Issue 1272683002: Creates BreakIterator::GetWordBreakStatus.

Issue 1272683002: Creates BreakIterator::GetWordBreakStatus. (Closed)

Created:
5 years, 4 months ago by Julius

Modified:
5 years, 4 months ago

Reviewers:
please use gerrit instead, jungshik at Google

CC:
chromium-reviews, tapted, Matt Giuca, grt+watch_chromium.org, rouslan+spellwatch_chromium.org, rlp+watch_chromium.org, tfarina, groby+spellwatch_chromium.org, jshin+watch_chromium.org

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Target Ref:
refs/pending/heads/master

Project:
chromium

Visibility:
Public.

More Reviews

Description

Creates BreakIterator::GetWordBreakStatus. For multilingual spellchecking, we need a function to tell us the current state of the iterator so we know what the spellchecker needs to pay attention to. That is, we need to know if we've found a word or characters that can be skipped over. TEST=*Skippable* TEST=*BreakStatus* BUG=5102 Committed: https://crrev.com/3fc3250d48a1e1d280936a9de4c0875d4ec72e3e Cr-Commit-Position: refs/heads/master@{#342958}

Patch Set 1 : #

Patch Set 2 : #

Total comments: 2

Patch Set 3 : Made new function, added tests. #

Total comments: 6

Patch Set 4 : Added comments and such. #

Total comments: 14

Patch Set 5 : Rebase and address comments. #

Total comments: 6

Patch Set 6 : Comment clarifications and using EXPECT_EQ. #

Total comments: 8

Patch Set 7 : Updated Khmer tests and ASCII-fied comments. #

Total comments: 1

Created: 5 years, 4 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+289 lines, -2 lines)			Patch
M	base/i18n/break_iterator.h	View	1 2 3 4 5	2 chunks	+27 lines, -0 lines	0 comments	Download
M	base/i18n/break_iterator.cc	View	1 2 3 4 5	1 chunk	+6 lines, -2 lines	0 comments	Download
M	base/i18n/break_iterator_unittest.cc	View	1 2 3 4 5 6	1 chunk	+85 lines, -0 lines	0 comments	Download
M	chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc	View	1 2 3 4 5 6	3 chunks	+171 lines, -0 lines	1 comment	Download

Messages

Total messages: 45 (23 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

please use gerrit instead

On 2015/08/05 22:43:07, Julius wrote: > Rouslan, PTAL at Patch Set #1. Let's define a ...

5 years, 4 months ago (2015-08-05 23:47:55 UTC) #6

groby-ooo-7-16

On 2015/08/05 23:47:55, Rouslan wrote: > On 2015/08/05 22:43:07, Julius wrote: > > Rouslan, PTAL ...

5 years, 4 months ago (2015-08-05 23:53:29 UTC) #7

Matt Giuca

Drive-by nit. https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h#newcode100 base/i18n/break_iterator.h:100: nit: No extra blank line.

5 years, 4 months ago (2015-08-06 00:32:16 UTC) #8

Julius

Rouslan, PTAL at Patch Set #3. https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h#newcode100 base/i18n/break_iterator.h:100: On 2015/08/06 00:32:15, ...

5 years, 4 months ago (2015-08-06 20:43:54 UTC) #15

please use gerrit instead

A few comments to start. https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterator.h#newcode74 base/i18n/break_iterator.h:74: enum WordBreakStatus { IS_WORD_BREAK, ...

5 years, 4 months ago (2015-08-07 17:16:59 UTC) #16

Julius

Rouslan, PTAL at Patch Set #4. https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterator.h#newcode74 base/i18n/break_iterator.h:74: enum WordBreakStatus { ...

5 years, 4 months ago (2015-08-07 20:30:04 UTC) #19

please use gerrit instead

I need so much coffee to understand this :-P https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterator.cc File base/i18n/break_iterator.cc (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterator.cc#newcode146 base/i18n/break_iterator.cc:146: ...

5 years, 4 months ago (2015-08-07 20:53:10 UTC) #20

Julius

Rouslan, PTAL at Patch Set #5. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterator.cc File base/i18n/break_iterator.cc (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterator.cc#newcode146 base/i18n/break_iterator.cc:146: return IS_NOT_WORD_BREAK; On ...

5 years, 4 months ago (2015-08-10 16:06:37 UTC) #25

Rouslan, PTAL at Patch Set #5.

https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat...
File base/i18n/break_iterator.cc (right):

https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat...
base/i18n/break_iterator.cc:146: return IS_NOT_WORD_BREAK;
On 2015/08/07 20:53:10, Rouslan wrote:
> Call ubrk_getRuleStatus() before the if statement.
> 
> The original code called urbk_getRuleStatus() before checking break_type_.
This
> seems unusual, but there could be a good reason for it. Either way, the code
> that's using IsWord() implicitly assumes that ubrk_getRuleStatus() is always
> called inside of IsWord(). Let's not break that assumption.

Done.

https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat...
File base/i18n/break_iterator.h (right):

https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat...
base/i18n/break_iterator.h:80: // Only used if not in BREAK_WORD or RULE_BASED
mode.
On 2015/08/07 20:53:10, Rouslan wrote:
> What does returning this value mean? It's nice to know when it's used, but
> people who read your code will want to know why it's returned.

Done.

https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat...
base/i18n/break_iterator.h:116: // distinction doesn't apply and it returns
IS_NOT_WORD_BREAK. Otherwise, the
On 2015/08/07 20:53:10, Rouslan wrote:
> This "Otherwise" is confusing. Please be explicit about conditions that cause
> IS_SKIPPABLE_WORD to be returned.

Done.

https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc...
File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right):

https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:310:
TEST(SpellcheckWordIteratorTest, BreakLine) {
On 2015/08/07 20:53:10, Rouslan wrote:
> This test should be in base/.
> 
> Also add a test for BREAK_WORD.
> 
> Only rule-based tests should be in spellchecker.

Done.

https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:314:
base::WideToUTF16(L"foo \x1791\x17c1 Can \x041C\x0438..."));
On 2015/08/07 20:53:10, Rouslan wrote:
> Put a newline in there, so that you you get one return value that's not
> IS_NOT_WORD_BREAK.

Well, it's still going to be IS_NOT_WORD_BREAK if we're using BREAK_LINE mode
but I added the newline anyway.

https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:319: // Finds
"foo".
On 2015/08/07 20:53:10, Rouslan wrote:
> Also add this throught:
> 
> EXPECT_EQ(base::WideToUTF16(L"foo"), iter.GetString());

Done.

https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:320:
EXPECT_TRUE(iter.IsWordBreak() == BreakIterator::IS_NOT_WORD_BREAK);
On 2015/08/07 20:53:10, Rouslan wrote:
> Can you think of a better name for BreakIterator::IS_NOT_WORD_BREAK?
> 
> BreakIterator::IsWordBreak() returns BreakIterator::IS_NOT_WORD_BREAK returns
> for words when BreakIterator::BREAK_LINE mode is used. That's confusing. I
guess
> that's not more confusing when BreakIterator::IsWord() returns false for word
> breaks in the same mode...

IS_LINE_OR_CHAR_BREAK seems good.

please use gerrit instead

Those comments are still confusing, but better than before. You don't have to change them, ...

5 years, 4 months ago (2015-08-10 17:24:42 UTC) #26

Julius

Rouslan, PTAL at Patch Set #6. Hopefully this is a clearer way of commenting what ...

5 years, 4 months ago (2015-08-10 18:56:19 UTC) #30

please use gerrit instead

Overall lgtm. Next you need a base/i18n/ OWNER review.

5 years, 4 months ago (2015-08-10 19:31:45 UTC) #31

please use gerrit instead

By the way, the description is out of date. Please update it.

5 years, 4 months ago (2015-08-10 19:32:08 UTC) #32

Julius

jshin@chromium.org, PTAL at Patch Set #6, files: base/i18n/break_iterator.h base/i18n/break_iterator.cc base/i18n/break_iterator_unittest.cc

5 years, 4 months ago (2015-08-10 20:26:06 UTC) #34

jungshik at Google

LGTM with nits about visual studio + Khmer test. https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterator_unittest.cc File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterator_unittest.cc#newcode375 base/i18n/break_iterator_unittest.cc:375: ...

5 years, 4 months ago (2015-08-11 21:43:50 UTC) #35

jungshik at Google

And, can you add TEST= line to the CL description? Thanks

5 years, 4 months ago (2015-08-11 21:44:22 UTC) #36

jungshik at Google

https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc#newcode327 chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327: EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString()); On 2015/08/11 21:43:50, jungshik at google wrote: ...

5 years, 4 months ago (2015-08-11 22:22:07 UTC) #37

Julius

Fixed up the tests and nits and submitting. https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterator_unittest.cc File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterator_unittest.cc#newcode375 base/i18n/break_iterator_unittest.cc:375: // ...

5 years, 4 months ago (2015-08-12 01:22:22 UTC) #39

Fixed up the tests and nits and submitting.

https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat...
File base/i18n/break_iterator_unittest.cc (right):

https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat...
base/i18n/break_iterator_unittest.cc:375: // The string "foo ទេ \nCan Ми..."
which contains English, Khmer, and Russian
On 2015/08/11 21:43:50, jungshik at google wrote:
> Due to an issue with Visual Studio, we cannot use non-ASCII characters even in
> comments. Visual Studio would barf in East Asian locales (CJK) when it's asked
> to compile this source code. That really sucks, but ...Perhaps, we should
> disable that warning (that would be treated as an error). 

I got rid of the non-ASCII characters in the comments and tried to make them
clear as to what's in the string.

https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc...
File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right):

https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327:
EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString());
On 2015/08/11 22:22:07, jungshik at google wrote:
> On 2015/08/11 21:43:50, jungshik at google wrote:
> > Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus,
> > w-b-iterator still does not break them apart (perhaps because it's a single
> > grapheme... rules does not have that info?). 
> 
> I figured out why Khmer is not split up here. That's because Khmer uses a
> dictionary for word (as well as line) breaking. And, it's handled outside
> non-dictionary cases. Our custom rules do not change the following line:
> 
> # For dictionary-based break
> $dictionary $dictionary;
> 

Acknowledged.

https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327:
EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString());
On 2015/08/11 21:43:50, jungshik at google wrote:
> Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus,
> w-b-iterator still does not break them apart (perhaps because it's a single
> grapheme... rules does not have that info?). 

Acknowledged.

https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:412: //
characters, in that order.
On 2015/08/11 21:43:50, jungshik at google wrote:
> A Khmer example can be made more interesting. If you take an example from
ICU's
> Khmer break iterator tests. 
> 
> I took the following example from the first line in the Khmer section of
> third_party/icu/source/test/testdata/rbbitst.txt
> 
> 
> U+178F U+17BE <word break> U+179B U+17C4 U+1780 <word break> U+1798 U+1780

Swapped the Khmer text in this case with your suggested texted.

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1272683002/490001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1272683002/490001

5 years, 4 months ago (2015-08-12 01:22:38 UTC) #42

commit-bot: I haz the power

Patchset 7 (id:??) landed as https://crrev.com/3fc3250d48a1e1d280936a9de4c0875d4ec72e3e Cr-Commit-Position: refs/heads/master@{#342958}

5 years, 4 months ago (2015-08-12 01:30:49 UTC) #44

jungshik at Google

5 years, 4 months ago (2015-08-12 16:56:52 UTC) #45

Message was sent while issue was closed.

https://codereview.chromium.org/1272683002/diff/490001/chrome/renderer/spellc...
File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right):

https://codereview.chromium.org/1272683002/diff/490001/chrome/renderer/spellc...
chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:415:
L"\x041C\x0438 \x178F\x17BE \x179B\x17C4\x1780 \x1798\x1780zoo. ,"));
Ick. Sorry it's not clear to you (and for a post-landing comment). 

Khmer does not use a space between words. That's why it needs a dictionary to
break between words. The Khmer portion should be

\x178F\x17BE\x179B\x17C4\x1780\x1798\x1780

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages