|
|
Created:
5 years, 4 months ago by Julius Modified:
5 years, 4 months ago CC:
chromium-reviews, tapted, Matt Giuca, grt+watch_chromium.org, rouslan+spellwatch_chromium.org, rlp+watch_chromium.org, tfarina, groby+spellwatch_chromium.org, jshin+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionCreates BreakIterator::GetWordBreakStatus.
For multilingual spellchecking, we need a function to tell us the
current state of the iterator so we know what the spellchecker needs to
pay attention to. That is, we need to know if we've found a word or
characters that can be skipped over.
TEST=*Skippable*
TEST=*BreakStatus*
BUG=5102
Committed: https://crrev.com/3fc3250d48a1e1d280936a9de4c0875d4ec72e3e
Cr-Commit-Position: refs/heads/master@{#342958}
Patch Set 1 : #Patch Set 2 : #
Total comments: 2
Patch Set 3 : Made new function, added tests. #
Total comments: 6
Patch Set 4 : Added comments and such. #
Total comments: 14
Patch Set 5 : Rebase and address comments. #
Total comments: 6
Patch Set 6 : Comment clarifications and using EXPECT_EQ. #
Total comments: 8
Patch Set 7 : Updated Khmer tests and ASCII-fied comments. #
Total comments: 1
Messages
Total messages: 45 (23 generated)
Patchset #1 (id:1) has been deleted
Patchset #1 (id:20001) has been deleted
Patchset #1 (id:40001) has been deleted
juliusa@google.com changed reviewers: + rouslan@chromium.org
Rouslan, PTAL at Patch Set #1.
On 2015/08/05 22:43:07, Julius wrote: > Rouslan, PTAL at Patch Set #1. Let's define a new function so we don't have to update 10 unrelated files.
On 2015/08/05 23:47:55, Rouslan wrote: > On 2015/08/05 22:43:07, Julius wrote: > > Rouslan, PTAL at Patch Set #1. > > Let's define a new function so we don't have to update 10 unrelated files. Also: What is an "unknown word character"? I'd suggest to simply pick the word iterator for the primary language if we check in multiple languages. If two languages have significantly different word breaking rules, it becomes almost impossible to properly analyze a text in both languages simultaneously, unless you run two separate WordBreak iterators with different rules. At the very least I'd like to see a test case that highlights why we need to do this and what the expected benefit is.
Drive-by nit. https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterato... base/i18n/break_iterator.h:100: nit: No extra blank line.
Patchset #3 (id:90001) has been deleted
Patchset #3 (id:110001) has been deleted
Patchset #3 (id:130001) has been deleted
Patchset #3 (id:150001) has been deleted
Patchset #3 (id:170001) has been deleted
Patchset #3 (id:190001) has been deleted
Rouslan, PTAL at Patch Set #3. https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterator.h File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/70001/base/i18n/break_iterato... base/i18n/break_iterator.h:100: On 2015/08/06 00:32:15, Matt Giuca wrote: > nit: No extra blank line. Done.
A few comments to start. https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterat... base/i18n/break_iterator.h:74: enum WordBreakStatus { IS_WORD_BREAK, IS_SKIPPABLE_WORD, IS_NOT_WORD_BREAK }; add newlines and comments on the meaning of these enum values here. https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:28: base::string16 GetRulesForLanguage(const std::string& language) { Put this in anonymous namespace. https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:311: // A string containing English, Khmer, and Russian characters. Put actual text into the comment.
Patchset #4 (id:230001) has been deleted
Patchset #4 (id:250001) has been deleted
Rouslan, PTAL at Patch Set #4. https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/210001/base/i18n/break_iterat... base/i18n/break_iterator.h:74: enum WordBreakStatus { IS_WORD_BREAK, IS_SKIPPABLE_WORD, IS_NOT_WORD_BREAK }; On 2015/08/07 17:16:59, Rouslan wrote: > add newlines and comments on the meaning of these enum values here. Done. https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:28: base::string16 GetRulesForLanguage(const std::string& language) { On 2015/08/07 17:16:59, Rouslan wrote: > Put this in anonymous namespace. Done. https://codereview.chromium.org/1272683002/diff/210001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:311: // A string containing English, Khmer, and Russian characters. On 2015/08/07 17:16:59, Rouslan wrote: > Put actual text into the comment. Done.
I need so much coffee to understand this :-P https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... File base/i18n/break_iterator.cc (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.cc:146: return IS_NOT_WORD_BREAK; Call ubrk_getRuleStatus() before the if statement. The original code called urbk_getRuleStatus() before checking break_type_. This seems unusual, but there could be a good reason for it. Either way, the code that's using IsWord() implicitly assumes that ubrk_getRuleStatus() is always called inside of IsWord(). Let's not break that assumption. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.h:80: // Only used if not in BREAK_WORD or RULE_BASED mode. What does returning this value mean? It's nice to know when it's used, but people who read your code will want to know why it's returned. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.h:116: // distinction doesn't apply and it returns IS_NOT_WORD_BREAK. Otherwise, the This "Otherwise" is confusing. Please be explicit about conditions that cause IS_SKIPPABLE_WORD to be returned. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:310: TEST(SpellcheckWordIteratorTest, BreakLine) { This test should be in base/. Also add a test for BREAK_WORD. Only rule-based tests should be in spellchecker. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:314: base::WideToUTF16(L"foo \x1791\x17c1 Can \x041C\x0438...")); Put a newline in there, so that you you get one return value that's not IS_NOT_WORD_BREAK. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:319: // Finds "foo". Also add this throught: EXPECT_EQ(base::WideToUTF16(L"foo"), iter.GetString()); https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:320: EXPECT_TRUE(iter.IsWordBreak() == BreakIterator::IS_NOT_WORD_BREAK); Can you think of a better name for BreakIterator::IS_NOT_WORD_BREAK? BreakIterator::IsWordBreak() returns BreakIterator::IS_NOT_WORD_BREAK returns for words when BreakIterator::BREAK_LINE mode is used. That's confusing. I guess that's not more confusing when BreakIterator::IsWord() returns false for word breaks in the same mode...
Patchset #5 (id:290001) has been deleted
Patchset #5 (id:310001) has been deleted
Patchset #5 (id:330001) has been deleted
Patchset #5 (id:350001) has been deleted
Rouslan, PTAL at Patch Set #5. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... File base/i18n/break_iterator.cc (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.cc:146: return IS_NOT_WORD_BREAK; On 2015/08/07 20:53:10, Rouslan wrote: > Call ubrk_getRuleStatus() before the if statement. > > The original code called urbk_getRuleStatus() before checking break_type_. This > seems unusual, but there could be a good reason for it. Either way, the code > that's using IsWord() implicitly assumes that ubrk_getRuleStatus() is always > called inside of IsWord(). Let's not break that assumption. Done. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.h:80: // Only used if not in BREAK_WORD or RULE_BASED mode. On 2015/08/07 20:53:10, Rouslan wrote: > What does returning this value mean? It's nice to know when it's used, but > people who read your code will want to know why it's returned. Done. https://codereview.chromium.org/1272683002/diff/270001/base/i18n/break_iterat... base/i18n/break_iterator.h:116: // distinction doesn't apply and it returns IS_NOT_WORD_BREAK. Otherwise, the On 2015/08/07 20:53:10, Rouslan wrote: > This "Otherwise" is confusing. Please be explicit about conditions that cause > IS_SKIPPABLE_WORD to be returned. Done. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:310: TEST(SpellcheckWordIteratorTest, BreakLine) { On 2015/08/07 20:53:10, Rouslan wrote: > This test should be in base/. > > Also add a test for BREAK_WORD. > > Only rule-based tests should be in spellchecker. Done. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:314: base::WideToUTF16(L"foo \x1791\x17c1 Can \x041C\x0438...")); On 2015/08/07 20:53:10, Rouslan wrote: > Put a newline in there, so that you you get one return value that's not > IS_NOT_WORD_BREAK. Well, it's still going to be IS_NOT_WORD_BREAK if we're using BREAK_LINE mode but I added the newline anyway. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:319: // Finds "foo". On 2015/08/07 20:53:10, Rouslan wrote: > Also add this throught: > > EXPECT_EQ(base::WideToUTF16(L"foo"), iter.GetString()); Done. https://codereview.chromium.org/1272683002/diff/270001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:320: EXPECT_TRUE(iter.IsWordBreak() == BreakIterator::IS_NOT_WORD_BREAK); On 2015/08/07 20:53:10, Rouslan wrote: > Can you think of a better name for BreakIterator::IS_NOT_WORD_BREAK? > > BreakIterator::IsWordBreak() returns BreakIterator::IS_NOT_WORD_BREAK returns > for words when BreakIterator::BREAK_LINE mode is used. That's confusing. I guess > that's not more confusing when BreakIterator::IsWord() returns false for word > breaks in the same mode... IS_LINE_OR_CHAR_BREAK seems good.
Those comments are still confusing, but better than before. You don't have to change them, if you cannot come up with something clearer. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator.h:118: // the end of a string of word characters. IS_SKIPPABLE_WORD is returned if |string| is an overloaded term. Say "sequence", if you must. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator.h:122: BreakIterator::WordBreakStatus IsWordBreak() const; I think IsSomething() is best for functions that return boolean. You should GetSomething() for functions that return non-booleans. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator_unittest.cc:385: EXPECT_TRUE(iter.IsWordBreak() == BreakIterator::IS_LINE_OR_CHAR_BREAK); Here and below, use EXPECT_EQ(item1, item2) instead of EXPECT_TRUE(item1 == item2). If you use EXPECT_TRUE, the error message will be "Expected true, but found false." That's not useful. If you use EXPECT_EQ, the error message will be "Expected value1, but found value2." That makes debugging test failures easier.
Patchset #6 (id:390001) has been deleted
Patchset #6 (id:410001) has been deleted
Patchset #6 (id:430001) has been deleted
Rouslan, PTAL at Patch Set #6. Hopefully this is a clearer way of commenting what the function is doing. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... File base/i18n/break_iterator.h (right): https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator.h:118: // the end of a string of word characters. IS_SKIPPABLE_WORD is returned if On 2015/08/10 17:24:42, Rouslan wrote: > |string| is an overloaded term. Say "sequence", if you must. Done. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator.h:122: BreakIterator::WordBreakStatus IsWordBreak() const; On 2015/08/10 17:24:42, Rouslan wrote: > I think IsSomething() is best for functions that return boolean. You should > GetSomething() for functions that return non-booleans. Done. https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/370001/base/i18n/break_iterat... base/i18n/break_iterator_unittest.cc:385: EXPECT_TRUE(iter.IsWordBreak() == BreakIterator::IS_LINE_OR_CHAR_BREAK); On 2015/08/10 17:24:42, Rouslan wrote: > Here and below, use EXPECT_EQ(item1, item2) instead of EXPECT_TRUE(item1 == > item2). If you use EXPECT_TRUE, the error message will be "Expected true, but > found false." That's not useful. If you use EXPECT_EQ, the error message will be > "Expected value1, but found value2." That makes debugging test failures easier. Done.
Overall lgtm. Next you need a base/i18n/ OWNER review.
By the way, the description is out of date. Please update it.
juliusa@google.com changed reviewers: + jshin@chromium.org
jshin@chromium.org, PTAL at Patch Set #6, files: base/i18n/break_iterator.h base/i18n/break_iterator.cc base/i18n/break_iterator_unittest.cc
LGTM with nits about visual studio + Khmer test. https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat... File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat... base/i18n/break_iterator_unittest.cc:375: // The string "foo ទេ \nCan Ми..." which contains English, Khmer, and Russian Due to an issue with Visual Studio, we cannot use non-ASCII characters even in comments. Visual Studio would barf in East Asian locales (CJK) when it's asked to compile this source code. That really sucks, but ...Perhaps, we should disable that warning (that would be treated as an error). https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327: EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString()); Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus, w-b-iterator still does not break them apart (perhaps because it's a single grapheme... rules does not have that info?). https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:412: // characters, in that order. A Khmer example can be made more interesting. If you take an example from ICU's Khmer break iterator tests. I took the following example from the first line in the Khmer section of third_party/icu/source/test/testdata/rbbitst.txt U+178F U+17BE <word break> U+179B U+17C4 U+1780 <word break> U+1798 U+1780
And, can you add TEST= line to the CL description? Thanks
https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327: EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString()); On 2015/08/11 21:43:50, jungshik at google wrote: > Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus, > w-b-iterator still does not break them apart (perhaps because it's a single > grapheme... rules does not have that info?). I figured out why Khmer is not split up here. That's because Khmer uses a dictionary for word (as well as line) breaking. And, it's handled outside non-dictionary cases. Our custom rules do not change the following line: # For dictionary-based break $dictionary $dictionary;
Patchset #7 (id:470001) has been deleted
Fixed up the tests and nits and submitting. https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat... File base/i18n/break_iterator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/base/i18n/break_iterat... base/i18n/break_iterator_unittest.cc:375: // The string "foo ទេ \nCan Ми..." which contains English, Khmer, and Russian On 2015/08/11 21:43:50, jungshik at google wrote: > Due to an issue with Visual Studio, we cannot use non-ASCII characters even in > comments. Visual Studio would barf in East Asian locales (CJK) when it's asked > to compile this source code. That really sucks, but ...Perhaps, we should > disable that warning (that would be treated as an error). I got rid of the non-ASCII characters in the comments and tried to make them clear as to what's in the string. https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327: EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString()); On 2015/08/11 22:22:07, jungshik at google wrote: > On 2015/08/11 21:43:50, jungshik at google wrote: > > Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus, > > w-b-iterator still does not break them apart (perhaps because it's a single > > grapheme... rules does not have that info?). > > I figured out why Khmer is not split up here. That's because Khmer uses a > dictionary for word (as well as line) breaking. And, it's handled outside > non-dictionary cases. Our custom rules do not change the following line: > > # For dictionary-based break > $dictionary $dictionary; > Acknowledged. https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:327: EXPECT_EQ(base::WideToUTF16(L"\x1791\x17c1"), iter.GetString()); On 2015/08/11 21:43:50, jungshik at google wrote: > Interesting. Even if Khmer is not treated as either ALetter or ALetterPlus, > w-b-iterator still does not break them apart (perhaps because it's a single > grapheme... rules does not have that info?). Acknowledged. https://codereview.chromium.org/1272683002/diff/450001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:412: // characters, in that order. On 2015/08/11 21:43:50, jungshik at google wrote: > A Khmer example can be made more interesting. If you take an example from ICU's > Khmer break iterator tests. > > I took the following example from the first line in the Khmer section of > third_party/icu/source/test/testdata/rbbitst.txt > > > U+178F U+17BE <word break> U+179B U+17C4 U+1780 <word break> U+1798 U+1780 Swapped the Khmer text in this case with your suggested texted.
The CQ bit was checked by juliusa@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from rouslan@chromium.org, jshin@chromium.org Link to the patchset: https://codereview.chromium.org/1272683002/#ps490001 (title: "Updated Khmer tests and ASCII-fied comments.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1272683002/490001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1272683002/490001
Message was sent while issue was closed.
Committed patchset #7 (id:490001)
Message was sent while issue was closed.
Patchset 7 (id:??) landed as https://crrev.com/3fc3250d48a1e1d280936a9de4c0875d4ec72e3e Cr-Commit-Position: refs/heads/master@{#342958}
Message was sent while issue was closed.
https://codereview.chromium.org/1272683002/diff/490001/chrome/renderer/spellc... File chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc (right): https://codereview.chromium.org/1272683002/diff/490001/chrome/renderer/spellc... chrome/renderer/spellchecker/spellcheck_worditerator_unittest.cc:415: L"\x041C\x0438 \x178F\x17BE \x179B\x17C4\x1780 \x1798\x1780zoo. ,")); Ick. Sorry it's not clear to you (and for a post-landing comment). Khmer does not use a space between words. That's why it needs a dictionary to break between words. The Khmer portion should be \x178F\x17BE\x179B\x17C4\x1780\x1798\x1780 |