Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(467)

Issue 1131853006: Fix word count issue for Chinese and Japanese (Closed)

Created:
5 years, 7 months ago by wychen
Modified:
5 years, 6 months ago
Reviewers:
cjhopman, kuan
Base URL:
git@github.com:chromium/dom-distiller.git@master
Target Ref:
refs/heads/master
Visibility:
Public.

Description

Fix word count issue for Chinese and Japanese Some languages don't use spaces to separate "words", like Chinese and Japanese. In our current word counting algorithm, one paragraph of Chinese or Japanese could be counted as one single word, and the content classification would fail horribly. The quick fix here is to treat Chinese and Japanese characters as words, and adjust it by a constant factor. The factor 0.55 is from comparing two versions of a long article in NY Times that is available in both English and Chinese. Read this bug for more information: https://crbug.com/484750 ** Score changes: CJK data set: https://x20web.corp.google.com/~wychen/domdistillerscore/cjk/cjk.html Average F1: 0.356 -> 0.954 No changes on other data sets: - cleaneval-golden-data - golden_data_with_knowledge - page-links-golden-data - reader-mode-golden-data - reader-images-golden-data ** Performance impact: The average time reported by eval server for dataset "reader-mode-golden-data" is used as the benchmark. To reduce noise, it is rerun for 100 times. Before: 127.380+-0.208ms After: 128.541+-0.198ms Difference: 1.161+-0.287ms Percentage: 0.91+-0.25% slower than before BUG=484750, 483710, 483713, 485177 R=cjhopman@chromium.org, kuan@chromium.org Committed: 7a1099255504d55d2fc2348c2d208d7679aa8195

Patch Set 1 #

Patch Set 2 : reorder tests #

Total comments: 9

Patch Set 3 : address comments #

Patch Set 4 : speed up #

Total comments: 6

Patch Set 5 : address comments #

Total comments: 3

Patch Set 6 : local func var #

Total comments: 2

Patch Set 7 : use java interface #

Total comments: 6

Patch Set 8 : rewrite tests #

Unified diffs Side-by-side diffs Delta from patch set Stats (+195 lines, -17 lines) Patch
M java/org/chromium/distiller/DomDistiller.java View 1 2 3 4 5 6 7 2 chunks +4 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/StringUtil.java View 1 2 3 4 5 6 7 2 chunks +63 lines, -7 lines 0 comments Download
M javatests/org/chromium/distiller/DocumentTitleGetterTest.java View 1 2 3 chunks +30 lines, -1 line 0 comments Download
M javatests/org/chromium/distiller/StringUtilTest.java View 1 2 3 4 5 6 7 1 chunk +98 lines, -9 lines 0 comments Download

Messages

Total messages: 25 (3 generated)
wychen
PTAL
5 years, 7 months ago (2015-05-14 19:08:44 UTC) #2
wychen
+kuan as another native speaker of Chinese.
5 years, 7 months ago (2015-05-14 22:54:10 UTC) #4
kuan
https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java File javatests/org/chromium/distiller/StringUtilTest.java (right): https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java#newcode27 javatests/org/chromium/distiller/StringUtilTest.java:27: assertTrue(StringUtil.countWords("ファイナルファンタジー") < 11); this is a katakana sentence, how ...
5 years, 7 months ago (2015-05-15 01:45:07 UTC) #5
cjhopman
https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java#newcode45 java/org/chromium/distiller/StringUtil.java:45: public static native boolean containsWordCharacter(String s) /*-{ We need ...
5 years, 7 months ago (2015-05-15 20:16:55 UTC) #6
wychen
https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java#newcode45 java/org/chromium/distiller/StringUtil.java:45: public static native boolean containsWordCharacter(String s) /*-{ On 2015/05/15 ...
5 years, 7 months ago (2015-05-18 18:49:20 UTC) #7
cjhopman
2% slower seems like a large performance hit. Is that 2% total time or just ...
5 years, 7 months ago (2015-05-20 02:31:43 UTC) #8
wychen
On 2015/05/20 02:31:43, cjhopman wrote: > 2% slower seems like a large performance hit. Is ...
5 years, 7 months ago (2015-05-20 07:14:58 UTC) #9
wychen
The speed regression should be <1% now. It's fairly difficult to measure though, since performance ...
5 years, 7 months ago (2015-05-21 01:34:21 UTC) #11
wychen
On 2015/05/21 01:34:21, wychen wrote: > The speed regression should be <1% now. It's fairly ...
5 years, 7 months ago (2015-05-21 08:13:18 UTC) #12
kuan
i'll leave the lgtm to chris. https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java File javatests/org/chromium/distiller/StringUtilTest.java (right): https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java#newcode27 javatests/org/chromium/distiller/StringUtilTest.java:27: assertTrue(StringUtil.countWords("ファイナルファンタジー") < 11); ...
5 years, 7 months ago (2015-05-21 17:46:15 UTC) #13
cjhopman
https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java File java/org/chromium/distiller/DomDistiller.java (right): https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java#newcode27 java/org/chromium/distiller/DomDistiller.java:27: StringUtil.selectCountWordsFunc(Document.get().getBody().getInnerText()); This should use textcontent not innertext https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/StringUtil.java File ...
5 years, 7 months ago (2015-05-21 19:32:02 UTC) #14
wychen
https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java File java/org/chromium/distiller/DomDistiller.java (right): https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java#newcode27 java/org/chromium/distiller/DomDistiller.java:27: StringUtil.selectCountWordsFunc(Document.get().getBody().getInnerText()); On 2015/05/21 19:32:02, cjhopman wrote: > This should ...
5 years, 7 months ago (2015-05-21 23:07:03 UTC) #15
cjhopman
https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: $_countWords = function(s) { This is still going to ...
5 years, 7 months ago (2015-05-22 00:11:59 UTC) #16
cjhopman
https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: $_countWords = function(s) { On 2015/05/22 00:11:58, cjhopman wrote: ...
5 years, 7 months ago (2015-05-22 00:13:30 UTC) #17
wychen
On 2015/05/22 00:13:30, cjhopman wrote: > https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java > File java/org/chromium/distiller/StringUtil.java (right): > > https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 > ...
5 years, 7 months ago (2015-05-22 22:00:47 UTC) #18
wychen
https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: $_countWords = function(s) { On 2015/05/22 00:13:29, cjhopman wrote: ...
5 years, 7 months ago (2015-05-23 00:32:27 UTC) #19
cjhopman
https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: public static int _countWords; Can't this be a JavaScriptObject? ...
5 years, 7 months ago (2015-05-26 21:53:06 UTC) #20
wychen
https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: public static int _countWords; On 2015/05/26 21:53:06, cjhopman wrote: ...
5 years, 6 months ago (2015-05-29 07:44:09 UTC) #21
cjhopman
https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java#newcode53 java/org/chromium/distiller/StringUtil.java:53: private static interface CountWords { nit: probably should be ...
5 years, 6 months ago (2015-05-29 19:38:57 UTC) #22
wychen
https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java#newcode53 java/org/chromium/distiller/StringUtil.java:53: private static interface CountWords { On 2015/05/29 19:38:57, cjhopman ...
5 years, 6 months ago (2015-05-31 09:04:03 UTC) #23
cjhopman
lgtm
5 years, 6 months ago (2015-06-04 20:29:44 UTC) #24
wychen
5 years, 6 months ago (2015-06-05 21:01:04 UTC) #25
Message was sent while issue was closed.
Committed patchset #8 (id:160001) manually as
7a1099255504d55d2fc2348c2d208d7679aa8195 (presubmit successful).

Powered by Google App Engine
This is Rietveld 408576698