Issue 1131853006: Fix word count issue for Chinese and Japanese

Issue 1131853006: Fix word count issue for Chinese and Japanese (Closed)

Created:
5 years, 7 months ago by wychen

Modified:
5 years, 6 months ago

Reviewers:
cjhopman, kuan

Base URL:
git@github.com:chromium/dom-distiller.git@master

Target Ref:
refs/heads/master

Visibility:
Public.

More Reviews

Description

Fix word count issue for Chinese and Japanese Some languages don't use spaces to separate "words", like Chinese and Japanese. In our current word counting algorithm, one paragraph of Chinese or Japanese could be counted as one single word, and the content classification would fail horribly. The quick fix here is to treat Chinese and Japanese characters as words, and adjust it by a constant factor. The factor 0.55 is from comparing two versions of a long article in NY Times that is available in both English and Chinese. Read this bug for more information: https://crbug.com/484750 ** Score changes: CJK data set: https://x20web.corp.google.com/~wychen/domdistillerscore/cjk/cjk.html Average F1: 0.356 -> 0.954 No changes on other data sets: - cleaneval-golden-data - golden_data_with_knowledge - page-links-golden-data - reader-mode-golden-data - reader-images-golden-data ** Performance impact: The average time reported by eval server for dataset "reader-mode-golden-data" is used as the benchmark. To reduce noise, it is rerun for 100 times. Before: 127.380+-0.208ms After: 128.541+-0.198ms Difference: 1.161+-0.287ms Percentage: 0.91+-0.25% slower than before BUG=484750, 483710, 483713, 485177 R=cjhopman@chromium.org, kuan@chromium.org Committed: 7a1099255504d55d2fc2348c2d208d7679aa8195

Patch Set 1 #

Patch Set 2 : reorder tests #

Total comments: 9

Patch Set 3 : address comments #

Patch Set 4 : speed up #

Total comments: 6

Patch Set 5 : address comments #

Total comments: 3

Patch Set 6 : local func var #

Total comments: 2

Patch Set 7 : use java interface #

Total comments: 6

Patch Set 8 : rewrite tests #

Created: 5 years, 6 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+195 lines, -17 lines)			Patch
M	java/org/chromium/distiller/DomDistiller.java	View	1 2 3 4 5 6 7	2 chunks	+4 lines, -0 lines	0 comments	Download
M	java/org/chromium/distiller/StringUtil.java	View	1 2 3 4 5 6 7	2 chunks	+63 lines, -7 lines	0 comments	Download
M	javatests/org/chromium/distiller/DocumentTitleGetterTest.java	View	1 2	3 chunks	+30 lines, -1 line	0 comments	Download
M	javatests/org/chromium/distiller/StringUtilTest.java	View	1 2 3 4 5 6 7	1 chunk	+98 lines, -9 lines	0 comments	Download

Messages

Total messages: 25 (3 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

kuan

https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java File javatests/org/chromium/distiller/StringUtilTest.java (right): https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java#newcode27 javatests/org/chromium/distiller/StringUtilTest.java:27: assertTrue(StringUtil.countWords("ファイナルファンタジー") < 11); this is a katakana sentence, how ...

5 years, 7 months ago (2015-05-15 01:45:07 UTC) #5

cjhopman

https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/distiller/StringUtil.java#newcode45 java/org/chromium/distiller/StringUtil.java:45: public static native boolean containsWordCharacter(String s) /*-{ We need ...

5 years, 7 months ago (2015-05-15 20:16:55 UTC) #6

wychen

5 years, 7 months ago (2015-05-18 18:49:20 UTC) #7

https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/disti...
File java/org/chromium/distiller/StringUtil.java (right):

https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/StringUtil.java:45: public static native boolean
containsWordCharacter(String s) /*-{
On 2015/05/15 20:16:55, cjhopman wrote:
> We need to ensure that
> 
> containsWordCharacter(s) == (countWords(s) > 0)
> 
> for all strings. 
> 
> If containsWordCharacter isn't used, we could just delete it.

Deleted.

https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/StringUtil.java:49: public static native int
countWords(String s) /*-{
On 2015/05/15 20:16:55, cjhopman wrote:
> Maybe we should change the name of this to reflect that it's not really
counting
> words anymore.

Well, the goal of this function is still to count words, but it is not feasible
to accurately count words for some languages. For those difficult languages,
character count is used as an approximation (adjusted by a factor). I've added
comments to this function to explain this.

https://codereview.chromium.org/1131853006/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/StringUtil.java:49: public static native int
countWords(String s) /*-{
On 2015/05/15 20:16:55, cjhopman wrote:
> Does this new approach make sense everywhere that we use countWords?

One problem I see is the title finding part. It contains some magic numbers that
are inferred from English corpus, which should still work for Chinese and
Japanese after this CL. We might want to make it more robust in a separate CL
though, and also enhance eval server and dataset to validate that change. Or
validate that our current magic numbers are actually robust enough.

https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/...
File javatests/org/chromium/distiller/StringUtilTest.java (right):

https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/...
javatests/org/chromium/distiller/StringUtilTest.java:27:
assertTrue(StringUtil.countWords("ファイナルファンタジー") < 11);
On 2015/05/15 01:45:07, kuan wrote:
> this is a katakana sentence, how about also adding one for hiragana since u
> mentioned that in the code? u only tested a single hiragana character @33.
> 
> what about kanji, and simplified chinese?

All added.

cjhopman

2% slower seems like a large performance hit. Is that 2% total time or just ...

5 years, 7 months ago (2015-05-20 02:31:43 UTC) #8

wychen

On 2015/05/20 02:31:43, cjhopman wrote: > 2% slower seems like a large performance hit. Is ...

5 years, 7 months ago (2015-05-20 07:14:58 UTC) #9

wychen

The speed regression should be <1% now. It's fairly difficult to measure though, since performance ...

5 years, 7 months ago (2015-05-21 01:34:21 UTC) #11

wychen

On 2015/05/21 01:34:21, wychen wrote: > The speed regression should be <1% now. It's fairly ...

5 years, 7 months ago (2015-05-21 08:13:18 UTC) #12

kuan

i'll leave the lgtm to chris. https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java File javatests/org/chromium/distiller/StringUtilTest.java (right): https://codereview.chromium.org/1131853006/diff/20001/javatests/org/chromium/distiller/StringUtilTest.java#newcode27 javatests/org/chromium/distiller/StringUtilTest.java:27: assertTrue(StringUtil.countWords("ファイナルファンタジー") < 11); ...

5 years, 7 months ago (2015-05-21 17:46:15 UTC) #13

cjhopman

https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java File java/org/chromium/distiller/DomDistiller.java (right): https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java#newcode27 java/org/chromium/distiller/DomDistiller.java:27: StringUtil.selectCountWordsFunc(Document.get().getBody().getInnerText()); This should use textcontent not innertext https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/StringUtil.java File ...

5 years, 7 months ago (2015-05-21 19:32:02 UTC) #14

wychen

https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java File java/org/chromium/distiller/DomDistiller.java (right): https://codereview.chromium.org/1131853006/diff/80001/java/org/chromium/distiller/DomDistiller.java#newcode27 java/org/chromium/distiller/DomDistiller.java:27: StringUtil.selectCountWordsFunc(Document.get().getBody().getInnerText()); On 2015/05/21 19:32:02, cjhopman wrote: > This should ...

5 years, 7 months ago (2015-05-21 23:07:03 UTC) #15

cjhopman

https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: $_countWords = function(s) { This is still going to ...

5 years, 7 months ago (2015-05-22 00:11:59 UTC) #16

cjhopman

5 years, 7 months ago (2015-05-22 00:13:30 UTC) #17

wychen

On 2015/05/22 00:13:30, cjhopman wrote: > https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java > File java/org/chromium/distiller/StringUtil.java (right): > > https://codereview.chromium.org/1131853006/diff/100001/java/org/chromium/distiller/StringUtil.java#newcode54 > ...

5 years, 7 months ago (2015-05-22 22:00:47 UTC) #18

wychen

5 years, 7 months ago (2015-05-23 00:32:27 UTC) #19

cjhopman

https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/120001/java/org/chromium/distiller/StringUtil.java#newcode54 java/org/chromium/distiller/StringUtil.java:54: public static int _countWords; Can't this be a JavaScriptObject? ...

5 years, 7 months ago (2015-05-26 21:53:06 UTC) #20

wychen

5 years, 6 months ago (2015-05-29 07:44:09 UTC) #21

cjhopman

https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java File java/org/chromium/distiller/StringUtil.java (right): https://codereview.chromium.org/1131853006/diff/140001/java/org/chromium/distiller/StringUtil.java#newcode53 java/org/chromium/distiller/StringUtil.java:53: private static interface CountWords { nit: probably should be ...

5 years, 6 months ago (2015-05-29 19:38:57 UTC) #22

wychen

5 years, 6 months ago (2015-05-31 09:04:03 UTC) #23

wychen

5 years, 6 months ago (2015-06-05 21:01:04 UTC) #25

Message was sent while issue was closed.

Committed patchset #8 (id:160001) manually as
7a1099255504d55d2fc2348c2d208d7679aa8195 (presubmit successful).

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages