DescriptionFix word count issue for Chinese and Japanese
Some languages don't use spaces to separate "words", like Chinese and
Japanese. In our current word counting algorithm, one paragraph of
Chinese or Japanese could be counted as one single word, and the content
classification would fail horribly.
The quick fix here is to treat Chinese and Japanese characters as words,
and adjust it by a constant factor. The factor 0.55 is from comparing
two versions of a long article in NY Times that is available in both
English and Chinese.
Read this bug for more information:
https://crbug.com/484750
** Score changes:
CJK data set:
https://x20web.corp.google.com/~wychen/domdistillerscore/cjk/cjk.html
Average F1: 0.356 -> 0.954
No changes on other data sets:
- cleaneval-golden-data
- golden_data_with_knowledge
- page-links-golden-data
- reader-mode-golden-data
- reader-images-golden-data
** Performance impact:
The average time reported by eval server for dataset
"reader-mode-golden-data" is used as the benchmark.
To reduce noise, it is rerun for 100 times.
Before: 127.380+-0.208ms
After: 128.541+-0.198ms
Difference: 1.161+-0.287ms
Percentage: 0.91+-0.25% slower than before
BUG=484750, 483710, 483713, 485177
R=cjhopman@chromium.org, kuan@chromium.org
Committed: 7a1099255504d55d2fc2348c2d208d7679aa8195
Patch Set 1 #Patch Set 2 : reorder tests #
Total comments: 9
Patch Set 3 : address comments #Patch Set 4 : speed up #
Total comments: 6
Patch Set 5 : address comments #
Total comments: 3
Patch Set 6 : local func var #
Total comments: 2
Patch Set 7 : use java interface #
Total comments: 6
Patch Set 8 : rewrite tests #
Messages
Total messages: 25 (3 generated)
|