|
|
Created:
4 years, 6 months ago by Jinsuk Kim Modified:
4 years, 5 months ago CC:
blink-reviews, chromium-reviews Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionReplace ICU with CED for auto encoding detection
This is a drop-in replacement of ICU library performing automatic text
encoding detection with CED (Compact Encdoing Detection).
CED is used extensively in Google for every crawled web page,
email message, query string, etc., and recently open-sourced for
public use. (https://github.com/google/compact_enc_det)
Also it is a much better alternative to ICU in terms of speed.
ICU introduces significant regression in page loading (up to 30%):
= ICU auto-detection vs. TOT =
page_cycler.typical_25:cold_times.page_load_time 1085.13±9.31% 754.28±12.03% (30.49%)
http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05-08_21-20-58
while CED adds virtually no additional loading time (delta < sigma):
= CED auto-detection vs. TOT =
page_cycler.typical_25:cold_times.page_load_time ms 705.70±9.49% vs. 760.31±11.90% (-7.74%)
http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05-08_20-37-54
With CED, it is feasible to turn on auto encoding detection by default
so that web pages without encoding label can be taken care of. It will be
done in a follow-up CL.
BUG=597488
Committed: https://crrev.com/13510755ac11e6e1f58c34ef8c9bd4cf925c8d70
Cr-Commit-Position: refs/heads/master@{#402622}
Patch Set 1 #Patch Set 2 : fix trybot builds #
Total comments: 5
Patch Set 3 : deps #
Total comments: 2
Patch Set 4 : hint encoding #
Messages
Total messages: 24 (6 generated)
Patchset #2 (id:20001) has been deleted
Patchset #2 (id:40001) has been deleted
jinsukkim@chromium.org changed reviewers: + jshin@chromium.org, tkent@chromium.org
A few Windows broken trybot builds can be fixed by crrev.com/2092533003.
I understand CED is faster than ICU according to the CL description. What about correctness? Is it possible that CED detects a wrong encoding though ICU can detect correctly? https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/BUILD.gn (left): https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/BUILD.gn:369: "//third_party/icu", We shouldn't remove ICU dependency. Other code in platform depend on ICU. https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:34: #include <compact_enc_det/compact_enc_det.h> Please don't use |#include <>| for non-system headers. It should be |#include "third_party/ced/src/compact_enc_det/compact_enc_det.h"|, and we need to update platform/DEPS.
On 2016/06/22 at 23:36:41, tkent wrote: > I understand CED is faster than ICU according to the CL description. What about correctness? Is it possible that CED detects a wrong encoding though ICU can detect correctly? We don't have data for that. It's difficult to estimate correctness via automated measurements. At most we could estimate how often they differ, then we would need to do a manual pass. In my opinion, that level of diligence about correctness isn't called for here. First, detecting encoding is a rather easy problem (encodings have very regular characteristics) and I expect all autodetectors to perform well. Secondly, CED has been used internally within Google3 extensively, whereas ICU autodetector is not very popular to my knowledge, so there is every reason to believe CED is more battle-hardened. Based on these circumstantial factors, I'm pretty confident a correctness check would show a result that is neutral or favorable to CED, so I'd prefer not to waste time on it.
Patchset #3 (id:80001) has been deleted
For accuracy, I have only anecdotal results like tests against the URLs reported here https://bugs.chromium.org/p/chromium/issues/detail?id=512032#c19 It all worked fine for them. The only case I've seen was when attempting to detect the encoding of 3-letter month abbreviation words which are too short. But ICU often returns wrong results in this case too. In general, it works quite well for texts of sufficient length (>= 1KB). https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/BUILD.gn (left): https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/BUILD.gn:369: "//third_party/icu", On 2016/06/22 23:36:40, tkent wrote: > We shouldn't remove ICU dependency. Other code in platform depend on ICU. Put it back. https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:34: #include <compact_enc_det/compact_enc_det.h> On 2016/06/22 23:36:40, tkent wrote: > Please don't use |#include <>| for non-system headers. It should be |#include > "third_party/ced/src/compact_enc_det/compact_enc_det.h"|, and we need to update > platform/DEPS. Done. Just curious - I was following what was done for ICU header. Was the policy changed not to do it that way?
I don't believe this code produces results as accurate as CED in Google or ICU because this code doesn't specify hint arguments. If you'd like to produce this change without specifying hints, we should collect data such as UMA for usage of encoding menu before/after this change or running CED without hints against DocJoins. https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:34: #include <compact_enc_det/compact_enc_det.h> On 2016/06/23 at 00:45:57, Jinsuk wrote: > On 2016/06/22 23:36:40, tkent wrote: > > Please don't use |#include <>| for non-system headers. It should be |#include > > "third_party/ced/src/compact_enc_det/compact_enc_det.h"|, and we need to update > > platform/DEPS. > > Done. Just curious - I was following what was done for ICU header. Was the policy changed not to do it that way? It's a legacy code inherited from WebKit. We should apply Chromium rule now. https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:39: const char* hintEncodingName, WTF::TextEncoding* detectedEncoding) I'm afraid ignoring |hintEncodingName| degrades CED's accuracy.
On 2016/06/23 at 03:30:39, tkent wrote: > If you'd like to produce this change without specifying hints, oops, produce -> proceed
On 2016/06/23 03:30:39, tkent wrote: > I don't believe this code produces results as accurate as CED in Google or ICU > because this code doesn't specify hint arguments. > If you'd like to produce this change without specifying hints, we should collect > data such as UMA for usage of encoding menu before/after this change or running > CED without hints against DocJoins. > I can update the code to make use of |hintEncodingName|. As far as I can tell, the hint encoding is passed in a limited situation https://cs-staging.chromium.org/chromium/src/third_party/WebKit/Source/core/h..., and for the web sites I have tested all come with the hint encoding set to null. Is the situation above what you're worried about? > https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... > File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): > > https://codereview.chromium.org/2081653007/diff/60001/third_party/WebKit/Sour... > third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:34: #include > <compact_enc_det/compact_enc_det.h> > On 2016/06/23 at 00:45:57, Jinsuk wrote: > > On 2016/06/22 23:36:40, tkent wrote: > > > Please don't use |#include <>| for non-system headers. It should be > |#include > > > "third_party/ced/src/compact_enc_det/compact_enc_det.h"|, and we need to > update > > > platform/DEPS. > > > > Done. Just curious - I was following what was done for ICU header. Was the > policy changed not to do it that way? > > It's a legacy code inherited from WebKit. We should apply Chromium rule now. > > https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... > File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): > > https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... > third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:39: const char* > hintEncodingName, WTF::TextEncoding* detectedEncoding) > I'm afraid ignoring |hintEncodingName| degrades CED's accuracy.
On 2016/06/23 at 04:33:19, jinsukkim wrote: > On 2016/06/23 03:30:39, tkent wrote: > > I don't believe this code produces results as accurate as CED in Google or ICU > > because this code doesn't specify hint arguments. > > If you'd like to produce this change without specifying hints, we should collect > > data such as UMA for usage of encoding menu before/after this change or running > > CED without hints against DocJoins. > > > > I can update the code to make use of |hintEncodingName|. As far as I can tell, > the hint encoding is passed in a limited situation https://cs-staging.chromium.org/chromium/src/third_party/WebKit/Source/core/h..., and for the web sites I have > tested all come with the hint encoding set to null. Is the situation above > what you're worried about? Yeah, ignoring hintEncodingName makes an opportunity to lose to ICU. We should give hints as much as possible. e.g. url_hint and language_hint. Can CED distinguish EUC-JP and EUC-KR correctly without these hints?
On 2016/06/23 05:17:31, tkent wrote: > On 2016/06/23 at 04:33:19, jinsukkim wrote: > > On 2016/06/23 03:30:39, tkent wrote: > > > I don't believe this code produces results as accurate as CED in Google or > ICU > > > because this code doesn't specify hint arguments. > > > If you'd like to produce this change without specifying hints, we should > collect > > > data such as UMA for usage of encoding menu before/after this change or > running > > > CED without hints against DocJoins. > > > > > > > I can update the code to make use of |hintEncodingName|. As far as I can tell, > > the hint encoding is passed in a limited situation > https://cs-staging.chromium.org/chromium/src/third_party/WebKit/Source/core/h..., > and for the web sites I have > > tested all come with the hint encoding set to null. Is the situation above > > what you're worried about? > > Yeah, ignoring hintEncodingName makes an opportunity to lose to ICU. > > We should give hints as much as possible. e.g. url_hint and language_hint. > Can CED distinguish EUC-JP and EUC-KR correctly without these hints? More hints would definitely help, but I left it out since I saw ICU not make practical use of it either as the hint for top frame would be null. I haven't tried EUC_KR but ced works for EUC_JP as in one of the test URLs: http://www.wheel.gr.jp/~dai/rta55i/ipv6-privacy.html ced is based on Bayes detection on byte bigram, which gives quite accurate result for input of reasonable length. May give false results on short queries, but works well for web pages. sequences With reasonable length of text (1KB or more)
I think we may proceed this CL and enable encoding detection by default if hintEncodingName is passed to CED. Then we should wait for a few months before removing encoding menu to see if this causes problems.
PTAL. This needs update on CED to support encoding string to enum conversion google3/i18n/encodings/public/encodings.h. I'm working on it now. // // EncodingNameAliasToEncoding // --------------------------- // // If enc_name matches the standard name or an alias of an Encoding, // using a case-insensitive comparison, return that // Encoding. Otherwise, return UNKNOWN_ENCODING. // // Aliases include most mime-encoding names (e.g., "ISO-8859-7" for // GREEK), alternate names (e.g., "cyrillic" for ISO_8859_5) and // common variations with hyphens and underscores (e.g., "koi8-u" and // "koi8u" for RUSSIAN_KOI8_R). Encoding EncodingNameAliasToEncoding(const char *enc_name); https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/2081653007/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:39: const char* hintEncodingName, WTF::TextEncoding* detectedEncoding) On 2016/06/23 03:30:39, tkent wrote: > I'm afraid ignoring |hintEncodingName| degrades CED's accuracy. Passed hint encoding info to CED.
On 2016/06/23 at 07:35:58, jinsukkim wrote: > PTAL. This needs update on CED to support encoding string to enum conversion google3/i18n/encodings/public/encodings.h. I'm working on it now. Looks reasonable. Let's revisit this CL after the CED change is landed.
On 2016/06/23 07:38:47, tkent wrote: > On 2016/06/23 at 07:35:58, jinsukkim wrote: > > PTAL. This needs update on CED to support encoding string to enum conversion > google3/i18n/encodings/public/encodings.h. I'm working on it now. > > Looks reasonable. Let's revisit this CL after the CED change is landed. Thanks tkent for helping the CED update land. We can now resume on this patch.
lgtm
The CQ bit was checked by jinsukkim@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Committed patchset #4 (id:120001)
Message was sent while issue was closed.
Description was changed from ========== Replace ICU with CED for auto encoding detection This is a drop-in replacement of ICU library performing automatic text encoding detection with CED (Compact Encdoing Detection). CED is used extensively in Google for every crawled web page, email message, query string, etc., and recently open-sourced for public use. (https://github.com/google/compact_enc_det) Also it is a much better alternative to ICU in terms of speed. ICU introduces significant regression in page loading (up to 30%): = ICU auto-detection vs. TOT = page_cycler.typical_25:cold_times.page_load_time 1085.13±9.31% 754.28±12.03% (30.49%) http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05... while CED adds virtually no additional loading time (delta < sigma): = CED auto-detection vs. TOT = page_cycler.typical_25:cold_times.page_load_time ms 705.70±9.49% vs. 760.31±11.90% (-7.74%) http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05... With CED, it is feasible to turn on auto encoding detection by default so that web pages without encoding label can be taken care of. It will be done in a follow-up CL. BUG=597488 ========== to ========== Replace ICU with CED for auto encoding detection This is a drop-in replacement of ICU library performing automatic text encoding detection with CED (Compact Encdoing Detection). CED is used extensively in Google for every crawled web page, email message, query string, etc., and recently open-sourced for public use. (https://github.com/google/compact_enc_det) Also it is a much better alternative to ICU in terms of speed. ICU introduces significant regression in page loading (up to 30%): = ICU auto-detection vs. TOT = page_cycler.typical_25:cold_times.page_load_time 1085.13±9.31% 754.28±12.03% (30.49%) http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05... while CED adds virtually no additional loading time (delta < sigma): = CED auto-detection vs. TOT = page_cycler.typical_25:cold_times.page_load_time ms 705.70±9.49% vs. 760.31±11.90% (-7.74%) http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-05... With CED, it is feasible to turn on auto encoding detection by default so that web pages without encoding label can be taken care of. It will be done in a follow-up CL. BUG=597488 Committed: https://crrev.com/13510755ac11e6e1f58c34ef8c9bd4cf925c8d70 Cr-Commit-Position: refs/heads/master@{#402622} ==========
Message was sent while issue was closed.
Patchset 4 (id:??) landed as https://crrev.com/13510755ac11e6e1f58c34ef8c9bd4cf925c8d70 Cr-Commit-Position: refs/heads/master@{#402622}
Message was sent while issue was closed.
A revert of this CL (patchset #4 id:120001) has been created in https://codereview.chromium.org/2109633003/ by shans@chromium.org. The reason for reverting is: This seems to have broken compile on Windows 8: [46/8396] CXX obj\third_party\ced\src\compact_enc_det\ced.compact_enc_det_hint_code.obj FAILED: obj/third_party/ced/src/compact_enc_det/ced.compact_enc_det_hint_code.obj ninja -t msvc -e environment.x86 -- C:\b\build\slave\cache\cipd\goma/gomacc "C:\b\depot_tools\win_toolchain\vs_files\95ddda401ec5678f15eeed01d2bee08fcbc5ee97\VC\bin\amd64_x86\cl.exe" /nologo /showIncludes /FC @obj\third_party\ced\src\compact_enc_det\ced.compact_enc_det_hint_code.obj.rsp /c ..\..\third_party\ced\src\compact_enc_det\compact_enc_det_hint_code.cc /Foobj\third_party\ced\src\compact_enc_det\ced.compact_enc_det_hint_code.obj /Fdobj\third_party\ced\ced.cc.pdb c:\b\build\slave\win8_gyp\build\src\third_party\ced\src\compact_enc_det\compact_enc_det_hint_code.cc(122): error C2220: warning treated as error - no 'object' file generated c:\b\build\slave\win8_gyp\build\src\third_party\ced\src\compact_enc_det\compact_enc_det_hint_code.cc(122): warning C4018: '<': signed/unsigned mismatch c:\b\build\slave\win8_gyp\build\src\third_party\ced\src\compact_enc_det\compact_enc_det_hint_code.cc(151): warning C4018: '<': signed/unsigned mismatch c:\b\build\slave\win8_gyp\build\src\third_party\ced\src\compact_enc_det\compact_enc_det_hint_code.cc(169): warning C4018: '<': signed/unsigned mismatch https://build.chromium.org/p/chromium.win/builders/Win8%20GYP/builds/89. |