|
|
Created:
4 years, 5 months ago by Jinsuk Kim Modified:
4 years, 4 months ago Reviewers:
cbentzel, Lei Zhang, jungshik at Google, armansito, Elly Fong-Jones, no sievers, Paweł Hajdan Jr. CC:
Paweł Hajdan Jr., cbentzel+watch_chromium.org, chromium-reviews, darin-cc_chromium.org, jam, jshin+watch_chromium.org, oshima+watch_chromium.org, stevenjb+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionReplace ICU encoding detection with CED
CED (Compact Encoding Detection) is a better alternative to
ICU in terms of speed and accuracy. This CL switches the encoding
detection library to CED and fixes some of the failing cases.
BUG=630113
R=armansito@chromium.org, cbentzel@chromium.org, ellyjones@chromium.org, jshin@chromium.org, phajdan.jr@chromium.org, sievers@chromium.org, thestig@chromium.org
Committed: https://chromium.googlesource.com/chromium/src/+/70de1704feea35840fbed4bc1e864a24b620569e
Patch Set 1 : icu -> ced #Patch Set 2 : temorarily reverted test files #
Total comments: 6
Patch Set 3 : addressed comments #Patch Set 4 : fix build #Patch Set 5 : HTML5 mode #
Total comments: 6
Patch Set 6 : addressed comments #
Total comments: 6
Patch Set 7 : comments/add datafiles back #
Total comments: 2
Messages
Total messages: 105 (69 generated)
Description was changed from ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 ========== to ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 ==========
Patchset #1 (id:1) has been deleted
jinsukkim@chromium.org changed reviewers: + jshin@chromium.org
jshin@: Let me have your initial review on overall change. I'll take care of some details/builds soon. Note: base::DetectEncoding() was previously returning an empty string for US-ASCII. This was changed so that the returned value is explicitly checked against the encoding name.
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: chromeos_amd64-generic_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromeos_amd64-...) linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) linux_chromium_clobber_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-gn on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-gn/bui...) ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) ios-simulator-gn on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-gn/...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
On 2016/07/21 07:50:10, Jinsuk wrote: > jshin@: Let me have your initial review on overall change. I'll take care of > some details/builds soon. > > Note: base::DetectEncoding() was previously returning an empty string for > US-ASCII. This was changed so that the returned value is explicitly checked > against the encoding name. Looks like Rietveld doesn't know how to patch files with DOS linefeed (\d\a). I'll pull them out from the CL for now, and land manually.
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
jinsukkim@chromium.org changed reviewers: + phajdan.jr@chromium.org
phajdan.jr@chromium.org: Please review changes in net/ftp
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Patchset #2 (id:40001) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
https://codereview.chromium.org/2168003003/diff/60001/base/i18n/encoding_dete... File base/i18n/encoding_detection.h (right): https://codereview.chromium.org/2168003003/diff/60001/base/i18n/encoding_dete... base/i18n/encoding_detection.h:16: BASE_I18N_EXPORT bool DetectEncoding(const std::string& text, nit: Add WARN_UNUSED_RESULT https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... File net/ftp/ftp_directory_listing_parser_unittest.cc (right): https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... net/ftp/ftp_directory_listing_parser_unittest.cc:145: // {"dir-listing-ls-20", OK}, I'm worried about these tests getting disabled. Does that mean this change breaks them? Historically we needed DetectAllEncodings vs. just DetectEncoding. I'm taking vacation next week. Feel free to get review from net/ OWNERS, although I'd recommend that this concern gets addressed and no tests get disabled. It doesn't seem trivial to fix them.
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
jinsukkim@chromium.org changed reviewers: + eroman@chromium.org - phajdan.jr@chromium.org
https://codereview.chromium.org/2168003003/diff/60001/base/i18n/encoding_dete... File base/i18n/encoding_detection.h (right): https://codereview.chromium.org/2168003003/diff/60001/base/i18n/encoding_dete... base/i18n/encoding_detection.h:16: BASE_I18N_EXPORT bool DetectEncoding(const std::string& text, On 2016/07/22 13:22:19, Paweł Hajdan (OOO back Aug 1) wrote: > nit: Add WARN_UNUSED_RESULT Done. https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... File net/ftp/ftp_directory_listing_parser_unittest.cc (right): https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... net/ftp/ftp_directory_listing_parser_unittest.cc:145: // {"dir-listing-ls-20", OK}, On 2016/07/22 13:22:19, Paweł Hajdan (OOO back Aug 1) wrote: > I'm worried about these tests getting disabled. > > Does that mean this change breaks them? > > Historically we needed DetectAllEncodings vs. just DetectEncoding. > > I'm taking vacation next week. Feel free to get review from net/ OWNERS, > although I'd recommend that this concern gets addressed and no tests get > disabled. It doesn't seem trivial to fix them. No quite the opposite - the 3 tests were already failing, and this CL fixes them. I'm temporarily commented them out because the test files to update have DOS-formatted linefeeds which Rietveld cannot handle. I'm going them put them back once the review is completed and land manually. Updated the comment to make the intention clearer.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Patchset #3 (id:80001) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win8_chromium_gyp_rel on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win8_chromium_gyp...)
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #3 (id:100001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) win8_chromium_gyp_rel on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win8_chromium_gyp...)
jinsukkim@chromium.org changed reviewers: + armansito@chromium.org, ellyjones@chromium.org, pfeldman@chromium.org, thestig@chromium.org
jshin@chromium.org : Please review changes in base/i18n thestig@chromium.org: Please review changes in pdf, base/ ellyjones@chromium.org: Please review changes in ios/ armansito@chromium.org: Please review changes in chromeos/ pfeldman@chromium.org: Please review changes in content/child
/ios lgtm
looks ok, but there are red bots.
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
I'm also worried about encoding names and non-html5 encodings coming out of CED (especially the latter because the former is kinda taken care of, I guess). https://github.com/google/compact_enc_det/issues/1 https://github.com/google/compact_enc_det/issues/2
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #5 (id:160001) has been deleted
Patchset #4 (id:140001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win8_chromium_gyp_rel on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win8_chromium_gyp...)
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #4 (id:180001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Patchset #4 (id:200001) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
chromeos/ lgtm
thestig@: All builds are green now. jshin@: Introduced HTML5_MODE in https://crrev.com/2188663002 to handle non-HTML5 encodings (7-bit ones except ISO_2022_JP). Now they are treated as ASCII_7BIT.
Terribly sorry for the delay. My previous comment about 7-bit encoding can be addressed by *excluding* 7-bit encoding when calling CED (as I mentioned below). https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... File base/i18n/encoding_detection.cc (right): https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:18: CompactEncDet::WEB_CORPUS, FTP directory listing is one of customers, isn't it? In that case, is WEB_CORPUS a good choice? Other customers: 1. SSID encoding detection (it's not Web corpus, either) 2. pdfium uses it for font name encoding detection (again, it's not Web corpus). Use PlainText (I didn't look up CED code for the name for plain text) and document in the header file that it's for plain text only. An alternative is to add another argument to specify input text type, but it'd be an overkill for now because all 3 callers pass 'plain text' input. BTW, the detection won't be very reliable for those two cases. CED takes other hints (ccTLD, etc...). In a follow-up to this CL, you may consider passing them if available. Well, that would be more helpful when calling CED in Blink where ccTLD and other clues (e.g. the current UI language of web browser) are known. https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:19: false, // Include 7-bit encodings In this case, I would NOT include 7-bit encodings. I *hope* there's no FTP servers using ISO-2022-JP. https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.gn File third_party/ced/BUILD.gn (right): https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.... third_party/ced/BUILD.gn:17: # LC_COLLATE=c sort | sed 's/^\(.*\)$/ "\1",/' I guess not all header files are public. I'm NOT a GN expert, but it might be more kosher to list public headers separately in public[] array. (my attempt to clean up ICU that way is https://codereview.chromium.org/2186343002/ ).
On 2016/07/27 02:22:21, Jinsuk wrote: > thestig@: All builds are green now. > > jshin@: Introduced HTML5_MODE in https://crrev.com/2188663002 to handle > non-HTML5 encodings (7-bit ones except ISO_2022_JP). Now they are treated as > ASCII_7BIT. Oh.. Sorry I missed this. Actually, treating 7-bit encodings other than ISO-2022-JP as 'ASCII' can be potentially problematic. For use in Blink, let CED return 7-bit encoding names (as listed in WHATWG encoding spec) and Blink will convert the whole input to a single U+FFFD (replacement character). In base/i18n, excluding all 7-bit encodings would be better (as I mentioned in my previous comment).
On 2016/07/29 06:44:24, jungshik at google wrote: > On 2016/07/27 02:22:21, Jinsuk wrote: > > thestig@: All builds are green now. > > > > jshin@: Introduced HTML5_MODE in https://crrev.com/2188663002 to handle > > non-HTML5 encodings (7-bit ones except ISO_2022_JP). Now they are treated as > > ASCII_7BIT. > > Oh.. Sorry I missed this. Actually, treating 7-bit encodings other than > ISO-2022-JP as 'ASCII' can be potentially problematic. > For use in Blink, let CED return 7-bit encoding names (as listed in WHATWG > encoding spec) and Blink will convert the whole input to a single U+FFFD > (replacement character). > > In base/i18n, excluding all 7-bit encodings would be better (as I mentioned in > my previous comment).
On 2016/07/29 07:07:27, Jinsuk wrote: > On 2016/07/29 06:44:24, jungshik at google wrote: > > On 2016/07/27 02:22:21, Jinsuk wrote: > > > thestig@: All builds are green now. > > > > > > jshin@: Introduced HTML5_MODE in https://crrev.com/2188663002 to handle > > > non-HTML5 encodings (7-bit ones except ISO_2022_JP). Now they are treated as > > > ASCII_7BIT. > > > > Oh.. Sorry I missed this. Actually, treating 7-bit encodings other than > > ISO-2022-JP as 'ASCII' can be potentially problematic. > > For use in Blink, let CED return 7-bit encoding names (as listed in WHATWG > > encoding spec) and Blink will convert the whole input to a single U+FFFD > > (replacement character). > > > > In base/i18n, excluding all 7-bit encodings would be better (as I mentioned > in > > my previous comment). When told to ignore 7-bit encoding (by passing |true| to the flag), CED was written to return ASCII-7BIT in the first place. That's why I choose the encoding for them. I think I should still tell it to ignore 7-bit, and handles the response here manually. https://encoding.spec.whatwg.org/ doesn't seem to have HZ, ISO-2022-CN/KR defined in it. It has encoding called 'replacement' mapped to { 'csiso2022kr', 'hz-gb-2312', 'iso-2022-cn', ...} Did you mean to use one of these values? And any suggestion for UTF7 not even mentioned in the standard?
On 2016/07/29 07:12:03, Jinsuk wrote: > On 2016/07/29 07:07:27, Jinsuk wrote: > > On 2016/07/29 06:44:24, jungshik at google wrote: > > > On 2016/07/27 02:22:21, Jinsuk wrote: > > > > thestig@: All builds are green now. > > > > > > > > jshin@: Introduced HTML5_MODE in https://crrev.com/2188663002 to handle > > > > non-HTML5 encodings (7-bit ones except ISO_2022_JP). Now they are treated > as > > > > ASCII_7BIT. > > > > > > Oh.. Sorry I missed this. Actually, treating 7-bit encodings other than > > > ISO-2022-JP as 'ASCII' can be potentially problematic. > > > For use in Blink, let CED return 7-bit encoding names (as listed in WHATWG > > > encoding spec) and Blink will convert the whole input to a single U+FFFD > > > (replacement character). > > > > > > In base/i18n, excluding all 7-bit encodings would be better (as I mentioned > > in > > > my previous comment). > > When told to ignore 7-bit encoding (by passing |true| to the flag), CED was > written to return ASCII-7BIT in the first place. That's why I choose the > encoding for them. I think I should still tell it to ignore 7-bit, and handles > the response here manually. > > https://encoding.spec.whatwg.org/ doesn't seem to have HZ, ISO-2022-CN/KR > defined in it. It has encoding called 'replacement' mapped to { 'csiso2022kr', > 'hz-gb-2312', 'iso-2022-cn', ...} Did you mean to use one of these values? And > any suggestion for UTF7 not even mentioned in the standard? Sorry I was confused. I'll handle your suggestion on Blink in a follow-up CL, focusing on chrome use here. But the question still stands - telling CED to ignore 7-bit encoding makes it return ASCII-7BIT. I guess that's not what you intended to see.
The CQ bit was checked by jinsukkim@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... File base/i18n/encoding_detection.cc (right): https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:18: CompactEncDet::WEB_CORPUS, On 2016/07/29 06:35:46, jungshik at google wrote: > FTP directory listing is one of customers, isn't it? In that case, is WEB_CORPUS > a good choice? > > Other customers: 1. SSID encoding detection (it's not Web corpus, either) > 2. pdfium uses it for font name encoding detection (again, it's not Web corpus). > > > Use PlainText (I didn't look up CED code for the name for plain text) and > document in the header file that it's for plain text only. > Good point. Used QUERY_CORPUS which is for plain text. > An alternative is to add another argument to specify input text type, but it'd > be an overkill for now because all 3 callers pass 'plain text' input. > > BTW, the detection won't be very reliable for those two cases. > > CED takes other hints (ccTLD, etc...). In a follow-up to this CL, you may > consider passing them if available. Well, that would be more helpful when > calling CED in Blink where ccTLD and other clues (e.g. the current UI language > of web browser) are known. Granted, but I observed that it is on par with (or better than) ICU even now. https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:19: false, // Include 7-bit encodings On 2016/07/29 06:35:46, jungshik at google wrote: > In this case, I would NOT include 7-bit encodings. > I *hope* there's no FTP servers using ISO-2022-JP. By 'NOT include' I believe you mean text in those encodings (HZ, UTF7, ISO-2022-{CN,KR}) to be left alone without conversion. Marking them as 7bit will effectively leave them intact, which is what you intend. https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.gn File third_party/ced/BUILD.gn (right): https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.... third_party/ced/BUILD.gn:17: # LC_COLLATE=c sort | sed 's/^\(.*\)$/ "\1",/' On 2016/07/29 06:35:46, jungshik at google wrote: > I guess not all header files are public. I'm NOT a GN expert, but it might be > more kosher to list public headers separately in public[] array. (my attempt to > clean up ICU that way is https://codereview.chromium.org/2186343002/ ). > Thanks for the tip. Done.
On 2016/08/01 07:48:15, Jinsuk wrote: > https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... > File base/i18n/encoding_detection.cc (right): > > https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... > base/i18n/encoding_detection.cc:18: CompactEncDet::WEB_CORPUS, > On 2016/07/29 06:35:46, jungshik at google wrote: > > FTP directory listing is one of customers, isn't it? In that case, is > WEB_CORPUS > > a good choice? > > > > Other customers: 1. SSID encoding detection (it's not Web corpus, either) > > 2. pdfium uses it for font name encoding detection (again, it's not Web > corpus). > > > > > > Use PlainText (I didn't look up CED code for the name for plain text) and > > document in the header file that it's for plain text only. > > > > Good point. Used QUERY_CORPUS which is for plain text. > > > An alternative is to add another argument to specify input text type, but it'd > > be an overkill for now because all 3 callers pass 'plain text' input. > > > > BTW, the detection won't be very reliable for those two cases. > > > > CED takes other hints (ccTLD, etc...). In a follow-up to this CL, you may > > consider passing them if available. Well, that would be more helpful when > > calling CED in Blink where ccTLD and other clues (e.g. the current UI language > > of web browser) are known. > > Granted, but I observed that it is on par with (or better than) ICU even now. > > https://codereview.chromium.org/2168003003/diff/240001/base/i18n/encoding_det... > base/i18n/encoding_detection.cc:19: false, // Include 7-bit encodings > On 2016/07/29 06:35:46, jungshik at google wrote: > > In this case, I would NOT include 7-bit encodings. > > I *hope* there's no FTP servers using ISO-2022-JP. > > By 'NOT include' I believe you mean text in those encodings (HZ, UTF7, > ISO-2022-{CN,KR}) to be left alone without conversion. Marking them as 7bit will > effectively leave them intact, which is what you intend. > > https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.gn > File third_party/ced/BUILD.gn (right): > > https://codereview.chromium.org/2168003003/diff/240001/third_party/ced/BUILD.... > third_party/ced/BUILD.gn:17: # LC_COLLATE=c sort | sed 's/^\(.*\)$/ "\1",/' > On 2016/07/29 06:35:46, jungshik at google wrote: > > I guess not all header files are public. I'm NOT a GN expert, but it might > be > > more kosher to list public headers separately in public[] array. (my attempt > to > > clean up ICU that way is https://codereview.chromium.org/2186343002/ ). > > > > Thanks for the tip. Done. thestig@, jshin@ Please take another look.
phajdan.jr@chromium.org changed reviewers: + phajdan.jr@chromium.org
https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... File net/ftp/ftp_directory_listing_parser_unittest.cc (right): https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... net/ftp/ftp_directory_listing_parser_unittest.cc:145: // {"dir-listing-ls-20", OK}, On 2016/07/24 at 22:24:46, Jinsuk wrote: > No quite the opposite - the 3 tests were already failing, and this CL fixes them. I'm temporarily commented them out because the test files to update have DOS-formatted linefeeds which Rietveld cannot handle. I'm going them put them back once the review is completed and land manually. Updated the comment to make the intention clearer. I'd like to understand this better. These test cases were _enabled_ before. Could you explain more what do you mean that they were failing? Also, it seems you'd like to change the corresponding data files. Could you explain more why and how would the contents change?
https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... File net/ftp/ftp_directory_listing_parser_unittest.cc (right): https://codereview.chromium.org/2168003003/diff/60001/net/ftp/ftp_directory_l... net/ftp/ftp_directory_listing_parser_unittest.cc:145: // {"dir-listing-ls-20", OK}, On 2016/08/08 09:00:48, Paweł Hajdan Jr. wrote: > On 2016/07/24 at 22:24:46, Jinsuk wrote: > > No quite the opposite - the 3 tests were already failing, and this CL fixes > them. I'm temporarily commented them out because the test files to update have > DOS-formatted linefeeds which Rietveld cannot handle. I'm going them put them > back once the review is completed and land manually. Updated the comment to make > the intention clearer. > > I'd like to understand this better. > > These test cases were _enabled_ before. Could you explain more what do you mean > that they were failing? > > Also, it seems you'd like to change the corresponding data files. Could you > explain more why and how would the contents change? If you open the expected files, they are actually wrong results. They were checked in just to make the tests pass.
Okay. LGTM.
On 2016/07/25 07:52:28, Jinsuk wrote: > mailto:jshin@chromium.org : Please review changes in base/i18n > > mailto:thestig@chromium.org: Please review changes in pdf, base/ Looks ok, but I would like jshin to approve first.
LGTM Thanks ! (and sorry).
lgtm with some nits: https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... File base/i18n/encoding_detection.cc (right): https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:12: int consumedBytes; nit: foo_bar, not fooBar. https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:18: CompactEncDet::QUERY_CORPUS, // plain text two spaces in front of comments. https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... File base/i18n/encoding_detection.h (right): https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.h:1: // Copyright (c) 2016 The Chromium Authors. All rights reserved. nit: no (c), ditto for .cc file.
jinsukkim@chromium.org changed reviewers: + sievers@chromium.org - eroman@chromium.org, pfeldman@chromium.org
Thanks! sievers@ Could you review content/? I'd like a owner's lgtm. Thanks.
On 2016/08/08 22:14:00, Jinsuk wrote: > Thanks! > > sievers@ Could you review content/? I'd like a owner's lgtm. Thanks. lgtm
jinsukkim@chromium.org changed reviewers: + cbentzel@chromium.org
cbentzel@chromium.org: Please review changes in net/data - the old test data files are wrong (lots of gibberish) but still checked in since the old encoding detector couldn't handle them. This CL fixes it by improving the detection accuracy. You can see the data files now have expected results. https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... File base/i18n/encoding_detection.cc (right): https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:12: int consumedBytes; On 2016/08/08 22:09:55, Lei Zhang wrote: > nit: foo_bar, not fooBar. Done. https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.cc:18: CompactEncDet::QUERY_CORPUS, // plain text On 2016/08/08 22:09:55, Lei Zhang wrote: > two spaces in front of comments. Done. https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... File base/i18n/encoding_detection.h (right): https://codereview.chromium.org/2168003003/diff/260001/base/i18n/encoding_det... base/i18n/encoding_detection.h:1: // Copyright (c) 2016 The Chromium Authors. All rights reserved. On 2016/08/08 22:09:55, Lei Zhang wrote: > nit: no (c), ditto for .cc file. Done.
On 2016/08/09 01:16:19, Jinsuk wrote: > mailto:cbentzel@chromium.org: Please review changes in net/data - the old test data > files are wrong (lots of gibberish) but still checked in since the old encoding > detector couldn't handle them. This CL fixes it by improving the detection > accuracy. You can see the data files now have expected results. > I'll have to look a bit more, but I'm surprised that expected files changed when input files are unchanged.
On 2016/08/09 01:16:19, Jinsuk wrote: > mailto:cbentzel@chromium.org: Please review changes in net/data - the old test data > files are wrong (lots of gibberish) but still checked in since the old encoding > detector couldn't handle them. This CL fixes it by improving the detection > accuracy. You can see the data files now have expected results. > I'll have to look a bit more, but I'm surprised that expected files changed when input files are unchanged.
https://codereview.chromium.org/2168003003/diff/280001/net/ftp/ftp_directory_... File net/ftp/ftp_directory_listing_parser.cc (right): https://codereview.chromium.org/2168003003/diff/280001/net/ftp/ftp_directory_... net/ftp/ftp_directory_listing_parser.cc:93: if (base::CodepageToUTF16(text, encoding_name, I assume that encoding names are the same in CED as in ICU?
https://codereview.chromium.org/2168003003/diff/280001/net/ftp/ftp_directory_... File net/ftp/ftp_directory_listing_parser.cc (right): https://codereview.chromium.org/2168003003/diff/280001/net/ftp/ftp_directory_... net/ftp/ftp_directory_listing_parser.cc:93: if (base::CodepageToUTF16(text, encoding_name, On 2016/08/10 01:52:00, cbentzel wrote: > I assume that encoding names are the same in CED as in ICU? They are not precisely the same. CED returns IANA standard-compliant names. But ICU which |base::CodepageToUTF16| is based on has no problem understanding them. It uses the following table for sanitizing the names https://cs.chromium.org/chromium/src/third_party/icu/source/data/mappings/con...
On 2016/08/10 01:50:29, cbentzel wrote: > On 2016/08/09 01:16:19, Jinsuk wrote: > > mailto:cbentzel@chromium.org: Please review changes in net/data - the old test > data > > files are wrong (lots of gibberish) but still checked in since the old > encoding > > detector couldn't handle them. This CL fixes it by improving the detection > > accuracy. You can see the data files now have expected results. > > > > I'll have to look a bit more, but I'm surprised that expected files changed when > input files are unchanged. ICU was making wrong guesses on those documents (thinks they are encoded is ISO-8859-1) which are actually in windows-1251. CED works better and could detect them right.
On 2016/08/10 07:41:25, Jinsuk wrote: > On 2016/08/10 01:50:29, cbentzel wrote: > > On 2016/08/09 01:16:19, Jinsuk wrote: > > > mailto:cbentzel@chromium.org: Please review changes in net/data - the old > test > > data > > > files are wrong (lots of gibberish) but still checked in since the old > > encoding > > > detector couldn't handle them. This CL fixes it by improving the detection > > > accuracy. You can see the data files now have expected results. > > > > > > > I'll have to look a bit more, but I'm surprised that expected files changed > when > > input files are unchanged. > > ICU was making wrong guesses on those documents (thinks they are encoded is > ISO-8859-1) which are actually in windows-1251. CED works better and could > detect them right. LGTM, thanks for the answers.
The CQ bit was checked by jinsukkim@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from ellyjones@chromium.org, armansito@chromium.org, thestig@chromium.org, jshin@chromium.org, sievers@chromium.org, phajdan.jr@chromium.org Link to the patchset: https://codereview.chromium.org/2168003003/#ps280001 (title: "comments/add datafiles back")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by jinsukkim@chromium.org
Message was sent while issue was closed.
Description was changed from ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 ========== to ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 R=armansito@chromium.org, cbentzel@chromium.org, ellyjones@chromium.org, jshin@chromium.org, phajdan.jr@chromium.org, sievers@chromium.org, thestig@chromium.org Committed: https://crrev.com/70de1704feea35840fbed4bc1e864a24b620569e Cr-Commit-Position: refs/heads/master@{#411163} ==========
Message was sent while issue was closed.
Patchset 7 (id:??) landed as https://crrev.com/70de1704feea35840fbed4bc1e864a24b620569e Cr-Commit-Position: refs/heads/master@{#411163}
Message was sent while issue was closed.
Description was changed from ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 R=armansito@chromium.org, cbentzel@chromium.org, ellyjones@chromium.org, jshin@chromium.org, phajdan.jr@chromium.org, sievers@chromium.org, thestig@chromium.org Committed: https://crrev.com/70de1704feea35840fbed4bc1e864a24b620569e Cr-Commit-Position: refs/heads/master@{#411163} ========== to ========== Replace ICU encoding detection with CED CED (Compact Encoding Detection) is a better alternative to ICU in terms of speed and accuracy. This CL switches the encoding detection library to CED and fixes some of the failing cases. BUG=630113 R=armansito@chromium.org, cbentzel@chromium.org, ellyjones@chromium.org, jshin@chromium.org, phajdan.jr@chromium.org, sievers@chromium.org, thestig@chromium.org Committed: https://chromium.googlesource.com/chromium/src/+/70de1704feea35840fbed4bc1e86... ==========
Message was sent while issue was closed.
Committed patchset #7 (id:280001) manually as 70de1704feea35840fbed4bc1e864a24b620569e. |