|
|
Created:
4 years, 10 months ago by Jinsuk Kim Modified:
4 years, 8 months ago CC:
aelias_OOO_until_Jul13, blink-reviews, blink-reviews-html_chromium.org, chromium-reviews, dglazkov+blink, kinuko+watch Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionUTF-8 detector for pages missing encoding info
Experiment crbug.com/518968 shows that about 30% of the pages
missing character encoding information are encoded in UTF-8.
This CL runs a quick one-pass scan against them to check if
they are encoded in UTF-8. It helps get the encoding right
for 30% of the pages without having to turn on full auto-encoding
detection logic which is disabled due to slow performance.
BUG=583549
TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html
Committed: https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610
Cr-Commit-Position: refs/heads/master@{#387209}
Patch Set 1 #
Total comments: 4
Patch Set 2 : #
Total comments: 8
Patch Set 3 : addressed comments #
Total comments: 7
Patch Set 4 : addressed comments (wtf) #Patch Set 5 : updated webkit layout tests accordingly #Patch Set 6 : #
Total comments: 14
Patch Set 7 : rebased & removed test files to be rebasedlined #Patch Set 8 : addressed comments #Patch Set 9 : rebased #Patch Set 10 : a new layout test file for testing UTF8 encoding detection #
Total comments: 2
Patch Set 11 : turned the new layout test to reference test #Patch Set 12 : left out test files that should be landed manually #Messages
Total messages: 46 (11 generated)
jinsukkim@chromium.org changed reviewers: + esprehn@chromium.org
Initial perf results for UTF8 encoding detector on Linux/Mac is at crbug.com/583549#c6. The amount of page load time increased by the detector is not obvious here since it's smaller than error margin. Will run microbenchmark if required.
aelias@chromium.org changed reviewers: + aelias@chromium.org
aelias@chromium.org changed reviewers: + aelias@chromium.org
https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame && m_hintEncoding)) && isUTF8Encoded(data, len)) { This if statement is duplicative with the other and also with shouldAutodetect(), could you create a small method for it? https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:127: if (!U_IS_UNICODE_CHAR(codePoint)) According to http://icu-project.org/apiref/icu4c/utf8_8h.html reference for U8_NEXT, "If the offset points to a trail byte or an illegal UTF-8 sequence, then c is set to a negative value.", so how about just checking for < 0 instead?
https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame && m_hintEncoding)) && isUTF8Encoded(data, len)) { This if statement is duplicative with the other and also with shouldAutodetect(), could you create a small method for it? https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:127: if (!U_IS_UNICODE_CHAR(codePoint)) According to http://icu-project.org/apiref/icu4c/utf8_8h.html reference for U8_NEXT, "If the offset points to a trail byte or an illegal UTF-8 sequence, then c is set to a negative value.", so how about just checking for < 0 instead?
Thanks for reviewing. https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/c... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame && m_hintEncoding)) && isUTF8Encoded(data, len)) { On 2016/02/24 04:37:53, aelias wrote: > This if statement is duplicative with the other and also with > shouldAutodetect(), could you create a small method for it? Done. https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/p... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:127: if (!U_IS_UNICODE_CHAR(codePoint)) On 2016/02/24 04:37:54, aelias wrote: > According to http://icu-project.org/apiref/icu4c/utf8_8h.html reference for > U8_NEXT, "If the offset points to a trail byte or an illegal UTF-8 sequence, > then c is set to a negative value.", so how about just checking for < 0 instead? Thanks for looking into the detail. Ran the unittests again with codePoint < 0 to find that it works in almost all cases but for following: // Non-characters : U+xxFFF[EF] where xx is 0x00 through 0x10 and <FDD0,FDEF> EXPECT_FALSE(isUTF8Encoded("\xef\xbf\xbe", 3)); // U+FFFE EXPECT_FALSE(isUTF8Encoded("\xf3\xbf\xbf\xbf", 4)); // U+10FFFF EXPECT_FALSE(isUTF8Encoded("\xef\xb7\x90", 3)); // U+FDD0 EXPECT_FALSE(isUTF8Encoded("\xef\xb7\xaf", 3)); // U+FDEF Not sure how important/practical these cases are.
lgtm modulo nit, still needs OWNERS approval. https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:40: bool detectTextEncoding(const char* data, size_t length, Since we're adding another kind of detector, how about renaming this to "detectEncodingUniversal"?
https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { Can we use early return instead? put a return at the end of this if block, then break the multiple || and && conditions below into return statements before the final call to setEncoding(UTF8Encoding(), EncodingFromContentSniffing); https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:116: bool isUTF8Encoded(const char* data, size_t length) seems like this stuff probably belongs in wtf/, why is this file here? https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:118: int32_t srcLen = static_cast<int32_t>(length); why int32_t? not sure why you need to cast the length and index to int32
https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { On 2016/02/26 09:37:25, esprehn wrote: > Can we use early return instead? > > put a return at the end of this if block, then break the multiple || and && > conditions below into return statements before the final call to > > setEncoding(UTF8Encoding(), EncodingFromContentSniffing); Done. (I suppose you meant 'if' not 'return' statements) https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:40: bool detectTextEncoding(const char* data, size_t length, On 2016/02/26 09:20:43, aelias wrote: > Since we're adding another kind of detector, how about renaming this to > "detectEncodingUniversal"? Done. https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:116: bool isUTF8Encoded(const char* data, size_t length) On 2016/02/26 09:37:25, esprehn wrote: > seems like this stuff probably belongs in wtf/, why is this file here? Sounds better. Moved to WTF/text. https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Sour... third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:118: int32_t srcLen = static_cast<int32_t>(length); On 2016/02/26 09:37:25, esprehn wrote: > why int32_t? not sure why you need to cast the length and index to int32 The macro U8_NEXT requires int32_t since it does comparisons against int32 internally, like https://cs.corp.google.com/clankium/src/base/strings/string_util.cc?rcl=0&l=362. Added a comment for clarification.
Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 ========== to ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 ==========
jinsukkim@chromium.org changed reviewers: + tkent@chromium.org
tkent@: Please review wtf/text.
https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. is it possible that |length| is greater than 2^31? https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/text/UTF8.cpp:454: if ((uint8_t)(data[charIndex]) >= 0x80) Please do not use C-style type cast. https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/wtf/wtf.gypi (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/wtf.gypi:235: "text/UTF8Test.cpp", Please add UTF8Test.cpp to this CL.
https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. On 2016/03/02 00:55:55, tkent wrote: > is it possible that |length| is greater than 2^31? The document data (HTML, CSS, JS, etc) can be theoretically bigger than 2^31 but it is not likely to happen in real situation. The macro U8_NEXT would fail to work in the first place, so using int32_t will not cause problems of its own. https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/text/UTF8.cpp:454: if ((uint8_t)(data[charIndex]) >= 0x80) On 2016/03/02 00:55:55, tkent wrote: > Please do not use C-style type cast. Done. https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/wtf/wtf.gypi (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/wtf.gypi:235: "text/UTF8Test.cpp", On 2016/03/02 00:55:56, tkent wrote: > Please add UTF8Test.cpp to this CL. Done.
lgtm https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Sour... third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. On 2016/03/02 at 01:14:31, Jinsuk wrote: > On 2016/03/02 00:55:55, tkent wrote: > > is it possible that |length| is greater than 2^31? > > The document data (HTML, CSS, JS, etc) can be theoretically bigger than 2^31 but it is not likely to happen in real situation. The macro U8_NEXT would fail to work in the first place, so using int32_t will not cause problems of its own. ok. Even if |length| is larger than the maximum value of int32_t, this function just returns false without out-of-bound access. So, no problem.
Several webkit layout tests were affected by the change. UTF-8 detector now gives better rendering results since it helps get the encoding right for those test HTMLs. Previously many characters in them were being garbled. They were not being taken care of largely due to their not being quite relevant to the purpose of the tests. They were updated accordingly in the updated patch. Note on LayoutTests/fast/encoding/css-charset-evil/*: these css files are now always identified (correctly) as UTF-8 regardless of the error in the directive |@charset "utf-8"| being tested here. Changed the encoding to a different one (iso-8859-7) so that the tests remain meaningful. But rietveld has a trouble handling non-ASCII characters in the uploaded files. crbug.com/423920
my own comments for clarification: https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html (left): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22: testDecode('EUC-KR', '%C7%81', 'U+FFFD'); Remove this test since %C7%81 happens to be a valid Unicode sequence (U+01C1). This CL identifies the test document as UTF-8 document and correctly decodes the character as such, not U+FFFD any more. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/fast/encoding/css-charset-evil/css-charset-evil-a1.css (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/fast/encoding/css-charset-evil/css-charset-evil-a1.css:1: @charset "utf-8"; The error of Rietveld shows wrong result. It should reads something like: @charset "iso-8859-7" #a1\xe1\xe2 { display: inline; }
jshin@chromium.org changed reviewers: + jshin@chromium.org
Thank you for the CL. A few comments below. BTW, rietveld does not like non-UTF-8 files. (evil-css charset files in ISO-8859-7). You have to land them manually. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7: 韓國한국 This and a number of tests whose expected results are revised are very interesting. Apparently, whoever took them and added to the layout tests overlooked the fact that the original tests were served via an HTTP server that emits 'Content-Type: text/html; charset=UTF-8'. When blink layout tests are run without a web server, they're interpreted as the default encoding (of content_shell which is windows-1252/iso-8859-1). So, all the expected results have been wrong until now. I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=597517 for that. I propose that you add a separate layout test to test UTF-8 encoding detection rather than relying on these layout tests (virtually all of them whose expected results are changed from gibberish to something meaningful fall into that category). https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html (left): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22: testDecode('EUC-KR', '%C7%81', 'U+FFFD'); On 2016/03/07 07:36:37, Jinsuk wrote: > Remove this test since %C7%81 happens to be a valid Unicode sequence (U+01C1). > This CL identifies the test document as UTF-8 document and correctly decodes the > character as such, not U+FFFD any more. Wait. This test should NOT be affected by this CL. If it is affected, there's something wrong. This test wants to make sure that '\xC7 \x81' is 'decoded' to U+FFFD when the encoding is explicitly set to EUC-KR. OTOH, this CL is supposed to affect pages WITHOUT any specified. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:427: if ((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame && m_hintEncoding))) { nit: The above condition is shared by shouldAutoDetect() except that shouldAutoDetect has an additional check to see if 'autodetector' is ON. How about refactoring the above condition to a helper function |shouldDetectEncoding|? Then, this function can be something like: if (shouldDetectEncoding()) { if ( .... isUTF8Encoded( ...) ) setEncoding ....; return; if (isUniveralDetectorOn()) .... call Universal encoding detector ..... https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:429: setEncoding(UTF8Encoding(), EncodingFromContentSniffing); Given that isUTF8Encoded excludes 'ASCII' (by checking the output of U8_NEXT is larger than 0x7F), you may as well consider being more aggressive and putting this *before* the Universal detector.
Another BTW, you can also mark all the tests (corresponding to revised and now correct expectation files) as 'NeedsRebaseline' in TestExpectations files instead of including them in the CL.
Can you explain how the content sniffing works in general? The server is streaming the data to the client, so you don't have the whole resource at once. How much are we scanning? When are we doing it? What's the overhead of doing this "quick scan" on a huge page like the html spec or wikipedia cats?
https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7: 韓國한국 On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: > This and a number of tests whose expected results are revised are very > interesting. > > Apparently, whoever took them and added to the layout tests overlooked the fact > that the original tests were served via an HTTP server that emits 'Content-Type: > text/html; charset=UTF-8'. > > When blink layout tests are run without a web server, they're interpreted as the > default encoding (of content_shell which is windows-1252/iso-8859-1). So, all > the expected results have been wrong until now. > > I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=597517 for > that. > > I propose that you add a separate layout test to test UTF-8 encoding detection > rather than relying on these layout tests (virtually all of them whose expected > results are changed from gibberish to something meaningful fall into that > category). Thanks for filing the bug. Are you going search for all the files to update them with the meta tag yourself? Feel free to split the work and assign it to me. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html (left): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22: testDecode('EUC-KR', '%C7%81', 'U+FFFD'); On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: > On 2016/03/07 07:36:37, Jinsuk wrote: > > Remove this test since %C7%81 happens to be a valid Unicode sequence (U+01C1). > > This CL identifies the test document as UTF-8 document and correctly decodes > the > > character as such, not U+FFFD any more. > > Wait. This test should NOT be affected by this CL. If it is affected, there's > something wrong. > This test wants to make sure that '\xC7 \x81' is 'decoded' to U+FFFD when the > encoding is explicitly set to EUC-KR. > > OTOH, this CL is supposed to affect pages WITHOUT any specified. You're right - there was a bug in XMLHtttpRequest passing text encoding (EUC-KR) only, NOT the encoding source (EncodingFromHTTPheader) when creating TextResourceDecoder instance. This was putting UTF-8 encoding detector into action which override the text encoding when the detection returns true. Reverted the test file and fixed the bug by setting the encoding source as well. Please see XMLHttpRequest.cpp. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:427: if ((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame && m_hintEncoding))) { On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: > nit: The above condition is shared by shouldAutoDetect() except that > shouldAutoDetect has an additional check to see if 'autodetector' is ON. How > about refactoring the above condition to a helper function > |shouldDetectEncoding|? Then, this function can be something like: > > if (shouldDetectEncoding()) { > if ( .... isUTF8Encoded( ...) ) > setEncoding ....; > return; > if (isUniveralDetectorOn()) > .... call Universal encoding detector > > ..... > Done. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:429: setEncoding(UTF8Encoding(), EncodingFromContentSniffing); On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: > Given that isUTF8Encoded excludes 'ASCII' (by checking the output of U8_NEXT is > larger than 0x7F), you may as well consider being more aggressive and putting > this *before* the Universal detector. > Makes sense. Done.
On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > Another BTW, you can also mark all the tests (corresponding to revised and now > correct expectation files) as 'NeedsRebaseline' in TestExpectations files > instead of including them in the CL. Good to know. Thanks. Took them all out. Will update TestExpectations once the test files get ready with the right meta tag.
On 2016/03/24 06:32:09, esprehn wrote: > Can you explain how the content sniffing works in general? The server is > streaming the data to the client, so you don't have the whole resource at once. > How much are we scanning? When are we doing it? What's the overhead of doing > this "quick scan" on a huge page like the html spec or wikipedia cats? The content sniffing is triggered in |TextResourceDecoder| for every data chunk received from network (WebURLLoaderImpl::Context::OnReceivedData) and repeated until it detects the encoding using meta tag, BOM, or from the first few chunk with which the encoding can be determined. Then the rest don't actually initiate sniffing but only does decoding job. By 'quick scan' I meant that it simply tries to decode the data assuming it is encoded in UTF-8 which is a fairly straightforward process. The test crbug.com/583549#c8 shows that it indeed is small. I may have to run microbenchmark to measure its impact more precisely though.
On 2016/03/25 02:17:14, Jinsuk wrote: > On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > > Another BTW, you can also mark all the tests (corresponding to revised and > now > > correct expectation files) as 'NeedsRebaseline' in TestExpectations files > > instead of including them in the CL. > > Good to know. Thanks. Took them all out. Will update TestExpectations once the > test files get ready with the right meta tag. ok jinsuk so your behind all of this changing an updating info an url an textresourcedecoder an data chucnkey so i would like for you to please don't let this person that is charles pleser be authorized to change any info that belong to natalie garcia or even link web pages or even have any conection to this ip address from now on. this man has mentally fucked with me an time an carma is cathing up to him. he just lost everything because of his chose an he though his baby mom natalie.garcia will never now an she will not figerit out if i change the lanugues an decoded and change is domain an spy on me an find out what i was doing so he can get a way with his lies an cheating an faking his life. well now it will fuck with him becasue he cant even believe him self. so this can make people nuts. it not cool
What's next for this patch?
On 2016/03/31 04:46:39, esprehn wrote: > What's next for this patch? Still waiting for your approval. Will handle crbug.com/597517 together to let all the layout tests pass. The next step will be crbug.com/597488.
Sorry for the late reply. Could you take care of my comments below? Besides, you need to add a layout test specifically designed to test this CL? That is, a UTF-8 file *without* charset label with an appropriate name has to be added. Thanks. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/wtf/text/UTF8.cpp:445: bool isUTF8Encoded(const char* data, size_t length) Without looking at the header file or the function body, it'd be all but impossible to know what this function does (i.e. returns true ONLY if |data| is UTF-8 and contains characters beyond ASCII). How about changing the name to 'isNonAsciiUTF8'? Hmm, that sounds awkward. Well, I wish I could come up with a better name. 'isUTF8andNonASCII' better? Or, how about 'isUTF8andNotASCII' ? https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/wtf/text/UTF8.cpp:450: bool markDetected = false; At first, it took me a while to figure out what this variable is for. How about |isNonAsciiFound| or just |nonAsciiFound|? Or, you can go with |isASCIIOnly| and flip the value.
Added unlabelled-non-ascii-utf8.html to test the effect of UTF-8 encoding detector. Please see if TestExpectations is set as expected. I suppose it will newly generate mac/linux/win variations of -expected.{png|txt}. https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/wtf/text/UTF8.cpp:445: bool isUTF8Encoded(const char* data, size_t length) On 2016/04/03 00:52:04, jshin (jungshik at google) wrote: > Without looking at the header file or the function body, it'd be all but > impossible to know what this function does (i.e. returns true ONLY if |data| is > UTF-8 and contains characters beyond ASCII). How about changing the name to > 'isNonAsciiUTF8'? Hmm, that sounds awkward. Well, I wish I could come up with a > better name. 'isUTF8andNonASCII' better? Or, how about 'isUTF8andNotASCII' ? Chose isUTF8andNotASCII https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou... third_party/WebKit/Source/wtf/text/UTF8.cpp:450: bool markDetected = false; On 2016/04/03 00:52:04, jshin (jungshik at google) wrote: > At first, it took me a while to figure out what this variable is for. How about > |isNonAsciiFound| or just |nonAsciiFound|? Or, you can go with |isASCIIOnly| > and flip the value. Done.
LGTM with unlabelled-non-ascii-utf8-expected.html added. Sorry again for the delay. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/TestExpectations:98: crbug.com/583549 fast/encoding/unlabelled-non-ascii-utf8.html [ NeedsRebaseline ] Instead of adding this to TestExpectations, why don't you just add an expected.html file with 'meta charset=utf-8'? that is, exactly the same file as unlabelled-non-ascii-utf8.html but with meta charset=utf-8' inserted.
Thanks for the review. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/Lay... File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/Lay... third_party/WebKit/LayoutTests/TestExpectations:98: crbug.com/583549 fast/encoding/unlabelled-non-ascii-utf8.html [ NeedsRebaseline ] On 2016/04/08 06:56:42, jshin (jungshik at google) wrote: > Instead of adding this to TestExpectations, why don't you just add an > expected.html file with 'meta charset=utf-8'? that is, exactly the same file as > unlabelled-non-ascii-utf8.html but with meta charset=utf-8' inserted. Sounds better. Reverted the TestExpectation and added *-expected.html.
Thanks. LGTM again :-)
BTW, all those *evil*{html,css} files in ISO-8859-7 can be messed up if you land with CQ. Would you land them manually (separately from this CL) ? That is, 1. drop them from this CL 2. land this CL without them 3. land those *evil*{css,html} files manually with 'git cl land'.
Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 ========== to ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html ==========
On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > BTW, all those *evil*{html,css} files in ISO-8859-7 can be messed up if you land > with CQ. Would you land them manually (separately from this CL) ? > That is, 1. drop them from this CL 2. land this CL without them 3. land those > *evil*{css,html} files manually with 'git cl land'. Got the point but I don't think this CL without the updated html/css will make it, since it will break the corresponding webkit layout tests. I thought I could simply land this CL manually. Would that work?
On 2016/04/08 20:02:08, Jinsuk wrote: > On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > > BTW, all those *evil*{html,css} files in ISO-8859-7 can be messed up if you > land > > with CQ. Would you land them manually (separately from this CL) ? > > That is, 1. drop them from this CL 2. land this CL without them 3. land those > > *evil*{css,html} files manually with 'git cl land'. > > Got the point but I don't think this CL without the updated html/css will make > it, since it will break the corresponding webkit layout tests. I thought I could > simply land this CL manually. Would that work? Oh I was wrong. It's okay to split them. Will do as you suggested. Thanks.
On 2016/03/31 05:12:56, Jinsuk wrote: > On 2016/03/31 04:46:39, esprehn wrote: > > What's next for this patch? > > Still waiting for your approval. Will handle crbug.com/597517 together to let > all the layout tests pass. > > The next step will be crbug.com/597488. Let me land the CL in a few days if you do not have any more feedback. Thanks.
The CQ bit was checked by jinsukkim@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from aelias@chromium.org, tkent@chromium.org, jshin@chromium.org Link to the patchset: https://codereview.chromium.org/1721373002/#ps210001 (title: "left out test files that should be landed manually")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1721373002/210001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1721373002/210001
Message was sent while issue was closed.
Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html ========== to ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html ==========
Message was sent while issue was closed.
Committed patchset #12 (id:210001)
Message was sent while issue was closed.
Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html ========== to ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html Committed: https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610 Cr-Commit-Position: refs/heads/master@{#387209} ==========
Message was sent while issue was closed.
Patchset 12 (id:??) landed as https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610 Cr-Commit-Position: refs/heads/master@{#387209}
Message was sent while issue was closed.
A revert of this CL (patchset #12 id:210001) has been created in https://codereview.chromium.org/1888083002/ by rnephew@chromium.org. The reason for reverting is: Causes jetstream perf test to fail. crbug.com/603558. |