Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(270)

Issue 1721373002: UTF-8 detector for pages missing encoding info (Closed)

Created:
4 years, 10 months ago by Jinsuk Kim
Modified:
4 years, 8 months ago
CC:
aelias_OOO_until_Jul13, blink-reviews, blink-reviews-html_chromium.org, chromium-reviews, dglazkov+blink, kinuko+watch
Base URL:
https://chromium.googlesource.com/chromium/src.git@master
Target Ref:
refs/pending/heads/master
Project:
chromium
Visibility:
Public.

Description

UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows that about 30% of the pages missing character encoding information are encoded in UTF-8. This CL runs a quick one-pass scan against them to check if they are encoded in UTF-8. It helps get the encoding right for 30% of the pages without having to turn on full auto-encoding detection logic which is disabled due to slow performance. BUG=583549 TEST=Layout test: fast/encoding/unlabelled-non-ascii-utf8.html Committed: https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610 Cr-Commit-Position: refs/heads/master@{#387209}

Patch Set 1 #

Total comments: 4

Patch Set 2 : #

Total comments: 8

Patch Set 3 : addressed comments #

Total comments: 7

Patch Set 4 : addressed comments (wtf) #

Patch Set 5 : updated webkit layout tests accordingly #

Patch Set 6 : #

Total comments: 14

Patch Set 7 : rebased & removed test files to be rebasedlined #

Patch Set 8 : addressed comments #

Patch Set 9 : rebased #

Patch Set 10 : a new layout test file for testing UTF8 encoding detection #

Total comments: 2

Patch Set 11 : turned the new layout test to reference test #

Patch Set 12 : left out test files that should be landed manually #

Unified diffs Side-by-side diffs Delta from patch set Stats (+163 lines, -60 lines) Patch
A + third_party/WebKit/LayoutTests/fast/encoding/unlabelled-non-ascii-utf8.html View 1 2 3 4 5 6 7 8 9 10 11 2 chunks +10 lines, -11 lines 0 comments Download
A + third_party/WebKit/LayoutTests/fast/encoding/unlabelled-non-ascii-utf8-expected.html View 1 2 3 4 5 6 7 8 9 10 11 2 chunks +11 lines, -11 lines 0 comments Download
M third_party/WebKit/Source/core/html/parser/TextResourceDecoder.h View 1 2 3 4 5 6 7 2 chunks +9 lines, -6 lines 0 comments Download
M third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp View 1 2 3 4 5 6 7 8 9 4 chunks +36 lines, -26 lines 0 comments Download
M third_party/WebKit/Source/core/xmlhttprequest/XMLHttpRequest.cpp View 1 2 3 4 5 6 7 8 1 chunk +5 lines, -2 lines 0 comments Download
M third_party/WebKit/Source/platform/text/TextEncodingDetector.h View 1 2 1 chunk +3 lines, -3 lines 0 comments Download
M third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp View 1 2 1 chunk +1 line, -1 line 0 comments Download
M third_party/WebKit/Source/wtf/text/UTF8.h View 1 2 3 4 5 6 7 8 9 1 chunk +6 lines, -0 lines 0 comments Download
M third_party/WebKit/Source/wtf/text/UTF8.cpp View 1 2 3 4 5 6 7 8 9 1 chunk +18 lines, -0 lines 0 comments Download
A third_party/WebKit/Source/wtf/text/UTF8Test.cpp View 1 2 3 4 5 6 7 8 9 1 chunk +63 lines, -0 lines 0 comments Download
M third_party/WebKit/Source/wtf/wtf.gypi View 1 2 3 4 5 6 7 8 1 chunk +1 line, -0 lines 0 comments Download

Messages

Total messages: 46 (11 generated)
Jinsuk Kim
Initial perf results for UTF8 encoding detector on Linux/Mac is at crbug.com/583549#c6. The amount of ...
4 years, 10 months ago (2016-02-23 10:26:00 UTC) #2
aelias_OOO_until_Jul13
https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == ...
4 years, 10 months ago (2016-02-24 04:37:56 UTC) #5
aelias_OOO_until_Jul13
https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == ...
4 years, 10 months ago (2016-02-24 04:37:57 UTC) #6
Jinsuk Kim
Thanks for reviewing. https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding ...
4 years, 10 months ago (2016-02-24 06:54:54 UTC) #7
aelias_OOO_until_Jul13
lgtm modulo nit, still needs OWNERS approval. https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp#newcode40 third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:40: bool detectTextEncoding(const ...
4 years, 10 months ago (2016-02-26 09:20:43 UTC) #8
esprehn
https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode420 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { Can we use early return instead? ...
4 years, 10 months ago (2016-02-26 09:37:25 UTC) #9
Jinsuk Kim
https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode420 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { On 2016/02/26 09:37:25, esprehn wrote: > ...
4 years, 9 months ago (2016-03-02 00:08:32 UTC) #10
Jinsuk Kim
tkent@: Please review wtf/text.
4 years, 9 months ago (2016-03-02 00:16:25 UTC) #13
tkent
https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. ...
4 years, 9 months ago (2016-03-02 00:55:56 UTC) #14
Jinsuk Kim
https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. ...
4 years, 9 months ago (2016-03-02 01:14:32 UTC) #15
tkent
lgtm https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses ...
4 years, 9 months ago (2016-03-02 01:36:39 UTC) #16
Jinsuk Kim
Several webkit layout tests were affected by the change. UTF-8 detector now gives better rendering ...
4 years, 9 months ago (2016-03-07 07:08:18 UTC) #17
Jinsuk Kim
my own comments for clarification: https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html File third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html (left): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html#oldcode22 third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22: testDecode('EUC-KR', '%C7%81', 'U+FFFD'); Remove ...
4 years, 9 months ago (2016-03-07 07:36:38 UTC) #18
jungshik at Google
Thank you for the CL. A few comments below. BTW, rietveld does not like non-UTF-8 ...
4 years, 9 months ago (2016-03-24 06:15:08 UTC) #20
jungshik at Google
Another BTW, you can also mark all the tests (corresponding to revised and now correct ...
4 years, 9 months ago (2016-03-24 06:21:01 UTC) #21
esprehn
Can you explain how the content sniffing works in general? The server is streaming the ...
4 years, 9 months ago (2016-03-24 06:32:09 UTC) #22
Jinsuk Kim
https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt File third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt#newcode7 third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7: 韓國한국 On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: ...
4 years, 9 months ago (2016-03-25 02:15:43 UTC) #23
Jinsuk Kim
On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > Another BTW, you can also mark ...
4 years, 9 months ago (2016-03-25 02:17:14 UTC) #24
Jinsuk Kim
On 2016/03/24 06:32:09, esprehn wrote: > Can you explain how the content sniffing works in ...
4 years, 9 months ago (2016-03-25 02:18:03 UTC) #25
natgar1108
On 2016/03/25 02:17:14, Jinsuk wrote: > On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > ...
4 years, 8 months ago (2016-03-28 02:40:20 UTC) #26
esprehn
What's next for this patch?
4 years, 8 months ago (2016-03-31 04:46:39 UTC) #27
Jinsuk Kim
On 2016/03/31 04:46:39, esprehn wrote: > What's next for this patch? Still waiting for your ...
4 years, 8 months ago (2016-03-31 05:12:56 UTC) #28
jungshik at Google
Sorry for the late reply. Could you take care of my comments below? Besides, you ...
4 years, 8 months ago (2016-04-03 00:52:04 UTC) #29
Jinsuk Kim
Added unlabelled-non-ascii-utf8.html to test the effect of UTF-8 encoding detector. Please see if TestExpectations is ...
4 years, 8 months ago (2016-04-06 04:34:11 UTC) #30
jungshik at Google
LGTM with unlabelled-non-ascii-utf8-expected.html added. Sorry again for the delay. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations#newcode98 third_party/WebKit/LayoutTests/TestExpectations:98: ...
4 years, 8 months ago (2016-04-08 06:56:43 UTC) #31
Jinsuk Kim
Thanks for the review. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations#newcode98 third_party/WebKit/LayoutTests/TestExpectations:98: crbug.com/583549 fast/encoding/unlabelled-non-ascii-utf8.html [ NeedsRebaseline ] ...
4 years, 8 months ago (2016-04-08 12:12:10 UTC) #32
jungshik at Google
Thanks. LGTM again :-)
4 years, 8 months ago (2016-04-08 17:51:02 UTC) #33
jungshik at Google
BTW, all those *evil*{html,css} files in ISO-8859-7 can be messed up if you land with ...
4 years, 8 months ago (2016-04-08 17:53:40 UTC) #34
Jinsuk Kim
On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > BTW, all those *evil*{html,css} files in ...
4 years, 8 months ago (2016-04-08 20:02:08 UTC) #36
Jinsuk Kim
On 2016/04/08 20:02:08, Jinsuk wrote: > On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > ...
4 years, 8 months ago (2016-04-08 20:19:35 UTC) #37
Jinsuk Kim
On 2016/03/31 05:12:56, Jinsuk wrote: > On 2016/03/31 04:46:39, esprehn wrote: > > What's next ...
4 years, 8 months ago (2016-04-12 00:29:46 UTC) #38
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1721373002/210001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1721373002/210001
4 years, 8 months ago (2016-04-14 02:07:35 UTC) #41
commit-bot: I haz the power
Committed patchset #12 (id:210001)
4 years, 8 months ago (2016-04-14 02:25:00 UTC) #43
commit-bot: I haz the power
Patchset 12 (id:??) landed as https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610 Cr-Commit-Position: refs/heads/master@{#387209}
4 years, 8 months ago (2016-04-14 02:26:55 UTC) #45
rnephew (Reviews Here)
4 years, 8 months ago (2016-04-14 16:15:13 UTC) #46
Message was sent while issue was closed.
A revert of this CL (patchset #12 id:210001) has been created in
https://codereview.chromium.org/1888083002/ by rnephew@chromium.org.

The reason for reverting is: Causes jetstream perf test to fail.
crbug.com/603558.

Powered by Google App Engine
This is Rietveld 408576698