Issue 1721373002: UTF-8 detector for pages missing encoding info

Jinsuk Kim

jinsukkim@chromium.org changed reviewers: + esprehn@chromium.org

4 years, 10 months ago (2016-02-23 10:26:00 UTC) #1

Jinsuk Kim

Initial perf results for UTF8 encoding detector on Linux/Mac is at crbug.com/583549#c6. The amount of ...

4 years, 10 months ago (2016-02-23 10:26:00 UTC) #2

aelias_OOO_until_Jul13

aelias@chromium.org changed reviewers: + aelias@chromium.org

4 years, 10 months ago (2016-02-24 04:37:52 UTC) #3

aelias_OOO_until_Jul13

aelias@chromium.org changed reviewers: + aelias@chromium.org

4 years, 10 months ago (2016-02-24 04:37:52 UTC) #4

aelias_OOO_until_Jul13

https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == ...

4 years, 10 months ago (2016-02-24 04:37:56 UTC) #5

aelias_OOO_until_Jul13

https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding || (m_source == ...

4 years, 10 months ago (2016-02-24 04:37:57 UTC) #6

Jinsuk Kim

Thanks for reviewing. https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/1/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode409 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:409: } else if ((m_source == DefaultEncoding ...

4 years, 10 months ago (2016-02-24 06:54:54 UTC) #7

aelias_OOO_until_Jul13

lgtm modulo nit, still needs OWNERS approval. https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp File third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp#newcode40 third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp:40: bool detectTextEncoding(const ...

4 years, 10 months ago (2016-02-26 09:20:43 UTC) #8

esprehn

https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode420 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { Can we use early return instead? ...

4 years, 10 months ago (2016-02-26 09:37:25 UTC) #9

Jinsuk Kim

https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right): https://codereview.chromium.org/1721373002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp#newcode420 third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:420: if (shouldAutoDetect()) { On 2016/02/26 09:37:25, esprehn wrote: > ...

4 years, 9 months ago (2016-03-02 00:08:32 UTC) #10

Jinsuk Kim

Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows ...

4 years, 9 months ago (2016-03-02 00:13:29 UTC) #11

Jinsuk Kim

jinsukkim@chromium.org changed reviewers: + tkent@chromium.org

4 years, 9 months ago (2016-03-02 00:13:29 UTC) #12

tkent

https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. ...

4 years, 9 months ago (2016-03-02 00:55:56 UTC) #14

Jinsuk Kim

https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses int32_ts. ...

4 years, 9 months ago (2016-03-02 01:14:32 UTC) #15

tkent

lgtm https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp File third_party/WebKit/Source/wtf/text/UTF8.cpp (right): https://codereview.chromium.org/1721373002/diff/40001/third_party/WebKit/Source/wtf/text/UTF8.cpp#newcode447 third_party/WebKit/Source/wtf/text/UTF8.cpp:447: // This cast is necessary because U8_NEXT uses ...

4 years, 9 months ago (2016-03-02 01:36:39 UTC) #16

Jinsuk Kim

Several webkit layout tests were affected by the change. UTF-8 detector now gives better rendering ...

4 years, 9 months ago (2016-03-07 07:08:18 UTC) #17

Jinsuk Kim

my own comments for clarification: https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html File third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html (left): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html#oldcode22 third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22: testDecode('EUC-KR', '%C7%81', 'U+FFFD'); Remove ...

4 years, 9 months ago (2016-03-07 07:36:38 UTC) #18

jungshik at Google

jshin@chromium.org changed reviewers: + jshin@chromium.org

4 years, 9 months ago (2016-03-24 06:15:06 UTC) #19

jungshik at Google

Thank you for the CL. A few comments below. BTW, rietveld does not like non-UTF-8 ...

4 years, 9 months ago (2016-03-24 06:15:08 UTC) #20

Thank you for the CL. A few comments below. 

BTW, rietveld does not like non-UTF-8 files. (evil-css charset files in
ISO-8859-7). You have to land them manually.

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
File
third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt
(right):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7:
韓國한국
This and a number of tests whose expected results are revised are very
interesting. 

Apparently, whoever took them and added to the layout tests overlooked the fact
that the original tests were served via an HTTP server that emits 'Content-Type:
text/html; charset=UTF-8'.  

When blink layout tests are run without a web server, they're interpreted as the
default encoding (of content_shell which is windows-1252/iso-8859-1). So, all
the expected results have been wrong until now. 

I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=597517 for
that. 

I propose that you add a separate layout test to test UTF-8 encoding detection
rather than relying on these layout tests (virtually all of them whose expected
results are changed from gibberish to something meaningful fall into that
category).

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
File
third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html
(left):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22:
testDecode('EUC-KR', '%C7%81', 'U+FFFD');
On 2016/03/07 07:36:37, Jinsuk wrote:
> Remove this test since %C7%81 happens to be a valid Unicode sequence (U+01C1).
> This CL identifies the test document as UTF-8 document and correctly decodes
the
> character as such, not U+FFFD any more. 

Wait.  This test should NOT be affected by this CL. If it is affected, there's
something wrong. 
This test wants to make sure that '\xC7 \x81' is 'decoded' to U+FFFD when the
encoding is explicitly set to EUC-KR.   

OTOH, this CL is supposed to affect pages WITHOUT any specified.

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:427: if
((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame &&
m_hintEncoding))) {
nit:   The above condition is shared by shouldAutoDetect() except that
shouldAutoDetect has an additional check to see if 'autodetector' is ON.  How
about refactoring the above condition to a helper function
|shouldDetectEncoding|?  Then, this function can be something like:

if (shouldDetectEncoding()) {
    if (  .... isUTF8Encoded( ...) )  
       setEncoding ....; 
       return;
    if (isUniveralDetectorOn()) 
         .... call Universal encoding detector 

.....

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:429:
setEncoding(UTF8Encoding(), EncodingFromContentSniffing);
Given that isUTF8Encoded excludes 'ASCII' (by checking the output of U8_NEXT is
larger than 0x7F), you may as well consider being more aggressive and putting
this *before* the Universal detector.

jungshik at Google

Another BTW, you can also mark all the tests (corresponding to revised and now correct ...

4 years, 9 months ago (2016-03-24 06:21:01 UTC) #21

esprehn

Can you explain how the content sniffing works in general? The server is streaming the ...

4 years, 9 months ago (2016-03-24 06:32:09 UTC) #22

Jinsuk Kim

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt File third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt (right): https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt#newcode7 third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7: 韓國한국 On 2016/03/24 06:15:08, jshin (jungshik at google) wrote: ...

4 years, 9 months ago (2016-03-25 02:15:43 UTC) #23

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
File
third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt
(right):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
third_party/WebKit/LayoutTests/accessibility/element-role-mapping-normal-expected.txt:7:
韓國한국
On 2016/03/24 06:15:08, jshin (jungshik at google) wrote:
> This and a number of tests whose expected results are revised are very
> interesting. 
> 
> Apparently, whoever took them and added to the layout tests overlooked the
fact
> that the original tests were served via an HTTP server that emits
'Content-Type:
> text/html; charset=UTF-8'.  
> 
> When blink layout tests are run without a web server, they're interpreted as
the
> default encoding (of content_shell which is windows-1252/iso-8859-1). So, all
> the expected results have been wrong until now. 
> 
> I've filed https://bugs.chromium.org/p/chromium/issues/detail?id=597517 for
> that. 
> 
> I propose that you add a separate layout test to test UTF-8 encoding detection
> rather than relying on these layout tests (virtually all of them whose
expected
> results are changed from gibberish to something meaningful fall into that
> category). 

Thanks for filing the bug. Are you going search for all the files to update them
with the meta tag yourself? Feel free to split the work and assign it to me.

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
File
third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html
(left):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Lay...
third_party/WebKit/LayoutTests/fast/encoding/char-decoding-invalid-trail.html:22:
testDecode('EUC-KR', '%C7%81', 'U+FFFD');
On 2016/03/24 06:15:08, jshin (jungshik at google) wrote:
> On 2016/03/07 07:36:37, Jinsuk wrote:
> > Remove this test since %C7%81 happens to be a valid Unicode sequence
(U+01C1).
> > This CL identifies the test document as UTF-8 document and correctly decodes
> the
> > character as such, not U+FFFD any more. 
> 
> Wait.  This test should NOT be affected by this CL. If it is affected, there's
> something wrong. 
> This test wants to make sure that '\xC7 \x81' is 'decoded' to U+FFFD when the
> encoding is explicitly set to EUC-KR.   
> 
> OTOH, this CL is supposed to affect pages WITHOUT any specified. 

You're right - there was a bug in XMLHtttpRequest passing text encoding (EUC-KR)
only, NOT the encoding source (EncodingFromHTTPheader) when creating
TextResourceDecoder instance. This was putting UTF-8 encoding detector into
action which override the text encoding when the detection returns true.
Reverted the test file and fixed the bug by setting the encoding source as well.
Please see XMLHttpRequest.cpp.

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
File third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp (right):

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:427: if
((m_source == DefaultEncoding || (m_source == EncodingFromParentFrame &&
m_hintEncoding))) {
On 2016/03/24 06:15:08, jshin (jungshik at google) wrote:
> nit:   The above condition is shared by shouldAutoDetect() except that
> shouldAutoDetect has an additional check to see if 'autodetector' is ON.  How
> about refactoring the above condition to a helper function
> |shouldDetectEncoding|?  Then, this function can be something like:
> 
> if (shouldDetectEncoding()) {
>     if (  .... isUTF8Encoded( ...) )  
>        setEncoding ....; 
>        return;
>     if (isUniveralDetectorOn()) 
>          .... call Universal encoding detector 
> 
> .....
> 

Done.

https://codereview.chromium.org/1721373002/diff/100001/third_party/WebKit/Sou...
third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp:429:
setEncoding(UTF8Encoding(), EncodingFromContentSniffing);
On 2016/03/24 06:15:08, jshin (jungshik at google) wrote:
> Given that isUTF8Encoded excludes 'ASCII' (by checking the output of U8_NEXT
is
> larger than 0x7F), you may as well consider being more aggressive and putting
> this *before* the Universal detector. 
> 

Makes sense. Done.

Jinsuk Kim

On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > Another BTW, you can also mark ...

4 years, 9 months ago (2016-03-25 02:17:14 UTC) #24

Jinsuk Kim

On 2016/03/24 06:32:09, esprehn wrote: > Can you explain how the content sniffing works in ...

4 years, 9 months ago (2016-03-25 02:18:03 UTC) #25

natgar1108

On 2016/03/25 02:17:14, Jinsuk wrote: > On 2016/03/24 06:21:01, jshin (jungshik at google) wrote: > ...

4 years, 8 months ago (2016-03-28 02:40:20 UTC) #26

Jinsuk Kim

On 2016/03/31 04:46:39, esprehn wrote: > What's next for this patch? Still waiting for your ...

4 years, 8 months ago (2016-03-31 05:12:56 UTC) #28

jungshik at Google

Sorry for the late reply. Could you take care of my comments below? Besides, you ...

4 years, 8 months ago (2016-04-03 00:52:04 UTC) #29

Jinsuk Kim

Added unlabelled-non-ascii-utf8.html to test the effect of UTF-8 encoding detector. Please see if TestExpectations is ...

4 years, 8 months ago (2016-04-06 04:34:11 UTC) #30

jungshik at Google

LGTM with unlabelled-non-ascii-utf8-expected.html added. Sorry again for the delay. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations#newcode98 third_party/WebKit/LayoutTests/TestExpectations:98: ...

4 years, 8 months ago (2016-04-08 06:56:43 UTC) #31

Jinsuk Kim

Thanks for the review. https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations File third_party/WebKit/LayoutTests/TestExpectations (right): https://codereview.chromium.org/1721373002/diff/170001/third_party/WebKit/LayoutTests/TestExpectations#newcode98 third_party/WebKit/LayoutTests/TestExpectations:98: crbug.com/583549 fast/encoding/unlabelled-non-ascii-utf8.html [ NeedsRebaseline ] ...

4 years, 8 months ago (2016-04-08 12:12:10 UTC) #32

jungshik at Google

BTW, all those *evil*{html,css} files in ISO-8859-7 can be messed up if you land with ...

4 years, 8 months ago (2016-04-08 17:53:40 UTC) #34

jungshik at Google

Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows ...

4 years, 8 months ago (2016-04-08 17:54:28 UTC) #35

Jinsuk Kim

On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > BTW, all those *evil*{html,css} files in ...

4 years, 8 months ago (2016-04-08 20:02:08 UTC) #36

Jinsuk Kim

On 2016/04/08 20:02:08, Jinsuk wrote: > On 2016/04/08 17:53:40, jshin (jungshik at google) wrote: > ...

4 years, 8 months ago (2016-04-08 20:19:35 UTC) #37

Jinsuk Kim

On 2016/03/31 05:12:56, Jinsuk wrote: > On 2016/03/31 04:46:39, esprehn wrote: > > What's next ...

4 years, 8 months ago (2016-04-12 00:29:46 UTC) #38

Jinsuk Kim

The patchset sent to the CQ was uploaded after l-g-t-m from aelias@chromium.org, tkent@chromium.org, jshin@chromium.org Link ...

4 years, 8 months ago (2016-04-14 02:07:10 UTC) #40

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1721373002/210001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1721373002/210001

4 years, 8 months ago (2016-04-14 02:07:35 UTC) #41

commit-bot: I haz the power

Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows ...

4 years, 8 months ago (2016-04-14 02:24:58 UTC) #42

commit-bot: I haz the power

Description was changed from ========== UTF-8 detector for pages missing encoding info Experiment crbug.com/518968 shows ...

4 years, 8 months ago (2016-04-14 02:26:54 UTC) #44

commit-bot: I haz the power

Patchset 12 (id:??) landed as https://crrev.com/2af3917eb9ca14b263116d664a8257ae69680610 Cr-Commit-Position: refs/heads/master@{#387209}

4 years, 8 months ago (2016-04-14 02:26:55 UTC) #45

rnephew (Reviews Here)

4 years, 8 months ago (2016-04-14 16:15:13 UTC) #46

Message was sent while issue was closed.

A revert of this CL (patchset #12 id:210001) has been created in
https://codereview.chromium.org/1888083002/ by rnephew@chromium.org.

The reason for reverting is: Causes jetstream perf test to fail.
crbug.com/603558.

Issue 1721373002: UTF-8 detector for pages missing encoding info (Closed)

Description

Patch Set 1 #

Patch Set 2 : #

Patch Set 3 : addressed comments #

Patch Set 4 : addressed comments (wtf) #

Patch Set 5 : updated webkit layout tests accordingly #

Patch Set 6 : #

Patch Set 7 : rebased & removed test files to be rebasedlined #

Patch Set 8 : addressed comments #

Patch Set 9 : rebased #

Patch Set 10 : a new layout test file for testing UTF8 encoding detection #

Patch Set 11 : turned the new layout test to reference test #

Patch Set 12 : left out test files that should be landed manually #

Messages