Issue 1890103002: Reland "UTF-8 detector for pages missing encoding info"

Issue 1890103002: Reland "UTF-8 detector for pages missing encoding info" (Closed)

Created:
4 years, 8 months ago by Jinsuk Kim

Modified:
4 years, 8 months ago

Reviewers:
tkent, aelias_OOO_until_Jul13, jungshik at Google

CC:
blink-reviews, blink-reviews-html_chromium.org, blink-reviews-wtf_chromium.org, chromium-reviews, dglazkov+blink, jshin+watch_chromium.org, kinuko+watch, Mikhail, tyoshino+watch_chromium.org

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Target Ref:
refs/pending/heads/master

Project:
chromium

Visibility:
Public.

More Reviews

Description

Reland "UTF-8 detector for pages missing encoding info" TextResourceDecoder is designed (or used) in such a way that the text encoding of a document gets resolved from the first chunk (as big as 4096 bytes) of the text received from network - by BOM, meta tag, or auto encoding detection (if enabled). The newly introduced UTF-8 encoding detector crrev.com/1721373002 was reverted (crbug.com/603558) because it attempted to work in a bit different way - it examined all the subsequent chunks as well in search of non-ASCII-UTF-8-encoded char sequence. This means it is possible for TextResourceDecoder to start with a codec for, say, windows-1252, and then later switch to one for UTF-8. Theoretically this should still work but doesn't in practice (maybe hasn't been used/tested in that way). This is what happened with failed perf tests - one of the js files was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7 almost at the end. The CL was updated so the UTF-8 encoding detection also works against the first chunk only like other methods, to avoid potential codec switching in the middle. BUG=583549, 603558 Committed: https://crrev.com/57139d64c5b98142ca9305792f39ae23a4950375 Cr-Commit-Position: refs/heads/master@{#388927}

Patch Set 1 #

Patch Set 2 : add tests #

Total comments: 4

Patch Set 3 : #

Created: 4 years, 8 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+209 lines, -60 lines)			Patch
A +	third_party/WebKit/LayoutTests/fast/encoding/unlabelled-non-ascii-utf8.html	View		2 chunks	+10 lines, -11 lines	0 comments	Download
A +	third_party/WebKit/LayoutTests/fast/encoding/unlabelled-non-ascii-utf8-expected.html	View		2 chunks	+11 lines, -11 lines	0 comments	Download
M	third_party/WebKit/Source/core/core.gypi	View	1	1 chunk	+1 line, -0 lines	0 comments	Download
M	third_party/WebKit/Source/core/html/parser/TextResourceDecoder.h	View		2 chunks	+9 lines, -6 lines	0 comments	Download
M	third_party/WebKit/Source/core/html/parser/TextResourceDecoder.cpp	View		4 chunks	+43 lines, -26 lines	0 comments	Download
A	third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp	View	1 2	1 chunk	+38 lines, -0 lines	0 comments	Download
M	third_party/WebKit/Source/core/xmlhttprequest/XMLHttpRequest.cpp	View		1 chunk	+5 lines, -2 lines	0 comments	Download
M	third_party/WebKit/Source/platform/text/TextEncodingDetector.h	View		1 chunk	+3 lines, -3 lines	0 comments	Download
M	third_party/WebKit/Source/platform/text/TextEncodingDetector.cpp	View		1 chunk	+1 line, -1 line	0 comments	Download
M	third_party/WebKit/Source/wtf/text/UTF8.h	View		1 chunk	+6 lines, -0 lines	0 comments	Download
M	third_party/WebKit/Source/wtf/text/UTF8.cpp	View		1 chunk	+18 lines, -0 lines	0 comments	Download
A	third_party/WebKit/Source/wtf/text/UTF8Test.cpp	View		1 chunk	+63 lines, -0 lines	0 comments	Download
M	third_party/WebKit/Source/wtf/wtf.gypi	View		1 chunk	+1 line, -0 lines	0 comments	Download

Messages

Total messages: 18 (6 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

Jinsuk Kim

The ideal fix would be to allow TextResourceDecoder to switch codecs but first, this is ...

4 years, 8 months ago (2016-04-15 12:37:26 UTC) #1

jungshik at Google

On 2016/04/15 12:37:26, Jinsuk wrote: > The ideal fix would be to allow TextResourceDecoder to ...

4 years, 8 months ago (2016-04-15 19:10:36 UTC) #2

Jinsuk Kim

On 2016/04/15 19:10:36, jshin (ooo Fri aka jungshik) wrote: > On 2016/04/15 12:37:26, Jinsuk wrote: ...

4 years, 8 months ago (2016-04-15 19:20:40 UTC) #3

On 2016/04/15 19:10:36, jshin (ooo Fri aka jungshik) wrote:
> On 2016/04/15 12:37:26, Jinsuk wrote:
> > The ideal fix would be to allow TextResourceDecoder to switch codecs but
> first,
> > this is a quick fix. This is at least no worse than how it was before. I'm
> > thinking of looking into codec switching approach in a follow up CL, plus
> tests.
> > Feel free to let me know what you think.
> 
> Could you point me where this CL is different from what you landed (and got
> reverted) previously? I just had a cursory look (I'm not supposed to work
today
> ;-)) and fail to find a difference.
> 
> Thanks

The difference is in detectTextEncoding().

Previously:

 411 void TextResourceDecoder::detectTextEncoding(const char* data, size_t len)
 412 {
 413     if (!shouldDetectEncoding())
 414         return;
 415 
 416     if (WTF::Unicode::isUTF8andNotASCII(data, len)) {
 417         setEncoding(UTF8Encoding(), EncodingFromContentSniffing);
 418         return;
 419     }
 420     if (m_encodingDetectionOption == UseAllAutoDetection) {
 421         WTF::TextEncoding detectedEncoding;
 422         if (detectTextEncodingUniversal(data, len, m_hintEncoding,
&detectedEncoding))
 423             setEncoding(detectedEncoding, EncodingFromContentSniffing);
 424     }
 425 }
 426 

Now:

 411 //   We also check if the text is encoded in UTF-8 in case the encoding has
not
 412 //   been determined by auto encoding detector (optional). Then |m_source|
needs
 413 //   to be set to anything but DefaultEncoding to avoid further detection
 414 //   attempts.
 415 void TextResourceDecoder::detectTextEncoding(const char* data, size_t len)
 416 {
 417     if (!shouldDetectEncoding())
 418         return;
 419 
 420     if (m_encodingDetectionOption == UseAllAutoDetection) {
 421         WTF::TextEncoding detectedEncoding;
 422         if (detectTextEncodingUniversal(data, len, m_hintEncoding,
&detectedEncoding)) {
 423             setEncoding(detectedEncoding, EncodingFromContentSniffing);
 424             return;
 425         }
 426     }
 427 
 428     if (WTF::Unicode::isUTF8andNotASCII(data, len))
 429         setEncoding(UTF8Encoding(), EncodingFromContentSniffing);
 430     else
 431         m_source = EncodingFromContentSniffing;
 432 }


Thanks for taking a look while OOO...

Jinsuk Kim

Description was changed from ========== Reland "UTF-8 detector for pages missing encoding info" TextResourceDecoder is ...

4 years, 8 months ago (2016-04-15 19:31:05 UTC) #4

Description was changed from

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examines all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updatd for the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

TBR=tkent@chromium.org,aelias@chromium.org,jshin@chromium.org
BUG=583549,603558
==========

to

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

TBR=tkent@chromium.org,aelias@chromium.org,jshin@chromium.org
BUG=583549,603558
==========

Jinsuk Kim

Description was changed from ========== Reland "UTF-8 detector for pages missing encoding info" TextResourceDecoder is ...

4 years, 8 months ago (2016-04-18 23:39:45 UTC) #5

Description was changed from

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

TBR=tkent@chromium.org,aelias@chromium.org,jshin@chromium.org
BUG=583549,603558
==========

to

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

BUG=583549,603558
==========

Jinsuk Kim

jshin@, tkent@: could you a look? This is a relanding CL that was reverted. And ...

4 years, 8 months ago (2016-04-19 22:59:28 UTC) #6

Jinsuk Kim

On 2016/04/20 00:07:49, tkent wrote: > This CL should have a test for crbug.com/603558. Done. ...

4 years, 8 months ago (2016-04-20 02:08:39 UTC) #8

tkent

lgtm https://codereview.chromium.org/1890103002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp (right): https://codereview.chromium.org/1890103002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp#newcode24 third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp:24: ASSERT_EQ(UTF8Encoding(), decoder->encoding()); probably this should be EXPECT_EQ because ...

4 years, 8 months ago (2016-04-20 06:37:47 UTC) #9

Jinsuk Kim

Thanks again Kent for the review. https://codereview.chromium.org/1890103002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp File third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp (right): https://codereview.chromium.org/1890103002/diff/20001/third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp#newcode24 third_party/WebKit/Source/core/html/parser/TextResourceDecoderTest.cpp:24: ASSERT_EQ(UTF8Encoding(), decoder->encoding()); On ...

4 years, 8 months ago (2016-04-20 07:13:55 UTC) #10

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1890103002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1890103002/40001

4 years, 8 months ago (2016-04-21 20:58:30 UTC) #14

commit-bot: I haz the power

Description was changed from ========== Reland "UTF-8 detector for pages missing encoding info" TextResourceDecoder is ...

4 years, 8 months ago (2016-04-21 22:33:34 UTC) #15

Message was sent while issue was closed.

Description was changed from

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

BUG=583549,603558
==========

to

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

BUG=583549,603558
==========

commit-bot: I haz the power

Description was changed from ========== Reland "UTF-8 detector for pages missing encoding info" TextResourceDecoder is ...

4 years, 8 months ago (2016-04-22 19:40:37 UTC) #17

Message was sent while issue was closed.

Description was changed from

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

BUG=583549,603558
==========

to

==========
Reland "UTF-8 detector for pages missing encoding info"

TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).

The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.

The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.

BUG=583549,603558

Committed: https://crrev.com/57139d64c5b98142ca9305792f39ae23a4950375
Cr-Commit-Position: refs/heads/master@{#388927}
==========

commit-bot: I haz the power

4 years, 8 months ago (2016-04-22 19:40:38 UTC) #18

Message was sent while issue was closed.

Patchset 3 (id:??) landed as
https://crrev.com/57139d64c5b98142ca9305792f39ae23a4950375
Cr-Commit-Position: refs/heads/master@{#388927}

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages