Chromium Code Reviews
DescriptionReland "UTF-8 detector for pages missing encoding info"
TextResourceDecoder is designed (or used) in such a way that the text
encoding of a document gets resolved from the first chunk (as big as
4096 bytes) of the text received from network - by BOM, meta tag, or
auto encoding detection (if enabled).
The newly introduced UTF-8 encoding detector crrev.com/1721373002 was
reverted (crbug.com/603558) because it attempted to work in a bit
different way - it examined all the subsequent chunks as well in search
of non-ASCII-UTF-8-encoded char sequence. This means it is possible for
TextResourceDecoder to start with a codec for, say, windows-1252, and
then later switch to one for UTF-8. Theoretically this should still work
but doesn't in practice (maybe hasn't been used/tested in that
way). This is what happened with failed perf tests - one of the js files
was big (13K), of pure ASCII except one tiny char sequence \xc2\xa7
almost at the end.
The CL was updated so the UTF-8 encoding detection also works against
the first chunk only like other methods, to avoid potential codec
switching in the middle.
BUG=583549, 603558
Committed: https://crrev.com/57139d64c5b98142ca9305792f39ae23a4950375
Cr-Commit-Position: refs/heads/master@{#388927}
Patch Set 1 #Patch Set 2 : add tests #
Total comments: 4
Patch Set 3 : #Messages
Total messages: 18 (6 generated)
|