Issue 1407663004: Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Łukasz Anforowicz

Patchset #1 (id:1) has been deleted

5 years, 2 months ago (2015-10-14 17:00:54 UTC) #1

Łukasz Anforowicz

lukasza@chromium.org changed reviewers: + rdsmith@chromium.org

5 years, 2 months ago (2015-10-14 17:01:36 UTC) #2

Randy Smith (Not in Mondays)

(I feel competent to review the overall test structure, but I know very little about ...

5 years, 2 months ago (2015-10-15 18:57:06 UTC) #4

Łukasz Anforowicz

Randy, can you take another look please? https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download/save_page_browsertest.cc File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download/save_page_browsertest.cc#newcode561 chrome/browser/download/save_page_browsertest.cc:561: // Disabled ...

5 years, 2 months ago (2015-10-20 21:08:43 UTC) #5

Randy, can you take another look please?

https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
File chrome/browser/download/save_page_browsertest.cc (right):

https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
chrome/browser/download/save_page_browsertest.cc:561: // Disabled on Windows due
to flakiness. http://crbug.com/162323
On 2015/10/15 18:57:06, rdsmith wrote:
> This is a new test; have you confirmed that it's flaky?  Any possibility of
> fixing it?  I wince at landing tests in pre-flaky/disabled fashion, though if
> it's in some test infrastructure you have nothing to do with it's quite
> reasonable to not fix it.

Done.

Also following up with the following CLs:
- https://codereview.chromium.org/1408993003/ (landed)
- https://codereview.chromium.org/1408393004/ (landed)
- https://codereview.chromium.org/1413763008/ (in CR)

https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
chrome/browser/download/save_page_browsertest.cc:578:
EXPECT_TRUE(base::ContentsEqual(
On 2015/10/15 18:57:06, rdsmith wrote:
> I wince a bit at this, since it tests the implementation rather than the
> behavior we care about (I could imagine many character level changes that
> wouldn't effect the web page viewed by the user after saving)--is there any
way
> you can put something in the original file that would be messed up if the
> encoding was messed up, but would always be saved correctly if it wasn't?

I still think that comparing the whole file contents is desirable here:

- This captures encoding of the whole page (which is important if in the future
we decide to split encoding across 2 separate pieces of code as the tentative
crrev.com/1373573002 is doing).  Alternative here would be to foresee which
pieces might be encoded separately and add verification for each such piece
flavor.

- This captures BOM behavior.

- Avoiding whole-file-comparison in the new test doesn't help that much without
also avoiding whole-file-comparison in the 2 old tests.  And if somebody is
replacing the other 2 expected contents, then in the same go he/she can retain
and replace the file generated in the new test.

- There is a trade-off between robustness of verification and effort required to
update the tests.  I think the effort is not that big, and I think there is a
significant value in making sure that all unexpected changes are detected by the
test (i.e. I don't worry about overtesting at browser-test level as much as I
would at unit-test level).


If you feel strongly about this, then maybe we can track this as a separate bug,
that also covers the 2 old tests that currently use whole file comparison. 
WDYT?

Łukasz Anforowicz

On 2015/10/15 18:57:06, rdsmith wrote: > (I feel competent to review the overall test structure, ...

5 years, 2 months ago (2015-10-20 21:10:22 UTC) #6

Randy Smith (Not in Mondays)

On 2015/10/20 21:08:43, Łukasz Anforowicz wrote: > Randy, can you take another look please? > ...

5 years, 2 months ago (2015-10-21 17:27:16 UTC) #7

On 2015/10/20 21:08:43, Łukasz Anforowicz wrote:
> Randy, can you take another look please?
> 
>
https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
> File chrome/browser/download/save_page_browsertest.cc (right):
> 
>
https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
> chrome/browser/download/save_page_browsertest.cc:561: // Disabled on Windows
due
> to flakiness. http://crbug.com/162323
> On 2015/10/15 18:57:06, rdsmith wrote:
> > This is a new test; have you confirmed that it's flaky?  Any possibility of
> > fixing it?  I wince at landing tests in pre-flaky/disabled fashion, though
if
> > it's in some test infrastructure you have nothing to do with it's quite
> > reasonable to not fix it.
> 
> Done.
> 
> Also following up with the following CLs:
> - https://codereview.chromium.org/1408993003/ (landed)
> - https://codereview.chromium.org/1408393004/ (landed)
> - https://codereview.chromium.org/1413763008/ (in CR)
> 
>
https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download...
> chrome/browser/download/save_page_browsertest.cc:578:
> EXPECT_TRUE(base::ContentsEqual(
> On 2015/10/15 18:57:06, rdsmith wrote:
> > I wince a bit at this, since it tests the implementation rather than the
> > behavior we care about (I could imagine many character level changes that
> > wouldn't effect the web page viewed by the user after saving)--is there any
> way
> > you can put something in the original file that would be messed up if the
> > encoding was messed up, but would always be saved correctly if it wasn't?
> 
> I still think that comparing the whole file contents is desirable here:
> 
> - This captures encoding of the whole page (which is important if in the
future
> we decide to split encoding across 2 separate pieces of code as the tentative
> crrev.com/1373573002 is doing).  Alternative here would be to foresee which
> pieces might be encoded separately and add verification for each such piece
> flavor.
> 
> - This captures BOM behavior.
> 
> - Avoiding whole-file-comparison in the new test doesn't help that much
without
> also avoiding whole-file-comparison in the 2 old tests.  And if somebody is
> replacing the other 2 expected contents, then in the same go he/she can retain
> and replace the file generated in the new test.
> 
> - There is a trade-off between robustness of verification and effort required
to
> update the tests.  I think the effort is not that big, and I think there is a
> significant value in making sure that all unexpected changes are detected by
the
> test (i.e. I don't worry about overtesting at browser-test level as much as I
> would at unit-test level).
> 
> 
> If you feel strongly about this, then maybe we can track this as a separate
bug,
> that also covers the 2 old tests that currently use whole file comparison. 
> WDYT?

If there's a strong argument in that direction, I'm good with it. 
save_page_browsertest.cc and chrome/test/data LGTM--I don't know the webkit code
at all and am hoping whoever you get to evaluate the encoding can also that
code.  But if not, let me know, and I'll pepper you with questions until I
decide I know the code well enough to review it :-}.

Łukasz Anforowicz

lukasza@chromium.org changed reviewers: + dcheng@chromium.org

5 years, 2 months ago (2015-10-21 17:35:35 UTC) #8

Łukasz Anforowicz

Daniel, could you please take a look? (mostly at third_party/WebKit/Source/web/WebPageSerializerImpl.cpp, but obviously feedback for the ...

5 years, 2 months ago (2015-10-21 17:35:35 UTC) #9

Łukasz Anforowicz

lukasza@chromium.org changed reviewers: + nick@chromium.org

5 years, 2 months ago (2015-10-21 17:39:57 UTC) #10

Łukasz Anforowicz

On 2015/10/15 18:57:06, rdsmith wrote: > (I feel competent to review the overall test structure, ...

5 years, 2 months ago (2015-10-21 17:40:29 UTC) #11

dcheng

Also, you might need to land this manually with git cl land. I recall strange ...

5 years, 2 months ago (2015-10-21 20:09:29 UTC) #13

ncarter (slow)

https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Source/web/WebPageSerializerImpl.cpp File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Source/web/WebPageSerializerImpl.cpp#newcode133 third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:133: if (isHTMLMetaElement(*element)) { It looks like it's designed to ...

5 years, 2 months ago (2015-10-21 20:17:58 UTC) #14

Łukasz Anforowicz

I've replied to CR questions. I think the current CL should be put on hold ...

5 years, 2 months ago (2015-10-23 21:42:23 UTC) #15

I've replied to CR questions.  I think the current CL should be put on hold
until crbug.com/547206 is fixed.

https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour...
File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right):

https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour...
third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:133: if
(isHTMLMetaElement(*element)) {
On 2015/10/21 20:17:58, ncarter wrote:
> It looks like it's designed to strip <meta http-equiv="Content-Type"
> content="text/html; charset=UTF-8">.
> 
> Does also have the effect of stripping html5 style <meta charset="UTF-32">
tags?

I've opened crbug.com/547206 to track this (I'll add a comment to this bug with
a proposed fix in a few minutes).

I am not sure if the issue you pointed out (thanks for catching this BTW) should
block the CL under review.  Let me try to spell out the impact of this issue
before the CL / after the CL, for documents with html5-style meta/charset
element, and depending on the encoding of the document:

- UTF32 - won't work at all (content will be lost because of the IPC bug that
this CL tries to mitigate [by only sending utf8 strings over ipc])

- ISO8859-1 - will get serialized with *two* meta elements:

   - <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

     - generated by the serializer
     - before this CL this would use and say iso8859-1 - the original encoding

   - <meta charset="iso8859-1">

     - copied by the serializer from the original html document

My gut feeling is that I should not land the current CL until the <meta
charset...> issue is fixed.  Plus, I don't need to normalize to UTF8 across
browser/renderer if we use the "original urls" cl to fix the OOPIF problem
(crrev.com/1413103003).

https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour...
third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:166: String xmlEncoding
= param->textEncoding.name();
On 2015/10/21 20:09:29, dcheng wrote:
> Out of curiosity, did you consider removing textEncoding from the parameters
> completely? If so, why did you opt not to do it?
> 
> (I'm not saying we should do it, just wondering about reasons for/against
doing
> this)

The main reason is an imprecise feeling that smaller diff will be better (better
for understanding the cl for review and for the future, better for compatibility
with future cls [future CLs = resurrecting serializer's merge OR other
OOPIF-related changes OR reintroducing the encoding if needed]).  I guess one
could say that we should optimizing for wholesale/overall code readability,
rather than readability of individual diifs - this is a valid argument, but I
think one can also argue for the other tradeoff.

Still - I didn't spend too much time thinking about it when authoring the CL -
the above paragraph is a post-rationalization of an already made decision...

dcheng

Blink changes seem OK, but given your reply, I'll defer from L-G-T-M until you think ...

5 years, 1 month ago (2015-10-26 17:01:31 UTC) #16

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-01 18:49:04 UTC) #17

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

To fix the problem above, this CL makes sure that "complete-html" is
always saved as UTF8.  This fixes crbug.com/541699 - after the CL
save-page-as works fine with multi-byte encodings (i.e. a saved
utf32.htm can be successfully opened and rendered).

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- Always using UTF8 makes it easy to coordinate the encoding between
  renderer and browser processes for OOPIFs-related work - see the
  pending CL in [6].

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*)
  and manually verified correctness of the output file.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...


BUG=541699,526786
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

To fix the problem above, this CL makes sure that "complete-html" is
always saved as UTF8.  This fixes crbug.com/541699 - after the CL
save-page-as works fine with multi-byte encodings (i.e. a saved
utf32.htm can be successfully opened and rendered).

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*)
  and manually verified correctness of the output file.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...


BUG=541699,526786
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-01 19:07:49 UTC) #18

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

To fix the problem above, this CL makes sure that "complete-html" is
always saved as UTF8.  This fixes crbug.com/541699 - after the CL
save-page-as works fine with multi-byte encodings (i.e. a saved
utf32.htm can be successfully opened and rendered).

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*)
  and manually verified correctness of the output file.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...


BUG=541699,526786
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

To fix the problem above, this CL makes sure that "complete-html" is
always saved as UTF8.  This fixes crbug.com/541699 - after the CL
save-page-as works fine with multi-byte encodings (i.e. a saved
utf32.htm can be successfully opened and rendered).

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*)
  and manually verified correctness of the output file.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-01 20:01:30 UTC) #19

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

To fix the problem above, this CL makes sure that "complete-html" is
always saved as UTF8.  This fixes crbug.com/541699 - after the CL
save-page-as works fine with multi-byte encodings (i.e. a saved
utf32.htm can be successfully opened and rendered).

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*)
  and manually verified correctness of the output file.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-01 23:33:42 UTC) #20

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).  This included
  verifying that the 2 characters from encoding-euc-kr.htm are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 00:51:26 UTC) #21

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).  This included
  verifying that the 2 characters from encoding-euc-kr.htm are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

lukasza@chromium.org changed reviewers: + jsbell@chromium.org, kojii@chromium.org

5 years ago (2015-12-02 22:04:36 UTC) #22

Łukasz Anforowicz

jsbell@ + kojii@ - could you please take a look? In particular: 1. Please shout ...

5 years ago (2015-12-02 22:04:36 UTC) #23

jsbell

Quick nit about the CL description: it uses "multi-byte encodings" in a non-standard way. That ...

5 years ago (2015-12-02 23:26:10 UTC) #24

jsbell

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 23:38:41 UTC) #25

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 23:43:03 UTC) #27

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly, because the browser would not be able to correctly
detect the encoding of the file saved in the original, utf-32 encoding.

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+
  years.

- The only way to declare multi-byte encoding is via http headers and
  therefore encoding information gets lost when saving multi-byte
  encoded html docs into local files (i.e. there is no good place for
  saving / preserving the encoding information in the local files; in
  particular sniffing the <meta> element doesn't work for multi-byte
  encodings).

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 23:43:59 UTC) #28

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 23:45:08 UTC) #29

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The IPC issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-02 23:45:58 UTC) #30

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The IPC issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The IPC issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_.Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

Łukasz Anforowicz

On 2015/12/02 23:26:10, jsbell wrote: > Quick nit about the CL description: it uses "multi-byte ...

5 years ago (2015-12-02 23:46:15 UTC) #31

jsbell

+jshin@ Thanks for the detailed CL description, BTW. It's answered most of the questions I've ...

5 years ago (2015-12-03 00:09:47 UTC) #32

+jshin@

Thanks for the detailed CL description, BTW. It's answered most of the questions
I've come up with.

The CL description mostly refers to UTF-32, a vary rarely used encoding. Even
with the IPC bug fixed, does the issue still affect UTF-16?

If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render
correctly? (i.e. maybe we can do that instead?)

On 2015/12/02 22:04:36, Łukasz Anforowicz wrote:
> jsbell@ + kojii@ - could you please take a look?  In particular:
> 
> 1. Please shout if always saving as utf-8 is wrong for some reason.

Since pages can introspect the encoding (via `document.charset` etc) and the
page's encoding is used for form submissions and URL parsing, this would change
behavior. But... I'm not sure how much we neuter saved page content such that
those would matter.

One thought: should we scope this change to just UTF-16/UTF-32 pages, and leave
the ASCII-superset(-ish) encodings (that can be identified by a <meta> tag)
alone? 

> 2. Please review my changes that tweak the test you've added in
> crrev.com/1413123005. 

It looks okay - the important thing is that it's not NFC normalized (and it
isn't). If we do always encode to utf-8 then the change is fine; we previously
wanted to ensure a non-utf-8 encoding was used since the code path differs
there.

>     2.a. I cannot cover these 2 characters via save_page_browsertest.cc,
because
> they are indistinguishable via find-in-page (although they do look visually
> distinct if one looks closely).

Good to know! We must normalize when searching, which makes perfect sense!

>     2.b. My changes to WebPageSerializerTest.cpp are silly / mechanical (i.e.
> they make the test pass in a tautological way by copying "actual" into
> "expected") but I think the CL overall does the right thing and I've manually
> verified that these 2 Korean characters are still visually distinct when
saving
> a euc-kr page after my changes.

As noted, seems okay given the premise (always encode to utf-8).

Re: BOM - agreed, don't write out a BOM for UTF-8.

Other random questions:

* stylesheets - I wonder if we detect BOMs there to sniff utf-16/32?
* script files - Do we need to worry about <script charset="utf-32"> ?

kojii

I agree with jsbell@, in general, it's fine to always save in utf-8 as long ...

5 years ago (2015-12-03 03:41:45 UTC) #33

kojii

One more random thought, since IIUC what we want is not to save in UTF-8, ...

5 years ago (2015-12-04 05:08:59 UTC) #34

jsbell

On 2015/12/04 05:08:59, kojii wrote: > One more random thought, since IIUC what we want ...

5 years ago (2015-12-04 17:55:56 UTC) #35

Łukasz Anforowicz

jsbell@ / kojii@, thank you for your feedback. Can you please take a look at ...

5 years ago (2015-12-04 18:20:38 UTC) #36

jsbell@ / kojii@, thank you for your feedback.  Can you please take a look at my
replies below?

On 2015/12/03 00:09:47, jsbell wrote:
> +jshin@
> 
> Thanks for the detailed CL description, BTW. It's answered most of the
questions
> I've come up with.
> 
> The CL description mostly refers to UTF-32, a vary rarely used encoding. Even
> with the IPC bug fixed, does the issue still affect UTF-16?

The issue still affects UTF16:

1. UTF16 without BOM (both LE and BE) does not render properly even
   before save-page-as, unless there are http headers declaring the
   encoding.

2. All of the files originally encoded as UTF-16LE, UTF-16LE-WITH-BOM,
   UTF-16BE, UTF-16BE-WITH-BOM, UTF-32 were not correctly round-tripped
   by save-page-as.  All 5 files were rendered as if the browser could
   not detect the encoding or content type (i.e. html markup itself got
   rendered).

3. Regardless of whether the original file had or didn't have BOM,
   the saved file always did NOT include a BOM.

4. The UTF16 files without BOM included extra strange character (right
   before </body>) when saved.  In binary it looked like this (this is BE):

   original: dot  eol space space lt
   original: 002e 000a 0020 0020 003c
   saved:    002e 000a 0020 0020 000a 0a0a 003c

> If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render
> correctly? (i.e. maybe we can do that instead?)

Yes.  Sort of:

- Tested by manually prepending a BOM to all of UTF16 LE / BE
  and ORIGINALLY-HAD-BOM / ORIGINALLY-HAD-NOBOM and to UTF32 (only LE in my
tests).

- Things worked fine, except as noted above there was an extra strange
  character as noted in 4. above

QUESTION1: Could you please help me see how to emit BOM properly?  I don't see
BOM-related methods in
third_party/WebKit/Source/modules/encoding/TextEncoder.h or
third_party/WebKit/Source/wtf/text/TextEncoding.h.
I guess I can manually emit the U+FEFF character from WebPageSerializerImpl.cpp,
but then this should only be done for a subset of encodings (i.e. definitely not
for something like iso-8859-2)
and I am not sure if I know how to properly detect the desirability of BOM at
runtime.

QUESTION2: Do you have any advise on landing test files in utf16 encoding?  I've
tried using .gitattributes (i.e. see
https://codereview.chromium.org/1407663004/patch/120001/130002 in an older
patchset), but this made the tooling unhappy (the patch could not be applied
when trying to run trybots against this patchset).

QUESTION3: Do you think always saving as UTF8 (maybe only for a subset of
encodings [but what subset?]) is still an option?  Or should I abandon this
approach given lack of rewriting of encoding declaration in some places (most
notably inline css stylesheets)?  The always-include-BOM approach looks like a
promising alternative (thank you for pushing me gently in this direction).

> 
> On 2015/12/02 22:04:36, Łukasz Anforowicz wrote:
> > jsbell@ + kojii@ - could you please take a look?  In particular:
> > 
> > 1. Please shout if always saving as utf-8 is wrong for some reason.
> 
> Since pages can introspect the encoding (via `document.charset` etc) and the
> page's encoding is used for form submissions and URL parsing, this would
change
> behavior. But... I'm not sure how much we neuter saved page content such that
> those would matter.
> 
> One thought: should we scope this change to just UTF-16/UTF-32 pages, and
leave
> the ASCII-superset(-ish) encodings (that can be identified by a <meta> tag)
> alone? 

Possibly.  I am not sure if we need to worry about passing through (now
incorrect) encoding declaration from inline css.
On one hand things do not work today, otoh there is no point in making them limp
along with a new (although smaller IMO) kind of bug.
> 
> > 2. Please review my changes that tweak the test you've added in
> > crrev.com/1413123005. 
> 
> It looks okay - the important thing is that it's not NFC normalized (and it
> isn't). If we do always encode to utf-8 then the change is fine; we previously
> wanted to ensure a non-utf-8 encoding was used since the code path differs
> there.
> 
> >     2.a. I cannot cover these 2 characters via save_page_browsertest.cc,
> because
> > they are indistinguishable via find-in-page (although they do look visually
> > distinct if one looks closely).
> 
> Good to know! We must normalize when searching, which makes perfect sense!
>  
> >     2.b. My changes to WebPageSerializerTest.cpp are silly / mechanical
(i.e.
> > they make the test pass in a tautological way by copying "actual" into
> > "expected") but I think the CL overall does the right thing and I've
manually
> > verified that these 2 Korean characters are still visually distinct when
> saving
> > a euc-kr page after my changes.
> 
> As noted, seems okay given the premise (always encode to utf-8).
> 
> Re: BOM - agreed, don't write out a BOM for UTF-8.
> 
> Other random questions:
> 
> * stylesheets - I wonder if we detect BOMs there to sniff utf-16/32?
> * script files - Do we need to worry about <script charset="utf-32"> ?

The important thing here is that non-frames / non-html-docs
are saved as-is, without any reencoding or link rewriting.
Therefore:
- I believe we don't need to worry about <script charset=...">.
- If BOM sniffing works fine for web content, it should work fine for saved
content.

On 2015/12/03 03:41:45, kojii wrote:
> I agree with jsbell@, in general, it's fine to always save in utf-8 as long as
> we don't normalize, and all relevant declarations are updated. But the latter
is
> likely to be not a small work.

Agreed.  Getting the rewriting right is tricky.  I am even surprised that we try
to rewrite the <meta> element today (given that by default we don't change the
encoding of the original decoding;  at least not before this CL).
> 
> * Is <meta charset> supported already? Couldn't read whether it's done or not.
> This is probably a must support even for utf-16/utf-32 only.

Yes.  Although I've added html5-style-meta support very recently -
crrev.com/1426503002

> * CSS declarations (@charset)?

Not AFAIK.  QUESTION: Does this matter for encodings like utf16, which are not
ascii-compatible / where I am guessing we won't be able to sniff the <meta>
element?

> * Do we strip utf-16/utf-32 BOM if they exist? CSS honors BOM over @charset.
See
> the tests below.

We do strip BOM when serializing HTML documents (at least for "complete-html"
kind;  I would have to double-check what we do to html docs encoded within
mhtml, but based on what I've seen I think we also strip the BOM.

> * Forms...yeah, agree with jsbell@, it's less important for it to work for
saved
> pages.
> * XHTML?

Yes - we rewrite (at least in WebPageSerializerImpl.cpp)

> * SVG?

SVG (as a non-html-doc) is saved as-is.  So rewriting/preserving of BOM is N/A.
> 
> There are some tests in W3C I18N how major browsers behave for the precedence:
>
http://www.w3.org/International/tests/repository/html5/the-input-byte-stream/...
>
http://www.w3.org/International/tests/repo/results/css-input-byte-stream#prec...

On 2015/12/04 05:08:59, kojii wrote:
> One more random thought, since IIUC what we want is not to save in UTF-8, but
to
> fix UTF-32.
> 
> What's wrong here looks to me that we handle encoded byte array as WebCString.
> True that in ASCII or Latin the encoded byte array looks like a string, but
the
> encoded stream is really a byte array, not a string.
> 
> What about changing the type to WebData or WebVector<uint_8>? They should work
> well with IPC, no?

Current IPC code works fine now (i.e. file contents are not truncated to the
first '\0' byte anymore, as they used to).
It used to (implicitly) call std::string(web_cstring.data().  Not it works fine
with calling std::string(web_cstring).

blink-reviews

The reason why I am looking at save-page-as behavior for save-page-as is that some time ...

5 years ago (2015-12-04 18:24:53 UTC) #37

The reason why I am looking at save-page-as behavior for save-page-as is
that some time ago I needed to test save-page-as with an exotic encoding
(to see if browser / renderer would disagree on what encoding to use in an
abandoned approach to an OOPIF-related problem).  I happened to pick UTF32
for my test which led me to discover the IPC problem and the round-tripping
problem.

Now - *if you think it is not worth spending the effort on utf16 and/or
utf32 encodings - please say so*.  I am only vaguely familiar with the
encodings problem area and picked utf32 for my tests somewhat randomly /
arbitrarly (i.e. without thinking if it occurs in the wild).

On Fri, Dec 4, 2015 at 9:55 AM, <jsbell@chromium.org> wrote:

> On 2015/12/04 05:08:59, kojii wrote:
>
>> One more random thought, since IIUC what we want is not to save in UTF-8,
>> but
>>
> to
>
>> fix UTF-32.
>>
>
> What's wrong here looks to me that we handle encoded byte array as
>> WebCString.
>> True that in ASCII or Latin the encoded byte array looks like a string,
>> but
>>
> the
>
>> encoded stream is really a byte array, not a string.
>>
>
> What about changing the type to WebData or WebVector<uint_8>? They should
>> work
>> well with IPC, no?
>>
>
> See the CL description - the IPC issue was apparently fixed, but for
> non-ASCII-superset encodings we fail to parse since there's no header
> indicating
> the encoding - and I guess we don't sniff it correctly. I made a
> suggestion in
> #32 that maybe serializing a BOM would cause the parser to select a correct
> encoding, but that was just a guess.
>
> (BTW, the use of UTF-16 is very low. I'm surprised that there is ANY usage
> of
> UTF-32 in the wild. Does any other browser still support that?)
>
>
> https://codereview.chromium.org/1407663004/
>

-- 
Thanks,

Lukasz

-- 
You received this message because you are subscribed to the Google Groups "Blink
Reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to blink-reviews+unsubscribe@chromium.org.

chromium-reviews

The reason why I am looking at save-page-as behavior for save-page-as is that some time ...

5 years ago (2015-12-04 18:24:54 UTC) #38

The reason why I am looking at save-page-as behavior for save-page-as is
that some time ago I needed to test save-page-as with an exotic encoding
(to see if browser / renderer would disagree on what encoding to use in an
abandoned approach to an OOPIF-related problem).  I happened to pick UTF32
for my test which led me to discover the IPC problem and the round-tripping
problem.

Now - *if you think it is not worth spending the effort on utf16 and/or
utf32 encodings - please say so*.  I am only vaguely familiar with the
encodings problem area and picked utf32 for my tests somewhat randomly /
arbitrarly (i.e. without thinking if it occurs in the wild).

On Fri, Dec 4, 2015 at 9:55 AM, <jsbell@chromium.org> wrote:

> On 2015/12/04 05:08:59, kojii wrote:
>
>> One more random thought, since IIUC what we want is not to save in UTF-8,
>> but
>>
> to
>
>> fix UTF-32.
>>
>
> What's wrong here looks to me that we handle encoded byte array as
>> WebCString.
>> True that in ASCII or Latin the encoded byte array looks like a string,
>> but
>>
> the
>
>> encoded stream is really a byte array, not a string.
>>
>
> What about changing the type to WebData or WebVector<uint_8>? They should
>> work
>> well with IPC, no?
>>
>
> See the CL description - the IPC issue was apparently fixed, but for
> non-ASCII-superset encodings we fail to parse since there's no header
> indicating
> the encoding - and I guess we don't sniff it correctly. I made a
> suggestion in
> #32 that maybe serializing a BOM would cause the parser to select a correct
> encoding, but that was just a guess.
>
> (BTW, the use of UTF-16 is very low. I'm surprised that there is ANY usage
> of
> UTF-32 in the wild. Does any other browser still support that?)
>
>
> https://codereview.chromium.org/1407663004/
>

-- 
Thanks,

Lukasz

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

jsbell

On 2015/12/04 18:20:38, Łukasz Anforowicz wrote: > > If we serialize a BOM for UTF-16 ...

5 years ago (2015-12-04 23:12:01 UTC) #39

On 2015/12/04 18:20:38, Łukasz Anforowicz wrote:
> > If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render
> > correctly? (i.e. maybe we can do that instead?)
> 
> Yes.  Sort of:

Yay!

> - Tested by manually prepending a BOM to all of UTF16 LE / BE
>   and ORIGINALLY-HAD-BOM / ORIGINALLY-HAD-NOBOM and to UTF32 (only LE in my
> tests).
> 
> - Things worked fine, except as noted above there was an extra strange
>   character as noted in 4. above

> QUESTION1: Could you please help me see how to emit BOM properly? 

Yeah, we don't need to write out BOMs when posting (since we send a header) and
for the Encoding API we decided that callers could prepend it manually if
desired. Plumbing a "prepend BOM" flag through TextEncoding::encode to
TextCodec::encode seems reasonable to me. On most codecs it would do nothing,
and we'd add support in TextCodecUTF16.

I don't know where the extra character is coming from. 0A0A looks suspiciously
like the low byte of the proceeding character (U+000A LINE FEED) repeated twice
- it could be a bug in the UTF-16 encoding pipeline? Does this only appear if
you manually prepend the BOM (maybe a buffer off-by-one error somewhere?) or is
it always present?

> QUESTION2: Do you have any advise on landing test files in utf16 encoding?

Hmmm, I've had problems with the CQ but not try jobs... for landing, just `git
cl land` the resource files manually after you get an l*g*t*m, before the test
if possible. If the encoded files *are* the test, just `git cl land` them. But
again, it sounds like you're having issues on the try bots?

Anyone else have suggestions?

> QUESTION3: Do you think always saving as UTF8 (maybe only for a subset of
> encodings [but what subset?]) is still an option?  Or should I abandon this
> approach given lack of rewriting of encoding declaration in some places (most
> notably inline css stylesheets)?  The always-include-BOM approach looks like a
> promising alternative (thank you for pushing me gently in this direction).

I'd go with the BOM. It's not crazy - ENCODING (which HTML defers to) states:
"the byte order mark (also known as BOM) is considered more authoritative than
anything else"

On 2015/12/04 18:24:54, chromium-reviews wrote:
> Now - *if you think it is not worth spending the effort on utf16 and/or
> utf32 encodings - please say so*.  I am only vaguely familiar with the
> encodings problem area and picked utf32 for my tests somewhat randomly /
> arbitrarly (i.e. without thinking if it occurs in the wild).

UTF-16 is worth fixing since that's the default encoding for Unicode content on
Windows it's still generated by tools when the developer isn't aware. I speak
from experience that it's easy to have an all windows-1252 ASP.NET site, make a
minor content change, and have pages silently served as UTF-16 without noticing
because it just works.

If UTF-32 requires *any* extra work I'd defer.

kojii

Sorry, I missed the discussion point in my previous reply. On 2015/12/04 at 23:12:01, jsbell ...

5 years ago (2015-12-07 02:42:37 UTC) #40

Łukasz Anforowicz

Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the ...

5 years ago (2015-12-07 22:08:47 UTC) #41

Description was changed from

==========
Make sure that Save-Page-As-Complete-HTML always saves as UTF8.

Before the CL, UTF32-encoded html files would render properly, but they
would be incorrectly saved as 1-byte-long file, because
WebPageSerializerImpl would attempt to preserve [1] the original file
encoding, but an IPC bug would corrupt/truncate UTF32 contents when
sending them to the browser (see crbug.com/541699#c2).

Note that even after fixing the IPC bug [10], saved page would not
render properly (most likely because the browser would not be able to
correctly detect the encoding of the file saved in the original, utf-32
encoding).

To fix the problems above, this CL makes sure that "complete-html" is
always saved as UTF8.

Other notes:

- W3C recommends saving as UTF8 - [2], [3]

- If saving in original encoding [1] was important, we would have noticed
  that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+
  years.
  (*) The IPC issue was affecting any encoding that could include '\0' byte.

- Declaring the encoding via <meta> element only works if the encoding
  is an ASCII-compatible character encoding [3] (otherwise, the browser
  would not be able to sniff the <meta> element).  This means that for
  not-ASCII-compatible encodings (i.e. UTF32) the encoding can be
  declared via http headers, but there is no good place for saving /
  preserving the encoding information in the local files.  This problem
  doesn't happen for UTF8.

- Existing PageSerializer and WebPageSerializeImpl already have code and
  set a precendent for modifying encoding references inside document
  content:
  [4] - PageSerializer.cpp - isCharsetSpecifyingNode.
  [5] - PageSerializer.cpp - emitting meta element w/ charset info
  [7] - WebPageSerializerImpl.cpp - skipping the original meta element
  [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration
  [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info

- This CL continues the current behavior of *not* emitting a BOM at the
  beginning of a saved html file.  I am not sure if this is an issue.
  W3C documents I've seen so far do not have a strong recommendation for
  or against inclusion of a BOM.

- This CL applies only to save as "complete html" and has no impact to save as
  "html only" or save as "mhtml".


Test steps:

- Built chrome and manually saved-as-complete-html and inspected result for
  http://anforowicz.github.io/html-encoding/utf32.html

- Run the new save_page_browsertest.cc and manually verified correctness
  of the output (preserved by calling save_dir_.Take()).

- Manually saved a page with euc-kr encoding from the modified
  WebPageSerializerTest and verified that the 2 characters are still
  visually distinct after save-page-as-complete-html.


[1]
https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...).

[2] http://www.w3.org/TR/encoding/

    "The utf-8 encoding is the most appropriate encoding for interchange of
    Unicode, the universal coded character set. Therefore for new protocols
    and formats, as well as existing formats deployed in new contexts, this
    specification requires (and defines) the utf-8 encoding."

[3] http://www.w3.org/TR/html5/document-metadata#charset:

    "Authors should use UTF-8. Conformance checkers may advise authors
    against using legacy encodings. [RFC3629]"

    "Authoring tools should default to using UTF-8 for newly-created
    documents. [RFC3629]"

[4]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[5]
https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c...

[6]
https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo...

[7]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[8]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[9]
https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be...

[10]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...


BUG=541699
TEST=See "Test steps" above.
==========

to

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for utf16-encoded frames.

BOM behavior for UTF16 is easy to control via wtf/text/TextCodecUTF16,
but unfortunately this is not the case for UTF32 (which goes through
wtf/text/TextCodecICU) and consequently this CL doesn't fix
round-trippng of UTF32-encoded frames.  This is okay, because UTF32 is
much more rare than UTF16.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

Łukasz Anforowicz

jsbell@, could you please take another look? Things that are ready for review: - API ...

5 years ago (2015-12-07 22:16:04 UTC) #42

Łukasz Anforowicz

On 2015/12/04 23:12:01, jsbell wrote: > I don't know where the extra character is coming ...

5 years ago (2015-12-07 22:19:48 UTC) #43

jsbell

What's the status of those extra 0x0A bytes you were seeing? A result of temp ...

5 years ago (2015-12-07 22:53:51 UTC) #44

Łukasz Anforowicz

Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16. Previously save-page-as ...

5 years ago (2015-12-07 23:28:48 UTC) #45

Description was changed from

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for utf16-encoded frames.

BOM behavior for UTF16 is easy to control via wtf/text/TextCodecUTF16,
but unfortunately this is not the case for UTF32 (which goes through
wtf/text/TextCodecICU) and consequently this CL doesn't fix
round-trippng of UTF32-encoded frames.  This is okay, because UTF32 is
much more rare than UTF16.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

to

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

Łukasz Anforowicz

jsbell@, can you take another look please? Thank you for the previous round of CR ...

5 years ago (2015-12-07 23:32:36 UTC) #46

jsbell

I like the direction this CL is trending. :) Overall lgtm with nits, but you'll ...

5 years ago (2015-12-08 01:19:08 UTC) #47

jsbell

On 2015/12/07 23:32:36, Łukasz Anforowicz wrote: > It turned out that this extra character was ...

5 years ago (2015-12-08 01:20:17 UTC) #48

dcheng

third_party/WebKit LGTM, but you'll need an owner in wtf (+tkent for that review)

5 years ago (2015-12-08 01:33:35 UTC) #50

dcheng

On 2015/12/08 at 01:33:35, dcheng wrote: > third_party/WebKit LGTM, but you'll need an owner in ...

5 years ago (2015-12-08 01:33:59 UTC) #51

blink-reviews

On Monday, December 7, 2015, <dcheng@chromium.org> wrote: > On 2015/12/08 at 01:33:35, dcheng wrote: > ...

5 years ago (2015-12-08 03:53:31 UTC) #52

chromium-reviews

On Monday, December 7, 2015, <dcheng@chromium.org> wrote: > On 2015/12/08 at 01:33:35, dcheng wrote: > ...

5 years ago (2015-12-08 03:53:32 UTC) #53

Łukasz Anforowicz

jsbell@, could you please take another look? (I've tried to address your feedback in the ...

5 years ago (2015-12-09 00:50:59 UTC) #55

jsbell@, could you please take another look? (I've tried to address your
feedback in the most recent patchset)

dcheng@ + tkent@ - thanks for reviewing.

Please note that at this point the CL is failing the new test on the try bots. 
I suspect problems with patching / preserving the binary content of files under
chrome/test/data/save_page/encoding*.htm.  I think I'll land these files
manually, in a separate CL, with a .gitattributes asking to treat them as
binary.  Please let me know if you think this is a bad idea.

https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa...
File chrome/browser/download/save_page_browsertest.cc (right):

https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa...
chrome/browser/download/save_page_browsertest.cc:980: std::vector<std::string>
forbidden_substrings{
On 2015/12/08 01:19:07, jsbell wrote:
> http://chromium-cpp.appspot.com/ says that the "uniform initialization syntax"
> is banned. :(

Yes, sorry about that.  I should have rebased onto crrev.com/1441553002 (which
transitions to arrays) earlier.

https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa...
chrome/browser/download/save_page_browsertest.cc:1090: // Test for saving frames
with various encodings (crbug.com/541699).
On 2015/12/08 01:19:07, jsbell wrote:
> I don't think including the bug# here is necessary; this comment explains
what's
> going on.

Done.

https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa...
chrome/browser/download/save_page_browsertest.cc:1098: std::vector<std::string>
expected_substrings{
On 2015/12/08 01:19:07, jsbell wrote:
> (see above)

Done.

https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa...
chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć
gęślą jaźń",
On 2015/12/08 01:19:07, jsbell wrote:
> Use \x## escapes so that the source is ASCII when possible, as not all
> developers configure editors for UTF-8.

Is \u escape ok?  It fits into 80 columns limit better...

https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou...
File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right):

https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou...
third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:457: const UChar
bomCharacter = 0xFEFF;
On 2015/12/08 01:19:07, jsbell wrote:
> nit: maybe call this `byteOrderMark` as it's not technically a character and
not
> everyone knows BOM

Done.

jsbell

still lgtm - apart from the \u#### usage, alas. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/download/save_page_browsertest.cc File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/download/save_page_browsertest.cc#newcode1100 chrome/browser/download/save_page_browsertest.cc:1100: ...

5 years ago (2015-12-09 01:02:47 UTC) #56

Łukasz Anforowicz

jsbell@, thanks for the review. BTW: the separate CL for the test files is @ ...

5 years ago (2015-12-09 01:27:36 UTC) #57

Łukasz Anforowicz

Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as ...

5 years ago (2015-12-09 01:29:41 UTC) #58

Description was changed from

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

to

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

jungshik at Google

On 2015/12/09 01:27:36, Łukasz Anforowicz wrote: > jsbell@, thanks for the review. > > BTW: ...

5 years ago (2015-12-10 00:47:09 UTC) #59

Łukasz Anforowicz

The CQ bit was checked by lukasza@chromium.org to run a CQ dry run

5 years ago (2015-12-10 01:11:17 UTC) #60

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1407663004/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1407663004/310001

5 years ago (2015-12-10 01:13:34 UTC) #61

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years ago (2015-12-10 06:40:57 UTC) #62

Łukasz Anforowicz

The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, rdsmith@chromium.org, tkent@chromium.org, dcheng@chromium.org, ...

5 years ago (2015-12-10 17:55:59 UTC) #65

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1407663004/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1407663004/310001

5 years ago (2015-12-10 17:56:56 UTC) #66

commit-bot: I haz the power

Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as ...

5 years ago (2015-12-10 18:06:18 UTC) #67

Message was sent while issue was closed.

Description was changed from

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

to

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

commit-bot: I haz the power

Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as ...

5 years ago (2015-12-10 18:07:30 UTC) #69

Message was sent while issue was closed.

Description was changed from

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699
==========

to

==========
Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32.

Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16).  The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).

This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding.  This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.

Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".

[1]
https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242...

[2] http://www.w3.org/TR/html5/document-metadata#charset:
    If an HTML document contains a meta element with a charset attribute
    or a meta element with an http-equiv attribute in the encoding
    declaration state, then the character encoding used must be an
    ASCII-compatible character encoding.

BUG=541699

Committed: https://crrev.com/36f0d157b6d2c6fae627a3aba955ad4e0fe07d71
Cr-Commit-Position: refs/heads/master@{#364394}
==========

commit-bot: I haz the power

5 years ago (2015-12-10 18:07:31 UTC) #70

Message was sent while issue was closed.

Patchset 16 (id:??) landed as
https://crrev.com/36f0d157b6d2c6fae627a3aba955ad4e0fe07d71
Cr-Commit-Position: refs/heads/master@{#364394}

Issue 1407663004: Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. (Closed)

Description

Patch Set 1 : #

Patch Set 2 : Reverting unnecessary embedded_test_server.cc changes. #

Patch Set 3 : Unconditionally enabling the new test. #

Patch Set 4 : Rebasing... #

Patch Set 5 : Ready for further review. #

Patch Set 6 : Making .gitattributes more generic. #

Patch Set 7 : Reuploading with artificially high similarity requirements. #

Patch Set 8 : Dropped euc-kr test (little value + trouble with tools). #

Patch Set 9 : Dropping .gitattributes #

Patch Set 10 : Minor tweaks. #

Patch Set 11 : Saving UTF16 with BOM. Leaving UTF32 in a broken state. #

Patch Set 12 : Using TextEncoding::isNonByteBasedEncoding instead. #

Patch Set 13 : Addressed CR feedback from jsbell@. #

Patch Set 14 : Rebasing... #

Patch Set 15 : Switched \uNNNN escape codes into \xNN. #

Patch Set 16 : Rebased after landing the test files in a separate CL. #

Messages