|
|
Created:
5 years, 2 months ago by Łukasz Anforowicz Modified:
5 years ago Reviewers:
ncarter (slow), jsbell, tkent, jungshik at Google, Randy Smith (Not in Mondays), dcheng, kojii CC:
asanka, blink-reviews, cbentzel+watch_chromium.org, chromium-reviews, ncarter (slow), site-isolation-reviews_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionTweaking WebPageSerializerImpl to emit a BOM for UTF16/32.
Previously save-page-as would truncate saved html files if they
would include a '\0' byte (i.e. if saving a html document that was
encoded as utf16). The truncation doesn't happen anymore [1], but the
saved html files still don't round-trip properly, because the browser is
not able to detect the encoding of the files (<meta> character cannot be
sniffed in encodings that are not ascii compatible [2] + it is obviously
impossible for local files to declare their encoding via http headers).
This CL makes changes to always emits a BOM when saving in UTF16
or UTF32 encoding. This in-turn causes round-tripping tests in
save_page_browsertest.cc to pass for tests using these encodings.
Note that this CL applies only to save as "complete html" and has no
impact to save as "html only" or save as "mhtml".
[1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c942426a9818e008ec%5E%21/content/renderer/render_frame_impl.cc
[2] http://www.w3.org/TR/html5/document-metadata#charset:
If an HTML document contains a meta element with a charset attribute
or a meta element with an http-equiv attribute in the encoding
declaration state, then the character encoding used must be an
ASCII-compatible character encoding.
BUG=541699
Committed: https://crrev.com/36f0d157b6d2c6fae627a3aba955ad4e0fe07d71
Cr-Commit-Position: refs/heads/master@{#364394}
Patch Set 1 : #Patch Set 2 : Reverting unnecessary embedded_test_server.cc changes. #
Total comments: 4
Patch Set 3 : Unconditionally enabling the new test. #Patch Set 4 : Rebasing... #
Total comments: 4
Patch Set 5 : Ready for further review. #Patch Set 6 : Making .gitattributes more generic. #Patch Set 7 : Reuploading with artificially high similarity requirements. #Patch Set 8 : Dropped euc-kr test (little value + trouble with tools). #Patch Set 9 : Dropping .gitattributes #Patch Set 10 : Minor tweaks. #Patch Set 11 : Saving UTF16 with BOM. Leaving UTF32 in a broken state. #
Total comments: 5
Patch Set 12 : Using TextEncoding::isNonByteBasedEncoding instead. #
Total comments: 12
Patch Set 13 : Addressed CR feedback from jsbell@. #Patch Set 14 : Rebasing... #Patch Set 15 : Switched \uNNNN escape codes into \xNN. #Patch Set 16 : Rebased after landing the test files in a separate CL. #
Messages
Total messages: 70 (26 generated)
Patchset #1 (id:1) has been deleted
lukasza@chromium.org changed reviewers: + rdsmith@chromium.org
Randy, could you please take a look?
(I feel competent to review the overall test structure, but I know very little about text encodings--would you be willing to get someone knowledgable in that area to review as well?) Top level comment: The CL description reads as if the only testing done here is manual, which isn't accurate; could you clarify that? https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... chrome/browser/download/save_page_browsertest.cc:561: // Disabled on Windows due to flakiness. http://crbug.com/162323 This is a new test; have you confirmed that it's flaky? Any possibility of fixing it? I wince at landing tests in pre-flaky/disabled fashion, though if it's in some test infrastructure you have nothing to do with it's quite reasonable to not fix it. https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... chrome/browser/download/save_page_browsertest.cc:578: EXPECT_TRUE(base::ContentsEqual( I wince a bit at this, since it tests the implementation rather than the behavior we care about (I could imagine many character level changes that wouldn't effect the web page viewed by the user after saving)--is there any way you can put something in the original file that would be messed up if the encoding was messed up, but would always be saved correctly if it wasn't?
Randy, can you take another look please? https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... chrome/browser/download/save_page_browsertest.cc:561: // Disabled on Windows due to flakiness. http://crbug.com/162323 On 2015/10/15 18:57:06, rdsmith wrote: > This is a new test; have you confirmed that it's flaky? Any possibility of > fixing it? I wince at landing tests in pre-flaky/disabled fashion, though if > it's in some test infrastructure you have nothing to do with it's quite > reasonable to not fix it. Done. Also following up with the following CLs: - https://codereview.chromium.org/1408993003/ (landed) - https://codereview.chromium.org/1408393004/ (landed) - https://codereview.chromium.org/1413763008/ (in CR) https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... chrome/browser/download/save_page_browsertest.cc:578: EXPECT_TRUE(base::ContentsEqual( On 2015/10/15 18:57:06, rdsmith wrote: > I wince a bit at this, since it tests the implementation rather than the > behavior we care about (I could imagine many character level changes that > wouldn't effect the web page viewed by the user after saving)--is there any way > you can put something in the original file that would be messed up if the > encoding was messed up, but would always be saved correctly if it wasn't? I still think that comparing the whole file contents is desirable here: - This captures encoding of the whole page (which is important if in the future we decide to split encoding across 2 separate pieces of code as the tentative crrev.com/1373573002 is doing). Alternative here would be to foresee which pieces might be encoded separately and add verification for each such piece flavor. - This captures BOM behavior. - Avoiding whole-file-comparison in the new test doesn't help that much without also avoiding whole-file-comparison in the 2 old tests. And if somebody is replacing the other 2 expected contents, then in the same go he/she can retain and replace the file generated in the new test. - There is a trade-off between robustness of verification and effort required to update the tests. I think the effort is not that big, and I think there is a significant value in making sure that all unexpected changes are detected by the test (i.e. I don't worry about overtesting at browser-test level as much as I would at unit-test level). If you feel strongly about this, then maybe we can track this as a separate bug, that also covers the 2 old tests that currently use whole file comparison. WDYT?
On 2015/10/15 18:57:06, rdsmith wrote: > (I feel competent to review the overall test structure, but I know very little > about text encodings--would you be willing to get someone knowledgable in that > area to review as well?) Ok. > > Top level comment: The CL description reads as if the only testing done here is > manual, which isn't accurate; could you clarify that? I did add a new automated test (which does automated verification). I manually looked at the file generated by the automated test, to make sure it looks right. And obviously the test via a private chrome binary was manual as well.
On 2015/10/20 21:08:43, Łukasz Anforowicz wrote: > Randy, can you take another look please? > > https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... > File chrome/browser/download/save_page_browsertest.cc (right): > > https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... > chrome/browser/download/save_page_browsertest.cc:561: // Disabled on Windows due > to flakiness. http://crbug.com/162323 > On 2015/10/15 18:57:06, rdsmith wrote: > > This is a new test; have you confirmed that it's flaky? Any possibility of > > fixing it? I wince at landing tests in pre-flaky/disabled fashion, though if > > it's in some test infrastructure you have nothing to do with it's quite > > reasonable to not fix it. > > Done. > > Also following up with the following CLs: > - https://codereview.chromium.org/1408993003/ (landed) > - https://codereview.chromium.org/1408393004/ (landed) > - https://codereview.chromium.org/1413763008/ (in CR) > > https://codereview.chromium.org/1407663004/diff/40001/chrome/browser/download... > chrome/browser/download/save_page_browsertest.cc:578: > EXPECT_TRUE(base::ContentsEqual( > On 2015/10/15 18:57:06, rdsmith wrote: > > I wince a bit at this, since it tests the implementation rather than the > > behavior we care about (I could imagine many character level changes that > > wouldn't effect the web page viewed by the user after saving)--is there any > way > > you can put something in the original file that would be messed up if the > > encoding was messed up, but would always be saved correctly if it wasn't? > > I still think that comparing the whole file contents is desirable here: > > - This captures encoding of the whole page (which is important if in the future > we decide to split encoding across 2 separate pieces of code as the tentative > crrev.com/1373573002 is doing). Alternative here would be to foresee which > pieces might be encoded separately and add verification for each such piece > flavor. > > - This captures BOM behavior. > > - Avoiding whole-file-comparison in the new test doesn't help that much without > also avoiding whole-file-comparison in the 2 old tests. And if somebody is > replacing the other 2 expected contents, then in the same go he/she can retain > and replace the file generated in the new test. > > - There is a trade-off between robustness of verification and effort required to > update the tests. I think the effort is not that big, and I think there is a > significant value in making sure that all unexpected changes are detected by the > test (i.e. I don't worry about overtesting at browser-test level as much as I > would at unit-test level). > > > If you feel strongly about this, then maybe we can track this as a separate bug, > that also covers the 2 old tests that currently use whole file comparison. > WDYT? If there's a strong argument in that direction, I'm good with it. save_page_browsertest.cc and chrome/test/data LGTM--I don't know the webkit code at all and am hoping whoever you get to evaluate the encoding can also that code. But if not, let me know, and I'll pepper you with questions until I decide I know the code well enough to review it :-}.
lukasza@chromium.org changed reviewers: + dcheng@chromium.org
Daniel, could you please take a look? (mostly at third_party/WebKit/Source/web/WebPageSerializerImpl.cpp, but obviously feedback for the whole CL is welcomed).
lukasza@chromium.org changed reviewers: + nick@chromium.org
On 2015/10/15 18:57:06, rdsmith wrote: > (I feel competent to review the overall test structure, but I know very little > about text encodings--would you be willing to get someone knowledgable in that > area to review as well?) Nick has kindly volunteered to do the review from this angle.
lgtm
Also, you might need to land this manually with git cl land. I recall strange things happened in the past when landing non-utf8 files. https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:166: String xmlEncoding = param->textEncoding.name(); Out of curiosity, did you consider removing textEncoding from the parameters completely? If so, why did you opt not to do it? (I'm not saying we should do it, just wondering about reasons for/against doing this)
https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:133: if (isHTMLMetaElement(*element)) { It looks like it's designed to strip <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">. Does also have the effect of stripping html5 style <meta charset="UTF-32"> tags?
I've replied to CR questions. I think the current CL should be put on hold until crbug.com/547206 is fixed. https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:133: if (isHTMLMetaElement(*element)) { On 2015/10/21 20:17:58, ncarter wrote: > It looks like it's designed to strip <meta http-equiv="Content-Type" > content="text/html; charset=UTF-8">. > > Does also have the effect of stripping html5 style <meta charset="UTF-32"> tags? I've opened crbug.com/547206 to track this (I'll add a comment to this bug with a proposed fix in a few minutes). I am not sure if the issue you pointed out (thanks for catching this BTW) should block the CL under review. Let me try to spell out the impact of this issue before the CL / after the CL, for documents with html5-style meta/charset element, and depending on the encoding of the document: - UTF32 - won't work at all (content will be lost because of the IPC bug that this CL tries to mitigate [by only sending utf8 strings over ipc]) - ISO8859-1 - will get serialized with *two* meta elements: - <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - generated by the serializer - before this CL this would use and say iso8859-1 - the original encoding - <meta charset="iso8859-1"> - copied by the serializer from the original html document My gut feeling is that I should not land the current CL until the <meta charset...> issue is fixed. Plus, I don't need to normalize to UTF8 across browser/renderer if we use the "original urls" cl to fix the OOPIF problem (crrev.com/1413103003). https://codereview.chromium.org/1407663004/diff/80001/third_party/WebKit/Sour... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:166: String xmlEncoding = param->textEncoding.name(); On 2015/10/21 20:09:29, dcheng wrote: > Out of curiosity, did you consider removing textEncoding from the parameters > completely? If so, why did you opt not to do it? > > (I'm not saying we should do it, just wondering about reasons for/against doing > this) The main reason is an imprecise feeling that smaller diff will be better (better for understanding the cl for review and for the future, better for compatibility with future cls [future CLs = resurrecting serializer's merge OR other OOPIF-related changes OR reintroducing the encoding if needed]). I guess one could say that we should optimizing for wholesale/overall code readability, rather than readability of individual diifs - this is a valid argument, but I think one can also argue for the other tradeoff. Still - I didn't spend too much time thinking about it when authoring the CL - the above paragraph is a post-rationalization of an already made decision...
Blink changes seem OK, but given your reply, I'll defer from L-G-T-M until you think it's ready to land (which I assume is mostly blocked on fixing the <meta charset> bug.
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). To fix the problem above, this CL makes sure that "complete-html" is always saved as UTF8. This fixes crbug.com/541699 - after the CL save-page-as works fine with multi-byte encodings (i.e. a saved utf32.htm can be successfully opened and rendered). Other notes: - W3C recommends saving as UTF8 - [2], [3] - Always using UTF8 makes it easy to coordinate the encoding between renderer and browser processes for OOPIFs-related work - see the pending CL in [6]. - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*) and manually verified correctness of the output file. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... BUG=541699,526786 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). To fix the problem above, this CL makes sure that "complete-html" is always saved as UTF8. This fixes crbug.com/541699 - after the CL save-page-as works fine with multi-byte encodings (i.e. a saved utf32.htm can be successfully opened and rendered). Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*) and manually verified correctness of the output file. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... BUG=541699,526786 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). To fix the problem above, this CL makes sure that "complete-html" is always saved as UTF8. This fixes crbug.com/541699 - after the CL save-page-as works fine with multi-byte encodings (i.e. a saved utf32.htm can be successfully opened and rendered). Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*) and manually verified correctness of the output file. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... BUG=541699,526786 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). To fix the problem above, this CL makes sure that "complete-html" is always saved as UTF8. This fixes crbug.com/541699 - after the CL save-page-as works fine with multi-byte encodings (i.e. a saved utf32.htm can be successfully opened and rendered). Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*) and manually verified correctness of the output file. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). To fix the problem above, this CL makes sure that "complete-html" is always saved as UTF8. This fixes crbug.com/541699 - after the CL save-page-as works fine with multi-byte encodings (i.e. a saved utf32.htm can be successfully opened and rendered). Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new test (out/gn/browser_tests --gtest_filter=*SaveUTF32Complete*) and manually verified correctness of the output file. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). This included verifying that the 2 characters from encoding-euc-kr.htm are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). This included verifying that the 2 characters from encoding-euc-kr.htm are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
lukasza@chromium.org changed reviewers: + jsbell@chromium.org, kojii@chromium.org
jsbell@ + kojii@ - could you please take a look? In particular: 1. Please shout if always saving as utf-8 is wrong for some reason. 2. Please review my changes that tweak the test you've added in crrev.com/1413123005. 2.a. I cannot cover these 2 characters via save_page_browsertest.cc, because they are indistinguishable via find-in-page (although they do look visually distinct if one looks closely). 2.b. My changes to WebPageSerializerTest.cpp are silly / mechanical (i.e. they make the test pass in a tautological way by copying "actual" into "expected") but I think the CL overall does the right thing and I've manually verified that these 2 Korean characters are still visually distinct when saving a euc-kr page after my changes.
Quick nit about the CL description: it uses "multi-byte encodings" in a non-standard way. That term usually refers to any encoding where a code point MAY be represented as more than one byte - e.g. shift_jis, big5 or even utf-8. You're using it to refer to encodings that are always > 1 byte per code point and include 0x00 bytes (resulting in immediately truncated cstrings). Can you reword? I don't know if there's a good term, so maybe just list UTF-16/UTF-32 explicitly.
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
jsbell@chromium.org changed reviewers: + jshin@chromium.org
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly, because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding. To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of multi-byte encodings (UTF16, UTF32) was broken for 4+ years. - The only way to declare multi-byte encoding is via http headers and therefore encoding information gets lost when saving multi-byte encoded html docs into local files (i.e. there is no good place for saving / preserving the encoding information in the local files; in particular sniffing the <meta> element doesn't work for multi-byte encodings). - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The IPC issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The IPC issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The IPC issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_.Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ==========
On 2015/12/02 23:26:10, jsbell wrote: > Quick nit about the CL description: it uses "multi-byte encodings" in a > non-standard way. That term usually refers to any encoding where a code point > MAY be represented as more than one byte - e.g. shift_jis, big5 or even utf-8. > You're using it to refer to encodings that are always > 1 byte per code point > and include 0x00 bytes (resulting in immediately truncated cstrings). > > Can you reword? I don't know if there's a good term, so maybe just list > UTF-16/UTF-32 explicitly. Done. Thanks for the feedback.
+jshin@ Thanks for the detailed CL description, BTW. It's answered most of the questions I've come up with. The CL description mostly refers to UTF-32, a vary rarely used encoding. Even with the IPC bug fixed, does the issue still affect UTF-16? If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render correctly? (i.e. maybe we can do that instead?) On 2015/12/02 22:04:36, Łukasz Anforowicz wrote: > jsbell@ + kojii@ - could you please take a look? In particular: > > 1. Please shout if always saving as utf-8 is wrong for some reason. Since pages can introspect the encoding (via `document.charset` etc) and the page's encoding is used for form submissions and URL parsing, this would change behavior. But... I'm not sure how much we neuter saved page content such that those would matter. One thought: should we scope this change to just UTF-16/UTF-32 pages, and leave the ASCII-superset(-ish) encodings (that can be identified by a <meta> tag) alone? > 2. Please review my changes that tweak the test you've added in > crrev.com/1413123005. It looks okay - the important thing is that it's not NFC normalized (and it isn't). If we do always encode to utf-8 then the change is fine; we previously wanted to ensure a non-utf-8 encoding was used since the code path differs there. > 2.a. I cannot cover these 2 characters via save_page_browsertest.cc, because > they are indistinguishable via find-in-page (although they do look visually > distinct if one looks closely). Good to know! We must normalize when searching, which makes perfect sense! > 2.b. My changes to WebPageSerializerTest.cpp are silly / mechanical (i.e. > they make the test pass in a tautological way by copying "actual" into > "expected") but I think the CL overall does the right thing and I've manually > verified that these 2 Korean characters are still visually distinct when saving > a euc-kr page after my changes. As noted, seems okay given the premise (always encode to utf-8). Re: BOM - agreed, don't write out a BOM for UTF-8. Other random questions: * stylesheets - I wonder if we detect BOMs there to sniff utf-16/32? * script files - Do we need to worry about <script charset="utf-32"> ?
I agree with jsbell@, in general, it's fine to always save in utf-8 as long as we don't normalize, and all relevant declarations are updated. But the latter is likely to be not a small work. * Is <meta charset> supported already? Couldn't read whether it's done or not. This is probably a must support even for utf-16/utf-32 only. * CSS declarations (@charset)? * Do we strip utf-16/utf-32 BOM if they exist? CSS honors BOM over @charset. See the tests below. * Forms...yeah, agree with jsbell@, it's less important for it to work for saved pages. * XHTML? * SVG? There are some tests in W3C I18N how major browsers behave for the precedence: http://www.w3.org/International/tests/repository/html5/the-input-byte-stream/... http://www.w3.org/International/tests/repo/results/css-input-byte-stream#prec...
One more random thought, since IIUC what we want is not to save in UTF-8, but to fix UTF-32. What's wrong here looks to me that we handle encoded byte array as WebCString. True that in ASCII or Latin the encoded byte array looks like a string, but the encoded stream is really a byte array, not a string. What about changing the type to WebData or WebVector<uint_8>? They should work well with IPC, no?
On 2015/12/04 05:08:59, kojii wrote: > One more random thought, since IIUC what we want is not to save in UTF-8, but to > fix UTF-32. > > What's wrong here looks to me that we handle encoded byte array as WebCString. > True that in ASCII or Latin the encoded byte array looks like a string, but the > encoded stream is really a byte array, not a string. > > What about changing the type to WebData or WebVector<uint_8>? They should work > well with IPC, no? See the CL description - the IPC issue was apparently fixed, but for non-ASCII-superset encodings we fail to parse since there's no header indicating the encoding - and I guess we don't sniff it correctly. I made a suggestion in #32 that maybe serializing a BOM would cause the parser to select a correct encoding, but that was just a guess. (BTW, the use of UTF-16 is very low. I'm surprised that there is ANY usage of UTF-32 in the wild. Does any other browser still support that?)
jsbell@ / kojii@, thank you for your feedback. Can you please take a look at my replies below? On 2015/12/03 00:09:47, jsbell wrote: > +jshin@ > > Thanks for the detailed CL description, BTW. It's answered most of the questions > I've come up with. > > The CL description mostly refers to UTF-32, a vary rarely used encoding. Even > with the IPC bug fixed, does the issue still affect UTF-16? The issue still affects UTF16: 1. UTF16 without BOM (both LE and BE) does not render properly even before save-page-as, unless there are http headers declaring the encoding. 2. All of the files originally encoded as UTF-16LE, UTF-16LE-WITH-BOM, UTF-16BE, UTF-16BE-WITH-BOM, UTF-32 were not correctly round-tripped by save-page-as. All 5 files were rendered as if the browser could not detect the encoding or content type (i.e. html markup itself got rendered). 3. Regardless of whether the original file had or didn't have BOM, the saved file always did NOT include a BOM. 4. The UTF16 files without BOM included extra strange character (right before </body>) when saved. In binary it looked like this (this is BE): original: dot eol space space lt original: 002e 000a 0020 0020 003c saved: 002e 000a 0020 0020 000a 0a0a 003c > If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render > correctly? (i.e. maybe we can do that instead?) Yes. Sort of: - Tested by manually prepending a BOM to all of UTF16 LE / BE and ORIGINALLY-HAD-BOM / ORIGINALLY-HAD-NOBOM and to UTF32 (only LE in my tests). - Things worked fine, except as noted above there was an extra strange character as noted in 4. above QUESTION1: Could you please help me see how to emit BOM properly? I don't see BOM-related methods in third_party/WebKit/Source/modules/encoding/TextEncoder.h or third_party/WebKit/Source/wtf/text/TextEncoding.h. I guess I can manually emit the U+FEFF character from WebPageSerializerImpl.cpp, but then this should only be done for a subset of encodings (i.e. definitely not for something like iso-8859-2) and I am not sure if I know how to properly detect the desirability of BOM at runtime. QUESTION2: Do you have any advise on landing test files in utf16 encoding? I've tried using .gitattributes (i.e. see https://codereview.chromium.org/1407663004/patch/120001/130002 in an older patchset), but this made the tooling unhappy (the patch could not be applied when trying to run trybots against this patchset). QUESTION3: Do you think always saving as UTF8 (maybe only for a subset of encodings [but what subset?]) is still an option? Or should I abandon this approach given lack of rewriting of encoding declaration in some places (most notably inline css stylesheets)? The always-include-BOM approach looks like a promising alternative (thank you for pushing me gently in this direction). > > On 2015/12/02 22:04:36, Łukasz Anforowicz wrote: > > jsbell@ + kojii@ - could you please take a look? In particular: > > > > 1. Please shout if always saving as utf-8 is wrong for some reason. > > Since pages can introspect the encoding (via `document.charset` etc) and the > page's encoding is used for form submissions and URL parsing, this would change > behavior. But... I'm not sure how much we neuter saved page content such that > those would matter. > > One thought: should we scope this change to just UTF-16/UTF-32 pages, and leave > the ASCII-superset(-ish) encodings (that can be identified by a <meta> tag) > alone? Possibly. I am not sure if we need to worry about passing through (now incorrect) encoding declaration from inline css. On one hand things do not work today, otoh there is no point in making them limp along with a new (although smaller IMO) kind of bug. > > > 2. Please review my changes that tweak the test you've added in > > crrev.com/1413123005. > > It looks okay - the important thing is that it's not NFC normalized (and it > isn't). If we do always encode to utf-8 then the change is fine; we previously > wanted to ensure a non-utf-8 encoding was used since the code path differs > there. > > > 2.a. I cannot cover these 2 characters via save_page_browsertest.cc, > because > > they are indistinguishable via find-in-page (although they do look visually > > distinct if one looks closely). > > Good to know! We must normalize when searching, which makes perfect sense! > > > 2.b. My changes to WebPageSerializerTest.cpp are silly / mechanical (i.e. > > they make the test pass in a tautological way by copying "actual" into > > "expected") but I think the CL overall does the right thing and I've manually > > verified that these 2 Korean characters are still visually distinct when > saving > > a euc-kr page after my changes. > > As noted, seems okay given the premise (always encode to utf-8). > > Re: BOM - agreed, don't write out a BOM for UTF-8. > > Other random questions: > > * stylesheets - I wonder if we detect BOMs there to sniff utf-16/32? > * script files - Do we need to worry about <script charset="utf-32"> ? The important thing here is that non-frames / non-html-docs are saved as-is, without any reencoding or link rewriting. Therefore: - I believe we don't need to worry about <script charset=...">. - If BOM sniffing works fine for web content, it should work fine for saved content. On 2015/12/03 03:41:45, kojii wrote: > I agree with jsbell@, in general, it's fine to always save in utf-8 as long as > we don't normalize, and all relevant declarations are updated. But the latter is > likely to be not a small work. Agreed. Getting the rewriting right is tricky. I am even surprised that we try to rewrite the <meta> element today (given that by default we don't change the encoding of the original decoding; at least not before this CL). > > * Is <meta charset> supported already? Couldn't read whether it's done or not. > This is probably a must support even for utf-16/utf-32 only. Yes. Although I've added html5-style-meta support very recently - crrev.com/1426503002 > * CSS declarations (@charset)? Not AFAIK. QUESTION: Does this matter for encodings like utf16, which are not ascii-compatible / where I am guessing we won't be able to sniff the <meta> element? > * Do we strip utf-16/utf-32 BOM if they exist? CSS honors BOM over @charset. See > the tests below. We do strip BOM when serializing HTML documents (at least for "complete-html" kind; I would have to double-check what we do to html docs encoded within mhtml, but based on what I've seen I think we also strip the BOM. > * Forms...yeah, agree with jsbell@, it's less important for it to work for saved > pages. > * XHTML? Yes - we rewrite (at least in WebPageSerializerImpl.cpp) > * SVG? SVG (as a non-html-doc) is saved as-is. So rewriting/preserving of BOM is N/A. > > There are some tests in W3C I18N how major browsers behave for the precedence: > http://www.w3.org/International/tests/repository/html5/the-input-byte-stream/... > http://www.w3.org/International/tests/repo/results/css-input-byte-stream#prec... On 2015/12/04 05:08:59, kojii wrote: > One more random thought, since IIUC what we want is not to save in UTF-8, but to > fix UTF-32. > > What's wrong here looks to me that we handle encoded byte array as WebCString. > True that in ASCII or Latin the encoded byte array looks like a string, but the > encoded stream is really a byte array, not a string. > > What about changing the type to WebData or WebVector<uint_8>? They should work > well with IPC, no? Current IPC code works fine now (i.e. file contents are not truncated to the first '\0' byte anymore, as they used to). It used to (implicitly) call std::string(web_cstring.data(). Not it works fine with calling std::string(web_cstring).
The reason why I am looking at save-page-as behavior for save-page-as is that some time ago I needed to test save-page-as with an exotic encoding (to see if browser / renderer would disagree on what encoding to use in an abandoned approach to an OOPIF-related problem). I happened to pick UTF32 for my test which led me to discover the IPC problem and the round-tripping problem. Now - *if you think it is not worth spending the effort on utf16 and/or utf32 encodings - please say so*. I am only vaguely familiar with the encodings problem area and picked utf32 for my tests somewhat randomly / arbitrarly (i.e. without thinking if it occurs in the wild). On Fri, Dec 4, 2015 at 9:55 AM, <jsbell@chromium.org> wrote: > On 2015/12/04 05:08:59, kojii wrote: > >> One more random thought, since IIUC what we want is not to save in UTF-8, >> but >> > to > >> fix UTF-32. >> > > What's wrong here looks to me that we handle encoded byte array as >> WebCString. >> True that in ASCII or Latin the encoded byte array looks like a string, >> but >> > the > >> encoded stream is really a byte array, not a string. >> > > What about changing the type to WebData or WebVector<uint_8>? They should >> work >> well with IPC, no? >> > > See the CL description - the IPC issue was apparently fixed, but for > non-ASCII-superset encodings we fail to parse since there's no header > indicating > the encoding - and I guess we don't sniff it correctly. I made a > suggestion in > #32 that maybe serializing a BOM would cause the parser to select a correct > encoding, but that was just a guess. > > (BTW, the use of UTF-16 is very low. I'm surprised that there is ANY usage > of > UTF-32 in the wild. Does any other browser still support that?) > > > https://codereview.chromium.org/1407663004/ > -- Thanks, Lukasz -- You received this message because you are subscribed to the Google Groups "Blink Reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to blink-reviews+unsubscribe@chromium.org.
The reason why I am looking at save-page-as behavior for save-page-as is that some time ago I needed to test save-page-as with an exotic encoding (to see if browser / renderer would disagree on what encoding to use in an abandoned approach to an OOPIF-related problem). I happened to pick UTF32 for my test which led me to discover the IPC problem and the round-tripping problem. Now - *if you think it is not worth spending the effort on utf16 and/or utf32 encodings - please say so*. I am only vaguely familiar with the encodings problem area and picked utf32 for my tests somewhat randomly / arbitrarly (i.e. without thinking if it occurs in the wild). On Fri, Dec 4, 2015 at 9:55 AM, <jsbell@chromium.org> wrote: > On 2015/12/04 05:08:59, kojii wrote: > >> One more random thought, since IIUC what we want is not to save in UTF-8, >> but >> > to > >> fix UTF-32. >> > > What's wrong here looks to me that we handle encoded byte array as >> WebCString. >> True that in ASCII or Latin the encoded byte array looks like a string, >> but >> > the > >> encoded stream is really a byte array, not a string. >> > > What about changing the type to WebData or WebVector<uint_8>? They should >> work >> well with IPC, no? >> > > See the CL description - the IPC issue was apparently fixed, but for > non-ASCII-superset encodings we fail to parse since there's no header > indicating > the encoding - and I guess we don't sniff it correctly. I made a > suggestion in > #32 that maybe serializing a BOM would cause the parser to select a correct > encoding, but that was just a guess. > > (BTW, the use of UTF-16 is very low. I'm surprised that there is ANY usage > of > UTF-32 in the wild. Does any other browser still support that?) > > > https://codereview.chromium.org/1407663004/ > -- Thanks, Lukasz -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On 2015/12/04 18:20:38, Łukasz Anforowicz wrote: > > If we serialize a BOM for UTF-16 (and UTF-32?) documents, do they render > > correctly? (i.e. maybe we can do that instead?) > > Yes. Sort of: Yay! > - Tested by manually prepending a BOM to all of UTF16 LE / BE > and ORIGINALLY-HAD-BOM / ORIGINALLY-HAD-NOBOM and to UTF32 (only LE in my > tests). > > - Things worked fine, except as noted above there was an extra strange > character as noted in 4. above > QUESTION1: Could you please help me see how to emit BOM properly? Yeah, we don't need to write out BOMs when posting (since we send a header) and for the Encoding API we decided that callers could prepend it manually if desired. Plumbing a "prepend BOM" flag through TextEncoding::encode to TextCodec::encode seems reasonable to me. On most codecs it would do nothing, and we'd add support in TextCodecUTF16. I don't know where the extra character is coming from. 0A0A looks suspiciously like the low byte of the proceeding character (U+000A LINE FEED) repeated twice - it could be a bug in the UTF-16 encoding pipeline? Does this only appear if you manually prepend the BOM (maybe a buffer off-by-one error somewhere?) or is it always present? > QUESTION2: Do you have any advise on landing test files in utf16 encoding? Hmmm, I've had problems with the CQ but not try jobs... for landing, just `git cl land` the resource files manually after you get an l*g*t*m, before the test if possible. If the encoded files *are* the test, just `git cl land` them. But again, it sounds like you're having issues on the try bots? Anyone else have suggestions? > QUESTION3: Do you think always saving as UTF8 (maybe only for a subset of > encodings [but what subset?]) is still an option? Or should I abandon this > approach given lack of rewriting of encoding declaration in some places (most > notably inline css stylesheets)? The always-include-BOM approach looks like a > promising alternative (thank you for pushing me gently in this direction). I'd go with the BOM. It's not crazy - ENCODING (which HTML defers to) states: "the byte order mark (also known as BOM) is considered more authoritative than anything else" On 2015/12/04 18:24:54, chromium-reviews wrote: > Now - *if you think it is not worth spending the effort on utf16 and/or > utf32 encodings - please say so*. I am only vaguely familiar with the > encodings problem area and picked utf32 for my tests somewhat randomly / > arbitrarly (i.e. without thinking if it occurs in the wild). UTF-16 is worth fixing since that's the default encoding for Unicode content on Windows it's still generated by tools when the developer isn't aware. I speak from experience that it's easy to have an all windows-1252 ASP.NET site, make a minor content change, and have pages silently served as UTF-16 without noticing because it just works. If UTF-32 requires *any* extra work I'd defer.
Sorry, I missed the discussion point in my previous reply. On 2015/12/04 at 23:12:01, jsbell wrote: > > I'd go with the BOM. It's not crazy - ENCODING (which HTML defers to) states: "the byte order mark (also known as BOM) is considered more authoritative than anything else" I agree, adding BOM is a nice fix. BOM wins over other signatures except the HTTP header, see the precedence tests: http://www.w3.org/International/tests/repository/html5/the-input-byte-stream/...
Description was changed from ========== Make sure that Save-Page-As-Complete-HTML always saves as UTF8. Before the CL, UTF32-encoded html files would render properly, but they would be incorrectly saved as 1-byte-long file, because WebPageSerializerImpl would attempt to preserve [1] the original file encoding, but an IPC bug would corrupt/truncate UTF32 contents when sending them to the browser (see crbug.com/541699#c2). Note that even after fixing the IPC bug [10], saved page would not render properly (most likely because the browser would not be able to correctly detect the encoding of the file saved in the original, utf-32 encoding). To fix the problems above, this CL makes sure that "complete-html" is always saved as UTF8. Other notes: - W3C recommends saving as UTF8 - [2], [3] - If saving in original encoding [1] was important, we would have noticed that saving of UTF16 and/or UTF32 encodings (*) was broken for 4+ years. (*) The IPC issue was affecting any encoding that could include '\0' byte. - Declaring the encoding via <meta> element only works if the encoding is an ASCII-compatible character encoding [3] (otherwise, the browser would not be able to sniff the <meta> element). This means that for not-ASCII-compatible encodings (i.e. UTF32) the encoding can be declared via http headers, but there is no good place for saving / preserving the encoding information in the local files. This problem doesn't happen for UTF8. - Existing PageSerializer and WebPageSerializeImpl already have code and set a precendent for modifying encoding references inside document content: [4] - PageSerializer.cpp - isCharsetSpecifyingNode. [5] - PageSerializer.cpp - emitting meta element w/ charset info [7] - WebPageSerializerImpl.cpp - skipping the original meta element [8] - WebPageSerializerImpl.cpp - emitting encoding in xml declaration [9] - WebPageSerializerImpl.cpp - emitting meta element w/ charset info - This CL continues the current behavior of *not* emitting a BOM at the beginning of a saved html file. I am not sure if this is an issue. W3C documents I've seen so far do not have a strong recommendation for or against inclusion of a BOM. - This CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". Test steps: - Built chrome and manually saved-as-complete-html and inspected result for http://anforowicz.github.io/html-encoding/utf32.html - Run the new save_page_browsertest.cc and manually verified correctness of the output (preserved by calling save_dir_.Take()). - Manually saved a page with euc-kr encoding from the modified WebPageSerializerTest and verified that the 2 characters are still visually distinct after save-page-as-complete-html. [1] https://chromium.googlesource.com/chromium/src/+/d083d29530a507b7b5a6df1c5727...). [2] http://www.w3.org/TR/encoding/ "The utf-8 encoding is the most appropriate encoding for interchange of Unicode, the universal coded character set. Therefore for new protocols and formats, as well as existing formats deployed in new contexts, this specification requires (and defines) the utf-8 encoding." [3] http://www.w3.org/TR/html5/document-metadata#charset: "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. [RFC3629]" "Authoring tools should default to using UTF-8 for newly-created documents. [RFC3629]" [4] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [5] https://chromium.googlesource.com/chromium/src/+/d2a014c5d15083e944894b3a624c... [6] https://codereview.chromium.org/1373573002/diff/120001/content/browser/downlo... [7] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [8] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [9] https://chromium.googlesource.com/chromium/src/+/11d41626fbe8e88fb507371df3be... [10] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... BUG=541699 TEST=See "Test steps" above. ========== to ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for utf16-encoded frames. BOM behavior for UTF16 is easy to control via wtf/text/TextCodecUTF16, but unfortunately this is not the case for UTF32 (which goes through wtf/text/TextCodecICU) and consequently this CL doesn't fix round-trippng of UTF32-encoded frames. This is okay, because UTF32 is much more rare than UTF16. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ==========
jsbell@, could you please take another look? Things that are ready for review: - API tweaks for TextEncoding and TextCodec classes (changes under wtf/text). - Concatenating CStrings via ArrayBufferBuilder (changes in WebPageSerializerImpl.cpp). - Not fixing UTF32 - I didn't see an easy way to implement TextCodecICU::shouldIncludeBOM (other than just deciding based on the encoding name... yuck?). Notes: - This is not quite ready to land yet - I am not yet confident that the test files I have locally are correctly packaged for consumption by git / rietveld / etc. - TextEncoding::encode is called multiple times by WebPageSerializerImpl::encodeAndFlushBuffer. Therefore, even before this CL, we would instantiate TextCodec multiple times (once per call to TextEncoding::encode). I've added a TODO note about reusing this object. If this is desirable, I can try doing this in a separate CL.
On 2015/12/04 23:12:01, jsbell wrote: > I don't know where the extra character is coming from. 0A0A looks suspiciously > like the low byte of the proceeding character (U+000A LINE FEED) repeated twice > - it could be a bug in the UTF-16 encoding pipeline? Does this only appear if > you manually prepend the BOM (maybe a buffer off-by-one error somewhere?) or is > it always present? It was always present as the last 2 bytes in the original non-bom files. I don't know how it got here (I suspect gvim). After removing this mistake in the test files, the funny character is no longer appearing in the test results.
What's the status of those extra 0x0A bytes you were seeing? A result of temp code? https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:279: CString encodedBOM; Move this block up above the encodedContent initialization, as BOM comes before content? https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:281: encodedBOM = param->textEncoding.encodeBOMifApplicable(); I was going to suggest some changes to encodeBOMifApplicable (and request a comment, etc) but as I was reviewing this I ran across: TextEncoding::isNonByteBasedEncoding() ... which captures the semantics we want. I think that's sufficiently close to what we want such that we should instead just call that from WebPageSerializerImpl::encodeAndFlushBuffer and make no changes to text/wtf. (Sorry, I should have known about this before.) We should comment in the code something like: // When serializing with encodings that are not ASCII-compatible // (e.g. UTF-16) it is necessary to prefix the content with a // byte order mark (BOM) so that it can be parsed correctly. if (param->textEncoding.isNonByteBasedEncoding()) { ... } https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:292: result = WebCString( Don't need to wrap in blink. https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... File third_party/WebKit/Source/web/WebPageSerializerImpl.h (right): https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.h:106: bool m_noBytesEncodedYet; Can you flip the semantics of this so that it starts false and becomes true, e.g. m_hasEncodedBytes ? https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... File third_party/WebKit/Source/wtf/text/TextEncoding.cpp (right): https://codereview.chromium.org/1407663004/diff/210001/third_party/WebKit/Sou... third_party/WebKit/Source/wtf/text/TextEncoding.cpp:90: // across repeated calls to encode and/or encodeBOMifApplicable. That would change the behavior for stateful encodings, unfortunately. Callers can use TextCodec directly if that level of control is desired. (e.g. Source/modules/TextEncoding.cpp)
Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for utf16-encoded frames. BOM behavior for UTF16 is easy to control via wtf/text/TextCodecUTF16, but unfortunately this is not the case for UTF32 (which goes through wtf/text/TextCodecICU) and consequently this CL doesn't fix round-trippng of UTF32-encoded frames. This is okay, because UTF32 is much more rare than UTF16. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ========== to ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ==========
jsbell@, can you take another look please? Thank you for the previous round of CR feedback, but after following your suggestion to use isNonByteBasedEncoding these CR comments are no longer applicable. Therefore I've left them unanswered. On 2015/12/07 22:53:51, jsbell wrote: > What's the status of those extra 0x0A bytes you were seeing? A result of temp > code? It turned out that this extra character was present as the last 2 bytes in the input/original files. I don't know how it got here (I suspect gvim). After removing this mistake in the test files, the funny character is no longer appearing in the test results.
I like the direction this CL is trending. :) Overall lgtm with nits, but you'll need OWNERS reviews https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:980: std::vector<std::string> forbidden_substrings{ http://chromium-cpp.appspot.com/ says that the "uniform initialization syntax" is banned. :( https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1090: // Test for saving frames with various encodings (crbug.com/541699). I don't think including the bug# here is necessary; this comment explains what's going on. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1098: std::vector<std::string> expected_substrings{ (see above) https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć gęślą jaźń", Use \x## escapes so that the source is ASCII when possible, as not all developers configure editors for UTF-8. https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:457: const UChar bomCharacter = 0xFEFF; nit: maybe call this `byteOrderMark` as it's not technically a character and not everyone knows BOM
On 2015/12/07 23:32:36, Łukasz Anforowicz wrote: > It turned out that this extra character was present as the last 2 bytes in the > input/original files. Yay! > I don't know how it got here (I suspect gvim). Another reason to use emacs. (just kidding!)
dcheng@chromium.org changed reviewers: + tkent@chromium.org
third_party/WebKit LGTM, but you'll need an owner in wtf (+tkent for that review)
On 2015/12/08 at 01:33:35, dcheng wrote: > third_party/WebKit LGTM, but you'll need an owner in wtf (+tkent for that review) (with jsbell's comment in WebKit addressed)
On Monday, December 7, 2015, <dcheng@chromium.org> wrote: > On 2015/12/08 at 01:33:35, dcheng wrote: > >> third_party/WebKit LGTM, but you'll need an owner in wtf (+tkent for that >> > review) > > (with jsbell's comment in WebKit addressed) > > https://codereview.chromium.org/1407663004/ > > -- > You received this message because you are subscribed to the Google Groups > "Chromium Site Isolation Reviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to site-isolation-reviews+unsubscribe@chromium.org. > For more options, visit https://groups.google.com/a/chromium.org/d/optout. > -- You received this message because you are subscribed to the Google Groups "Blink Reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to blink-reviews+unsubscribe@chromium.org.
On Monday, December 7, 2015, <dcheng@chromium.org> wrote: > On 2015/12/08 at 01:33:35, dcheng wrote: > >> third_party/WebKit LGTM, but you'll need an owner in wtf (+tkent for that >> > review) > > (with jsbell's comment in WebKit addressed) > > https://codereview.chromium.org/1407663004/ > > -- > You received this message because you are subscribed to the Google Groups > "Chromium Site Isolation Reviews" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to site-isolation-reviews+unsubscribe@chromium.org. > For more options, visit https://groups.google.com/a/chromium.org/d/optout. > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
wtf lgtm
jsbell@, could you please take another look? (I've tried to address your feedback in the most recent patchset) dcheng@ + tkent@ - thanks for reviewing. Please note that at this point the CL is failing the new test on the try bots. I suspect problems with patching / preserving the binary content of files under chrome/test/data/save_page/encoding*.htm. I think I'll land these files manually, in a separate CL, with a .gitattributes asking to treat them as binary. Please let me know if you think this is a bad idea. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:980: std::vector<std::string> forbidden_substrings{ On 2015/12/08 01:19:07, jsbell wrote: > http://chromium-cpp.appspot.com/ says that the "uniform initialization syntax" > is banned. :( Yes, sorry about that. I should have rebased onto crrev.com/1441553002 (which transitions to arrays) earlier. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1090: // Test for saving frames with various encodings (crbug.com/541699). On 2015/12/08 01:19:07, jsbell wrote: > I don't think including the bug# here is necessary; this comment explains what's > going on. Done. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1098: std::vector<std::string> expected_substrings{ On 2015/12/08 01:19:07, jsbell wrote: > (see above) Done. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć gęślą jaźń", On 2015/12/08 01:19:07, jsbell wrote: > Use \x## escapes so that the source is ASCII when possible, as not all > developers configure editors for UTF-8. Is \u escape ok? It fits into 80 columns limit better... https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou... File third_party/WebKit/Source/web/WebPageSerializerImpl.cpp (right): https://codereview.chromium.org/1407663004/diff/230001/third_party/WebKit/Sou... third_party/WebKit/Source/web/WebPageSerializerImpl.cpp:457: const UChar bomCharacter = 0xFEFF; On 2015/12/08 01:19:07, jsbell wrote: > nit: maybe call this `byteOrderMark` as it's not technically a character and not > everyone knows BOM Done.
still lgtm - apart from the \u#### usage, alas. https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć gęślą jaźń", On 2015/12/09 00:50:59, Łukasz Anforowicz wrote: > On 2015/12/08 01:19:07, jsbell wrote: > > Use \x## escapes so that the source is ASCII when possible, as not all > > developers configure editors for UTF-8. > > Is \u escape ok? It fits into 80 columns limit better... https://google.github.io/styleguide/cppguide.html#Non-ASCII_Characters seems to say that you need to use u8"...." to ensure it's encoded as UTF-8, and the u8 prefix is on the banned list https://chromium-cpp.appspot.com/ since it doesn't work in MSVS2013 yet. So... looks like no. :(
jsbell@, thanks for the review. BTW: the separate CL for the test files is @ crrev.com/1506023007 https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... File chrome/browser/download/save_page_browsertest.cc (right): https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć gęślą jaźń", On 2015/12/09 01:02:47, jsbell wrote: > On 2015/12/09 00:50:59, Łukasz Anforowicz wrote: > > On 2015/12/08 01:19:07, jsbell wrote: > > > Use \x## escapes so that the source is ASCII when possible, as not all > > > developers configure editors for UTF-8. > > > > Is \u escape ok? It fits into 80 columns limit better... > > https://google.github.io/styleguide/cppguide.html#Non-ASCII_Characters seems to > say that you need to use u8"...." to ensure it's encoded as UTF-8, and the u8 > prefix is on the banned list https://chromium-cpp.appspot.com/ since it doesn't > work in MSVS2013 yet. > > So... looks like no. :( Done.
Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ========== to ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ==========
On 2015/12/09 01:27:36, Łukasz Anforowicz wrote: > jsbell@, thanks for the review. > > BTW: the separate CL for the test files is @ crrev.com/1506023007 > > https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... > File chrome/browser/download/save_page_browsertest.cc (right): > > https://codereview.chromium.org/1407663004/diff/230001/chrome/browser/downloa... > chrome/browser/download/save_page_browsertest.cc:1100: "iso-8859-2.htm: Zażółć > gęślą jaźń", > On 2015/12/09 01:02:47, jsbell wrote: > > On 2015/12/09 00:50:59, Łukasz Anforowicz wrote: > > > On 2015/12/08 01:19:07, jsbell wrote: > > > > Use \x## escapes so that the source is ASCII when possible, as not all > > > > developers configure editors for UTF-8. > > > > > > Is \u escape ok? It fits into 80 columns limit better... > > > > https://google.github.io/styleguide/cppguide.html#Non-ASCII_Characters seems > to > > say that you need to use u8"...." to ensure it's encoded as UTF-8, and the u8 > > prefix is on the banned list https://chromium-cpp.appspot.com/ since it > doesn't > > work in MSVS2013 yet. > > > > So... looks like no. :( > > Done. LGTM Sorry for being late here. jsbell@ and kojii@ asked all the questions I asked. I'd also have a reservation about always saving in UTF-8 so that the latest CL looks good to me. Besides, as you found out, MSVC's lack of support for u8 AND more importantly UTF-8 input file really sucks (if somebody tries to build Chrome with one of CJK legacy encodings set as the default non-Unicode encoding on Windows, they wouldn't be able to compile Chrome with UTF-8 raw strings in the source code). Due to that, the only way possible is to use hexescape as you did in the latest version. (I wish I could just use UTF-8 strings in Chrome tree as we do in google internal). See http://crbug.com/454858
The CQ bit was checked by lukasza@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1407663004/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1407663004/310001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by lukasza@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, rdsmith@chromium.org, tkent@chromium.org, dcheng@chromium.org, jsbell@chromium.org Link to the patchset: https://codereview.chromium.org/1407663004/#ps310001 (title: "Rebased after landing the test files in a separate CL.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1407663004/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1407663004/310001
Message was sent while issue was closed.
Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ========== to ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ==========
Message was sent while issue was closed.
Committed patchset #16 (id:310001)
Message was sent while issue was closed.
Description was changed from ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 ========== to ========== Tweaking WebPageSerializerImpl to emit a BOM for UTF16/32. Previously save-page-as would truncate saved html files if they would include a '\0' byte (i.e. if saving a html document that was encoded as utf16). The truncation doesn't happen anymore [1], but the saved html files still don't round-trip properly, because the browser is not able to detect the encoding of the files (<meta> character cannot be sniffed in encodings that are not ascii compatible [2] + it is obviously impossible for local files to declare their encoding via http headers). This CL makes changes to always emits a BOM when saving in UTF16 or UTF32 encoding. This in-turn causes round-tripping tests in save_page_browsertest.cc to pass for tests using these encodings. Note that this CL applies only to save as "complete html" and has no impact to save as "html only" or save as "mhtml". [1] https://chromium.googlesource.com/chromium/src/+/17e4a38aef61b832589933c94242... [2] http://www.w3.org/TR/html5/document-metadata#charset: If an HTML document contains a meta element with a charset attribute or a meta element with an http-equiv attribute in the encoding declaration state, then the character encoding used must be an ASCII-compatible character encoding. BUG=541699 Committed: https://crrev.com/36f0d157b6d2c6fae627a3aba955ad4e0fe07d71 Cr-Commit-Position: refs/heads/master@{#364394} ==========
Message was sent while issue was closed.
Patchset 16 (id:??) landed as https://crrev.com/36f0d157b6d2c6fae627a3aba955ad4e0fe07d71 Cr-Commit-Position: refs/heads/master@{#364394} |