|
|
DescriptionString:WriteUtf8: Add REPLACE_INVALID_UTF8 option
This patch makes String::WriteUtf8 replace invalid code points (i.e. unmatched
surrogates) with the unicode replacement character when REPLACE_INVALID_UTF8 is
set. This is done to avoid creating invalid UTF-8 output which can lead to
compatibility issues with software requiring valid UTF-8 inputs (e.g. the
WebSocket protocol requires valid UTF-8 and terminates connections when invalid
UTF-8 is encountered).
R=dcarney@chromium.org
BUG=
Committed: https://code.google.com/p/v8/source/detail?r=18683
Patch Set 1 #
Total comments: 12
Patch Set 2 : DISALLOW_INVALID_UTF8 flag and fixes #
Total comments: 4
Patch Set 3 : Abandon refactoring, get core behavior change done #
Total comments: 14
Patch Set 4 : Latest version #
Total comments: 8
Patch Set 5 : Cleanup, test for edge case + incorrect fix #Patch Set 6 : Fix mistake in test case, finish patch #
Total comments: 9
Patch Set 7 : Address remaining comments #Patch Set 8 : Address latest comments #Patch Set 9 : Rebase #
Messages
Total messages: 26 (0 generated)
Just a preliminary glance. What you're doing is okay, but all other callsites of Utf8::Encode would have to change and have the separated logic for rewriting the surrogates, which is probably not what you want. You may want to keep the rewriting in Encode with a new flag for the new type of rewriting. Additionally, the length computations must be adjust for flag Also, you should always run test-api/Utf16. Since it's handling a lot of edge cases you want to consider. https://codereview.chromium.org/121173009/diff/1/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/1/src/api.cc#newcode4519 src/api.cc:4519: // @TODO use uint16_t for previous? previous is an int because of some special values used in the encoder, like kNoPreviousCharacter https://codereview.chromium.org/121173009/diff/1/src/api.cc#newcode4529 src/api.cc:4529: return written + Utf8::Encode(buffer, code_point, false); having the length calculation here is too late. String::WriteUtf8 must compute the identical length, which it can't do if you're potentially changing it here. https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h#newcode114 src/unicode-inl.h:114: unsigned Utf8::Encode(char* str, uchar c, bool allow_invalid) { the name of the third argument is not the same as the declaration. https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h#newcode129 src/unicode-inl.h:129: } else if (c <= kMaxThreeByteChar) { can't chop this section out - need backwards compatibility for the other call sites of Encode https://codereview.chromium.org/121173009/diff/1/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode53 src/unicode.h:53: const int kReplacementCharacter = 0xFFFD; should probably be in the Utf8 class https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 src/unicode.h:158: static inline unsigned Encode(char* str, uchar c, bool replace_surrogates); this is not backwards compatible. All existing call sites which you did not cut over are casting int to bool for the third argument
On 2014/01/04 15:56:45, dcarney wrote: > Just a preliminary glance. What you're doing is okay, but all other callsites > of Utf8::Encode would have to change and have the separated logic for rewriting > the surrogates, which is probably not what you want. Running `git grep 'Utf8::Encode'` shows 4 additional call sites, and yes, it was my plan to change them. IMO Utf8::Encode should convert unicode code points into the UTF-8 encoding. It should not be concerned with UTF-16 surrogates beyond the fact that they are invalid code points inside a UTF-8 stream. I mean I can completely see the argument for putting more UTF-16 awareness into v8's UTF-8 encoder given that JavaScript strings are made up from 16Bit units, but I think separation of concerns should be the more heavily weighted argument here. > Also, you should always run test-api/Utf16. Since it's handling a lot of edge > cases you want to consider. Thanks. Will do!
https://codereview.chromium.org/121173009/diff/1/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/1/src/api.cc#newcode4519 src/api.cc:4519: // @TODO use uint16_t for previous? On 2014/01/04 15:56:45, dcarney wrote: > previous is an int because of some special values used in the encoder, like > kNoPreviousCharacter Makes sense. Thx. https://codereview.chromium.org/121173009/diff/1/src/api.cc#newcode4529 src/api.cc:4529: return written + Utf8::Encode(buffer, code_point, false); On 2014/01/04 15:56:45, dcarney wrote: > having the length calculation here is too late. String::WriteUtf8 must compute > the identical length, which it can't do if you're potentially changing it here. Ok, I'll take a closer look. https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h#newcode114 src/unicode-inl.h:114: unsigned Utf8::Encode(char* str, uchar c, bool allow_invalid) { On 2014/01/04 15:56:45, dcarney wrote: > the name of the third argument is not the same as the declaration. Thx, will fix. https://codereview.chromium.org/121173009/diff/1/src/unicode-inl.h#newcode129 src/unicode-inl.h:129: } else if (c <= kMaxThreeByteChar) { On 2014/01/04 15:56:45, dcarney wrote: > can't chop this section out - need backwards compatibility for the other call > sites of Encode I'd prefer changing the other call sites if needed. I'll look into this today. But yeah, if that can't be done for some reason, I'll put this code back. https://codereview.chromium.org/121173009/diff/1/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode53 src/unicode.h:53: const int kReplacementCharacter = 0xFFFD; On 2014/01/04 15:56:45, dcarney wrote: > should probably be in the Utf8 class I don't feel strongly about it, but technically it's not directly related to UTF-8, but rather provided for all unicode encoders/decoders that might need it. https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 src/unicode.h:158: static inline unsigned Encode(char* str, uchar c, bool replace_surrogates); On 2014/01/04 15:56:45, dcarney wrote: > this is not backwards compatible. Is there a good reason to keep compatibility, rather than changing the call sites?
> https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 > src/unicode.h:158: static inline unsigned Encode(char* str, uchar c, bool > replace_surrogates); > On 2014/01/04 15:56:45, dcarney wrote: > > this is not backwards compatible. > > Is there a good reason to keep compatibility, rather than changing the call > sites? By compatibility, I just mean the current behaviour must be maintained. You can change the call sites, but they will all have to be changed to maintain the current invalid encoding rules. That would be fine, but it's a bit odd to have to push this logic out into various places. It would seem more sensible to somehow encapsulate what you're trying to do in unicode.*, rather than in api.cc.
On 2014/01/06 07:55:58, dcarney wrote: > > https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 > > src/unicode.h:158: static inline unsigned Encode(char* str, uchar c, bool > > replace_surrogates); > > On 2014/01/04 15:56:45, dcarney wrote: > > > this is not backwards compatible. > > > > Is there a good reason to keep compatibility, rather than changing the call > > sites? > > By compatibility, I just mean the current behaviour must be maintained. You can > change the call sites, but they will all have to be changed to maintain the > current invalid encoding rules. That would be fine, but it's a bit odd to have > to push this logic out into various places. It would seem more sensible to > somehow encapsulate what you're trying to do in unicode.*, rather than in > api.cc. I'll maintain current behavior throughout the system. I agree that my patch can be done without changing the semantics of Utf8::Encode. However making Utf8::Encode more modular has been very useful to my ability to reason about the system, so I'd like to continue going into this direction.
> I'll maintain current behavior throughout the system. I agree that my patch can > be done without changing the semantics of Utf8::Encode. However making > Utf8::Encode more modular has been very useful to my ability to reason about the > system, so I'd like to continue going into this direction. The direction is fine, it's just that you need to maintain state (should_replace_characters, last_character, etc) in some kind of uniform way. I think some sort of class in unicode.h is the place to put it.
This patch set addresses most issues raised in the first review, adds the DISALLOW_INVALID_UTF8 flag. The next step is to fix up the remaining callers of Utf8::Encode and to find a better abstraction for all of this. I could imagine a UTF-16 Decoder that allows writing individual UTF-16 code units, and returns full unicode code points (and maybe even has the replacement logic). But I'll have to experiment with this first.
https://codereview.chromium.org/121173009/diff/130001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/130001/src/api.cc#newcode4523 src/api.cc:4523: static int WritePair(uint16_t current, WritePair is a bad name here, since it may or not be a pair. https://codereview.chromium.org/121173009/diff/130001/src/api.cc#newcode4556 src/api.cc:4556: int written = WritePair(current, this is not a surrogate pair, could use Encode directly https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h#newco... src/unicode-inl.h:115: if (!allow_invalid && move this block down into the kMaxThreeByteChar clause. it's in the way of the common case
https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h#newco... src/unicode-inl.h:155: // @TODO give this the same semantics as Encode? i don't see an easy way to do this. You'd have to then run over the byte stream with a second pass to adjust for matching surrogates, but the current implementation here already doing that correctly and efficiently for the case where the replacement character is inserted and the case that it isn't. I think you'll ultimately need to keep the rewriting above in Utf8::Encode as it is instead of moving it to WritePair in api.cc. Instead you can just write kReplacementCharacter if the flag is set. This makes the Length and Encode functions easier to understand, since they complement one another and are in the same place. Also, you won't have to change any other call sites of Encode, which is of great benefit.
On 2014/01/07 11:05:50, dcarney wrote: > I think you'll ultimately need to keep the rewriting above in Utf8::Encode as it > is instead of moving it to WritePair in api.cc. Instead you can just write > kReplacementCharacter if the flag is set. This makes the Length and Encode > functions easier to understand, since they complement one another and are in the > same place. Also, you won't have to change any other call sites of Encode, > which is of great benefit. Ok. I'll abandon my current approach, and try to add the new behavior on top of the existing Utf8::Encode semantics. I still think the UTF abstractions I've been seeing so far are extremely leaky and don't do a good job of encapsulation [1]. But I now realize that this should be seen as a separate issue and goes beyond the scope of what my patch is trying to accomplish. --fg [1]: I'll again admit to having very little experience in the domain of VM engineering, so I'm sure these problems have been considered before and the current solution deemed right given the constraints on performance. A part of me is just thinks that it should be possible to make a nice and simple UTF-16 -> UTF-8 converter that is entirely unaware of v8::String, and that this should be doable without performance regressions. Hacking on this has been great fun so far, so maybe I'll make another attempt in this direction in the future.
> I still think the UTF abstractions I've been seeing so far are extremely leaky > and don't do a good job of encapsulation [1]. But I now realize that this should > be seen as a separate issue and goes beyond the scope of what my patch is trying > to accomplish. This code could definitely use a refresh. The duplication is probably just historical/seemed like a good idea at the time sort of thing. It really should all be templatized/abstracted into the unicode files without knowledge of String as you say.
The new patch set abandons the attempt of refactoring the UTF-16 to UTF-8 mechanism, and focuses on getting the actual task done. Hopefully it covers the problem completely.
this patch is almost ready to land. There's one edge case which needs to be handled, and some minor fixes for readability https://codereview.chromium.org/121173009/diff/230001/include/v8.h File include/v8.h (right): https://codereview.chromium.org/121173009/diff/230001/include/v8.h#newcode1650 include/v8.h:1650: DISALLOW_INVALID_UTF8 = 8 this needs a comment above https://codereview.chromium.org/121173009/diff/230001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4745 src/api.cc:4745: return writer.CompleteWrite(write_null, nchars_ref); think you need to ensure that all return points from this function check for the case that you don't allow invalid utf8, and the last character written is one half of a surrogate pair, which would be incorrectly written as a replacement character in the current implementation https://codereview.chromium.org/121173009/diff/230001/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode51 src/unicode.h:51: * (e.g. an orphan surrogate) when converting to a UTF encoding. typo - UTF-8 https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode53 src/unicode.h:53: const int kReplacementCharacter = 0xFFFD; this should be in Utf8, but see below https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode161 src/unicode.h:161: bool allow_invalid); this either needs to be an enum to avoid passing true and false directly, or you need to keep the bool but use a default value of true so existing call sites don't change https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode165 src/unicode.h:165: static const uchar kBadChar = 0xFFFD; hmmm, maybe you should just rename this variable here to kReplacementChar, since that's what it is, and it might conflict with your change. you'll obviously need to check usages of it, but i think i removed most of them in the codebase
https://codereview.chromium.org/121173009/diff/230001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4708 src/api.cc:4708: // @TODO Replace magic number 3 with something more descriptive. E.g. the syntax we use is: // TODO(username) Comment. but in this case, it just needs to be done before landing...
I addressed all issues raised except for one (see inline comments). Two questions: 1) What do you think about using the term "replace" instead of "disallow"? 2) IIRC there was supposed to be some kind of feedback mechanism for the caller of WriteUtf8 to inform him that replacements have happened. Do we need this for landing the patch? Thank you so much for all the reviewing! Felix https://codereview.chromium.org/121173009/diff/230001/include/v8.h File include/v8.h (right): https://codereview.chromium.org/121173009/diff/230001/include/v8.h#newcode1650 include/v8.h:1650: DISALLOW_INVALID_UTF8 = 8 On 2014/01/10 16:49:55, dcarney wrote: > this needs a comment above Done. https://codereview.chromium.org/121173009/diff/230001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4708 src/api.cc:4708: // @TODO Replace magic number 3 with something more descriptive. E.g. On 2014/01/10 16:54:33, dcarney wrote: > the syntax we use is: > > // TODO(username) Comment. > > but in this case, it just needs to be done before landing... Done. https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4745 src/api.cc:4745: return writer.CompleteWrite(write_null, nchars_ref); On 2014/01/10 16:49:55, dcarney wrote: > think you need to ensure that all return points from this function check for the > case that you don't allow invalid utf8, and the last character written is one > half of a surrogate pair, which would be incorrectly written as a replacement > character in the current implementation Not sure I understand. If allow_invalid_utf8=false, then any lone surrogate is an invalid last character, and should be replaced. https://codereview.chromium.org/121173009/diff/230001/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode51 src/unicode.h:51: * (e.g. an orphan surrogate) when converting to a UTF encoding. On 2014/01/10 16:49:55, dcarney wrote: > typo - UTF-8 Done. https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode53 src/unicode.h:53: const int kReplacementCharacter = 0xFFFD; On 2014/01/10 16:49:55, dcarney wrote: > this should be in Utf8, but see below Done. https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode161 src/unicode.h:161: bool allow_invalid); On 2014/01/10 16:49:55, dcarney wrote: > this either needs to be an enum to avoid passing true and false directly, or you > need to keep the bool but use a default value of true so existing call sites > don't change Done by defaulting to true. I don't understand the requirement of not changing existing call sites so, I assume this is just to minimize the diff of the patch? https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode165 src/unicode.h:165: static const uchar kBadChar = 0xFFFD; On 2014/01/10 16:49:55, dcarney wrote: > hmmm, maybe you should just rename this variable here to kReplacementChar, since > that's what it is, and it might conflict with your change. > you'll obviously need to check usages of it, but i think i removed most of them > in the codebase Done. Decided to keep the kBadChar name for now as it was still used in a few places.
https://codereview.chromium.org/121173009/diff/300001/include/v8.h File include/v8.h (right): https://codereview.chromium.org/121173009/diff/300001/include/v8.h#newcode1652 include/v8.h:1652: // @TODO Rename to REPLACE_INVALID_UTF8? sure https://codereview.chromium.org/121173009/diff/300001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/300001/src/api.cc#newcode4709 src/api.cc:4709: capacity / static_cast<int>(unibrow::Utf8::kMax16BitCodeUnitSize) this formatting is funny either use clang format or move the constant above the if statement to make it shorter https://codereview.chromium.org/121173009/diff/300001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/300001/src/unicode-inl.h#newco... src/unicode-inl.h:116: bool allow_invalid = true) { default should be in declaration https://codereview.chromium.org/121173009/diff/300001/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/300001/src/unicode.h#newcode155 src/unicode.h:155: bool allow_invalid); needs default value
> 1) What do you think about using the term "replace" instead of "disallow"? sounds good > 2) IIRC there was supposed to be some kind of feedback mechanism for the caller > of WriteUtf8 to inform him that replacements have happened. Do we need this for > landing the patch? the feature seems unnecessary, skip it > > think you need to ensure that all return points from this function check for > the > > case that you don't allow invalid utf8, and the last character written is one > > half of a surrogate pair, which would be incorrectly written as a replacement > > character in the current implementation > > Not sure I understand. If allow_invalid_utf8=false, then any lone surrogate is > an invalid last character, and should be replaced. right, but if you have the following string: {lead_surrogate, trail_surrogate} and you try to write into a buffer of 3 bytes, the correct behaviour is to write nothing, not write the replacement character as will currently happen the current implementation doesn't have this problem, as it just writes the lead surrogate illegally and leaves it to the caller to deal with
My latest patch: * Fixes the missing default value in the header (sorry, I'm not used to working with header files, so I keep forgetting : /, wish the compiler told me about it) * Fixes the funny formatting * Renames allow to replace, as it's more descriptive * Adds a test case for the buffer limit being hit on a lead surrogate * Adds a fix attempt for the issue, but it's not working yet (I think I can figure it out, I need more time) * Documents the Utf8WriterVisitor Visit method as I found it to be quite complex. I hope you want to keep it, but no worries if not. I have to wrap things up for today and thought it can't hurt to send out the latest progress. However, feel free to ignore this patch set until I can figure out how to fix the lead surrogate at the end of buffer edge case. https://codereview.chromium.org/121173009/diff/300001/include/v8.h File include/v8.h (right): https://codereview.chromium.org/121173009/diff/300001/include/v8.h#newcode1652 include/v8.h:1652: // @TODO Rename to REPLACE_INVALID_UTF8? On 2014/01/13 09:19:56, dcarney wrote: > sure Done. https://codereview.chromium.org/121173009/diff/300001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/300001/src/api.cc#newcode4709 src/api.cc:4709: capacity / static_cast<int>(unibrow::Utf8::kMax16BitCodeUnitSize) On 2014/01/13 09:19:56, dcarney wrote: > this formatting is funny > either use clang format or move the constant above the if statement to make it > shorter Done. https://codereview.chromium.org/121173009/diff/300001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/300001/src/unicode-inl.h#newco... src/unicode-inl.h:116: bool allow_invalid = true) { On 2014/01/13 09:19:56, dcarney wrote: > default should be in declaration Done. https://codereview.chromium.org/121173009/diff/300001/src/unicode.h File src/unicode.h (right): https://codereview.chromium.org/121173009/diff/300001/src/unicode.h#newcode155 src/unicode.h:155: bool allow_invalid); On 2014/01/13 09:19:56, dcarney wrote: > needs default value Done.
all looks good i'll land this as soon as that edge case is handled
I'm happy to report that I found the problem with my previous patch. The test was wrong. I have fixed the test, and would like you to consider this version for inclusion into v8. Thank you so much for all the reviewing and help! --fg https://codereview.chromium.org/121173009/diff/470001/test/cctest/test-api.cc File test/cctest/test-api.cc (right): https://codereview.chromium.org/121173009/diff/470001/test/cctest/test-api.cc... test/cctest/test-api.cc:7571: context->GetIsolate(), pair, v8::String::kNormalString, 2); This mistake was the missing piece in the puzzle. My fix seems to have been correct already.
lgtm i'll land once you addressed the trivial comments below additionally, the description of the code review must now change to be the description of the change. when I land it, it will become the commit message thanks for fixing this! https://codereview.chromium.org/121173009/diff/470001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4565 src/api.cc:4565: // TODO(felixge) This function is rather complex and could benefit from drop the todo https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4627 src/api.cc:4627: if (replace_invalid_utf8_ && Utf16::IsLeadSurrogate(character)) { this line is in the correct place, but it's only true because we can have at most 3 characters remaining here please add a one line comment https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4631 src/api.cc:4631: no space https://codereview.chromium.org/121173009/diff/470001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/470001/src/unicode-inl.h#newco... src/unicode-inl.h:113: // with kReplacementCharacter. kBadChar
I've addressed all remaining issues and updated the issue description. Ping me in chat if there are still formatting issues or anything, I'll take care of it right away. --fg https://codereview.chromium.org/121173009/diff/470001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4565 src/api.cc:4565: // TODO(felixge) This function is rather complex and could benefit from On 2014/01/17 09:10:12, dcarney wrote: > drop the todo Done. https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4627 src/api.cc:4627: if (replace_invalid_utf8_ && Utf16::IsLeadSurrogate(character)) { On 2014/01/17 09:10:12, dcarney wrote: > this line is in the correct place, but it's only true because we can have at > most 3 characters remaining here > please add a one line comment Done. https://codereview.chromium.org/121173009/diff/470001/src/api.cc#newcode4631 src/api.cc:4631: On 2014/01/17 09:10:12, dcarney wrote: > no space Done. https://codereview.chromium.org/121173009/diff/470001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/470001/src/unicode-inl.h#newco... src/unicode-inl.h:113: // with kReplacementCharacter. On 2014/01/17 09:10:12, dcarney wrote: > kBadChar Done.
Sending my previous patch again, my first upload attempt seems to have failed.
Rebased the patch.
Message was sent while issue was closed.
Committed patchset #9 manually as r18683 (presubmit successful). |