|
|
Created:
8 years ago by Søren Gjesse Modified:
8 years ago CC:
reviews_dartlang.org Visibility:
Public. |
DescriptionAdd support for surrogates when serializing and deserializing for native ports
As Dart allows creating string with UTF-16 surrogate code units which
are not always in lead/trail pairs, we need to support this in the
communication with native ports.
This change allows surrogate code units in messages to/from native
ports and supports them in the UTF-8 encoding used on the Dart_CObject
structure string representation.
R=ager@google.com, erikcorry@google.com, asiva@google.com
BUG=
Committed: https://code.google.com/p/dart/source/detail?r=15585
Patch Set 1 #
Total comments: 3
Patch Set 2 : Added Utf16::CodePointIterator #
Total comments: 13
Patch Set 3 : Addressed review comments from asiva@ #Patch Set 4 : Addressed additional comments from asiva@ #Patch Set 5 : Use iterator reset #
Total comments: 6
Patch Set 6 : Only allow legal UTF-8 #Patch Set 7 : Fixed long line #
Total comments: 9
Patch Set 8 : Addressed more review comments #Patch Set 9 : Rebased to r15579 #
Messages
Total messages: 23 (0 generated)
Are snapshots untrusted data? Do you create symbols from snapshots? Is it possible to canonicalize the symbols incorrectly if surrogate pairs are encoded as two 3 byte UTF-8 sequences vs. one 4-byte UTF-8 sequences? Once you have thought about that, it LGTM https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.c... runtime/vm/dart_api_message.cc:397: for (intptr_t i = 0; i < len; i++) { There is a CodePointIterator in object.h that you may be able to use instead. https://codereview.chromium.org/11280150/diff/1/runtime/vm/snapshot_test.cc File runtime/vm/snapshot_test.cc (right): https://codereview.chromium.org/11280150/diff/1/runtime/vm/snapshot_test.cc#n... runtime/vm/snapshot_test.cc:1295: " return \"\\u{10000}\\u{1F601}\\u{1F637}\\u{20000}\";\n" 😁
On 2012/11/23 14:25:47, erikcorry wrote: > Are snapshots untrusted data? Do you create symbols from snapshots? Is it > possible to canonicalize the symbols incorrectly if surrogate pairs are encoded > as two 3 byte UTF-8 sequences vs. one 4-byte UTF-8 sequences? > > Once you have thought about that, it LGTM > > https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.cc > File runtime/vm/dart_api_message.cc (right): > > https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.c... > runtime/vm/dart_api_message.cc:397: for (intptr_t i = 0; i < len; i++) { > There is a CodePointIterator in object.h that you may be able to use instead. > > https://codereview.chromium.org/11280150/diff/1/runtime/vm/snapshot_test.cc > File runtime/vm/snapshot_test.cc (right): > > https://codereview.chromium.org/11280150/diff/1/runtime/vm/snapshot_test.cc#n... > runtime/vm/snapshot_test.cc:1295: " return > \"\\u{10000}\\u{1F601}\\u{1F637}\\u{20000}\";\n" > 😁 Snapshots are generated by code inside the Dart VM - also for message snapshots generates from extension code from Dart_CObject structures. I am not sure whether it makes it "trusted data". I don't think that strings from message snapshots are made into symbols, but I couls be wrong and of-cause that can change. However when converting UTF-8 to UTF-16 (UTF-16 is used in the snapshot), both the 4-byte and the 2 * 3-byte UTF-8 representations will turn into the same two UTF-16 code units.
https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.c... runtime/vm/dart_api_message.cc:397: for (intptr_t i = 0; i < len; i++) { On 2012/11/23 14:25:47, erikcorry wrote: > There is a CodePointIterator in object.h that you may be able to use instead. That is not possible as this is a raw de-serialized utf16 array outside the VM and not a VM string. But thanks for mentioning it. I ended up implementing a Utf16::CodePointIterator matching String::CodePointIterator (both does the same as the code below).
https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:397: Utf16::CodePointIterator it(utf16, len); If you get invalid characters here it.Next() could potentially return false right at the first character and we would end up with utf8_len being 0. Should that be reported as an error as just silently dropped and an empty string returned like it is being done now? https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:404: Utf16::CodePointIterator it2(utf16, len); Would it make sense to have a reset method on the iterator instead so that you can start iterating again on the same iterator? https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str, utf8_len)) { I am not sure I understand the need for this to be IsValidAllowSurrogates and not just IsValid? Is it because you might have read a partial utf8 string and are expecting more? https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:91: intptr_t array_len) Has two parameters in the constructor why is explicit needed? https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:98: int32_t Current() { int32_t Current() const { https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:110: int32_t ch_; Missing DISALLOW stuff?
Thanks for the review, please take another look. https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:397: Utf16::CodePointIterator it(utf16, len); On 2012/11/27 03:00:25, siva wrote: > If you get invalid characters here it.Next() could > potentially return false right at the first character > and we would end up with utf8_len being 0. > > Should that be reported as an error as just silently > dropped and an empty string returned like it is being > done now? There are no invalid characters in an UTF-16 sequence. So that cannot happen. Added test to unicode_test.cc. https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:404: Utf16::CodePointIterator it2(utf16, len); On 2012/11/27 03:00:25, siva wrote: > Would it make sense to have a reset method on the iterator > instead so that you can start iterating again on the same iterator? Good point added Reset() here and for String::CodePointIterator as well. https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str, utf8_len)) { On 2012/11/27 03:00:25, siva wrote: > I am not sure I understand the need for this to be > IsValidAllowSurrogates and not just IsValid? > > Is it because you might have read a partial utf8 string and are expecting more? The current Utf8::IsValid does not allow for Utf8 encoded Utf16 surrogate code points. That is the 3-byte Utf8 encodings of the code point range d800 - dbff and dc00 - dfff are not allowed. However as the Utf16 two byte strings posted can contain these code points the Utf8 strings in the Dart_CObject structures can contain 3-byte Utf8 encodings of Utf16 surrogate code points. We need to allow sending the same data as can be received. Maybe we should just make IsValid allow surrogate code points in all cases. Currently you cannot create all the strings using the Dart API Dart_NewStringFromUTF8 that you can using String.fromCharCodes inside Dart (e.g. String.fromCharCodes([0xd800])). https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:91: intptr_t array_len) On 2012/11/27 03:00:25, siva wrote: > Has two parameters in the constructor why is explicit needed? Not needed - removed. https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:98: int32_t Current() { On 2012/11/27 03:00:25, siva wrote: > int32_t Current() const { Done (added for String::CodePointIterator as well). https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco... runtime/vm/unicode.h:110: int32_t ch_; On 2012/11/27 03:00:25, siva wrote: > Missing DISALLOW stuff? Done.
https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag... runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str, utf8_len)) { I was under the impression that we allow for Utf8 encoded Utf16 surrogate code points, at least that was the intention. This might be a bug, I should probably write a test case to verify this. I would prefer if we did not have to distinguish between IsValid and IsValidAllowSurrogates. Similarly Utf8::DecodeToUTF16 should allow surrogates and not have to distinguish between the two. For instance if you look at Utf8::DecodeToUTF16 it tries to deal with supplementary characters. The fact that Utf8::Decode doesn't deal with it correctly is a bug. I think we should remove this distinction, fix Utf8::ISValid and Utf8::Decode, then this CL would be good to go. What do you think? On 2012/11/27 11:35:54, Søren Gjesse wrote: > On 2012/11/27 03:00:25, siva wrote: > > I am not sure I understand the need for this to be > > IsValidAllowSurrogates and not just IsValid? > > > > Is it because you might have read a partial utf8 string and are expecting > more? > > The current Utf8::IsValid does not allow for Utf8 encoded Utf16 surrogate code > points. That is the 3-byte Utf8 encodings of the code point range d800 - dbff > and dc00 - dfff are not allowed. However as the Utf16 two byte strings posted > can contain these code points the Utf8 strings in the Dart_CObject structures > can contain 3-byte Utf8 encodings of Utf16 surrogate code points. We need to > allow sending the same data as can be received. > > Maybe we should just make IsValid allow surrogate code points in all cases. > Currently you cannot create all the strings using the Dart API > Dart_NewStringFromUTF8 that you can using String.fromCharCodes inside Dart (e.g. > String.fromCharCodes([0xd800])).
PTAL Changed the Unicode library to always accept Utf8 encoded surrogate code units. Added and changed a number of tests.
lgtm https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_... File runtime/vm/dart_api_impl_test.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_... runtime/vm/dart_api_impl_test.cc:591: } This test and the one above seem to test identical values (U+4E8C), why repeat the test? https://codereview.chromium.org/11280150/diff/13002/runtime/vm/unicode.cc File runtime/vm/unicode.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:62: // Check is codepoint is <= U+00FF. Check if codepoint is ...
I chatted with Carl regarding Utf8::IsValid and Utf8::Decode, he says it is incorrect to see UTF16 surrogate characters in a valid Utf8 stream and hence this change of allowing Surrogate characters is incorrect. Now the question would be how come you end up with a Utf8 stream which has surrogate characters? The bug is probably because the Dart strings which have surrogate characters are not being properly fused into a codepoint before being serialized as a Utf8 stream when sending the message.
On 2012/11/28 20:46:56, siva wrote: > I chatted with Carl regarding > Utf8::IsValid and Utf8::Decode, > he says it is incorrect to see UTF16 surrogate characters > in a valid Utf8 stream and hence this change of allowing > Surrogate characters is incorrect. > > Now the question would be how come you end up with a Utf8 stream which has > surrogate characters? The bug is probably because the Dart strings which have > surrogate characters are > not being properly fused into a codepoint before being serialized as a Utf8 > stream when sending the message. It's easy to come up with such strings using the [] operator.
On 2012/11/28 20:47:59, erikcorry wrote: > On 2012/11/28 20:46:56, siva wrote: > > I chatted with Carl regarding > > Utf8::IsValid and Utf8::Decode, > > he says it is incorrect to see UTF16 surrogate characters > > in a valid Utf8 stream and hence this change of allowing > > Surrogate characters is incorrect. > > > > Now the question would be how come you end up with a Utf8 stream which has > > surrogate characters? The bug is probably because the Dart strings which have > > surrogate characters are > > not being properly fused into a codepoint before being serialized as a Utf8 > > stream when sending the message. > > It's easy to come up with such strings using the [] operator. Also with String.fromCharCodes you can create any sequence of UTF-16 code points (see the new tests in snapshot_test.dart. I don't see that we can have an API where what can be returned by Dart_StringToUTF8 cannot always be used as valid input to Dart_NewStringFromUTF8. The same argument goes for Dart_CObject structures where received strings should be valid strings to send back.
https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_messa... File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_messa... runtime/vm/dart_api_message.cc:407: } At this point here we could state that if malformed Dart strings are trying to be passed out to the native world we could flag them as error. That way we can say that only valid utf8 strings will be sent to the native world.
> Also with String.fromCharCodes you can create any sequence of UTF-16 code points > (see the new tests in snapshot_test.dart. When we decided to use UTF-16 that was explicitly not supposed to happen. The interface should admit valid sequences of Unicode scalar values, only. If I understand you correctly, if the implementation has relaxed, it is a bug. > I don't see that we can have an API where what can be returned by > Dart_StringToUTF8 cannot always be used as valid input to > Dart_NewStringFromUTF8. The same argument goes for Dart_CObject structures where > received strings should be valid strings to send back. To avoid the asymmetric behavior, Dart_StringToUTF8 should reject malformed input. Avoiding the marshaling seems like a better option. Latin-1 and UTF-16 strings can be passed as-is and decoded by the user. This would be convenient, especially when passing data to system functions that already expected UTF-16.
On 2012/11/29 02:14:22, cshapiro wrote: > > Also with String.fromCharCodes you can create any sequence of UTF-16 code > points > > (see the new tests in snapshot_test.dart. > > When we decided to use UTF-16 that was explicitly not supposed to happen. The > interface should admit valid sequences of Unicode scalar values, only. > > If I understand you correctly, if the implementation has relaxed, it is a bug. > > > I don't see that we can have an API where what can be returned by > > Dart_StringToUTF8 cannot always be used as valid input to > > Dart_NewStringFromUTF8. The same argument goes for Dart_CObject structures > where > > received strings should be valid strings to send back. > > To avoid the asymmetric behavior, Dart_StringToUTF8 should reject malformed > input. > > Avoiding the marshaling seems like a better option. Latin-1 and UTF-16 strings > can be passed as-is and decoded by the user. This would be convenient, > especially when passing data to system functions that already expected UTF-16. I don't like to have two string types in the marchalling format used to communicate with native ports. Except for Windows there don't seem to be that many UTF-16 APIs out there. This makes it much more complicated for programmers. Discussed this with Mads and decided to keep UTF-8, and disallow UTF-16 surrogate code points. If such a string is received it will be turned into an unsupported object. If such a string is send the sending will fail. We can always extend this and maybe have a configuration for choosing between UTF-8 or Latin1/UTF-16 in native port message marchalling. My personal opinion here is that just relaxing our view of UTF-8 to allow UTF-16 surrogate code points is both the simplest and most pragmatic solution. There is no need to be a purist when it does not add much value. We know we can get UTF-16 strings inside the VM which can have which are not present as valid lead/trail pairs, so it seems odd not to accept that all the way through.
On 2012/11/29 08:23:33, Søren Gjesse wrote: > My personal opinion here is that just relaxing our view of UTF-8 to allow UTF-16 > surrogate code points is both the simplest and most pragmatic solution. There is > no need to be a purist when it does not add much value. We know we can get > UTF-16 strings inside the VM which can have which are not present as valid > lead/trail pairs, so it seems odd not to accept that all the way through. This is what we did for V8 and it has not caused any problems that I am aware of.
PTA(final)L The UTF-8 encodings used are now always fully legal. If a string which cannot be UTF-8 encoded is sent to a native port it will be received as an unsupported object. https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_... File runtime/vm/dart_api_impl_test.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_... runtime/vm/dart_api_impl_test.cc:591: } On 2012/11/28 18:22:46, siva wrote: > This test and the one above seem to test identical values (U+4E8C), why repeat > the test? This is all reverted. https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_messa... File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_messa... runtime/vm/dart_api_message.cc:407: } On 2012/11/28 22:02:24, siva wrote: > At this point here we could state that if malformed Dart strings are trying to > be passed out to the native world we could flag them as error. That way we can > say that only valid utf8 strings will be sent to the native world. Done. https://codereview.chromium.org/11280150/diff/13002/runtime/vm/unicode.cc File runtime/vm/unicode.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:62: // Check is codepoint is <= U+00FF. On 2012/11/28 18:22:46, siva wrote: > Check if codepoint is ... Done.
lgtm
https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h File runtime/vm/object.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h#newco... runtime/vm/object.h:3746: void Reset() { I would prefer that this functionality not be added until we have a client. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc File runtime/vm/unicode.cc (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:96: if (!IsLatin1SequenceStart(code_unit)) { // > U+00FF. no period https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:97: if (IsSupplementarySequenceStart(code_unit)) { // >= U+10000. ditto https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h#newc... runtime/vm/unicode.h:76: class CodePointIterator { Consider Utf16::Next, which is a lighter-weight, and requires no more code at the use site.
https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h File runtime/vm/object.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h#newco... runtime/vm/object.h:3746: void Reset() { On 2012/11/30 02:49:08, cshapiro wrote: > I would prefer that this functionality not be added until we have a client. Done. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc File runtime/vm/unicode.cc (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:96: if (!IsLatin1SequenceStart(code_unit)) { // > U+00FF. On 2012/11/30 02:49:08, cshapiro wrote: > no period Done. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.cc#new... runtime/vm/unicode.cc:97: if (IsSupplementarySequenceStart(code_unit)) { // >= U+10000. On 2012/11/30 02:49:08, cshapiro wrote: > ditto Done. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h#newc... runtime/vm/unicode.h:76: class CodePointIterator { On 2012/11/30 02:49:08, cshapiro wrote: > Consider Utf16::Next, which is a lighter-weight, and requires no more code at > the use site. Are you suggestion a static function like this: bool Utf16::Next(const uint16_t* utf16_array, intptr_t array_len, int* index, uint32_t* ch); And used this way: uint16_t* data = ... intptr_t len = ... intptr_t index = 0; uint32_t ch = -1; while (Utf16::Next(data, len, &index, &ch) { // ... use ch. } Seems simpler to keep the state in a stack allocated object.
Committing. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h#newc... runtime/vm/unicode.h:76: class CodePointIterator { On 2012/11/30 02:49:08, cshapiro wrote: > Consider Utf16::Next, which is a lighter-weight, and requires no more code at > the use site. I didn't realize that Utf16::Next was already in there as I did not rebase since starting this CL. Used that and removed Utf16::CodePointIterator.
Message was sent while issue was closed.
Thanks! |