Issue 11280150: Add support for surrogates when serializing and deserializing for native ports

Issue 11280150: Add support for surrogates when serializing and deserializing for native ports (Closed)

Created:
8 years ago by Søren Gjesse

Modified:
8 years ago

Reviewers:
erikcorry, Mads Ager (google), cshapiro, siva

CC:
reviews_dartlang.org

Base URL:
https://dart.googlecode.com/svn/branches/bleeding_edge/dart

Visibility:
Public.

More Reviews

Description

Add support for surrogates when serializing and deserializing for native ports As Dart allows creating string with UTF-16 surrogate code units which are not always in lead/trail pairs, we need to support this in the communication with native ports. This change allows surrogate code units in messages to/from native ports and supports them in the UTF-8 encoding used on the Dart_CObject structure string representation. R=ager@google.com, erikcorry@google.com, asiva@google.com BUG= Committed: https://code.google.com/p/dart/source/detail?r=15585

Patch Set 1 #

Total comments: 3

Patch Set 2 : Added Utf16::CodePointIterator #

Total comments: 13

Patch Set 3 : Addressed review comments from asiva@ #

Patch Set 4 : Addressed additional comments from asiva@ #

Patch Set 5 : Use iterator reset #

Total comments: 6

Patch Set 6 : Only allow legal UTF-8 #

Patch Set 7 : Fixed long line #

Total comments: 9

Patch Set 8 : Addressed more review comments #

Patch Set 9 : Rebased to r15579 #

Created: 8 years ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+134 lines, -65 lines)			Patch
M	runtime/vm/dart_api_message.cc	View	1 2 3 4 5 6 7 8	2 chunks	+23 lines, -8 lines	0 comments	Download
M	runtime/vm/object.h	View	1 2 3 4 5 6 7 8	1 chunk	+1 line, -1 line	0 comments	Download
M	runtime/vm/snapshot_test.cc	View	1 2 3 4 5 6 7 8	6 chunks	+96 lines, -43 lines	0 comments	Download
M	runtime/vm/unicode.h	View	1 2 3 4 5 6 7 8	1 chunk	+1 line, -1 line	0 comments	Download
M	runtime/vm/unicode.cc	View	1 2 3 4 5 6 7 8	8 chunks	+13 lines, -12 lines	0 comments	Download

Messages

Total messages: 23 (0 generated)

Expand Messages | Collapse Messages

erikcorry

Are snapshots untrusted data? Do you create symbols from snapshots? Is it possible to canonicalize ...

8 years ago (2012-11-23 14:25:47 UTC) #2

Søren Gjesse

On 2012/11/23 14:25:47, erikcorry wrote: > Are snapshots untrusted data? Do you create symbols from ...

8 years ago (2012-11-26 08:52:50 UTC) #3

Søren Gjesse

https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/1/runtime/vm/dart_api_message.cc#newcode397 runtime/vm/dart_api_message.cc:397: for (intptr_t i = 0; i < len; i++) ...

8 years ago (2012-11-26 08:53:00 UTC) #4

siva

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc#newcode397 runtime/vm/dart_api_message.cc:397: Utf16::CodePointIterator it(utf16, len); If you get invalid characters here ...

8 years ago (2012-11-27 03:00:25 UTC) #5

Søren Gjesse

Thanks for the review, please take another look. https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc#newcode397 runtime/vm/dart_api_message.cc:397: Utf16::CodePointIterator ...

8 years ago (2012-11-27 11:35:54 UTC) #6

Thanks for the review, please take another look.

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
File runtime/vm/dart_api_message.cc (right):

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
runtime/vm/dart_api_message.cc:397: Utf16::CodePointIterator it(utf16, len);
On 2012/11/27 03:00:25, siva wrote:
> If you get invalid characters here it.Next() could
> potentially return false right at the first character
> and we would end up with utf8_len being 0.
> 
> Should that be reported as an error as just silently
> dropped and an empty string returned like it is being
> done now?

There are no invalid characters in an UTF-16 sequence. So that cannot happen.
Added test to unicode_test.cc.

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
runtime/vm/dart_api_message.cc:404: Utf16::CodePointIterator it2(utf16, len);
On 2012/11/27 03:00:25, siva wrote:
> Would it make sense to have a reset method on the iterator
> instead so that you can start iterating again on the same iterator?

Good point added Reset() here and for String::CodePointIterator as well.

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str,
utf8_len)) {
On 2012/11/27 03:00:25, siva wrote:
> I am not sure I understand the need for this to be
> IsValidAllowSurrogates and not just IsValid?
> 
> Is it because you might have read a partial utf8 string and are expecting
more?

The current Utf8::IsValid does not allow for Utf8 encoded Utf16 surrogate code
points. That is the 3-byte Utf8 encodings of the code point range d800 - dbff
and dc00 - dfff are not allowed. However as the Utf16 two byte strings posted
can contain these code points the Utf8 strings in the Dart_CObject structures
can contain 3-byte Utf8 encodings of Utf16 surrogate code points. We need to
allow sending the same data as can be received.

Maybe we should just make IsValid allow surrogate code points in all cases.
Currently you cannot create all the strings using the Dart API
Dart_NewStringFromUTF8 that you can using String.fromCharCodes inside Dart (e.g.
String.fromCharCodes([0xd800])).

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h
File runtime/vm/unicode.h (right):

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco...
runtime/vm/unicode.h:91: intptr_t array_len)
On 2012/11/27 03:00:25, siva wrote:
> Has two parameters in the constructor why is explicit needed?

Not needed - removed.

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco...
runtime/vm/unicode.h:98: int32_t Current() {
On 2012/11/27 03:00:25, siva wrote:
> int32_t Current() const {

Done (added for String::CodePointIterator as well).

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/unicode.h#newco...
runtime/vm/unicode.h:110: int32_t ch_;
On 2012/11/27 03:00:25, siva wrote:
> Missing DISALLOW stuff?

Done.

siva

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_message.cc#newcode795 runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str, utf8_len)) { I was under the impression ...

8 years ago (2012-11-28 03:28:23 UTC) #7

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
File runtime/vm/dart_api_message.cc (right):

https://codereview.chromium.org/11280150/diff/5001/runtime/vm/dart_api_messag...
runtime/vm/dart_api_message.cc:795: if (!Utf8::IsValidAllowSurrogates(utf8_str,
utf8_len)) {
I was under the impression that we allow for Utf8 encoded Utf16 surrogate code
points, at least that was the intention.

This might be a bug, I should probably write a test case
to verify this. I would prefer if we did not have to distinguish between IsValid
and IsValidAllowSurrogates.
Similarly Utf8::DecodeToUTF16 should allow surrogates and
not have to distinguish between the two.

For instance if you look at  Utf8::DecodeToUTF16 it tries
to deal with supplementary characters. The fact that
Utf8::Decode doesn't deal with it correctly is a bug.

I think we should remove this distinction, fix Utf8::ISValid and Utf8::Decode,
then this CL would be good to go. What do you think?

On 2012/11/27 11:35:54, Søren Gjesse wrote:
> On 2012/11/27 03:00:25, siva wrote:
> > I am not sure I understand the need for this to be
> > IsValidAllowSurrogates and not just IsValid?
> > 
> > Is it because you might have read a partial utf8 string and are expecting
> more?
> 
> The current Utf8::IsValid does not allow for Utf8 encoded Utf16 surrogate code
> points. That is the 3-byte Utf8 encodings of the code point range d800 - dbff
> and dc00 - dfff are not allowed. However as the Utf16 two byte strings posted
> can contain these code points the Utf8 strings in the Dart_CObject structures
> can contain 3-byte Utf8 encodings of Utf16 surrogate code points. We need to
> allow sending the same data as can be received.
> 
> Maybe we should just make IsValid allow surrogate code points in all cases.
> Currently you cannot create all the strings using the Dart API
> Dart_NewStringFromUTF8 that you can using String.fromCharCodes inside Dart
(e.g.
> String.fromCharCodes([0xd800])).

Søren Gjesse

PTAL Changed the Unicode library to always accept Utf8 encoded surrogate code units. Added and ...

8 years ago (2012-11-28 15:23:18 UTC) #8

siva

lgtm https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_test.cc File runtime/vm/dart_api_impl_test.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_impl_test.cc#newcode591 runtime/vm/dart_api_impl_test.cc:591: } This test and the one above seem ...

8 years ago (2012-11-28 18:22:46 UTC) #9

siva

I chatted with Carl regarding Utf8::IsValid and Utf8::Decode, he says it is incorrect to see ...

8 years ago (2012-11-28 20:46:56 UTC) #10

erikcorry

On 2012/11/28 20:46:56, siva wrote: > I chatted with Carl regarding > Utf8::IsValid and Utf8::Decode, ...

8 years ago (2012-11-28 20:47:59 UTC) #11

Søren Gjesse

On 2012/11/28 20:47:59, erikcorry wrote: > On 2012/11/28 20:46:56, siva wrote: > > I chatted ...

8 years ago (2012-11-28 21:06:28 UTC) #12

siva

https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_message.cc File runtime/vm/dart_api_message.cc (right): https://codereview.chromium.org/11280150/diff/13002/runtime/vm/dart_api_message.cc#newcode407 runtime/vm/dart_api_message.cc:407: } At this point here we could state that ...

8 years ago (2012-11-28 22:02:24 UTC) #13

cshapiro

> Also with String.fromCharCodes you can create any sequence of UTF-16 code points > (see ...

8 years ago (2012-11-29 02:14:22 UTC) #15

Søren Gjesse

On 2012/11/29 02:14:22, cshapiro wrote: > > Also with String.fromCharCodes you can create any sequence ...

8 years ago (2012-11-29 08:23:33 UTC) #16

On 2012/11/29 02:14:22, cshapiro wrote:
> > Also with String.fromCharCodes you can create any sequence of UTF-16 code
> points
> > (see the new tests in snapshot_test.dart.
> 
> When we decided to use UTF-16 that was explicitly not supposed to happen.  The
> interface should admit valid sequences of Unicode scalar values, only. 
> 
> If I understand you correctly, if the implementation has relaxed, it is a bug.
> 
> > I don't see that we can have an API where what can be returned by
> > Dart_StringToUTF8 cannot always be used as valid input to
> > Dart_NewStringFromUTF8. The same argument goes for Dart_CObject structures
> where
> > received strings should be valid strings to send back.
> 
> To avoid the asymmetric behavior, Dart_StringToUTF8 should reject malformed
> input.
> 
> Avoiding the marshaling seems like a better option.  Latin-1 and UTF-16
strings
> can be passed as-is and decoded by the user.  This would be convenient,
> especially when passing data to system functions that already expected UTF-16.

I don't like to have two string types in the marchalling format used to
communicate with native ports. Except for Windows there don't seem to be that
many UTF-16 APIs out there. This makes it much more complicated for programmers.
Discussed this with Mads and decided to keep UTF-8, and disallow UTF-16
surrogate code points. If such a string is received it will be turned into an
unsupported object. If such a string is send the sending will fail. We can
always extend this and maybe have a configuration for choosing between UTF-8 or
Latin1/UTF-16 in native port message marchalling.

My personal opinion here is that just relaxing our view of UTF-8 to allow UTF-16
surrogate code points is both the simplest and most pragmatic solution. There is
no need to be a purist when it does not add much value. We know we can get
UTF-16 strings inside the VM which can have which are not present as valid
lead/trail pairs, so it seems odd not to accept that all the way through.

erikcorry

On 2012/11/29 08:23:33, Søren Gjesse wrote: > My personal opinion here is that just relaxing ...

8 years ago (2012-11-29 08:53:31 UTC) #17

Søren Gjesse

PTA(final)L The UTF-8 encodings used are now always fully legal. If a string which cannot ...

8 years ago (2012-11-29 09:06:14 UTC) #18

cshapiro

https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h File runtime/vm/object.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/object.h#newcode3746 runtime/vm/object.h:3746: void Reset() { I would prefer that this functionality ...

8 years ago (2012-11-30 02:49:07 UTC) #20

Søren Gjesse

8 years ago (2012-11-30 12:23:07 UTC) #21

Søren Gjesse

Committing. https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h File runtime/vm/unicode.h (right): https://codereview.chromium.org/11280150/diff/13004/runtime/vm/unicode.h#newcode76 runtime/vm/unicode.h:76: class CodePointIterator { On 2012/11/30 02:49:08, cshapiro wrote: ...

8 years ago (2012-11-30 13:06:03 UTC) #22

Message was sent while issue was closed.

Thanks!

Expand Messages | Collapse Messages