Issue 121173009: String:WriteUtf8: Add REPLACE_INVALID_UTF8 option

Issue 121173009: String:WriteUtf8: Add REPLACE_INVALID_UTF8 option (Closed)

Created:
6 years, 11 months ago by haimuiba

Modified:
6 years, 11 months ago

Reviewers:
dcarney

CC:
v8-dev

Base URL:
git://github.com/v8/v8.git@master

Visibility:
Public.

More Reviews

Description

String:WriteUtf8: Add REPLACE_INVALID_UTF8 option This patch makes String::WriteUtf8 replace invalid code points (i.e. unmatched surrogates) with the unicode replacement character when REPLACE_INVALID_UTF8 is set. This is done to avoid creating invalid UTF-8 output which can lead to compatibility issues with software requiring valid UTF-8 inputs (e.g. the WebSocket protocol requires valid UTF-8 and terminates connections when invalid UTF-8 is encountered). R=dcarney@chromium.org BUG= Committed: https://code.google.com/p/v8/source/detail?r=18683

Patch Set 1 #

Total comments: 12

Patch Set 2 : DISALLOW_INVALID_UTF8 flag and fixes #

Total comments: 4

Patch Set 3 : Abandon refactoring, get core behavior change done #

Total comments: 14

Patch Set 4 : Latest version #

Total comments: 8

Patch Set 5 : Cleanup, test for edge case + incorrect fix #

Patch Set 6 : Fix mistake in test case, finish patch #

Total comments: 9

Patch Set 7 : Address remaining comments #

Patch Set 8 : Address latest comments #

Patch Set 9 : Rebase #

Created: 6 years, 11 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+145 lines, -26 lines)			Patch
M	include/v8.h	View	1 2 3 4 5 6 7 8	1 chunk	+5 lines, -1 line	0 comments	Download
M	src/api.cc	View	1 2 3 4 5 6 7 8	8 chunks	+49 lines, -18 lines	0 comments	Download
M	src/unicode.h	View	1 2 3 4	3 chunks	+13 lines, -2 lines	0 comments	Download
M	src/unicode-inl.h	View	1 2 3 4 5 6	2 chunks	+15 lines, -5 lines	0 comments	Download
M	test/cctest/test-api.cc	View	1 2 3 4 5 6 7 8	2 chunks	+63 lines, -0 lines	0 comments	Download

Messages

Total messages: 26 (0 generated)

Expand Messages | Collapse Messages

dcarney

Just a preliminary glance. What you're doing is okay, but all other callsites of Utf8::Encode ...

6 years, 11 months ago (2014-01-04 15:56:45 UTC) #2

haimuiba

On 2014/01/04 15:56:45, dcarney wrote: > Just a preliminary glance. What you're doing is okay, ...

6 years, 11 months ago (2014-01-06 05:39:58 UTC) #3

haimuiba

https://codereview.chromium.org/121173009/diff/1/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/1/src/api.cc#newcode4519 src/api.cc:4519: // @TODO use uint16_t for previous? On 2014/01/04 15:56:45, ...

6 years, 11 months ago (2014-01-06 05:40:18 UTC) #4

dcarney

> https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 > src/unicode.h:158: static inline unsigned Encode(char* str, uchar c, bool > replace_surrogates); > ...

6 years, 11 months ago (2014-01-06 07:55:58 UTC) #5

haimuiba

On 2014/01/06 07:55:58, dcarney wrote: > > https://codereview.chromium.org/121173009/diff/1/src/unicode.h#newcode158 > > src/unicode.h:158: static inline unsigned Encode(char* ...

6 years, 11 months ago (2014-01-06 08:51:02 UTC) #6

dcarney

> I'll maintain current behavior throughout the system. I agree that my patch can > ...

6 years, 11 months ago (2014-01-06 09:31:28 UTC) #7

haimuiba

This patch set addresses most issues raised in the first review, adds the DISALLOW_INVALID_UTF8 flag. ...

6 years, 11 months ago (2014-01-07 09:47:44 UTC) #8

dcarney

https://codereview.chromium.org/121173009/diff/130001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/130001/src/api.cc#newcode4523 src/api.cc:4523: static int WritePair(uint16_t current, WritePair is a bad name ...

6 years, 11 months ago (2014-01-07 10:12:15 UTC) #9

dcarney

https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h File src/unicode-inl.h (right): https://codereview.chromium.org/121173009/diff/130001/src/unicode-inl.h#newcode155 src/unicode-inl.h:155: // @TODO give this the same semantics as Encode? ...

6 years, 11 months ago (2014-01-07 11:05:50 UTC) #10

haimuiba

On 2014/01/07 11:05:50, dcarney wrote: > I think you'll ultimately need to keep the rewriting ...

6 years, 11 months ago (2014-01-09 08:20:58 UTC) #11

dcarney

> I still think the UTF abstractions I've been seeing so far are extremely leaky ...

6 years, 11 months ago (2014-01-09 08:42:17 UTC) #12

haimuiba

The new patch set abandons the attempt of refactoring the UTF-16 to UTF-8 mechanism, and ...

6 years, 11 months ago (2014-01-10 06:24:29 UTC) #13

dcarney

this patch is almost ready to land. There's one edge case which needs to be ...

6 years, 11 months ago (2014-01-10 16:49:55 UTC) #14

dcarney

https://codereview.chromium.org/121173009/diff/230001/src/api.cc File src/api.cc (right): https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4708 src/api.cc:4708: // @TODO Replace magic number 3 with something more ...

6 years, 11 months ago (2014-01-10 16:54:33 UTC) #15

haimuiba

I addressed all issues raised except for one (see inline comments). Two questions: 1) What ...

6 years, 11 months ago (2014-01-13 07:48:20 UTC) #16

I addressed all issues raised except for one (see inline comments).

Two questions:

1) What do you think about using the term "replace" instead of "disallow"?
2) IIRC there was supposed to be some kind of feedback mechanism for the caller
of WriteUtf8 to inform him that replacements have happened. Do we need this for
landing the patch?

Thank you so much for all the reviewing!
Felix

https://codereview.chromium.org/121173009/diff/230001/include/v8.h
File include/v8.h (right):

https://codereview.chromium.org/121173009/diff/230001/include/v8.h#newcode1650
include/v8.h:1650: DISALLOW_INVALID_UTF8 = 8
On 2014/01/10 16:49:55, dcarney wrote:
> this needs a comment above

Done.

https://codereview.chromium.org/121173009/diff/230001/src/api.cc
File src/api.cc (right):

https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4708
src/api.cc:4708: // @TODO Replace magic number 3 with something more
descriptive. E.g.
On 2014/01/10 16:54:33, dcarney wrote:
> the syntax we use is:
> 
> // TODO(username) Comment.
> 
> but in this case, it just needs to be done before landing...

Done.

https://codereview.chromium.org/121173009/diff/230001/src/api.cc#newcode4745
src/api.cc:4745: return writer.CompleteWrite(write_null, nchars_ref);
On 2014/01/10 16:49:55, dcarney wrote:
> think you need to ensure that all return points from this function check for
the
> case that you  don't allow invalid utf8, and the last character written is one
> half of a surrogate pair, which would be incorrectly written as a replacement
> character in the current implementation

Not sure I understand. If allow_invalid_utf8=false, then any lone surrogate is
an invalid last character, and should be replaced.

https://codereview.chromium.org/121173009/diff/230001/src/unicode.h
File src/unicode.h (right):

https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode51
src/unicode.h:51: * (e.g. an orphan surrogate) when converting to a UTF
encoding.
On 2014/01/10 16:49:55, dcarney wrote:
> typo - UTF-8

Done.

https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode53
src/unicode.h:53: const int kReplacementCharacter = 0xFFFD;
On 2014/01/10 16:49:55, dcarney wrote:
> this should be in Utf8, but see below

Done.

https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode161
src/unicode.h:161: bool allow_invalid);
On 2014/01/10 16:49:55, dcarney wrote:
> this either needs to be an enum to avoid passing true and false directly, or
you
> need to keep the bool but use a default value of true so existing call sites
> don't change

Done by defaulting to true. I don't understand the requirement of not changing
existing call sites so, I assume this is just to minimize the diff of the patch?

https://codereview.chromium.org/121173009/diff/230001/src/unicode.h#newcode165
src/unicode.h:165: static const uchar kBadChar = 0xFFFD;
On 2014/01/10 16:49:55, dcarney wrote:
> hmmm, maybe you should just rename this variable here to kReplacementChar,
since
> that's what it is, and it might conflict with your change.
> you'll obviously need to check usages of it, but i think i removed most of
them
> in the codebase

Done. Decided to keep the kBadChar name for now as it was still used in a few
places.

dcarney

https://codereview.chromium.org/121173009/diff/300001/include/v8.h File include/v8.h (right): https://codereview.chromium.org/121173009/diff/300001/include/v8.h#newcode1652 include/v8.h:1652: // @TODO Rename to REPLACE_INVALID_UTF8? sure https://codereview.chromium.org/121173009/diff/300001/src/api.cc File src/api.cc ...

6 years, 11 months ago (2014-01-13 09:19:56 UTC) #17

dcarney

> 1) What do you think about using the term "replace" instead of "disallow"? sounds ...

6 years, 11 months ago (2014-01-13 09:24:14 UTC) #18

haimuiba

My latest patch: * Fixes the missing default value in the header (sorry, I'm not ...

6 years, 11 months ago (2014-01-15 10:52:34 UTC) #19

dcarney

all looks good i'll land this as soon as that edge case is handled

6 years, 11 months ago (2014-01-15 11:07:38 UTC) #20

haimuiba

I'm happy to report that I found the problem with my previous patch. The test ...

6 years, 11 months ago (2014-01-17 08:33:06 UTC) #21

dcarney

lgtm i'll land once you addressed the trivial comments below additionally, the description of the ...

6 years, 11 months ago (2014-01-17 09:10:12 UTC) #22

haimuiba

I've addressed all remaining issues and updated the issue description. Ping me in chat if ...

6 years, 11 months ago (2014-01-20 08:10:26 UTC) #23

haimuiba

Sending my previous patch again, my first upload attempt seems to have failed.

6 years, 11 months ago (2014-01-20 08:54:40 UTC) #24

dcarney

6 years, 11 months ago (2014-01-20 09:53:02 UTC) #26

Message was sent while issue was closed.

Committed patchset #9 manually as r18683 (presubmit successful).

Expand Messages | Collapse Messages