Issue 6524031: Terminate on \0 in data passed to String::New()

Issue 6524031: Terminate on \0 in data passed to String::New() (Closed)

Created:
9 years, 10 months ago by isaacs

Modified:
7 years, 5 months ago

Reviewers:
Lasse Reichstein, ry, Mads Ager (chromium)

CC:
v8-dev

Visibility:
Public.

Description

Terminate on \0 in data passed to String::New() Remove \0 chars from String::New fuzz-test data If strings are made to terminate on null characters in the data provided to String::New(), then the Traverse tests will fail because 0 is part of the fuzz-testing data. This change makes it so that they don't try to put null bytes in the test data. Failing test to demonstrate null-handling behavior There are many cases where you have a bunch of bytes which are null-padded or in some other way represent a null-terminated ASCII or UTF-8 string. Even if a number of bytes are provided, a null character should always terminate the string.

Patch Set 1 #

Created: 9 years, 10 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+26 lines, -4 lines)			Patch
	M	src/heap.cc	View		2 chunks	+8 lines, -2 lines	0 comments	Download
	M	test/cctest/test-strings.cc	View		3 chunks	+18 lines, -2 lines	0 comments	Download

Messages

Total messages: 5 (0 generated)

Expand Messages | Collapse Messages

Mads Ager (chromium)

Thanks for you patch. Null characters are perfectly legal in JavaScript strings and they do ...

9 years, 10 months ago (2011-02-16 09:30:12 UTC) #1

isaacs

I agree that "\u0000" is a valid JavaScript string. However, "x\0\0\0" is a C string ...

9 years, 10 months ago (2011-02-16 17:08:29 UTC) #2

Mads, The issue is that String::New(char*, size_t) is meant to read a UTF8 encoded byte ...

9 years, 10 months ago (2011-02-16 19:40:41 UTC) #3

Lasse Reichstein

Can you give a reference for null-bytes not being valid utf-8? The way I read ...

9 years, 10 months ago (2011-02-17 09:18:26 UTC) #4

isaacs

9 years, 10 months ago (2011-02-17 16:56:33 UTC) #5

So, \0 is valid UTF-8, but \xC0\x80 is the encoding for the 0 code-point in
"Modified UTF-8", which is backwards compatible with null-terminated c-strings.

It's interesting to me that v8 *does* terminate on \0 if a length is not given,
no matter what the bytes are.

> If null-terminated UTF-8 strings are common, it would be better to add a
> separate function to create JS strings from those.

That would be very agreeable.  Would it be better to add an optional flag?  It
seems like it's not much of a change in behavior.

> If you are creating the UTF-8 byte sequences yourself, you are likely to
already
> know where it ends, and it's just a matter of not throwing that information
> away. If you are using a library that returns null-terminated UTF-8 sequences,
> then it's obviously not as simple.

Here's a bit more information about a use case that prompted this.  (It's worse
than a library.)

Node's "Buffer" objects represent a block of bytes outside of v8's heap.  The
Buffer::ToString method takes the block of bytes, and passes it to v8's
String::New method.  The length of the buffer is known, and is provided to
String::New.  However, the length of the string it contains might *not* be
known.

For example, say that you are writing a client for a binary protocol.  Part of
the message contains a block of data that is 10 bytes long, which contains a
string that is between 1 and 10 characters, terminated with a \0 byte if it is
less than 10 bytes long.  Anything between the first \0 and the 10th byte is
considered garbage, and if no \0 byte is found, then the string is 10 bytes
long.

It makes sense to allocate a 10-byte buffer to read in this block, since you
know that it will be 10 bytes of the message.  It would be disastrous to rely on
the string ending in \0, since the 11th byte is the start of some *other* part
of the message (or another Buffer, or maybe just some random spot in memory). 
So, we must provide a length.  However, if there are only 5 characters in this
block, you only really want a 5-byte string.

The options seem to be:

1. Scan the bytes twice.  Once to get the length, and a second time in v8 to
convert them to a string.  This seems suboptimal.
2. Tell v8 to terminate on \0 bytes in String::New.

Expand Messages | Collapse Messages