|
|
Created:
6 years, 1 month ago by caitp (gmail) Modified:
6 years ago CC:
v8-dev Base URL:
https://chromium.googlesource.com/v8/v8.git@master Project:
v8 Visibility:
Public. |
DescriptionFix raw TemplateLiteral spans with non-ascii characters
BUG=v8:3710
Patch Set 1 #
Total comments: 5
Patch Set 2 : Use static Utf8Decoder, add more thorough unicode test case #Patch Set 3 : Don't use Utf8Decoder::WriteUtf16 when length is 0 #
Total comments: 2
Messages
Total messages: 19 (5 generated)
caitpotter88@gmail.com changed reviewers: + arv@chromium.org, dslomov@chromium.org, rossberg@chromium.org
On 2014/11/21 04:49:11, caitp wrote: > mailto:caitpotter88@gmail.com changed reviewers: > + mailto:arv@chromium.org, mailto:dslomov@chromium.org, mailto:rossberg@chromium.org It's based on https://codereview.chromium.org/742643003/#ps60001 --- but a simpler approach that will allocate more memory than is needed in ascii cases.
https://codereview.chromium.org/745233002/diff/1/src/parser.cc File src/parser.cc (right): https://codereview.chromium.org/745233002/diff/1/src/parser.cc#newcode5306 src/parser.cc:5306: Vector<uint8_t> hash_string = Vector<uint8_t>::New(num_hash_chars); One thing that seems strange to me is that we have to assemble a large string to compute the hash. The hash function just iterates over the characters in that string. Conceptually we should be able to compute the hash span by span. Each span is either a one byte string or a two byte string. We can compute the hash for these using a rolling hash. Looking into the source code here (this is new territory for me too) it looks like we can use a StringHasher and then call AddCharacters on it (which has one version for one byte string and one for two byte strings) and then finally get the hash code out. Take a look at https://code.google.com/p/chromium/codesearch#chromium/src/v8/src/objects-inl... for inspiration.
On 2014/11/21 06:12:49, arv wrote: > https://codereview.chromium.org/745233002/diff/1/src/parser.cc > File src/parser.cc (right): > > https://codereview.chromium.org/745233002/diff/1/src/parser.cc#newcode5306 > src/parser.cc:5306: Vector<uint8_t> hash_string = > Vector<uint8_t>::New(num_hash_chars); > One thing that seems strange to me is that we have to assemble a large string to > compute the hash. The hash function just iterates over the characters in that > string. > > Conceptually we should be able to compute the hash span by span. Each span is > either a one byte string or a two byte string. We can compute the hash for these > using a rolling hash. > > Looking into the source code here (this is new territory for me too) it looks > like we can use a StringHasher and then call AddCharacters on it (which has one > version for one byte string and one for two byte strings) and then finally get > the hash code out. > > Take a look at > https://code.google.com/p/chromium/codesearch#chromium/src/v8/src/objects-inl... > for inspiration. Making the hashing a bit more memory friendly seems a bit unrelated to this CL --- I can open a followup that tries to make it easier (but I think it would result in duplicating some code from ComputeUtf8Hash(), unfortunately).
Sorry for the late reply, I was travelling. Using rolling hash would be desirable, but that can be a separate CL. I have just one comment for this one. https://codereview.chromium.org/745233002/diff/1/src/parser.cc File src/parser.cc (right): https://codereview.chromium.org/745233002/diff/1/src/parser.cc#newcode5339 src/parser.cc:5339: unibrow::Utf8Decoder<256> decoder(raw_chars.get(), to_index); Use cached decoder from isoalte()->unicode_cache()->utf8_decoder() See: https://code.google.com/p/chromium/codesearch#chromium/src/v8/src/factory.cc&...
mathias@qiwi.be changed reviewers: + mathias@qiwi.be
https://codereview.chromium.org/745233002/diff/1/test/mjsunit/harmony/templat... File test/mjsunit/harmony/templates.js (right): https://codereview.chromium.org/745233002/diff/1/test/mjsunit/harmony/templat... test/mjsunit/harmony/templates.js:404: assertEquals("안녕", callSites[1].raw[0]); May want to test UTF-8 encoded symbols that take up 3 or 4 bytes as well. Example string that contains all of these (from https://mathiasbynens.be/notes/javascript-unicode#poo-test): "Iñtërnâtiônàlizætiøn☃💩"
There, hopefully that helps a bit https://codereview.chromium.org/745233002/diff/1/src/parser.cc File src/parser.cc (right): https://codereview.chromium.org/745233002/diff/1/src/parser.cc#newcode5339 src/parser.cc:5339: unibrow::Utf8Decoder<256> decoder(raw_chars.get(), to_index); On 2014/11/25 13:16:21, Dmitry Lomov (chromium) wrote: > Use cached decoder from isoalte()->unicode_cache()->utf8_decoder() > > See: > https://code.google.com/p/chromium/codesearch#chromium/src/v8/src/factory.cc&... Should this be cached outside of the loop? But, done and done! https://codereview.chromium.org/745233002/diff/1/test/mjsunit/harmony/templat... File test/mjsunit/harmony/templates.js (right): https://codereview.chromium.org/745233002/diff/1/test/mjsunit/harmony/templat... test/mjsunit/harmony/templates.js:404: assertEquals("안녕", callSites[1].raw[0]); On 2014/11/25 14:45:00, mathias wrote: > May want to test UTF-8 encoded symbols that take up 3 or 4 bytes as well. > Example string that contains all of these (from > https://mathiasbynens.be/notes/javascript-unicode#poo-test): > > "Iñtërnâtiônàlizætiøn☃💩" It's not necessarily guaranteed that there is any decoding from utf8 involved --- but I guess it doesn't hurt to add a more thorough unicode test case. Done and done!
lgtm
The CQ bit was checked by dslomov@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/745233002/20001
On 2014/11/26 15:59:42, I haz the power (commit-bot) wrote: > CQ is trying da patch. Follow status at > https://chromium-cq-status.appspot.com/patch-status/745233002/20001 Ahh --- it's a really small bug, adding a quick fix for it.
The CQ bit was checked by caitpotter88@gmail.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/745233002/40001
Message was sent while issue was closed.
Committed patchset #3 (id:40001)
Message was sent while issue was closed.
marja@chromium.org changed reviewers: + marja@chromium.org
Message was sent while issue was closed.
Some post-commit comments: https://codereview.chromium.org/745233002/diff/40001/src/parser.cc File src/parser.cc (right): https://codereview.chromium.org/745233002/diff/40001/src/parser.cc#newcode5341 src/parser.cc:5341: decoder->Reset(raw_chars.get(), to_index); Since we're now doing parsing on a background thread in some use cases, Parser should remain independent of the Isolate. It gets passed a ParseInfo which contains the UnicodeCache to use, you can use it here (via Scanner I guess). There are guards in place to make sure the background parsing cases don't mess with the heap, but sadly there are no guards to ensure we don't access the Isolate in non-thread-safe ways. :/
Message was sent while issue was closed.
https://codereview.chromium.org/745233002/diff/40001/src/parser.cc File src/parser.cc (right): https://codereview.chromium.org/745233002/diff/40001/src/parser.cc#newcode5341 src/parser.cc:5341: decoder->Reset(raw_chars.get(), to_index); On 2014/12/01 15:53:59, marja wrote: > Since we're now doing parsing on a background thread in some use cases, Parser > should remain independent of the Isolate. It gets passed a ParseInfo which > contains the UnicodeCache to use, you can use it here (via Scanner I guess). > > There are guards in place to make sure the background parsing cases don't mess > with the heap, but sadly there are no guards to ensure we don't access the > Isolate in non-thread-safe ways. :/ Oops, sorry, forgot about that! Good catch!
Message was sent while issue was closed.
Though, I should untangle the code in such a way that Parser simply doesn't have an Isolate in the cases where it shouldn't access it. But even then, this place would maybe just crash if this gets called via the non-isolate-having code path. I'll try to subconsciously think of a better solution. (There are some cases where Parser *does* need an Isolate, and both that and the background parsing case can end up in this code.) |