|
|
Created:
4 years, 9 months ago by jungshik at Google Modified:
4 years, 4 months ago Reviewers:
jochen (gone - plz use gerrit), Yang, Michael Achenbach, srl295, Dan Ehrenberg, adamk, yangguo CC:
Michael Hablich, v8-reviews_googlegroups.com Base URL:
https://chromium.googlesource.com/v8/v8.git@master Target Ref:
refs/pending/heads/master Project:
v8 Visibility:
Public. |
DescriptionUse ICU case conversion/transliterator for case conversion
When I18N is enabled, use ICU's case conversion API and transliteration
API [1] to implement String.prototype.to{Upper,Lower}Case and
String.prototype.toLocale{Upper,Lower}Case.
* ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js
* The above 4 functions are overridden with those in i18n.js when
--icu_case_mapping flag is turned on. To control the override by the flag,
they're overriden in icu-case-mapping.js
Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't
support locale-sensitive case conversion for Turkic languages (az, tr),
Greek (el) and Lithuanian (lt).
Before ICU APIs for the most general case are called, a fast-path for Latin-1
is tried. It's taken from Blink and adopted as necessary. This fast path
is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken
when a locale (explicitly specified or default) is not in {az, el, lt, tr}.
With these changes, a build with --icu_case_mapping=true passes a bunch
of tests in test262/intl402/Strings/* and intl/* that failed before.
Handling of pure ASCII strings (aligned at word boundary) are not as fast
as Unibrow's implementation that uses word-by-word case conversion. OTOH,
Latin-1 input handling is faster than Unibrow. General Unicode input
handling is slower but more accurate.
See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_HGSbrg/edit?usp=sharing for the benchmark.
This CL started with http://crrev.com/1544023002#ps200001 by littledan@,
but has changed significantly since.
[1] See why transliteration API is needed for uppercasing in Greek.
http://bugs.icu-project.org/trac/ticket/10582
R=yangguo
BUG=v8:4476, v8:4477
LOG=Y
TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case,
intl/general/case*
Committed: https://crrev.com/b348d47bb94399045394bf4743c0c8c35328923b
Cr-Commit-Position: refs/heads/master@{#36187}
Patch Set 1 : patch copied from Dan's CL #Patch Set 2 : use UnicodeString to simplify #Patch Set 3 : do not use a macro #Patch Set 4 : working now with HandleChecked #
Total comments: 3
Patch Set 5 : add tests; fix the implementation; fix typos in testcfg #Patch Set 6 : checkpoint: use Flatten - not working #Patch Set 7 : Use CharCopy(); GetFlatContent still crashes #
Total comments: 1
Patch Set 8 : remove goto; GetFlatContent still crashes #
Total comments: 2
Patch Set 9 : isolate DisallowAllocation #Patch Set 10 : add a test with U+00FF #
Total comments: 2
Patch Set 11 : move to i18n; not yet working #Patch Set 12 : checkpoint: build still failing - unbound variable #Patch Set 13 : both to{U,L}Case and toLocale{U,L}Case working #Patch Set 14 : canonicalize |locales| and refactor: also use "lt" instead of "li" #Patch Set 15 : typo fix: li -> lt #Patch Set 16 : use transliterator for case conversion of Greek/Lithuanian #
Total comments: 1
Patch Set 17 : a bit more clean-up #Patch Set 18 : lt: do not use transliterator #Patch Set 19 : speed up toLocale*Case by 80% #Patch Set 20 : add back case conversion tests under no-i18n #Patch Set 21 : drop an entry added by mistake in test.status file #Patch Set 22 : optimize canonicalizeLanguageTag for the most common case #Patch Set 23 : get DisallowHeapAllocation out of "anon" block and use copy-on-write ctor for UnicodeString for bet… #
Total comments: 2
Patch Set 24 : enclose DisallowAllocation inside a block again #Patch Set 25 : extend the DisallowAllocation block #Patch Set 26 : fix an issue due to |sap| going out of scope #Patch Set 27 : use fastCopyFrom for UnicodeString: a minor optimization #Patch Set 28 : port back cl 1875263006 PS#8; use C API with do-while and DisallowHeapAlloc in the loop #
Total comments: 1
Patch Set 29 : fix mjsunit test; use GetFlatContents in fast path, too; 20% speed-up for ToLowerCase #Patch Set 30 : string-capitalization-expected.txt : fix #Patch Set 31 : fix string-capitalization-expected.txt #Patch Set 32 : more optimization; ToLatin1{Upper,Lower} #Patch Set 33 : use FlatContent for uppercase; add 3 templatized helpers #
Total comments: 26
Patch Set 34 : Dan's review comments addressed #Patch Set 35 : drop parallel arrays. pass locale name as a string from JS to runtime #Patch Set 36 : rebased: dropped unnecesary Wno-c99-extensions from gypi #
Total comments: 2
Patch Set 37 : trival change: unnecessary line dropped #
Total comments: 21
Patch Set 38 : do not use strcmp !! #Patch Set 39 : use the table for Latin1-Lowercasing; still slower than word-by-word conversion in Unibrow for ASCI… #Patch Set 40 : Addressed Yang's review comments #Patch Set 41 : Use Flags for icu_case_mapping in case_mapping.js; a bit clean up in C++ #Patch Set 42 : copy buffer to UnicodeString explicitly #Patch Set 43 : wip: Move to{Lower,Upper}Case override to icu-case-mapping.js #Patch Set 44 : working with icu-case-mapping.js manually compiled in; gyp change not picking it up #
Total comments: 16
Patch Set 45 : NEG_IMPL. => IMPL(es_stage, icu_case_mapping) #Patch Set 46 : rebased #Patch Set 47 : Address Yang's comments #Patch Set 48 : mark tests failing without icu_case_mapping as FAIL in test.status files #Patch Set 49 : mark webkit/../string-capitalization as FAIL because a fix is behind a flag #Patch Set 50 : use read-alias setTo for UnicodeString and add more tests #Patch Set 51 : windows compile fix: use reinterpret_cast for uc16* => UChar* #
Total comments: 1
Patch Set 52 : a bit more tweak with buffer extraction #
Total comments: 4
Patch Set 53 : move toLocale{U,L}Case behind --icu-case-mapping, too; adjust test status accordingly. roll back st… #
Total comments: 4
Patch Set 54 : use utils.import and add a new test to the fail list (passes with --icu_case_mapping) #Patch Set 55 : more efficient overrides of to{Lower,Upper}Case #Patch Set 56 : change the way toLocale*Case is overriden; still failing S15.5.4.16_A6 test #Patch Set 57 : remove function prototypes for to*Case #
Total comments: 2
Patch Set 58 : Dan's comment addressed #
Total comments: 5
Patch Set 59 : Address Yang's comment. the feature is now harmony_in_progress instead of harmony_staged #
Total comments: 7
Patch Set 60 : Flags:// --icu_..: working #Patch Set 61 : Address Yang's comment #
Total comments: 2
Patch Set 62 : rebased and build fixed #Patch Set 63 : Yang's comment addressed - return right away for no-change #
Total comments: 4
Messages
Total messages: 149 (48 generated)
littledan@chromium.org changed reviewers: + littledan@chromium.org
https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py File test/intl/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py#ne... test/intl/testcfg.py:57: flags = ["--allow-natives-syntax", "--icu-case-mapping"] + \ BTW other testcfg.py's have support for a // Flags: line at the beginning of the file, which is probably the better way to do this.
In the latest patch set, I addressed Yang's comments (from https://codereview.chromium.org/1544023002 ; compare ps#1 and the latest). However, the following issue is unresolved. Could you enlighten me as to what I'm doing wrong. Thanks https://codereview.chromium.org/1812673005/diff/120001/src/runtime/runtime-st... File src/runtime/runtime-strings.cc (right): https://codereview.chromium.org/1812673005/diff/120001/src/runtime/runtime-st... src/runtime/runtime-strings.cc:1094: String::FlatContent flat = s->GetFlatContent(); This leads to a assertion failure in objects.cc: # Fatal error in ../../src/objects.cc, line 11065 # Check failed: !AllowHeapAllocation::IsAllowed(). I have little idea what I'm doing wrong (I found a similar usage in this file and other places).
On 2016/04/07 18:57:11, jshin (jungshik at google) wrote: > In the latest patch set, I addressed Yang's comments (from > https://codereview.chromium.org/1544023002 ; compare ps#1 and the latest). > However, the following issue is unresolved. Could you enlighten me as to what > I'm doing wrong. Thanks > > https://codereview.chromium.org/1812673005/diff/120001/src/runtime/runtime-st... > File src/runtime/runtime-strings.cc (right): > > https://codereview.chromium.org/1812673005/diff/120001/src/runtime/runtime-st... > src/runtime/runtime-strings.cc:1094: String::FlatContent flat = > s->GetFlatContent(); > This leads to a assertion failure in objects.cc: > > # Fatal error in ../../src/objects.cc, line 11065 > # Check failed: !AllowHeapAllocation::IsAllowed(). > > I have little idea what I'm doing wrong (I found a similar usage in this file > and other places). Removing |DCHECK(!AllowHeapAllocation::IsAllowed())| at the beginning of GetFlatContent() seems to work (conversion works as expected without other anomaly), but obviously I have no clue as to why that DCHECK is there in the first place. (so I'd not venture to propose that it be removed).
adamk@chromium.org changed reviewers: + adamk@chromium.org
Some answers about GetFlatContent... https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... File src/runtime/runtime-strings.cc (right): https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... src/runtime/runtime-strings.cc:1094: String::FlatContent flat = s->GetFlatContent(); You'll want to add a DisallowHeapAllocation object on the stack here, like: DisallowHeapAllocation no_gc; What the DCHECK in GetFlatContent is doing is keeping you from holding onto a pointer to the flat content across a GC, which might invalidate it. Once you've added the above DisallowHeapAllocation, you'll need to make sure that you do avoid allocations while it's on the stack (otherwise you'll get other DCHECK failures). Specifically this would be the NewStringFromTwoByte() call down at the bottom; the usual thing to do is put the rest of the function inside an inner block, and allocate the DisallowHeapAllocation in there. However... https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... src/runtime/runtime-strings.cc:1106: // This UnicodeString ctor has copy-on-write semantics. It starts as a ...this comment worries me. Does this mean that converted.getBuffer() below could be pointing at |src|? That doesn't sound valid to me, since NewStringFromTwoByte needs to read from getBuffer _after_ doing an allocation, which might cause GC. This function needs to make sure the FlatContent isn't referenced when an allocation occurs.
On 2016/04/07 20:50:55, adamk wrote: > Some answers about GetFlatContent... > > https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... > File src/runtime/runtime-strings.cc (right): > > https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... > src/runtime/runtime-strings.cc:1094: String::FlatContent flat = > s->GetFlatContent(); > You'll want to add a DisallowHeapAllocation object on the stack here, like: > > DisallowHeapAllocation no_gc; > > What the DCHECK in GetFlatContent is doing is keeping you from holding onto a > pointer to the flat content across a GC, which might invalidate it. Thanks a lot !! I also discovered |DisallowHeapAllocation no_gc| right above other callers of GetFlatContent and experimented with it. As you wrote, now I'm getting DCHECK failure in other places where heap allocation *is* required. I'll follow your suggestion below to isolate DisallowHeapAllocation. > > Once you've added the above DisallowHeapAllocation, you'll need to make sure > that you do avoid allocations while it's on the stack (otherwise you'll get > other DCHECK failures). Specifically this would be the NewStringFromTwoByte() > call down at the bottom; the usual thing to do is put the rest of the function > inside an inner block, and allocate the DisallowHeapAllocation in there. > > However... > > https://codereview.chromium.org/1812673005/diff/140001/src/runtime/runtime-st... > src/runtime/runtime-strings.cc:1106: // This UnicodeString ctor has > copy-on-write semantics. It starts as a > ...this comment worries me. Does this mean that converted.getBuffer() below > could be pointing at |src|? That doesn't sound valid to me, since > NewStringFromTwoByte needs to read from getBuffer _after_ doing an allocation, > which might cause GC. This function needs to make sure the FlatContent isn't > referenced when an allocation occurs. An allocation will happen in converted.toUpper() and converted.toLower() while referring to |src|. By the time, converted.getBuffer() is called, |converted| will have a new buffer (different from |src). That new buffer can be either in stack or heap (it's icu::UniocdeString's internal representation based on the length of a string). Anyway, it looks like I have to give up on 'copy-on-write' optimization.
Forgot to ask for another look. All the tests (that have been disabled) now pass as well as a few new tests I added. https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py File test/intl/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py#ne... test/intl/testcfg.py:57: flags = ["--allow-natives-syntax", "--icu-case-mapping"] + \ On 2016/03/17 20:38:36, Dan Ehrenberg wrote: > BTW other testcfg.py's have support for a // Flags: line at the beginning of the > file, which is probably the better way to do this. I'm not sure what you meant. $ find . -name testcfg.py | xargs grep Flags does not turn up anything like that.
I have a new plan. I'd rather fix both to{Upper,Lower}Case (locale-independent / root locale) and toLocale{Upper,Lower}Case together. For that, I'll override to{Upper,Lower}Case and toLocale{U,L}Case for String the way toLocaleString() is overridden for Date/Number in i18n.js. That way, runtime-string.cc will can remain as it is and everything new/i18n/icu will be added to (runtime-)i18n.cc and i18n.js. https://codereview.chromium.org/1812673005/diff/180001/src/runtime/runtime-st... File src/runtime/runtime-strings.cc (right): https://codereview.chromium.org/1812673005/diff/180001/src/runtime/runtime-st... src/runtime/runtime-strings.cc:1271: if (FLAG_icu_case_mapping) Do we need this flag once we're satisfied with perf etc? Dan, what kind of micro benchmark test did you run?
Alright, overwriting SGTM. https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py File test/intl/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/60001/test/intl/testcfg.py#ne... test/intl/testcfg.py:57: flags = ["--allow-natives-syntax", "--icu-case-mapping"] + \ On 2016/04/08 at 18:09:33, jshin (jungshik at google) wrote: > On 2016/03/17 20:38:36, Dan Ehrenberg wrote: > > BTW other testcfg.py's have support for a // Flags: line at the beginning of the > > file, which is probably the better way to do this. > > I'm not sure what you meant. > > $ find . -name testcfg.py | xargs grep Flags > > does not turn up anything like that. No, Flags is used in test cases, but implemented in testcfg.py. An example of a test using flags is test/mjsunit/es6/arguments-iterator.js which has the line // Flags: --allow-natives-syntax In test/mjsunit/testcfg.py, this is supported by the code in GetFlagsForTestCase. This could be copy-pasted into intl/testcfg.py, or the tests could be moved to mjsunit (why are they separated out in the first place? is it just historical?). https://codereview.chromium.org/1812673005/diff/180001/src/runtime/runtime-st... File src/runtime/runtime-strings.cc (right): https://codereview.chromium.org/1812673005/diff/180001/src/runtime/runtime-st... src/runtime/runtime-strings.cc:1271: if (FLAG_icu_case_mapping) On 2016/04/08 at 19:44:42, jshin (jungshik at google) wrote: > Do we need this flag once we're satisfied with perf etc? > Dan, what kind of micro benchmark test did you run? Microbenchmarks and results here: https://docs.google.com/spreadsheets/d/1xDpYTaFVE97rtqQ5KyZCk4T_QJsKp-S_lRhQZ... Agreed that we won't need this flag long-term, but it's good to develop new features under a flag.
https://codereview.chromium.org/1812673005/diff/300001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/300001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:819: false, reinterpret_cast<const UChar*>(*string_value), length); a question: Using this public API is more concise than using the internal mechanism. I just need to extract 'UChar string' (buffer and length). Is there a perf (and other reason) to use a more verbose internal way (the code above this commented-out block)? BTW, 'normalize' method (in runtime-i18n.cc) is implemented with String::Value API.
Description was changed from ========== Call out to ICU for case conversion under a new flag patch from issue 1544023002 at patchset 200001 (http://crrev.com/1544023002#ps200001) by littledan@ V8 has its own implementation of case conversion, in unibrow, which has a couple bugs. This patch calls out to Unicode case conversion from String.prototype.toLowerCase() and String.prototype.toUpperCase() if the --icu-case-mapping flag is passed. On no-intl builds, the flag is disabled. A fast-path for some latin1 strings is taken from Blink in an attempt to avoid performance regressions, and microbenchmarks show that it is competitive with the old implementation. This new behavior fixes a number of test262 tests, as well as a Kangax test. R=yangguo BUG=v8:4476 LOG=Y ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, intl builds In non-intl builds, they're still handled by unibrow with a few issues. V8 has its own implementation of case conversion, in unibrow, which has a couple bugs. This patch calls out to Unicode case conversion from String.prototype.toLowerCase() and String.prototype.toUpperCase() if the --icu-case-mapping flag is passed. On no-intl builds, the flag is disabled. A fast-path for some latin1 strings is taken from Blink in an attempt to avoid performance regressions, and microbenchmarks show that it is competitive with the old implementation. This new behavior fixes a number of test262 tests, as well as a Kangax test. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 http://unicode.org/cldr/trac/ticket/6921 http://unicode.org/cldr/trac/ticket/7039 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ==========
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, intl builds In non-intl builds, they're still handled by unibrow with a few issues. V8 has its own implementation of case conversion, in unibrow, which has a couple bugs. This patch calls out to Unicode case conversion from String.prototype.toLowerCase() and String.prototype.toUpperCase() if the --icu-case-mapping flag is passed. On no-intl builds, the flag is disabled. A fast-path for some latin1 strings is taken from Blink in an attempt to avoid performance regressions, and microbenchmarks show that it is competitive with the old implementation. This new behavior fixes a number of test262 tests, as well as a Kangax test. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 http://unicode.org/cldr/trac/ticket/6921 http://unicode.org/cldr/trac/ticket/7039 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/ and intl/ that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 http://unicode.org/cldr/trac/ticket/6921 http://unicode.org/cldr/trac/ticket/7039 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ==========
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/ and intl/ that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 http://unicode.org/cldr/trac/ticket/6921 http://unicode.org/cldr/trac/ticket/7039 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/ and intl/ that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ==========
On 2016/04/11 17:47:30, jshin (jungshik at google) wrote: > https://codereview.chromium.org/1812673005/diff/300001/src/runtime/runtime-i1... > File src/runtime/runtime-i18n.cc (right): > > https://codereview.chromium.org/1812673005/diff/300001/src/runtime/runtime-i1... > src/runtime/runtime-i18n.cc:819: false, reinterpret_cast<const > UChar*>(*string_value), length); > a question: Using this public API is more concise than using the internal > mechanism. I just need to extract 'UChar string' (buffer and length). Is there a > perf (and other reason) to use a more verbose internal way (the code above this > commented-out block)? BTW, 'normalize' method (in runtime-i18n.cc) is > implemented with String::Value API. I tried the public API and it's slower (as expected) than internal API so that I'll stick to the internal API. Besides, I made several tweaks in i18n.js to make the performance of toLocale{U,L}Case with or without |locales| approach the performance of to{U,L}Case as much as possible. BTW, for slow path (beyond Latin-1), to{U,L}Case in my CL is 1.8 ~ 2 times slower than Dan's original CL when the same input was tested (such as Greek characters). I'm trying to figure out the cause.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/440001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/440001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux64_asan_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_asan_rel_ng/buil...) v8_linux64_asan_rel_ng_triggered on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_asan_rel_ng_trig...)
PS #23 failed in a debug build because of heap allocation with DisAllowHeapAllocation in effect. https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:784: DisallowHeapAllocation no_gc; Getting this out of an isolated block to avoid an additional buffer copy [1] does not work (I forgot to try a debug build and thought that I can get away with it) because there's a heap allocation down the road. I'll see if there's a way around it after going back to the previous patch set. A question to v8 folks: what's the impact of |DisallowHeapAllocation| on heap allocation in a third_party library (e.g. ICU)? Can bad things happen? BTW, does SmartArrayPointer<uc16) behave differently when it's inside |DisallowHeapAllocation| block? [1] I also have to make sure that additional buffer copy is indeed the reason for slow down.
https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:784: DisallowHeapAllocation no_gc; On 2016/04/13 at 17:34:34, jshin (jungshik at google) wrote: > Getting this out of an isolated block to avoid an additional buffer copy [1] does not work (I forgot to try a debug build and thought that I can get away with it) because there's a heap allocation down the road. > > I'll see if there's a way around it after going back to the previous patch set. > > A question to v8 folks: what's the impact of |DisallowHeapAllocation| on heap allocation in a third_party library (e.g. ICU)? Can bad things happen? BTW, does SmartArrayPointer<uc16) behave differently when it's inside |DisallowHeapAllocation| block? > > [1] I also have to make sure that additional buffer copy is indeed the reason for slow down. I believe DisallowHeapAllocation's function is to add a DCHECK that fires when allocation occurs/may occur from C++ (grep around for DCHECK(AllowHeapAllocation::IsAllowed()); ). It won't actually change the nature of allocations.
On 2016/04/13 17:43:43, Dan Ehrenberg wrote: > https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... > File src/runtime/runtime-i18n.cc (right): > > https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... > src/runtime/runtime-i18n.cc:784: DisallowHeapAllocation no_gc; > On 2016/04/13 at 17:34:34, jshin (jungshik at google) wrote: > > Getting this out of an isolated block to avoid an additional buffer copy [1] > does not work (I forgot to try a debug build and thought that I can get away > with it) because there's a heap allocation down the road. > > > > I'll see if there's a way around it after going back to the previous patch > set. > > > > A question to v8 folks: what's the impact of |DisallowHeapAllocation| on heap > allocation in a third_party library (e.g. ICU)? Can bad things happen? BTW, does > SmartArrayPointer<uc16) behave differently when it's inside > |DisallowHeapAllocation| block? > > > > [1] I also have to make sure that additional buffer copy is indeed the reason > for slow down. > > I believe DisallowHeapAllocation's function is to add a DCHECK that fires when > allocation occurs/may occur from C++ (grep around for > DCHECK(AllowHeapAllocation::IsAllowed()); ). It won't actually change the nature > of allocations. Sorry that I was not clear. I did see DCHECK(!AllowHeapAllocation::IsAllowed()) at the top of GetFlatContent(). My question was if allocation outside v8 such as ICU and 'NewArray' (in v8) is something to worry about in terms of gc triggering even though obviously ICU not know about DisAllowHeapAllocation and will not dcheck for it. How about NewArray used in SmartArrayPointer for uc16? It seems that it's ok. Related to that and adamk's earlier comment: I have a question up in my daughter/experimental cl at https://codereview.chromium.org/1875263006 It'd be great if my question there can be answered. Thanks
On 2016/04/14 at 00:23:18, jshin wrote: > On 2016/04/13 17:43:43, Dan Ehrenberg wrote: > > https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... > > File src/runtime/runtime-i18n.cc (right): > > > > https://codereview.chromium.org/1812673005/diff/440001/src/runtime/runtime-i1... > > src/runtime/runtime-i18n.cc:784: DisallowHeapAllocation no_gc; > > On 2016/04/13 at 17:34:34, jshin (jungshik at google) wrote: > > > Getting this out of an isolated block to avoid an additional buffer copy [1] > > does not work (I forgot to try a debug build and thought that I can get away > > with it) because there's a heap allocation down the road. > > > > > > I'll see if there's a way around it after going back to the previous patch > > set. > > > > > > A question to v8 folks: what's the impact of |DisallowHeapAllocation| on heap > > allocation in a third_party library (e.g. ICU)? Can bad things happen? BTW, does > > SmartArrayPointer<uc16) behave differently when it's inside > > |DisallowHeapAllocation| block? > > > > > > [1] I also have to make sure that additional buffer copy is indeed the reason > > for slow down. > > > > I believe DisallowHeapAllocation's function is to add a DCHECK that fires when > > allocation occurs/may occur from C++ (grep around for > > DCHECK(AllowHeapAllocation::IsAllowed()); ). It won't actually change the nature > > of allocations. > > Sorry that I was not clear. I did see DCHECK(!AllowHeapAllocation::IsAllowed()) at the top of GetFlatContent(). > My question was if allocation outside v8 such as ICU and 'NewArray' (in v8) is something to worry about in terms > of gc triggering even though obviously ICU not know about DisAllowHeapAllocation and will not dcheck for it. > How about NewArray used in SmartArrayPointer for uc16? It seems that it's ok. Yes, I don't think ICU would call into V8 to tell whether this flag is set. > > Related to that and adamk's earlier comment: I have a question up in my daughter/experimental cl at > https://codereview.chromium.org/1875263006 > > It'd be great if my question there can be answered. Thanks
PS #24 and PS #25 do not have a crash issue that I have with an alternative CL at https://codereview.chromium.org/1875263006/ when a debug build of d8 is run with '--verify-heap' option. However, it's about 50% slower (for a microbench mark with non-Latin-1 input) than the alternative CL mentioned above.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/540001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/540001
https://codereview.chromium.org/1812673005/diff/540001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/540001/src/js/i18n.js#newcode... src/js/i18n.js:2082: OverrideFunction(GlobalString.prototype, 'toLowerCase', function() { Dan, can you give me a pointer to the way a 'flag' (in this case, icu_case_mapping) can be used in JS to turn on or off this and 3 other OverideFunctions? They need to be changed depending on the value of |icu_case_mapping| flag if we want to enable what's implemented in the CL behind a flag at first.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux_nodcheck_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng/bu...) v8_linux_nodcheck_rel_ng_triggered on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng_tr...)
On 2016/04/15 10:03:20, commit-bot: I haz the power wrote: > Dry run: Try jobs failed on following builders: > v8_linux_nodcheck_rel_ng on tryserver.v8 (JOB_FAILED, > http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng/bu...) > v8_linux_nodcheck_rel_ng_triggered on tryserver.v8 (JOB_FAILED, > http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng_tr...) mjsunit/string-case failed because fast path does not handle sliced/external string properly. For instance, I got a bogus result for the following (in Debug build, DCHECK is triggered). out/Release/d8 -e 'print("AbRERqwewlmaopqrstuvewreafadkfdfdkfdfdfdafdldfdfadfa".substring(1).toLowerCase())' Ñrerqwewlmaopqrstuvewreafadkfdfdkfdfdfdafdldfdfadfa I'll fix that. In case of webkit/js/fast/string-capitalization, I'm not sure what's wrong. All the tests pass in the buildbot output and my local output, but the overall result is FAIL in the buildbot. https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... As for 'durations' failure, I have no idea what's expected result for them? https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n...
On 2016/04/15 at 17:50:05, jshin wrote: > On 2016/04/15 10:03:20, commit-bot: I haz the power wrote: > > Dry run: Try jobs failed on following builders: > > v8_linux_nodcheck_rel_ng on tryserver.v8 (JOB_FAILED, > > http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng/bu...) > > v8_linux_nodcheck_rel_ng_triggered on tryserver.v8 (JOB_FAILED, > > http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng_tr...) > > mjsunit/string-case failed because fast path does not handle sliced/external string properly. For instance, I got a bogus result for the following (in Debug build, DCHECK is triggered). > > out/Release/d8 -e 'print("AbRERqwewlmaopqrstuvewreafadkfdfdkfdfdfdafdldfdfadfa".substring(1).toLowerCase())' > Ñrerqwewlmaopqrstuvewreafadkfdfdkfdfdfdafdldfdfadfa > > I'll fix that. > > In case of webkit/js/fast/string-capitalization, I'm not sure what's wrong. All the tests pass in the buildbot output and my local output, but the overall result is FAIL in the buildbot. > > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > > As for 'durations' failure, I have no idea what's expected result for them? https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... No need to pay attention to that; it is not an individual test failure, just an artifact of our test infrastructure.
On 2016/04/15 18:02:30, Dan Ehrenberg wrote: > On 2016/04/15 at 17:50:05, jshin wrote: Thanks, Dan, for the reply. > > > > In case of webkit/js/fast/string-capitalization, I'm not sure what's wrong. > All the tests pass in the buildbot output and my local output, but the overall > result is FAIL in the buildbot. > > > > > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... Do you have any idea why buildbots regard webkit/fast/js/string-capitalization as failed? All the tests pass, but the overall result is 'failure'. I can't reproduce that locally. ( https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... ) > > As for 'durations' failure, I have no idea what's expected result for them? > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > > No need to pay attention to that; it is not an individual test failure, just an > artifact of our test infrastructure. Good to know that. Thanks.
On 2016/04/15 at 19:13:26, jshin wrote: > On 2016/04/15 18:02:30, Dan Ehrenberg wrote: > > On 2016/04/15 at 17:50:05, jshin wrote: > > Thanks, Dan, for the reply. > > > > > > > In case of webkit/js/fast/string-capitalization, I'm not sure what's wrong. > > All the tests pass in the buildbot output and my local output, but the overall > > result is FAIL in the buildbot. > > > > > > > > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > > Do you have any idea why buildbots regard webkit/fast/js/string-capitalization as failed? All the tests pass, but the overall result is 'failure'. I can't > reproduce that locally. ( https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... ) The webkit tests run by comparing the actual results with an expections file, *-expected.txt. We may have had an expected failure previously. If so, the right thing to do is to turn it into an expected success. > > > > > As for 'durations' failure, I have no idea what's expected result for them? > > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > > > > No need to pay attention to that; it is not an individual test failure, just an > > artifact of our test infrastructure. > > Good to know that. Thanks.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/560001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/560001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux64_avx2_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_avx2_rel_ng/buil...) v8_linux64_avx2_rel_ng_triggered on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_avx2_rel_ng_trig...)
On 2016/04/15 20:25:04, Dan Ehrenberg wrote: > On 2016/04/15 at 19:13:26, jshin wrote: > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > > > > Do you have any idea why buildbots regard webkit/fast/js/string-capitalization > as failed? All the tests pass, but the overall result is 'failure'. I can't > > reproduce that locally. ( > https://build.chromium.org/p/tryserver.v8/builders/v8_win_nosnap_shared_rel_n... > ) > > The webkit tests run by comparing the actual results with an expections file, > *-expected.txt. We may have had an expected failure previously. If so, the right > thing to do is to turn it into an expected success. Ok. It's just like a regular Webkit layout test. However, webkit/fast/js/string-capitalization-expectation.txt has FAIL String("A𐐀").toLowerCase() should be a𐐨. Was a𐐀. FAIL String("a𐐨").toUpperCase() should be A𐐀. Was A𐐨. I fixed *expectation file in the latest PS.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/600001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/600001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/620001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/620001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/640001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/640001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/ and intl/ that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST= test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case ==========
Hi Dan, Adam, and Yang Can you take a look? Putting this behind the flag (icu_case_mapping) is not done, yet. (Dan, can you give me a pointer? ). The performance characteristics is similar to what Dan obtained with his CL (but with different performance differential against ToT). ASCII/Latin-1 (unless U+00B5 or U+00FF is present in the inut) is faster than the current (unibrow). The full Unicode (to{U,L}Case) is slower. I'll post the result soon.
https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode728 src/js/i18n.js:728: if ((typeof localeID !== 'string' && typeof localeID !== 'object') || Nit: As long as you're changing this code, could you use IS_STRING(localeID), and IS_RECEIVER(localeID) to be more idiomatic (and slightly more correct)? https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode735 src/js/i18n.js:735: if (typeof localeID === 'string' && IS_STRING https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2000: function getCaseConversionLanguageId(locales) { Generally, the v8 convention is CamelCase with an initial capital letter. Let's use that for new code. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2007: } else if (typeof locales === 'string') { IS_STRING https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2014: // StringSplit is slwoer than this. slower Consider factoring this out into a utility function. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2019: var CUSTOM_CASE_LANGUAGES = ['az', 'el', 'lt', 'tr']; Could we somehow query this from ICU? I thought there were more languages that had custom case conversion, like French for removing accents over vowels when going uppercase. Is there a way to avoid having this parallel array in JS and C++? For example, what if we just pass the string into the runtime, and let that interpret it. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2023: function localeConvertCase(s, locales, isToUpper) { LocaleConvertCase https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2088: return %StringToLowerCaseI18N(s); ECMA262 seems to specify using the root locale in https://tc39.github.io/ecma262/#sec-string.prototype.tolowercase ; however, the implementation seems to pass "" to ICU, which would use the current default locale, right? https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:788: if (V8_UNLIKELY(locale_id == 1 && is_to_upper)) { Maybe make an enum for these ids, like enum LocaleID { ROOT = -1, AZERI = 0, GREEK = 1, ... }; static const char* conversion_locales[] = { ..., [GREEK] = "el", ... }; https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:814: case_conversion_fn fn = is_to_upper ? u_strToUpper : u_strToLower; Nit: Since this is just used locally, maybe a good case for auto? https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:815: const char* locale = locale_id == -1 ? "" : conversion_locales[locale_id]; The use of the root locale seems appropriate here for String.prototype.toUpperCase. However, if the default locale is different from the root locale with respect to case mapping, do we know that it will be included in the set of four languages which is included in this code? https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:822: // twice (overflow). Any way we could include a DCHECK to ensure it doesn't run more than twice? https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:847: SeqString::Truncate(result, dest_length)); Do you have a test which hits this case?
Thanks a lot for the review. I addressed your comments. One remaining question is how to pass a conversion locale between JS and runtime. Passing a string (out of 4 or 5 if 'und' is included) is more readable / maintainable, but I have a concern about wrapping and unwrapping a string. Do you have https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode728 src/js/i18n.js:728: if ((typeof localeID !== 'string' && typeof localeID !== 'object') || On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > Nit: As long as you're changing this code, could you use IS_STRING(localeID), > and IS_RECEIVER(localeID) to be more idiomatic (and slightly more correct)? Done. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode735 src/js/i18n.js:735: if (typeof localeID === 'string' && On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > IS_STRING Done. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2000: function getCaseConversionLanguageId(locales) { On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > Generally, the v8 convention is CamelCase with an initial capital letter. Let's > use that for new code. Done. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2007: } else if (typeof locales === 'string') { On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > IS_STRING Done. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2014: // StringSplit is slwoer than this. On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > slower Typo fixed. > Consider factoring this out into a utility function. You mean "var pos .. ; if (pos != -1) .... " ? Like this? function GetLanguagePart(locale) { // StringSplit is slower than this. var pos = %_Call(StringIndexOf, locale, '-'); return pos == -1 ? locale : %_Call(StringSubstring, locale, 0, pos); } https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2019: var CUSTOM_CASE_LANGUAGES = ['az', 'el', 'lt', 'tr']; On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > Could we somehow query this from ICU? I thought there were more languages that > had custom case conversion, like French for removing accents over vowels when > going uppercase. > Some typesetting systems in the past couldn't deal with uppercase with diacritics, but that's not the case any more. The French Academy is very clear about what to do about them. (do not drop diacritics). Well, some native French speakers get confused, too. :-) Anyway, what languages are subject to special casings comes from the Unicode data files, but CLDR/ICU does not have that list exposed via API. Unicode's SpecialCasing.txt has 'tr', 'az' and 'lt'. 'el' is not there yet. ECMAScript 402 only lists 'tr', 'az' and 'lt' at the moment citing SpecialCasing.txt at unicode.org (I'll file a bug against the spec), but 'el' certainly needs a tailored case mapping. (Blink and Gecko do that, already for CSS text-transform) per strong demand from Greek speakers (Mozilla's implementation is more sophisticated than CLDR/ICU's. CLDR/ICU is looking into revising it). > Is there a way to avoid having this parallel array in JS and C++? For example, > what if we just pass the string into the runtime, and let that interpret it. I hate parallel arrays, too !! OTOH, I want to avoid passing a string from JS to C++ unless there's an efficient way to wrap and unwrap it (almost as efficient as passing an integer). StringNormalize does the same (passing integer instead of 'NFC', 'NFKC', "NFD', 'NFKD'). If there's an efficient way to pass one of a small list of strings between JS and runtime, I'd for sure use that. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2023: function localeConvertCase(s, locales, isToUpper) { On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > LocaleConvertCase Done. https://codereview.chromium.org/1812673005/diff/640001/src/js/i18n.js#newcode... src/js/i18n.js:2088: return %StringToLowerCaseI18N(s); On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > ECMA262 seems to specify using the root locale in > https://tc39.github.io/ecma262/#sec-string.prototype.tolowercase ; Right. > however, the > implementation seems to pass "" to ICU, which would use the current default > locale, right? "" for root locale and NULL for the default locale :-) @param locale The locale to consider, or "" for the root locale or NULL for the default locale. https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:788: if (V8_UNLIKELY(locale_id == 1 && is_to_upper)) { On 2016/04/20 22:01:30, Dan Ehrenberg wrote: > Maybe make an enum for these ids, like > > enum LocaleID { > ROOT = -1, > AZERI = 0, > GREEK = 1, > ... > }; > > static const char* conversion_locales[] = { > ..., > [GREEK] = "el", > ... > }; Is it ok to use a C99 feature? I have to add '-Wno-c99-extensions' to gyp file. This is related to your comment in i18n.js. If a string is used for locale, I don't have to worry about this. As I replied there, I have a perf concern (maybe I have to measure). https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:814: case_conversion_fn fn = is_to_upper ? u_strToUpper : u_strToLower; On 2016/04/20 22:01:30, Dan Ehrenberg wrote: > Nit: Since this is just used locally, maybe a good case for auto? Thanks. Done https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:815: const char* locale = locale_id == -1 ? "" : conversion_locales[locale_id]; On 2016/04/20 22:01:30, Dan Ehrenberg wrote: > The use of the root locale seems appropriate here for > String.prototype.toUpperCase. However, if the default locale is different from > the root locale with respect to case mapping, do we know that it will be > included in the set of four languages which is included in this code? Again, "" means root locale and NULL means the default locale. :-) As toLocale{U,L}Case without an argument, it's the default locale. And, if the default locale happens to be one of 4, that's already taken care in JS code. https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:822: // twice (overflow). On 2016/04/20 22:01:30, Dan Ehrenberg wrote: > Any way we could include a DCHECK to ensure it doesn't run more than twice? This is the second time I got that question in a month :-). (the first one was for a sqlite CL). This is a pretty common pattern used multiple times in both Chromium and ICU itself. Anyway, I'm turning do-while to for loop (i=0 ; i < 2; ++i). https://codereview.chromium.org/1812673005/diff/640001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:847: SeqString::Truncate(result, dest_length)); On 2016/04/20 22:01:29, Dan Ehrenberg wrote: > Do you have a test which hits this case? intl/general/case-mapping.js has several (Greek and Turkic drop diacritic marks). test262/intl402/Strings/* has some, too.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/680001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/680001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux_arm_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_arm_rel_ng/builds/632) v8_linux_dbg_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_dbg_ng/builds/4777) v8_linux_gcc_compile_rel on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_gcc_compile_rel/bu...) v8_linux_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_rel_ng/builds/4759) v8_win64_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win64_rel_ng/builds/6312)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/700001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/700001
On 2016/04/21 20:39:17, jshin (jungshik at google) wrote: > Thanks a lot for the review. I addressed your comments. > > One remaining question is how to pass a conversion locale between JS and > runtime. Passing a string (out of 4 or 5 if 'und' is included) is more readable > / maintainable, but I have a concern about wrapping and unwrapping a string. Do > > Is there a way to avoid having this parallel array in JS and C++? For example, > > what if we just pass the string into the runtime, and let that interpret it. > > I hate parallel arrays, too !! OTOH, I want to avoid passing a string from JS to > C++ unless there's an efficient way to wrap and unwrap it (almost as efficient > as passing an integer). Ok. I measured the perf of toLocaleLowerCase("tr") with PS 34 (parallel arrays in JS and C++) and PS 35 (no. passing a locale as string). The difference is not statistically significant at all (t-test with two independent samples with size 100). x=new Date(); n=2500000; for (var i=0; i < n; i++) { "Ii\u0130\u0131".toLocaleLowerCase("tr"); } print (new Date() - x); So, I'll go with a better one (PS 35/36). Can you take another look? Thanks
Another question: If this has to be put behind a flag (icu_case_mapping), how can I hide 'overriding' of to{L,U}Case and toLocale{L,U}Case behind the flag? It's currently done in i18n.js unconditionally. The flag (icu_case_mapping) has no effect. Thanks,
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
To make loading some code conditional on a runtime flag, follow the pattern of --promise-extras. See src/bootstrapper.cc for a simple example of how flags can be used to load additional JS files. These files are not compiled into the snapshot, so for some performance sensitive code (not sure whether or not this counts) the functions can be created in JS code from the main snapshot and just installed on the builtins conditionally at runtime. Be sure to update both build files to make the new JS file available for use. https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... File test/test262/test262.status (left): https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... test/test262/test262.status:159: 'intl402/String/prototype/toLocaleUpperCase/special_casing_Turkish': [FAIL], Good to see that all of these tests are actually fixed by your patch! But let's revert this change when the testcfg.py change is reverted. We can do this when --icu-case-mapping is staged. https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.py File test/test262/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.p... test/test262/testcfg.py:134: ["--icu_case_mapping"] + We usually don't add flags this way. Instead, we wait for the feature to be staged, at which point it'll be turned on that way. The staging patch is the one which should update test262.status, rather than this one.
Thanks, Dan, for the comment and pointers. Micro-benchmark results are up at https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... ASCII-only lower/uppercasing is about the same speed as before except that lowercasing 'lowercase ascii' is 60% because it just returns the input. Latin-1 only lower/uppercasing (including sharp-s) is 1.6 ~ 2.6 times as fast as before. Full-unicode lower/uppercasing (non-latin 1): This CL is better in terms of correctness at the expense of reduced speed (50~60% of Unibrow). What's curious is that my CL (using FlatContent) is a bit slower (80~95% of Dan's) for a short string. As input strings longer, the difference gets smaller between 1. UniBrow : Mine 2, Dan's : Mine. Apparently, this CL has a higher 'fixed' cost Using FlatContent will also help when an input string is nested cons. On 2016/04/25 21:10:11, Dan Ehrenberg wrote: > To make loading some code conditional on a runtime flag, follow the pattern of > --promise-extras. See src/bootstrapper.cc for a simple example of how flags can > be used to load additional JS files. These files are not compiled into the > snapshot, so for some performance sensitive code (not sure whether or not this > counts) the functions can be created in JS code from the main snapshot and just > installed on the builtins conditionally at runtime. Be sure to update both build > files to make the new JS file available for use. Thank you for the pointer. It's unfortunate that I need a separate JS file and a bit involved steps. I'm afraid that having a separate JS file (from i18n.js) will lead to some code duplication (for some utilities I'm using from i18n.js). I wonder if we want to treat this as a 'feature'. Isn't it more like a 'bug fix' (as you mentioned before :-))? If this is a bug fix, do we really want to take all those steps? > https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... > File test/test262/test262.status (left): > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... > test/test262/test262.status:159: > 'intl402/String/prototype/toLocaleUpperCase/special_casing_Turkish': [FAIL], > Good to see that all of these tests are actually fixed by your patch! But let's > revert this change when the testcfg.py change is reverted. We can do this when > --icu-case-mapping is staged. > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.py > File test/test262/testcfg.py (right): > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.p... > test/test262/testcfg.py:134: ["--icu_case_mapping"] + > We usually don't add flags this way. Instead, we wait for the feature to be > staged, at which point it'll be turned on that way. The staging patch is the one > which should update test262.status, rather than this one. Thanks. I left it alone for now because I had a reservation about putting this behind the flag. It'd be an easy call if perf was better across the board (instead of 'full unicode' being slower while more spec-compliant). Anyway, what would you saying about going forward without a flag? Thanks
On 2016/04/26 at 23:07:34, jshin wrote: > Thanks, Dan, for the comment and pointers. > > > Micro-benchmark results are up at > https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... > > ASCII-only lower/uppercasing is about the same speed as before except that lowercasing 'lowercase ascii' is > 60% because it just returns the input. > > Latin-1 only lower/uppercasing (including sharp-s) is 1.6 ~ 2.6 times as fast as before. > > Full-unicode lower/uppercasing (non-latin 1): This CL is better in terms of correctness at the expense of > reduced speed (50~60% of Unibrow). What's curious is that my CL (using FlatContent) is a bit slower (80~95% of Dan's) > for a short string. As input strings longer, the difference gets smaller between 1. UniBrow : Mine 2, Dan's : Mine. > Apparently, this CL has a higher 'fixed' cost > Using FlatContent will also help when an input string is nested cons. If this is slower for that case, it could be a good idea to look into making use of Unibrow's caching layer. See the Mapping class in src/unicode.h . > > On 2016/04/25 21:10:11, Dan Ehrenberg wrote: > > To make loading some code conditional on a runtime flag, follow the pattern of > > --promise-extras. See src/bootstrapper.cc for a simple example of how flags can > > be used to load additional JS files. These files are not compiled into the > > snapshot, so for some performance sensitive code (not sure whether or not this > > counts) the functions can be created in JS code from the main snapshot and just > > installed on the builtins conditionally at runtime. Be sure to update both build > > files to make the new JS file available for use. > > > Thank you for the pointer. It's unfortunate that I need a separate JS file and a bit involved steps. I'm afraid that > having a separate JS file (from i18n.js) will lead to some code duplication (for some utilities I'm using from i18n.js). > > I wonder if we want to treat this as a 'feature'. Isn't it more like a 'bug fix' (as you > mentioned before :-))? If this is a bug fix, do we really want to take all those steps? The question in my mind is, is this a bigger change with significant risk, and as such, do we want to flag it so that we can back it out if needed? I think the answer is yes here--this changes user-visible behavior in more than one way, and also changes performance properties. I don't think it's that much work, and it's something that you'd have to learn anyway if you ever implement new features in V8 (for example, ECMA 402 v3 includes a new Intl.getCanonicalLocales() method). Here's an example patch which uses this technique: https://codereview.chromium.org/1469543003 . > > > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... > > File test/test262/test262.status (left): > > > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/test262.s... > > test/test262/test262.status:159: > > 'intl402/String/prototype/toLocaleUpperCase/special_casing_Turkish': [FAIL], > > Good to see that all of these tests are actually fixed by your patch! But let's > > revert this change when the testcfg.py change is reverted. We can do this when > > --icu-case-mapping is staged. > > > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.py > > File test/test262/testcfg.py (right): > > > > https://codereview.chromium.org/1812673005/diff/700001/test/test262/testcfg.p... > > test/test262/testcfg.py:134: ["--icu_case_mapping"] + > > We usually don't add flags this way. Instead, we wait for the feature to be > > staged, at which point it'll be turned on that way. The staging patch is the one > > which should update test262.status, rather than this one. > > Thanks. I left it alone for now because I had a reservation about putting this behind the flag. > > It'd be an easy call if perf was better across the board (instead of 'full unicode' being slower while more spec-compliant). > > Anyway, what would you saying about going forward without a flag? Thanks I'd personally prefer a flag, but maybe others reading this thread have another opinion.
yangguo@chromium.org changed reviewers: + yangguo@chromium.org
Very nice patch. I have some comments though. https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js#newcode... src/js/i18n.js:2015: if (pos != -1) Can we use V8's coding style where line breaks in an if-statement requires using {}-brackets? https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js#newcode... src/js/i18n.js:2021: return isToUpper ? %StringToUpperCaseI18N(s) : %StringToLowerCaseI18N(s); ditto. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:787: String::FlatContent flat = s->GetFlatContent(); I don't see the string being flattened anywhere leading up to this. So the string could be a cons string that has not been flattened yet. In which case the DCHECK below would fail. Can you add a test case for this? Let's just always - Flatten the string - Extract the flat content - Copy the flat content into a uc16 buffer (managed by smart pointer) via CopyChars, regardless of whether it's one-byte or two-byte. This way, we can avoid having to call fastCopyFrom, which copies the buffer yet again. - Construct the UnicodeString from that buffer. Let's not use ToWideCString. There is no other uses of it, and I think we should remove it. It's not particularly performant, using StringCharacterStream. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:800: isolate->factory()->NewStringFromTwoByte(Vector<const uint16_t>( This means that any case conversion with "problematic" locales is going to produce a two-byte string even if the old string was not. Do we expect this memory bloat to cause issues? https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:818: String::FlatContent flat = s->GetFlatContent(); We need to make sure s is flattened at this point before trying to get the flat content. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:820: if (flat.IsTwoByte()) add brackets here. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:833: // dest_length == result->length() Can we make this a DCHECK? https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:836: // dest_length < result->length() Also make this a DCHECK. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:1023: if (V8_UNLIKELY(!is_result_single_byte)) add brackets please. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:1035: if (flat.IsOneByte()) brackets
On 2016/04/26 23:39:04, Dan Ehrenberg wrote: > On 2016/04/26 at 23:07:34, jshin wrote: > > Thanks, Dan, for the comment and pointers. > > > > > > Micro-benchmark results are up at > > > https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... > > > > ASCII-only lower/uppercasing is about the same speed as before except that > lowercasing 'lowercase ascii' is > > 60% because it just returns the input. > > > > Latin-1 only lower/uppercasing (including sharp-s) is 1.6 ~ 2.6 times as fast > as before. > > > > Full-unicode lower/uppercasing (non-latin 1): This CL is better in terms of > correctness at the expense of > > reduced speed (50~60% of Unibrow). What's curious is that my CL (using > FlatContent) is a bit slower (80~95% of Dan's) > > for a short string. As input strings longer, the difference gets smaller > between 1. UniBrow : Mine 2, Dan's : Mine. > > Apparently, this CL has a higher 'fixed' cost > > Using FlatContent will also help when an input string is nested cons. I re-ran the benchmark with PS #39 and updated the spreadsheet. There's some improvement. Not using strcmp helped quite a lot. (I noticed this by running v8 profiler). Also using the hardcoded result for Latin-1 lowercasing helps, too. And, 'greek sigma-sigma' lowercasing (the second one is turned to final sigma) result was completely off. In this case (where capital sigma=> lowercase final sigma is involved), ICU is faster than Unibrow. > If this is slower for that case, it could be a good idea to look into making use > of Unibrow's caching layer. See the Mapping class in src/unicode.h . I'm not sure I can. The input string is passed to ICU and what happens inside is up to ICU. By fixed cost, I mean that the cost seems to be |a * n + b| where 'b' is a fixed cost and 'n' is # of input characters. > > > > On 2016/04/25 21:10:11, Dan Ehrenberg wrote: > > > To make loading some code conditional on a runtime flag, follow the pattern > of > > > --promise-extras. See src/bootstrapper.cc for a simple example of how flags > can > > > be used to load additional JS files. These files are not compiled into the > > > snapshot, so for some performance sensitive code (not sure whether or not > this > > > counts) the functions can be created in JS code from the main snapshot and > just > > > installed on the builtins conditionally at runtime. Be sure to update both > build > > > files to make the new JS file available for use. > > > > > > Thank you for the pointer. It's unfortunate that I need a separate JS file and > a bit involved steps. I'm afraid that > > having a separate JS file (from i18n.js) will lead to some code duplication > (for some utilities I'm using from i18n.js). > > > > I wonder if we want to treat this as a 'feature'. Isn't it more like a 'bug > fix' (as you > > mentioned before :-))? If this is a bug fix, do we really want to take all > those steps? > > The question in my mind is, is this a bigger change with significant risk, and > as such, do we want to flag it so that we can back it out if needed? I think the > answer is yes here--this changes user-visible behavior in more than one way, and > also changes performance properties. > > I don't think it's that much work, and it's something that you'd have to learn > anyway if you ever implement new features in V8 (for example, ECMA 402 v3 > includes a new Intl.getCanonicalLocales() method). Here's an example patch which > uses this technique: https://codereview.chromium.org/1469543003 . Thank you for the pointer. I'll take a look. I have an idea. Let's put to{Lower,Upper}Case implemented with ICU behind the flag but make toLocale{Lower,Upper}Case available regardless of the flag. In that case, I don't have to worry about the code duplication between i18n.js and a new JS file. After that, I can port FastASCII path from Unibrow (word-by-word upper/lowercasing for ASCII). What do you think of this split approach?
Addressed and replied to Yang's comment. Thanks ! https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js#newcode... src/js/i18n.js:2015: if (pos != -1) On 2016/04/27 08:06:45, Yang wrote: > Can we use V8's coding style where line breaks in an if-statement requires using > {}-brackets? Done. https://codereview.chromium.org/1812673005/diff/720001/src/js/i18n.js#newcode... src/js/i18n.js:2021: return isToUpper ? %StringToUpperCaseI18N(s) : %StringToLowerCaseI18N(s); On 2016/04/27 08:06:45, Yang wrote: > ditto. Done. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:787: String::FlatContent flat = s->GetFlatContent(); On 2016/04/27 08:06:45, Yang wrote: > I don't see the string being flattened anywhere leading up to this. So the > string could be a cons string that has not been flattened yet. In which case the > DCHECK below would fail. Thank you for pointing this out. Actually, two of three callers of this function flatten |s| before calling it, but I forgot to flatten it in the 3rd caller. The reason for flattening before calling this function is that two callers need to use flattened string for faster path before falling back to this full-unicode path. I fixed the 3rd caller (Runtime_StringLocaleConvertCase). > > Can you add a test case for this? I'll. (I tested some cons strings locally but I guess I didn't try it with toLocale{L,U}Case that goes through the 3rd caller where I forgot to flatten the input. > Let's just always > - Flatten the string > - Extract the flat content > - Copy the flat content into a uc16 buffer (managed by smart pointer) via > CopyChars, regardless of whether it's one-byte or two-byte. This way, we can > avoid having to call fastCopyFrom, which copies the buffer yet again. Thanks for the suggestion. BTW, fastCopyFrom is cheap if icu::UnicodeString aliases 'buffer'. The first argument in UnicodeString ctor is for that: See http://icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#a2bd1d1822... ). Nonetheless, I found an even better way (setTo() method). Either way, transliterator-based conversion is so slow that this change will make little difference. > - Construct the UnicodeString from that buffer. I did a slight variation of your suggestion. > > Let's not use ToWideCString. There is no other uses of it, and I think we should > remove it. It's not particularly performant, using StringCharacterStream. Thanks again. Now it's gone ! https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:800: isolate->factory()->NewStringFromTwoByte(Vector<const uint16_t>( On 2016/04/27 08:06:45, Yang wrote: > This means that any case conversion with "problematic" locales is going to > produce a two-byte string even if the old string was not. Do we expect this > memory bloat to cause issues? That's a good point. If toLocale{L,U}Case(<no argument>) is used when the default locale is one of the four (or they're called with one of the 4 locales), there'll be unexpected memory use increase even for ASCII input. I added TODO comments here and Runtime_StringLocaleConvertCase) for now. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:818: String::FlatContent flat = s->GetFlatContent(); On 2016/04/27 08:06:45, Yang wrote: > We need to make sure s is flattened at this point before trying to get the flat > content. Yup. See the comment above to the same question. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:820: if (flat.IsTwoByte()) On 2016/04/27 08:06:45, Yang wrote: > add brackets here. Done. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:833: // dest_length == result->length() On 2016/04/27 08:06:45, Yang wrote: > Can we make this a DCHECK? Done. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:836: // dest_length < result->length() On 2016/04/27 08:06:45, Yang wrote: > Also make this a DCHECK. Done. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:1023: if (V8_UNLIKELY(!is_result_single_byte)) On 2016/04/27 08:06:45, Yang wrote: > add brackets please. Done. https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:1035: if (flat.IsOneByte()) On 2016/04/27 08:06:45, Yang wrote: > brackets Done.
https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:787: String::FlatContent flat = s->GetFlatContent(); On 2016/04/28 10:50:09, jshin (jungshik at google) wrote: > On 2016/04/27 08:06:45, Yang wrote: > > I don't see the string being flattened anywhere leading up to this. So the > > string could be a cons string that has not been flattened yet. In which case > the > > DCHECK below would fail. > > Thank you for pointing this out. Actually, two of three callers of this function > flatten |s| before calling it, but > I forgot to flatten it in the 3rd caller. The reason for flattening before > calling this function is that two callers need to use flattened string for > faster path before falling back to this full-unicode path. > > I fixed the 3rd caller (Runtime_StringLocaleConvertCase). > > > > > > Can you add a test case for this? > > I'll. (I tested some cons strings locally but I guess I didn't try it with > toLocale{L,U}Case that goes through > the 3rd caller where I forgot to flatten the input. > > > > Let's just always > > - Flatten the string > > - Extract the flat content > > - Copy the flat content into a uc16 buffer (managed by smart pointer) via > > CopyChars, regardless of whether it's one-byte or two-byte. This way, we can > > avoid having to call fastCopyFrom, which copies the buffer yet again. > > Thanks for the suggestion. > > BTW, fastCopyFrom is cheap if icu::UnicodeString aliases 'buffer'. The first > argument in UnicodeString ctor is for that: See > http://icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#a2bd1d1822... > ). Nonetheless, > I found an even better way (setTo() method). Either way, transliterator-based > conversion is so slow that this > change will make little difference. > > > > - Construct the UnicodeString from that buffer. > > I did a slight variation of your suggestion. > > > > > Let's not use ToWideCString. There is no other uses of it, and I think we > should > > remove it. It's not particularly performant, using StringCharacterStream. > > Thanks again. Now it's gone ! I did check the description of fastCopyFrom. In either case, the unicode string must not alias the original string buffer (which is what would make it fast), since that would change the original string (src points to the orignal backing of the two-byte string). So we just might as well make a copy explicitly and not rely on ICU to do it correctly implicitly.
On 2016/04/28 12:52:26, Yang wrote: > https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... > File src/runtime/runtime-i18n.cc (right): > > https://codereview.chromium.org/1812673005/diff/720001/src/runtime/runtime-i1... > src/runtime/runtime-i18n.cc:787: String::FlatContent flat = s->GetFlatContent(); > On 2016/04/28 10:50:09, jshin (jungshik at google) wrote: > > On 2016/04/27 08:06:45, Yang wrote: > > > I don't see the string being flattened anywhere leading up to this. So the > > > string could be a cons string that has not been flattened yet. In which case > > the > > > DCHECK below would fail. > > > > Thank you for pointing this out. Actually, two of three callers of this > function > > flatten |s| before calling it, but > > I forgot to flatten it in the 3rd caller. The reason for flattening before > > calling this function is that two callers need to use flattened string for > > faster path before falling back to this full-unicode path. > > > > I fixed the 3rd caller (Runtime_StringLocaleConvertCase). > > > > > > > > > > Can you add a test case for this? > > > > I'll. (I tested some cons strings locally but I guess I didn't try it with > > toLocale{L,U}Case that goes through > > the 3rd caller where I forgot to flatten the input. > > > > > > > Let's just always > > > - Flatten the string > > > - Extract the flat content > > > - Copy the flat content into a uc16 buffer (managed by smart pointer) via > > > CopyChars, regardless of whether it's one-byte or two-byte. This way, we can > > > avoid having to call fastCopyFrom, which copies the buffer yet again. > > > > Thanks for the suggestion. > > > > BTW, fastCopyFrom is cheap if icu::UnicodeString aliases 'buffer'. The first > > argument in UnicodeString ctor is for that: See > > > http://icu-project.org/apiref/icu4c/classicu_1_1UnicodeString.html#a2bd1d1822... > > ). Nonetheless, > > I found an even better way (setTo() method). Either way, transliterator-based > > conversion is so slow that this > > change will make little difference. > > > > > > > - Construct the UnicodeString from that buffer. > > > > I did a slight variation of your suggestion. > > > > > > > > Let's not use ToWideCString. There is no other uses of it, and I think we > > should > > > remove it. It's not particularly performant, using StringCharacterStream. > > > > Thanks again. Now it's gone ! > > I did check the description of fastCopyFrom. In either case, the unicode string > must not alias the original string buffer (which is what would make it fast), > since that would change the original string (src points to the orignal backing > of the two-byte string). So we just might as well make a copy explicitly and not > rely on ICU to do it correctly implicitly. Thanks. I'll go with it. Doing it explicitly in v8 would make it easier to read the code (no more wondering as to what's gonna happen to the aliased buffer) while not making any practical difference (copying is gonna happen anyway when the transliteration is about to be done over the input. It's just a matter of who does it - v8 or ICU).
jshin@chromium.org changed reviewers: + jochen@chromium.org
+jochen for v8.gyp changes. Jochen, could you enlighten me as to why icu-case-mapping.js (that is added to experimental_library_files ) is not listed when js2c is run ? i18n.js (added to library_files the same way) is listed when js2c is run. Dan, Yang and Adam: can you take another look? Dan: are you fine with just putting to{Lower,Upper}Case behind --icu_case_mapping while making toLocale{L,U}Case available without the flag? https://codereview.chromium.org/1812673005/diff/860001/test/intl/general/case... File test/intl/general/case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/860001/test/intl/general/case... test/intl/general/case-mapping.js:5: // Flags: --icu_case_mapping This does not work. --icu_case_mapping is not used by tools/run_tests.py. I have to pass --icu_case_mapping via '--extra-flags' option. ./tools/run-tests.py --buildbot --extra-flags=--icu_case_mapping --arch-and-mode=x64.Release test 262/intl402/String/* https://codereview.chromium.org/1812673005/diff/860001/test/test262/test262.s... File test/test262/test262.status (left): https://codereview.chromium.org/1812673005/diff/860001/test/test262/test262.s... test/test262/test262.status:159: 'intl402/String/prototype/toLocaleUpperCase/special_casing_Turkish': [FAIL], toLocale{U,L}Case should still pass without --icu_case_mapping because I didn't pull them out of i18n.js (that is, they're not hidden behind the flag). OTOH, the ICU-based implementation of to{Upper,Lower}Case is hidden behind the flag. Is there a way to mark them as passing when --icu_case_mapping is used? https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp File tools/gyp/v8.gyp (right): https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... tools/gyp/v8.gyp:2026: 'experimental_library_files': ['../../src/js/icu-case-mapping.js'], For some reason, icu-case-mapping.js is never added to the list of js files to be compiled by 'js2c' for experimental-libraries.cc and libraries-experimental.bin. When I manually built those two files ( cc and bin) using the following commands and d8 after that, d8 worked as expected. Without --icu_case_mapping, I got the current behavior and with --icu_case_mapping, I got a new correct behaviro for to{Lower,Upper}Case. python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js js/promise-extra.js js/icu-case-mapping.js python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js js/promise-extra.js js/icu-case-mapping.js --startup_blob ../out/Release/gen/libraries-experimental.bin --nojs I also wanted to try GN build, but it turned out that GN does not yet support a standalone v8 build.
https://codereview.chromium.org/1812673005/diff/860001/src/flag-definitions.h File src/flag-definitions.h (right): https://codereview.chromium.org/1812673005/diff/860001/src/flag-definitions.h... src/flag-definitions.h:184: DEFINE_NEG_IMPLICATION(es_staging, icu_case_mapping) I noticed that https://codereview.chromium.org/1513873002/ has a comment about harmony vs es_staging by Adam. BTW, I have to use DEFINE_IMPLICATION(es_staging|harmony, ...) instead . I'll fix it.
https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp File tools/gyp/v8.gyp (right): https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... tools/gyp/v8.gyp:1971: }, { this file doesn't exist anymore, it's now in src/ https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... tools/gyp/v8.gyp:2026: 'experimental_library_files': ['../../src/js/icu-case-mapping.js'], On 2016/04/29 at 06:23:17, jshin (jungshik at google) wrote: > For some reason, icu-case-mapping.js is never added to the list of js files to be compiled by 'js2c' for > experimental-libraries.cc and libraries-experimental.bin. > > When I manually built those two files ( cc and bin) using the following commands and d8 after that, d8 worked as expected. Without --icu_case_mapping, I got the current > behavior and with --icu_case_mapping, I got a new correct behaviro for to{Lower,Upper}Case. > > python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js js/promise-extra.js js/icu-case-mapping.js > > python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js js/promise-extra.js js/icu-case-mapping.js --startup_blob ../out/Release/gen/libraries-experimental.bin --nojs > > I also wanted to try GN build, but it turned out that GN does not yet support a standalone v8 build. gyp is a bit peculiar here. I'd recommend to keep i18n_library_files and add i18n_experimental_library_files
Thank you for the reply. On 2016/04/29 06:58:09, jochen wrote: > https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp > File tools/gyp/v8.gyp (right): > > https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... > tools/gyp/v8.gyp:1971: }, { > this file doesn't exist anymore, it's now in src/ It must be fairly recent. (I rebased earlier this week, IIRC). I'll rebase it again. > > https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... > tools/gyp/v8.gyp:2026: 'experimental_library_files': > ['../../src/js/icu-case-mapping.js'], > On 2016/04/29 at 06:23:17, jshin (jungshik at google) wrote: > > For some reason, icu-case-mapping.js is never added to the list of js files to > be compiled by 'js2c' for > > experimental-libraries.cc and libraries-experimental.bin. > > > > When I manually built those two files ( cc and bin) using the following > commands and d8 after that, d8 worked as expected. Without --icu_case_mapping, I > got the current > > behavior and with --icu_case_mapping, I got a new correct behaviro for > to{Lower,Upper}Case. > > > > python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc > EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js > js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js > js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js > js/promise-extra.js js/icu-case-mapping.js > > > > python ../tools/js2c.py ../out/Release/gen/experimental-libraries.cc > EXPERIMENTAL js/macros.py messages.h js/harmony-atomics.js > js/harmony-regexp-exec.js js/harmony-sharedarraybuffer.js js/harmony-simd.js > js/harmony-species.js js/harmony-unicode-regexps.js js/harmony-string-padding.js > js/promise-extra.js js/icu-case-mapping.js --startup_blob > ../out/Release/gen/libraries-experimental.bin --nojs > > > > I also wanted to try GN build, but it turned out that GN does not yet support > a standalone v8 build. > > gyp is a bit peculiar here. I'd recommend to keep i18n_library_files and add > i18n_experimental_library_files That's exactly what I actually did first (didn't upload that version here), but it didn't work either. After scratching head for a while, I realized that v8.gyp is taking a rather winding road with i18n_library_files when just appending to library_files can work (and much simpler). So, I tried it for both library and experimental_library hoping that it would work for both cases, but it only works with library_files. Anyway, I'll try it once more.
https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc File src/bootstrapper.cc (right): https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc#ne... src/bootstrapper.cc:203: DECLARE_FEATURE_INITIALIZATION(icu_case_mapping, "") Is this necessary? We declare it here, call it below, only to define it as empty function. I get it that we need to do it for flags in the harmony list macros, since some do have non-empty initialize functions. But I don't see the necessity here. https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js#newcode734 src/js/i18n.js:734: !IS_NULL(InternalRegExpMatch(/^[a-z]{2,3}$/, localeID))) brackets https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js#newcode... src/js/i18n.js:2013: // StringSplit is slwoer than this. "slower" https://codereview.chromium.org/1812673005/diff/860001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/860001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:794: base::SmartArrayPointer<uc16> sap(NewArray<uc16>(src_length)); This seems wrong. In case of two-byte string, we do not copy, and 'converted' aliases the original string. Can we have test cases that test that the original string did not change?
On 2016/04/29 07:23:30, jshin (jungshik at google) wrote: > On 2016/04/29 06:58:09, jochen wrote: > > https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp > > On 2016/04/29 at 06:23:17, jshin (jungshik at google) wrote: > > > For some reason, icu-case-mapping.js is never added to the list of js files > to > > be compiled by 'js2c' for > > > experimental-libraries.cc and libraries-experimental.bin. > > > . > > > > gyp is a bit peculiar here. I'd recommend to keep i18n_library_files and add > > i18n_experimental_library_files > > That's exactly what I actually did first (didn't upload that version here), but > it didn't work either. After scratching head for a while, > I realized that v8.gyp is taking a rather winding road with i18n_library_files > when just appending to library_files can work (and much simpler). So, I tried it > for both > library and experimental_library hoping that it would work for both cases, but > it only works with library_files. > > Anyway, I'll try it once more. After rebasing, I tried both methods and both worked !!! I don't know what was wrong with my gyp changes before rebase. Given that both work correctly, are you ok with a simpler gyp or do you still want to be on the safer side by keeping the current way? PS 46 uses a simpler approach.
Yang, thanks for taking a look. I addressed your comments. https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc File src/bootstrapper.cc (right): https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc#ne... src/bootstrapper.cc:203: DECLARE_FEATURE_INITIALIZATION(icu_case_mapping, "") On 2016/04/29 07:38:47, Yang wrote: > Is this necessary? We declare it here, call it below, only to define it as empty > function. I get it that we need to do it for flags in the harmony list macros, > since some do have non-empty initialize functions. But I don't see the necessity > here. Indeed. I dropped them all (decl, call, definition) and it worked well as before. Even though it's not relevant to this CL, I wonder if the same thing has to be done for promise_extra. Dan, is there a reason to have an 'empty' function defined, declared and called? https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js#newcode734 src/js/i18n.js:734: !IS_NULL(InternalRegExpMatch(/^[a-z]{2,3}$/, localeID))) On 2016/04/29 07:38:47, Yang wrote: > brackets Done. https://codereview.chromium.org/1812673005/diff/860001/src/js/i18n.js#newcode... src/js/i18n.js:2013: // StringSplit is slwoer than this. On 2016/04/29 07:38:47, Yang wrote: > "slower" Done. https://codereview.chromium.org/1812673005/diff/860001/src/runtime/runtime-i1... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/860001/src/runtime/runtime-i1... src/runtime/runtime-i18n.cc:794: base::SmartArrayPointer<uc16> sap(NewArray<uc16>(src_length)); On 2016/04/29 07:38:47, Yang wrote: > This seems wrong. In case of two-byte string, we do not copy, and 'converted' > aliases the original string. Can we have test cases that test that the original > string did not change? UnicodeString's |setTo(src, src_length| does copy the buffer pointed to by |src| before doing anything. Previously, I used |setTo(false, src, src_length)| which aliases the buffer until it's time to write/modify. So, in case of a SingleByteString, the widened buffer is stored in |sap| and then copied to UnicodeString. In case of TwoByteString, its uc16 buffer is extracted (flatbuffer) and copied to UnicodeString upfront. Anyway, I added a couple of tests (toLocale{L,U}Case("el") that do not change the input. https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp File tools/gyp/v8.gyp (right): https://codereview.chromium.org/1812673005/diff/860001/tools/gyp/v8.gyp#newco... tools/gyp/v8.gyp:1971: }, { On 2016/04/29 06:58:09, jochen wrote: > this file doesn't exist anymore, it's now in src/ rebased. Thanks !
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/940001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/940001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_win64_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win64_rel_ng/builds/6634) v8_win_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win_rel_ng/builds/6522)
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/960001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/960001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_win_compile_dbg on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win_compile_dbg/builds/1...) v8_win_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_win_rel_ng/builds/6524)
https://codereview.chromium.org/1812673005/diff/1000001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1000001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:802: converted.setTo(false, src, src_length); Yang, https://docs.google.com/spreadsheets/d/19FCu-uk48FT7NCvMxRWgsnrCc06iON79Uni4a... compares various combinations of aliasing and copying when obtaining |src| (GetUCharBufferFromFlat) and setting up the input UnicodeString with setTo(). Using read-alias for two-byte string when obtaining |src| and read-alias for setTo minimizes # of copy operations. One outstanding question is whether the buffer pointed to by |flat.ToUC16Vector()| will continue to be reliably available after |flat| goes out of scope even though a flattened string (from which |flat| was obtained) remains in the scope (and not gc'd away). Adam thinks that it can still cause a problem (I asked him that question a week or so ago) and I want to play safe. Therefore, I changed the CL to avoid that. I have since discovered that the implementation and usage pattern of GetCharVector() ( https://goo.gl/EsgZXq ) kinda indicates otherwise. The way the return value of GetCharVector() is used is almost identical to this one except that in my case, I need to access the buffer even after DisallowHeapAllocation is NOT in force any more. Anyway, if the way I use the buffer obtained via GetFlatContent() is safe [1], the current way (using read-alias for both steps) is the best (although the difference would be rather small relatively because Transltierator takes so long). If not, either '1. alias 2. upfront copy' or '1. copy 2. alias' should be used. '1. copy 2. alias' is what you suggested. I prefer to use '1. alias 2. upfront copy' (PS #47 ~ #49) because I'm using GetUCharBufferFromFlat() in non-Greek code-path below (which is much more common) where '1. copy' is wasteful. Well, I shouldn't tinker with this too much because 1) ICU 58 in 5 months will support el-Upper with a regular case conversion API (no more need for special-casing) 2) the cost of transliteration is so expensive that saving one copy does not help much. [1] I have yet to find a test that will break for TwoByteString path. When |sap| is inside DisallowHeapAllocation block, accessing |converted| with an unmodified buffer, I definitely get a crash for a OneByte input in a debug build.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1000001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1000001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1020001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1020001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux64_avx2_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_avx2_rel_ng/buil...) v8_linux64_avx2_rel_ng_triggered on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_avx2_rel_ng_trig...)
https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc File src/bootstrapper.cc (right): https://codereview.chromium.org/1812673005/diff/860001/src/bootstrapper.cc#ne... src/bootstrapper.cc:203: DECLARE_FEATURE_INITIALIZATION(icu_case_mapping, "") On 2016/04/29 at 18:03:23, jshin (jungshik at google) wrote: > On 2016/04/29 07:38:47, Yang wrote: > > Is this necessary? We declare it here, call it below, only to define it as empty > > function. I get it that we need to do it for flags in the harmony list macros, > > since some do have non-empty initialize functions. But I don't see the necessity > > here. > > Indeed. I dropped them all (decl, call, definition) and it worked well as before. > > Even though it's not relevant to this CL, I wonder if the same thing has to be done for promise_extra. Dan, is there a reason to have an 'empty' function defined, declared and called? promise_extra was special in this way. This should just be a HARMONY_INPROGRESS feature. https://codereview.chromium.org/1812673005/diff/1020001/test/webkit/webkit.st... File test/webkit/webkit.status (right): https://codereview.chromium.org/1812673005/diff/1020001/test/webkit/webkit.st... test/webkit/webkit.status:42: 'fast/js/string-capitalization': [FAIL], I'd rather you not check this in. Instead, change the test expectations when you flip the flag on.
As we've discussed off-line, the main TODO item is to move all behavior changes from this patch under a flag. I don't see any other big issues with the code. https://codereview.chromium.org/1812673005/diff/1020001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1020001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:793: if (V8_UNLIKELY(is_to_upper && lang[0] == 'e' && lang[1] == 'l')) { Nit: use strncmp?
Thank you, Dan for the review and help. Now both to{L,U}Case and toLocale{L,U}Case are behind --icu-case-mapping. (it was pretty simple to do as you told me :-) ). All the test status were updated accordingly. https://codereview.chromium.org/1812673005/diff/1020001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1020001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:793: if (V8_UNLIKELY(is_to_upper && lang[0] == 'e' && lang[1] == 'l')) { On 2016/05/03 21:24:09, Dan Ehrenberg wrote: > Nit: use strncmp? I used to have strncmp, but changed it to the above because glibc showed up rather high in the profiling result (of toLocale{Lower,Upper}Case). Not using strncmp cuts down 5% of run-time (with a short string). https://codereview.chromium.org/1812673005/diff/1020001/test/webkit/webkit.st... File test/webkit/webkit.status (right): https://codereview.chromium.org/1812673005/diff/1020001/test/webkit/webkit.st... test/webkit/webkit.status:42: 'fast/js/string-capitalization': [FAIL], On 2016/05/03 17:58:17, Dan Ehrenberg wrote: > I'd rather you not check this in. Instead, change the test expectations when you > flip the flag on. Done.
https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... File src/js/icu-case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... src/js/icu-case-mapping.js:13: var LocaleConvertCase = utils.ImportNow("LocaleConvertCase"); Rather than using ImportNow (generally dispreferred), since LocaleConvertCase is called from a function declaration, you can import it using utils.Import.
https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... File src/js/icu-case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... src/js/icu-case-mapping.js:13: var LocaleConvertCase = utils.ImportNow("LocaleConvertCase"); On 2016/05/04 00:13:08, Dan Ehrenberg wrote: > Rather than using ImportNow (generally dispreferred), since LocaleConvertCase is > called from a function declaration, you can import it using utils.Import. Done. https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, 'toLowerCase', function() { offline, you're worried about actually defining functions in this file instead of just installing them. Would this be better, then? OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); where |toLowerCaseICU| is defined in i18n.js. I found an example in harmony-regexp-exec.js exactly doing that. I'll make a change.
https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... File src/js/icu-case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, 'toLowerCase', function() { On 2016/05/04 at 19:02:58, jshin (jungshik at google) wrote: > offline, you're worried about actually defining functions in this file instead of just installing them. > > Would this be better, then? > > OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); > > where |toLowerCaseICU| is defined in i18n.js. > > I found an example in harmony-regexp-exec.js exactly doing that. I'll make a change. That's what I was suggesting, though if you found that performance is already good in this configuration, then that might not be necessary. Note that in this case, you'll need to use ImportNow. Eventually, the flag will be switched to be flipped on all the time and this file will be merged into i18n.js, so this is just about performance during the transition.
On 2016/05/04 19:06:37, Dan Ehrenberg wrote: > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > File src/js/icu-case-mapping.js (right): > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, > 'toLowerCase', function() { > On 2016/05/04 at 19:02:58, jshin (jungshik at google) wrote: > > offline, you're worried about actually defining functions in this file instead > of just installing them. > > > > Would this be better, then? > > > > OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); > > > > where |toLowerCaseICU| is defined in i18n.js. > > > > I found an example in harmony-regexp-exec.js exactly doing that. I'll make a > change. > > That's what I was suggesting, though if you found that performance is already > good in this configuration, then that might not be necessary. Note that in this > case, you'll need to use ImportNow. Eventually, the flag will be switched to be > flipped on all the time and this file will be merged into i18n.js, so this is > just about performance during the transition. Thanks. I was able to change the way to{L,U}Case are overriden (And, I found out that I need to use importNow instead of import by trial/error because I hadn't read your reply while trying :-) ). However, my attempt to move toLocale{L,U}Case was not successful. I don't know how to deal with 'arguments[0]' in the function body. Various attempts led to all sort of funny crashes. PS 54 does not have any of this, yet. Because to{L,U}Case are more critical than their Locale counterparts (toLocale{L,U}Case was broken completely anyway in the ToT), I'll just change the way to{L,U}Case are overriden (while leaving alone toLocale{L,U}Case) hoping that it may squeeze out a bit more perf improvement (I don't know if there's any significatn difference. I'm gonna measure it).
On 2016/05/04 at 22:03:22, jshin wrote: > On 2016/05/04 19:06:37, Dan Ehrenberg wrote: > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > File src/js/icu-case-mapping.js (right): > > > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, > > 'toLowerCase', function() { > > On 2016/05/04 at 19:02:58, jshin (jungshik at google) wrote: > > > offline, you're worried about actually defining functions in this file instead > > of just installing them. > > > > > > Would this be better, then? > > > > > > OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); > > > > > > where |toLowerCaseICU| is defined in i18n.js. > > > > > > I found an example in harmony-regexp-exec.js exactly doing that. I'll make a > > change. > > > > That's what I was suggesting, though if you found that performance is already > > good in this configuration, then that might not be necessary. Note that in this > > case, you'll need to use ImportNow. Eventually, the flag will be switched to be > > flipped on all the time and this file will be merged into i18n.js, so this is > > just about performance during the transition. > > Thanks. I was able to change the way to{L,U}Case are overriden (And, I found out that I need to use importNow instead of import by trial/error because I hadn't read your reply while trying :-) ). > > However, my attempt to move toLocale{L,U}Case was not successful. I don't know how to deal with 'arguments[0]' in the function body. Various attempts led to all sort of funny crashes. > > PS 54 does not have any of this, yet. Because to{L,U}Case are more critical than their Locale counterparts (toLocale{L,U}Case was broken completely anyway in the ToT), I'll just change the way to{L,U}Case are overriden (while leaving alone toLocale{L,U}Case) hoping that it may squeeze out a bit more perf improvement (I don't know if there's any significatn difference. I'm gonna measure it). The arguments[0] trick was just a funny way to avoid having to set the function length explicitly. You can just take the arguments as normal function parameters and later call %FunctionSetLength to set it to 0.
On 2016/05/04 22:27:17, Dan Ehrenberg wrote: > On 2016/05/04 at 22:03:22, jshin wrote: > > On 2016/05/04 19:06:37, Dan Ehrenberg wrote: > > > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > > File src/js/icu-case-mapping.js (right): > > > > > > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > > src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, > > > 'toLowerCase', function() { > > > On 2016/05/04 at 19:02:58, jshin (jungshik at google) wrote: > > > > offline, you're worried about actually defining functions in this file > instead > > > of just installing them. > > > > > > > > Would this be better, then? > > > > > > > > OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); > > > > > > > > where |toLowerCaseICU| is defined in i18n.js. > > > > > > > > I found an example in harmony-regexp-exec.js exactly doing that. I'll make > a > > > change. > > > > > > That's what I was suggesting, though if you found that performance is > already > > > good in this configuration, then that might not be necessary. Note that in > this > > > case, you'll need to use ImportNow. Eventually, the flag will be switched to > be > > > flipped on all the time and this file will be merged into i18n.js, so this > is > > > just about performance during the transition. > > > > Thanks. I was able to change the way to{L,U}Case are overriden (And, I found > out that I need to use importNow instead of import by trial/error because I > hadn't read your reply while trying :-) ). > > > > However, my attempt to move toLocale{L,U}Case was not successful. I don't know > how to deal with 'arguments[0]' in the function body. Various attempts led to > all sort of funny crashes. > > > > PS 54 does not have any of this, yet. Because to{L,U}Case are more critical > than their Locale counterparts (toLocale{L,U}Case was broken completely anyway > in the ToT), I'll just change the way to{L,U}Case are overriden (while leaving > alone toLocale{L,U}Case) hoping that it may squeeze out a bit more perf > improvement (I don't know if there's any significatn difference. I'm gonna > measure it). > > The arguments[0] trick was just a funny way to avoid having to set the function > length explicitly. You can just take the arguments as normal function parameters > and later call %FunctionSetLength to set it to 0. Thanks. I've tried it and it still does not work. More importantly, even PS 55 (that only changed the way to{L,U}Case are overriden) has an issue. It fails the following test and other similar tests: === test262/built-ins/String/prototype/toLowerCase/S15.5.4.16_A6 === /usr/local/google/home/jungshik/v8/v8/test/test262/data/harness/sta.js:18: Test262Error: #1: String.prototype.toLowerCase.prototype === undefined. Actual: [object Object] throw new Test262Error(message); ^ |String.prototype.toLowerCase.prototype| is supposed to be undefined, but it's not the case. BTW, PS 53 (function body in icu_case_mapping.js) vs PS 55 (function body in i18n.js): in most cases, PS 55 is 'faster' than PS 53 (by a constant time no matter how long a test is), but it's all within a standard deviation. So, it's not statistically significant. https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... has more details (PS 53 v PS55 tab).
On 2016/05/05 00:33:30, jshin (jungshik at google) wrote: > On 2016/05/04 22:27:17, Dan Ehrenberg wrote: > > On 2016/05/04 at 22:03:22, jshin wrote: > > > On 2016/05/04 19:06:37, Dan Ehrenberg wrote: > > > > > > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > > > File src/js/icu-case-mapping.js (right): > > > > > > > > > > > https://codereview.chromium.org/1812673005/diff/1040001/src/js/icu-case-mappi... > > > > src/js/icu-case-mapping.js:19: OverrideFunction(GlobalString.prototype, > > > > 'toLowerCase', function() { > > > > On 2016/05/04 at 19:02:58, jshin (jungshik at google) wrote: > > > > > offline, you're worried about actually defining functions in this file > > instead > > > > of just installing them. > > > > > > > > > > Would this be better, then? > > > > > > > > > > OverrideFunction(GlobalString.prototype, 'toLowerCase', toLowerCaseICU); > > > > > > > > > > where |toLowerCaseICU| is defined in i18n.js. > > > > > > > > > > I found an example in harmony-regexp-exec.js exactly doing that. I'll > make > > a > > > > change. > > > > > > > > That's what I was suggesting, though if you found that performance is > > already > > > > good in this configuration, then that might not be necessary. Note that in > > this > > > > case, you'll need to use ImportNow. Eventually, the flag will be switched > to > > be > > > > flipped on all the time and this file will be merged into i18n.js, so this > > is > > > > just about performance during the transition. > > > However, my attempt to move toLocale{L,U}Case was not successful. I don't > know > > how to deal with 'arguments[0]' in the function body. Various attempts led to > > all sort of funny crashes. ... > > The arguments[0] trick was just a funny way to avoid having to set the > function > > length explicitly. You can just take the arguments as normal function > parameters > > and later call %FunctionSetLength to set it to 0. > > Thanks. I've tried it and it still does not work. > > More importantly, even PS 55 (that only changed the way to{L,U}Case are > overriden) has an issue. It fails the following test and other similar tests: > > === test262/built-ins/String/prototype/toLowerCase/S15.5.4.16_A6 === > /usr/local/google/home/jungshik/v8/v8/test/test262/data/harness/sta.js:18: > Test262Error: #1: > String.prototype.toLowerCase.prototype === undefined. Actual: [object Object] > throw new Test262Error(message); > ^ > |String.prototype.toLowerCase.prototype| is supposed to be undefined, but it's > not the case. There's a typo in my previous attempt to override toLocale*Case in a more efficient way. With that fixed (PS 56), all the case-conversion-related tests pass with --icu_case_mapping passed. However, the above test (S15.5.4.16_A6) still fails for all 4 functions (to{U,L}Case and toLocale{U,L}Case). Dan, can you take a look at PS 56 and see if there's anything I can do to make the test happy ( |String.prototype.toLowerCase.prototype| should be undefined to pass the test) ? Otherwise, I'll just go back to PS 54 with a less efficient override of those 4 functions. As shown in the column M and N of 'PS 53 vs PS 55' tab of the following spreadsheet, the difference between PS 53 and PS 55 are within one sigma (even though PS 55 is better in most cases; perhaps pooling all the test results together also shows that PS 55 is faster than PS 53 by ~20 ms for 10^7 operations) and is insignificant. https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... Thanks
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1120001
On 2016/05/05 17:54:34, jshin (jungshik at google) wrote: > > /usr/local/google/home/jungshik/v8/v8/test/test262/data/harness/sta.js:18: > > Test262Error: #1: > > String.prototype.toLowerCase.prototype === undefined. Actual: [object Object] > > throw new Test262Error(message); > > ^ > > |String.prototype.toLowerCase.prototype| is supposed to be undefined, but it's > > not the case. > > There's a typo in my previous attempt to override toLocale*Case in a more > efficient way. > With that fixed (PS 56), all the case-conversion-related tests pass with > --icu_case_mapping passed. > > However, the above test (S15.5.4.16_A6) still fails for all 4 functions > (to{U,L}Case and toLocale{U,L}Case). > > Dan, can you take a look at PS 56 and see if there's anything I can do to make > the test happy ( > |String.prototype.toLowerCase.prototype| should be undefined to pass the test) > ? Solved the problem (with Dan's help offline) ! I think the CL is in good shape for review/approval. Can all of you take another look? Thanks !
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when I18N is enabled. Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with I18N enabled passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. In non-intl builds, they're still handled by unibrow with a few test failures. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST= test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST= test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case ==========
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST= test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
littledan@chromium.org changed reviewers: + machenbach@chromium.org
lgtm +machenbach for build file changes. This patch looks good to me, but I'd hold off on submitting until an LGTM from Yang as well. https://codereview.chromium.org/1812673005/diff/1120001/test/intl/testcfg.py File test/intl/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/1120001/test/intl/testcfg.py#... test/intl/testcfg.py:58: flags = ["--allow_natives_syntax"] + context.mode_flags Nit: No need for this change.
build files lgtm
Thanks, Dan and Michael. Yang, can you take another look? Thanks ! https://codereview.chromium.org/1812673005/diff/1120001/test/intl/testcfg.py File test/intl/testcfg.py (right): https://codereview.chromium.org/1812673005/diff/1120001/test/intl/testcfg.py#... test/intl/testcfg.py:58: flags = ["--allow_natives_syntax"] + context.mode_flags On 2016/05/05 23:36:57, Dan Ehrenberg wrote: > Nit: No need for this change. Done.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1140001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2016/05/05 23:36:57, Dan Ehrenberg wrote: > lgtm > > +machenbach for build file changes. > > This patch looks good to me, but I'd hold off on submitting until an LGTM from > Yang as well. Yang, could you take another look? Thanks
https://codereview.chromium.org/1812673005/diff/1140001/src/flag-definitions.h File src/flag-definitions.h (right): https://codereview.chromium.org/1812673005/diff/1140001/src/flag-definitions.... src/flag-definitions.h:184: DEFINE_IMPLICATION(es_staging, icu_case_mapping) Can we instead put this into the HARMONY_STAGED list below? https://codereview.chromium.org/1812673005/diff/1140001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/1140001/src/js/i18n.js#newcod... src/js/i18n.js:2119: [ToLowerCaseI18N, ToUpperCaseI18N, ToLocaleLowerCaseI18N, ToLocaleUpperCaseI18N]. Can we avoid using this short hand? I'd rather have %FunctionRemovePrototype called verbatim. We are assuming, at this point, that forEach is installed, and we are using the correct one. This may all hold at this point, but seems unnecessary to me. And we are unnecessarily creating an array literal. https://codereview.chromium.org/1812673005/diff/1140001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1140001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:806: converted.setTo(false, src, src_length); Can you explain when the source string is copied? The comment for icu::Transliterator::transliterate says "Transliterates an entire string in place.", which suggests that the existing string may be overwritten. I haven't found a test case that checks that the input string has not been overwritten.
https://codereview.chromium.org/1812673005/diff/1160001/src/flag-definitions.h File src/flag-definitions.h (right): https://codereview.chromium.org/1812673005/diff/1160001/src/flag-definitions.... src/flag-definitions.h:205: #else Could you define this flag unconditionally, and ignore it when ICU is not present? https://codereview.chromium.org/1812673005/diff/1160001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/1160001/src/js/i18n.js#newcod... src/js/i18n.js:2122: %FunctionRemovePrototype(ToLocaleUpperCaseI18N); Put these calls under each individual function declaration https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... File test/intl/general/case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... test/intl/general/case-mapping.js:5: // Flags: --icu_case_mapping This line is not used by the current intl test runner. Ideally, update test/intl/testcfg.py to forward flags the way mjsunit does (in which case you can make this a passing test expectation!) or remove this line.
Thank you, Yang, for taking a look. The latest CL addresses Yang's comments. Besides, Dan asked me to put '--icu_case_mapping' in the group of flags for HARMONY_INPROGRESS (instead of HARMONY_STAGED). As a result, empty global initialization function definition has to be added back for --icu_case_mapping (that was removed per Yang the other day) because HARMONY_INPROGRESS macro adds a call to an empty global init and a declaration for icu_case_mapping. https://codereview.chromium.org/1812673005/diff/1140001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/1140001/src/js/i18n.js#newcod... src/js/i18n.js:2119: [ToLowerCaseI18N, ToUpperCaseI18N, ToLocaleLowerCaseI18N, ToLocaleUpperCaseI18N]. On 2016/05/10 10:00:46, Yang wrote: > Can we avoid using this short hand? I'd rather have %FunctionRemovePrototype > called verbatim. We are assuming, at this point, that forEach is installed, and > we are using the correct one. This may all hold at this point, but seems > unnecessary to me. And we are unnecessarily creating an array literal. Done. https://codereview.chromium.org/1812673005/diff/1140001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1140001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:806: converted.setTo(false, src, src_length); On 2016/05/10 10:00:46, Yang wrote: > Can you explain when the source string is copied? The comment for > icu::Transliterator::transliterate says "Transliterates an entire string in > place.", which suggests that the existing string may be overwritten. Whenever the read-alias buffer needs to be overwritten/modified, icu::UnicodeString makes a copy before doing that. To make 100% sure, last week I tracked down where it happens when UnicodeString is passed to Transliterate. Note that Transliterate() takes 'icu::Replaceable' (inherited by UnicodeString). Transliterate() calls Replaceable::handleReplaceBetween whose UnicodeString() implementation calls CloneIfNeeded() to honor 'copy-on-write'. > I haven't found a test case that checks that the input string has not been > overwritten. A few CLs ago, I added the following test cases where input is the same as output (in test/intl/general/case-mapping.js ). When the memory pointing to the input (unmodified) becomes inaccessible, I got a crash. That was fixed. assertEquals("ΑΒΓΔΕ", "ΑΒΓΔΕ".toLocaleUpperCase("el")); assertEquals("ΑΒΓΔΕАБ𝐀𝐁", "ΑΒΓΔΕАБ𝐀𝐁".toLocaleUpperCase("el")); assertEquals("ABCDEÂÓḴ123", "ABCDEÂÓḴ123".toLocaleUpperCase("el")); Ahah what you want this time is the opposite case. When the output is different from the input, you want me to verify that the input is NOT changed, don't you? I added test cases for that and they pass both in Debug and Release builds.
https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... File test/intl/general/case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... test/intl/general/case-mapping.js:115: var uppered = s.toLocaleUpperCase("el"); unfortunately this is not the correct test. strings in js are immutable. backup would simply alias s. comparing s to the expected string literal also wouldnt work, since string literals are canonicalized and aliases of each other. You would need to construct the expectation somehow, for example using array.join.
https://codereview.chromium.org/1812673005/diff/1160001/src/flag-definitions.h File src/flag-definitions.h (right): https://codereview.chromium.org/1812673005/diff/1160001/src/flag-definitions.... src/flag-definitions.h:205: #else On 2016/05/10 19:29:37, Dan Ehrenberg wrote: > Could you define this flag unconditionally, and ignore it when ICU is not > present? HARMONY_INPROGRESS() macro is used in a few places in bootstrap.cc. They pull in icu-case-mapping.js, the js file is NOT bundled in a build when i18n support is disabled. So, I guess we have to do this way even though it's ugly. https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... File test/intl/general/case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... test/intl/general/case-mapping.js:5: // Flags: --icu_case_mapping On 2016/05/10 19:29:37, Dan Ehrenberg wrote: > This line is not used by the current intl test runner. Ideally, update > test/intl/testcfg.py to forward flags the way mjsunit does (in which case you > can make this a passing test expectation!) or remove this line. Changes intl/testcfg.py to support 'Flags: ...'. Thanks for the suggestion.
Can you take another look? Thanks ! https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... File test/intl/general/case-mapping.js (right): https://codereview.chromium.org/1812673005/diff/1160001/test/intl/general/cas... test/intl/general/case-mapping.js:115: var uppered = s.toLocaleUpperCase("el"); On 2016/05/10 20:37:36, Yang wrote: > unfortunately this is not the correct test. strings in js are immutable. backup > would simply alias s. > > comparing s to the expected string literal also wouldnt work, since string > literals are canonicalized and aliases of each other. You would need to > construct the expectation somehow, for example using array.join. Thank you for enlightening me ! Now, I split |s| to an array (backup) and join it again to compare with |s|.
The CQ bit was checked by jshin@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1200001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: v8_linux64_asan_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_asan_rel_ng/buil...) v8_linux64_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux64_rel_ng/builds/5554) v8_linux_arm_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_arm_rel_ng/builds/...) v8_linux_mips64el_compile_rel on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_mips64el_compile_r...) v8_linux_mipsel_compile_rel on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_mipsel_compile_rel...) v8_linux_nodcheck_rel_ng on tryserver.v8 (JOB_FAILED, http://build.chromium.org/p/tryserver.v8/builders/v8_linux_nodcheck_rel_ng/bu...)
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ==========
LGTM https://codereview.chromium.org/1812673005/diff/1200001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1200001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:807: ConvertCaseWithTransliterator(&converted, "el-Upper"); So... if ConvertCaseWithTransliterator does not change the input, 'converted' should still alias 'src', right? Can we simply return 's' if converted.getBuffer() == src? That doesn't seem to be expensive.
Thank you, Yang ! Your comment was addressed. I'm adding the CL to CQ. https://codereview.chromium.org/1812673005/diff/1200001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1200001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:807: ConvertCaseWithTransliterator(&converted, "el-Upper"); On 2016/05/11 08:42:30, Yang wrote: > So... if ConvertCaseWithTransliterator does not change the input, 'converted' > should still alias 'src', right? Can we simply return 's' if > converted.getBuffer() == src? That doesn't seem to be expensive. Done.
lgtm
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ==========
The CQ bit was checked by jshin@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from machenbach@chromium.org, yangguo@chromium.org Link to the patchset: https://codereview.chromium.org/1812673005/#ps1240001 (title: "Yang's comment addressed - return right away for no-change")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1812673005/1240001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1812673005/1240001
Message was sent while issue was closed.
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ==========
Message was sent while issue was closed.
Committed patchset #63 (id:1240001)
Message was sent while issue was closed.
Description was changed from ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* ========== to ========== Use ICU case conversion/transliterator for case conversion When I18N is enabled, use ICU's case conversion API and transliteration API [1] to implement String.prototype.to{Upper,Lower}Case and String.prototype.toLocale{Upper,Lower}Case. * ICU-based case conversion was implemented in runtime-i18n.cc/i18n.js * The above 4 functions are overridden with those in i18n.js when --icu_case_mapping flag is turned on. To control the override by the flag, they're overriden in icu-case-mapping.js Previously, toLocale{U,L}Case just called to{U,L}Case so that they didn't support locale-sensitive case conversion for Turkic languages (az, tr), Greek (el) and Lithuanian (lt). Before ICU APIs for the most general case are called, a fast-path for Latin-1 is tried. It's taken from Blink and adopted as necessary. This fast path is always tried for to{U,L}Case. For toLocale{U,L}Case, it's only taken when a locale (explicitly specified or default) is not in {az, el, lt, tr}. With these changes, a build with --icu_case_mapping=true passes a bunch of tests in test262/intl402/Strings/* and intl/* that failed before. Handling of pure ASCII strings (aligned at word boundary) are not as fast as Unibrow's implementation that uses word-by-word case conversion. OTOH, Latin-1 input handling is faster than Unibrow. General Unicode input handling is slower but more accurate. See https://docs.google.com/spreadsheets/d/1KJCJxKc1FxFXjwmYqABS0_2cNdPetvnd8gY8_... for the benchmark. This CL started with http://crrev.com/1544023002#ps200001 by littledan@, but has changed significantly since. [1] See why transliteration API is needed for uppercasing in Greek. http://bugs.icu-project.org/trac/ticket/10582 R=yangguo BUG=v8:4476,v8:4477 LOG=Y TEST=test262/{built-ins,intl402}/Strings/*, webkit/fast/js/*, mjsunit/string-case, intl/general/case* Committed: https://crrev.com/b348d47bb94399045394bf4743c0c8c35328923b Cr-Commit-Position: refs/heads/master@{#36187} ==========
Message was sent while issue was closed.
Patchset 63 (id:??) landed as https://crrev.com/b348d47bb94399045394bf4743c0c8c35328923b Cr-Commit-Position: refs/heads/master@{#36187}
Message was sent while issue was closed.
srl295@gmail.com changed reviewers: + srl295@gmail.com
Message was sent while issue was closed.
Not sure if i'm late or early to the party. https://codereview.chromium.org/1812673005/diff/1240001/src/js/i18n.js File src/js/i18n.js (right): https://codereview.chromium.org/1812673005/diff/1240001/src/js/i18n.js#newcod... src/js/i18n.js:2003: // toLocale{U,L}Case() and about 40% of toLocale{U,L}Case("<locale>"). if it's slow, please file a bug in ICU to address. Filed: http://bugs.icu-project.org/trac/ticket/12647 https://codereview.chromium.org/1812673005/diff/1240001/src/js/i18n.js#newcod... src/js/i18n.js:2019: var CUSTOM_CASE_LANGUAGES = ['az', 'el', 'lt', 'tr']; Why are these hard coded? This decision should be made in ICU. Fragile. https://codereview.chromium.org/1812673005/diff/1240001/src/js/i18n.js#newcod... src/js/i18n.js:2019: var CUSTOM_CASE_LANGUAGES = ['az', 'el', 'lt', 'tr']; Filed : http://bugs.icu-project.org/trac/ticket/12647 https://codereview.chromium.org/1812673005/diff/1240001/src/runtime/runtime-i... File src/runtime/runtime-i18n.cc (right): https://codereview.chromium.org/1812673005/diff/1240001/src/runtime/runtime-i... src/runtime/runtime-i18n.cc:754: filed ICU bug http://bugs.icu-project.org/trac/ticket/12647 to pull this fastpath into ICU. |