|
|
Chromium Code Reviews|
Created:
5 years, 1 month ago by Lasse Reichstein Nielsen Modified:
5 years, 1 month ago CC:
reviews_dartlang.org Base URL:
https://github.com/dart-lang/collection.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionAdd string comparators that can compare ASCII strings ignoring case.
Also add comparators that compare digit sequences lexicographically
(which is the same as comparing numerically if there are no leading
zeros).
The comparison functions define total orderings, breaking ties for
strings that only differ by case by their first case-different letter,
upper-case being less than lower-case.
R=floitsch@google.com, rnystrom@google.com, sra@google.com
Committed: https://github.com/dart-lang/collection/commit/c94b8dc745ef332270bece64daab7f545bf0452d
Patch Set 1 #
Total comments: 10
Patch Set 2 : Address comments. #
Total comments: 18
Patch Set 3 : Address comments. PTAL #
Total comments: 14
Patch Set 4 : Address comments. #
Messages
Total messages: 25 (5 generated)
lrn@google.com changed reviewers: + nweiz@google.com
After seeing sort((a,b)=>a.toUpperCase().compareTo(b.toUpperCase())) once too often.
sra@google.com changed reviewers: + sra@google.com
DBC. I think I would rather have a sort function that takes a 'key:' argument (and calls it once per element) than have ASCII-specific comparators. List.sort() is difficult because we didn't name the 'compare' argument, but there are stable sorts in this library that could add the parameter. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:120: /// Compares strings [a] and [b] lexically, with embedded numbers recognized. 'with embedded sequences of digits compared as a unit' I'm confused by the terms 'lexically' and 'lexicographically'. Do you mean the same thing? If so, use one term throughout. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:127: /// the same as numerical ordering. Add examples in the doc to tease out when this is number-like and when it is not. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:128: int compareWithNumbers(String a, String b) { 'With' makes it sound like one of the arguments is a number. Maybe 'compareStringsContainingNumbers' https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:241: // Characters are not both part , or part Fix comment.
> I think I would rather have a sort function that takes a 'key:' argument (and > calls it once per element) than have ASCII-specific comparators. > > List.sort() is difficult because we didn't name the 'compare' argument, but > there are stable sorts in this library that could add the parameter. A sort operation that extracts a comparison key from each element, then sorts by the keys, is a good idea, but it's still not as easy as just using a compare function like this. Writing a "key" function that is compatible with these sortings is non-trivial (doable, but it requires building a new value that is more complex than the original string). https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:120: /// Compares strings [a] and [b] lexically, with embedded numbers recognized. What I intend is: Leixcally: Dictionary sorting (alphabetical with most weight on first letter). Lexicographically: Shorter words sorted before longer words, words of same length ordered lexically. (Basically, lexical pads the shorter string on the right, lexicograpical pads it on the left, to ensure comparison of sequences with the same length) Wikipedia doesn't agree with me that there is a difference, it uses the words interchangeably, so I might need to find a different word to describe the second ordering. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:127: /// the same as numerical ordering. Good idea. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:128: int compareWithNumbers(String a, String b) { Yes, the name sucks. Suggestions will be appreciated. "comparStringsContainingNumbers" is't precise either - the function can compare strings without numbers too. It needs to make the point that any embedded numbers are compared "numerically" (which, in practice, is the goal, but it's achieved by using "lexicographical" ordering on the numerals which does sort "3" before "02"). https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:241: // Characters are not both part , or part I wonder what I meant to say :)
kevmoo@google.com changed reviewers: + kevmoo@google.com
Do we NEED this in the core SDK? Could it be in a package? What's the use case? Also: where's the update to the README?
https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:16: /// strings with a known structure. Describe how non-ASCII characters are handled. https://codereview.chromium.org/1412293008/diff/1/lib/src/comparators.dart#ne... lib/src/comparators.dart:33: /// Hash code for a string which is compatible with [equalsIgnoreAsciiCase]. Describe how non-ASCII characters are handled.
On 2015/11/10 19:28:45, kevmoo wrote: > Do we NEED this in the core SDK? Could it be in a package? It is already in package:collection. > What's the use case? I see people writing: (a, b) => a.toUpperCase().compareTo(b.toUpperCase()); (It's even in package:quiver as compareIgnoreCase). That does two string conversions per comparison. When sorting 30 elements, that's ~300 new string allocations (30 * log(30) * 2). If the strings are known to be ASCII (like Dart identifiers or synthetic keys) then that's a complete waste. Even more, it converts the entire strings when comparison may only look at the first few characters. In a small benchmark, I got a ~600% speed-up on string sorting. > Also: where's the update to the README? Good point.
On 2015/11/11 06:16:03, Lasse Reichstein Nielsen wrote: > On 2015/11/10 19:28:45, kevmoo wrote: > > Do we NEED this in the core SDK? Could it be in a package? > > It is already in package:collection. I repeat. Why in the SDK? Could we just encourage folks to use pkg/collection? I love the perf upside – I could even imagine a lint that finds the bad pattern – but if it doesn't need to be in the SDK, shouldn't we just point folks to the package? With all of the upside of packages: no wait for an SDK release. Available immediately. Less to maintain in the core SDK...etc etc etc
On 2015/11/11 06:27:19, kevmoo wrote: > On 2015/11/11 06:16:03, Lasse Reichstein Nielsen wrote: > > On 2015/11/10 19:28:45, kevmoo wrote: > > > Do we NEED this in the core SDK? Could it be in a package? > > > > It is already in package:collection. > > I repeat. Why in the SDK? Could we just encourage folks to use pkg/collection? > > I love the perf upside – I could even imagine a lint that finds the bad pattern > – but if it doesn't need to be in the SDK, shouldn't we just point folks to the > package? > > With all of the upside of packages: no wait for an SDK release. Available > immediately. Less to maintain in the core SDK...etc etc etc This is the package.
lrn@google.com changed reviewers: + rnystrom@google.com
lrn@google.com changed reviewers: + floitsch@google.com
PTAL.
Some nits, but LGTM! https://codereview.chromium.org/1412293008/diff/20001/CHANGELOG.md File CHANGELOG.md (right): https://codereview.chromium.org/1412293008/diff/20001/CHANGELOG.md#newcode3 CHANGELOG.md:3: * Add string comparators that ignore ASCII case and sorts numbers numerically. Nit: "sorts" -> "sort". https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:28: if (aChar ^ bChar != 0x20) return false; Document this. This is because upper/lowercase ASCII letters only differ by their sixth bit, right? https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:46: if (lowerCaseA <= char && char <= lowerCaseZ) char -= 0x20; Document this. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:53: return 0x1fffffff & (hash + ((0x00003fff & hash) << 15)); It would be good to document a reference for where the hash logic comes from. This is Jenkins hash, right? https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:83: if (lowerCaseA <= bChar && bChar <= lowerCaseZ) bUpperCase -= 0x20; You do this in a few places. I wonder if it's worth making a little helper function: int _toAsciiUpper(int char) ... https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:93: /// Comparison treats all upper-case ASCII letters as lower-case letter, "letter" -> "letters" https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:134: /// (See: [Natural sort order](https://en.wikipedia.org/wiki/Natural_sort_order)) Long line. I'd probably do: /// Compares strings [a] and [b] according to [natural sort ordering][]. /// /// ... /// /// [natural sort ordering]: https://en.wikipedia.org/wiki/Natural_sort_order
https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:26: const int upperCaseA = 0x41; Hoist these common constants to top level library private constants. Naming 0x20 as _asciiCaseBit would 'document' the trick in one place. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:146: /// "a", "a0", "a0b", "a1", "a9", "a01", "a10", "a100", "a100b", "aa" I find it odd that a01 is after a9. This is neither numerical nor plain string ordering. It would feel more 'natural' if different strings with the same numerical value were together (as advertised in the first paragraph). digit strings with the same integer value are resolved with a character-based tie-breaker. This gives order: a a0 a00 a000 a000b a00b a0b a001 a01 a1 sort --version-sort has this ordering. I think you can still do this with 'local' logic starting at the mismatch.
addressed comments, changed to real natural comparison (PTAL new code). https://codereview.chromium.org/1412293008/diff/20001/CHANGELOG.md File CHANGELOG.md (right): https://codereview.chromium.org/1412293008/diff/20001/CHANGELOG.md#newcode3 CHANGELOG.md:3: * Add string comparators that ignore ASCII case and sorts numbers numerically. On 2015/11/13 16:55:36, Bob Nystrom wrote: > Nit: "sorts" -> "sort". Done. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:26: const int upperCaseA = 0x41; On 2015/11/13 18:25:40, sra1 wrote: > Hoist these common constants to top level library private constants. > > Naming 0x20 as _asciiCaseBit would 'document' the trick in one place. Done. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:28: if (aChar ^ bChar != 0x20) return false; Yes, this is a quick bailout. If the characters differ by anything but that bit, it's not the same character with different case. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:46: if (lowerCaseA <= char && char <= lowerCaseZ) char -= 0x20; On 2015/11/13 16:55:36, Bob Nystrom wrote: > Document this. Done. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:53: return 0x1fffffff & (hash + ((0x00003fff & hash) << 15)); It's Jenkins, stole the code from the actual string hash. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:83: if (lowerCaseA <= bChar && bChar <= lowerCaseZ) bUpperCase -= 0x20; I could, but only if I'm sure it will get inlined. This line should be compilable to 4-5 machine instructions if both aChar and aUpperCase are in registers, and it's in the hot loop, so I don't want to risk unoptimal compilation. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:93: /// Comparison treats all upper-case ASCII letters as lower-case letter, On 2015/11/13 16:55:36, Bob Nystrom wrote: > "letter" -> "letters" Done. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:134: /// (See: [Natural sort order](https://en.wikipedia.org/wiki/Natural_sort_order)) On 2015/11/13 16:55:36, Bob Nystrom wrote: > Long line. I'd probably do: > > /// Compares strings [a] and [b] according to [natural sort ordering][]. > /// > /// ... > /// > /// [natural sort ordering]: https://en.wikipedia.org/wiki/Natural_sort_order Done. https://codereview.chromium.org/1412293008/diff/20001/lib/src/comparators.dar... lib/src/comparators.dart:146: /// "a", "a0", "a0b", "a1", "a9", "a01", "a10", "a100", "a100b", "aa" Good point. It's simpler to explain "numerical ordering, length as tie-breaker" than the other way around. I'll do that. It's slightly harder to implement (more special cases), but it's not that bad.
LGTM. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:125: int compareAsciiLowerCase(String a, String b) { How is this one different from compareAsciiUpperCase? https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:154: /// ordering, where leixcal ordering would put the `1` before the `7`, ignoring lexical https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:227: int compareAsciiUpperCaseNatural(String a, String b) { How is this one different from compareAsciiLowerCaseNatural. https://codereview.chromium.org/1412293008/diff/40001/test/comparators_test.dart File test/comparators_test.dart (right): https://codereview.chromium.org/1412293008/diff/40001/test/comparators_test.d... test/comparators_test.dart:32: "A01A", Add test similar to: A3005x A30004x
lgtm https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:53: // Jenkins hash code ( http://en.wikipedia.org/wiki/Jenkins_hash_function). Jenkins hash code modified to operate on Smi values. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:166: if (i == b.length) return 1; Use if (i >= b.length) to improve constraint-based bounds checking. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:391: bool _isNumberSuffix(String string, int index) { maybe call it _isNonZeroNumberSuffix https://codereview.chromium.org/1412293008/diff/40001/test/comparators_test.dart File test/comparators_test.dart (right): https://codereview.chromium.org/1412293008/diff/40001/test/comparators_test.d... test/comparators_test.dart:91: sortedBy((a, b) => replaceNumbers(a).compareTo(replaceNumbers(b)))); I would feel more comfortable about natural order if the list was explicit. Here we are comparing two algorithms which could be broken in the same way.
https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dart File lib/src/comparators.dart (right): https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:10: const int _upperCaseZ = 0x5d; Whoops, that should be 0x5a, and 0x7a below. Will add to test! https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:53: // Jenkins hash code ( http://en.wikipedia.org/wiki/Jenkins_hash_function). Done. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:125: int compareAsciiLowerCase(String a, String b) { Only in how it orders, e.g., "a" vs. "_". Since "_" is between "A" and "a", it will come before "a" when you do lower-case comparison and after "a" when you do upper-case comparison. So, basically, it differs in the ordering of only six characters when they are compared to letters (0x5b-0x61). You won't need both, but existing code is likely to have used either (a, b) => a.toUpperCase().compareTo(b.toUpperCase()) or (a, b) => a.toLowerCase().compareTo(b.toLowerCase()) , and will want to replace it with an equivalent version from here without having to change tests. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:166: if (i == b.length) return 1; Done. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:227: int compareAsciiUpperCaseNatural(String a, String b) { Same as above, differs on six characters that lie between upper case and lower case letters. https://codereview.chromium.org/1412293008/diff/40001/lib/src/comparators.dar... lib/src/comparators.dart:391: bool _isNumberSuffix(String string, int index) { Done.
Message was sent while issue was closed.
Committed patchset #4 (id:60001) manually as c94b8dc745ef332270bece64daab7f545bf0452d (presubmit successful).
Message was sent while issue was closed.
Was going to publish this – but noticed 4 failures when running on my mac.
Message was sent while issue was closed.
My bad. Changed the limit to exercise the test failures before committing, forgot to save when I changed them back. |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
