|
|
Created:
5 years, 2 months ago by nweiz Modified:
5 years, 2 months ago CC:
reviews_dartlang.org Base URL:
git@github.com:dart-lang/convert.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionAdd a hexadecimal codec.
R=lrn@google.com
Committed: https://github.com/dart-lang/convert/commit/bc95e00ad577abac3fceb63b25dd619abc36d3de
Patch Set 1 #
Total comments: 62
Patch Set 2 : Code review changes #
Total comments: 26
Messages
Total messages: 14 (2 generated)
Cool converter. LGTM with comments (and suggestions! :) https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart File lib/src/hex.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart#newcode22 lib/src/hex.dart:22: final HexEncoder encoder = hexEncoder; I'd make these getters that return the encode/decoder instead of instance fields. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:17: /// Because two hexadecimal digits correspond to a single byte, this will throw Maybe mention that this is RFC 4648 base-16 encoding. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:26: "was odd."); Add ", string, string.length" as arguments to FormatException. Maybe reduce text to "Invalid input length, must be even". https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:30: _decode(string.runes, bytes, 0); Use string.codeUnits instead. It's more efficient and any >255 value is an error in either case. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:38: /// A sink for chunked hexadecimal decoding. A conversion sink for ... ("A sink" is too generic, and it's not the Sink class). https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:52: void addSlice(String string, int start, int end, bool isLast) { Maybe use: end = RangeError.checkValidRange(start, end, string.length); It also checks that start >= 0 and end <= string.length. This allows end to be null. If that's not OK, check for that first (and drop the assignment). https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:60: if (_lastDigit == null) { dart2js would probably prefer if the variable always contains integers, so maybe use -1 instead of null to represent "no carry". https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:66: bytes[0] = _lastDigit + _digitForCodeUnit(string.codeUnitAt(start - 1)); Slightly confusing to increment start first and then subtract one here. Maybe do it as: var firstChar = string.codeUnitAt(start); start++; bytes = ...; bytes[0] = _lastDigit + _digitsForCodeUnit(firstChar); or bytes = new Uint8List((end - start + 1) ~/ 2); bytes[0] = _lastDigit + _digitForCodeUnit(string.codeUnitAt(start - 1)); start++; https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:70: var runes = string.runes.skip(start).take(end); Should have be either .take(end).skip(start) or .skip(start).take(end - start). We should have .getRange(start, end) on Iterable that does this for you. But avoid doing this at all and just work directly on the codeUnits, not the runes. If you really want to report invalid code points (aka runes), you only need to do it in the error handling, you can still parse the code units because the first surrogate will be invalid anyway. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:80: "was odd."); Same message as above (if you changed it there). It's sad that we don't have an input/index to report here, but it's kind of unavoidable. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:84: } Consider extending with asUtf8Sink because you can still use _decode for utf-8 input. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int> destination, int start) { Make it: _decode(List<int> codeUnits, int start, int end, List<int> destination, int destIndex) Then you don't have to wrap the argument in skip/take, and you can index it as a list instead of using it only as an iterator. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:92: for (var i = start; true; i++) { remove the "true". An omitted condition in a for statement is true by default, and parts that are not used are traditionally omitted. Or just use a while(true) loop and increment i as destination[i++]=... below. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:112: Consider optimizing this to: int digit = codeUnit ^ $0; if (codeUnit <= 9) { // Only one branch for 10 of 16 cases. return digit; } int ucLetter = codeUnit & ~0x20; // Upper-case if letter. if (ucLetter <= $Z && ucLetter >= $A) { return ucLetter - ($A - 10); } throw new FormatException(...) It would be great if the FormatException could contain the input string and index, but it's harder when you don't pass the string and index down here. Returning -1 for error is possible, but it introduces an extra check in the inner loop (which can be delayed if necessary). https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:114: "U+${codeUnit.toRadixString(16)}."); code point -> code unit add: .padLeft(4, '0') so the hex is always four digits. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:22: var buffer = new StringBuffer(); Use an Uint8List instead of a StringBuffer. StringBuffer isn't as efficient as you would want when you know that you are only dealing with code units. Maybe: var buffer = new Uint8List(bytes.length * 2); var index = 0; var byteBits = 0; // bitwise or of all bytes in `bytes`. for (var byte in bytes) { byteBits |= byte; buffer[index++] = _codeUnitForDigit(byte ~/ 16); buffer[index++] = _codeUnitForDigit(byte % 16); } if (byteBits >= 0 && byteBits <= 255) { return new String.fromCharCodes(buffer); } // there is an invalid byte in bytes, find it and report the error. // This moves the range-check branch outside of the inner loop. for (int i = 0; i < bytes.length; i++) { var byte = bytes[i]; if (byte != byte.toUnsigned(8)) { throw new FormatException("Not a byte value (0..255)", input, i); } } https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ? digit + $0 : digit + $a - 10; Maybe just use "0123456789abcdef".codeUnitAt(digit); It's only used in two places, so you can even inline it (and maybe make the string a named constant _hexDigits). https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:48: } Consider implementing the addSlice function (and have a _convert(List<int> bytes, int start, int end) helper function that the public `convert` method uses as well. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart File test/hex_test.dart (right): https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode24 test/hex_test.dart:24: expect(results, equals(["1ab23cd4"])); Don't expect that the input is immediately output. Another implementation would be to have a buffer in the converter, and accumulate bytes until the buffer is full or the sink is closed. That would break this test even if it's a perfectly valid implementation. Only test the final result after calling close. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode27 test/hex_test.dart:27: expect(results, equals(["1ab23cd4", "0001feff"])); Test adding an empty list and a single-element list (possible edge cases). Maybe even test the empty list first, between non-empty ones and last. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode46 test/hex_test.dart:46: expect(hex.decode("1Ab23cD4"), equals([0x1a, 0xb2, 0x3c, 0xd4])); Check all of them. No reason not to: "0123456789ABCDEFabcdef" https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode54 test/hex_test.dart:54: var controller = new StreamController(sync: true); Hmmm. Not really a proper use of a sync stream controller, but I can see that it's an easy way to get a Sink. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode58 test/hex_test.dart:58: Again, don't compare intermediate results, only check the final result after calling close, and don't expect a particular split for the output lists (basically, flatten the list of result lists before comparing to it). https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode64 test/hex_test.dart:64: expect(results, equals([[0x1a, 0xb2, 0x3c, 0xd4], [0x00, 0x01, 0xfe, 0xff]])); long line. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode80 test/hex_test.dart:80: }); Also check empty lists. In similar cases I have made a parameterized test that tried all possible splits of a list into 1-4 parts, and checks that the final result is the same in either case. That also tests empty lists. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode82 test/hex_test.dart:82: test("rejects odd characters detected in close()", () { odd characters -> odd length Ditto below. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode92 test/hex_test.dart:92: expect(() => sink.addSlice("1ab23cd", 5, 7, true), throwsFormatException); Long line. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode97 test/hex_test.dart:97: expect(() => hex.decode("az"), throwsFormatException); Also check the characters: "/" ":" "@" "[" "`" "{" "\x00" "\u0141" "\u{10041}" (characters right next to 0/9, A/Z, a/z, plus NUL and values that become valid if masked incorrectly).
Code review changes
https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart File lib/src/hex.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart#newcode22 lib/src/hex.dart:22: final HexEncoder encoder = hexEncoder; On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > I'd make these getters that return the encode/decoder instead of instance > fields. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:17: /// Because two hexadecimal digits correspond to a single byte, this will throw On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Maybe mention that this is RFC 4648 base-16 encoding. Done, in the HexCodec documentation. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:26: "was odd."); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Add ", string, string.length" as arguments to FormatException. > Maybe reduce text to "Invalid input length, must be even". Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:30: _decode(string.runes, bytes, 0); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Use string.codeUnits instead. It's more efficient and any >255 value is an error > in either case. > Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:38: /// A sink for chunked hexadecimal decoding. On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > A conversion sink for ... > ("A sink" is too generic, and it's not the Sink class). Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:52: void addSlice(String string, int start, int end, bool isLast) { On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > Maybe use: > end = RangeError.checkValidRange(start, end, string.length); > It also checks that start >= 0 and end <= string.length. > > This allows end to be null. If that's not OK, check for that first (and drop the > assignment). Done, although I'm not checking for null explicitly because that's not the usual style. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:60: if (_lastDigit == null) { On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > dart2js would probably prefer if the variable always contains integers, so maybe > use -1 instead of null to represent "no carry". I really don't like using magic numbers to indicate "no value" when we have a null value that's explicitly for that case and fails more gracefully if it's mis-used. Unless we have concrete data that "null" makes the code size or speed substantially worse, I'd prefer to leave it in. Also, it's at least plausible that dart2js can figure out that _lastDigit's only use is guarded by an if statement that guarantees it won't be null. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:66: bytes[0] = _lastDigit + _digitForCodeUnit(string.codeUnitAt(start - 1)); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Slightly confusing to increment start first and then subtract one here. > Maybe do it as: > var firstChar = string.codeUnitAt(start); > start++; > bytes = ...; > bytes[0] = _lastDigit + _digitsForCodeUnit(firstChar); > or > bytes = new Uint8List((end - start + 1) ~/ 2); > bytes[0] = _lastDigit + _digitForCodeUnit(string.codeUnitAt(start - 1)); > start++; Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:70: var runes = string.runes.skip(start).take(end); On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > Should have be either .take(end).skip(start) or .skip(start).take(end - start). > We should have .getRange(start, end) on Iterable that does this for you. > > But avoid doing this at all and just work directly on the codeUnits, not the > runes. > If you really want to report invalid code points (aka runes), you only need to > do it in the error handling, you can still parse the code units because the > first surrogate will be invalid anyway. Changed to use codeUnits, although I'm still using take/skip. to get at the correct sublist. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:80: "was odd."); On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > Same message as above (if you changed it there). > It's sad that we don't have an input/index to report here, but it's kind of > unavoidable. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:84: } On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Consider extending with asUtf8Sink because you can still use _decode for utf-8 > input. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int> destination, int start) { On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > Make it: _decode(List<int> codeUnits, int start, int end, List<int> destination, > int destIndex) > Then you don't have to wrap the argument in skip/take, and you can index it as a > list instead of using it only as an iterator. Is that going to be substantially more efficient? It makes the signature more complex and changes a lot of the implementation to be math-y instead of using semantic method calls. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:92: for (var i = start; true; i++) { On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > remove the "true". An omitted condition in a for statement is true by default, > and parts that are not used are traditionally omitted. > > Or just use a while(true) loop and increment i as destination[i++]=... below. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:112: On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Consider optimizing this to: > > int digit = codeUnit ^ $0; > > if (codeUnit <= 9) { // Only one branch for 10 of 16 cases. > return digit; > } > int ucLetter = codeUnit & ~0x20; // Upper-case if letter. > if (ucLetter <= $Z && ucLetter >= $A) { > return ucLetter - ($A - 10); > } > throw new FormatException(...) If this is worth heavily optimizing, would it be better to just use a switch statement? Couldn't that be compiled to a single jump or something clever like that? Also, it looks like your code won't catch negative code units, which could come from a misuse of the UTF-8 converter. > It would be great if the FormatException could contain the input string and > index, but it's harder when you don't pass the string and index down here. > Returning -1 for error is possible, but it introduces an extra check in the > inner loop (which can be delayed if necessary). Passing down the string would add a lot of complexity with the UTF-8 decoder, and passing back up an error value would add complexity regardless. One somewhat cleaner possibility would be to just pass in the index, and then have a try/catch that rethrows the exception with the original source in whatever format. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:114: "U+${codeUnit.toRadixString(16)}."); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > code point -> code unit Done. > add: .padLeft(4, '0') so the hex is always four digits. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:22: var buffer = new StringBuffer(); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Use an Uint8List instead of a StringBuffer. StringBuffer isn't as efficient as > you would want when you know that you are only dealing with code units. > > Maybe: > > var buffer = new Uint8List(bytes.length * 2); > var index = 0; > var byteBits = 0; // bitwise or of all bytes in `bytes`. > for (var byte in bytes) { > byteBits |= byte; > buffer[index++] = _codeUnitForDigit(byte ~/ 16); > buffer[index++] = _codeUnitForDigit(byte % 16); > } > if (byteBits >= 0 && byteBits <= 255) { > return new String.fromCharCodes(buffer); > } > // there is an invalid byte in bytes, find it and report the error. > // This moves the range-check branch outside of the inner loop. > for (int i = 0; i < bytes.length; i++) { > var byte = bytes[i]; > if (byte != byte.toUnsigned(8)) { > throw new FormatException("Not a byte value (0..255)", input, i); > } > } Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ? digit + $0 : digit + $a - 10; On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Maybe just use "0123456789abcdef".codeUnitAt(digit); Will a memory read really be better than simple arithmetic? > It's only used in two places, so you can even inline it (and maybe make the > string a named constant _hexDigits). I think it's nice to leave it as a named function. It parallels _digitForCodeUnit in the decoder, and it makes it easy to read and swap out the implementation (like we might do in this code review). https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:48: } On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Consider implementing the addSlice function (and have a _convert(List<int> > bytes, int start, int end) helper function that the public `convert` method uses > as well. Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart File test/hex_test.dart (right): https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode24 test/hex_test.dart:24: expect(results, equals(["1ab23cd4"])); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Don't expect that the input is immediately output. > > Another implementation would be to have a buffer in the converter, and > accumulate bytes until the buffer is full or the sink is closed. That would > break this test even if it's a perfectly valid implementation. > > Only test the final result after calling close. That would be a valid implementation, but these aren't intended to be wholly black-box tests. Even if it's not a guaranteed part of the external API, the intended behavior of the current implementation is to pass through data as it comes, and that can have performance implications. I'd rather have to re-implement the test if the behavior changes than not test this behavior. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode27 test/hex_test.dart:27: expect(results, equals(["1ab23cd4", "0001feff"])); On 2015/09/23 08:32:05, Lasse Reichstein Nielsen wrote: > Test adding an empty list and a single-element list (possible edge cases). > > Maybe even test the empty list first, between non-empty ones and last. Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode46 test/hex_test.dart:46: expect(hex.decode("1Ab23cD4"), equals([0x1a, 0xb2, 0x3c, 0xd4])); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Check all of them. No reason not to: "0123456789ABCDEFabcdef" Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode64 test/hex_test.dart:64: expect(results, equals([[0x1a, 0xb2, 0x3c, 0xd4], [0x00, 0x01, 0xfe, 0xff]])); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > long line. Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode80 test/hex_test.dart:80: }); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Also check empty lists. Done. > In similar cases I have made a parameterized test that tried all possible splits > of a list into 1-4 parts, and checks that the final result is the same in either > case. That also tests empty lists. I generally prefer to have a few explicit tests that we believe will cover all relevant cases. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode82 test/hex_test.dart:82: test("rejects odd characters detected in close()", () { On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > odd characters -> odd length > Ditto below. Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode92 test/hex_test.dart:92: expect(() => sink.addSlice("1ab23cd", 5, 7, true), throwsFormatException); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Long line. Done. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode97 test/hex_test.dart:97: expect(() => hex.decode("az"), throwsFormatException); On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > Also check the characters: > "/" > ":" > "@" > "[" > "`" > "{" > "\x00" > "\u0141" > "\u{10041}" > (characters right next to 0/9, A/Z, a/z, plus NUL and values that become valid > if masked incorrectly). Done, although I replaced "{" and "[" with "g" and "G" since those are the limits of valid hex.
(Holding off on submitting for another round of comments)
lgtm https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:60: if (_lastDigit == null) { True. This is definitely speculative, so keep it as is. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int> destination, int start) { On 2015/09/23 22:04:21, nweiz wrote: > On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > > Make it: _decode(List<int> codeUnits, int start, int end, List<int> > destination, > > int destIndex) > > Then you don't have to wrap the argument in skip/take, and you can index it as > a > > list instead of using it only as an iterator. > > Is that going to be substantially more efficient? Most likely yes. This is the inner loop of the transformation. The more code you put in here, the slower everything gets. An Iterable, in this case a SubListIterable on top of a CodeUnits list on top of a String, will have more overhead than doing .codeUnitAt(index) directly. I don't think any of our compilers are smart enough to avoid it. > It makes the signature more > complex and changes a lot of the implementation to be math-y instead of using > semantic method calls. You say "math-y" as if it's a bad thing :) And it's very much in line with most other methods that take a string in taking start,end too. It's a bad design that requires us to do it - we should have an efficient slice operator on String that returns a CharSequence (not a string, but similar interface) so we didn't have to pass around "string, start, end" manually, but alas, we didn't do that. It's not like the code is unreadable either: int _decode(List<int> input, int start, int end, List<int> output, int outIndex) { for (int i = start; i + 1 < end; i++) { var firstDigit = _digitForCodeUnit(input, i); var secondDigit = _digitForCodeUnit(input, i + 1); output[outIndex++] = firstDigit * 16 + secondDigit; } int length = end - start; if (length.isOdd) return _digitForCodeUnit(input, end - 1) * 16; return null; } int _digitForCodeUnit(List<int> codeUnits, int index) { int codeUnit = codeUnits[index]; ... decode as usual ... throw new FormatException("Not hexadecimal digit", new String.fromCodeUnits(codeUnits), index); } ... and this even makes the input and index available at the point of error. It would be even easier if you just passed the string around and did .codeUnitAt, but then you need different code for the asUtf8Sink sink. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:112: On 2015/09/23 22:04:21, nweiz wrote: > If this is worth heavily optimizing, It's executed twice in the inner loop of the transformation, so it's a good idea to keep it small and reduce the number of unpredictable branches. I'd use either this or a table. Of course, having a benchmark would be even better :) > would it be better to just use a switch > statement? Couldn't that be compiled to a single jump or something clever like > that? It could, but it likely isn't. I wouldn't trust switch statements, and in this case (where there is no special code for any case) a simple table would be a more direct way to get get the result with a single indirection. And it avoids branches. You will need a branch to ensure that the input is in the range of the table, but that should be a predictable branch (always true, otherwise we have an error anyway). > Also, it looks like your code won't catch negative code units, which could come > from a misuse of the UTF-8 converter. That is true. For that you will need the >= 0 check: int digit = codeUnit ^ $0; if (digit <= 9) { if (digit >= 0) return digit; // Predictable branch. } else { int letter = codeUnit & ~0x20; if ($A <= letter && letter <= $Z) return letter - ($A - 10); } throw ... That check is predictable for correct input, so it shouldn't cost. Really, the only unpredictable branch is the first "<= 9". > > > It would be great if the FormatException could contain the input string and > > index, but it's harder when you don't pass the string and index down here. > > Returning -1 for error is possible, but it introduces an extra check in the > > inner loop (which can be delayed if necessary). > > Passing down the string would add a lot of complexity with the UTF-8 decoder, > and passing back up an error value would add complexity regardless. One somewhat > cleaner possibility would be to just pass in the index, and then have a > try/catch that rethrows the exception with the original source in whatever > format. Or, if you pass around a list and an index, you can do the index operation inside the _digitForCodeUnit function. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ? digit + $0 : digit + $a - 10; On 2015/09/23 22:04:21, nweiz wrote: > On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > > Maybe just use "0123456789abcdef".codeUnitAt(digit); > > Will a memory read really be better than simple arithmetic? It's not simple arithmetic - there is an unpredictable branch in it. It's only one branch, so I *guess* it'll be around the same time as a memory fetch from a small constant piece of memory. However, if the compiler can't determine that the argument is always in range of the string, the string lookup will add a range check too. I'd probably write it as: "0123456789ABCDEF".codeUnitAt(digit & 0xF) to help with that. And then benchmark! It's the only way to be sure. > > It's only used in two places, so you can even inline it (and maybe make the > > string a named constant _hexDigits). > > I think it's nice to leave it as a named function. It parallels > _digitForCodeUnit in the decoder, and it makes it easy to read and swap out the > implementation (like we might do in this code review). It's probably fine as-is. https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart File test/hex_test.dart (right): https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode97 test/hex_test.dart:97: expect(() => hex.decode("az"), throwsFormatException); Does it show that I'm working on BASE64 :) https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:53: RangeError.checkValidRange(start, end, string.length); Maybe do end = RangeError.checkValidRange(...); Then a "null" end will be treated as string.length instead of failing somewhere below. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:68: var codeUnits = string.codeUnits.take(end).skip(start); I want to add getRange(start, end) to Iterable, but they won't let me add more to Iterable any more :( https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:105: You need to handle the case where start==end somehow. Probably just wrap everything until if(isLast) in if (start < end) {...} https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2; Maybe declare var length = end - start; Consider writing it as: var hexPairs = (length + 1) ~/ 2; bytes = new Uint8List(hexPairs); instead of needing the 1+ below. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:120: _lastDigit = _decode(codeUnits, bytes, bytesStart); I still think using a list+start+end is more efficient than in iterator (and personally I think it's more readable too - it *is* a list, and reducing it to an iterator is just throwing away useful information). https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar... lib/src/hex/encoder.dart:52: // we're only emitting ASCII-compatible characters. , and that we know the length ahead of time. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar... lib/src/hex/encoder.dart:77: throw 'unreachable'; Hmm! I can see why the "throw 'unreachable' is needed, but it's still ugly. Maybe it can be rewritten as: int i = start; while (i < end) { var byte = bytes[i]; if (byte < 0 || byte > 255) break; i++; } assert(i < end); throw new FormatException(..., bytes, i); https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart File test/hex_test.dart (right): https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc... test/hex_test.dart:137: expect(() => sink.add("a$char"), throwsFormatException); Also try them as first digit and as the single digit in a chunked conversion. https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc... test/hex_test.dart:142: test("rejects odd characters detected in convert()", () { odd characters -> odd length
https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:68: var codeUnits = string.codeUnits.take(end).skip(start); However, since codeUnits is a list, you *can* write: string.codeUnits.getRange(start, end)
kevmoo@google.com changed reviewers: + kevmoo@google.com
DBC: rebase against what's on github? They seem a bit out of sync.
sra@google.com changed reviewers: + sra@google.com
DBC. TL;DR: pass the codeUnits all the way through to the _decode loop. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex.dart File lib/src/hex.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex.dart#newcode12 lib/src/hex.dart:12: export 'hex/encoder.dart' hide hexEncoder; Why not 'show', so by reading this file I can tell what the library exports? https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:68: var codeUnits = string.codeUnits.take(end).skip(start); On 2015/09/24 11:37:39, Lasse Reichstein Nielsen wrote: > However, since codeUnits is a list, you *can* write: > string.codeUnits.getRange(start, end) On dart2js it is going to be very much better to pass the (codeUnits,start,end) to _decode and use indexing in _decode. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:143: var firstDigit = _digitForCodeUnit(iterator.current) * 16; It really helps dart2js to have the constant first. That looks nasty for expressions like `0 > n` but reads fine for `16 * _digitForCodeUnit(iterator.current)`. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:146: var secondDigit = _digitForCodeUnit(iterator.current); dart2js thinks iterator.current can be null (because all iterators have a null .current at the end) so will generate type checks in the comparison arithmetic below. It is even worse, though. Since the iterable is a skip/take, which is used from many other places, dart2js will be confused about the type of .current and will believe _digitForCodeUnit is called with anything. The result is code for 'codeUnit >= $0' that would work for any type (codeUnit might be a Duration!). The cost of the dynamic dispatch to do this will dwarf the concerns about the number of correctly predicted branches. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:148: destination[i] = firstDigit + secondDigit; nit: it feels wrong to call both of these xxxDigit when one is scaled and the other is not. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar... lib/src/hex/encoder.dart:64: buffer[bufferIndex++] = _codeUnitForDigit(byte % 16); consider: buffer[bufferIndex++] = _codeUnitForDigit((byte & 0xF0) >> 4); buffer[bufferIndex++] = _codeUnitForDigit(byte & 0x0F); This will be much faster in dart2js, where we can't really optimize div and mod because byte could be negative.
Message was sent while issue was closed.
Committed patchset #2 (id:20001) manually as bc95e00ad577abac3fceb63b25dd619abc36d3de (presubmit successful).
Message was sent while issue was closed.
https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int> destination, int start) { On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote: > On 2015/09/23 22:04:21, nweiz wrote: > > On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote: > > > Make it: _decode(List<int> codeUnits, int start, int end, List<int> > > destination, > > > int destIndex) > > > Then you don't have to wrap the argument in skip/take, and you can index it > as > > a > > > list instead of using it only as an iterator. > > > > Is that going to be substantially more efficient? > > Most likely yes. > This is the inner loop of the transformation. The more code you put in here, the > slower everything gets. An Iterable, in this case a SubListIterable on top of a > CodeUnits list on top of a String, will have more overhead than doing > .codeUnitAt(index) directly. I don't think any of our compilers are smart enough > to avoid it. > > > It makes the signature more > > complex and changes a lot of the implementation to be math-y instead of using > > semantic method calls. > > You say "math-y" as if it's a bad thing :) > > And it's very much in line with most other methods that take a string in taking > start,end too. > It's a bad design that requires us to do it - we should have an efficient slice > operator on String that returns a CharSequence (not a string, but similar > interface) so we didn't have to pass around "string, start, end" manually, but > alas, we didn't do that. > > It's not like the code is unreadable either: > > int _decode(List<int> input, int start, int end, List<int> output, int outIndex) > { > for (int i = start; i + 1 < end; i++) { > var firstDigit = _digitForCodeUnit(input, i); > var secondDigit = _digitForCodeUnit(input, i + 1); > output[outIndex++] = firstDigit * 16 + secondDigit; > } > int length = end - start; > if (length.isOdd) return _digitForCodeUnit(input, end - 1) * 16; > return null; > } > > int _digitForCodeUnit(List<int> codeUnits, int index) { > int codeUnit = codeUnits[index]; > ... decode as usual ... > throw new FormatException("Not hexadecimal digit", > new String.fromCodeUnits(codeUnits), index); > } > > ... and this even makes the input and index available at the point of error. > It would be even easier if you just passed the string around and did > .codeUnitAt, but then you need different code for the asUtf8Sink sink. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne... lib/src/hex/decoder.dart:112: On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote: > On 2015/09/23 22:04:21, nweiz wrote: > > If this is worth heavily optimizing, > > It's executed twice in the inner loop of the transformation, so it's a good idea > to keep it small and reduce the number of unpredictable branches. > I'd use either this or a table. > Of course, having a benchmark would be even better :) > > > would it be better to just use a switch > > statement? Couldn't that be compiled to a single jump or something clever like > > that? > > It could, but it likely isn't. I wouldn't trust switch statements, and in this > case (where there is no special code for any case) a simple table would be a > more direct way to get get the result with a single indirection. > And it avoids branches. You will need a branch to ensure that the input is in > the range of the table, but that should be a predictable branch (always true, > otherwise we have an error anyway). > > > > Also, it looks like your code won't catch negative code units, which could > come > > from a misuse of the UTF-8 converter. > > That is true. For that you will need the >= 0 check: > > int digit = codeUnit ^ $0; > if (digit <= 9) { > if (digit >= 0) return digit; // Predictable branch. > } else { > int letter = codeUnit & ~0x20; > if ($A <= letter && letter <= $Z) return letter - ($A - 10); > } > throw ... > > That check is predictable for correct input, so it shouldn't cost. > Really, the only unpredictable branch is the first "<= 9". > > > > > > It would be great if the FormatException could contain the input string and > > > index, but it's harder when you don't pass the string and index down here. > > > Returning -1 for error is possible, but it introduces an extra check in the > > > inner loop (which can be delayed if necessary). > > > > Passing down the string would add a lot of complexity with the UTF-8 decoder, > > and passing back up an error value would add complexity regardless. One > somewhat > > cleaner possibility would be to just pass in the index, and then have a > > try/catch that rethrows the exception with the original source in whatever > > format. > > Or, if you pass around a list and an index, you can do the index operation > inside the _digitForCodeUnit function. Done. https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne... lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ? digit + $0 : digit + $a - 10; On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote: > On 2015/09/23 22:04:21, nweiz wrote: > > On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote: > > > Maybe just use "0123456789abcdef".codeUnitAt(digit); > > > > Will a memory read really be better than simple arithmetic? > > It's not simple arithmetic - there is an unpredictable branch in it. > It's only one branch, so I *guess* it'll be around the same time as a memory > fetch from a small constant piece of memory. > > However, if the compiler can't determine that the argument is always in range of > the string, the string lookup will add a range check too. > I'd probably write it as: "0123456789ABCDEF".codeUnitAt(digit & 0xF) to help > with that. > > And then benchmark! It's the only way to be sure. > > > > It's only used in two places, so you can even inline it (and maybe make the > > > string a named constant _hexDigits). > > > > I think it's nice to leave it as a named function. It parallels > > _digitForCodeUnit in the decoder, and it makes it easy to read and swap out > the > > implementation (like we might do in this code review). > > It's probably fine as-is. > > > I'm going to leave this as-is for now. We can come back and do benchmark-driven changes if it becomes necessary in the future. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:53: RangeError.checkValidRange(start, end, string.length); On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > Maybe do > end = RangeError.checkValidRange(...); > Then a "null" end will be treated as string.length instead of failing somewhere > below. StringConversionSink.addSlice's documentation doesn't describe [end] as being nullable, so I'd rather not add support for it here. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:105: On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > You need to handle the case where start==end somehow. > Probably just wrap everything until if(isLast) in if (start < end) {...} Done. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2; On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > Maybe declare > var length = end - start; > Consider writing it as: > > var hexPairs = (length + 1) ~/ 2; > bytes = new Uint8List(hexPairs); > instead of needing the 1+ below. I like how the current writing makes hexPairs the number of *pairs* in the chunk, and then adds one to that for the extra carry-over byte. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:143: var firstDigit = _digitForCodeUnit(iterator.current) * 16; On 2015/09/24 18:39:35, sra1 wrote: > It really helps dart2js to have the constant first. That looks nasty for > expressions like `0 > n` but reads fine for `16 * > _digitForCodeUnit(iterator.current)`. Done. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:148: destination[i] = firstDigit + secondDigit; On 2015/09/24 18:39:35, sra1 wrote: > nit: it feels wrong to call both of these xxxDigit when one is scaled and the > other is not. > Done. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dart File lib/src/hex/encoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar... lib/src/hex/encoder.dart:52: // we're only emitting ASCII-compatible characters. On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > , and that we know the length ahead of time. Done. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar... lib/src/hex/encoder.dart:77: throw 'unreachable'; On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > Hmm! I can see why the "throw 'unreachable' is needed, but it's still ugly. > Maybe it can be rewritten as: > > int i = start; > while (i < end) { > var byte = bytes[i]; > if (byte < 0 || byte > 255) break; > i++; > } > assert(i < end); > throw new FormatException(..., bytes, i); I think I prefer a standard-looking for loop with an awkward throw, since that makes the real code look normal. https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart File test/hex_test.dart (right): https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc... test/hex_test.dart:137: expect(() => sink.add("a$char"), throwsFormatException); On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > Also try them as first digit and as the single digit in a chunked conversion. Done. https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc... test/hex_test.dart:142: test("rejects odd characters detected in convert()", () { On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote: > odd characters -> odd length Done.
Message was sent while issue was closed.
https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar... lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2; I guess that's just a difference in perspective. I was seeing it as the number of pairs being output, which includes the carry-over byte |