Issue 1364613002: Add a hexadecimal codec.

nweiz

5 years, 2 months ago (2015-09-22 23:23:17 UTC) #1

Lasse Reichstein Nielsen

Cool converter. LGTM with comments (and suggestions! :) https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart File lib/src/hex.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart#newcode22 lib/src/hex.dart:22: final ...

5 years, 2 months ago (2015-09-23 08:32:05 UTC) #2

nweiz

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart File lib/src/hex.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart#newcode22 lib/src/hex.dart:22: final HexEncoder encoder = hexEncoder; On 2015/09/23 08:32:03, Lasse ...

5 years, 2 months ago (2015-09-23 22:04:21 UTC) #4

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart
File lib/src/hex.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex.dart#newcode22
lib/src/hex.dart:22: final HexEncoder encoder = hexEncoder;
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> I'd make these getters that return the encode/decoder instead of instance
> fields. 

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:17: /// Because two hexadecimal digits correspond to a
single byte, this will throw
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Maybe mention that this is RFC 4648 base-16 encoding.

Done, in the HexCodec documentation.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:26: "was odd.");
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Add ", string, string.length" as arguments to FormatException.
> Maybe reduce text to "Invalid input length, must be even".

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:30: _decode(string.runes, bytes, 0);
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Use string.codeUnits instead. It's more efficient and any >255 value is an
error
> in either case.
> 

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:38: /// A sink for chunked hexadecimal decoding.
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> A conversion sink for ...
> ("A sink" is too generic, and it's not the Sink class).

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:52: void addSlice(String string, int start, int end,
bool isLast) {
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> Maybe use:
>   end = RangeError.checkValidRange(start, end, string.length);
> It also checks that start >= 0 and end <= string.length.
> 
> This allows end to be null. If that's not OK, check for that first (and drop
the
> assignment).

Done, although I'm not checking for null explicitly because that's not the usual
style.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:60: if (_lastDigit == null) {
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> dart2js would probably prefer if the variable always contains integers, so
maybe
> use -1 instead of null to represent "no carry".

I really don't like using magic numbers to indicate "no value" when we have a
null value that's explicitly for that case and fails more gracefully if it's
mis-used. Unless we have concrete data that "null" makes the code size or speed
substantially worse, I'd prefer to leave it in.

Also, it's at least plausible that dart2js can figure out that _lastDigit's only
use is guarded by an if statement that guarantees it won't be null.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:66: bytes[0] = _lastDigit +
_digitForCodeUnit(string.codeUnitAt(start - 1));
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Slightly confusing to increment start first and then subtract one here.
> Maybe do it as:
>   var firstChar = string.codeUnitAt(start);
>   start++;
>   bytes = ...;
>   bytes[0] = _lastDigit + _digitsForCodeUnit(firstChar);
> or 
>   bytes = new Uint8List((end - start + 1) ~/ 2);
>   bytes[0] = _lastDigit + _digitForCodeUnit(string.codeUnitAt(start - 1));
>   start++;

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:70: var runes = string.runes.skip(start).take(end);
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> Should have be either .take(end).skip(start) or .skip(start).take(end -
start).
> We should have .getRange(start, end) on Iterable that does this for you.
> 
> But avoid doing this at all and just work directly on the codeUnits, not the
> runes.
> If you really want to report invalid code points (aka runes), you only need to
> do it in the error handling, you can still parse the code units because the
> first surrogate will be invalid anyway.

Changed to use codeUnits, although I'm still using take/skip. to get at the
correct sublist.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:80: "was odd.");
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> Same message as above (if you changed it there). 
> It's sad that we don't have an input/index to report here, but it's kind of
> unavoidable.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:84: }
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Consider extending with asUtf8Sink because you can still use _decode for utf-8
> input.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int>
destination, int start) {
On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> Make it: _decode(List<int> codeUnits, int start, int end, List<int>
destination,
> int destIndex)
> Then you don't have to wrap the argument in skip/take, and you can index it as
a
> list instead of using it only as an iterator.

Is that going to be substantially more efficient? It makes the signature more
complex and changes a lot of the implementation to be math-y instead of using
semantic method calls.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:92: for (var i = start; true; i++) {
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> remove the "true". An omitted condition in a for statement is true by default,
> and parts that are not used are traditionally omitted.
> 
> Or just use a while(true) loop and increment i as  destination[i++]=... 
below.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:112: 
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Consider optimizing this to:
> 
> int digit = codeUnit ^ $0;
>
> if (codeUnit <= 9) {  // Only one branch for 10 of 16 cases.
>   return digit;
> }
> int ucLetter = codeUnit & ~0x20;  // Upper-case if letter.
> if (ucLetter <= $Z && ucLetter >= $A) {
>   return ucLetter - ($A - 10);
> }
> throw new FormatException(...)

If this is worth heavily optimizing, would it be better to just use a switch
statement? Couldn't that be compiled to a single jump or something clever like
that?

Also, it looks like your code won't catch negative code units, which could come
from a misuse of the UTF-8 converter.

> It would be great if the FormatException could contain the input string and
> index, but it's harder when you don't pass the string and index down here.
> Returning -1 for error is possible, but it introduces an extra check in the
> inner loop (which can be delayed if necessary).

Passing down the string would add a lot of complexity with the UTF-8 decoder,
and passing back up an error value would add complexity regardless. One somewhat
cleaner possibility would be to just pass in the index, and then have a
try/catch that rethrows the exception with the original source in whatever
format.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:114: "U+${codeUnit.toRadixString(16)}.");
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> code point -> code unit

Done.

> add: .padLeft(4, '0') so the hex is always four digits.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart
File lib/src/hex/encoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne...
lib/src/hex/encoder.dart:22: var buffer = new StringBuffer();
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Use an Uint8List instead of a StringBuffer. StringBuffer isn't as efficient as
> you would want when you know that you are only dealing with code units.
> 
> Maybe:
> 
>    var buffer = new Uint8List(bytes.length * 2);
>    var index = 0;
>    var byteBits = 0;  // bitwise or of all bytes in `bytes`.
>    for (var byte in bytes) {
>      byteBits |= byte;
>      buffer[index++] = _codeUnitForDigit(byte ~/ 16);
>      buffer[index++] = _codeUnitForDigit(byte % 16);
>   }
>   if (byteBits >= 0 && byteBits <= 255) {
>     return new String.fromCharCodes(buffer);
>   }
>   // there is an invalid byte in bytes, find it and report the error.
>   // This moves the range-check branch outside of the inner loop.
>   for (int i = 0; i < bytes.length; i++) {
>     var byte = bytes[i];
>     if (byte != byte.toUnsigned(8)) {
>       throw new FormatException("Not a byte value (0..255)", input, i);
>     }
>   }

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne...
lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ?
digit + $0 : digit + $a - 10;
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Maybe just use "0123456789abcdef".codeUnitAt(digit);

Will a memory read really be better than simple arithmetic?

> It's only used in two places, so you can even inline it (and maybe make the
> string a named constant _hexDigits).

I think it's nice to leave it as a named function. It parallels
_digitForCodeUnit in the decoder, and it makes it easy to read and swap out the
implementation (like we might do in this code review).

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne...
lib/src/hex/encoder.dart:48: }
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Consider implementing the addSlice function (and have a _convert(List<int>
> bytes, int start, int end) helper function that the public `convert` method
uses
> as well.

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart
File test/hex_test.dart (right):

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode24
test/hex_test.dart:24: expect(results, equals(["1ab23cd4"]));
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Don't expect that the input is immediately output.
> 
> Another implementation would be to have a buffer in the converter, and
> accumulate bytes until the buffer is full or the sink is closed. That would
> break this test even if it's a perfectly valid implementation.
> 
> Only test the final result after calling close.

That would be a valid implementation, but these aren't intended to be wholly
black-box tests. Even if it's not a guaranteed part of the external API, the
intended behavior of the current implementation is to pass through data as it
comes, and that can have performance implications. I'd rather have to
re-implement the test if the behavior changes than not test this behavior.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode27
test/hex_test.dart:27: expect(results, equals(["1ab23cd4", "0001feff"]));
On 2015/09/23 08:32:05, Lasse Reichstein Nielsen wrote:
> Test adding an empty list and a single-element list (possible edge cases).
> 
> Maybe even test the empty list first, between non-empty ones and last.

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode46
test/hex_test.dart:46: expect(hex.decode("1Ab23cD4"), equals([0x1a, 0xb2, 0x3c,
0xd4]));
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Check all of them. No reason not to: "0123456789ABCDEFabcdef"

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode64
test/hex_test.dart:64: expect(results, equals([[0x1a, 0xb2, 0x3c, 0xd4], [0x00,
0x01, 0xfe, 0xff]]));
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> long line.

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode80
test/hex_test.dart:80: });
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Also check empty lists.

Done.

> In similar cases I have made a parameterized test that tried all possible
splits
> of a list into 1-4 parts, and checks that the final result is the same in
either
> case. That also tests empty lists.

I generally prefer to have a few explicit tests that we believe will cover all
relevant cases.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode82
test/hex_test.dart:82: test("rejects odd characters detected in close()", () {
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> odd characters -> odd length
> Ditto below.

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode92
test/hex_test.dart:92: expect(() => sink.addSlice("1ab23cd", 5, 7, true),
throwsFormatException);
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Long line.

Done.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode97
test/hex_test.dart:97: expect(() => hex.decode("az"), throwsFormatException);
On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> Also check the characters:
>   "/"
>   ":"
>   "@"
>   "["
>   "`"
>   "{"
>   "\x00"
>   "\u0141"
>   "\u{10041}"
> (characters right next to 0/9, A/Z, a/z, plus NUL and values that become valid
> if masked incorrectly).

Done, although I replaced "{" and "[" with "g" and "G" since those are the
limits of valid hex.

Lasse Reichstein Nielsen

lgtm https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#newcode60 lib/src/hex/decoder.dart:60: if (_lastDigit == null) { True. This is ...

5 years, 2 months ago (2015-09-24 08:07:02 UTC) #6

lgtm

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:60: if (_lastDigit == null) {
True. This is definitely speculative, so keep it as is.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int>
destination, int start) {
On 2015/09/23 22:04:21, nweiz wrote:
> On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> > Make it: _decode(List<int> codeUnits, int start, int end, List<int>
> destination,
> > int destIndex)
> > Then you don't have to wrap the argument in skip/take, and you can index it
as
> a
> > list instead of using it only as an iterator.
> 
> Is that going to be substantially more efficient? 

Most likely yes.
This is the inner loop of the transformation. The more code you put in here, the
slower everything gets. An Iterable, in this case a SubListIterable on top of a
CodeUnits list on top of a String, will have more overhead than doing
.codeUnitAt(index) directly. I don't think any of our compilers are smart enough
to avoid it.

> It makes the signature more
> complex and changes a lot of the implementation to be math-y instead of using
> semantic method calls.

You say "math-y" as if it's a bad thing :)

And it's very much in line with most other methods that take a string in taking
start,end too. 
It's a bad design that requires us to do it - we should have an efficient slice
operator on String that returns a CharSequence (not a string, but similar
interface) so we didn't have to pass around "string, start, end" manually, but
alas, we didn't do that.

It's not like the code is unreadable either:

int _decode(List<int> input, int start, int end, List<int> output, int outIndex)
{
  for (int i = start; i + 1 < end; i++) {
    var firstDigit = _digitForCodeUnit(input, i);
    var secondDigit = _digitForCodeUnit(input, i + 1);
    output[outIndex++] = firstDigit * 16 + secondDigit;
  }
  int length = end - start;
  if (length.isOdd) return _digitForCodeUnit(input, end - 1) * 16;
  return null;
}

int _digitForCodeUnit(List<int> codeUnits, int index) {
  int codeUnit = codeUnits[index];
  ... decode as usual ...
  throw new FormatException("Not hexadecimal digit", 
                new String.fromCodeUnits(codeUnits), index);
}

... and this even makes the input and index available at the point of error.
It would be even easier if you just passed the string around and did
.codeUnitAt, but then you need different code for the asUtf8Sink sink.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:112: 
On 2015/09/23 22:04:21, nweiz wrote:
> If this is worth heavily optimizing, 

It's executed twice in the inner loop of the transformation, so it's a good idea
to keep it small and reduce the number of unpredictable branches.
I'd use either this or a table.
Of course, having a benchmark would be even better :)

> would it be better to just use a switch
> statement? Couldn't that be compiled to a single jump or something clever like
> that?

It could, but it likely isn't. I wouldn't trust switch statements, and in this
case (where there is no special code for any case) a simple table would be a
more direct way to get get the result with a single indirection.
And it avoids branches. You will need a branch to ensure that the input is in
the range of the table, but that should be a predictable branch (always true,
otherwise we have an error anyway). 

> Also, it looks like your code won't catch negative code units, which could
come
> from a misuse of the UTF-8 converter.

That is true. For that you will need the >= 0 check:

int digit = codeUnit ^ $0;
if (digit <= 9) {
  if (digit >= 0) return digit;  // Predictable branch.
} else {
  int letter = codeUnit & ~0x20;
  if ($A <= letter && letter <= $Z) return letter - ($A - 10);
}
throw ...

That check is predictable for correct input, so it shouldn't cost.
Really, the only unpredictable branch is the first "<= 9".

> 
> > It would be great if the FormatException could contain the input string and
> > index, but it's harder when you don't pass the string and index down here.
> > Returning -1 for error is possible, but it introduces an extra check in the
> > inner loop (which can be delayed if necessary).
> 
> Passing down the string would add a lot of complexity with the UTF-8 decoder,
> and passing back up an error value would add complexity regardless. One
somewhat
> cleaner possibility would be to just pass in the index, and then have a
> try/catch that rethrows the exception with the original source in whatever
> format.

Or, if you pass around a list and an index, you can do the index operation
inside the _digitForCodeUnit function.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart
File lib/src/hex/encoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne...
lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ?
digit + $0 : digit + $a - 10;
On 2015/09/23 22:04:21, nweiz wrote:
> On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> > Maybe just use "0123456789abcdef".codeUnitAt(digit);
> 
> Will a memory read really be better than simple arithmetic?

It's not simple arithmetic - there is an unpredictable branch in it.
It's only one branch, so I *guess* it'll be around the same time as a memory
fetch from a small constant piece of memory.

However, if the compiler can't determine that the argument is always in range of
the string, the string lookup will add a range check too.
I'd probably write it as: "0123456789ABCDEF".codeUnitAt(digit & 0xF) to help
with that.

And then benchmark! It's the only way to be sure.

> > It's only used in two places, so you can even inline it (and maybe make the
> > string a named constant _hexDigits).
> 
> I think it's nice to leave it as a named function. It parallels
> _digitForCodeUnit in the decoder, and it makes it easy to read and swap out
the
> implementation (like we might do in this code review).

It's probably fine as-is.

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart
File test/hex_test.dart (right):

https://codereview.chromium.org/1364613002/diff/1/test/hex_test.dart#newcode97
test/hex_test.dart:97: expect(() => hex.decode("az"), throwsFormatException);
Does it show that I'm working on BASE64 :)

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:53: RangeError.checkValidRange(start, end,
string.length);
Maybe do
  end = RangeError.checkValidRange(...);
Then a "null" end will be treated as string.length instead of failing somewhere
below.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:68: var codeUnits =
string.codeUnits.take(end).skip(start);
I want to add getRange(start, end) to Iterable, but they won't let me add more
to Iterable any more :(

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:105: 
You need to handle the case where start==end somehow.
Probably just wrap everything until if(isLast)  in  if (start < end) {...}

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2;
Maybe declare
  var length = end - start;
Consider writing it as:

  var hexPairs = (length + 1) ~/ 2;
  bytes = new Uint8List(hexPairs);
instead of needing the 1+ below.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:120: _lastDigit = _decode(codeUnits, bytes,
bytesStart);
I still think using a list+start+end is more efficient than in iterator (and
personally I think it's more readable too - it *is* a list, and reducing it to
an iterator is just throwing away useful information).

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dart
File lib/src/hex/encoder.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar...
lib/src/hex/encoder.dart:52: // we're only emitting ASCII-compatible characters.
, and that we know the length ahead of time.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar...
lib/src/hex/encoder.dart:77: throw 'unreachable';
Hmm! I can see why the "throw 'unreachable' is needed, but it's still ugly.
Maybe it can be rewritten as:

int i = start;
while (i < end) {
  var byte = bytes[i];
  if (byte < 0 || byte > 255) break;
  i++;
}
assert(i < end);
throw new FormatException(..., bytes, i);

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart
File test/hex_test.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc...
test/hex_test.dart:137: expect(() => sink.add("a$char"), throwsFormatException);
Also try them as first digit and as the single digit in a chunked conversion.

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc...
test/hex_test.dart:142: test("rejects odd characters detected in convert()", ()
{
odd characters -> odd length

Lasse Reichstein Nielsen

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart#newcode68 lib/src/hex/decoder.dart:68: var codeUnits = string.codeUnits.take(end).skip(start); However, since codeUnits is a ...

5 years, 2 months ago (2015-09-24 11:37:39 UTC) #7

kevmoo

DBC: rebase against what's on github? They seem a bit out of sync.

5 years, 2 months ago (2015-09-24 17:50:52 UTC) #9

sra1

DBC. TL;DR: pass the codeUnits all the way through to the _decode loop. https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex.dart File ...

5 years, 2 months ago (2015-09-24 18:39:36 UTC) #11

nweiz

Committed patchset #2 (id:20001) manually as bc95e00ad577abac3fceb63b25dd619abc36d3de (presubmit successful).

5 years, 2 months ago (2015-09-24 23:23:11 UTC) #12

nweiz

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart File lib/src/hex/decoder.dart (right): https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#newcode90 lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int> destination, int start) { On ...

5 years, 2 months ago (2015-09-24 23:23:42 UTC) #13

Message was sent while issue was closed.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:90: int _decode(Iterable<int> runes, List<int>
destination, int start) {
On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote:
> On 2015/09/23 22:04:21, nweiz wrote:
> > On 2015/09/23 08:32:03, Lasse Reichstein Nielsen wrote:
> > > Make it: _decode(List<int> codeUnits, int start, int end, List<int>
> > destination,
> > > int destIndex)
> > > Then you don't have to wrap the argument in skip/take, and you can index
it
> as
> > a
> > > list instead of using it only as an iterator.
> > 
> > Is that going to be substantially more efficient? 
> 
> Most likely yes.
> This is the inner loop of the transformation. The more code you put in here,
the
> slower everything gets. An Iterable, in this case a SubListIterable on top of
a
> CodeUnits list on top of a String, will have more overhead than doing
> .codeUnitAt(index) directly. I don't think any of our compilers are smart
enough
> to avoid it.
> 
> > It makes the signature more
> > complex and changes a lot of the implementation to be math-y instead of
using
> > semantic method calls.
> 
> You say "math-y" as if it's a bad thing :)
> 
> And it's very much in line with most other methods that take a string in
taking
> start,end too. 
> It's a bad design that requires us to do it - we should have an efficient
slice
> operator on String that returns a CharSequence (not a string, but similar
> interface) so we didn't have to pass around "string, start, end" manually, but
> alas, we didn't do that.
> 
> It's not like the code is unreadable either:
> 
> int _decode(List<int> input, int start, int end, List<int> output, int
outIndex)
> {
>   for (int i = start; i + 1 < end; i++) {
>     var firstDigit = _digitForCodeUnit(input, i);
>     var secondDigit = _digitForCodeUnit(input, i + 1);
>     output[outIndex++] = firstDigit * 16 + secondDigit;
>   }
>   int length = end - start;
>   if (length.isOdd) return _digitForCodeUnit(input, end - 1) * 16;
>   return null;
> }
> 
> int _digitForCodeUnit(List<int> codeUnits, int index) {
>   int codeUnit = codeUnits[index];
>   ... decode as usual ...
>   throw new FormatException("Not hexadecimal digit", 
>                 new String.fromCodeUnits(codeUnits), index);
> }
> 
> ... and this even makes the input and index available at the point of error.
> It would be even easier if you just passed the string around and did
> .codeUnitAt, but then you need different code for the asUtf8Sink sink.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/decoder.dart#ne...
lib/src/hex/decoder.dart:112: 
On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote:
> On 2015/09/23 22:04:21, nweiz wrote:
> > If this is worth heavily optimizing, 
> 
> It's executed twice in the inner loop of the transformation, so it's a good
idea
> to keep it small and reduce the number of unpredictable branches.
> I'd use either this or a table.
> Of course, having a benchmark would be even better :)
> 
> > would it be better to just use a switch
> > statement? Couldn't that be compiled to a single jump or something clever
like
> > that?
> 
> It could, but it likely isn't. I wouldn't trust switch statements, and in this
> case (where there is no special code for any case) a simple table would be a
> more direct way to get get the result with a single indirection.
> And it avoids branches. You will need a branch to ensure that the input is in
> the range of the table, but that should be a predictable branch (always true,
> otherwise we have an error anyway). 
> 
>  
> > Also, it looks like your code won't catch negative code units, which could
> come
> > from a misuse of the UTF-8 converter.
> 
> That is true. For that you will need the >= 0 check:
> 
> int digit = codeUnit ^ $0;
> if (digit <= 9) {
>   if (digit >= 0) return digit;  // Predictable branch.
> } else {
>   int letter = codeUnit & ~0x20;
>   if ($A <= letter && letter <= $Z) return letter - ($A - 10);
> }
> throw ...
> 
> That check is predictable for correct input, so it shouldn't cost.
> Really, the only unpredictable branch is the first "<= 9".
> 
> > 
> > > It would be great if the FormatException could contain the input string
and
> > > index, but it's harder when you don't pass the string and index down here.
> > > Returning -1 for error is possible, but it introduces an extra check in
the
> > > inner loop (which can be delayed if necessary).
> > 
> > Passing down the string would add a lot of complexity with the UTF-8
decoder,
> > and passing back up an error value would add complexity regardless. One
> somewhat
> > cleaner possibility would be to just pass in the index, and then have a
> > try/catch that rethrows the exception with the original source in whatever
> > format.
> 
> Or, if you pass around a list and an index, you can do the index operation
> inside the _digitForCodeUnit function.

Done.

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart
File lib/src/hex/encoder.dart (right):

https://codereview.chromium.org/1364613002/diff/1/lib/src/hex/encoder.dart#ne...
lib/src/hex/encoder.dart:33: int _codeUnitForDigit(int digit) => digit < 10 ?
digit + $0 : digit + $a - 10;
On 2015/09/24 08:07:01, Lasse Reichstein Nielsen wrote:
> On 2015/09/23 22:04:21, nweiz wrote:
> > On 2015/09/23 08:32:04, Lasse Reichstein Nielsen wrote:
> > > Maybe just use "0123456789abcdef".codeUnitAt(digit);
> > 
> > Will a memory read really be better than simple arithmetic?
> 
> It's not simple arithmetic - there is an unpredictable branch in it.
> It's only one branch, so I *guess* it'll be around the same time as a memory
> fetch from a small constant piece of memory.
> 
> However, if the compiler can't determine that the argument is always in range
of
> the string, the string lookup will add a range check too.
> I'd probably write it as: "0123456789ABCDEF".codeUnitAt(digit & 0xF) to help
> with that.
> 
> And then benchmark! It's the only way to be sure.
> 
> > > It's only used in two places, so you can even inline it (and maybe make
the
> > > string a named constant _hexDigits).
> > 
> > I think it's nice to leave it as a named function. It parallels
> > _digitForCodeUnit in the decoder, and it makes it easy to read and swap out
> the
> > implementation (like we might do in this code review).
> 
> It's probably fine as-is.
> 
> 
> 

I'm going to leave this as-is for now. We can come back and do benchmark-driven
changes if it becomes necessary in the future.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:53: RangeError.checkValidRange(start, end,
string.length);
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> Maybe do
>   end = RangeError.checkValidRange(...);
> Then a "null" end will be treated as string.length instead of failing
somewhere
> below.

StringConversionSink.addSlice's documentation doesn't describe [end] as being
nullable, so I'd rather not add support for it here.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:105: 
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> You need to handle the case where start==end somehow.
> Probably just wrap everything until if(isLast)  in  if (start < end) {...}

Done.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2;
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> Maybe declare
>   var length = end - start;
> Consider writing it as:
> 
>   var hexPairs = (length + 1) ~/ 2;
>   bytes = new Uint8List(hexPairs);
> instead of needing the 1+ below.

I like how the current writing makes hexPairs the number of *pairs* in the
chunk, and then adds one to that for the extra carry-over byte.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:143: var firstDigit =
_digitForCodeUnit(iterator.current) * 16;
On 2015/09/24 18:39:35, sra1 wrote:
> It really helps dart2js to have the constant first.  That looks nasty for
> expressions like `0 > n` but reads fine for `16 *
> _digitForCodeUnit(iterator.current)`.

Done.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:148: destination[i] = firstDigit + secondDigit;
On 2015/09/24 18:39:35, sra1 wrote:
> nit: it feels wrong to call both of these xxxDigit when one is scaled and the
> other is not.
> 

Done.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dart
File lib/src/hex/encoder.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar...
lib/src/hex/encoder.dart:52: // we're only emitting ASCII-compatible characters.
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> , and that we know the length ahead of time.

Done.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/encoder.dar...
lib/src/hex/encoder.dart:77: throw 'unreachable';
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> Hmm! I can see why the "throw 'unreachable' is needed, but it's still ugly.
> Maybe it can be rewritten as:
> 
> int i = start;
> while (i < end) {
>   var byte = bytes[i];
>   if (byte < 0 || byte > 255) break;
>   i++;
> }
> assert(i < end);
> throw new FormatException(..., bytes, i);

I think I prefer a standard-looking for loop with an awkward throw, since that
makes the real code look normal.

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart
File test/hex_test.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc...
test/hex_test.dart:137: expect(() => sink.add("a$char"), throwsFormatException);
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> Also try them as first digit and as the single digit in a chunked conversion.

Done.

https://codereview.chromium.org/1364613002/diff/20001/test/hex_test.dart#newc...
test/hex_test.dart:142: test("rejects odd characters detected in convert()", ()
{
On 2015/09/24 08:07:02, Lasse Reichstein Nielsen wrote:
> odd characters -> odd length

Done.

Lasse Reichstein Nielsen

5 years, 2 months ago (2015-09-25 06:40:33 UTC) #14

Message was sent while issue was closed.

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dart
File lib/src/hex/decoder.dart (right):

https://codereview.chromium.org/1364613002/diff/20001/lib/src/hex/decoder.dar...
lib/src/hex/decoder.dart:112: var hexPairs = (end - start - 1) ~/ 2;
I guess that's just a difference in perspective. I was seeing it as the number
of pairs being output, which includes the carry-over byte

Issue 1364613002: Add a hexadecimal codec. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Code review changes #

Messages