|
|
Created:
11 years, 7 months ago by eroman Modified:
9 years, 6 months ago CC:
chromium-reviews_googlegroups.com Visibility:
Public. |
DescriptionCanonicalize IPv6 addresses.
BUG=http://code.google.com/p/google-url/issues/detail?id=2
Committed: http://code.google.com/p/google-url/source/detail?r=104
Patch Set 1 #Patch Set 2 : add missing unit tests #Patch Set 3 : some cleanups #Patch Set 4 : Add support for scope-ids #Patch Set 5 : cleanup and fix lint errors #Patch Set 6 : only normalize scope-id on WIN32, since it isn't necessarily a number on linux #Patch Set 7 : kill some whitespace #
Total comments: 9
Patch Set 8 : Address brettw's comments #
Total comments: 6
Patch Set 9 : remove scopeid support & fix itoa_s radix=16 for non-windows #Patch Set 10 : Use zero-compression in the canonicalized output #
Total comments: 2
Patch Set 11 : fix case where colon wasn't appended #
Total comments: 12
Patch Set 12 : Address pmarks loop change, and brettw's comments #
Total comments: 4
Patch Set 13 : address pmarks comments #
Messages
Total messages: 23 (0 generated)
brettw: please review the whole. wtc: please review the test cases in |url_canon_unittest.cc| in case you can suggest other cases to check or missing functionality. Wan-Teh, I added the "scope-id" support per your suggestion -- please check if the windows vs linux logic is right (I know how to normalize on windows, but not on other platforms)... I am not sure if it is worth the trouble to support scope-id (as opposed to just stripping it during canonicalization and worrying about it later). It does introduce some global complexity where all callers need to know about it. For example http_network_transaction would need to know to strip the scope-id when building the "Host:" header. Similarly would we respect scope-id in URLs from external pages? does it matter?
Sorry I didn't get to all of this yet. I'll do it tomorrow. http://codereview.chromium.org/99183/diff/1018/31 File src/url_canon_unittest.cc (right): http://codereview.chromium.org/99183/diff/1018/31#newcode596 Line 596: // Scope ID -- the URL may contain an optional ["%" <scope_id>] section. This percent sucks! We need to check the unescaping code in various places. I'm thinking particularly for display, but we should probably check all calls to the unescaper to make sure that we don't unescape in the hostname (possibly other than %20, but I'm not sure). Since you seem to be on the fence about supporting this, I would suggest "no".
I did some more, still not done. http://codereview.chromium.org/99183/diff/1018/29 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/1018/29#newcode374 Line 374: int end = address.end(); If 'address' is invalid, the length will be -1 and you'll have problems. Is it guaranteed to be valid? http://codereview.chromium.org/99183/diff/1018/29#newcode382 Line 382: bool is_contraction = is_colon && i != end && spec[i + 1] == ':'; "end" is 1 past the end of the string, and even if it was the end, this would be wrong. I think you want "i < end - 1" http://codereview.chromium.org/99183/diff/1018/29#newcode387 Line 387: if (is_colon || i == end) { Same thing about "end", and in several places below. Maybe you should rename it to "last" and set it to end() - 1? http://codereview.chromium.org/99183/diff/1018/29#newcode420 Line 420: // There can be at most one contraction in the literal. This is indented 1 space too much.
http://codereview.chromium.org/99183/diff/1018/29 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/1018/29#newcode374 Line 374: int end = address.end(); On 2009/05/05 09:25:25, brettw wrote: > If 'address' is invalid, the length will be -1 and you'll have problems. Is it > guaranteed to be valid? Done. Added a check for non-empty. (This turns out to have worked, but for an awkward reason. While it wasn't possible for |address| to be invalid, it was possible for it to be empty. The empty case would work since |spec[end]| is always either ']' or '%' rather than a memory error) http://codereview.chromium.org/99183/diff/1018/29#newcode382 Line 382: bool is_contraction = is_colon && i != end && spec[i + 1] == ':'; On 2009/05/05 09:25:25, brettw wrote: > "end" is 1 past the end of the string, and even if it was the end, this would be > wrong. I think you want "i < end - 1" Done. Changed to |i < end - 1|. (This turns out to have worked for the same awkward reason as above. |spec[end]| is always ']' or '%' rather than a memory error). http://codereview.chromium.org/99183/diff/1018/29#newcode387 Line 387: if (is_colon || i == end) { On 2009/05/05 09:25:25, brettw wrote: > Same thing about "end", and in several places below. Maybe you should rename it > to "last" and set it to end() - 1? Here the use of |end| is intentional. The loop goes from (i = begin; i <= end; ++i), so "end" is inclusive. The checks |i == end| are for ending a token when we have run out of input. This is modeled after |DoFindIPv4Components| which does a similar type of loop. To clarify this aspect, I have changed the loop comment: - for (int i = begin; /* nothing */; i++) { + for (int i = begin; /* i <= end */; i++) { http://codereview.chromium.org/99183/diff/1018/29#newcode420 Line 420: // There can be at most one contraction in the literal. On 2009/05/05 09:25:25, brettw wrote: > This is indented 1 space too much. Fixed.
http://codereview.chromium.org/99183/diff/5001/4002 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/5001/4002#newcode540 Line 540: int cur_index_in_address = 0; // Current output location into |address|. The comment on this line seems superfluous given your nice naming. http://codereview.chromium.org/99183/diff/5001/4002#newcode588 Line 588: // id was invalid. Has there been resolution on the scope ID thing? If we're going to keep it, can you document the range of values scope_id might have on success? http://codereview.chromium.org/99183/diff/5001/4002#newcode609 Line 609: if (id < 0) // Overflowed. Are you sure this is correct? I thought the static cast would basically truncate, so there may be high bits? I'd instead compare this as an unsigned so that it's < (1<<31) and then assign to the output with the cast.
Skimming the test data, I notice that your canonical format doesn't use any zero compression. This document describes the more-or-less widely accepted standard for IPv6 canonicalization. Does it really make sense to do something different from inet_ntop, etc.? http://tools.ietf.org/html/draft-kawamura-ipv6-text-representation-01#section-4
Also, I think you may want to drop the "%" interface identifier parsing. See RFC3986, section 3.2.2: "This syntax does not support IPv6 scoped addressing zone identifiers."
On 2009/05/09 23:34:29, pmarks wrote: > Also, I think you may want to drop the "%" interface identifier parsing. > > See RFC3986, section 3.2.2: "This syntax does not support IPv6 scoped addressing > zone identifiers." I'm sure eroman will be glad to hear that. He added the "%" interface parsing after I told him about IPv6 scoped ids, but I didn't know about RFC3986.
Thanks for all the feedback! I hope to address these comments soon and upload a new patch, but I am still busy with some other stuff. @pmarks: thanks for looking through the test datas and for the useful feedback. I am an IPv6 n00b so this stuff is definitely helpful.
Finally got around to updating this! > Skimming the test data, I notice that your canonical format doesn't > use any zero compression. Done. @pmarks: can you review the new set of expected canonicalized results? > Also, I think you may want to drop the "%" interface identifier parsing. Gladly removed! http://codereview.chromium.org/99183/diff/5001/4002 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/5001/4002#newcode540 Line 540: int cur_index_in_address = 0; // Current output location into |address|. On 2009/05/06 09:13:06, brettw wrote: > The comment on this line seems superfluous given your nice naming. Removed. http://codereview.chromium.org/99183/diff/5001/4002#newcode588 Line 588: // id was invalid. On 2009/05/06 09:13:06, brettw wrote: > Has there been resolution on the scope ID thing? > > If we're going to keep it, can you document the range of values scope_id might > have on success? Removed the Scope-ID support. http://codereview.chromium.org/99183/diff/5001/4002#newcode609 Line 609: if (id < 0) // Overflowed. On 2009/05/06 09:13:06, brettw wrote: > Are you sure this is correct? I thought the static cast would basically > truncate, so there may be high bits? I'd instead compare this as an unsigned so > that it's < (1<<31) and then assign to the output with the cast. Not applicable, since Scope-ID is now gone.
http://codereview.chromium.org/99183/diff/6013/6014 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/6013/6014#newcode585 Line 585: !IsInRange(contraction_range, i + 2) && You're removing the : before a number if a contraction starts *after* that number? That can't be right. Okay, I ran the tests locally, and it appears I'm not crazy: /home/paul/src/chromium/chromium/src/googleurl/src/gurl_unittest.cc:372: Failure Value of: url.host() Actual: "[2001db8::1]" Expected: cases[i].expected_host Which is: "[2001:db8::1]" Your Canonicalization tests don't trigger this, though. I think that means you need more tests. (Also, I didn't notice any tests that cover cases with equal-length runs, like "[1:0:0:2:0:0:3:4]")
On 2009/05/13 23:24:08, pmarks wrote: > (Also, I didn't notice any tests that cover cases with equal-length runs, like > "[1:0:0:2:0:0:3:4]") Er, ignore this part. Your "[1::2:0:0:3:0]" test covers that.
Oh, and this needs to be asked: Why are you re-implementing inet_ntop(), instead of just calling it?
On 2009/05/13 23:42:46, pmarks wrote: > Oh, and this needs to be asked: Why are you re-implementing inet_ntop(), instead > of just calling it? Do you mean this CL is just a re-implementation of inet_ntop? In which case it would be just adding code bloat, unless we need consistency across platforms. Or do you mean this CL could call inet_ntop to help with canonicalization? On Windows, inet_ntop is not added until Windows Vista: http://msdn.microsoft.com/en-us/library/cc805843%28VS.85%29.aspx But we can use getnameinfo with NI_NUMERICHOST instead. WSAAddressToString may also work, but I have never used it.
On 2009/05/14 00:07:33, wtc wrote: > Do you mean this CL is just a re-implementation of inet_ntop? > In which case it would be just adding code bloat, unless we need > consistency across platforms. > > Or do you mean this CL could call inet_ntop to help with > canonicalization? I mean that 90% of this code could be eliminated by canonicalizing with inet_ntop(inet_pton()). > > On Windows, inet_ntop is not added until Windows Vista: > http://msdn.microsoft.com/en-us/library/cc805843%28VS.85%29.aspx > But we can use getnameinfo with NI_NUMERICHOST instead. > WSAAddressToString may also work, but I have never used it. Between "use the OS libraries to save code" and "rewrite it for consistency", I don't have a strong opinion. I just asked to confirm that the authors are making an informed decision.
> Oh, and this needs to be asked: Why are you re-implementing > inet_ntop(), instead of just calling it? Brett is better equipped to answer this, so I will defer an answer. However from my understanding of the design of googleurl, its goal is to have minimal dependencies (i.e. not have to load winsock for example), and to yield consistent results across platforms (regardless of whether they support say ipv6). Given how URLs are compared in the browser and stored in history, it is probably important that canonicalization works the same everywhere... At any rate, this same question can apply to the IPv4 parsing strategy which does a custom solution, which is why I modeled IPv6 along the same lines. http://codereview.chromium.org/99183/diff/6013/6014 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/6013/6014#newcode585 Line 585: !IsInRange(contraction_range, i + 2) && On 2009/05/13 23:24:08, pmarks wrote: > You're removing the : before a number if a contraction starts *after* that > number? That can't be right. Fixed. That i+2 test shouldn't have been there. > > Okay, I ran the tests locally, and it appears I'm not crazy: > /home/paul/src/chromium/chromium/src/googleurl/src/gurl_unittest.cc:372: Failure > Value of: url.host() > Actual: "[2001db8::1]" > Expected: cases[i].expected_host > Which is: "[2001:db8::1]" > > Your Canonicalization tests don't trigger this, though. I think that means you > need more tests. > > (Also, I didn't notice any tests that cover cases with equal-length runs, like > "[1:0:0:2:0:0:3:4]")
http://codereview.chromium.org/99183/diff/7011/7012 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/7011/7012#newcode578 Line 578: for (int i = 0; i < 16; i += 2) { Here's a more efficient version of the for loop. It lets you get rid of IsInRange(): for (int i = 0; i < 16; ) { if (i == contraction_range.begin && contraction_range.len > 0) { // Jump over the contraction. if (i == 0) output->push_back(':'); output->push_back(':'); i = contraction_range.end(); } else { // Consume the next 16 bits from |address|. const uint32_t x = address[i] << 8 | address[i + 1]; i += 2; // Stringify the 16 bit number (at most requires 4 hex digits). char str[5]; _itoa_s(x, str, 16); for (int ch = 0; str[ch] != '\0'; ++ch) output->push_back(str[ch]); // Put a colon after each number, except the last. if (i < 16) { output->push_back(':'); } } }
On 2009/05/14 00:15:17, pmarks wrote: > Between "use the OS libraries to save code" and "rewrite it for consistency", I > don't have a strong opinion. I just asked to confirm that the authors are > making an informed decision. We need to have control over this code so that we behave the same across platforms, and in case there are any security problems we can fix it.
LGTM if you agree with everything http://codereview.chromium.org/99183/diff/7011/7012 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/7011/7012#newcode354 Line 354: // Reached the end of the input, DONE. I think this comment would make more sense as an end of line comment after the return (since getting to where the comment is doesn't mean we reached the end). http://codereview.chromium.org/99183/diff/7011/7012#newcode432 Line 432: uint32_t IPv6HexComponentToNumber(const CHAR* spec, Is there any reason this can't be a uint16_t? http://codereview.chromium.org/99183/diff/7011/7012#newcode486 Line 486: uint32_t number = IPv6HexComponentToNumber<CHAR>( If we can change to uint16_t above, be sure to change this one as well. http://codereview.chromium.org/99183/diff/7011/7012#newcode578 Line 578: for (int i = 0; i < 16; i += 2) { On 2009/05/14 03:13:27, pmarks wrote: > Here's a more efficient version of the for loop. It lets you get rid of > IsInRange(): > > for (int i = 0; i < 16; ) { > if (i == contraction_range.begin && contraction_range.len > 0) { > // Jump over the contraction. > if (i == 0) > output->push_back(':'); > output->push_back(':'); > i = contraction_range.end(); > > } else { > // Consume the next 16 bits from |address|. > const uint32_t x = address[i] << 8 | address[i + 1]; Maybe this should just be an int since we pass to itoa below? I'd also lose the const. > i += 2; > > // Stringify the 16 bit number (at most requires 4 hex digits). > char str[5]; > _itoa_s(x, str, 16); > for (int ch = 0; str[ch] != '\0'; ++ch) > output->push_back(str[ch]); > > // Put a colon after each number, except the last. > if (i < 16) { > output->push_back(':'); > } I wouldn't use {} here to keep it like the rest of the file. http://codereview.chromium.org/99183/diff/7011/7014 File src/url_canon_unittest.cc (right): http://codereview.chromium.org/99183/diff/7011/7014#newcode610 Line 610: Can you add a testcase with a single colon at the beginning? There is some code to handle this and I don't see a test.
Thanks for the code reviews. http://codereview.chromium.org/99183/diff/7011/7012 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/7011/7012#newcode354 Line 354: // Reached the end of the input, DONE. On 2009/05/17 04:14:16, brettw wrote: > I think this comment would make more sense as an end of line comment after the > return (since getting to where the comment is doesn't mean we reached the end). Done. http://codereview.chromium.org/99183/diff/7011/7012#newcode432 Line 432: uint32_t IPv6HexComponentToNumber(const CHAR* spec, On 2009/05/17 04:14:16, brettw wrote: > Is there any reason this can't be a uint16_t? Done. http://codereview.chromium.org/99183/diff/7011/7012#newcode486 Line 486: uint32_t number = IPv6HexComponentToNumber<CHAR>( On 2009/05/17 04:14:16, brettw wrote: > If we can change to uint16_t above, be sure to change this one as well. Done. http://codereview.chromium.org/99183/diff/7011/7012#newcode578 Line 578: for (int i = 0; i < 16; i += 2) { On 2009/05/17 04:14:16, brettw wrote: > On 2009/05/14 03:13:27, pmarks wrote: > > Here's a more efficient version of the for loop. It lets you get rid of > > IsInRange(): > > > > for (int i = 0; i < 16; ) { > > if (i == contraction_range.begin && contraction_range.len > 0) { > > // Jump over the contraction. > > if (i == 0) > > output->push_back(':'); > > output->push_back(':'); > > i = contraction_range.end(); > > > > } else { > > // Consume the next 16 bits from |address|. > > const uint32_t x = address[i] << 8 | address[i + 1]; > > Maybe this should just be an int since we pass to itoa below? I'd also lose the > const. Done. > > > > i += 2; > > > > // Stringify the 16 bit number (at most requires 4 hex digits). > > char str[5]; > > _itoa_s(x, str, 16); > > for (int ch = 0; str[ch] != '\0'; ++ch) > > output->push_back(str[ch]); > > > > // Put a colon after each number, except the last. > > if (i < 16) { > > output->push_back(':'); > > } > > I wouldn't use {} here to keep it like the rest of the file. Done. > > http://codereview.chromium.org/99183/diff/7011/7012#newcode578 Line 578: for (int i = 0; i < 16; i += 2) { On 2009/05/14 03:13:27, pmarks wrote: > Here's a more efficient version of the for loop. It lets you get rid of > IsInRange(): > > for (int i = 0; i < 16; ) { > if (i == contraction_range.begin && contraction_range.len > 0) { > // Jump over the contraction. > if (i == 0) > output->push_back(':'); > output->push_back(':'); > i = contraction_range.end(); > > } else { > // Consume the next 16 bits from |address|. > const uint32_t x = address[i] << 8 | address[i + 1]; > i += 2; > > // Stringify the 16 bit number (at most requires 4 hex digits). > char str[5]; > _itoa_s(x, str, 16); > for (int ch = 0; str[ch] != '\0'; ++ch) > output->push_back(str[ch]); > > // Put a colon after each number, except the last. > if (i < 16) { > output->push_back(':'); > } > } > } > Done. That is a better loop indeed, thanks! http://codereview.chromium.org/99183/diff/7011/7014 File src/url_canon_unittest.cc (right): http://codereview.chromium.org/99183/diff/7011/7014#newcode610 Line 610: On 2009/05/17 04:14:16, brettw wrote: > Can you add a testcase with a single colon at the beginning? There is some code > to handle this and I don't see a test. Done.
http://codereview.chromium.org/99183/diff/6030/7019 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/6030/7019#newcode572 Line 572: for (int i = 0; i < 16; ) { I suppose you should get rid of the space before the ), as lint doesn't seem to like it. http://codereview.chromium.org/99183/diff/6030/7021 File src/url_canon_unittest.cc (right): http://codereview.chromium.org/99183/diff/6030/7021#newcode613 Line 613: {"[0:0::0:0:8:]", L"[:0:0::0:0:8:]", "", url_parse::Component(), false}, The first two strings aren't the same. Is this intentional?
LGTM, otherwise.
http://codereview.chromium.org/99183/diff/6030/7019 File src/url_canon_ip.cc (right): http://codereview.chromium.org/99183/diff/6030/7019#newcode572 Line 572: for (int i = 0; i < 16; ) { On 2009/05/18 21:12:48, pmarks wrote: > I suppose you should get rid of the space before the ), as lint doesn't seem to > like it. Done. http://codereview.chromium.org/99183/diff/6030/7021 File src/url_canon_unittest.cc (right): http://codereview.chromium.org/99183/diff/6030/7021#newcode613 Line 613: {"[0:0::0:0:8:]", L"[:0:0::0:0:8:]", "", url_parse::Component(), false}, On 2009/05/18 21:12:48, pmarks wrote: > The first two strings aren't the same. Is this intentional? Fixed. (no, not intentional). |