|
|
Created:
7 years, 3 months ago by Peter Kasting Modified:
7 years, 3 months ago CC:
chromium-reviews, James Su, erikwright+watch_chromium.org, jshin+watch_chromium.org Base URL:
svn://chrome-svn/chrome/trunk/src/ Visibility:
Public. |
DescriptionSwitch the offset conversion routines from an "offsets point at characters"
worldview to an "offsets point between characters" worldview.
This more closely aligns with how the omnibox autocomplete code (which is what
this was originally written for) expects things to behave.
Direct fallout from this change:
* An input offset of 0 will always map to an output offset of 0.
* An input offset of (length of string) will always map to the length of the
output string, instead of npos.
* It's possible for multiple unique input offsets to map to a single non-npos
output offset, if they e.g. point to the start and end of a collapsed
sequence.
* Input offsets pointing into the middle of a completely-removed sequence may
not be set to npos if they fall on the boundaries of a subsequence processed
by the transformer. For example, when running FormatUrlWithOffsets() on
"http://user:pass@domain.com/" and directing it to omit both the scheme and
username/password, an input offset of "7" that points in between the scheme
and the username/password will be transformed to an output offset of 0
instead of npos.
Indirect fallout:
* A caller like SearchProvider::NavigationToMatch() will now mark certain
matches as "allowed to be default" that it didn't before. Specifically, if
the user's input string ends at the same point as the desired
|fill_into_edit|, the autocomplete offset will be calculated as (length of
string) instead of npos, and thus the match will be thought of as "inlinable"
and thus "allowed to be default".
BUG=284781
TEST=none
R=msw@chromium.org, willchan@chromium.org
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=222426
Patch Set 1 #
Total comments: 17
Patch Set 2 : #
Total comments: 43
Patch Set 3 : #Patch Set 4 : #
Total comments: 10
Patch Set 5 : #
Total comments: 1
Messages
Total messages: 14 (0 generated)
There are no great reviewers for this, since mrossetti wrote the original code and I reviewed. willchan: base/ msw: search_provider_unittest.cc
Running off to a meeting, just wanted to say a few things first: * The fix looks right to me, thanks. * I'm concerned that there might be fallout from the fix. I looked up various callers of LimitOffset, and their callers, and it looks like it's used in a number of different places. It's possible it might fix other bugs as well, but it might also regress behavior. * That said, I think we should land this anyway and hope our bug triage catches any fallout. I'll review the test in detail after this meeting and then send approval.
https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... File base/strings/utf_offset_string_conversions.h (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions.h:47: // which is greater than |limit| with npos. I share Will's concern that other callers might have properly expected the old behavior. Some callers might try size_t i = 1; std::string text = "a"; LimitOffset(...text.length()...); ...text[i]... or similar. Can you point to the codepath by which SearchProvider exercises this code? https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... File base/strings/utf_offset_string_conversions_unittest.cc (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:33: {"A\xF0\x90\x8C\x80z", 5, 3}, nit: also test 6 -> 4? https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:56: {{'A', 0x00bc, 0x00be, 'z'}, 3, 5}, nit: also test 4 -> 6? https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:60: {{'A', 0xd800, 0xdf00, 'z'}, 3, 5}, nit: also test 4 -> 6? https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:80: if (*ti <= kLimit) nit: I think you should actually change this to just check *ti != kNpos. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:94: if (*ti <= kLimit) nit: ditto https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:100: TEST(UTFOffsetStringConversionsTest, AdjustAndLimitOffsets) { nit: LimitOffsets seems fairly well tested above. I think adding use of LimitOffsets here unnecessarily conflates testing of the the two functions, and I would leave the test as-is instead. https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... chrome/browser/autocomplete/search_provider_unittest.cc:1960: { "a.com", "[\"a.com\",[\"http://a.com/a\"],[],[]," Shouldn't this test have caught the failing-to-inline a navsuggest problem? Was it not caught because the field trial isn't exercised in tests?
I'm ignoring all of Mike's comments for now because the net_util test failures exposed more widespread issues which may subsume those. Those issues are mostly locally solved, but I realized tonight I have created something of an existential crisis regarding these offset conversion functions, and I'm not sure which way to go. The fundamental problem here comes from a disconnect between two worldviews: one in which offsets are viewed as pointing at specific characters within the string, and one in which offsets point _between_ characters in the string. (A computer graphics analogy would be the clash between coordinate systems where the center of a pixel is at the .0 versus .5 coordinate.) The existing conversion functions assume offsets point at characters, not between character. A couple behaviors fall out of this: * Pointing "one past the end" points out of the string, just like pointing two or 57 characters past the end. Therefore, the input offset is invalid, and the core change in this patch is wrong. * Unique input offsets are guaranteed to be transformed to unique output offsets or npos. However, the |inline_autocomplete_offset| in the omnibox world functions more like an offset between characters, an insertion point into the string to be precise. Given a string of length n, all offsets from 0 to n inclusive represent boundary points that touch at least one character within the string. Since code like SearchProvider::NavigationToMatch() tries to transform such offsets, it reasonably expects that an offset "just after the last character" is valid and will remain so after transformation. This view of the world has other implications for the offset conversion functions, however. Specifically, unique input offsets may be transformed to equal, non-npos output offsets. The most obvious case is if we take the string "a", with offsets 0 and 1, and collapse to "". In the "offsets point at characters" worldview, the 0 offset pointed at 'a', which is gone, so that becomes npos; and the 1 pointed out of the string, so that's npos too. In the "offsets point between characters" worldview, the 0 points before the beginning of the string and remains 0 (as, indeed, will always happen in this worldview), while the 1 points just past the end and also becomes 0 when the string is shrunk to zero size. This can become even more confusing when one considers a function like net_util::FormatUrlWithOffsets(). Let's take the URL "http://user:pass@domain.com/", and run it through a transformation which strips both the scheme and the username/password (for an output of "domain.com/"). In the "offsets point between characters" worldview, because FormatUrlWithOffsets() works piecewise on one URL component at a time, a string offset of "7" -- which points just after the scheme -- is first transformed to "0" when we strip the scheme. It then remains 0 as we strip the username/password that follow it. The consequence is that if you have three offsets (6, 7, 8) that on input all point into the middle of a block of text that will be removed, on output these become (npos, 0, npos). At first this seemed obviously wrong to me, but the more I thought about it the more this actually began to seem correct, in the sense that FormatUrlWithOffsets() treats the separate pieces of a URL as logical units the same way UTF8ToUTF16AndAdjustOffsets() treats Unicode code sequences as logical units, and if you point between two of them, it's reasonable for your offset to be preserved. So, as I said, I am nearly done locally with changing both the UTF conversion functions and the net_util ones to behave in this second fashion. At that point I can run it through the trybots again and see what else breaks. But it's worth asking: does this worldview change make sense? Is this really how the offset conversions should work, conceptually? Or was the old "offsets point _at_ particular characters" worldview the right one? If we like the old worldview, the obvious question is how to support something like NavigationToMatch() or other functions that want to refer to "one past the end". We can try to special-case this without making other changes -- which is sort of the direction the current CL here goes, though not far enough -- but it led to problems in my first attempt, though now that I understand the problem space better, I think it's probably possible. Still, it feels like a hack, and I'm not sure it's right overall. The fact that I'm waffling is what's causing me to write all this, in the hopes that you guys will have input.
On 2013/09/05 06:07:27, Peter Kasting wrote: > I'm ignoring all of Mike's comments for now because the net_util test failures > exposed more widespread issues which may subsume those. Those issues are mostly > locally solved, but I realized tonight I have created something of an > existential crisis regarding these offset conversion functions, and I'm not sure > which way to go. > > The fundamental problem here comes from a disconnect between two worldviews: one > in which offsets are viewed as pointing at specific characters within the > string, and one in which offsets point _between_ characters in the string. (A > computer graphics analogy would be the clash between coordinate systems where > the center of a pixel is at the .0 versus .5 coordinate.) > > The existing conversion functions assume offsets point at characters, not > between character. A couple behaviors fall out of this: > * Pointing "one past the end" points out of the string, just like pointing two > or 57 characters past the end. Therefore, the input offset is invalid, and the > core change in this patch is wrong. > * Unique input offsets are guaranteed to be transformed to unique output offsets > or npos. > > However, the |inline_autocomplete_offset| in the omnibox world functions more > like an offset between characters, an insertion point into the string to be > precise. Given a string of length n, all offsets from 0 to n inclusive > represent boundary points that touch at least one character within the string. > Since code like SearchProvider::NavigationToMatch() tries to transform such > offsets, it reasonably expects that an offset "just after the last character" is > valid and will remain so after transformation. > > This view of the world has other implications for the offset conversion > functions, however. Specifically, unique input offsets may be transformed to > equal, non-npos output offsets. The most obvious case is if we take the string > "a", with offsets 0 and 1, and collapse to "". In the "offsets point at > characters" worldview, the 0 offset pointed at 'a', which is gone, so that > becomes npos; and the 1 pointed out of the string, so that's npos too. In the > "offsets point between characters" worldview, the 0 points before the beginning > of the string and remains 0 (as, indeed, will always happen in this worldview), > while the 1 points just past the end and also becomes 0 when the string is > shrunk to zero size. > > This can become even more confusing when one considers a function like > net_util::FormatUrlWithOffsets(). Let's take the URL > "http://user:pass@domain.com/", and run it through a transformation which strips > both the scheme and the username/password (for an output of "domain.com/"). In > the "offsets point between characters" worldview, because FormatUrlWithOffsets() > works piecewise on one URL component at a time, a string offset of "7" -- which > points just after the scheme -- is first transformed to "0" when we strip the > scheme. It then remains 0 as we strip the username/password that follow it. > The consequence is that if you have three offsets (6, 7, 8) that on input all > point into the middle of a block of text that will be removed, on output these > become (npos, 0, npos). At first this seemed obviously wrong to me, but the > more I thought about it the more this actually began to seem correct, in the > sense that FormatUrlWithOffsets() treats the separate pieces of a URL as logical > units the same way UTF8ToUTF16AndAdjustOffsets() treats Unicode code sequences > as logical units, and if you point between two of them, it's reasonable for your > offset to be preserved. > > So, as I said, I am nearly done locally with changing both the UTF conversion > functions and the net_util ones to behave in this second fashion. At that point > I can run it through the trybots again and see what else breaks. But it's worth > asking: does this worldview change make sense? Is this really how the offset > conversions should work, conceptually? Or was the old "offsets point _at_ > particular characters" worldview the right one? > > If we like the old worldview, the obvious question is how to support something > like NavigationToMatch() or other functions that want to refer to "one past the > end". We can try to special-case this without making other changes -- which is > sort of the direction the current CL here goes, though not far enough -- but it > led to problems in my first attempt, though now that I understand the problem > space better, I think it's probably possible. Still, it feels like a hack, and > I'm not sure it's right overall. > > The fact that I'm waffling is what's causing me to write all this, in the hopes > that you guys will have input. Sorry I didn't read all that, but hopefully my thoughts are still relevant. Cursor positioning and selection requires two positions for a string of one character length, like "a". That allows a cursor '|' to be placed at "|a" and "a|". Similar indexing is needed for selection/style ranges. Can you try to avoid changing the contract of the LimitOffset function, perhaps making its behavior comment even more explicit, and change user to supply a limit of string.length()-1 as needed?
On 2013/09/05 19:18:01, msw wrote: > Sorry I didn't read all that, but hopefully my thoughts are still relevant. :( I realize it's an enormous wall of text, but I wrote it so that you guys would read and ponder it. It's not very helpful to me for you to not read the background on this. > Cursor positioning and selection requires two positions for a string of one > character length, like "a". That allows a cursor '|' to be placed at "|a" and > "a|". Similar indexing is needed for selection/style ranges. Can you try to > avoid changing the contract of the LimitOffset function, perhaps making its > behavior comment even more explicit, and change user to supply a limit of > string.length()-1 as needed? It seems like the first half of this comment is arguing for the worldview change I'm making here, and the second half is arguing against it. Could you explain more how the first two sentences lead you to the concluding sentence? After all, if LimitOffset can't support an input offset of string.length(), how can it support the sort of positioning required for cursor and selection manipulation?
New snap up that implements the larger worldview changeover. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... File base/strings/utf_offset_string_conversions.h (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions.h:47: // which is greater than |limit| with npos. On 2013/09/04 23:58:50, msw wrote: > I share Will's concern that other callers might have properly expected the old > behavior. Some callers might try size_t i = 1; std::string text = "a"; > LimitOffset(...text.length()...); ...text[i]... or similar. > > Can you point to the codepath by which SearchProvider exercises this code? See bug 284781 comment 16 for the full description. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... File base/strings/utf_offset_string_conversions_unittest.cc (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:33: {"A\xF0\x90\x8C\x80z", 5, 3}, On 2013/09/04 23:58:50, msw wrote: > nit: also test 6 -> 4? Done. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:56: {{'A', 0x00bc, 0x00be, 'z'}, 3, 5}, On 2013/09/04 23:58:50, msw wrote: > nit: also test 4 -> 6? Done. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:60: {{'A', 0xd800, 0xdf00, 'z'}, 3, 5}, On 2013/09/04 23:58:50, msw wrote: > nit: also test 4 -> 6? Done. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:80: if (*ti <= kLimit) On 2013/09/04 23:58:50, msw wrote: > nit: I think you should actually change this to just check *ti != kNpos. Done. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:94: if (*ti <= kLimit) On 2013/09/04 23:58:50, msw wrote: > nit: ditto Done. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_strin... base/strings/utf_offset_string_conversions_unittest.cc:100: TEST(UTFOffsetStringConversionsTest, AdjustAndLimitOffsets) { On 2013/09/04 23:58:50, msw wrote: > nit: LimitOffsets seems fairly well tested above. I think adding use of > LimitOffsets here unnecessarily conflates testing of the the two functions, and > I would leave the test as-is instead. Done. https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... chrome/browser/autocomplete/search_provider_unittest.cc:1960: { "a.com", "[\"a.com\",[\"http://a.com/a\"],[],[]," On 2013/09/04 23:58:50, msw wrote: > Shouldn't this test have caught the failing-to-inline a navsuggest problem? This didn't catch it because the inline autocompletion was non-empty. I'm adding another test here to catch this (redundant with the ones above).
On 2013/09/05 19:31:36, Peter Kasting wrote: > On 2013/09/05 19:18:01, msw wrote: > > Sorry I didn't read all that, but hopefully my thoughts are still relevant. > > :( I realize it's an enormous wall of text, but I wrote it so that you guys > would read and ponder it. It's not very helpful to me for you to not read the > background on this. > > > Cursor positioning and selection requires two positions for a string of one > > character length, like "a". That allows a cursor '|' to be placed at "|a" and > > "a|". Similar indexing is needed for selection/style ranges. Can you try to > > avoid changing the contract of the LimitOffset function, perhaps making its > > behavior comment even more explicit, and change user to supply a limit of > > string.length()-1 as needed? > > It seems like the first half of this comment is arguing for the worldview change > I'm making here, and the second half is arguing against it. Could you explain > more how the first two sentences lead you to the concluding sentence? > > After all, if LimitOffset can't support an input offset of string.length(), how > can it support the sort of positioning required for cursor and selection > manipulation? Sorry, but I'm just really swamped; and this re-review took several hours... I read your wall of text and looked at the new patch set, and I see your pain. My last comment mistakenly thought you were removing index == limit support. Regarding your "http://user:pass@domain.com/" example, I think simply clamping indices across text modifications is wrong. (6, 7, 8) should probably become (0, 0, 0) when stripping the scheme/username/password, not (npos, 0, npos). Indices pointing to text that was removed shouldn't be treated as abstract offsets independent of the actual text content, they should probably be clamped to the leading index of the removed content. So, I'd suggest not letting that behavior weigh in on your LimitOffset decision if it shouldn't be using LimitOffset at all, and can be fixed tangentially. Anyway, it seems like some valid LimitOffset callers will need to use +1 or -1 regardless of the ultimate behavior. I appreciate your new deeper understanding of the issue and recognize that you've gone ahead with changing the worldview of this function, but I suspect that simply adding 1 to the string length for the "offsets point between characters" worldview users would be good enough. Gosh, I sure hope I can get some of my own work done someday... https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s... chrome/browser/autocomplete/search_provider_unittest.cc:1960: { "a.com", "[\"a.com\",[\"http://a.com/a\"],[],[]," On 2013/09/05 20:11:59, Peter Kasting wrote: > On 2013/09/04 23:58:50, msw wrote: > > Shouldn't this test have caught the failing-to-inline a navsuggest problem? > > This didn't catch it because the inline autocompletion was non-empty. > > I'm adding another test here to catch this (redundant with the ones above). Great! I was going to say that your new "a" -> "https://a" test was good, but having one with .com would be even better, and here it is! https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... chrome/browser/autocomplete/search_provider_unittest.cc:1006: { "[\"a\",[\"https://a/\"],[],[]," nit: is the trailing '/' needed here and below? It's not obvious that it exercises the empty inline (and trailing index offset) codepath. https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.cc File net/base/escape_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.... net/base/escape_unittest.cc:340: {"", 0, 0}, nit: also test {"", 1, std::string::npos} https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc File net/base/net_util.cc (left): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc... net/base/net_util.cc:514: std::vector<size_t> OffsetsIntoComponent( Leave this helper intact, instead of writing two new similar loops. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc... net/base/net_util.cc:1730: size_t colon = parsed.username.end(); What was this code doing and why is it no longer necessary? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc... net/base/net_util.cc:1744: size_t at_sign = (parsed.password.is_valid() ? Ditto: What was this code doing and why is it no longer necessary? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc File net/base/net_util.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:546: for (Offsets::iterator i(offsets_into_underlying_url.begin()); nit: use consistent iter/index looping if you're changing a lot of code here. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:548: *i = ((*i == std::string::npos) || (*i < kViewSourceLength)) ? Why should offsets into trimmed leading text go to npos instead of 0? (I suspect that pre-existing behavior is wrong, but don't really know) https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:568: for (size_t i = 0; i < original_offsets.size(); ++i) { Why does this loop use |original_offsets| at all? Shouldn't it just add kViewSourceLength to all the non-npos |offsets_into_underlying_url|? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:571: (*offsets_for_adjustment)[i] = (new_offset == base::string16::npos) ? offsets_for_adjustment may be NULL, check for that. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:650: // Transform |original_offsets| into a vetor of offsets relative to nit: 'vector' https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:658: // |original_component_begin|. (Other offsets are ignored, since it Is it even worthwhile to ignore other offsets? It requires adding this comment and another conditional term to avoid a subtraction and assignment. Might it even be wrong to ignore those if later string/offset use assume they're updated to account for the transformed component? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:676: // This offset originally pointed into the transformed component. I find it odd that we are adjusting offsets within transformed components at all. Can you provide an example of when this is done? Overall, I wonder if we could simply calculate offsets on formatted URLs, instead of calculating un-formatted URL offsets then formatting the URL and either transforming or trashing offsets. Perhaps knowing an actual use might help me understand. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:684: (original_offset != std::string::npos)) { nit: fix indent https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:688: original_offset - original_component_end + output->length(); I suppose either is approach should yield the same result, but shouldn't this really use |original_component.len| instead of |original_component_end| and |component_str.length()| instead of |output->length()| to be 'correct'? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:1764: base::OffsetAdjuster offset_adjuster(offsets_for_adjustment); What is this doing? Adjusting offsets past a removed trailing slash? https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:1777: AppendFormattedComponent(spec, parsed.ref, original_offsets, Nice! https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... File net/base/net_util_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2900: {0, 0}, Aren't all these cases except 27, 500000, and base::string16::npos covered by the basic_offsets verification of all offsets? That seems to be a general trend in all of these tests. Can you remove the AdjustOffsetCase arrays, and simply supply the |all_offsets| array to CheckAdjustedOffsets. It would be easy to append a few extra test cases within CheckAdjustedOffsets that should yield npos for all callers, like (|url_length| + 1), (|url_length| + 2), (|url_length| + 50), and npos. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2914: const size_t basic_offsets[] = { Optionally, you could change the size_t arrays of expected offsets to int arrays of expected offset deltas (if you think it'd be easier to read and worthwhile). In this case they'd all be 0, but a case that does remove components, like omit_auth_offsets_1, could be: const size_t omit_auth_offsets_1[] = { 0, 0, 0, 0, 0, 0, 0, 0, kNpos, kNpos, kNpos, kNpos, kNpos, kNpos, kNpos, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7 }; https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2916: 21, 22, 23, 24, 25 Shouldn't 26 now also be part of this array for the all-offsets check? Ditto below for all the size_t arrays terminal values at the url length. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2923: {6, 6}, Again, all of these cases are redundant with the omit_auth_offsets_1 verification of all offsets. Hopefully you can simplify these tests a little; I've removed other pending comments to the same effect after expanding on these first two. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2954: // "http://foo\x30B0:\x30B0bar@www.google.com" nit: can you expand on this comment, I guess this is the expected formatted URL, but saying so explicitly (perhaps moving this to be right before the input URL) would clarify its meaning. Ditto for the other similar comments below.
Thanks for the thorough review, although I hadn't been expecting you to review net/ or base/! New snap up. I modified some .h comments to try and reflect the "modified worldview" of these functions. On 2013/09/06 00:31:07, msw wrote: > Regarding your "http://user:pass@domain.com/" example, I think simply clamping > indices across text modifications is wrong. (6, 7, 8) should probably become (0, > 0, 0) when stripping the scheme/username/password, not (npos, 0, npos). It depends on whether you view FormatURLWithOffsets() as working on a character-by-character basis -- basically just a wrapper around UTF8ToUTF16() and the like -- or on a functional block basis, where "the scheme" is an entity a bit like a long Unicode encoding sequence is an entity. I think it makes more sense to view it as the latter, in which case converting "offset into the middle of an entity that no longer exists" to npos seems more correct. Plus I worry some callers may intentionally rely on the current behavior to detect offsets that point into the middles of such entities. I am not sure though. > Anyway, it seems like some valid LimitOffset callers will need to use +1 or -1 > regardless of the ultimate behavior. How so? I haven't found any in my patch... > I suspect that simply adding 1 to the string length for the > "offsets point between characters" worldview users would be good enough. The problem with this is that the transformation functions don't take an arbitrary limit point, the infer it from the provided string. So to pretend you have a longer string, you'd have to have callers actually add characters to the end of the string, just to allow for the possibility of an offset at the end of the original string. That seems really hacky and gross, and might have a large footprint across the omnibox code. https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... chrome/browser/autocomplete/search_provider_unittest.cc:1006: { "[\"a\",[\"https://a/\"],[],[]," On 2013/09/06 00:31:08, msw wrote: > nit: is the trailing '/' needed here and below? It's not obvious that it > exercises the empty inline (and trailing index offset) codepath. I actually wanted to make sure the trailing slash case still got handled properly (with the slash stripped off in the output). The "a.com" case omits the trailing slash, so we test both. https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.cc File net/base/escape_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.... net/base/escape_unittest.cc:340: {"", 0, 0}, On 2013/09/06 00:31:08, msw wrote: > nit: also test {"", 1, std::string::npos} Done. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc File net/base/net_util.cc (left): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc... net/base/net_util.cc:514: std::vector<size_t> OffsetsIntoComponent( On 2013/09/06 00:31:08, msw wrote: > Leave this helper intact, instead of writing two new similar loops. Restored these two functions. I originally gutted these because the callsites wanted different (sometimes simpler) behavior, but after some changes relative to your comments, I ended up going back to these. See discussion below for information about behavior changes. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc... net/base/net_util.cc:1730: size_t colon = parsed.username.end(); On 2013/09/06 00:31:08, msw wrote: > What was this code doing and why is it no longer necessary? This is somewhat subtle. Before this change, AppendFormattedComponent() only transformed offsets within the component, and left ones after it untouched. The result was that as the series of AppendFormattedComponent() calls occurred, groups of offsets would be set correctly, with the ones after them left completely invalid. One consequence was that any offsets referring to text that was _not_ transformed by AppendFormattedComponent() -- such as the colon and at-sign in a username/password section of the URL -- had to be fixed up manually. That's what this code, and the code below, are doing. The other consequence was that the offset pointing at the end of the URL, since it did not refer to any text that would be transformed, was never fixed up at all. (A third consequence, which is that during the middle of making these calls the state of |offsets_for_adjustment| was somewhat incoherent, was only relevant to someone debugging the code and trying to figure out why later offsets weren't being dynamically updated as earlier components changed length.) In the new version of the patch, AppendFormattedComponent() also adjusts all offsets _after_ the transformed component, which means that none of these consequences are present anymore. This obviates the need for these two blocks of code, also avoids an additional block to fix the offset at the end of the string, and finally is less confusing while debugging because |offsets_for_adjustment| always reflects the correct values for if you just dumped the rest of the string on unmodified and returned immediately. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc File net/base/net_util.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:546: for (Offsets::iterator i(offsets_into_underlying_url.begin()); On 2013/09/06 00:31:08, msw wrote: > nit: use consistent iter/index looping if you're changing a lot of code here. This is no longer so noticeably inconsistent, since the two loops are now back in separate helper functions. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:548: *i = ((*i == std::string::npos) || (*i < kViewSourceLength)) ? On 2013/09/06 00:31:08, msw wrote: > Why should offsets into trimmed leading text go to npos instead of 0? > (I suspect that pre-existing behavior is wrong, but don't really know) It doesn't actually matter what we set them to, since we won't pay attention to their values after the transformation. The helper function now does something simpler with these, based on your comments on the similar code in AppendFormattedComponent(). https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:568: for (size_t i = 0; i < original_offsets.size(); ++i) { On 2013/09/06 00:31:08, msw wrote: > Why does this loop use |original_offsets| at all? Shouldn't it just add > kViewSourceLength to all the non-npos |offsets_into_underlying_url|? This is now moot since this once again calls AdjustForComponentTransform(). https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:571: (*offsets_for_adjustment)[i] = (new_offset == base::string16::npos) ? On 2013/09/06 00:31:08, msw wrote: > offsets_for_adjustment may be NULL, check for that. This is now moot since this once again calls AdjustForComponentTransform(). https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:650: // Transform |original_offsets| into a vetor of offsets relative to On 2013/09/06 00:31:08, msw wrote: > nit: 'vector' This is now moot. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:658: // |original_component_begin|. (Other offsets are ignored, since it On 2013/09/06 00:31:08, msw wrote: > Is it even worthwhile to ignore other offsets? It requires adding this comment > and another conditional term to avoid a subtraction and assignment. You're right, we can unconditionally subtract since we'll ignore all the out-of-range offsets anyway. Doing things like subtracting a value from npos looks scary, but since we ignore such values, it doesn't matter. I simplified the code in OffsetsIntoComponent(). > Might it > even be wrong to ignore those if later string/offset use assume they're updated > to account for the transformed component? Code after this function returns does assume these are updated (which is a change from before this patch; see my long reply in the username/password handling section of FormatUrlWithOffsets()), so we do set the correct output values for them. I tried to make this clearer in the comments on OffsetsIntoComponent()/AdjustForComponentTransform(). https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:676: // This offset originally pointed into the transformed component. On 2013/09/06 00:31:08, msw wrote: > I find it odd that we are adjusting offsets within transformed components at > all. Can you provide an example of when this is done? Yes, any time the path needs escaping or unescaping. Each escaped character is generally equivalent to between three and nine unescaped characters, so any offsets that point anywhere within that sequence get adjusted during the actual escaping/unescaping process. This happens in real-world cases when people e.g. type something in the omnibox that the HQP matches against an escaped part of a URL's path. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:684: (original_offset != std::string::npos)) { On 2013/09/06 00:31:08, msw wrote: > nit: fix indent Done. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:688: original_offset - original_component_end + output->length(); On 2013/09/06 00:31:08, msw wrote: > I suppose either is approach should yield the same result, but shouldn't this > really use |original_component.len| instead of |original_component_end| and > |component_str.length()| instead of |output->length()| to be 'correct'? They do yield the same result. Now that this code has moved back into AdjustForComponentTransform() and the code says "transformed_component_end" instead of "output->length()", I think the algorithm is more obvious. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:1764: base::OffsetAdjuster offset_adjuster(offsets_for_adjustment); On 2013/09/06 00:31:08, msw wrote: > What is this doing? Adjusting offsets past a removed trailing slash? Yes. This only really matters for when we have an offset at the end of the input string and this trailing slash was the last thing in the input. In that case we have to do this fixup to get that offset correct. Otherwise, one of the AppendFormattedComponent() calls below will correct any such trailing offset, due to the new behavior I described in my reply just above. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc... net/base/net_util.cc:1777: AppendFormattedComponent(spec, parsed.ref, original_offsets, On 2013/09/06 00:31:08, msw wrote: > Nice! Yeah... old code made me facepalm. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... File net/base/net_util_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2900: {0, 0}, On 2013/09/06 00:31:08, msw wrote: > Aren't all these cases except 27, 500000, and base::string16::npos covered by > the basic_offsets verification of all offsets? Yes. > Can you remove the AdjustOffsetCase arrays, and simply > supply the |all_offsets| array to CheckAdjustedOffsets. I could, but then every case would have to declare two long "every possible offset" arrays instead of one. Both routes seem roughly as lame as each other. I had an "eww, gross" reaction when I first read this code, but there doesn't seem to be an answer that (to me, at least) makes me say "ahh, that's so much less redundant and crappy". YMMV. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2914: const size_t basic_offsets[] = { On 2013/09/06 00:31:08, msw wrote: > Optionally, you could change the size_t arrays of expected offsets to int arrays > of expected offset deltas (if you think it'd be easier to read and worthwhile). > In this case they'd all be 0, but a case that does remove components, like > omit_auth_offsets_1, could be: > const size_t omit_auth_offsets_1[] = { 0, 0, 0, 0, 0, 0, 0, 0, kNpos, kNpos, > kNpos, kNpos, kNpos, kNpos, kNpos, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, > -7, -7, -7, -7 }; Ultimately I find both somewhat confusing. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2916: 21, 22, 23, 24, 25 On 2013/09/06 00:31:08, msw wrote: > Shouldn't 26 now also be part of this array for the all-offsets check? Ditto > below for all the size_t arrays terminal values at the url length. CheckAdjustedOffsets() checks this value automatically, so while we could write the code such that we explicitly declared that at all the callsites instead, it wouldn't gain us anything. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2954: // "http://foo\x30B0:\x30B0bar@www.google.com" On 2013/09/06 00:31:08, msw wrote: > nit: can you expand on this comment, I guess this is the expected formatted URL, > but saying so explicitly (perhaps moving this to be right before the input URL) > would clarify its meaning. Ditto for the other similar comments below. OK, I expanded these slightly and moved them.
Okay, this is looking pretty good; I do like the "modified worldview" change overall. I hope some custom offset adjustment code can be replaced by OffsetAdjuster usage, but regardless of this change the code is fairly complex and doesn't seem well designed. https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... File chrome/browser/autocomplete/search_provider_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple... chrome/browser/autocomplete/search_provider_unittest.cc:1006: { "[\"a\",[\"https://a/\"],[],[]," On 2013/09/06 19:04:36, Peter Kasting wrote: > On 2013/09/06 00:31:08, msw wrote: > > nit: is the trailing '/' needed here and below? It's not obvious that it > > exercises the empty inline (and trailing index offset) codepath. > > I actually wanted to make sure the trailing slash case still got handled > properly (with the slash stripped off in the output). The "a.com" case omits > the trailing slash, so we test both. Okay, sounds good. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... File net/base/net_util_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2900: {0, 0}, On 2013/09/06 19:04:36, Peter Kasting wrote: > On 2013/09/06 00:31:08, msw wrote: > > Aren't all these cases except 27, 500000, and base::string16::npos covered by > > the basic_offsets verification of all offsets? > > Yes. > > > Can you remove the AdjustOffsetCase arrays, and simply > > supply the |all_offsets| array to CheckAdjustedOffsets. > > I could, but then every case would have to declare two long "every possible > offset" arrays instead of one. Both routes seem roughly as lame as each other. > I had an "eww, gross" reaction when I first read this code, but there doesn't > seem to be an answer that (to me, at least) makes me say "ahh, that's so much > less redundant and crappy". YMMV. As I suggested, you can just remove all AdjustOffsetCase arrays. Just send CheckAdjustedOffsets the full list of expected offsets from an automatically generated input array { 0, 1, 2 ... |url_string.length()| }, like it does now. CheckAdjustedOffsets can also generate the input and output offsets for cases > url_string.length() and npos. The result would be the same coverage with ~160 fewer lines; which seems decidedly better. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc File net/base/net_util.cc (right): https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:516: // NOTE: Offsets which didn't originally point into this component may be nit: With this helper intact, it's less compelling to remove the npos check. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:519: Offsets OffsetsIntoComponent(const Offsets& original_offsets, Can this be replaced with an OffsetAdjuster? https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:521: Offsets offsets_into_component(original_offsets); optional nit: rename this to |offsets| for a shorter loop impl: for (Offsets::iterator i(offsets.begin()); i != offsets.end(); ++i) *i = (*i == npos) ? *i : *i - component_begin; https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:541: size_t original_component_begin, optional nit: Replace 2 begin/end pairs with Ranges or url_parse::Components. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:550: for (size_t i = 0; i < original_offsets.size(); ++i) { Can this be replaced with an OffsetAdjuster?
New snap up. I think you're probably right that the existing FormatUrl...() functions could work a little better. They probably should have been built to construct the output offset vector by appending to it -- as they do the output string -- rather than by replacing in place. OffsetAdjuster might be easier to use if it wrote out a new output vector (maybe via a provided back_inserter?) instead of forcing you to operate in place. I'm reluctant to try and make these changes too, at least for now. If you're convinced they would help, or can think of anything else, you can file a bug against me to do them and I'll get to it. https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... File net/base/net_util_unittest.cc (right): https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2900: {0, 0}, On 2013/09/09 19:15:52, msw wrote: > On 2013/09/06 19:04:36, Peter Kasting wrote: > > On 2013/09/06 00:31:08, msw wrote: > > > Aren't all these cases except 27, 500000, and base::string16::npos covered > by > > > the basic_offsets verification of all offsets? > > > > Yes. > > > > > Can you remove the AdjustOffsetCase arrays, and simply > > > supply the |all_offsets| array to CheckAdjustedOffsets. > > > > I could, but then every case would have to declare two long "every possible > > offset" arrays instead of one. Both routes seem roughly as lame as each > other. > > I had an "eww, gross" reaction when I first read this code, but there doesn't > > seem to be an answer that (to me, at least) makes me say "ahh, that's so much > > less redundant and crappy". YMMV. > > As I suggested, you can just remove all AdjustOffsetCase arrays. Just send > CheckAdjustedOffsets the full list of expected offsets from an automatically > generated input array { 0, 1, 2 ... |url_string.length()| }, like it does now. > CheckAdjustedOffsets can also generate the input and output offsets for cases > > url_string.length() and npos. The result would be the same coverage with ~160 > fewer lines; which seems decidedly better. Done. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc File net/base/net_util.cc (right): https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:516: // NOTE: Offsets which didn't originally point into this component may be On 2013/09/09 19:15:52, msw wrote: > nit: With this helper intact, it's less compelling to remove the npos check. Moot. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:519: Offsets OffsetsIntoComponent(const Offsets& original_offsets, On 2013/09/09 19:15:52, msw wrote: > Can this be replaced with an OffsetAdjuster? Done. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:521: Offsets offsets_into_component(original_offsets); On 2013/09/09 19:15:52, msw wrote: > optional nit: rename this to |offsets| for a shorter loop impl: > for (Offsets::iterator i(offsets.begin()); i != offsets.end(); ++i) > *i = (*i == npos) ? *i : *i - component_begin; Moot. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:541: size_t original_component_begin, On 2013/09/09 19:15:52, msw wrote: > optional nit: Replace 2 begin/end pairs with Ranges or url_parse::Components. This turns out to not be particularly great at the callsites :/. https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc... net/base/net_util.cc:550: for (size_t i = 0; i < original_offsets.size(); ++i) { On 2013/09/09 19:15:52, msw wrote: > Can this be replaced with an OffsetAdjuster? Not really. It's sort of possible to do with two different adjustments that work on two different vectors, but you still run into problems at the boundary points -- now that OffsetAdjuster treats 0 as "before the component" instead of "first character in the component", it becomes difficult to adjust transformed offsets of 0 back to non-zero values. And the resulting code is less efficient (including at least one extra copy of the whole vector) and similarly verbose. After trying to make it work, I've gone back to the existing way.
LGTM with a nit; thanks for doing some difficult work here. https://codereview.chromium.org/23619016/diff/77001/net/base/net_util_unittes... File net/base/net_util_unittest.cc (right): https://codereview.chromium.org/23619016/diff/77001/net/base/net_util_unittes... net/base/net_util_unittest.cc:2982: UnescapeRule::NORMAL, omit_http_offsets); nit: extra space between args.
base/ LGTM As far as net/, I didn't do a detailed review for correctness and everything because unfortunately I don't have time for that (20% time and what not). If y'all are comfortable with msw's review, then I'm fine with it. But from a net/ module OWNERS' perspective, the basics of the change LGTM. If we need a more detailed review (and I see no reason why anyone else would be more appropriate), then it'd be better to ask someone else in net/ to review.
Message was sent while issue was closed.
Committed patchset #5 manually as r222426 (presubmit successful). |