Issue 23619016: Switch the offset conversion routines from...

Issue 23619016: Switch the offset conversion routines from... (Closed)

Created:
7 years, 3 months ago by Peter Kasting

Modified:
7 years, 3 months ago

Reviewers:
msw, willchan no longer on Chromium

CC:
chromium-reviews, James Su, erikwright+watch_chromium.org, jshin+watch_chromium.org

Base URL:
svn://chrome-svn/chrome/trunk/src/

Visibility:
Public.

More Reviews

Description

Switch the offset conversion routines from an "offsets point at characters" worldview to an "offsets point between characters" worldview. This more closely aligns with how the omnibox autocomplete code (which is what this was originally written for) expects things to behave. Direct fallout from this change: * An input offset of 0 will always map to an output offset of 0. * An input offset of (length of string) will always map to the length of the output string, instead of npos. * It's possible for multiple unique input offsets to map to a single non-npos output offset, if they e.g. point to the start and end of a collapsed sequence. * Input offsets pointing into the middle of a completely-removed sequence may not be set to npos if they fall on the boundaries of a subsequence processed by the transformer. For example, when running FormatUrlWithOffsets() on "http://user:pass@domain.com/" and directing it to omit both the scheme and username/password, an input offset of "7" that points in between the scheme and the username/password will be transformed to an output offset of 0 instead of npos. Indirect fallout: * A caller like SearchProvider::NavigationToMatch() will now mark certain matches as "allowed to be default" that it didn't before. Specifically, if the user's input string ends at the same point as the desired |fill_into_edit|, the autocomplete offset will be calculated as (length of string) instead of npos, and thus the match will be thought of as "inlinable" and thus "allowed to be default". BUG=284781 TEST=none R=msw@chromium.org, willchan@chromium.org Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=222426

Patch Set 1 #

Total comments: 17

Patch Set 2 : #

Total comments: 43

Patch Set 3 : #

Patch Set 4 : #

Total comments: 10

Patch Set 5 : #

Total comments: 1

Created: 7 years, 3 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+281 lines, -355 lines)			Patch
M	base/strings/utf_offset_string_conversions.h	View	1 2 3	2 chunks	+13 lines, -7 lines	0 comments	Download
M	base/strings/utf_offset_string_conversions.cc	View	1	1 chunk	+0 lines, -4 lines	0 comments	Download
M	base/strings/utf_offset_string_conversions_unittest.cc	View	1	9 chunks	+26 lines, -16 lines	0 comments	Download
M	chrome/browser/autocomplete/search_provider_unittest.cc	View	1 2 3 4	4 chunks	+24 lines, -3 lines	0 comments	Download
M	net/base/escape_unittest.cc	View	1 2	1 chunk	+4 lines, -2 lines	0 comments	Download
M	net/base/net_util.h	View	1 2 3	1 chunk	+19 lines, -9 lines	0 comments	Download
M	net/base/net_util.cc	View	1 2 3 4	19 chunks	+113 lines, -128 lines	0 comments	Download
M	net/base/net_util_unittest.cc	View	1 2 3 4	2 chunks	+82 lines, -186 lines	1 comment	Download

Messages

Total messages: 14 (0 generated)

Expand Messages | Collapse Messages

Peter Kasting

There are no great reviewers for this, since mrossetti wrote the original code and I ...

7 years, 3 months ago (2013-09-04 20:39:17 UTC) #1

willchan no longer on Chromium

Running off to a meeting, just wanted to say a few things first: * The ...

7 years, 3 months ago (2013-09-04 20:59:23 UTC) #2

msw

https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_string_conversions.h File base/strings/utf_offset_string_conversions.h (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_string_conversions.h#newcode47 base/strings/utf_offset_string_conversions.h:47: // which is greater than |limit| with npos. I ...

7 years, 3 months ago (2013-09-04 23:58:50 UTC) #3

Peter Kasting

I'm ignoring all of Mike's comments for now because the net_util test failures exposed more ...

7 years, 3 months ago (2013-09-05 06:07:27 UTC) #4

I'm ignoring all of Mike's comments for now because the net_util test failures
exposed more widespread issues which may subsume those.  Those issues are mostly
locally solved, but I realized tonight I have created something of an
existential crisis regarding these offset conversion functions, and I'm not sure
which way to go.

The fundamental problem here comes from a disconnect between two worldviews: one
in which offsets are viewed as pointing at specific characters within the
string, and one in which offsets point _between_ characters in the string.  (A
computer graphics analogy would be the clash between coordinate systems where
the center of a pixel is at the .0 versus .5 coordinate.)

The existing conversion functions assume offsets point at characters, not
between character.  A couple behaviors fall out of this:
* Pointing "one past the end" points out of the string, just like pointing two
or 57 characters past the end.  Therefore, the input offset is invalid, and the
core change in this patch is wrong.
* Unique input offsets are guaranteed to be transformed to unique output offsets
or npos.

However, the |inline_autocomplete_offset| in the omnibox world functions more
like an offset between characters, an insertion point into the string to be
precise.  Given a string of length n, all offsets from 0 to n inclusive
represent boundary points that touch at least one character within the string. 
Since code like SearchProvider::NavigationToMatch() tries to transform such
offsets, it reasonably expects that an offset "just after the last character" is
valid and will remain so after transformation.

This view of the world has other implications for the offset conversion
functions, however.  Specifically, unique input offsets may be transformed to
equal, non-npos output offsets.  The most obvious case is if we take the string
"a", with offsets 0 and 1, and collapse to "".  In the "offsets point at
characters" worldview, the 0 offset pointed at 'a', which is gone, so that
becomes npos; and the 1 pointed out of the string, so that's npos too.  In the
"offsets point between characters" worldview, the 0 points before the beginning
of the string and remains 0 (as, indeed, will always happen in this worldview),
while the 1 points just past the end and also becomes 0 when the string is
shrunk to zero size.

This can become even more confusing when one considers a function like
net_util::FormatUrlWithOffsets().  Let's take the URL
"http://user:pass@domain.com/", and run it through a transformation which strips
both the scheme and the username/password (for an output of "domain.com/").  In
the "offsets point between characters" worldview, because FormatUrlWithOffsets()
works piecewise on one URL component at a time, a string offset of "7" -- which
points just after the scheme -- is first transformed to "0" when we strip the
scheme.  It then remains 0 as we strip the username/password that follow it. 
The consequence is that if you have three offsets (6, 7, 8) that on input all
point into the middle of a block of text that will be removed, on output these
become (npos, 0, npos).  At first this seemed obviously wrong to me, but the
more I thought about it the more this actually began to seem correct, in the
sense that FormatUrlWithOffsets() treats the separate pieces of a URL as logical
units the same way UTF8ToUTF16AndAdjustOffsets() treats Unicode code sequences
as logical units, and if you point between two of them, it's reasonable for your
offset to be preserved.

So, as I said, I am nearly done locally with changing both the UTF conversion
functions and the net_util ones to behave in this second fashion.  At that point
I can run it through the trybots again and see what else breaks.  But it's worth
asking: does this worldview change make sense?  Is this really how the offset
conversions should work, conceptually?  Or was the old "offsets point _at_
particular characters" worldview the right one?

If we like the old worldview, the obvious question is how to support something
like NavigationToMatch() or other functions that want to refer to "one past the
end".  We can try to special-case this without making other changes -- which is
sort of the direction the current CL here goes, though not far enough -- but it
led to problems in my first attempt, though now that I understand the problem
space better, I think it's probably possible.  Still, it feels like a hack, and
I'm not sure it's right overall.

The fact that I'm waffling is what's causing me to write all this, in the hopes
that you guys will have input.

msw

On 2013/09/05 06:07:27, Peter Kasting wrote: > I'm ignoring all of Mike's comments for now ...

7 years, 3 months ago (2013-09-05 19:18:01 UTC) #5

On 2013/09/05 06:07:27, Peter Kasting wrote:
> I'm ignoring all of Mike's comments for now because the net_util test failures
> exposed more widespread issues which may subsume those.  Those issues are
mostly
> locally solved, but I realized tonight I have created something of an
> existential crisis regarding these offset conversion functions, and I'm not
sure
> which way to go.
> 
> The fundamental problem here comes from a disconnect between two worldviews:
one
> in which offsets are viewed as pointing at specific characters within the
> string, and one in which offsets point _between_ characters in the string.  (A
> computer graphics analogy would be the clash between coordinate systems where
> the center of a pixel is at the .0 versus .5 coordinate.)
> 
> The existing conversion functions assume offsets point at characters, not
> between character.  A couple behaviors fall out of this:
> * Pointing "one past the end" points out of the string, just like pointing two
> or 57 characters past the end.  Therefore, the input offset is invalid, and
the
> core change in this patch is wrong.
> * Unique input offsets are guaranteed to be transformed to unique output
offsets
> or npos.
> 
> However, the |inline_autocomplete_offset| in the omnibox world functions more
> like an offset between characters, an insertion point into the string to be
> precise.  Given a string of length n, all offsets from 0 to n inclusive
> represent boundary points that touch at least one character within the string.

> Since code like SearchProvider::NavigationToMatch() tries to transform such
> offsets, it reasonably expects that an offset "just after the last character"
is
> valid and will remain so after transformation.
> 
> This view of the world has other implications for the offset conversion
> functions, however.  Specifically, unique input offsets may be transformed to
> equal, non-npos output offsets.  The most obvious case is if we take the
string
> "a", with offsets 0 and 1, and collapse to "".  In the "offsets point at
> characters" worldview, the 0 offset pointed at 'a', which is gone, so that
> becomes npos; and the 1 pointed out of the string, so that's npos too.  In the
> "offsets point between characters" worldview, the 0 points before the
beginning
> of the string and remains 0 (as, indeed, will always happen in this
worldview),
> while the 1 points just past the end and also becomes 0 when the string is
> shrunk to zero size.
> 
> This can become even more confusing when one considers a function like
> net_util::FormatUrlWithOffsets().  Let's take the URL
> "http://user:pass@domain.com/", and run it through a transformation which
strips
> both the scheme and the username/password (for an output of "domain.com/"). 
In
> the "offsets point between characters" worldview, because
FormatUrlWithOffsets()
> works piecewise on one URL component at a time, a string offset of "7" --
which
> points just after the scheme -- is first transformed to "0" when we strip the
> scheme.  It then remains 0 as we strip the username/password that follow it. 
> The consequence is that if you have three offsets (6, 7, 8) that on input all
> point into the middle of a block of text that will be removed, on output these
> become (npos, 0, npos).  At first this seemed obviously wrong to me, but the
> more I thought about it the more this actually began to seem correct, in the
> sense that FormatUrlWithOffsets() treats the separate pieces of a URL as
logical
> units the same way UTF8ToUTF16AndAdjustOffsets() treats Unicode code sequences
> as logical units, and if you point between two of them, it's reasonable for
your
> offset to be preserved.
> 
> So, as I said, I am nearly done locally with changing both the UTF conversion
> functions and the net_util ones to behave in this second fashion.  At that
point
> I can run it through the trybots again and see what else breaks.  But it's
worth
> asking: does this worldview change make sense?  Is this really how the offset
> conversions should work, conceptually?  Or was the old "offsets point _at_
> particular characters" worldview the right one?
> 
> If we like the old worldview, the obvious question is how to support something
> like NavigationToMatch() or other functions that want to refer to "one past
the
> end".  We can try to special-case this without making other changes -- which
is
> sort of the direction the current CL here goes, though not far enough -- but
it
> led to problems in my first attempt, though now that I understand the problem
> space better, I think it's probably possible.  Still, it feels like a hack,
and
> I'm not sure it's right overall.
> 
> The fact that I'm waffling is what's causing me to write all this, in the
hopes
> that you guys will have input.

Sorry I didn't read all that, but hopefully my thoughts are still relevant.
Cursor positioning and selection requires two positions for a string of one
character length, like "a". That allows a cursor '|' to be placed at "|a" and
"a|". Similar indexing is needed for selection/style ranges. Can you try to
avoid changing the contract of the LimitOffset function, perhaps making its
behavior comment even more explicit, and change user to supply a limit of
string.length()-1 as needed?

Peter Kasting

On 2013/09/05 19:18:01, msw wrote: > Sorry I didn't read all that, but hopefully my ...

7 years, 3 months ago (2013-09-05 19:31:36 UTC) #6

Peter Kasting

New snap up that implements the larger worldview changeover. https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_string_conversions.h File base/strings/utf_offset_string_conversions.h (right): https://codereview.chromium.org/23619016/diff/1/base/strings/utf_offset_string_conversions.h#newcode47 base/strings/utf_offset_string_conversions.h:47: ...

7 years, 3 months ago (2013-09-05 20:11:59 UTC) #7

msw

On 2013/09/05 19:31:36, Peter Kasting wrote: > On 2013/09/05 19:18:01, msw wrote: > > Sorry ...

7 years, 3 months ago (2013-09-06 00:31:07 UTC) #8

On 2013/09/05 19:31:36, Peter Kasting wrote:
> On 2013/09/05 19:18:01, msw wrote:
> > Sorry I didn't read all that, but hopefully my thoughts are still relevant.
> 
> :( I realize it's an enormous wall of text, but I wrote it so that you guys
> would read and ponder it.  It's not very helpful to me for you to not read the
> background on this.
> 
> > Cursor positioning and selection requires two positions for a string of one
> > character length, like "a". That allows a cursor '|' to be placed at "|a"
and
> > "a|". Similar indexing is needed for selection/style ranges. Can you try to
> > avoid changing the contract of the LimitOffset function, perhaps making its
> > behavior comment even more explicit, and change user to supply a limit of
> > string.length()-1 as needed?
> 
> It seems like the first half of this comment is arguing for the worldview
change
> I'm making here, and the second half is arguing against it.  Could you explain
> more how the first two sentences lead you to the concluding sentence?
> 
> After all, if LimitOffset can't support an input offset of string.length(),
how
> can it support the sort of positioning required for cursor and selection
> manipulation?

Sorry, but I'm just really swamped; and this re-review took several hours...
I read your wall of text and looked at the new patch set, and I see your pain.
My last comment mistakenly thought you were removing index == limit support.

Regarding your "http://user:pass@domain.com/" example, I think simply clamping
indices across text modifications is wrong. (6, 7, 8) should probably become (0,
0, 0) when stripping the scheme/username/password, not (npos, 0, npos). Indices
pointing to text that was removed shouldn't be treated as abstract offsets
independent of the actual text content, they should probably be clamped to the
leading index of the removed content. So, I'd suggest not letting that behavior
weigh in on your LimitOffset decision if it shouldn't be using LimitOffset at
all, and can be fixed tangentially.

Anyway, it seems like some valid LimitOffset callers will need to use +1 or -1
regardless of the ultimate behavior. I appreciate your new deeper understanding
of the issue and recognize that you've gone ahead with changing the worldview of
this function, but I suspect that simply adding 1 to the string length for the
"offsets point between characters" worldview users would be good enough.

Gosh, I sure hope I can get some of my own work done someday...

https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s...
File chrome/browser/autocomplete/search_provider_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/1/chrome/browser/autocomplete/s...
chrome/browser/autocomplete/search_provider_unittest.cc:1960: { "a.com",
"[\"a.com\",[\"http://a.com/a\"],[],[],"
On 2013/09/05 20:11:59, Peter Kasting wrote:
> On 2013/09/04 23:58:50, msw wrote:
> > Shouldn't this test have caught the failing-to-inline a navsuggest problem?
> 
> This didn't catch it because the inline autocompletion was non-empty.
> 
> I'm adding another test here to catch this (redundant with the ones above).

Great! I was going to say that your new "a" -> "https://a" test was good, but
having one with .com would be even better, and here it is!

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
File chrome/browser/autocomplete/search_provider_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
chrome/browser/autocomplete/search_provider_unittest.cc:1006: {
"[\"a\",[\"https://a/\"],[],[],"
nit: is the trailing '/' needed here and below? It's not obvious that it
exercises the empty inline (and trailing index offset) codepath.

https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.cc
File net/base/escape_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest....
net/base/escape_unittest.cc:340: {"", 0, 0},
nit: also test {"", 1, std::string::npos}

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc
File net/base/net_util.cc (left):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc...
net/base/net_util.cc:514: std::vector<size_t> OffsetsIntoComponent(
Leave this helper intact, instead of writing two new similar loops.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc...
net/base/net_util.cc:1730: size_t colon = parsed.username.end();
What was this code doing and why is it no longer necessary?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc...
net/base/net_util.cc:1744: size_t at_sign = (parsed.password.is_valid() ?
Ditto: What was this code doing and why is it no longer necessary?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc
File net/base/net_util.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:546: for (Offsets::iterator
i(offsets_into_underlying_url.begin());
nit: use consistent iter/index looping if you're changing a lot of code here.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:548: *i = ((*i == std::string::npos) || (*i <
kViewSourceLength)) ?
Why should offsets into trimmed leading text go to npos instead of 0?
(I suspect that pre-existing behavior is wrong, but don't really know)

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:568: for (size_t i = 0; i < original_offsets.size(); ++i) {
Why does this loop use |original_offsets| at all? Shouldn't it just add
kViewSourceLength to all the non-npos |offsets_into_underlying_url|?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:571: (*offsets_for_adjustment)[i] = (new_offset ==
base::string16::npos) ?
offsets_for_adjustment may be NULL, check for that.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:650: // Transform |original_offsets| into a vetor of
offsets relative to
nit: 'vector'

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:658: // |original_component_begin|.  (Other offsets are
ignored, since it
Is it even worthwhile to ignore other offsets? It requires adding this comment
and another conditional term to avoid a subtraction and assignment. Might it
even be wrong to ignore those if later string/offset use assume they're updated
to account for the transformed component?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:676: // This offset originally pointed into the transformed
component.
I find it odd that we are adjusting offsets within transformed components at
all. Can you provide an example of when this is done? Overall, I wonder if we
could simply calculate offsets on formatted URLs, instead of calculating
un-formatted URL offsets then formatting the URL and either transforming or
trashing offsets. Perhaps knowing an actual use might help me understand.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:684: (original_offset != std::string::npos)) {
nit: fix indent

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:688: original_offset - original_component_end +
output->length();
I suppose either is approach should yield the same result, but shouldn't this
really use |original_component.len| instead of |original_component_end| and
|component_str.length()| instead of |output->length()| to be 'correct'?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:1764: base::OffsetAdjuster
offset_adjuster(offsets_for_adjustment);
What is this doing? Adjusting offsets past a removed trailing slash?

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:1777: AppendFormattedComponent(spec, parsed.ref,
original_offsets,
Nice!

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
File net/base/net_util_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2900: {0, 0},
Aren't all these cases except 27, 500000, and base::string16::npos covered by
the basic_offsets verification of all offsets? That seems to be a general trend
in all of these tests. Can you remove the AdjustOffsetCase arrays, and simply
supply the |all_offsets| array to CheckAdjustedOffsets. It would be easy to
append a few extra test cases within CheckAdjustedOffsets that should yield npos
for all callers, like (|url_length| + 1), (|url_length| + 2), (|url_length| +
50), and npos.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2914: const size_t basic_offsets[] = {
Optionally, you could change the size_t arrays of expected offsets to int arrays
of expected offset deltas (if you think it'd be easier to read and worthwhile).
In this case they'd all be 0, but a case that does remove components, like
omit_auth_offsets_1, could be:
const size_t omit_auth_offsets_1[] = { 0, 0, 0, 0, 0, 0, 0, 0, kNpos, kNpos,
kNpos, kNpos, kNpos, kNpos, kNpos, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7,
-7, -7, -7, -7 };

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2916: 21, 22, 23, 24, 25
Shouldn't 26 now also be part of this array for the all-offsets check? Ditto
below for all the size_t arrays terminal values at the url length.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2923: {6, 6},
Again, all of these cases are redundant with the omit_auth_offsets_1
verification of all offsets. Hopefully you can simplify these tests a little;
I've removed other pending comments to the same effect after expanding on these
first two.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2954: //
"http://foo\x30B0:\x30B0bar@www.google.com"
nit: can you expand on this comment, I guess this is the expected formatted URL,
but saying so explicitly (perhaps moving this to be right before the input URL)
would clarify its meaning. Ditto for the other similar comments below.

Peter Kasting

Thanks for the thorough review, although I hadn't been expecting you to review net/ or ...

7 years, 3 months ago (2013-09-06 19:04:36 UTC) #9

Thanks for the thorough review, although I hadn't been expecting you to review
net/ or base/!  New snap up.  I modified some .h comments to try and reflect the
"modified worldview" of these functions.

On 2013/09/06 00:31:07, msw wrote:
> Regarding your "http://user:pass@domain.com/" example, I think simply clamping
> indices across text modifications is wrong. (6, 7, 8) should probably become
(0,
> 0, 0) when stripping the scheme/username/password, not (npos, 0, npos).

It depends on whether you view FormatURLWithOffsets() as working on a
character-by-character basis -- basically just a wrapper around UTF8ToUTF16()
and the like -- or on a functional block basis, where "the scheme" is an entity
a bit like a long Unicode encoding sequence is an entity.  I think it makes more
sense to view it as the latter, in which case converting "offset into the middle
of an entity that no longer exists" to npos seems more correct.

Plus I worry some callers may intentionally rely on the current behavior to
detect offsets that point into the middles of such entities.  I am not sure
though.

> Anyway, it seems like some valid LimitOffset callers will need to use +1 or -1
> regardless of the ultimate behavior.

How so?  I haven't found any in my patch...

> I suspect that simply adding 1 to the string length for the
> "offsets point between characters" worldview users would be good enough.

The problem with this is that the transformation functions don't take an
arbitrary limit point, the infer it from the provided string.  So to pretend you
have a longer string, you'd have to have callers actually add characters to the
end of the string, just to allow for the possibility of an offset at the end of
the original string.  That seems really hacky and gross, and might have a large
footprint across the omnibox code.

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
File chrome/browser/autocomplete/search_provider_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
chrome/browser/autocomplete/search_provider_unittest.cc:1006: {
"[\"a\",[\"https://a/\"],[],[],"
On 2013/09/06 00:31:08, msw wrote:
> nit: is the trailing '/' needed here and below? It's not obvious that it
> exercises the empty inline (and trailing index offset) codepath.

I actually wanted to make sure the trailing slash case still got handled
properly (with the slash stripped off in the output).  The "a.com" case omits
the trailing slash, so we test both.

https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest.cc
File net/base/escape_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/escape_unittest....
net/base/escape_unittest.cc:340: {"", 0, 0},
On 2013/09/06 00:31:08, msw wrote:
> nit: also test {"", 1, std::string::npos}

Done.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc
File net/base/net_util.cc (left):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc...
net/base/net_util.cc:514: std::vector<size_t> OffsetsIntoComponent(
On 2013/09/06 00:31:08, msw wrote:
> Leave this helper intact, instead of writing two new similar loops.

Restored these two functions.  I originally gutted these because the callsites
wanted different (sometimes simpler) behavior, but after some changes relative
to your comments, I ended up going back to these.  See discussion below for
information about behavior changes.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#oldc...
net/base/net_util.cc:1730: size_t colon = parsed.username.end();
On 2013/09/06 00:31:08, msw wrote:
> What was this code doing and why is it no longer necessary?

This is somewhat subtle.

Before this change, AppendFormattedComponent() only transformed offsets within
the component, and left ones after it untouched.  The result was that as the
series of AppendFormattedComponent() calls occurred, groups of offsets would be
set correctly, with the ones after them left completely invalid.  One
consequence was that any offsets referring to text that was _not_ transformed by
AppendFormattedComponent() -- such as the colon and at-sign in a
username/password section of the URL -- had to be fixed up manually.  That's
what this code, and the code below, are doing.  The other consequence was that
the offset pointing at the end of the URL, since it did not refer to any text
that would be transformed, was never fixed up at all.  (A third consequence,
which is that during the middle of making these calls the state of
|offsets_for_adjustment| was somewhat incoherent, was only relevant to someone
debugging the code and trying to figure out why later offsets weren't being
dynamically updated as earlier components changed length.)

In the new version of the patch, AppendFormattedComponent() also adjusts all
offsets _after_ the transformed component, which means that none of these
consequences are present anymore.  This obviates the need for these two blocks
of code, also avoids an additional block to fix the offset at the end of the
string, and finally is less confusing while debugging because
|offsets_for_adjustment| always reflects the correct values for if you just
dumped the rest of the string on unmodified and returned immediately.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc
File net/base/net_util.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:546: for (Offsets::iterator
i(offsets_into_underlying_url.begin());
On 2013/09/06 00:31:08, msw wrote:
> nit: use consistent iter/index looping if you're changing a lot of code here.

This is no longer so noticeably inconsistent, since the two loops are now back
in separate helper functions.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:548: *i = ((*i == std::string::npos) || (*i <
kViewSourceLength)) ?
On 2013/09/06 00:31:08, msw wrote:
> Why should offsets into trimmed leading text go to npos instead of 0?
> (I suspect that pre-existing behavior is wrong, but don't really know)

It doesn't actually matter what we set them to, since we won't pay attention to
their values after the transformation.

The helper function now does something simpler with these, based on your
comments on the similar code in AppendFormattedComponent().

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:568: for (size_t i = 0; i < original_offsets.size(); ++i) {
On 2013/09/06 00:31:08, msw wrote:
> Why does this loop use |original_offsets| at all? Shouldn't it just add
> kViewSourceLength to all the non-npos |offsets_into_underlying_url|?

This is now moot since this once again calls AdjustForComponentTransform().

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:571: (*offsets_for_adjustment)[i] = (new_offset ==
base::string16::npos) ?
On 2013/09/06 00:31:08, msw wrote:
> offsets_for_adjustment may be NULL, check for that.

This is now moot since this once again calls AdjustForComponentTransform().

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:650: // Transform |original_offsets| into a vetor of
offsets relative to
On 2013/09/06 00:31:08, msw wrote:
> nit: 'vector'

This is now moot.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:658: // |original_component_begin|.  (Other offsets are
ignored, since it
On 2013/09/06 00:31:08, msw wrote:
> Is it even worthwhile to ignore other offsets? It requires adding this comment
> and another conditional term to avoid a subtraction and assignment.

You're right, we can unconditionally subtract since we'll ignore all the
out-of-range offsets anyway.

Doing things like subtracting a value from npos looks scary, but since we ignore
such values, it doesn't matter.  I simplified the code in
OffsetsIntoComponent().

> Might it
> even be wrong to ignore those if later string/offset use assume they're
updated
> to account for the transformed component?

Code after this function returns does assume these are updated (which is a
change from before this patch; see my long reply in the username/password
handling section of FormatUrlWithOffsets()), so we do set the correct output
values for them.  I tried to make this clearer in the comments on
OffsetsIntoComponent()/AdjustForComponentTransform().

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:676: // This offset originally pointed into the transformed
component.
On 2013/09/06 00:31:08, msw wrote:
> I find it odd that we are adjusting offsets within transformed components at
> all. Can you provide an example of when this is done?

Yes, any time the path needs escaping or unescaping.  Each escaped character is
generally equivalent to between three and nine unescaped characters, so any
offsets that point anywhere within that sequence get adjusted during the actual
escaping/unescaping process.  This happens in real-world cases when people e.g.
type something in the omnibox that the HQP matches against an escaped part of a
URL's path.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:684: (original_offset != std::string::npos)) {
On 2013/09/06 00:31:08, msw wrote:
> nit: fix indent

Done.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:688: original_offset - original_component_end +
output->length();
On 2013/09/06 00:31:08, msw wrote:
> I suppose either is approach should yield the same result, but shouldn't this
> really use |original_component.len| instead of |original_component_end| and
> |component_str.length()| instead of |output->length()| to be 'correct'?

They do yield the same result.

Now that this code has moved back into AdjustForComponentTransform() and the
code says "transformed_component_end" instead of "output->length()", I think the
algorithm is more obvious.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:1764: base::OffsetAdjuster
offset_adjuster(offsets_for_adjustment);
On 2013/09/06 00:31:08, msw wrote:
> What is this doing? Adjusting offsets past a removed trailing slash?

Yes.

This only really matters for when we have an offset at the end of the input
string and this trailing slash was the last thing in the input.  In that case we
have to do this fixup to get that offset correct.  Otherwise, one of the
AppendFormattedComponent() calls below will correct any such trailing offset,
due to the new behavior I described in my reply just above.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util.cc#newc...
net/base/net_util.cc:1777: AppendFormattedComponent(spec, parsed.ref,
original_offsets,
On 2013/09/06 00:31:08, msw wrote:
> Nice!

Yeah... old code made me facepalm.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
File net/base/net_util_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2900: {0, 0},
On 2013/09/06 00:31:08, msw wrote:
> Aren't all these cases except 27, 500000, and base::string16::npos covered by
> the basic_offsets verification of all offsets?

Yes.

> Can you remove the AdjustOffsetCase arrays, and simply
> supply the |all_offsets| array to CheckAdjustedOffsets.

I could, but then every case would have to declare two long "every possible
offset" arrays instead of one.  Both routes seem roughly as lame as each other. 
I had an "eww, gross" reaction when I first read this code, but there doesn't
seem to be an answer that (to me, at least) makes me say "ahh, that's so much
less redundant and crappy".  YMMV.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2914: const size_t basic_offsets[] = {
On 2013/09/06 00:31:08, msw wrote:
> Optionally, you could change the size_t arrays of expected offsets to int
arrays
> of expected offset deltas (if you think it'd be easier to read and
worthwhile).
> In this case they'd all be 0, but a case that does remove components, like
> omit_auth_offsets_1, could be:
> const size_t omit_auth_offsets_1[] = { 0, 0, 0, 0, 0, 0, 0, 0, kNpos, kNpos,
> kNpos, kNpos, kNpos, kNpos, kNpos, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7,
> -7, -7, -7, -7 };

Ultimately I find both somewhat confusing.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2916: 21, 22, 23, 24, 25
On 2013/09/06 00:31:08, msw wrote:
> Shouldn't 26 now also be part of this array for the all-offsets check? Ditto
> below for all the size_t arrays terminal values at the url length.

CheckAdjustedOffsets() checks this value automatically, so while we could write
the code such that we explicitly declared that at all the callsites instead, it
wouldn't gain us anything.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2954: //
"http://foo\x30B0:\x30B0bar@www.google.com"
On 2013/09/06 00:31:08, msw wrote:
> nit: can you expand on this comment, I guess this is the expected formatted
URL,
> but saying so explicitly (perhaps moving this to be right before the input
URL)
> would clarify its meaning. Ditto for the other similar comments below.

OK, I expanded these slightly and moved them.

msw

Okay, this is looking pretty good; I do like the "modified worldview" change overall. I ...

7 years, 3 months ago (2013-09-09 19:15:52 UTC) #10

Okay, this is looking pretty good; I do like the "modified worldview" change
overall. I hope some custom offset adjustment code can be replaced by
OffsetAdjuster usage, but regardless of this change the code is fairly complex
and doesn't seem well designed.

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
File chrome/browser/autocomplete/search_provider_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/chrome/browser/autocomple...
chrome/browser/autocomplete/search_provider_unittest.cc:1006: {
"[\"a\",[\"https://a/\"],[],[],"
On 2013/09/06 19:04:36, Peter Kasting wrote:
> On 2013/09/06 00:31:08, msw wrote:
> > nit: is the trailing '/' needed here and below? It's not obvious that it
> > exercises the empty inline (and trailing index offset) codepath.
> 
> I actually wanted to make sure the trailing slash case still got handled
> properly (with the slash stripped off in the output).  The "a.com" case omits
> the trailing slash, so we test both.

Okay, sounds good.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
File net/base/net_util_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2900: {0, 0},
On 2013/09/06 19:04:36, Peter Kasting wrote:
> On 2013/09/06 00:31:08, msw wrote:
> > Aren't all these cases except 27, 500000, and base::string16::npos covered
by
> > the basic_offsets verification of all offsets?
> 
> Yes.
> 
> > Can you remove the AdjustOffsetCase arrays, and simply
> > supply the |all_offsets| array to CheckAdjustedOffsets.
> 
> I could, but then every case would have to declare two long "every possible
> offset" arrays instead of one.  Both routes seem roughly as lame as each
other. 
> I had an "eww, gross" reaction when I first read this code, but there doesn't
> seem to be an answer that (to me, at least) makes me say "ahh, that's so much
> less redundant and crappy".  YMMV.

As I suggested, you can just remove all AdjustOffsetCase arrays. Just send
CheckAdjustedOffsets the full list of expected offsets from an automatically
generated input array { 0, 1, 2 ... |url_string.length()| }, like it does now.
CheckAdjustedOffsets can also generate the input and output offsets for cases >
url_string.length() and npos. The result would be the same coverage with ~160
fewer lines; which seems decidedly better.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc
File net/base/net_util.cc (right):

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:516: // NOTE: Offsets which didn't originally point into
this component may be
nit: With this helper intact, it's less compelling to remove the npos check.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:519: Offsets OffsetsIntoComponent(const Offsets&
original_offsets,
Can this be replaced with an OffsetAdjuster?

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:521: Offsets offsets_into_component(original_offsets);
optional nit: rename this to |offsets| for a shorter loop impl:
  for (Offsets::iterator i(offsets.begin()); i != offsets.end(); ++i)
    *i = (*i == npos) ? *i : *i - component_begin;

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:541: size_t original_component_begin,
optional nit: Replace 2 begin/end pairs with Ranges or url_parse::Components.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:550: for (size_t i = 0; i < original_offsets.size(); ++i) {
Can this be replaced with an OffsetAdjuster?

Peter Kasting

New snap up. I think you're probably right that the existing FormatUrl...() functions could work ...

7 years, 3 months ago (2013-09-09 23:43:01 UTC) #11

New snap up.

I think you're probably right that the existing FormatUrl...() functions could
work a little better.  They probably should have been built to construct the
output offset vector by appending to it -- as they do the output string --
rather than by replacing in place.  OffsetAdjuster might be easier to use if it
wrote out a new output vector (maybe via a provided back_inserter?) instead of
forcing you to operate in place.  I'm reluctant to try and make these changes
too, at least for now.  If you're convinced they would help, or can think of
anything else, you can file a bug against me to do them and I'll get to it.

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
File net/base/net_util_unittest.cc (right):

https://codereview.chromium.org/23619016/diff/29001/net/base/net_util_unittes...
net/base/net_util_unittest.cc:2900: {0, 0},
On 2013/09/09 19:15:52, msw wrote:
> On 2013/09/06 19:04:36, Peter Kasting wrote:
> > On 2013/09/06 00:31:08, msw wrote:
> > > Aren't all these cases except 27, 500000, and base::string16::npos covered
> by
> > > the basic_offsets verification of all offsets?
> > 
> > Yes.
> > 
> > > Can you remove the AdjustOffsetCase arrays, and simply
> > > supply the |all_offsets| array to CheckAdjustedOffsets.
> > 
> > I could, but then every case would have to declare two long "every possible
> > offset" arrays instead of one.  Both routes seem roughly as lame as each
> other. 
> > I had an "eww, gross" reaction when I first read this code, but there
doesn't
> > seem to be an answer that (to me, at least) makes me say "ahh, that's so
much
> > less redundant and crappy".  YMMV.
> 
> As I suggested, you can just remove all AdjustOffsetCase arrays. Just send
> CheckAdjustedOffsets the full list of expected offsets from an automatically
> generated input array { 0, 1, 2 ... |url_string.length()| }, like it does now.
> CheckAdjustedOffsets can also generate the input and output offsets for cases
>
> url_string.length() and npos. The result would be the same coverage with ~160
> fewer lines; which seems decidedly better.

Done.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc
File net/base/net_util.cc (right):

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:516: // NOTE: Offsets which didn't originally point into
this component may be
On 2013/09/09 19:15:52, msw wrote:
> nit: With this helper intact, it's less compelling to remove the npos check.

Moot.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:519: Offsets OffsetsIntoComponent(const Offsets&
original_offsets,
On 2013/09/09 19:15:52, msw wrote:
> Can this be replaced with an OffsetAdjuster?

Done.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:521: Offsets offsets_into_component(original_offsets);
On 2013/09/09 19:15:52, msw wrote:
> optional nit: rename this to |offsets| for a shorter loop impl:
>   for (Offsets::iterator i(offsets.begin()); i != offsets.end(); ++i)
>     *i = (*i == npos) ? *i : *i - component_begin;

Moot.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:541: size_t original_component_begin,
On 2013/09/09 19:15:52, msw wrote:
> optional nit: Replace 2 begin/end pairs with Ranges or url_parse::Components.

This turns out to not be particularly great at the callsites :/.

https://codereview.chromium.org/23619016/diff/24001/net/base/net_util.cc#newc...
net/base/net_util.cc:550: for (size_t i = 0; i < original_offsets.size(); ++i) {
On 2013/09/09 19:15:52, msw wrote:
> Can this be replaced with an OffsetAdjuster?

Not really.

It's sort of possible to do with two different adjustments that work on two
different vectors, but you still run into problems at the boundary points -- now
that OffsetAdjuster treats 0 as "before the component" instead of "first
character in the component", it becomes difficult to adjust transformed offsets
of 0 back to non-zero values.  And the resulting code is less efficient
(including at least one extra copy of the whole vector) and similarly verbose.

After trying to make it work, I've gone back to the existing way.

msw

LGTM with a nit; thanks for doing some difficult work here. https://codereview.chromium.org/23619016/diff/77001/net/base/net_util_unittest.cc File net/base/net_util_unittest.cc (right): ...

7 years, 3 months ago (2013-09-10 17:01:14 UTC) #12

willchan no longer on Chromium

base/ LGTM As far as net/, I didn't do a detailed review for correctness and ...

7 years, 3 months ago (2013-09-11 00:09:36 UTC) #13

Peter Kasting

7 years, 3 months ago (2013-09-11 00:42:32 UTC) #14

Message was sent while issue was closed.

Committed patchset #5 manually as r222426 (presubmit successful).

Expand Messages | Collapse Messages