Issue 226283009: Make AutocompleteInput::Parse() more strict: return QUERY for all inputs that

Peter Kasting

This is what you urged me to do a while ago. I have one major ...

6 years, 8 months ago (2014-04-05 00:23:08 UTC) #1

Mark P

Can you point me to where we discussed this before? I'd like to refresh my ...

6 years, 8 months ago (2014-04-05 00:27:48 UTC) #2

Can you point me to where we discussed this before?  I'd like to refresh my
memory.

thanks,
mark



On Fri, Apr 4, 2014 at 5:23 PM, <pkasting@chromium.org> wrote:

> Reviewers: Mark P,
>
> Message:
> This is what you urged me to do a while ago.
>
> I have one major concern here: we're now returning QUERY in cases where
> the user
> almost certainly intended to type a URL, but didn't type a valid one.
>  This will
> affect the type and ranking of results that appear in the omnibox
> dropdown, and
> that may be a bad change.  Perhaps we'd like to rank our results as if the
> user
> typed a URL, except for what we make default.
>
> I don't know how to measure the effect of this concern.
>
> Description:
> Make AutocompleteInput::Parse() more strict: return QUERY for all inputs
> that
> can't be fixed up into valid URLs.
>
> BUG=357172
> TEST=Typing "google.com/%00" (no quotes) into a debug build omnibox
> should not
> cause a DCHECK
>
> Please review this at https://codereview.chromium.org/226283009/
>
> SVN Base: svn://svn.chromium.org/chrome/trunk/src/
>
> Affected files (+15, -29 lines):
>   M     chrome/browser/autocomplete/autocomplete_input.cc
>   M     chrome/browser/autocomplete/autocomplete_input_unittest.cc
>   M     chrome/browser/autocomplete/history_quick_provider.cc
>   M     chrome/browser/autocomplete/history_url_provider.cc
>
>
> Index: chrome/browser/autocomplete/autocomplete_input.cc
> ===================================================================
> --- chrome/browser/autocomplete/autocomplete_input.cc   (revision 261885)
> +++ chrome/browser/autocomplete/autocomplete_input.cc   (working copy)
> @@ -150,20 +150,16 @@
>    if (scheme)
>      *scheme = parsed_scheme;
>
> -  // Try to fixup and canonicalize the user's typing.  We use this to help
> -  // determine if it's safe to return "URL" as the type of anything that
> has an
> -  // explicit, non-HTTP[S] scheme.  (HTTP[S] and "no scheme" inputs get
> more
> -  // sophisticated heuristics below.)  If we can't canonicalize such
> inputs, we
> -  // shouldn't mark them as "URL"s, because the rest of the autocomplete
> system
> -  // isn't going to be able to produce navigable URL matches for them,
> which can
> -  // lead to DCHECK failures later.
> +  // If we can't canonicalize the user's input, the rest of the
> autocomplete
> +  // system isn't going to be able to produce a navigable URL match for
> it.
> +  // So we just return QUERY immediately in these cases.
>    GURL placeholder_canonicalized_url;
>    if (!canonicalized_url)
>      canonicalized_url = &placeholder_canonicalized_url;
>    *canonicalized_url = URLFixerUpper::FixupURL(base::UTF16ToUTF8(text),
>
> base::UTF16ToUTF8(desired_tld));
> -  Type return_value_for_non_http_url =
> -      canonicalized_url->is_valid() ? URL : QUERY;
> +  if (!canonicalized_url->is_valid())
> +    return QUERY;
>
>    if (LowerCaseEqualsASCII(parsed_scheme, content::kFileScheme)) {
>      // A user might or might not type a scheme when entering a file URL.
>  In
> @@ -190,7 +186,7 @@
>          LowerCaseEqualsASCII(parsed_scheme, content::kViewSourceScheme)
> ||
>          LowerCaseEqualsASCII(parsed_scheme, content::kJavaScriptScheme)
> ||
>          LowerCaseEqualsASCII(parsed_scheme, content::kDataScheme))
> -      return return_value_for_non_http_url;
> +      return URL;
>
>      // Not an internal protocol.  Check and see if the user has explicitly
>      // opened this scheme as a URL before, or if the "scheme" is actually
> a
> @@ -206,7 +202,7 @@
>              base::UTF16ToUTF8(parsed_scheme), true);
>      switch (block_state) {
>        case ExternalProtocolHandler::DONT_BLOCK:
> -        return return_value_for_non_http_url;
> +        return URL;
>
>        case ExternalProtocolHandler::BLOCK:
>          // If we don't want the user to open the URL, don't let it be
> navigated
> @@ -270,9 +266,11 @@
>    // between an HTTP URL and a query, or the scheme is HTTP or HTTPS, in
> which
>    // case we should reject invalid formulations.
>
> -  // If we have an empty host it can't be a URL.
> -  if (!parts->host.is_nonempty())
> -    return QUERY;
> +  // Sanity-checks: GURL should have failed to canonicalize this as an
> HTTP[S]
> +  // URL if it had an empty host or an invalid port.
> +  DCHECK(parts->host.is_nonempty());
> +  DCHECK_NE(url_parse::PORT_INVALID,
> +            url_parse::ParsePort(text.c_str(), parts->port));
>
>    // Likewise, the RCDS can reject certain obviously-invalid hosts.  (We
> also
>    // use the registry length later below.)
> @@ -335,16 +333,6 @@
>              (host.find(' ') == base::string16::npos))) ? UNKNOWN : QUERY;
>    }
>
> -  // A port number is a good indicator that this is a URL.  However, it
> might
> -  // also be a query like "1.66:1" that looks kind of like an IP address
> and
> -  // port number. So here we only check for "port numbers" that are
> illegal and
> -  // thus mean this can't be navigated to (e.g. "1.2.3.4:garbage"), and
> we save
> -  // handling legal port numbers until after the "IP address"
> determination
> -  // below.
> -  if (url_parse::ParsePort(text.c_str(), parts->port) ==
> -      url_parse::PORT_INVALID)
> -    return QUERY;
> -
>    // Now that we've ruled out all schemes other than http or https and
> done a
>    // little more sanity checking, the presence of a scheme means this is
> likely
>    // a URL.
> Index: chrome/browser/autocomplete/autocomplete_input_unittest.cc
> ===================================================================
> --- chrome/browser/autocomplete/autocomplete_input_unittest.cc  (revision
> 261885)
> +++ chrome/browser/autocomplete/autocomplete_input_unittest.cc  (working
> copy)
> @@ -33,6 +33,7 @@
>      { ASCIIToUTF16("foo.-com"), AutocompleteInput::QUERY },
>      { ASCIIToUTF16("foo/"), AutocompleteInput::URL },
>      { ASCIIToUTF16("foo/bar"), AutocompleteInput::UNKNOWN },
> +    { ASCIIToUTF16("foo/bar%00"), AutocompleteInput::QUERY },
>      { ASCIIToUTF16("foo/bar/"), AutocompleteInput::URL },
>      { ASCIIToUTF16("foo/bar baz\\"), AutocompleteInput::URL },
>      { ASCIIToUTF16("foo.com/bar"), AutocompleteInput::URL },
> @@ -87,6 +88,7 @@
>      { ASCIIToUTF16("http://foo.c"), AutocompleteInput::URL },
>      { ASCIIToUTF16("http://foo.com"), AutocompleteInput::URL },
>      { ASCIIToUTF16("http://foo_bar.com"), AutocompleteInput::URL },
> +    { ASCIIToUTF16("http://foo/bar%00"), AutocompleteInput::QUERY },
>      { ASCIIToUTF16("http://foo/bar baz"), AutocompleteInput::URL },
>      { ASCIIToUTF16("http://-foo.com"), AutocompleteInput::URL },
>      { ASCIIToUTF16("http://foo-.com"), AutocompleteInput::URL },
> Index: chrome/browser/autocomplete/history_quick_provider.cc
> ===================================================================
> --- chrome/browser/autocomplete/history_quick_provider.cc       (revision
> 261885)
> +++ chrome/browser/autocomplete/history_quick_provider.cc       (working
> copy)
> @@ -124,9 +124,7 @@
>    // provider won't promote the URL-what-you-typed match to first
>    // for these inputs.
>    const bool can_have_url_what_you_typed_match_first =
> -      autocomplete_input_.canonicalized_url().is_valid() &&
>        (autocomplete_input_.type() != AutocompleteInput::QUERY) &&
> -      (autocomplete_input_.type() != AutocompleteInput::FORCED_QUERY) &&
>        (!autocomplete_input_.parts().username.is_nonempty() ||
>         autocomplete_input_.parts().password.is_nonempty() ||
>         autocomplete_input_.parts().path.is_nonempty());
> Index: chrome/browser/autocomplete/history_url_provider.cc
> ===================================================================
> --- chrome/browser/autocomplete/history_url_provider.cc (revision 261885)
> +++ chrome/browser/autocomplete/history_url_provider.cc (working copy)
> @@ -509,7 +509,6 @@
>    // Otherwise, this is just low-quality noise.  In the cases where we've
> parsed
>    // as UNKNOWN, we'll still show an accidental search infobar if need be.
>    bool have_what_you_typed_match =
> -      params->input.canonicalized_url().is_valid() &&
>        (params->input.type() != AutocompleteInput::QUERY) &&
>        ((params->input.type() != AutocompleteInput::UNKNOWN) ||
>         (classifier.type() == VisitClassifier::UNVISITED_INTRANET) ||
> @@ -701,8 +700,7 @@
>    const bool trim_http = !AutocompleteInput::HasHTTPScheme(input.text());
>    // Don't do this for queries -- while we can sometimes mark up a match
> for
>    // this, it's not what the user wants, and just adds noise.
> -  if ((input.type() != AutocompleteInput::QUERY) &&
> -      input.canonicalized_url().is_valid()) {
> +  if (input.type() != AutocompleteInput::QUERY) {
>      AutocompleteMatch what_you_typed(SuggestExactInput(
>          input.text(), input.canonicalized_url(), trim_http));
>      what_you_typed.relevance = CalculateRelevance(WHAT_YOU_TYPED, 0);
>
>
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Peter Kasting

On 2014/04/05 00:27:48, Mark P wrote: > Can you point me to where we discussed ...

6 years, 8 months ago (2014-04-05 00:30:49 UTC) #3

Mark P

On 2014/04/05 00:30:49, Peter Kasting wrote: > On 2014/04/05 00:27:48, Mark P wrote: > > ...

6 years, 8 months ago (2014-04-09 22:07:08 UTC) #4

Peter Kasting

On 2014/04/09 22:07:08, Mark P wrote: > Yes, I think I said that. This change, ...

6 years, 8 months ago (2014-04-09 22:56:42 UTC) #5

Mark P

On 2014/04/15 00:00:26, Peter Kasting wrote: > Ping. I've been postponing this review because the ...

6 years, 8 months ago (2014-04-15 16:20:27 UTC) #7

Peter Kasting

On 2014/04/15 16:20:27, Mark P wrote: > On 2014/04/15 00:00:26, Peter Kasting wrote: > > ...

6 years, 8 months ago (2014-04-15 20:01:52 UTC) #8

Mark P

I don't think this is good as-is. I started going through the places where AutocompleteInput::QUERY ...

6 years, 8 months ago (2014-04-15 22:00:10 UTC) #9

I don't think this is good as-is.  I started going through the places where
AutocompleteInput::QUERY is mentioned in the code and before I even made it
through 20% of them, I found lines 372-385 in
chrome/browser/autocomplete/search_provider.cc

This change will make the suggest server get asked to provide suggestions
for URLs that it didn't used to.  For instance, if I cut
http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter...
and paste it into the URL box incorrectly, getting
http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter...,
suddenly this will be an invalid URL that will now get sent to the
suggest server.

This kind of thing concerns me (maybe it needs privacy review?  or perhaps
you can fix this code here), and I wonder what other issues I'll find if I
continue auditing all uses.

After writing that paragraph, I kept looking for a bit.  I found another
problematic case: chrome/browser/autocomplete/history_url_provider.cc line
835, where we depending on an input like "c/# foo" to get classified as a
URL.

You may want to consider fixing this bug without making this change.
 Otherwise, I can continue auditing every use of AutocompleteInput::Type to
see what else may come up and have you fix all the fallout.  This may be a
pain for both of us.

--mark

On Tue, Apr 15, 2014 at 1:01 PM, <pkasting@chromium.org> wrote:

> On 2014/04/15 16:20:27, Mark P wrote:
>
>> On 2014/04/15 00:00:26, Peter Kasting wrote:
>> > Ping.
>>
>
>  I've been postponing this review because the change makes me nervous.
>>  Can you
>> give me some examples where it improves the ordering of suggestions?
>>
>
> The change isn't designed to "improve the ordering of suggestions".
>  Indeed, my
> first comment notes that my only concern about this patch is that we may
> change
> the ordering of suggestions sometimes in ways I don't know would be an
> improvement.  I don't know that it would make things worse either.  The
> more I
> think about it the more I'm skeptical it would have much effect at all.
>  There
> would only be a change for inputs that we previously erroneously thought
> were
> URLs, that aren't valid URLs -- edge cases like the "google.com/%00" in
> the
> original report.  Given this, I don't think the change should even affect
> most
> inputs, and in cases where it does, I can't think what valid URLs we'd
> have been
> suggesting, that could be ranked lower than before.
>
> The change does fix a DCHECK, because it is a correctness fix, so I'm not
> content to just sit indefinitely on it.  Personally I am OK with just
> landing
> this.
>
> https://codereview.chromium.org/226283009/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Peter Kasting

On 2014/04/15 22:00:10, Mark P wrote: > I don't think this is good as-is. I ...

6 years, 8 months ago (2014-04-15 22:11:30 UTC) #10

Mark P

On Tue, Apr 15, 2014 at 3:11 PM, <pkasting@chromium.org> wrote: > On 2014/04/15 22:00:10, Mark ...

6 years, 8 months ago (2014-04-15 22:30:44 UTC) #11

On Tue, Apr 15, 2014 at 3:11 PM, <pkasting@chromium.org> wrote:

> On 2014/04/15 22:00:10, Mark P wrote:
>
>> I don't think this is good as-is.  I started going through the places
>> where
>> AutocompleteInput::QUERY is mentioned in the code and before I even made
>> it
>> through 20% of them, I found lines 372-385 in
>> chrome/browser/autocomplete/search_provider.cc
>>
>
> I think you're quoting the wrong line numbers?


Sorry, I used internal code search because the external one seems to be
broken.  The internal one appears a bit out of date.


> Neither search_provider.cc nor
> base_search_provider.cc do things relating to "what to send to the suggest
> server" on lines 372-385.  Maybe you meant 568-581?

Close.  I meant line 553-566.

>
>
>  This change will make the suggest server get asked to provide suggestions
>> for URLs that it didn't used to.  For instance, if I cut
>>
>
> http://www.mycompany.com/?project=Secret+Project+to+
> Launch+Balloons+for+Internet+Access%21
>
>> and paste it into the URL box incorrectly, getting
>>
>
> http://www.mycompany.com/?project=Secret+Project+to+
> Launch+Balloons+for+Internet+Access%252instead,
>
>  suddenly this will be an invalid URL that will now get sent to the
>> suggest server.
>>
>
> Are you sure this will actually change?  In your example, I would expect
> input_.parts().query.is_nonempty() to be true even though the input type
> is
> QUERY, which would mean this would not be sent.  Generally SegmentURL() can
> break the input into parts even when it is not canonicalizable.
>
>
>  After writing that paragraph, I kept looking for a bit.  I found another
>> problematic case: chrome/browser/autocomplete/history_url_provider.cc
>> line
>> 835, where we depending on an input like "c/# foo" to get classified as a
>> URL.
>>
>
> Why would that not continue to be classified as a URL?  I wouldn't expect
> my
> change to affect this.
>

According to the spec, a URL must not contain a literal space.


> I would simply test both cases above to ensure my change doesn't affect
> them,
> but I'm home sick today and don't have VPN access to my machine.
>
> https://codereview.chromium.org/226283009/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Peter Kasting

On Tue, Apr 15, 2014 at 3:30 PM, Mark Pearson <mpearson@chromium.org> wrote: > Neither search_provider.cc ...

6 years, 8 months ago (2014-04-15 22:56:30 UTC) #12

Peter Kasting

So, in our IM chat about this, you said two things: (1) You want to ...

6 years, 8 months ago (2014-04-22 21:25:30 UTC) #13

Mark P

Okay, I haven't yet audited the places type() is used yet, but I looked closely ...

6 years, 8 months ago (2014-04-24 18:31:29 UTC) #14

Peter Kasting

On 2014/04/24 18:31:29, Mark P wrote: > for an input such as "abc", is it ...

6 years, 8 months ago (2014-04-24 20:43:20 UTC) #15

On 2014/04/24 18:31:29, Mark P wrote:
> for an input such as "abc", is it possible that this doesn't canonicalize &
> validate?  e.g., (forgive my hand-wavy language) the user doesn't have an
> intranet loopback configured in their network or they have no implicit domain
> specified in their registries?

No.  Canonicalization does not pay attention to the user's history.  Whether
"http://abc/" is theoretically legal (the question being asked by the code here)
is different than whether "http://abc/" would actually resolve (the question
you're effectively asking).

> won't site: be now classified as a QUERY, where we made an explicit decision
to
> classify it as UNKNOWN before?

No, it should be handled above the point where this change is made, by the code
that handles "schemes other than HTTP or HTTPS", and thus should be unchanged.

> how about an input such as "." when the user has certain registry controlled
> domains or "..foop.com" or "foop."?  does the classification for these change
at
> all?  What about under certain registry-controlled-domain settings?

Canonicalization also doesn't care about registry-controlled domains. 
"http://foop." and "http://..foop.com/" will be canonicalizable, or not,
regardless of what domains exist.  Again, what registry-controlled domains
affect here is whether a navigation to these URLs would succeed, not whether the
URLs are syntactically valid.  This is really the same question as your first.

> what about the whole segment to handle invalid hostnames?  see lines 309-340. 
> consider for example a hostname with an "_".  some intranets resolve these. 
now
> I think we'd call them QUERY instead of UNKNOWN.

As the comment on that section says, the reason for the section's existence
there is because we try to be stricter than what GURL requires.  That means that
we're handling things there that the canonicalizer will mark as legal.  So this
patch won't affect them.

> what about the trailing slashes?  line 372-377.  we used to allow them to
force
> URL-ness, unless we hit an earlier test.  Now they're not as powerful I think.

This only affects cases where the user's input isn't navigable anyway.  For
example,

abc%00/

Because this isn't a navigable URL, it's reasonable to not mark it as one.

> what about URLs with valid TLD that Chrome doesn't know about yet?  These will
> get marked as QUERY.  Previously they were UNKNOWN.  Or the user appended a
> slash to force them to be a URL.

No, these should not get marked as QUERY, for the same reason as the two
questions above.  Canonicalization pays no attention to the registry-controlled
domain list.  "http://foo.bar.garbage/" is a legal URL, and will be marked as
such, regardless of whether ".garbage" is a real TLD.

In short: None of the points you raise appear to be problematic.

Mark P

On Thu, Apr 24, 2014 at 1:43 PM, <pkasting@chromium.org> wrote: > On 2014/04/24 18:31:29, Mark ...

6 years, 8 months ago (2014-04-24 22:42:00 UTC) #16

On Thu, Apr 24, 2014 at 1:43 PM, <pkasting@chromium.org> wrote:

> On 2014/04/24 18:31:29, Mark P wrote:
>
>> for an input such as "abc", is it possible that this doesn't canonicalize
>> &
>> validate?  e.g., (forgive my hand-wavy language) the user doesn't have an
>> intranet loopback configured in their network or they have no implicit
>> domain
>> specified in their registries?
>>
>
> No.  Canonicalization does not pay attention to the user's history.
>  Whether
> "http://abc/" is theoretically legal (the question being asked by the
> code here)
> is different than whether "http://abc/" would actually resolve (the
> question
> you're effectively asking).

Whether canonicalization pays attention to the user's history is not my
question.  My question was whether canonicalization and validity can depend
on the user's network and/or registry configs.

[addendum: I think you answered this question below.  I guess there's no
need to comment further.]

>
>
>  won't site: be now classified as a QUERY, where we made an explicit
>> decision
>>
> to
>
>> classify it as UNKNOWN before?
>>
>
> No, it should be handled above the point where this change is made, by the
> code
> that handles "schemes other than HTTP or HTTPS", and thus should be
> unchanged.

I don't understand.  This will not be "handled above the point where this
change is made" because this change was made at line 161, earlier in the
file than the code that handles "schemes other than HTTP or HTTPS" (starts
at line 177).

Or are you perhaps implying that all blah:XYZ will be reported as
is_valid() for any value of blah and any value of XYZ (including funky
characters, unescaped items, etc.)?

>
>  how about an input such as "." when the user has certain registry
>> controlled
>> domains or "..foop.com" or "foop."?  does the classification for these
>> change
>>
> at
>
>> all?  What about under certain registry-controlled-domain settings?
>>
>
> Canonicalization also doesn't care about registry-controlled domains.
> "http://foop." and "http://..foop.com/" will be canonicalizable, or not,
> regardless of what domains exist.  Again, what registry-controlled domains
> affect here is whether a navigation to these URLs would succeed, not
> whether the
> URLs are syntactically valid.  This is really the same question as your
> first.
>
>
>  what about the whole segment to handle invalid hostnames?  see lines
>> 309-340.
>> consider for example a hostname with an "_".  some intranets resolve
>> these.
>>
> now
>
>> I think we'd call them QUERY instead of UNKNOWN.
>>
>
> As the comment on that section says, the reason for the section's existence
> there is because we try to be stricter than what GURL requires.  That
> means that
> we're handling things there that the canonicalizer will mark as legal.  So
> this
> patch won't affect them.

Perhaps I'm misled by the comment.  Regardless, the URL spec clearly says
that hostnames cannot contain underscores.  That makes me think is_valid()
is (or ought to) return false for them.  This means you'll return QUERY
early and not reach this code.  If you reached this code, you might decide
to return UNKNOWN.

>
>
>  what about the trailing slashes?  line 372-377.  we used to allow them to
>>
> force
>
>> URL-ness, unless we hit an earlier test.  Now they're not as powerful I
>> think.
>>
>
> This only affects cases where the user's input isn't navigable anyway.  For
> example,
>
> abc%00/
>
> Because this isn't a navigable URL, it's reasonable to not mark it as one.

I would bet some webservers would reply fine to requests for GET "/1<2".
 e.g., the request foo/1<2 that routes to webserver foo.  Aren't these not
actually valid (< should be escaped)?  And hence you'd mark them as QUERY.
 Oreviously we could get them to appear first (be interpreted as a URL) by
typing foo/1<2/  (trailing slash).  Now I don't think we can.

Maybe I simply should patch you change to verify these things...

--mark

>
>  what about URLs with valid TLD that Chrome doesn't know about yet?  These
>> will
>> get marked as QUERY.  Previously they were UNKNOWN.  Or the user appended
>> a
>> slash to force them to be a URL.
>>
>
> No, these should not get marked as QUERY, for the same reason as the two
> questions above.  Canonicalization pays no attention to the
> registry-controlled
> domain list.  "http://foo.bar.garbage/" is a legal URL, and will be
> marked as
> such, regardless of whether ".garbage" is a real TLD.
>
> In short: None of the points you raise appear to be problematic.
>
> https://codereview.chromium.org/226283009/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Peter Kasting

On 2014/04/24 22:42:00, Mark P wrote: > > won't site: be now classified as a ...

6 years, 8 months ago (2014-04-24 23:04:57 UTC) #17

On 2014/04/24 22:42:00, Mark P wrote:
> >  won't site: be now classified as a QUERY, where we made an explicit
> >> decision
> >>
> > to
> >
> >> classify it as UNKNOWN before?
> >>
> >
> > No, it should be handled above the point where this change is made, by the
> > code
> > that handles "schemes other than HTTP or HTTPS", and thus should be
> > unchanged.
> 
> 
> I don't understand.  This will not be "handled above the point where this
> change is made" because this change was made at line 161, earlier in the
> file than the code that handles "schemes other than HTTP or HTTPS" (starts
> at line 177).

Oops.  Sorry, misread my own patch.  You're right about the ordering.

This should still be OK.  For schemes that GURL doesn't know about, it will be
very permissive; it assumes just about anything is legal.

However, let's assume for the sake of argument we have some input, xyz:<blah>,
which GURL says is not valid.  And let's assume the previous code treated this
as UNKNOWN.  The only reason for doing this would be to allow the user to
attempt to "handle" this via e.g. ShellExecute or something.

The problem is, even today, the user can't do that.  If we mark the
destination_url as not valid, we won't produce a match which allows the user to
attempt to open this "URL" to begin with.

So even if such a case exists, this shouldn't make much difference to it.

> >  what about the whole segment to handle invalid hostnames?  see lines
> >> 309-340.
> >> consider for example a hostname with an "_".  some intranets resolve
> >> these.
> >>
> > now
> >
> >> I think we'd call them QUERY instead of UNKNOWN.
> >>
> >
> > As the comment on that section says, the reason for the section's existence
> > there is because we try to be stricter than what GURL requires.  That
> > means that
> > we're handling things there that the canonicalizer will mark as legal.  So
> > this
> > patch won't affect them.
> 
> 
> Perhaps I'm misled by the comment.  Regardless, the URL spec clearly says
> that hostnames cannot contain underscores.  That makes me think is_valid()
> is (or ought to) return false for them.  This means you'll return QUERY
> early and not reach this code.  If you reached this code, you might decide
> to return UNKNOWN.

GURL canonicalization has very little to do with URL specs -- even less than
Chrome's overall behavior does.  The URL spec banning underscores has been
superseded, I believe twice, by successively more permissive specs -- and the
real world is even more permissive than that.  Chrome attempts to match
real-world behavior.  It does so here by being limited in how much more
restrictive it aims to be about handling URLs that GURL considers valid.  This
code is NOT about being _less_ restrictive than GURL.  GURL won't mark URLs with
underscores in hostnames invalid.  And if it did, then marking them as UNKNOWN
here wouldn't matter: again, we'd never create a navigational match to such
URLs, so the user could never choose to open them, regardless of what their
local intranet might do.

I feel like a broken record, but it's important to keep in mind that there is
never a case today where canonicalization fails and yet the user can navigate to
their input.  It cannot possibly happen.  This change doesn't affect that at
all.

As I said though, in this case, even that doesn't matter.  Canonicalization
accepts as valid various things that we then additionally restrict if
IsCanonicalizedHostCompliant() fails.  So the code here isn't widening the
permissiveness, it's narrowing it.

> I would bet some webservers would reply fine to requests for GET "/1<2".
>  e.g., the request foo/1<2 that routes to webserver foo.  Aren't these not
> actually valid (< should be escaped)?

If "<" needs escaping, that should be done as part of fixup/canonicalization. 
We shouldn't reject any inputs that are valid if escaped because we failed to
escape them.

> Maybe I simply should patch you change to verify these things...

It might be helpful.

Mark P

On Thu, Apr 24, 2014 at 4:04 PM, <pkasting@chromium.org> wrote: > On 2014/04/24 22:42:00, Mark ...

6 years, 8 months ago (2014-04-24 23:54:59 UTC) #18

On Thu, Apr 24, 2014 at 4:04 PM, <pkasting@chromium.org> wrote:

> On 2014/04/24 22:42:00, Mark P wrote:
>
>> >  won't site: be now classified as a QUERY, where we made an explicit
>> >> decision
>> >>
>> > to
>> >
>> >> classify it as UNKNOWN before?
>> >>
>> >
>> > No, it should be handled above the point where this change is made, by
>> the
>> > code
>> > that handles "schemes other than HTTP or HTTPS", and thus should be
>> > unchanged.
>>
>
>
>  I don't understand.  This will not be "handled above the point where this
>> change is made" because this change was made at line 161, earlier in the
>> file than the code that handles "schemes other than HTTP or HTTPS" (starts
>> at line 177).
>>
>
> Oops.  Sorry, misread my own patch.  You're right about the ordering.
>
> This should still be OK.  For schemes that GURL doesn't know about, it
> will be
> very permissive; it assumes just about anything is legal.
>
> However, let's assume for the sake of argument we have some input,
> xyz:<blah>,
> which GURL says is not valid.  And let's assume the previous code treated
> this
> as UNKNOWN.  The only reason for doing this would be to allow the user to
> attempt to "handle" this via e.g. ShellExecute or something.
>
> The problem is, even today, the user can't do that.  If we mark the
> destination_url as not valid, we won't produce a match which allows the
> user to
> attempt to open this "URL" to begin with.
>

Ah, this is the key.  I know you told me this umpteen times but it finally
sunk in, and I read the code that decides whether to create a UWYT match to
verify it.  I see what you mean.

My mind is now much more at rest.

Now I'll finish auditing the calls to type() to see if I foresee any other
issues.

--mark


>
> So even if such a case exists, this shouldn't make much difference to it.
>
>
>  >  what about the whole segment to handle invalid hostnames?  see lines
>> >> 309-340.
>> >> consider for example a hostname with an "_".  some intranets resolve
>> >> these.
>> >>
>> > now
>> >
>> >> I think we'd call them QUERY instead of UNKNOWN.
>> >>
>> >
>> > As the comment on that section says, the reason for the section's
>> existence
>> > there is because we try to be stricter than what GURL requires.  That
>> > means that
>> > we're handling things there that the canonicalizer will mark as legal.
>>  So
>> > this
>> > patch won't affect them.
>>
>
>
>  Perhaps I'm misled by the comment.  Regardless, the URL spec clearly says
>> that hostnames cannot contain underscores.  That makes me think is_valid()
>> is (or ought to) return false for them.  This means you'll return QUERY
>> early and not reach this code.  If you reached this code, you might decide
>> to return UNKNOWN.
>>
>
> GURL canonicalization has very little to do with URL specs -- even less
> than
> Chrome's overall behavior does.  The URL spec banning underscores has been
> superseded, I believe twice, by successively more permissive specs -- and
> the
> real world is even more permissive than that.  Chrome attempts to match
> real-world behavior.  It does so here by being limited in how much more
> restrictive it aims to be about handling URLs that GURL considers valid.
>  This
> code is NOT about being _less_ restrictive than GURL.  GURL won't mark
> URLs with
> underscores in hostnames invalid.  And if it did, then marking them as
> UNKNOWN
> here wouldn't matter: again, we'd never create a navigational match to such
> URLs, so the user could never choose to open them, regardless of what their
> local intranet might do.
>
> I feel like a broken record, but it's important to keep in mind that there
> is
> never a case today where canonicalization fails and yet the user can
> navigate to
> their input.  It cannot possibly happen.  This change doesn't affect that
> at
> all.
>
> As I said though, in this case, even that doesn't matter.  Canonicalization
> accepts as valid various things that we then additionally restrict if
> IsCanonicalizedHostCompliant() fails.  So the code here isn't widening the
> permissiveness, it's narrowing it.
>
>
>  I would bet some webservers would reply fine to requests for GET "/1<2".
>>   e.g., the request foo/1<2 that routes to webserver foo.  Aren't these
>> not
>> actually valid (< should be escaped)?
>>
>
> If "<" needs escaping, that should be done as part of
> fixup/canonicalization.
> We shouldn't reject any inputs that are valid if escaped because we failed
> to
> escape them.
>
>
>  Maybe I simply should patch you change to verify these things...
>>
>
> It might be helpful.
>
> https://codereview.chromium.org/226283009/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Mark P

Okay, I audited all the code. One thing in particular I looked for was whether ...

6 years, 8 months ago (2014-04-25 02:59:19 UTC) #19

Mark P

lgtm https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocomplete/history_quick_provider.cc File chrome/browser/autocomplete/history_quick_provider.cc (left): https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocomplete/history_quick_provider.cc#oldcode129 chrome/browser/autocomplete/history_quick_provider.cc:129: (autocomplete_input_.type() != AutocompleteInput::FORCED_QUERY) && On 2014/04/25 02:59:20, Mark ...

6 years, 8 months ago (2014-04-25 03:00:29 UTC) #20

Peter Kasting

6 years, 8 months ago (2014-04-25 18:08:00 UTC) #21

Message was sent while issue was closed.

Committed patchset #3 manually as r266217 (presubmit successful).

Issue 226283009: Make AutocompleteInput::Parse() more strict: return QUERY for all inputs that (Closed)

Description

Patch Set 1 #

Patch Set 2 : #

Patch Set 3 : #

Messages