|
|
Created:
6 years, 8 months ago by Peter Kasting Modified:
6 years, 8 months ago Reviewers:
Mark P CC:
chromium-reviews, James Su Base URL:
svn://svn.chromium.org/chrome/trunk/src/ Visibility:
Public. |
DescriptionMake AutocompleteInput::Parse() more strict: return QUERY for all inputs that
can't be fixed up into valid URLs.
BUG=357172
TEST=Typing "google.com/%00" (no quotes) into a debug build omnibox should not cause a DCHECK
R=mpearson@chromium.org
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=266217
Patch Set 1 #Patch Set 2 : #Patch Set 3 : #
Total comments: 2
Messages
Total messages: 21 (0 generated)
This is what you urged me to do a while ago. I have one major concern here: we're now returning QUERY in cases where the user almost certainly intended to type a URL, but didn't type a valid one. This will affect the type and ranking of results that appear in the omnibox dropdown, and that may be a bad change. Perhaps we'd like to rank our results as if the user typed a URL, except for what we make default. I don't know how to measure the effect of this concern.
Can you point me to where we discussed this before? I'd like to refresh my memory. thanks, mark On Fri, Apr 4, 2014 at 5:23 PM, <pkasting@chromium.org> wrote: > Reviewers: Mark P, > > Message: > This is what you urged me to do a while ago. > > I have one major concern here: we're now returning QUERY in cases where > the user > almost certainly intended to type a URL, but didn't type a valid one. > This will > affect the type and ranking of results that appear in the omnibox > dropdown, and > that may be a bad change. Perhaps we'd like to rank our results as if the > user > typed a URL, except for what we make default. > > I don't know how to measure the effect of this concern. > > Description: > Make AutocompleteInput::Parse() more strict: return QUERY for all inputs > that > can't be fixed up into valid URLs. > > BUG=357172 > TEST=Typing "google.com/%00" (no quotes) into a debug build omnibox > should not > cause a DCHECK > > Please review this at https://codereview.chromium.org/226283009/ > > SVN Base: svn://svn.chromium.org/chrome/trunk/src/ > > Affected files (+15, -29 lines): > M chrome/browser/autocomplete/autocomplete_input.cc > M chrome/browser/autocomplete/autocomplete_input_unittest.cc > M chrome/browser/autocomplete/history_quick_provider.cc > M chrome/browser/autocomplete/history_url_provider.cc > > > Index: chrome/browser/autocomplete/autocomplete_input.cc > =================================================================== > --- chrome/browser/autocomplete/autocomplete_input.cc (revision 261885) > +++ chrome/browser/autocomplete/autocomplete_input.cc (working copy) > @@ -150,20 +150,16 @@ > if (scheme) > *scheme = parsed_scheme; > > - // Try to fixup and canonicalize the user's typing. We use this to help > - // determine if it's safe to return "URL" as the type of anything that > has an > - // explicit, non-HTTP[S] scheme. (HTTP[S] and "no scheme" inputs get > more > - // sophisticated heuristics below.) If we can't canonicalize such > inputs, we > - // shouldn't mark them as "URL"s, because the rest of the autocomplete > system > - // isn't going to be able to produce navigable URL matches for them, > which can > - // lead to DCHECK failures later. > + // If we can't canonicalize the user's input, the rest of the > autocomplete > + // system isn't going to be able to produce a navigable URL match for > it. > + // So we just return QUERY immediately in these cases. > GURL placeholder_canonicalized_url; > if (!canonicalized_url) > canonicalized_url = &placeholder_canonicalized_url; > *canonicalized_url = URLFixerUpper::FixupURL(base::UTF16ToUTF8(text), > > base::UTF16ToUTF8(desired_tld)); > - Type return_value_for_non_http_url = > - canonicalized_url->is_valid() ? URL : QUERY; > + if (!canonicalized_url->is_valid()) > + return QUERY; > > if (LowerCaseEqualsASCII(parsed_scheme, content::kFileScheme)) { > // A user might or might not type a scheme when entering a file URL. > In > @@ -190,7 +186,7 @@ > LowerCaseEqualsASCII(parsed_scheme, content::kViewSourceScheme) > || > LowerCaseEqualsASCII(parsed_scheme, content::kJavaScriptScheme) > || > LowerCaseEqualsASCII(parsed_scheme, content::kDataScheme)) > - return return_value_for_non_http_url; > + return URL; > > // Not an internal protocol. Check and see if the user has explicitly > // opened this scheme as a URL before, or if the "scheme" is actually > a > @@ -206,7 +202,7 @@ > base::UTF16ToUTF8(parsed_scheme), true); > switch (block_state) { > case ExternalProtocolHandler::DONT_BLOCK: > - return return_value_for_non_http_url; > + return URL; > > case ExternalProtocolHandler::BLOCK: > // If we don't want the user to open the URL, don't let it be > navigated > @@ -270,9 +266,11 @@ > // between an HTTP URL and a query, or the scheme is HTTP or HTTPS, in > which > // case we should reject invalid formulations. > > - // If we have an empty host it can't be a URL. > - if (!parts->host.is_nonempty()) > - return QUERY; > + // Sanity-checks: GURL should have failed to canonicalize this as an > HTTP[S] > + // URL if it had an empty host or an invalid port. > + DCHECK(parts->host.is_nonempty()); > + DCHECK_NE(url_parse::PORT_INVALID, > + url_parse::ParsePort(text.c_str(), parts->port)); > > // Likewise, the RCDS can reject certain obviously-invalid hosts. (We > also > // use the registry length later below.) > @@ -335,16 +333,6 @@ > (host.find(' ') == base::string16::npos))) ? UNKNOWN : QUERY; > } > > - // A port number is a good indicator that this is a URL. However, it > might > - // also be a query like "1.66:1" that looks kind of like an IP address > and > - // port number. So here we only check for "port numbers" that are > illegal and > - // thus mean this can't be navigated to (e.g. "1.2.3.4:garbage"), and > we save > - // handling legal port numbers until after the "IP address" > determination > - // below. > - if (url_parse::ParsePort(text.c_str(), parts->port) == > - url_parse::PORT_INVALID) > - return QUERY; > - > // Now that we've ruled out all schemes other than http or https and > done a > // little more sanity checking, the presence of a scheme means this is > likely > // a URL. > Index: chrome/browser/autocomplete/autocomplete_input_unittest.cc > =================================================================== > --- chrome/browser/autocomplete/autocomplete_input_unittest.cc (revision > 261885) > +++ chrome/browser/autocomplete/autocomplete_input_unittest.cc (working > copy) > @@ -33,6 +33,7 @@ > { ASCIIToUTF16("foo.-com"), AutocompleteInput::QUERY }, > { ASCIIToUTF16("foo/"), AutocompleteInput::URL }, > { ASCIIToUTF16("foo/bar"), AutocompleteInput::UNKNOWN }, > + { ASCIIToUTF16("foo/bar%00"), AutocompleteInput::QUERY }, > { ASCIIToUTF16("foo/bar/"), AutocompleteInput::URL }, > { ASCIIToUTF16("foo/bar baz\\"), AutocompleteInput::URL }, > { ASCIIToUTF16("foo.com/bar"), AutocompleteInput::URL }, > @@ -87,6 +88,7 @@ > { ASCIIToUTF16("http://foo.c"), AutocompleteInput::URL }, > { ASCIIToUTF16("http://foo.com"), AutocompleteInput::URL }, > { ASCIIToUTF16("http://foo_bar.com"), AutocompleteInput::URL }, > + { ASCIIToUTF16("http://foo/bar%00"), AutocompleteInput::QUERY }, > { ASCIIToUTF16("http://foo/bar baz"), AutocompleteInput::URL }, > { ASCIIToUTF16("http://-foo.com"), AutocompleteInput::URL }, > { ASCIIToUTF16("http://foo-.com"), AutocompleteInput::URL }, > Index: chrome/browser/autocomplete/history_quick_provider.cc > =================================================================== > --- chrome/browser/autocomplete/history_quick_provider.cc (revision > 261885) > +++ chrome/browser/autocomplete/history_quick_provider.cc (working > copy) > @@ -124,9 +124,7 @@ > // provider won't promote the URL-what-you-typed match to first > // for these inputs. > const bool can_have_url_what_you_typed_match_first = > - autocomplete_input_.canonicalized_url().is_valid() && > (autocomplete_input_.type() != AutocompleteInput::QUERY) && > - (autocomplete_input_.type() != AutocompleteInput::FORCED_QUERY) && > (!autocomplete_input_.parts().username.is_nonempty() || > autocomplete_input_.parts().password.is_nonempty() || > autocomplete_input_.parts().path.is_nonempty()); > Index: chrome/browser/autocomplete/history_url_provider.cc > =================================================================== > --- chrome/browser/autocomplete/history_url_provider.cc (revision 261885) > +++ chrome/browser/autocomplete/history_url_provider.cc (working copy) > @@ -509,7 +509,6 @@ > // Otherwise, this is just low-quality noise. In the cases where we've > parsed > // as UNKNOWN, we'll still show an accidental search infobar if need be. > bool have_what_you_typed_match = > - params->input.canonicalized_url().is_valid() && > (params->input.type() != AutocompleteInput::QUERY) && > ((params->input.type() != AutocompleteInput::UNKNOWN) || > (classifier.type() == VisitClassifier::UNVISITED_INTRANET) || > @@ -701,8 +700,7 @@ > const bool trim_http = !AutocompleteInput::HasHTTPScheme(input.text()); > // Don't do this for queries -- while we can sometimes mark up a match > for > // this, it's not what the user wants, and just adds noise. > - if ((input.type() != AutocompleteInput::QUERY) && > - input.canonicalized_url().is_valid()) { > + if (input.type() != AutocompleteInput::QUERY) { > AutocompleteMatch what_you_typed(SuggestExactInput( > input.text(), input.canonicalized_url(), trim_http)); > what_you_typed.relevance = CalculateRelevance(WHAT_YOU_TYPED, 0); > > > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On 2014/04/05 00:27:48, Mark P wrote: > Can you point me to where we discussed this before? I'd like to refresh my > memory. I don't have a pointer. There weren't detailed discussions of tradeoffs. You merely said a few times, I think once on a bug and once in talking to me over some medium, that it seemed to you like AutocompleteInput::Parse() really shouldn't ever return URL for anything that couldn't be canonicalized.
On 2014/04/05 00:30:49, Peter Kasting wrote: > On 2014/04/05 00:27:48, Mark P wrote: > > Can you point me to where we discussed this before? I'd like to refresh my > > memory. > > I don't have a pointer. There weren't detailed discussions of tradeoffs. You > merely said a few times, I think once on a bug and once in talking to me over > some medium, that it seemed to you like AutocompleteInput::Parse() really > shouldn't ever return URL for anything that couldn't be canonicalized. Yes, I think I said that. This change, however, seems to go farther--it seems to make everything that can't be a valid URL a QUERY. I believe things that look like URLs but cannot be canonicalized are better marked as UNKNOWN. --mark
On 2014/04/09 22:07:08, Mark P wrote: > Yes, I think I said that. This change, however, seems to go farther--it seems > to make everything that can't be a valid URL a QUERY. I believe things that > look like URLs but cannot be canonicalized are better marked as UNKNOWN. It makes no sense to mark it UNKNOWN when the history system refuses to create navigational matches, so it's literally impossible for the user to navigate. I chose QUERY over UNKNOWN quite intentionally, after examining the behavior of the whole system.
Ping.
On 2014/04/15 00:00:26, Peter Kasting wrote: > Ping. I've been postponing this review because the change makes me nervous. Can you give me some examples where it improves the ordering of suggestions? --mark
On 2014/04/15 16:20:27, Mark P wrote: > On 2014/04/15 00:00:26, Peter Kasting wrote: > > Ping. > > I've been postponing this review because the change makes me nervous. Can you > give me some examples where it improves the ordering of suggestions? The change isn't designed to "improve the ordering of suggestions". Indeed, my first comment notes that my only concern about this patch is that we may change the ordering of suggestions sometimes in ways I don't know would be an improvement. I don't know that it would make things worse either. The more I think about it the more I'm skeptical it would have much effect at all. There would only be a change for inputs that we previously erroneously thought were URLs, that aren't valid URLs -- edge cases like the "google.com/%00" in the original report. Given this, I don't think the change should even affect most inputs, and in cases where it does, I can't think what valid URLs we'd have been suggesting, that could be ranked lower than before. The change does fix a DCHECK, because it is a correctness fix, so I'm not content to just sit indefinitely on it. Personally I am OK with just landing this.
I don't think this is good as-is. I started going through the places where AutocompleteInput::QUERY is mentioned in the code and before I even made it through 20% of them, I found lines 372-385 in chrome/browser/autocomplete/search_provider.cc This change will make the suggest server get asked to provide suggestions for URLs that it didn't used to. For instance, if I cut http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter... and paste it into the URL box incorrectly, getting http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter..., suddenly this will be an invalid URL that will now get sent to the suggest server. This kind of thing concerns me (maybe it needs privacy review? or perhaps you can fix this code here), and I wonder what other issues I'll find if I continue auditing all uses. After writing that paragraph, I kept looking for a bit. I found another problematic case: chrome/browser/autocomplete/history_url_provider.cc line 835, where we depending on an input like "c/# foo" to get classified as a URL. You may want to consider fixing this bug without making this change. Otherwise, I can continue auditing every use of AutocompleteInput::Type to see what else may come up and have you fix all the fallout. This may be a pain for both of us. --mark On Tue, Apr 15, 2014 at 1:01 PM, <pkasting@chromium.org> wrote: > On 2014/04/15 16:20:27, Mark P wrote: > >> On 2014/04/15 00:00:26, Peter Kasting wrote: >> > Ping. >> > > I've been postponing this review because the change makes me nervous. >> Can you >> give me some examples where it improves the ordering of suggestions? >> > > The change isn't designed to "improve the ordering of suggestions". > Indeed, my > first comment notes that my only concern about this patch is that we may > change > the ordering of suggestions sometimes in ways I don't know would be an > improvement. I don't know that it would make things worse either. The > more I > think about it the more I'm skeptical it would have much effect at all. > There > would only be a change for inputs that we previously erroneously thought > were > URLs, that aren't valid URLs -- edge cases like the "google.com/%00" in > the > original report. Given this, I don't think the change should even affect > most > inputs, and in cases where it does, I can't think what valid URLs we'd > have been > suggesting, that could be ranked lower than before. > > The change does fix a DCHECK, because it is a correctness fix, so I'm not > content to just sit indefinitely on it. Personally I am OK with just > landing > this. > > https://codereview.chromium.org/226283009/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On 2014/04/15 22:00:10, Mark P wrote: > I don't think this is good as-is. I started going through the places where > AutocompleteInput::QUERY is mentioned in the code and before I even made it > through 20% of them, I found lines 372-385 in > chrome/browser/autocomplete/search_provider.cc I think you're quoting the wrong line numbers? Neither search_provider.cc nor base_search_provider.cc do things relating to "what to send to the suggest server" on lines 372-385. Maybe you meant 568-581? > This change will make the suggest server get asked to provide suggestions > for URLs that it didn't used to. For instance, if I cut > http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter... > and paste it into the URL box incorrectly, getting > http://www.mycompany.com/?project=Secret+Project+to+Launch+Balloons+for+Inter..., > suddenly this will be an invalid URL that will now get sent to the > suggest server. Are you sure this will actually change? In your example, I would expect input_.parts().query.is_nonempty() to be true even though the input type is QUERY, which would mean this would not be sent. Generally SegmentURL() can break the input into parts even when it is not canonicalizable. > After writing that paragraph, I kept looking for a bit. I found another > problematic case: chrome/browser/autocomplete/history_url_provider.cc line > 835, where we depending on an input like "c/# foo" to get classified as a > URL. Why would that not continue to be classified as a URL? I wouldn't expect my change to affect this. I would simply test both cases above to ensure my change doesn't affect them, but I'm home sick today and don't have VPN access to my machine.
On Tue, Apr 15, 2014 at 3:11 PM, <pkasting@chromium.org> wrote: > On 2014/04/15 22:00:10, Mark P wrote: > >> I don't think this is good as-is. I started going through the places >> where >> AutocompleteInput::QUERY is mentioned in the code and before I even made >> it >> through 20% of them, I found lines 372-385 in >> chrome/browser/autocomplete/search_provider.cc >> > > I think you're quoting the wrong line numbers? Sorry, I used internal code search because the external one seems to be broken. The internal one appears a bit out of date. > Neither search_provider.cc nor > base_search_provider.cc do things relating to "what to send to the suggest > server" on lines 372-385. Maybe you meant 568-581? Close. I meant line 553-566. > > > This change will make the suggest server get asked to provide suggestions >> for URLs that it didn't used to. For instance, if I cut >> > > http://www.mycompany.com/?project=Secret+Project+to+ > Launch+Balloons+for+Internet+Access%21 > >> and paste it into the URL box incorrectly, getting >> > > http://www.mycompany.com/?project=Secret+Project+to+ > Launch+Balloons+for+Internet+Access%252instead, > > suddenly this will be an invalid URL that will now get sent to the >> suggest server. >> > > Are you sure this will actually change? In your example, I would expect > input_.parts().query.is_nonempty() to be true even though the input type > is > QUERY, which would mean this would not be sent. Generally SegmentURL() can > break the input into parts even when it is not canonicalizable. > > > After writing that paragraph, I kept looking for a bit. I found another >> problematic case: chrome/browser/autocomplete/history_url_provider.cc >> line >> 835, where we depending on an input like "c/# foo" to get classified as a >> URL. >> > > Why would that not continue to be classified as a URL? I wouldn't expect > my > change to affect this. > According to the spec, a URL must not contain a literal space. > I would simply test both cases above to ensure my change doesn't affect > them, > but I'm home sick today and don't have VPN access to my machine. > > https://codereview.chromium.org/226283009/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On Tue, Apr 15, 2014 at 3:30 PM, Mark Pearson <mpearson@chromium.org> wrote: > Neither search_provider.cc nor >> base_search_provider.cc do things relating to "what to send to the suggest >> server" on lines 372-385. Maybe you meant 568-581? > > Close. I meant line 553-566. > In that case, I would expect input_.scheme() to be content::kHttpScheme, so this conditional should not immediately return "true" for the case you mention. Now, we _could_ affect things that have file: or data: schemes. But in those kinds of cases, I believe the canonicalizer is likely to mark almost any input as valid, except (again) for things like "file://...%00". And I can't really think of realistic privacy issues along that line. Can you? After writing that paragraph, I kept looking for a bit. I found another >>> problematic case: chrome/browser/autocomplete/history_url_provider.cc >>> line >>> 835, where we depending on an input like "c/# foo" to get classified as a >>> URL. >>> >> >> Why would that not continue to be classified as a URL? I wouldn't expect >> my >> change to affect this. >> > > According to the spec, a URL must not contain a literal space. > It won't. Fixup will escape it into %20 before the rest of Parse() takes a crack at it. Otherwise we wouldn't be able to navigate to this today -- remember, this change can only affect cases where we already can't navigate today. PK To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
So, in our IM chat about this, you said two things: (1) You want to audit use of AutocompleteMatch::Type further to see what effects this has. (2) You are thinking vaguely about whether to ask for this to be done as a field trial. (1) sounds like something you were planning to do. (2) sounded like it wasn't something you were sure you wanted to happen to begin with. (I certainly don't want to do it unless I need to.) I'd like to have a concrete plan here on what to do with this bug. This is the last outstanding change in my svn checkouts before I switch things over to git, and personally I'd rather just land this and move on, but if we're not going to do that, let's figure out what's going to happen and do it already.
Okay, I haven't yet audited the places type() is used yet, but I looked closely at Parse() and have a series of questions / concerns. I imagine I might be able to answer some of them myself if I thought enough, but it's your changelist and so I'm forcing you to do the heavy thinking: for an input such as "abc", is it possible that this doesn't canonicalize & validate? e.g., (forgive my hand-wavy language) the user doesn't have an intranet loopback configured in their network or they have no implicit domain specified in their registries? won't site: be now classified as a QUERY, where we made an explicit decision to classify it as UNKNOWN before? how about an input such as "." when the user has certain registry controlled domains or "..foop.com" or "foop."? does the classification for these change at all? What about under certain registry-controlled-domain settings? what about the whole segment to handle invalid hostnames? see lines 309-340. consider for example a hostname with an "_". some intranets resolve these. now I think we'd call them QUERY instead of UNKNOWN. what about the trailing slashes? line 372-377. we used to allow them to force URL-ness, unless we hit an earlier test. Now they're not as powerful I think. what about URLs with valid TLD that Chrome doesn't know about yet? These will get marked as QUERY. Previously they were UNKNOWN. Or the user appended a slash to force them to be a URL. --mark
On 2014/04/24 18:31:29, Mark P wrote: > for an input such as "abc", is it possible that this doesn't canonicalize & > validate? e.g., (forgive my hand-wavy language) the user doesn't have an > intranet loopback configured in their network or they have no implicit domain > specified in their registries? No. Canonicalization does not pay attention to the user's history. Whether "http://abc/" is theoretically legal (the question being asked by the code here) is different than whether "http://abc/" would actually resolve (the question you're effectively asking). > won't site: be now classified as a QUERY, where we made an explicit decision to > classify it as UNKNOWN before? No, it should be handled above the point where this change is made, by the code that handles "schemes other than HTTP or HTTPS", and thus should be unchanged. > how about an input such as "." when the user has certain registry controlled > domains or "..foop.com" or "foop."? does the classification for these change at > all? What about under certain registry-controlled-domain settings? Canonicalization also doesn't care about registry-controlled domains. "http://foop." and "http://..foop.com/" will be canonicalizable, or not, regardless of what domains exist. Again, what registry-controlled domains affect here is whether a navigation to these URLs would succeed, not whether the URLs are syntactically valid. This is really the same question as your first. > what about the whole segment to handle invalid hostnames? see lines 309-340. > consider for example a hostname with an "_". some intranets resolve these. now > I think we'd call them QUERY instead of UNKNOWN. As the comment on that section says, the reason for the section's existence there is because we try to be stricter than what GURL requires. That means that we're handling things there that the canonicalizer will mark as legal. So this patch won't affect them. > what about the trailing slashes? line 372-377. we used to allow them to force > URL-ness, unless we hit an earlier test. Now they're not as powerful I think. This only affects cases where the user's input isn't navigable anyway. For example, abc%00/ Because this isn't a navigable URL, it's reasonable to not mark it as one. > what about URLs with valid TLD that Chrome doesn't know about yet? These will > get marked as QUERY. Previously they were UNKNOWN. Or the user appended a > slash to force them to be a URL. No, these should not get marked as QUERY, for the same reason as the two questions above. Canonicalization pays no attention to the registry-controlled domain list. "http://foo.bar.garbage/" is a legal URL, and will be marked as such, regardless of whether ".garbage" is a real TLD. In short: None of the points you raise appear to be problematic.
On Thu, Apr 24, 2014 at 1:43 PM, <pkasting@chromium.org> wrote: > On 2014/04/24 18:31:29, Mark P wrote: > >> for an input such as "abc", is it possible that this doesn't canonicalize >> & >> validate? e.g., (forgive my hand-wavy language) the user doesn't have an >> intranet loopback configured in their network or they have no implicit >> domain >> specified in their registries? >> > > No. Canonicalization does not pay attention to the user's history. > Whether > "http://abc/" is theoretically legal (the question being asked by the > code here) > is different than whether "http://abc/" would actually resolve (the > question > you're effectively asking). Whether canonicalization pays attention to the user's history is not my question. My question was whether canonicalization and validity can depend on the user's network and/or registry configs. [addendum: I think you answered this question below. I guess there's no need to comment further.] > > > won't site: be now classified as a QUERY, where we made an explicit >> decision >> > to > >> classify it as UNKNOWN before? >> > > No, it should be handled above the point where this change is made, by the > code > that handles "schemes other than HTTP or HTTPS", and thus should be > unchanged. I don't understand. This will not be "handled above the point where this change is made" because this change was made at line 161, earlier in the file than the code that handles "schemes other than HTTP or HTTPS" (starts at line 177). Or are you perhaps implying that all blah:XYZ will be reported as is_valid() for any value of blah and any value of XYZ (including funky characters, unescaped items, etc.)? > > how about an input such as "." when the user has certain registry >> controlled >> domains or "..foop.com" or "foop."? does the classification for these >> change >> > at > >> all? What about under certain registry-controlled-domain settings? >> > > Canonicalization also doesn't care about registry-controlled domains. > "http://foop." and "http://..foop.com/" will be canonicalizable, or not, > regardless of what domains exist. Again, what registry-controlled domains > affect here is whether a navigation to these URLs would succeed, not > whether the > URLs are syntactically valid. This is really the same question as your > first. > > > what about the whole segment to handle invalid hostnames? see lines >> 309-340. >> consider for example a hostname with an "_". some intranets resolve >> these. >> > now > >> I think we'd call them QUERY instead of UNKNOWN. >> > > As the comment on that section says, the reason for the section's existence > there is because we try to be stricter than what GURL requires. That > means that > we're handling things there that the canonicalizer will mark as legal. So > this > patch won't affect them. Perhaps I'm misled by the comment. Regardless, the URL spec clearly says that hostnames cannot contain underscores. That makes me think is_valid() is (or ought to) return false for them. This means you'll return QUERY early and not reach this code. If you reached this code, you might decide to return UNKNOWN. > > > what about the trailing slashes? line 372-377. we used to allow them to >> > force > >> URL-ness, unless we hit an earlier test. Now they're not as powerful I >> think. >> > > This only affects cases where the user's input isn't navigable anyway. For > example, > > abc%00/ > > Because this isn't a navigable URL, it's reasonable to not mark it as one. I would bet some webservers would reply fine to requests for GET "/1<2". e.g., the request foo/1<2 that routes to webserver foo. Aren't these not actually valid (< should be escaped)? And hence you'd mark them as QUERY. Oreviously we could get them to appear first (be interpreted as a URL) by typing foo/1<2/ (trailing slash). Now I don't think we can. Maybe I simply should patch you change to verify these things... --mark > > what about URLs with valid TLD that Chrome doesn't know about yet? These >> will >> get marked as QUERY. Previously they were UNKNOWN. Or the user appended >> a >> slash to force them to be a URL. >> > > No, these should not get marked as QUERY, for the same reason as the two > questions above. Canonicalization pays no attention to the > registry-controlled > domain list. "http://foo.bar.garbage/" is a legal URL, and will be > marked as > such, regardless of whether ".garbage" is a real TLD. > > In short: None of the points you raise appear to be problematic. > > https://codereview.chromium.org/226283009/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On 2014/04/24 22:42:00, Mark P wrote: > > won't site: be now classified as a QUERY, where we made an explicit > >> decision > >> > > to > > > >> classify it as UNKNOWN before? > >> > > > > No, it should be handled above the point where this change is made, by the > > code > > that handles "schemes other than HTTP or HTTPS", and thus should be > > unchanged. > > > I don't understand. This will not be "handled above the point where this > change is made" because this change was made at line 161, earlier in the > file than the code that handles "schemes other than HTTP or HTTPS" (starts > at line 177). Oops. Sorry, misread my own patch. You're right about the ordering. This should still be OK. For schemes that GURL doesn't know about, it will be very permissive; it assumes just about anything is legal. However, let's assume for the sake of argument we have some input, xyz:<blah>, which GURL says is not valid. And let's assume the previous code treated this as UNKNOWN. The only reason for doing this would be to allow the user to attempt to "handle" this via e.g. ShellExecute or something. The problem is, even today, the user can't do that. If we mark the destination_url as not valid, we won't produce a match which allows the user to attempt to open this "URL" to begin with. So even if such a case exists, this shouldn't make much difference to it. > > what about the whole segment to handle invalid hostnames? see lines > >> 309-340. > >> consider for example a hostname with an "_". some intranets resolve > >> these. > >> > > now > > > >> I think we'd call them QUERY instead of UNKNOWN. > >> > > > > As the comment on that section says, the reason for the section's existence > > there is because we try to be stricter than what GURL requires. That > > means that > > we're handling things there that the canonicalizer will mark as legal. So > > this > > patch won't affect them. > > > Perhaps I'm misled by the comment. Regardless, the URL spec clearly says > that hostnames cannot contain underscores. That makes me think is_valid() > is (or ought to) return false for them. This means you'll return QUERY > early and not reach this code. If you reached this code, you might decide > to return UNKNOWN. GURL canonicalization has very little to do with URL specs -- even less than Chrome's overall behavior does. The URL spec banning underscores has been superseded, I believe twice, by successively more permissive specs -- and the real world is even more permissive than that. Chrome attempts to match real-world behavior. It does so here by being limited in how much more restrictive it aims to be about handling URLs that GURL considers valid. This code is NOT about being _less_ restrictive than GURL. GURL won't mark URLs with underscores in hostnames invalid. And if it did, then marking them as UNKNOWN here wouldn't matter: again, we'd never create a navigational match to such URLs, so the user could never choose to open them, regardless of what their local intranet might do. I feel like a broken record, but it's important to keep in mind that there is never a case today where canonicalization fails and yet the user can navigate to their input. It cannot possibly happen. This change doesn't affect that at all. As I said though, in this case, even that doesn't matter. Canonicalization accepts as valid various things that we then additionally restrict if IsCanonicalizedHostCompliant() fails. So the code here isn't widening the permissiveness, it's narrowing it. > I would bet some webservers would reply fine to requests for GET "/1<2". > e.g., the request foo/1<2 that routes to webserver foo. Aren't these not > actually valid (< should be escaped)? If "<" needs escaping, that should be done as part of fixup/canonicalization. We shouldn't reject any inputs that are valid if escaped because we failed to escape them. > Maybe I simply should patch you change to verify these things... It might be helpful.
On Thu, Apr 24, 2014 at 4:04 PM, <pkasting@chromium.org> wrote: > On 2014/04/24 22:42:00, Mark P wrote: > >> > won't site: be now classified as a QUERY, where we made an explicit >> >> decision >> >> >> > to >> > >> >> classify it as UNKNOWN before? >> >> >> > >> > No, it should be handled above the point where this change is made, by >> the >> > code >> > that handles "schemes other than HTTP or HTTPS", and thus should be >> > unchanged. >> > > > I don't understand. This will not be "handled above the point where this >> change is made" because this change was made at line 161, earlier in the >> file than the code that handles "schemes other than HTTP or HTTPS" (starts >> at line 177). >> > > Oops. Sorry, misread my own patch. You're right about the ordering. > > This should still be OK. For schemes that GURL doesn't know about, it > will be > very permissive; it assumes just about anything is legal. > > However, let's assume for the sake of argument we have some input, > xyz:<blah>, > which GURL says is not valid. And let's assume the previous code treated > this > as UNKNOWN. The only reason for doing this would be to allow the user to > attempt to "handle" this via e.g. ShellExecute or something. > > The problem is, even today, the user can't do that. If we mark the > destination_url as not valid, we won't produce a match which allows the > user to > attempt to open this "URL" to begin with. > Ah, this is the key. I know you told me this umpteen times but it finally sunk in, and I read the code that decides whether to create a UWYT match to verify it. I see what you mean. My mind is now much more at rest. Now I'll finish auditing the calls to type() to see if I foresee any other issues. --mark > > So even if such a case exists, this shouldn't make much difference to it. > > > > what about the whole segment to handle invalid hostnames? see lines >> >> 309-340. >> >> consider for example a hostname with an "_". some intranets resolve >> >> these. >> >> >> > now >> > >> >> I think we'd call them QUERY instead of UNKNOWN. >> >> >> > >> > As the comment on that section says, the reason for the section's >> existence >> > there is because we try to be stricter than what GURL requires. That >> > means that >> > we're handling things there that the canonicalizer will mark as legal. >> So >> > this >> > patch won't affect them. >> > > > Perhaps I'm misled by the comment. Regardless, the URL spec clearly says >> that hostnames cannot contain underscores. That makes me think is_valid() >> is (or ought to) return false for them. This means you'll return QUERY >> early and not reach this code. If you reached this code, you might decide >> to return UNKNOWN. >> > > GURL canonicalization has very little to do with URL specs -- even less > than > Chrome's overall behavior does. The URL spec banning underscores has been > superseded, I believe twice, by successively more permissive specs -- and > the > real world is even more permissive than that. Chrome attempts to match > real-world behavior. It does so here by being limited in how much more > restrictive it aims to be about handling URLs that GURL considers valid. > This > code is NOT about being _less_ restrictive than GURL. GURL won't mark > URLs with > underscores in hostnames invalid. And if it did, then marking them as > UNKNOWN > here wouldn't matter: again, we'd never create a navigational match to such > URLs, so the user could never choose to open them, regardless of what their > local intranet might do. > > I feel like a broken record, but it's important to keep in mind that there > is > never a case today where canonicalization fails and yet the user can > navigate to > their input. It cannot possibly happen. This change doesn't affect that > at > all. > > As I said though, in this case, even that doesn't matter. Canonicalization > accepts as valid various things that we then additionally restrict if > IsCanonicalizedHostCompliant() fails. So the code here isn't widening the > permissiveness, it's narrowing it. > > > I would bet some webservers would reply fine to requests for GET "/1<2". >> e.g., the request foo/1<2 that routes to webserver foo. Aren't these >> not >> actually valid (< should be escaped)? >> > > If "<" needs escaping, that should be done as part of > fixup/canonicalization. > We shouldn't reject any inputs that are valid if escaped because we failed > to > escape them. > > > Maybe I simply should patch you change to verify these things... >> > > It might be helpful. > > https://codereview.chromium.org/226283009/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Okay, I audited all the code. One thing in particular I looked for was whether the relative order of the matches that will get created should stay the same if we hit a test like this. e.g., all the various logic for score in SearchProvider depends on the type. I wanted to make sure the relative orders still made plausible sense. I'm satisfied. I'm comfortable this change, barring one question below. --mark https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocompl... File chrome/browser/autocomplete/history_quick_provider.cc (left): https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocompl... chrome/browser/autocomplete/history_quick_provider.cc:129: (autocomplete_input_.type() != AutocompleteInput::FORCED_QUERY) && Why did you remove this?
lgtm https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocompl... File chrome/browser/autocomplete/history_quick_provider.cc (left): https://codereview.chromium.org/226283009/diff/40001/chrome/browser/autocompl... chrome/browser/autocomplete/history_quick_provider.cc:129: (autocomplete_input_.type() != AutocompleteInput::FORCED_QUERY) && On 2014/04/25 02:59:20, Mark P wrote: > Why did you remove this? Nevermind, I see that we wouldn't reach here in FORCED_QUERY mode.
Message was sent while issue was closed.
Committed patchset #3 manually as r266217 (presubmit successful). |