|
|
DescriptionOmnibox - Do Not Allow HTTP/HTTPS Equivalence if User Explicitly Entered A Scheme
The dup elimination code is forgiving about schemes.
In particular, dup-elimination considers http and
https equivalent.
If the user explicitly enters a scheme, this is
not correct. In these cases, we should treat
these schemes as distinct.
(Actually, we can probably treat these schemes
as equivalent if the user explicitly enters a
non-http, non-https scheme. However, I'm not
sure this nuance is worth the extra code, so
I coded the simple solution.)
BUG=430722
Committed: https://crrev.com/4923cab672e256801d4b9efb904fef4c5fbf6f15
Cr-Commit-Position: refs/heads/master@{#336885}
Patch Set 1 #Patch Set 2 : new approach #
Total comments: 4
Patch Set 3 : no equivalence across schemes if any scheme is present #
Total comments: 14
Patch Set 4 : peter's comments, slight refactoringpeter's comments, slight refactoring #
Total comments: 38
Patch Set 5 : refactored, as discussed #
Total comments: 39
Patch Set 6 : revise logic #
Total comments: 14
Patch Set 7 : utf16->utf8, plus other minor comments #
Total comments: 7
Patch Set 8 : ASCII -> UTF8 #Patch Set 9 : forgot string implementation #Patch Set 10 : add tests; also rebase #Patch Set 11 : rebase (not yet verified the rebase was done correctly) #Patch Set 12 : use StartsWithASCII #
Total comments: 2
Patch Set 13 : new versions of ASCII comparisons #Patch Set 14 : finish rebase correctly #Patch Set 15 : rebase #Patch Set 16 : fix rebase error #Patch Set 17 : add log lines for debugging #Patch Set 18 : revise test cases to be correct #
Total comments: 2
Patch Set 19 : .cc -> .h #Patch Set 20 : add IDN support #Patch Set 21 : cleanup omnibox_edit_model #Patch Set 22 : possible fix for mac code (cannot compile mac code myself) #
Total comments: 22
Patch Set 23 : final? API changes #
Total comments: 6
Patch Set 24 : peter's comments #Patch Set 25 : rebase #Patch Set 26 : re-add string_split.h #Patch Set 27 : fix browsertests compile error (sigh) #Depends on Patchset: Messages
Total messages: 78 (13 generated)
mpearson@chromium.org changed reviewers: + pkasting@chromium.org
Peter, Though it touches many files, this is surprisingly simple change. Take a look at autocomplete_match.{cc,h} first. --mark
https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... components/omnibox/autocomplete_match.h:177: // Returns |url| altered by stripping off "www.", converting https protocol It strikes me that there's probably a similar issue with "www."-stripping: if the user only typed foo.com, then probably foo.com and www.foo.com should be duped. But if the user typed www.foo.com, then "foo.com" is not the same thing and shouldn't be duped. I doubt this is as critical of a distinction, but it seems like if we're fixing one issue, we should fix the other as well. Technically, this probably applies to query params too -- if the user explicitly adds "&foo=1", he won't want it stripped -- but in practice this seems rare. https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... components/omnibox/autocomplete_match.h:188: // different schemes from having the same stripped GURL. Hmm, this doesn't seem quite right. Consider user input of some version of "foo.com" and matches for various origins from foo.com and bar.com. Whether the user explicitly types a scheme should affect deduping of foo.com matches, but I don't think it should affect deduping of bar.com ones. That is: "foo.com" -> user doesn't care about http://foo.com vs. https://foo.com "http[s]://foo.com" -> user does care about the difference; ...but in both cases, user doesn't care about http://bar.com vs. https://bar.com Thinking about this more, in conjunction with my comment above about "www.", I realize that this issue is this: Given a particular user input, we might normally mark as dupes matches where one match is inlineable and one is not. Instead we should be unwilling to dupe two matches if their inlineability differs. In the case above, when typing "foo.com", all variants of foo.com (http vs. https, www. vs. no-www.) are inlineable. So it's safe to dupe one against another. Likewise, all variants of bar.com are non-inlineable, and again, it's safe to dupe. But if the user types "http://foo.com", then the two "http" forms of "foo.com" are still inlineable, but the "https" forms aren't. We should allow duping within each pair, but not across pairs. And if the user types "https://www.foo.com", then none of the other forms can be inlined against that, so none should be duped against it. Meanwhile, none of these inputs changes the (non-)inlineability of the bar.com matches, so those should all still be dupeable. So, basically, I think the answer is to pass in sufficient info about the user's typing to understand whether some of the things we'd normally strip were explicitly present in the original typing _and the hostnames are the same_, and if so, don't strip those things.
https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... components/omnibox/autocomplete_match.h:177: // Returns |url| altered by stripping off "www.", converting https protocol On 2015/04/25 00:03:12, Peter Kasting wrote: > It strikes me that there's probably a similar issue with "www."-stripping: if > the user only typed http://foo.com, then probably http://foo.com and http://www.foo.com should be > duped. But if the user typed http://www.foo.com, then "foo.com" is not the same thing > and shouldn't be duped. > > I doubt this is as critical of a distinction, but it seems like if we're fixing > one issue, we should fix the other as well. I admit this is a distinction. However, I know from watching user studies that people often add www. unnecessarily, and I don't want this mistake to cause problems. In particular, I think the number of people who add www. when they shouldn't is much larger than the number of people who care that adding www. means that don't want the non-www. version simultaneously/dupped together. > Technically, this probably applies to query params too -- if the user explicitly > adds "&foo=1", he won't want it stripped -- but in practice this seems rare. I agree. https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... components/omnibox/autocomplete_match.h:188: // different schemes from having the same stripped GURL. On 2015/04/25 00:03:12, Peter Kasting wrote: > Hmm, this doesn't seem quite right. > > Consider user input of some version of "foo.com" and matches for various origins > from http://foo.com and http://bar.com. > > Whether the user explicitly types a scheme should affect deduping of http://foo.com > matches, but I don't think it should affect deduping of http://bar.com ones. That is: > > "foo.com" -> user doesn't care about http://foo.com vs. https://foo.com > "http[s]://foo.com" -> user does care about the difference; > ...but in both cases, user doesn't care about http://bar.com vs. https://bar.com > > Thinking about this more, in conjunction with my comment above about "www.", I > realize that this issue is this: > > Given a particular user input, we might normally mark as dupes matches where one > match is inlineable and one is not. Instead we should be unwilling to dupe two > matches if their inlineability differs. > > In the case above, when typing "foo.com", all variants of http://foo.com (http vs. > https, www. vs. no-www.) are inlineable. So it's safe to dupe one against > another. Likewise, all variants of http://bar.com are non-inlineable, and again, it's > safe to dupe. > > But if the user types "http://foo.com", then the two "http" forms of "foo.com" > are still inlineable, but the "https" forms aren't. We should allow duping > within each pair, but not across pairs. And if the user types > "https://www.foo.com", then none of the other forms can be inlined against that, > so none should be duped against it. Meanwhile, none of these inputs changes the > (non-)inlineability of the http://bar.com matches, so those should all still be > dupeable. > > So, basically, I think the answer is to pass in sufficient info about the user's > typing to understand whether some of the things we'd normally strip were > explicitly present in the original typing _and the hostnames are the same_, and > if so, don't strip those things. Astute analysis. Here's a follow-up question: if a user types "https foo", do you want to prevent http/https matches getting dupped together?
On 2015/04/27 22:26:05, Mark P wrote: > https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... > File components/omnibox/autocomplete_match.h (right): > > https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... > components/omnibox/autocomplete_match.h:177: // Returns |url| altered by > stripping off "www.", converting https protocol > On 2015/04/25 00:03:12, Peter Kasting wrote: > > It strikes me that there's probably a similar issue with "www."-stripping: if > > the user only typed http://foo.com, then probably http://foo.com and > http://www.foo.com should be > > duped. But if the user typed http://www.foo.com, then "foo.com" is not the > same thing > > and shouldn't be duped. > > > > I doubt this is as critical of a distinction, but it seems like if we're > fixing > > one issue, we should fix the other as well. > > I admit this is a distinction. However, I know from watching user studies that > people often add www. unnecessarily, and I don't want this mistake to cause > problems. In particular, I think the number of people who add www. when they > shouldn't is much larger than the number of people who care that adding www. > means that don't want the non-www. version simultaneously/dupped together. We need to distinguish the types of problems we're addressing. There's one kind of problem where erroneously adding "www." causes the navigation to fail, or in some other way breaks things. Then there's a different problem where both with- and without-prefix versions of the hostname mean the same thing, but if they aren't duped, the user gets a dropdown with two entries in it. For the first issue, the problem itself seems rare (which is part of the justification for why we're so willing to dupe these today), and the HUP already won't return a match for the without-www hostname if you type the with-www version. So it seems unlikely this is going to cause many problems. For the second issue, if you unnecessarily add "www.", then you're liable to do it every time, and in particular, you're liable to do it the first time; and if you do it the first time (or at least most times), then the "www." result will be the default (and only) one regardless of what is typed in in the future. This only poses ranking problems for people who want the two variants merged but type the without-"www." version the first time and the majority of future times. And in that case the damage is limited to having a should-have-been-duped entry in the dropdown. All that said, perhaps there is no real meaningful gain here, so even these minor costs are enough to scuttle this facet of the idea. The only case we're really preventing is the one where you type "www.foo.com" and the browser navigates to "foo.com" anyway. When is preventing that important? Maybe if the two do different things it is, but in that case, you also want the other direction (foo.com won't navigate to www.foo.com) and this still won't provide that. > Here's a follow-up question: if a user types "https foo", do you want to prevent > http/https matches getting dupped together? I don't think so? It seems like we should only be paying attention to input text that looks like a URL. Or maybe the token should be "https://" instead of "https" but otherwise we _should_ pay attention to this case. I'm not totally sure.
On 2015/04/27 23:53:47, Peter Kasting wrote: > On 2015/04/27 22:26:05, Mark P wrote: > > > https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... > > File components/omnibox/autocomplete_match.h (right): > > > > > https://codereview.chromium.org/1098843004/diff/20001/components/omnibox/auto... > > components/omnibox/autocomplete_match.h:177: // Returns |url| altered by > > stripping off "www.", converting https protocol > > On 2015/04/25 00:03:12, Peter Kasting wrote: > > > It strikes me that there's probably a similar issue with "www."-stripping: > if > > > the user only typed http://foo.com, then probably http://foo.com and > > http://www.foo.com should be > > > duped. But if the user typed http://www.foo.com, then "foo.com" is not the > > same thing > > > and shouldn't be duped. > > > > > > I doubt this is as critical of a distinction, but it seems like if we're > > fixing > > > one issue, we should fix the other as well. > > > > I admit this is a distinction. However, I know from watching user studies > that > > people often add www. unnecessarily, and I don't want this mistake to cause > > problems. In particular, I think the number of people who add www. when they > > shouldn't is much larger than the number of people who care that adding www. > > means that don't want the non-www. version simultaneously/dupped together. > > We need to distinguish the types of problems we're addressing. There's one kind > of problem where erroneously adding "www." causes the navigation to fail, or in > some other way breaks things. Then there's a different problem where both with- > and without-prefix versions of the hostname mean the same thing, but if they > aren't duped, the user gets a dropdown with two entries in it. > > For the first issue, the problem itself seems rare (which is part of the > justification for why we're so willing to dupe these today), and the HUP already > won't return a match for the without-www hostname if you type the with-www > version. So it seems unlikely this is going to cause many problems. > > For the second issue, if you unnecessarily add "www.", then you're liable to do > it every time, and in particular, you're liable to do it the first time; and if > you do it the first time (or at least most times), then the "www." result will > be the default (and only) one regardless of what is typed in in the future. > > This only poses ranking problems for people who want the two variants merged but > type the without-"www." version the first time and the majority of future times. > And in that case the damage is limited to having a should-have-been-duped entry > in the dropdown. > > All that said, perhaps there is no real meaningful gain here, so even these > minor costs are enough to scuttle this facet of the idea. The only case we're > really preventing is the one where you type "www.foo.com" and the browser > navigates to "foo.com" anyway. When is preventing that important? Maybe if the > two do different things it is, but in that case, you also want the other > direction (http://foo.com won't navigate to http://www.foo.com) and this still won't provide > that. I agree with your reasoning here. I'm inclined not to do anything. > > Here's a follow-up question: if a user types "https foo", do you want to > prevent > > http/https matches getting dupped together? > > I don't think so? It seems like we should only be paying attention to input > text that looks like a URL. Or maybe the token should be "https://" instead of > "https" but otherwise we _should_ pay attention to this case. I'm not totally > sure. I think I have a solution. Roughly, I'm going to implement the following: See if any term in the input is a (case insensitive) prefix match for the suggestion URL (including scheme, i.e., no droppable prefixes here). If so, then don't change the https to http in stripped_destination_url for this suggestion. Keep these two types of URLs distinct for the purposes of dup detection. I think I'm going to be slightly for restrictive for single-term inputs. For these, the user must include at least the beginning of a hostname. This is to prevent, for example, the input "h" from causing all http / https suggestions from not being treated as dups. (I think the same behavior is fine for multi-word inputs because the other words will narrow down the suggestion list enough that having a single term h or ht causing http/https not to be treated as dups isn't likely to cause any strange behavior.) --mark
On 2015/05/25 20:49:03, Mark P wrote: > I think I'm going to be slightly for restrictive for single-term inputs. For > these, the user must include at least the beginning of a hostname. > This is to prevent, for example, the input "h" from causing all http / https > suggestions from not being treated as dups. > > (I think the same behavior is fine for multi-word inputs because the other > words will narrow down the suggestion list enough that having a single > term h or ht causing http/https not to be treated as dups isn't likely to > cause any strange behavior.) What benefit do we gain from allowing "h" to make http/https non-duped in the multi-word input case? Why wouldn't we just apply the single-term fix you describe in all cases?
On 2015/05/26 20:38:28, Peter Kasting wrote: > On 2015/05/25 20:49:03, Mark P wrote: > > I think I'm going to be slightly for restrictive for single-term inputs. For > > these, the user must include at least the beginning of a hostname. > > This is to prevent, for example, the input "h" from causing all http / https > > suggestions from not being treated as dups. > > > > (I think the same behavior is fine for multi-word inputs because the other > > words will narrow down the suggestion list enough that having a single > > term h or ht causing http/https not to be treated as dups isn't likely to > > cause any strange behavior.) > > What benefit do we gain from allowing "h" to make http/https non-duped in the > multi-word input case? Why wouldn't we just apply the single-term fix you > describe in all cases? I think if we're bothering to muck with this, I'd like the behavior that the omnibox input "https google" makes sure that "https://www.google.com/" and "http://www.google.com" (assuming for some reason we get that a suggestion too) are not grouped/dup-eliminated together. I suppose I could require that the user types out http or https to prevent this dup elimination. But if it looks like the user is typing something scheme-related (e.g., "htt"), then perhaps we should prevent this dupping earlier (for consistent evolution of the omnibox as the user types). That's why I proposed the "have any term match at the beginning of the URL" solution. --mark
On 2015/05/27 17:14:45, Mark P wrote: > On 2015/05/26 20:38:28, Peter Kasting wrote: > > On 2015/05/25 20:49:03, Mark P wrote: > > > I think I'm going to be slightly for restrictive for single-term inputs. For > > > these, the user must include at least the beginning of a hostname. > > > This is to prevent, for example, the input "h" from causing all http / https > > > suggestions from not being treated as dups. > > > > > > (I think the same behavior is fine for multi-word inputs because the other > > > words will narrow down the suggestion list enough that having a single > > > term h or ht causing http/https not to be treated as dups isn't likely to > > > cause any strange behavior.) > > > > What benefit do we gain from allowing "h" to make http/https non-duped in the > > multi-word input case? Why wouldn't we just apply the single-term fix you > > describe in all cases? > > I think if we're bothering to muck with this, I'd like the behavior that the > omnibox > input "https google" makes sure that "https://www.google.com/" and > "http://www.google.com" (assuming for some reason we get that a suggestion > too) are not grouped/dup-eliminated together. Hmm. On first glance, I don't agree with that. It's not clear that the "https" in "https google" is intended to mean "load the site over https" as opposed to, say, "match against a site with 'https' in the title or URL somewhere". What if the user typed "google https"? What if he typed "bad https" and had badssl.com (but pretend its URL was "badhttps.com") in his history? I think "https://g" is a pretty clear signal. "https" occurring as a whitespace-delimited token somewhere in the input is not. Am I missing cases?
On 2015/05/27 18:07:42, Peter Kasting wrote: > On 2015/05/27 17:14:45, Mark P wrote: > > On 2015/05/26 20:38:28, Peter Kasting wrote: > > > On 2015/05/25 20:49:03, Mark P wrote: > > > > I think I'm going to be slightly for restrictive for single-term inputs. > For > > > > these, the user must include at least the beginning of a hostname. > > > > This is to prevent, for example, the input "h" from causing all http / > https > > > > suggestions from not being treated as dups. > > > > > > > > (I think the same behavior is fine for multi-word inputs because the other > > > > words will narrow down the suggestion list enough that having a single > > > > term h or ht causing http/https not to be treated as dups isn't likely to > > > > cause any strange behavior.) > > > > > > What benefit do we gain from allowing "h" to make http/https non-duped in > the > > > multi-word input case? Why wouldn't we just apply the single-term fix you > > > describe in all cases? > > > > I think if we're bothering to muck with this, I'd like the behavior that the > > omnibox > > input "https google" makes sure that "https://www.google.com/" and > > "http://www.google.com" (assuming for some reason we get that a suggestion > > too) are not grouped/dup-eliminated together. > > Hmm. On first glance, I don't agree with that. It's not clear that the "https" > in "https google" is intended to mean "load the site over https" as opposed to, > say, "match against a site with 'https' in the title or URL somewhere". What if > the user typed "google https"? What if he typed "bad https" and had http://badssl.com > (but pretend its URL was "badhttps.com") in his history? > > I think "https://g" is a pretty clear signal. "https" occurring as a > whitespace-delimited token somewhere in the input is not. > > Am I missing cases? No, you're not missing anything. I expressed my opinion but I don't feel strongly enough to argue about it. I'll implement your suggested alteration of my proposal. --mark
So, I implemented this change as we discussed: if the user enters https://blah somewhere in the input, and it matches the URL suggestion, then don't allow the http<->https equivalence when doing dup elimination. Then I realized that design no longer fixes the original (DCHECK) bug report, which complained the user entered http://www.google and was provided https://www.google.com/ The change we discussed prevents downgrading https to http but does not allow upgrading http to https. I'm not sure what the right thing to do here is. If the user enters an explicit insecure scheme, should we allow a URL with that scheme to be dup-eliminated with / upgraded to the equivalent secure scheme? i.e., should we ignore the user's explicit intent for the sake of security? --mark
On 2015/05/30 17:10:08, Mark P (out sick) wrote: > If the user enters an explicit insecure scheme, should we allow a URL with that > scheme to be dup-eliminated with / upgraded to the equivalent secure scheme? > i.e., should we ignore the user's explicit intent for the sake of security? No. If you enter http://<something>, you should get that, not the https version.
This is again ready for review. It's likely best to review it from scratch; it's been rewritten compared to earlier patchsets. --mark
https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_quick_provider.cc (right): https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... chrome/browser/autocomplete/history_quick_provider.cc:274: base::SplitString(autocomplete_input_.text(), ' ', &words); I'm concerned about all these files doing a manual split based on spaces. It seems like we should ask ICU or other i18n code to tokenize instead, shouldn't we? After all, some languages (e.g. Thai) might not use whitespace to separate words. Separately, the fact that each file splits the input, passes the words to these utility functions, and they search through the input tokens, seems both cumbersome and inefficient to me. (If the input has many tokens, this could dramatically slow down input processing.) Maybe we should instead be setting a member variable on the AutocompleteInput during initial parsing like "vector of input tokens that look like a scheme plus something". Then we could just look at those tokens, if any existed, in the implementation of this utility. https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:43: std::string(url::kStandardSchemeSeparator).length())); Why not just url.host().begin? https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:48: it.substr(pos + separator_as_utf16.length()), false)) { Won't this find input like "://foo"? Shouldn't we be checking for the input to actually have a scheme? https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_result.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_result.cc:183: for (ACMatches::iterator i(matches_.begin()); i != matches_.end(); ++i) { Nit: No {}
https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_quick_provider.cc (right): https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... chrome/browser/autocomplete/history_quick_provider.cc:274: base::SplitString(autocomplete_input_.text(), ' ', &words); On 2015/06/01 21:04:08, Peter Kasting wrote: > I'm concerned about all these files doing a manual split based on spaces. It > seems like we should ask ICU or other i18n code to tokenize instead, shouldn't > we? After all, some languages (e.g. Thai) might not use whitespace to separate > words. I would think any decent break iterator would split on punctuation strings like :// Even if ICU in some locales does not do that now, I don't we should trust it to remain that way in the future. I think splitting on spaces is reasonable. In fact, I would not want it split more narrowly. Imagine you have a user that typed http://helloworld and we realized that hello and world should be broken based on the language model. Then we would be searching for URLs that match http://hello and modifying how they're dup-eliminated. That doesn't seem right. > > Separately, the fact that each file splits the input, passes the words to these > utility functions, and they search through the input tokens, seems both > cumbersome and inefficient to me. (If the input has many tokens, this could > dramatically slow down input processing.) Maybe we should instead be setting a > member variable on the AutocompleteInput during initial parsing like "vector of > input tokens that look like a scheme plus something". Then we could just look > at those tokens, if any existed, in the implementation of this utility. This current structure slightly bothers me too. I definitely want to pass in the set of tokens because sometimes computestrippedurl() is called in loops (e.g., see shortcuts provider). We can precompute this in AutocompleteInput but that seems heavyhanded to me. Looking closer, maybe I should make EnsureUWYTIsAllowedToBeDefault() do the word splitting. Then only the direct callers of computestrippedurl() will need to split on input, and that merely means AutocompleteResult and ShortcutsProvider, the two places that call this in a loop. https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:43: std::string(url::kStandardSchemeSeparator).length())); On 2015/06/01 21:04:08, Peter Kasting wrote: > Why not just url.host().begin? If there is a username or password, url.host() position will be wrong. I could write conditional logic to figure out the position depending on the presence of either of those two fields. This + length seemed simplier. https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:48: it.substr(pos + separator_as_utf16.length()), false)) { On 2015/06/01 21:04:08, Peter Kasting wrote: > Won't this find input like "://foo"? Shouldn't we be checking for the input to > actually have a scheme? Yes. Do you want me to add a length>0 check?
https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_quick_provider.cc (right): https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... chrome/browser/autocomplete/history_quick_provider.cc:274: base::SplitString(autocomplete_input_.text(), ' ', &words); On 2015/06/01 22:05:50, Mark P (out sick) wrote: > On 2015/06/01 21:04:08, Peter Kasting wrote: > > I'm concerned about all these files doing a manual split based on spaces. It > > seems like we should ask ICU or other i18n code to tokenize instead, shouldn't > > we? After all, some languages (e.g. Thai) might not use whitespace to > separate > > words. > > I would think any decent break iterator would split on punctuation strings like > :// > Even if ICU in some locales does not do that now, I don't we should trust it to > remain that way in the future. > I think splitting on spaces is reasonable. > In fact, I would not want it split more narrowly. Imagine you have a user that > typed > http://helloworld > and we realized that hello and world should be broken based on the language > model. Then > we would be searching for URLs that match http://hello and modifying how they're > dup-eliminated. That doesn't seem right. Fair enough. > > Separately, the fact that each file splits the input, passes the words to > these > > utility functions, and they search through the input tokens, seems both > > cumbersome and inefficient to me. (If the input has many tokens, this could > > dramatically slow down input processing.) Maybe we should instead be setting > a > > member variable on the AutocompleteInput during initial parsing like "vector > of > > input tokens that look like a scheme plus something". Then we could just look > > at those tokens, if any existed, in the implementation of this utility. > > This current structure slightly bothers me too. I definitely want to pass in > the set of tokens because sometimes computestrippedurl() is called in loops > (e.g., see shortcuts provider). We can precompute this in AutocompleteInput but > that seems heavyhanded to me. > > Looking closer, maybe I should make EnsureUWYTIsAllowedToBeDefault() do the word > splitting. Then only the direct callers of computestrippedurl() will need to > split on input, and that merely means AutocompleteResult and ShortcutsProvider, > the two places that call this in a loop. If you make this change I'll take another look at what it does to the code. It's hard for me to imagine it in my head. https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:43: std::string(url::kStandardSchemeSeparator).length())); On 2015/06/01 22:05:50, Mark P (out sick) wrote: > On 2015/06/01 21:04:08, Peter Kasting wrote: > > Why not just url.host().begin? > > If there is a username or password, url.host() position will be wrong. I could > write conditional logic to figure out the position depending on the presence of > either of those two fields. This + length seemed simplier. Maybe url.GetContent() then? https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:48: it.substr(pos + separator_as_utf16.length()), false)) { On 2015/06/01 22:05:50, Mark P (out sick) wrote: > On 2015/06/01 21:04:08, Peter Kasting wrote: > > Won't this find input like "://foo"? Shouldn't we be checking for the input > to > > actually have a scheme? > > Yes. Do you want me to add a length>0 check? Maybe instead we should do something like check url::IsStandard(it.data(), url::Component(0, pos)).
Peter, This version should appeal better to your sense of aesthetics. Please take a look. thanks, mark https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_quick_provider.cc (right): https://codereview.chromium.org/1098843004/diff/40001/chrome/browser/autocomp... chrome/browser/autocomplete/history_quick_provider.cc:274: base::SplitString(autocomplete_input_.text(), ' ', &words); > > Looking closer, maybe I should make EnsureUWYTIsAllowedToBeDefault() do the > word > > splitting. Then only the direct callers of computestrippedurl() will need to > > split on input, and that merely means AutocompleteResult and > ShortcutsProvider, > > the two places that call this in a loop. > > If you make this change I'll take another look at what it does to the code. > It's hard for me to imagine it in my head. Done. Let me know what you think. https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:43: std::string(url::kStandardSchemeSeparator).length())); On 2015/06/01 22:21:58, Peter Kasting wrote: > On 2015/06/01 22:05:50, Mark P (out sick) wrote: > > On 2015/06/01 21:04:08, Peter Kasting wrote: > > > Why not just url.host().begin? > > > > If there is a username or password, url.host() position will be wrong. I > could > > write conditional logic to figure out the position depending on the presence > of > > either of those two fields. This + length seemed simplier. > > Maybe url.GetContent() then? I think that's still ugly. It's require a shift because GetContent returns, for example, //www.google.com (i.e., it's everything after the colon, not the scheme separator) https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:48: it.substr(pos + separator_as_utf16.length()), false)) { On 2015/06/01 22:21:58, Peter Kasting wrote: > On 2015/06/01 22:05:50, Mark P (out sick) wrote: > > On 2015/06/01 21:04:08, Peter Kasting wrote: > > > Won't this find input like "://foo"? Shouldn't we be checking for the input > > to > > > actually have a scheme? > > > > Yes. Do you want me to add a length>0 check? > > Maybe instead we should do something like check url::IsStandard(it.data(), > url::Component(0, pos)). I'll prefer to add the >0 test. I don't see any reason to require that the scheme the user entered is a standard scheme. Matching the suggestion is enough (though in reality I don't think we ever suggest anything with non-standard scheme). https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... File components/omnibox/autocomplete_result.cc (right): https://codereview.chromium.org/1098843004/diff/40001/components/omnibox/auto... components/omnibox/autocomplete_result.cc:183: for (ACMatches::iterator i(matches_.begin()); i != matches_.end(); ++i) { On 2015/06/01 21:04:08, Peter Kasting wrote: > Nit: No {} Done.
https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/history_url_provider_unittest.cc:323: for (ACMatches::iterator i = matches_.begin(); i != matches_.end(); ++i) { Nit: No {} https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); Hmmm. Is it realy necessary for the shortcuts system to be doing all this fixup? At a glance it looks to me like this is basically what the controller will be doing anyway: computing the stripped destination URL and then de-duping by destination. I don't really see why ShortcutsProvider needs to be a special snowflake in regards to pre-deleting these before returning, but the other providers don't. It seems like if it's important quality-wise that we get kMaxMatches post-deduped matches here, the same could be said for other providers, and perhaps the answer would be to expose the relevant computation/de-duping on AutocompleteResult/AutocompleteController/somewhere as a method all providers could (and should) call before returning. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:244: input.canonicalized_url(), words, Instead of passing |words| to this function we could just call the version of EnsureUWYTIsAllowedToBeDefault() that takes an AutocompleteInput. I believe this is the only place where you use the URL-and-vector version of that function, so this would also eliminate a method from the class API. This would be a duplication of effort with what you do above in GetMatches(), but if you eliminate that code, that issue goes away. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.h (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:55: // the user's input. |input|, |fixed_up_input_text|, and |words| are used Nit: Extra space https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:56: // to decide what can be inlined. Nit: You might want to just give one sentence on how these things are used or what the overall inlineability goal is supposed to be. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:62: const std::vector<base::string16>& words); Nit: I kinda feel like ComputeStrippedDestinationURL() should define a typedef for this if it wants other people using this type, and we should use that here and elsewhere. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:32: // indicates that the user desires this URL loaded with this scheme in Nit: "This suggests that the user may want this particular scheme, and the caller should not consider another URL like this one but with a different scheme to be a duplicate." https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:47: if ((pos != base::string16::npos) && (pos > 0) && I'm still thinking about this (pos > 0) check. I think we should actually look for "http://" and "https://" instead of just "<anything>://". If I type in "ftp://mozilla.com", it seems OK -- preferable, even -- to dupe http://mozilla.com and https://mozilla.com together. This is even more true if I type garbage like "!@#://mozilla.com". https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:446: // Possibly replace https protocol with http protocol. Nit: "Replace https protocol with http, as long as the user didn't explicitly specify one of the two." (Goes along with my above suggestion) https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:188: // to be altered during stripping. (If the user indicated a desired scheme, Nit: Vague; "If the user indicated a desired scheme" -> "If one of the words in the input looks like a scheme plus the beginning of this URL's hostname"? https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:208: // split based on spaces into words. Nit: "...is the original input string, tokenized into words using the space character"? You might want to include a sentence of your rationale to me about why using space here was appropriate. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( My one concern with this API is that it will lead to a lot of duplicated effort, especially with long input strings, since we'll re-split the input on every call (which could be at least once per generated match per provider that calls this). After past "Pasting 10MB of data is slow" bugs I'm keen not to add more redundant calls. I wonder if we should be caching this on the AutocompleteInput instead, or something. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/sear... File components/omnibox/search_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/sear... components/omnibox/search_provider.cc:19: #include "base/strings/string_split.h" Nit: #include not needed
Comments in response to your comments; no code changes here. --mark https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); On 2015/06/06 01:31:23, Peter Kasting wrote: > Hmmm. > > Is it realy necessary for the shortcuts system to be doing all this fixup? No. Prior to fixing bug 260799, Shortcuts did not do any duplicate elimination, not even for URLs that are identical and just associated with different shortcut strings. I can make the duplicate elimination in ShortcutsProvider more naive by using exact URLs if you want. That'll simplify the API changes necessary because no one (besides AutocompleteResult) will have to explicitly call ComputeStrippedDestinationUrl(). I don't think it's important quality-wise to do correct full dedupping here (as long as shortcuts is smart about how it dedups query strings if I switch to the exact URL comparison--I should check this). > At a > glance it looks to me like this is basically what the controller will be doing > anyway: computing the stripped destination URL and then de-duping by > destination. I don't really see why ShortcutsProvider needs to be a special > snowflake in regards to pre-deleting these before returning, but the other > providers don't. > > It seems like if it's important quality-wise that we get kMaxMatches > post-deduped matches here, the same could be said for other providers, and > perhaps the answer would be to expose the relevant computation/de-duping on > AutocompleteResult/AutocompleteController/somewhere as a method all providers > could (and should) call before returning. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:244: input.canonicalized_url(), words, On 2015/06/06 01:31:23, Peter Kasting wrote: > Instead of passing |words| to this function we could just call the version of > EnsureUWYTIsAllowedToBeDefault() that takes an AutocompleteInput. I believe > this is the only place where you use the URL-and-vector version of that > function, so this would also eliminate a method from the class API. > > This would be a duplication of effort with what you do above in GetMatches(), > but if you eliminate that code, that issue goes away. I'm passing |words| here so as to above the version that takes |input|. This function is called in a loop and don't want to have to continually split |input| on |words| if I call the latter rather than the former. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:47: if ((pos != base::string16::npos) && (pos > 0) && On 2015/06/06 01:31:23, Peter Kasting wrote: > I think we should actually look for "http://" and "https://" instead of just > "<anything>://". If I type in "ftp://mozilla.com", it seems OK -- preferable, > even -- to dupe http://mozilla.com and https://mozilla.com together. This is > even more true if I type garbage like "!@#://mozilla.com". Your proposal requires explicitly looking at the scheme and making decisions based on it. In other words, if the users enters "http" or "https" you want to disable dedupping those two things. There is no where to implement your code without explicitly using those URLs constants. Do you really want me to do such a narrow solution? i.e., it won't apply by default to new protocols we introduce. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( On 2015/06/06 01:31:24, Peter Kasting wrote: > My one concern with this API is that it will lead to a lot of duplicated effort, > especially with long input strings, since we'll re-split the input on every call > (which could be at least once per generated match per provider that calls this). > After past "Pasting 10MB of data is slow" bugs I'm keen not to add more > redundant calls. > > I wonder if we should be caching this on the AutocompleteInput instead, or > something. I'm not philosophically keen on the notion of caching things in AutocompleteInput. I don't know why. I think I don't like the idea of keeping "value" and "is already computed" fields. But I suppose we can simply compute the split version when the AutocompleteInput is constructed and then use it as needed later, only one field needed. This seems reasonable. Is there a better solution? I'm happy to do this if you want and can't think of anything better. Let me know.
https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); On 2015/06/06 17:10:11, Mark P wrote: > On 2015/06/06 01:31:23, Peter Kasting wrote: > > Hmmm. > > > > Is it realy necessary for the shortcuts system to be doing all this fixup? > > No. Prior to fixing bug 260799, Shortcuts did not do any duplicate elimination, > not even for URLs that are identical and just associated with different shortcut > strings. I can make the duplicate elimination in ShortcutsProvider more naive > by using exact URLs if you want. That'll simplify the API changes necessary > because no one (besides AutocompleteResult) will have to explicitly call > ComputeStrippedDestinationUrl(). I'm wondering if we can go even further and avoid doing any de-duping at all without a significant quality impact. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:244: input.canonicalized_url(), words, On 2015/06/06 17:10:11, Mark P wrote: > On 2015/06/06 01:31:23, Peter Kasting wrote: > > Instead of passing |words| to this function we could just call the version of > > EnsureUWYTIsAllowedToBeDefault() that takes an AutocompleteInput. I believe > > this is the only place where you use the URL-and-vector version of that > > function, so this would also eliminate a method from the class API. > > > > This would be a duplication of effort with what you do above in GetMatches(), > > but if you eliminate that code, that issue goes away. > > I'm passing |words| here so as to above the version that takes |input|. This > function is called in a loop and don't want to have to continually split |input| > on |words| if I call the latter rather than the former. OK, but this seems like an argument in favor of my suggestion of caching the words on the AutocompleteInput. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:47: if ((pos != base::string16::npos) && (pos > 0) && On 2015/06/06 17:10:11, Mark P wrote: > On 2015/06/06 01:31:23, Peter Kasting wrote: > > I think we should actually look for "http://" and "https://" instead of just > > "<anything>://". If I type in "ftp://mozilla.com", it seems OK -- preferable, > > even -- to dupe http://mozilla.com and https://mozilla.com together. This is > > even more true if I type garbage like "!@#://mozilla.com". > > Your proposal requires explicitly looking at the scheme and making decisions > based on it. In other words, if the users enters "http" or "https" you want to > disable dedupping those two things. There is no where to implement your code > without explicitly using those URLs constants. Do you really want me to do such > a narrow solution? i.e., it won't apply by default to new protocols we > introduce. Yes. This should be in sync with the actual scheme manipulation we do in ComputeStrippedDestinationURL(). Right now we only convert HTTPS->HTTP, so those are the schemes we should look at here. If we were to add more schemes there we would want to add them here too. (We could even add a comment in one or both of these places talking about the need to stay in sync.) https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( On 2015/06/06 17:10:11, Mark P wrote: > On 2015/06/06 01:31:24, Peter Kasting wrote: > > My one concern with this API is that it will lead to a lot of duplicated > effort, > > especially with long input strings, since we'll re-split the input on every > call > > (which could be at least once per generated match per provider that calls > this). > > After past "Pasting 10MB of data is slow" bugs I'm keen not to add more > > redundant calls. > > > > I wonder if we should be caching this on the AutocompleteInput instead, or > > something. > > I'm not philosophically keen on the notion of caching things in > AutocompleteInput. > I don't know why. I think I don't like the idea of keeping "value" and "is > already computed" fields. > But I suppose we can simply compute the split version when the AutocompleteInput > is constructed and then use it as needed later, only one field needed. > This seems reasonable. Is there a better solution? By "only one field needed" do you mean "only one additional field needed compared to now"? I was thinking more about this and I think we could limit the amount of additional storage needed by keeping a vector of words that only contains those words that start with "http://" or "https://" plus one additional letter. That will mean that for most inputs the vector will be empty and thus we'll have minimal additional memory cost/copying cost; and it will let us simplify and speed up the code that looks for matches of these words against various URLs.
Additional comments, mostly wrapping up the conclusions. --mark https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); On 2015/06/06 18:38:15, Peter Kasting wrote: > On 2015/06/06 17:10:11, Mark P wrote: > > On 2015/06/06 01:31:23, Peter Kasting wrote: > > > Hmmm. > > > > > > Is it realy necessary for the shortcuts system to be doing all this fixup? > > > > No. Prior to fixing bug 260799, Shortcuts did not do any duplicate > elimination, > > not even for URLs that are identical and just associated with different > shortcut > > strings. I can make the duplicate elimination in ShortcutsProvider more naive > > by using exact URLs if you want. That'll simplify the API changes necessary > > because no one (besides AutocompleteResult) will have to explicitly call > > ComputeStrippedDestinationUrl(). > > I'm wondering if we can go even further and avoid doing any de-duping at all > without a significant quality impact. Look at the pointed-to bug. For almost any input, avoiding de-dupping between shortcuts means ShortcutsProvider will only return one URL (three times, each associated with different shortcut text). I think this would be a quality loss. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:244: input.canonicalized_url(), words, On 2015/06/06 18:38:16, Peter Kasting wrote: > On 2015/06/06 17:10:11, Mark P wrote: > > On 2015/06/06 01:31:23, Peter Kasting wrote: > > > Instead of passing |words| to this function we could just call the version > of > > > EnsureUWYTIsAllowedToBeDefault() that takes an AutocompleteInput. I believe > > > this is the only place where you use the URL-and-vector version of that > > > function, so this would also eliminate a method from the class API. > > > > > > This would be a duplication of effort with what you do above in > GetMatches(), > > > but if you eliminate that code, that issue goes away. > > > > I'm passing |words| here so as to above the version that takes |input|. This > > function is called in a loop and don't want to have to continually split > |input| > > on |words| if I call the latter rather than the former. > > OK, but this seems like an argument in favor of my suggestion of caching the > words on the AutocompleteInput. Yes, it is. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( On 2015/06/06 18:38:16, Peter Kasting wrote: > On 2015/06/06 17:10:11, Mark P wrote: > > On 2015/06/06 01:31:24, Peter Kasting wrote: > > > My one concern with this API is that it will lead to a lot of duplicated > > effort, > > > especially with long input strings, since we'll re-split the input on every > > call > > > (which could be at least once per generated match per provider that calls > > this). > > > After past "Pasting 10MB of data is slow" bugs I'm keen not to add more > > > redundant calls. > > > > > > I wonder if we should be caching this on the AutocompleteInput instead, or > > > something. > > > > I'm not philosophically keen on the notion of caching things in > > AutocompleteInput. > > I don't know why. I think I don't like the idea of keeping "value" and "is > > already computed" fields. > > But I suppose we can simply compute the split version when the > AutocompleteInput > > is constructed and then use it as needed later, only one field needed. > > This seems reasonable. Is there a better solution? > > By "only one field needed" do you mean "only one additional field needed > compared to now"? Yes. > I was thinking more about this and I think we could limit the amount of > additional storage needed by keeping a vector of words that only contains those > words that start with "http://" or "https://" plus one additional letter. That > will mean that for most inputs the vector will be empty and thus we'll have > minimal additional memory cost/copying cost; and it will let us simplify and > speed up the code that looks for matches of these words against various URLs. Yes, I can do it this way. That's definitely more efficient. If you're willing to let me do this, I'm okay with it. I thought you'd be opposed because it seems so particular / focussed, not more generally useful (as a full word vector in AutocompleteInput would be).
https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); On 2015/06/06 20:23:06, Mark P wrote: > On 2015/06/06 18:38:15, Peter Kasting wrote: > > On 2015/06/06 17:10:11, Mark P wrote: > > > On 2015/06/06 01:31:23, Peter Kasting wrote: > > > > Hmmm. > > > > > > > > Is it realy necessary for the shortcuts system to be doing all this fixup? > > > > > > No. Prior to fixing bug 260799, Shortcuts did not do any duplicate > > elimination, > > > not even for URLs that are identical and just associated with different > > shortcut > > > strings. I can make the duplicate elimination in ShortcutsProvider more > naive > > > by using exact URLs if you want. That'll simplify the API changes necessary > > > because no one (besides AutocompleteResult) will have to explicitly call > > > ComputeStrippedDestinationUrl(). > > > > I'm wondering if we can go even further and avoid doing any de-duping at all > > without a significant quality impact. > > Look at the pointed-to bug. For almost any input, avoiding de-dupping between > shortcuts means ShortcutsProvider will only return one URL (three times, each > associated with different shortcut text). I think this would be a quality loss. > OK. Can we add comments about why it's more important for the shortcuts provider to do this than most providers? https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( On 2015/06/06 20:23:06, Mark P wrote: > On 2015/06/06 18:38:16, Peter Kasting wrote: > > I was thinking more about this and I think we could limit the amount of > > additional storage needed by keeping a vector of words that only contains > those > > words that start with "http://" or "https://" plus one additional letter. > That > > will mean that for most inputs the vector will be empty and thus we'll have > > minimal additional memory cost/copying cost; and it will let us simplify and > > speed up the code that looks for matches of these words against various URLs. > > Yes, I can do it this way. That's definitely more efficient. If you're willing > to let me do this, I'm okay with it. I thought you'd be opposed because it > seems so particular / focussed, not more generally useful (as a full word vector > in AutocompleteInput would be). We don't have any other places that need such a vector, though, so there's no motivation for a generalized solution. (I'm a big fan of "build what you need now, reefactor later if your needs change".) Let's go ahead with this route, it seems maximally efficient.
I refactored this as we discussed. I think the comments could use some work; some of them are hard to naturally explain. I could use your help on those if you think they need improvement. thanks, mark https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/history_url_provider_unittest.cc:323: for (ACMatches::iterator i = matches_.begin(); i != matches_.end(); ++i) { On 2015/06/06 01:31:23, Peter Kasting wrote: > Nit: No {} Done. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:148: base::SplitString(input.text(), ' ', &words); On 2015/06/08 20:26:59, Peter Kasting wrote: > On 2015/06/06 20:23:06, Mark P wrote: > > On 2015/06/06 18:38:15, Peter Kasting wrote: > > > On 2015/06/06 17:10:11, Mark P wrote: > > > > On 2015/06/06 01:31:23, Peter Kasting wrote: > > > > > Hmmm. > > > > > > > > > > Is it realy necessary for the shortcuts system to be doing all this > fixup? > > > > > > > > No. Prior to fixing bug 260799, Shortcuts did not do any duplicate > > > elimination, > > > > not even for URLs that are identical and just associated with different > > > shortcut > > > > strings. I can make the duplicate elimination in ShortcutsProvider more > > naive > > > > by using exact URLs if you want. That'll simplify the API changes > necessary > > > > because no one (besides AutocompleteResult) will have to explicitly call > > > > ComputeStrippedDestinationUrl(). > > > > > > I'm wondering if we can go even further and avoid doing any de-duping at all > > > without a significant quality impact. > > > > Look at the pointed-to bug. For almost any input, avoiding de-dupping between > > shortcuts means ShortcutsProvider will only return one URL (three times, each > > associated with different shortcut text). I think this would be a quality > loss. > > > > OK. Can we add comments about why it's more important for the shortcuts > provider to do this than most providers? Okay, I added some comments about why it's important to do this (below in this file). https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.h (right): https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:55: // the user's input. |input|, |fixed_up_input_text|, and |words| are used On 2015/06/06 01:31:23, Peter Kasting wrote: > Nit: Extra space Now obsolete. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:56: // to decide what can be inlined. On 2015/06/06 01:31:23, Peter Kasting wrote: > Nit: You might want to just give one sentence on how these things are used or > what the overall inlineability goal is supposed to be. Now (mostly) obsolete, or at least not relevant to this changelist. https://codereview.chromium.org/1098843004/diff/60001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.h:62: const std::vector<base::string16>& words); On 2015/06/06 01:31:23, Peter Kasting wrote: > Nit: I kinda feel like ComputeStrippedDestinationURL() should define a typedef > for this if it wants other people using this type, and we should use that here > and elsewhere. Now obsolete. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:32: // indicates that the user desires this URL loaded with this scheme in On 2015/06/06 01:31:24, Peter Kasting wrote: > Nit: "This suggests that the user may want this particular scheme, and the > caller should not consider another URL like this one but with a different scheme > to be a duplicate." Done. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:47: if ((pos != base::string16::npos) && (pos > 0) && On 2015/06/06 18:38:16, Peter Kasting wrote: > On 2015/06/06 17:10:11, Mark P wrote: > > On 2015/06/06 01:31:23, Peter Kasting wrote: > > > I think we should actually look for "http://" and "https://" instead of just > > > "<anything>://". If I type in "ftp://mozilla.com", it seems OK -- > preferable, > > > even -- to dupe http://mozilla.com and https://mozilla.com together. This > is > > > even more true if I type garbage like "!@#://mozilla.com". > > > > Your proposal requires explicitly looking at the scheme and making decisions > > based on it. In other words, if the users enters "http" or "https" you want > to > > disable dedupping those two things. There is no where to implement your code > > without explicitly using those URLs constants. Do you really want me to do > such > > a narrow solution? i.e., it won't apply by default to new protocols we > > introduce. > > Yes. This should be in sync with the actual scheme manipulation we do in > ComputeStrippedDestinationURL(). Right now we only convert HTTPS->HTTP, so > those are the schemes we should look at here. If we were to add more schemes > there we would want to add them here too. (We could even add a comment in one > or both of these places talking about the need to stay in sync.) Acknowledged. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:446: // Possibly replace https protocol with http protocol. On 2015/06/06 01:31:23, Peter Kasting wrote: > Nit: "Replace https protocol with http, as long as the user didn't explicitly > specify one of the two." (Goes along with my above suggestion) Done. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:188: // to be altered during stripping. (If the user indicated a desired scheme, On 2015/06/06 01:31:24, Peter Kasting wrote: > Nit: Vague; "If the user indicated a desired scheme" -> "If one of the words in > the input looks like a scheme plus the beginning of this URL's hostname"? Done. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:208: // split based on spaces into words. On 2015/06/06 01:31:24, Peter Kasting wrote: > Nit: "...is the original input string, tokenized into words using the space > character"? > > You might want to include a sentence of your rationale to me about why using > space here was appropriate. Comment is now obsolete. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/auto... components/omnibox/autocomplete_match.h:215: void EnsureUWYTIsAllowedToBeDefault( On 2015/06/08 20:26:59, Peter Kasting wrote: > On 2015/06/06 20:23:06, Mark P wrote: > > On 2015/06/06 18:38:16, Peter Kasting wrote: > > > I was thinking more about this and I think we could limit the amount of > > > additional storage needed by keeping a vector of words that only contains > > those > > > words that start with "http://" or "https://" plus one additional letter. > > That > > > will mean that for most inputs the vector will be empty and thus we'll have > > > minimal additional memory cost/copying cost; and it will let us simplify and > > > speed up the code that looks for matches of these words against various > URLs. > > > > Yes, I can do it this way. That's definitely more efficient. If you're > willing > > to let me do this, I'm okay with it. I thought you'd be opposed because it > > seems so particular / focussed, not more generally useful (as a full word > vector > > in AutocompleteInput would be). > > We don't have any other places that need such a vector, though, so there's no > motivation for a generalized solution. (I'm a big fan of "build what you need > now, reefactor later if your needs change".) > > Let's go ahead with this route, it seems maximally efficient. Done. https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/sear... File components/omnibox/search_provider.cc (right): https://codereview.chromium.org/1098843004/diff/60001/components/omnibox/sear... components/omnibox/search_provider.cc:19: #include "base/strings/string_split.h" On 2015/06/06 01:31:24, Peter Kasting wrote: > Nit: #include not needed Done.
https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:161: // to mail.google.com, so typing "m" will return them all. Note that while Nit: I might add after "return them all" this sentence: If we then simply clamp to kMaxMatches and let the AutocompleteResult take care of collapsing the duplicates, we'll effectively only be returning one match, instead of several possibilities. Then I would probably put a paragraph break (blank line) before the note. https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:164: // are only used for deletions, and shortcuts deletes matches based on the Nit: shortcuts -> this provider? https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:35: // at least one additional character after the prefix. Nit: Could say "...or https:// plus at least one more character and puts..." and then drop the second sentence entirely. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:36: void PopulateHttpTerms(const base::string16& text, Nit: If you change the member name, change this too -- maybe PopulateTermsFollowingScheme() or FindTermsWithExplicitScheme() or something. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:38: const base::string16& http_scheme = Rather than converting these to string16s, I think we should convert |text| to UTF-8 (likely on the caller side). The code in autocomplete_match.cc has UTF-8 GURLs anyway, so this will avoid the conversions there. This means the type of |http_following_terms| (here and as a data member) will change to std::vector<std::string>. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:44: std::vector<base::string16> words; Nit: words -> terms (for consistency)? https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:45: base::SplitString(text, ' ', &words); Nit: Consider a comment about why we don't use ICU breaking functions to split into words. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:46: for (auto it : words) { Nit: const auto&, to avoid a copy https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:47: if (StartsWith(it, http_scheme, false) && Nit: This is shorter overall and no less efficient: std::vector<std::string> terms; base::SplitString(text, ' ', &terms); const std::string separator(url::kStandardSchemeSeparator); for (const auto& term : terms) { static const char* kSchemes[2] = { url::kHttpScheme, url::kHttpsScheme }; for (const char* scheme : kSchemes) { const std::string prefix(scheme + separator); if (StartsWith(term, prefix, false) && (term.length() > prefix.length())) http_following_terms->push_back(term.substr(prefix.length())); } } https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:187: // Returns list of words in text() that start with http:// or https://, Nit: text() -> text_ "...or https:// plus at least one more character, ..." https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:188: // stored without the prefix. Used in duplicate elimination to detect Nit: prefix -> scheme (Technically this is less accurate but it seems clearer to me after the change I suggest above.) https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:190: // scheme (and hence we shouldn't de-dup across schemes). Nit: How about: "...to detect whether, for a given URL, the user may have started typing that URL with an explicit scheme; see comments on AutocompleteMatch::GURLToStrippedGURL()." https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:191: const std::vector<base::string16>& http_following_terms() const { Hmm, I don't love this name... maybe terms_following_explicit_scheme or terms_prefixed_by_http_or_https? I can't think of a non-awkward name. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:37: const base::string16& url_spec_without_scheme_as_utf16 = If you follow my suggestion in autocomplete_input.cc of making |http_following_terms| a std::string, this can get simpler by avoiding the encoding conversion. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:41: for (auto it : http_following_terms) { Nit: const auto&, to avoid a copy https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:42: if (StartsWith(url_spec_without_scheme_as_utf16, it, false)) { Nit: No {} https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:442: stripped_destination_url.SchemeIs(url::kHttpsScheme)) { Nit: For efficiency, reverse these subexpressions; the SchemeIs() check should be cheaper than the other call. I would probably also add an explicit "!input.http_following_terms().empty() &&" before both of these, again, just for efficiency (avoids some string constructions). https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.h:190: // two URLs with different schemes from having the same stripped GURL.) Nit: How about: "...altered during stripping. If this URL, minus the scheme and separator, starts with any the terms in input.http_following_terms(), we avoid converting an HTTPS scheme to HTTP. This means that URLs that differ only by these schemes won't be marked as dupes, since the distinction seems to matter to the user." https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_result.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_result.cc:11: #include "base/strings/string_split.h" #include not needed
My computer is agonizingly slow for some reason right now so I haven't ran the tests yet, but I think it's ready for re-review. --mark https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... File chrome/browser/autocomplete/shortcuts_provider.cc (right): https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:161: // to mail.google.com, so typing "m" will return them all. Note that while On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: I might add after "return them all" this sentence: > > If we then simply clamp to kMaxMatches and let the AutocompleteResult take care > of collapsing the duplicates, we'll effectively only be returning one match, > instead of several possibilities. Done. > Then I would probably put a paragraph break (blank line) before the note. Done. https://codereview.chromium.org/1098843004/diff/80001/chrome/browser/autocomp... chrome/browser/autocomplete/shortcuts_provider.cc:164: // are only used for deletions, and shortcuts deletes matches based on the On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: shortcuts -> this provider? Sure. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:35: // at least one additional character after the prefix. On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: Could say "...or https:// plus at least one more character and puts..." and > then drop the second sentence entirely. Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:36: void PopulateHttpTerms(const base::string16& text, On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: If you change the member name, change this too -- maybe > PopulateTermsFollowingScheme() or FindTermsWithExplicitScheme() or something. Selected PopulateTermsPrefixedByScheme(). https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:38: const base::string16& http_scheme = On 2015/06/09 20:36:38, Peter Kasting wrote: > Rather than converting these to string16s, I think we should convert |text| to > UTF-8 (likely on the caller side). I don't like this idea. Everywhere in the autocomplete code when dealing with user input we use string16. Anything that gets compared with user input goes to the UTF-16 space. This is a widespread convention, and I'm nervous about breaking this pattern when there's no particularly good reason. I don't recall ever seeing the input converted to a std::string even when it would be more convenient to do so. To be concrete, much of this code parallels logic in url_prefix, which uses the same strategy (converting the URL to UTF16 to compare to the user input, which remains as string16). > The code in autocomplete_match.cc has UTF-8 > GURLs anyway, so this will avoid the conversions there. > > This means the type of |http_following_terms| (here and as a data member) will > change to std::vector<std::string>. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:44: std::vector<base::string16> words; On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: words -> terms (for consistency)? Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:45: base::SplitString(text, ' ', &words); On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: Consider a comment about why we don't use ICU breaking functions to split > into words. Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:46: for (auto it : words) { On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: const auto&, to avoid a copy Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:47: if (StartsWith(it, http_scheme, false) && On 2015/06/09 20:36:38, Peter Kasting wrote: > Nit: This is shorter overall and no less efficient: > > std::vector<std::string> terms; > base::SplitString(text, ' ', &terms); > const std::string separator(url::kStandardSchemeSeparator); > for (const auto& term : terms) { > static const char* kSchemes[2] = { url::kHttpScheme, url::kHttpsScheme }; > for (const char* scheme : kSchemes) { > const std::string prefix(scheme + separator); > if (StartsWith(term, prefix, false) && (term.length() > prefix.length())) > http_following_terms->push_back(term.substr(prefix.length())); > } > } Done. (well, the string16 version of the above) https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:187: // Returns list of words in text() that start with http:// or https://, On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: text() -> text_ Done. > "...or https:// plus at least one more character, ..." Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:188: // stored without the prefix. Used in duplicate elimination to detect On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: prefix -> scheme > > (Technically this is less accurate but it seems clearer to me after the change I > suggest above.) Okay, done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:190: // scheme (and hence we shouldn't de-dup across schemes). On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: How about: > > "...to detect whether, for a given URL, the user may have started typing that > URL with an explicit scheme; see comments on > AutocompleteMatch::GURLToStrippedGURL()." Ah, that's good to make a reference to where the important code is. Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.h:191: const std::vector<base::string16>& http_following_terms() const { On 2015/06/09 20:36:39, Peter Kasting wrote: > Hmm, I don't love this name... maybe terms_following_explicit_scheme or > terms_prefixed_by_http_or_https? I can't think of a non-awkward name. The latter sounds better to me (though as you admit it's still awkward). Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:37: const base::string16& url_spec_without_scheme_as_utf16 = On 2015/06/09 20:36:39, Peter Kasting wrote: > If you follow my suggestion in autocomplete_input.cc of making > |http_following_terms| a std::string, this can get simpler by avoiding the > encoding conversion. Acknowledged, but not relevant given my earlier protest. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:41: for (auto it : http_following_terms) { On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: const auto&, to avoid a copy Done. BTW, also renamed "it" to "term". https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:42: if (StartsWith(url_spec_without_scheme_as_utf16, it, false)) { On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: No {} Done. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.cc:442: stripped_destination_url.SchemeIs(url::kHttpsScheme)) { On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: For efficiency, reverse these subexpressions; the SchemeIs() check should > be cheaper than the other call. Good idea. Done. > I would probably also add an explicit "!input.http_following_terms().empty() &&" > before both of these, again, just for efficiency (avoids some string > constructions). This kind of thing makes sense to me. Putting in where you suggested with a && is wrong however. (Thank goodness for unit tests to notice this.) Put in the correct formulation. https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_match.h:190: // two URLs with different schemes from having the same stripped GURL.) On 2015/06/09 20:36:39, Peter Kasting wrote: > Nit: How about: > > "...altered during stripping. If this URL, minus the scheme and separator, > starts with any the terms in input.http_following_terms(), we avoid converting > an HTTPS scheme to HTTP. This means that URLs that differ only by these schemes > won't be marked as dupes, since the distinction seems to matter to the user." That's much better, thanks.
https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:38: const base::string16& http_scheme = On 2015/06/10 23:38:34, Mark P wrote: > On 2015/06/09 20:36:38, Peter Kasting wrote: > > Rather than converting these to string16s, I think we should convert |text| to > > UTF-8 (likely on the caller side). > > I don't like this idea. Everywhere in the autocomplete code when dealing with > user input we use string16. Anything that gets compared with user input goes to > the UTF-16 space. This is a widespread convention, and I'm nervous about > breaking this pattern when there's no particularly good reason. I don't recall > ever seeing the input converted to a std::string even when it would be more > convenient to do so. To be concrete, much of this code parallels logic in > url_prefix, which uses the same strategy (converting the URL to UTF16 to compare > to the user input, which remains as string16). This isn't an intentional pattern. The intentional pattern is to minimize conversions and code complexity, which is often accomplished by leaving input as UTF-16, but not in this case. (There are other places we convert input, or parts of it, to UTF-8 already; AutocompleteResult::SortAndCull(), ShortcutsProvider::ShortcutToACMatch(), HistoryURLProvider::DoAutocomplete(), etc.) I don't see any downsides to this. UTF-8 can represent everything UTF-16 can represent and it's impossible to confuse the two because they use different char and string types. So I see no risk of loss of functionality or introducing bugs.
(I really would rather convert to UTF-8 than UTF-16, here's some other comments though.) https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:38: std::vector<base::string16>* terms_prefixed_by_http_or_https) { Nit: Call this |terms_prefixed_by_scheme| to match the function name (or change the function name). I might pick that name for the class member/function as well, just for the brevity, but up to you. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:39: base::string16 kPrefixes[2] = { Even if you were to stick with string16 (which I don't want), doing the conversions in the loop is shorter because you only need to write them once instead of twice. Also, you'd want to use ASCIIToUTF16() instead. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:47: // examples, ICU's iterator may break on punctuation (such as ://) or decide Nit: example https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:49: // hostname is multiple words). Neither of these behaviors are desirable. Nit: are -> is ("neither" is singular) https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.h:187: // Returns list of words in text_ that start with http:// or https:// plus Nit: list of words in text_ -> the terms in |text_| https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.h:188: // at least one more characther, stored without the scheme. Used in Nit: character https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_match.h:190: // we avoid converting an HTTPS scheme to HTTP. This means URLs that differ Nit: "we we avoid"
Hopefully now we're almost done. I plan to rebase this after https://codereview.chromium.org/1169173005/ (which I just sent for review) is submitted, and add some new cases to the test framework I set up there. --mark https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/80001/components/omnibox/auto... components/omnibox/autocomplete_input.cc:38: const base::string16& http_scheme = On 2015/06/11 00:07:38, Peter Kasting wrote: > On 2015/06/10 23:38:34, Mark P wrote: > > On 2015/06/09 20:36:38, Peter Kasting wrote: > > > Rather than converting these to string16s, I think we should convert |text| > to > > > UTF-8 (likely on the caller side). > > > > I don't like this idea. Everywhere in the autocomplete code when dealing with > > user input we use string16. Anything that gets compared with user input goes > to > > the UTF-16 space. This is a widespread convention, and I'm nervous about > > breaking this pattern when there's no particularly good reason. I don't > recall > > ever seeing the input converted to a std::string even when it would be more > > convenient to do so. To be concrete, much of this code parallels logic in > > url_prefix, which uses the same strategy (converting the URL to UTF16 to > compare > > to the user input, which remains as string16). > > This isn't an intentional pattern. The intentional pattern is to minimize > conversions and code complexity, which is often accomplished by leaving input as > UTF-16, but not in this case. (There are other places we convert input, or > parts of it, to UTF-8 already; AutocompleteResult::SortAndCull(), > ShortcutsProvider::ShortcutToACMatch(), HistoryURLProvider::DoAutocomplete(), > etc.) > > I don't see any downsides to this. UTF-8 can represent everything UTF-16 can > represent and it's impossible to confuse the two because they use different char > and string types. So I see no risk of loss of functionality or introducing > bugs. Okay, I did it. Easy enough. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:38: std::vector<base::string16>* terms_prefixed_by_http_or_https) { On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: Call this |terms_prefixed_by_scheme| to match the function name (or change > the function name). > > I might pick that name for the class member/function as well, just for the > brevity, but up to you. I'd rather the increase in precision of http_or_https because this isn't terms prefixed by any scheme. Instead, I revised the function match to make them match. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:39: base::string16 kPrefixes[2] = { On 2015/06/11 05:21:44, Peter Kasting wrote: > Even if you were to stick with string16 (which I don't want), doing the > conversions in the loop is shorter because you only need to write them once > instead of twice. Also, you'd want to use ASCIIToUTF16() instead. Acknowledged. Now obsolete. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:47: // examples, ICU's iterator may break on punctuation (such as ://) or decide On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: example Done. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:49: // hostname is multiple words). Neither of these behaviors are desirable. On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: are -> is ("neither" is singular) Done. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.h:187: // Returns list of words in text_ that start with http:// or https:// plus On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: list of words in text_ -> the terms in |text_| Done. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_input.h:188: // at least one more characther, stored without the scheme. Used in On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: character Done. https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/100001/components/omnibox/aut... components/omnibox/autocomplete_match.h:190: // we avoid converting an HTTPS scheme to HTTP. This means URLs that differ On 2015/06/11 05:21:44, Peter Kasting wrote: > Nit: "we we avoid" Done.
Almost done. https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:41: strlen(url::kStandardSchemeSeparator)); You may want to do something like this: const size_t prefix_length = url.scheme().length() + strlen(url::kStandardSchemeSeparator); DCHECK_GE(url.spec().length(), prefix_length); const std::string& url_spec_without_scheme = url.spec().substr(prefix_length); This is because substr() does not accept offsets longer than the string length for the starting position, but we should be guaranteed that our offset isn't longer than that by the conditionals done on the caller side. https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:43: if (StartsWithASCII(url_spec_without_scheme, term, false)) This isn't right; we don't want an ASCII compare. I'm not sure why string_util.h has ASCII and UTF-16 versions of StartsWith(), and UTF-8 and UTF-16 versions of EndsWith(). This is weirdly inconsistent, and you can see in the .cc file that StartsWithT<> was built to be used by both UTF-8 and UTF-16 versions. I would suggest adding a UTF-8 StartsWith() to string_util.h and then using that here.
https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:50: if (StartsWithASCII(term, prefix, false) && This can still be a StartsWithASCII, right? Or maybe there's some unicode equivalent characters nonsense that means I should do the UTF8 version? https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:41: strlen(url::kStandardSchemeSeparator)); On 2015/06/11 23:00:15, Peter Kasting wrote: > This is because substr() does not accept offsets longer than the string length > for the starting position, Good point. > but we should be guaranteed that our offset isn't > longer than that by the conditionals done on the caller side. Ah, that's why things work. > You may want to do something like this: > > const size_t prefix_length = > url.scheme().length() + strlen(url::kStandardSchemeSeparator); > DCHECK_GE(url.spec().length(), prefix_length); > const std::string& url_spec_without_scheme = url.spec().substr(prefix_length); Done. https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:43: if (StartsWithASCII(url_spec_without_scheme, term, false)) On 2015/06/11 23:00:15, Peter Kasting wrote: > This isn't right; we don't want an ASCII compare. > > I'm not sure why string_util.h has ASCII and UTF-16 versions of StartsWith(), > and UTF-8 and UTF-16 versions of EndsWith(). I wondered the same thing. > This is weirdly inconsistent, and > you can see in the .cc file that StartsWithT<> was built to be used by both > UTF-8 and UTF-16 versions. > > I would suggest adding a UTF-8 StartsWith() to string_util.h and then using that > here. Done. I thought about this originally but I don't like touching base/ much. Thanks for your encouragement.
pkasting@chromium.org changed reviewers: + brettw@chromium.org
Adding Brett for string_util question since he's in the midst of touching these separately. Brett, in this CL I had Mark add a UTF-8 StartsWith() (we had one for EndsWith() but not StartsWith()). This is so we could do a case-insensitive compare for possibly-Unicode hostnames (see https://codereview.chromium.org/1098843004/diff/160001/components/omnibox/aut... for the caller). However, it sounds from the review you just sent me like this isn't right, and this implementation of StartsWith() would mangle non-ASCII UTF-8. Is that true? What should we be doing here instead? Mark, other than the above concern, the rest of this LGTM. https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:50: if (StartsWithASCII(term, prefix, false) && On 2015/06/12 20:23:20, Mark P wrote: > This can still be a StartsWithASCII, right? > Or maybe there's some unicode equivalent characters nonsense that means I should > do the UTF8 version? No, this can still be StartsWithASCII().
brettw@, this is waiting in your review of the strings stuff. pkasting@, FYI I added more unit tests since the last time you looked at this. I didn't yet run them. (Running into strings linking errors; don't want to bother resolving them until I hear from brett about what we should be doing with string_utils.) --mark https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/120001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:50: if (StartsWithASCII(term, prefix, false) && On 2015/06/15 20:47:16, Peter Kasting wrote: > On 2015/06/12 20:23:20, Mark P wrote: > > This can still be a StartsWithASCII, right? > > Or maybe there's some unicode equivalent characters nonsense that means I > should > > do the UTF8 version? > > No, this can still be StartsWithASCII(). Acknowledged.
If you sync you will notice StartsWith changed all around. Have you tested your version? I believe what your version does is take each 8-bit value individually and doing a case-insensitive comparison based on the current locale. You can see that the new versions of these functions are pretty explicit about saying it's case sensitive or doing case-insensitive ASCII comparisons. I'm considering writing a Unicode-aware case-insensitive StartsWith/EndsWith. That would need to go in base/i18n, and I would not add a UTF-8 version. If you want a unicode-aware StartsWith today, do: base::StartsWith(base::i18n::ToLower(a), base::i18n::ToLower(b), base::CompareCase::SENSITIVE)
pkasting: I thought a bit more about this and think StartsWithASCII is all I need. After all, given input "http://bl" or "https://blah.com/a", we're looking for the http / https prefix with the scheme separator. This is clearly ASCII and it should be able to match the (UTF8) input using an ASCII comparison correctly. Then, once we have "bl" or "blah.com/a", we're matching this against the URL spec. The URL spec is definitely ASCII. (In fact, it's not even full ASCII because a lot of things will be escaped.) I revised this changelist to use the ASCII versions of the functions. Take a look if you want. --mark
https://codereview.chromium.org/1098843004/diff/220001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/220001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:51: if (base::StartsWithASCII(term, prefix, false) && Can you use the new versions? I want to delete these old ones. base::StartsWith(term, prefix, base::CompareCase::INSENSITIVE_ASCII);
https://codereview.chromium.org/1098843004/diff/220001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/220001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:51: if (base::StartsWithASCII(term, prefix, false) && On 2015/06/22 17:54:01, brettw wrote: > Can you use the new versions? I want to delete these old ones. > base::StartsWith(term, prefix, base::CompareCase::INSENSITIVE_ASCII); Done.
Thanks. I'm not planning a real review so don't block on me.
On 2015/06/22 17:41:04, Mark P wrote: > Then, once we have "bl" or "blah.com/a", we're matching this against > the URL spec. The URL spec is definitely ASCII. (In fact, it's not > even full ASCII because a lot of things will be escaped.) Are you sure? I thought the spec() was unescaped, in which case it's UTF-8. And the user input is _definitely_ unescaped, so if the spec were escaped, we'd be comparing the wrong things. Can you explicitly verify the patch behavior with UTF-8 input and a UTF-8 URL?
On 2015/06/22 21:07:41, Peter Kasting wrote: > On 2015/06/22 17:41:04, Mark P wrote: > > Then, once we have "bl" or "blah.com/a", we're matching this against > > the URL spec. The URL spec is definitely ASCII. (In fact, it's not > > even full ASCII because a lot of things will be escaped.) > > Are you sure? I thought the spec() was unescaped, in which case it's UTF-8. > And the user input is _definitely_ unescaped, so if the spec were escaped, we'd > be comparing the wrong things. > > Can you explicitly verify the patch behavior with UTF-8 input and a UTF-8 URL? Spec is ASCII except for the reference fragment which will be UTF-8.
On 2015/06/22 21:31:34, brettw wrote: > On 2015/06/22 21:07:41, Peter Kasting wrote: > > On 2015/06/22 17:41:04, Mark P wrote: > > > Then, once we have "bl" or "blah.com/a", we're matching this against > > > the URL spec. The URL spec is definitely ASCII. (In fact, it's not > > > even full ASCII because a lot of things will be escaped.) > > > > Are you sure? I thought the spec() was unescaped, in which case it's UTF-8. > > And the user input is _definitely_ unescaped, so if the spec were escaped, > we'd > > be comparing the wrong things. > > > > Can you explicitly verify the patch behavior with UTF-8 input and a UTF-8 URL? > > Spec is ASCII except for the reference fragment which will be UTF-8. I see. Seems like we need to escape the user's input somewhere before comparing, then, unless that's already being done.
On Mon, Jun 22, 2015 at 2:39 PM, <pkasting@chromium.org> wrote: > On 2015/06/22 21:31:34, brettw wrote: > >> On 2015/06/22 21:07:41, Peter Kasting wrote: >> > On 2015/06/22 17:41:04, Mark P wrote: >> > > Then, once we have "bl" or "blah.com/a", we're matching this against >> > > the URL spec. The URL spec is definitely ASCII. (In fact, it's not >> > > even full ASCII because a lot of things will be escaped.) >> > >> > Are you sure? I thought the spec() was unescaped, in which case it's >> UTF-8. >> > > > And the user input is _definitely_ unescaped, so if the spec were >> escaped, >> we'd >> > be comparing the wrong things. >> > >> > Can you explicitly verify the patch behavior with UTF-8 input and a >> UTF-8 >> > URL? > > Spec is ASCII except for the reference fragment which will be UTF-8. >> > > I see. Seems like we need to escape the user's input somewhere before > comparing, then, unless that's already being done. > It's not clear. We're worried about cases when the user has something like "http://xy blah" or "https://xy.com/fo blah" in the omnibox. I think if the user gets some http://whatever text in the omnibox, and they're entering enough text to get into the path somewhere (which is where we need to start worrying about escaping), I think it's more likely that they pasted a URL into the omnibox (in which case leaving the URL text as-is is better) rather than started typing the special characters (that need to be escaped) in the path. --mark > https://codereview.chromium.org/1098843004/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
On 2015/06/22 22:20:28, Mark P wrote: > On Mon, Jun 22, 2015 at 2:39 PM, <mailto:pkasting@chromium.org> wrote: > > > On 2015/06/22 21:31:34, brettw wrote: > > > >> On 2015/06/22 21:07:41, Peter Kasting wrote: > >> > On 2015/06/22 17:41:04, Mark P wrote: > >> > > Then, once we have "bl" or "blah.com/a", we're matching this against > >> > > the URL spec. The URL spec is definitely ASCII. (In fact, it's not > >> > > even full ASCII because a lot of things will be escaped.) > >> > > >> > Are you sure? I thought the spec() was unescaped, in which case it's > >> UTF-8. > >> > > > > > And the user input is _definitely_ unescaped, so if the spec were > >> escaped, > >> we'd > >> > be comparing the wrong things. > >> > > >> > Can you explicitly verify the patch behavior with UTF-8 input and a > >> UTF-8 > >> > > URL? > > > > Spec is ASCII except for the reference fragment which will be UTF-8. > >> > > > > I see. Seems like we need to escape the user's input somewhere before > > comparing, then, unless that's already being done. > > > > It's not clear. > > We're worried about cases when the user has something like "http://xy blah" > or "https://xy.com/fo blah" in the omnibox. I think if the user gets some > http://whatever text in the omnibox, and they're entering enough text to > get into the path somewhere (which is where we need to start worrying about > escaping), I think it's more likely that they pasted a URL into the omnibox > (in which case leaving the URL text as-is is better) rather than started > typing the special characters (that need to be escaped) in the path. What if the user is typing an IDN? Seems like that's an obvious case where we won't match correctly. And actually, in that case, escaping the input and trying to convert to punycode won't help, because the punycode representation of a prefix of another IDN is not itself a prefix of that other IDN's punycode. I think instead the right answer is that we should take the URL we're trying to match against and convert to the unescaped Unicode form, and also attempt any relevant unescaping on the input, then compare those two strings.
On Mon, Jun 22, 2015 at 3:25 PM, <pkasting@chromium.org> wrote: > On 2015/06/22 22:20:28, Mark P wrote: > > On Mon, Jun 22, 2015 at 2:39 PM, <mailto:pkasting@chromium.org> wrote: >> > > > On 2015/06/22 21:31:34, brettw wrote: >> > >> >> On 2015/06/22 21:07:41, Peter Kasting wrote: >> >> > On 2015/06/22 17:41:04, Mark P wrote: >> >> > > Then, once we have "bl" or "blah.com/a", we're matching this >> against >> >> > > the URL spec. The URL spec is definitely ASCII. (In fact, it's >> not >> >> > > even full ASCII because a lot of things will be escaped.) >> >> > >> >> > Are you sure? I thought the spec() was unescaped, in which case it's >> >> UTF-8. >> >> >> > >> > > And the user input is _definitely_ unescaped, so if the spec were >> >> escaped, >> >> we'd >> >> > be comparing the wrong things. >> >> > >> >> > Can you explicitly verify the patch behavior with UTF-8 input and a >> >> UTF-8 >> >> >> > URL? >> > >> > Spec is ASCII except for the reference fragment which will be UTF-8. >> >> >> > >> > I see. Seems like we need to escape the user's input somewhere before >> > comparing, then, unless that's already being done. >> > >> > > It's not clear. >> > > We're worried about cases when the user has something like "http://xy >> blah" >> or "https://xy.com/fo blah" in the omnibox. I think if the user gets >> some >> http://whatever text in the omnibox, and they're entering enough text to >> get into the path somewhere (which is where we need to start worrying >> about >> escaping), I think it's more likely that they pasted a URL into the >> omnibox >> (in which case leaving the URL text as-is is better) rather than started >> typing the special characters (that need to be escaped) in the path. >> > > What if the user is typing an IDN? Seems like that's an obvious case > where we > won't match correctly. > > And actually, in that case, escaping the input and trying to convert to > punycode > won't help, because the punycode representation of a prefix of another IDN > is > not itself a prefix of that other IDN's punycode. > > I think instead the right answer is that we should take the URL we're > trying to > match against and convert to the unescaped Unicode form, and also attempt > any > relevant unescaping on the input, then compare those two strings. > If you're worried about IDNs, that's another story. Yes, this changelist doesn't handle them. If you want to add handling of IDNs, I suggest doing that in another changelist. (Personally, I think the time to implement that part of the change correctly--after all this is only about slightly better deduping across protocols--isn't worth my time compared to other things on my plate. Your speed of implementation, alternate uses of time, and opinion about the importance of this feature may vary.) --mark > > https://codereview.chromium.org/1098843004/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
I'm just proposing adding an unescape call. It's hard for me to understand how that's so involved as to be worth a separate CL, especially when I'm _not_ just concerned about IDNs. If it sounds like fixing every concern I have would take more than 20 minutes of your time, then one of us is misunderstanding something.
pkasting@, As a person who's played with compiler flags before, perhaps you can help me figure out what's going on. Look at the latest try server results. I think there's a misconfiguration somewhere in the build files for autocomplete_match_unittest.cc or its containing binary (components_unittests). But I don't know how to go about figuring out why the compiler thinks this symbol is hidden and conflicts with other versions of it. thanks, mark
Try adding BASE_EXPORT to all the function definitions in utf_string_conversions.cc that are declared that way in utf_string_conversions.h?
I don't think you did anything wrong. Try running the bots again.
On 2015/06/25 22:43:50, brettw wrote: > I don't think you did anything wrong. Try running the bots again. Still a problem on the bots. The problem also occurs with my local build from this patchset, but not from my local master branch. There must be something with this patch... --mark
On 2015/06/25 23:29:03, Mark P wrote: > On 2015/06/25 22:43:50, brettw wrote: > > I don't think you did anything wrong. Try running the bots again. > > Still a problem on the bots. Did you try the fix I suggested? PK
https://codereview.chromium.org/1098843004/diff/340001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/340001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:8: #include "base/strings/utf_string_conversions.cc" Ah, your problem is you included a .cc file here instead of the header.
https://codereview.chromium.org/1098843004/diff/340001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/340001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:8: #include "base/strings/utf_string_conversions.cc" On 2015/06/26 04:43:18, brettw wrote: > Ah, your problem is you included a .cc file here instead of the header. Doh! Thanks for catching that.
On 2015/06/22 23:04:56, Peter Kasting wrote: > I'm just proposing adding an unescape call. It's hard for me to understand how > that's so involved as to be worth a separate CL, especially when I'm _not_ just > concerned about IDNs. > > If it sounds like fixing every concern I have would take more than 20 minutes of > your time, then one of us is misunderstanding something. I added IDN support. It was easier than I expected, though harder than you expected. Your estimate was off by a factor of ~7x. The code itself was simple, but getting all the plumbing right (and first deciding what exactly to plumb) and then revising all the callers to reflect API changes took some time. Maybe you'd be able to do it faster (maybe your IDE has better support for automating API changes?) but 20 minutes was not a good estimate. Anyway, please take a look. --mark
On 2015/06/26 22:45:03, Mark P wrote: > On 2015/06/22 23:04:56, Peter Kasting wrote: > > I'm just proposing adding an unescape call. It's hard for me to understand > how > > that's so involved as to be worth a separate CL, especially when I'm _not_ > just > > concerned about IDNs. > > > > If it sounds like fixing every concern I have would take more than 20 minutes > of > > your time, then one of us is misunderstanding something. > > I added IDN support. It was easier than I expected, though harder than you > expected. Your estimate was off by a factor of ~7x. The code itself was > simple, but getting all the plumbing right (and first deciding what exactly to > plumb) and then revising all the callers to reflect API changes took some time. > Maybe you'd be able to do it faster (maybe your IDE has better support for > automating API changes?) but 20 minutes was not a good estimate. > > Anyway, please take a look. > > --mark P.S. Do not be misled by my chromium name; I'm not actually gone quite yet...
https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... File chrome/browser/autocomplete/autocomplete_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... chrome/browser/autocomplete/autocomplete_provider_unittest.cc:234: scoped_ptr<AutocompleteProviderClient> client_; Nit: Does this need to be a scoped_ptr? Can it just be a straight MockAutocompleteProviderClient? (However, see comments in omnibox_edit_model.cc that may moot this) https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... File chrome/browser/autocomplete/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... chrome/browser/autocomplete/history_url_provider_unittest.cc:326: input, client_.get()->GetAcceptLanguages(), service); No need for .get() when doing ->. https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/coco... File chrome/browser/ui/cocoa/omnibox/omnibox_popup_view_mac_unittest.mm (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/coco... chrome/browser/ui/cocoa/omnibox/omnibox_popup_view_mac_unittest.mm:43: scoped_ptr<AutocompleteProviderClient> client_; Nit: Again, can this drop the scoped_ptr? https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/omni... File chrome/browser/ui/omnibox/omnibox_edit_model.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/omni... chrome/browser/ui/omnibox/omnibox_edit_model.cc:789: make_scoped_ptr(new ChromeAutocompleteProviderClient(profile_)).get()); This seems kind of scary. I think the scope of the temporary scoped_ptr you create only extends to the end of this statement? In which case the AutocompleteResult is now holding a dangling raw pointer to a deleted object. There are a whole variety of potential fixes here, but honestly, I don't think AutocompleteResult should take a pointer to this object at all. This is partly because construction here is a bit awkward, partly because AutocompleteProviderClient is meant to be a "provider client" (see name) and not a "result client" or "general autocomplete system client", but mostly because the only actual use of this is to get the correct languages to use, and dependency injection principles suggest that, in that case, it would be better to pass in the languages to use, or even better, to pass those directly to the function that needs them, AutocompleteResult::SortAndCull(). Doing this requires touching a dozen or so callsites, but outside of tests, only a couple matter: one call from the AutocompleteController, and one call from CopyOldMatches(), which is in turn called from the AutocompleteController. The controller, of course, owns the main provider client already and can get the necessary languages directly. And while this does require changing some callsites in tests, it also allows you to avoid adding mock provider client construction to those same tests. Yay. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:53: // Doing an ASCII comparison is okay because prefix is in ASCII. Tiny nit: "...prefix is ASCII" (no "in") reads slightly clearer to me. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_input.h:204: const std::vector<std::pair<std::string, base::string16> >& Nit: No space between > >. (Yay C++11 language support) (Multiple places in multiple files) However, it may make sense to define a typedef for the pair type, the vector type, or both, rather than repeat the types in a few different places. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:49: // detect a prefix match against a punycode-encoded hostname. How come you check both? Shouldn't the check against the formatted URL catch everything except the user explicitly typing punycode (which is so unlikely I'm OK with not supporting it)? If so it seems like doing just that check would be clearer, and you wouldn't need to keep a vector of pairs. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match.h:201: // stores the result in |stripped_destination_url|. Nit: For this function and the next, I would probably at least say something like "|languages| is used for the same purpose as in GURLToStrippedGURL()" or something. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:175: // a match. Nit: For both of the comments here, you could shorten constructions like "A test to verify that for X, we Y" to just "For X, we should Y" or similar. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:187: SCOPED_TRACE(std::string("input=") + base::WideToUTF8(cases[i].input) + Nit: I _think_ this compiles without the explicit std::string(). https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_result.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_result.h:64: AutocompleteResult(AutocompleteProviderClient* client); Nit: explicit
I'm sorry this changelist has been so sloppy. I've been distracted lately. --mark https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... File chrome/browser/autocomplete/autocomplete_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... chrome/browser/autocomplete/autocomplete_provider_unittest.cc:234: scoped_ptr<AutocompleteProviderClient> client_; On 2015/06/29 05:04:58, Peter Kasting wrote: > Nit: Does this need to be a scoped_ptr? Can it just be a straight > MockAutocompleteProviderClient? > > (However, see comments in omnibox_edit_model.cc that may moot this) Now moot. https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... File chrome/browser/autocomplete/history_url_provider_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/autocom... chrome/browser/autocomplete/history_url_provider_unittest.cc:326: input, client_.get()->GetAcceptLanguages(), service); On 2015/06/29 05:04:58, Peter Kasting wrote: > No need for .get() when doing ->. Done. https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/coco... File chrome/browser/ui/cocoa/omnibox/omnibox_popup_view_mac_unittest.mm (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/coco... chrome/browser/ui/cocoa/omnibox/omnibox_popup_view_mac_unittest.mm:43: scoped_ptr<AutocompleteProviderClient> client_; On 2015/06/29 05:04:58, Peter Kasting wrote: > Nit: Again, can this drop the scoped_ptr? Now moot. https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/omni... File chrome/browser/ui/omnibox/omnibox_edit_model.cc (right): https://codereview.chromium.org/1098843004/diff/420001/chrome/browser/ui/omni... chrome/browser/ui/omnibox/omnibox_edit_model.cc:789: make_scoped_ptr(new ChromeAutocompleteProviderClient(profile_)).get()); On 2015/06/29 05:04:59, Peter Kasting wrote: > This seems kind of scary. I think the scope of the temporary scoped_ptr you > create only extends to the end of this statement? In which case the > AutocompleteResult is now holding a dangling raw pointer to a deleted object. > > There are a whole variety of potential fixes here, but honestly, I don't think > AutocompleteResult should take a pointer to this object at all. This is partly > because construction here is a bit awkward, partly because > AutocompleteProviderClient is meant to be a "provider client" (see name) and not > a "result client" or "general autocomplete system client", but mostly because > the only actual use of this is to get the correct languages to use, and > dependency injection principles suggest that, in that case, it would be better > to pass in the languages to use, or even better, to pass those directly to the > function that needs them, AutocompleteResult::SortAndCull(). Doing this requires > touching a dozen or so callsites, but outside of tests, only a couple matter: > one call from the AutocompleteController, and one call from CopyOldMatches(), > which is in turn called from the AutocompleteController. The controller, of > course, owns the main provider client already and can get the necessary > languages directly. And while this does require changing some callsites in > tests, it also allows you to avoid adding mock provider client construction to > those same tests. Yay. That's a good alternate design. I thought of one alternate design involving passing languages, but that involved passing languages into the AutocompleteResult constructor (a la how this changelist passes the AutocompleteProviderClient). However, I realize that doesn't work well because the user's language may change and we only construct AutocompleteResult once (when we build the controller). I'm embarrassed I didn't think of your suggestion, simply passing it directly into ComputeStrippedDestinationUrl and friends. Done. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_input.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_input.cc:53: // Doing an ASCII comparison is okay because prefix is in ASCII. On 2015/06/29 05:04:59, Peter Kasting wrote: > Tiny nit: "...prefix is ASCII" (no "in") reads slightly clearer to me. Done. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_input.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_input.h:204: const std::vector<std::pair<std::string, base::string16> >& On 2015/06/29 05:04:59, Peter Kasting wrote: > Nit: No space between > >. (Yay C++11 language support) (Multiple places in > multiple files) > > However, it may make sense to define a typedef for the pair type, the vector > type, or both, rather than repeat the types in a few different places. Now that it's back to a vector, I think a typedef is basically unnecessary. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:49: // detect a prefix match against a punycode-encoded hostname. On 2015/06/29 05:04:59, Peter Kasting wrote: > How come you check both? I thought I had a reason at the time but I don't recall it now. Perhaps I was thinking that FormatUrl() was reasonably likely to fail? (This notion is wrong.) Now fixed to only check the formatted url. > Shouldn't the check against the formatted URL catch > everything except the user explicitly typing punycode (which is so unlikely I'm > OK with not supporting it)? If so it seems like doing just that check would be > clearer, and you wouldn't need to keep a vector of pairs. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match.h:201: // stores the result in |stripped_destination_url|. On 2015/06/29 05:04:59, Peter Kasting wrote: > Nit: For this function and the next, I would probably at least say something > like "|languages| is used for the same purpose as in GURLToStrippedGURL()" or > something. Done. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:175: // a match. On 2015/06/29 05:04:59, Peter Kasting wrote: > Nit: For both of the comments here, you could shorten constructions like "A test > to verify that for X, we Y" to just "For X, we should Y" or similar. Done. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:187: SCOPED_TRACE(std::string("input=") + base::WideToUTF8(cases[i].input) + On 2015/06/29 05:04:59, Peter Kasting wrote: > Nit: I _think_ this compiles without the explicit std::string(). Right you are. (You found remnants of an earlier iteration of these tests.) Done. https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... File components/omnibox/autocomplete_result.h (right): https://codereview.chromium.org/1098843004/diff/420001/components/omnibox/aut... components/omnibox/autocomplete_result.h:64: AutocompleteResult(AutocompleteProviderClient* client); On 2015/06/29 05:04:59, Peter Kasting wrote: > Nit: explicit Now moot.
LGTM https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:55: // for international (punycode) domain names. I bet I know why you did both comparisons. You were trying to do a case-insensitive comparison for inputs like "HTTP://FOO", but you didn't want to do the case-folding with ICU for a true case-insensitive UTF-8 comparison, so you backstopped it with a case-sensitive UTF-8 comparison since the existing code could do that. Now that you only have one comparison, the comment is kinda misleading. We now don't support case-insensitive comparisons at all. It would be nice to do this right, by simply using ICU to lowercase both the URL and the terms before doing the case-sensitive comparison. But I suppose I could probably implement that as a followup if it looks tricky. https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:154: // Don't allow URL with different schemes to be considered duplicates for Nit: URLs https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_result.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_result.cc:184: for (ACMatches::iterator i(matches_.begin()); i != matches_.end(); ++i) { Nit: No {}
https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_match.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_match.cc:55: // for international (punycode) domain names. On 2015/06/30 06:50:30, Peter Kasting wrote: > I bet I know why you did both comparisons. You were trying to do a > case-insensitive comparison for inputs like "HTTP://FOO", but you didn't want to > do the case-folding with ICU for a true case-insensitive UTF-8 comparison, so > you backstopped it with a case-sensitive UTF-8 comparison since the existing > code could do that. Ah, that sounds like what I was thinking. Thanks for giving me credit. :-) FWIW, we do get the case where the protocol is capitalized right (that goes through a case insensitive comparison in autocomplete_input.cc); it's on the main part of the URL that we get wrong. > > Now that you only have one comparison, the comment is kinda misleading. We now > don't support case-insensitive comparisons at all. Yeah, I agree the comment is misleading. Removed. > > It would be nice to do this right, by simply using ICU to lowercase both the URL > and the terms before doing the case-sensitive comparison. But I suppose I could > probably implement that as a followup if it looks tricky. One reason I left out case insensitivity for IDNs is that it's not clear to me that it's the right thing to do. Are you sure that it's safe to transform one IDN domain name into another by lowercasing it? Can they be owned by different entities? Note that the lowercasing algorithm depends on the locale. In any case, I'll leave whether to do this kind of thing up to you. (I think the answer is that it's not worth the time to think about, let alone code, given that it only affects deduping.) https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_match_unittest.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_match_unittest.cc:154: // Don't allow URL with different schemes to be considered duplicates for On 2015/06/30 06:50:30, Peter Kasting wrote: > Nit: URLs Done. https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... File components/omnibox/autocomplete_result.cc (right): https://codereview.chromium.org/1098843004/diff/440001/components/omnibox/aut... components/omnibox/autocomplete_result.cc:184: for (ACMatches::iterator i(matches_.begin()); i != matches_.end(); ++i) { On 2015/06/30 06:50:30, Peter Kasting wrote: > Nit: No {} Done.
The CQ bit was checked by mpearson@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1098843004/#ps460001 (title: "peter's comments")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1098843004/460001
The CQ bit was unchecked by commit-bot@chromium.org
Exceeded global retry quota
The CQ bit was checked by mpearson@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1098843004/#ps480001 (title: "rebase")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1098843004/480001
The CQ bit was unchecked by commit-bot@chromium.org
Exceeded global retry quota
The CQ bit was checked by mpearson@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1098843004/#ps500001 (title: "re-add string_split.h")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1098843004/500001
The CQ bit was unchecked by commit-bot@chromium.org
Exceeded global retry quota
The CQ bit was checked by mpearson@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/1098843004/#ps520001 (title: "fix browsertests compile error (sigh)")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1098843004/520001
Message was sent while issue was closed.
Committed patchset #27 (id:520001)
Message was sent while issue was closed.
Patchset 27 (id:??) landed as https://crrev.com/4923cab672e256801d4b9efb904fef4c5fbf6f15 Cr-Commit-Position: refs/heads/master@{#336885} |