|
|
Created:
9 years, 1 month ago by mrossetti Modified:
9 years ago CC:
chromium-reviews, brettw-cc_chromium.org, James Su, Paweł Hajdan Jr. Base URL:
svn://svn.chromium.org/chrome/trunk/src/ Visibility:
Public. |
DescriptionImprove Autocomplete Matches and Handling of Large Results Sets
Do not call FixupUserInput as it was prepending unexpected prefixes (such as file://) to the search string and bypassing valid results.
Move the search string decomposition operation from the HQP into the IMUI.
In the final substring filtering use whitespace delineated terms rather than words.
Instead of bailing if we get a large results set (>500) filter it down to 500 by sorting by typed-count/visit-count/last-visit.
This means it's no longer necessary to bypass the HQP if there is only one character in the search term so get rid of the ExpandedInMemoryURLIndexTest.ShortCircuit unit test.
BUG=101301, 103575
TEST=Added unit tests.
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=112527
Patch Set 1 #Patch Set 2 : '' #Patch Set 3 : '' #Patch Set 4 : '' #Patch Set 5 : '' #
Total comments: 23
Patch Set 6 : '' #
Total comments: 2
Patch Set 7 : '' #Patch Set 8 : '' #
Total comments: 4
Messages
Total messages: 19 (0 generated)
Note that the filtering performed by HistoryItemFactorGreater (in in_memory_url_index.h) is quite simple-minded at this point. I would have added mpearson to this CL but it looks like he's not on chromium.org yet.
Adding georgey as reviewer in case Peter is too busy.
http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... File chrome/browser/autocomplete/history_quick_provider.cc (left): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... chrome/browser/autocomplete/history_quick_provider.cc:64: if (!FixupUserInput(&autocomplete_input_)) I think removing this may be worse than keeping it. Without this, you're going to have trouble matching in URLs with escaped characters, punycode domain names, etc. Maybe you can give me some concrete examples of problems this causes and we can figure out how to fight them? http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.cc (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:407: : history_info_map_(history_info_map) {} Nit: 4 spaces before colon. I generally prefer {} on separate lines unless the entire definition is all on one line. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:445: search_term_cache_.clear(); // Invalidate the term cache. Nit: Why is this line necessary? Wouldn't the cache already be empty? http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:464: bool was_trimmed = false; Nit: Even shorter: bool was_trimmed = (pre_filter_item_count > kItemsToScoreLimit); if (was_trimmed) { ... http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:478: history_ids.resize(kItemsToScoreLimit); Nit: You can eliminate this line and simply use "history_ids.begin() + kItemsToScoreLimit" in place of history_ids.end() in the copy() call below. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.h (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:95: // Given a string16 in |term_string|, scans the history index and return a Nit: return -> returns http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:101: // terms, as separated by whitespace, occur withint the candidate's URL Nit: withint -> within http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:103: // |kItemsToScoreLimit| candidates (as the scoring of such a large number of Nit: instead of using parens, just add a comma before "as". http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:194: class HistoryItemFactorGreater : public std::binary_function<HistoryID, Nit: Probably slightly more readable to linebreak before the ":". http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:202: private: Nit: Blank line above this http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index_unittest.cc (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:228: url_index_->HistoryItemsForTerms(UTF8ToUTF16("DrudgeReport")); Nit: Can use ASCIIToUTF16() in most of these.
Issues addressed. Comments on removal of FixupUserInput provided. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... File chrome/browser/autocomplete/history_quick_provider.cc (left): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... chrome/browser/autocomplete/history_quick_provider.cc:64: if (!FixupUserInput(&autocomplete_input_)) Great! I'd appreciate the advice... When give a search string like '/r/' (as described in the bug), calling FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course, causes the HQP to search for history items with the words 'file' and 'r' and then does a final filter for that exact string ('file:///r/'). It also sounds to me like the unit tests for this should be enhanced by adding URLs with escaped characters, etc. Could you suggest a few for completeness? Since the user input can be any combination of plain words and an URL it's important I get this right. On 2011/11/21 20:31:02, Peter Kasting wrote: > I think removing this may be worse than keeping it. Without this, you're going > to have trouble matching in URLs with escaped characters, punycode domain names, > etc. > > Maybe you can give me some concrete examples of problems this causes and we can > figure out how to fight them? http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.cc (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:407: : history_info_map_(history_info_map) {} On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: 4 spaces before colon. I generally prefer {} on separate lines unless the > entire definition is all on one line. Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:445: search_term_cache_.clear(); // Invalidate the term cache. On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: Why is this line necessary? Wouldn't the cache already be empty? I updated the comment since it did not mention that the input search string might not have any words. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:464: bool was_trimmed = false; On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: Even shorter: > > bool was_trimmed = (pre_filter_item_count > kItemsToScoreLimit); > if (was_trimmed) { ... Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:478: history_ids.resize(kItemsToScoreLimit); On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: You can eliminate this line and simply use "history_ids.begin() + > kItemsToScoreLimit" in place of history_ids.end() in the copy() call below. Delicious! http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.h (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:95: // Given a string16 in |term_string|, scans the history index and return a On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: return -> returns Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:101: // terms, as separated by whitespace, occur withint the candidate's URL On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: withint -> within Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:103: // |kItemsToScoreLimit| candidates (as the scoring of such a large number of On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: instead of using parens, just add a comma before "as". Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:194: class HistoryItemFactorGreater : public std::binary_function<HistoryID, On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: Probably slightly more readable to linebreak before the ":". Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.h:202: private: On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: Blank line above this Done. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index_unittest.cc (right): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:228: url_index_->HistoryItemsForTerms(UTF8ToUTF16("DrudgeReport")); On 2011/11/21 20:31:02, Peter Kasting wrote: > Nit: Can use ASCIIToUTF16() in most of these. Ah! Why did I forget that??? Thanks.
http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... File chrome/browser/autocomplete/history_quick_provider.cc (left): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... chrome/browser/autocomplete/history_quick_provider.cc:64: if (!FixupUserInput(&autocomplete_input_)) On 2011/11/21 21:38:25, mrossetti wrote: > When give a search string like '/r/' (as described in the bug), calling > FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course, causes > the HQP to search for history items with the words 'file' and 'r' and then does > a final filter for that exact string ('file:///r/'). > > It also sounds to me like the unit tests for this should be enhanced by adding > URLs with escaped characters, etc. Could you suggest a few for completeness? > > Since the user input can be any combination of plain words and an URL it's > important I get this right. Hmm. I didn't think about the fact that we could have input where it's not clear which tokens go together and which don't. If my original URL is "http://who/peter%20kasting", for example, does "peter kasting" mean "peter kasting" or "peter%20kasting"? We can't know. And if we have something like "frank google.com\mail" then as humans it's probably obvious that we should treat "google.com\mail" as a URL and fix up the '\' to be a '/', but I don't know how to code that. So maybe you're right and this should go away. My biggest worry is probably unicode domain names. I can't remember how the history backend stores those, but if it stores them in punycode, then users won't be able to type them in and get any HQP matches at all. (I think similar issues made the HUP less effective than it ideally would be and I wanted to change how the backend worked, but it was too big a task or something.) I don't know how to address this perfectly. My only idea here blows up matching time and thus is probably not terribly feasible: * For each whitespace-delimited token, try to fix it up somehow. Maybe FixupUserInput(), maybe something that does less? * If the fixed up version is different than the original, then allow matching either version The fixup stage here we want to do things like convert unicode hostnames, convert backslashes to forward slashes, and escape characters in URL paths, but we don't need it to add schemes, convert numbers to dotted quads, etc. A lot of this FixupUserInput() goes back and reverses after the fact (like the dotted quads thing) but perhaps there's a different way to go...
http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.cc (right): http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:454: HistoryIDSet history_id_set = HistoryIDSetFromWords(words); I wonder how fast it would work on lowest common denominator - the Atom netbook (such as Chromebook). I haven't cleared my history in two years on my desktop - I am sure people do not do it on netbooks also :)
On 2011/11/21 22:08:59, Peter Kasting wrote: > http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... > File chrome/browser/autocomplete/history_quick_provider.cc (left): > > http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete... > chrome/browser/autocomplete/history_quick_provider.cc:64: if > (!FixupUserInput(&autocomplete_input_)) > On 2011/11/21 21:38:25, mrossetti wrote: > > When give a search string like '/r/' (as described in the bug), calling > > FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course, causes > > the HQP to search for history items with the words 'file' and 'r' and then > does > > a final filter for that exact string ('file:///r/'). > > > > It also sounds to me like the unit tests for this should be enhanced by adding > > URLs with escaped characters, etc. Could you suggest a few for completeness? > > > > Since the user input can be any combination of plain words and an URL it's > > important I get this right. > > Hmm. I didn't think about the fact that we could have input where it's not > clear which tokens go together and which don't. If my original URL is > "http://who/peter%20kasting", for example, does "peter kasting" mean "peter > kasting" or "peter%20kasting"? We can't know. And if we have something like > "frank google.com\mail" then as humans it's probably obvious that we should > treat "google.com\mail" as a URL and fix up the '\' to be a '/', but I don't > know how to code that. > > So maybe you're right and this should go away. > > My biggest worry is probably unicode domain names. I can't remember how the > history backend stores those, but if it stores them in punycode, then users > won't be able to type them in and get any HQP matches at all. (I think similar > issues made the HUP less effective than it ideally would be and I wanted to > change how the backend worked, but it was too big a task or something.) I don't > know how to address this perfectly. > > My only idea here blows up matching time and thus is probably not terribly > feasible: > * For each whitespace-delimited token, try to fix it up somehow. Maybe > FixupUserInput(), maybe something that does less? > * If the fixed up version is different than the original, then allow matching > either version > > The fixup stage here we want to do things like convert unicode hostnames, > convert backslashes to forward slashes, and escape characters in URL paths, but > we don't need it to add schemes, convert numbers to dotted quads, etc. A lot of > this FixupUserInput() goes back and reverses after the fact (like the dotted > quads thing) but perhaps there's a different way to go... If it's okay with you, Peter, I'd like to go with omitting the FixupUserInput() and stick with this implementation. As is, this is a dramatic improvement over the current situation and my testing (unit tests and empirical) shows that good results are generated for URLs with encodings. For example, the example you give above, "http://who/peter%20kasting", is well-scored with search strings such as "http://who/peter%20kasting", "pet kas", "peter%20", etc. I've asked Jungshik to provide me with some samples of complex unicode URLs and page titles as well as user search strings so that I can enhance the unit tests. I've created bug 105058 to track this enhancement.
On 2011/11/22 00:07:43, mrossetti wrote: > If it's okay with you, Peter, I'd like to go with omitting the FixupUserInput() > and stick with this implementation. Yeah, that should be OK. > As is, this is a dramatic improvement over > the current situation and my testing (unit tests and empirical) shows that good > results are generated for URLs with encodings. For example, the example you give > above, "http://who/peter%20kasting", is well-scored with search strings such as > "http://who/peter%20kasting", "pet kas", "peter%20", etc. Sure, but is it also scored well for "peter kasting"? More interestingly, how about cases with encoded parens, pluses, etc.? And of course the Unicode hostname case.
On 2011/11/22 02:14:07, Peter Kasting wrote: > On 2011/11/22 00:07:43, mrossetti wrote: > > "http://who/peter%20kasting", "pet kas", "peter%20", etc. > > Sure, but is it also scored well for "peter kasting"? "peter kasting" scores nice and high. > More interestingly, how about cases with encoded parens, pluses, etc.? And of > course the Unicode hostname case. Not so well. The indexer is getting and using punycode while the search is using the raw unicode. I believe the indexer ought to be using the unicode but I would be grateful for your opinion before I start hacking away.
http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index.cc (right): http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index.cc:454: HistoryIDSet history_id_set = HistoryIDSetFromWords(words); On 2011/11/21 22:52:03, GeorgeY wrote: > I wonder how fast it would work on lowest common denominator - the Atom netbook > (such as Chromebook). I haven't cleared my history in two years on my desktop - > I am sure people do not do it on netbooks also :) This should be very fast as the dataset underlying is pre-constructed from the history database once and is kept up-to-date after that. In the early days, I measured this and the set of words and characters which has to be permuted is only in the thousands, even with huge history databases.
On 2011/11/22 23:29:02, mrossetti wrote: > On 2011/11/22 02:14:07, Peter Kasting wrote: > > More interestingly, how about cases with encoded parens, pluses, etc.? And of > > course the Unicode hostname case. > > Not so well. The indexer is getting and using punycode while the search is using > the raw unicode. I believe the indexer ought to be using the unicode but I would > be grateful for your opinion before I start hacking away. My original thought, which I may have already mentioned, was that punycode should only ever be used in the "displayed to user" URL, in cases where we don't believe Unicode display is safe (e.g. possible phishing), which we already have rules for. I had wanted the entire rest of Chrome, the history system, etc. to use Unicode for all storage, matching, etc. because let's face it, no one actually remembers and tries to work directly with punycode URLs. I don't know if that philosophy is instructive. I'm pretty sure it's not what the history code actually does. Migrating the backend to do it might be annoying.
The history database stores punycode (which is relevant for when the IMUI is being rebuilt). URLs are already in punycode by the time the providers get it from the omnibox (when the history database is updated). I'm tracking the update process backwards to see if there's a convenient way to pass along a non-punycodized URL. When rebuilding the IMUI from the history database I'm afraid the choices are not good. One possibility is to have the IMUI convert back from punycode to Unicode when rebuilding. Another possibility is to record the original Unicode into the history database and perform a migration for older punycoded databases. On Wed, Nov 23, 2011 at 11:47 AM, <pkasting@chromium.org> wrote: > On 2011/11/22 23:29:02, mrossetti wrote: > >> On 2011/11/22 02:14:07, Peter Kasting wrote: >> > More interestingly, how about cases with encoded parens, pluses, etc.? >> And >> > of > >> > course the Unicode hostname case. >> > > Not so well. The indexer is getting and using punycode while the search is >> > using > >> the raw unicode. I believe the indexer ought to be using the unicode but I >> > would > >> be grateful for your opinion before I start hacking away. >> > > My original thought, which I may have already mentioned, was that punycode > should only ever be used in the "displayed to user" URL, in cases where we > don't > believe Unicode display is safe (e.g. possible phishing), which we already > have > rules for. I had wanted the entire rest of Chrome, the history system, > etc. to > use Unicode for all storage, matching, etc. because let's face it, no one > actually remembers and tries to work directly with punycode URLs. > > I don't know if that philosophy is instructive. I'm pretty sure it's not > what > the history code actually does. Migrating the backend to do it might be > annoying. > > http://codereview.chromium.**org/8526010/<http://codereview.chromium.org/8526... >
On 2011/11/23 23:18:21, mrossetti wrote: > Another possibility is to record the > original Unicode into the history database and perform a migration for > older punycoded databases. Yeah, that's obviously what I'd prefer. One question is what we'd do for unicode URLs that are displayed *to the user* in punycode (as opposed to ones where the user always sees Unicode anyway). If we store these as Unicode, we could get surprising (and possible dangerous from a phishing perspective?) matches against them where users aren't expecting. If we store as punycode, then if the user changes their accept-languages values, we have to figure out what, if anything, to do.
On 2011/11/23 23:22:12, Peter Kasting wrote: > On 2011/11/23 23:18:21, mrossetti wrote: > > Another possibility is to record the > > original Unicode into the history database and perform a migration for > > older punycoded databases. > > Yeah, that's obviously what I'd prefer. One question is what we'd do for > unicode URLs that are displayed *to the user* in punycode (as opposed to ones > where the user always sees Unicode anyway). If we store these as Unicode, we > could get surprising (and possible dangerous from a phishing perspective?) > matches against them where users aren't expecting. If we store as punycode, > then if the user changes their accept-languages values, we have to figure out > what, if anything, to do. Peter, do you consider it essential that the unicode issue be resolved as part of this CL? The unicode/punycode issue is already a problem and this CL doesn't make it any worse. Would it be okay with you if I wrote up a bug to track the unicoded/punycoded host challenge so that it can be dealt with separately?
On 2011/11/29 21:56:45, mrossetti wrote: > Would it be okay with you if I wrote up a bug to track the > unicoded/punycoded host challenge so that it can be dealt with separately? That's fine.
This is ready for a final review then.
http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index_unittest.cc (right): http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:272: url_index_->Init(this, "en,ja,hi,zh"); Please fix indentation http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:278: ScoredHistoryMatches matches = Please fix indentation
Thanks! I had to look at that three times before I saw the extra spaces! Can I get an LGTM? Then I only need to get one from Peter. http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... File chrome/browser/history/in_memory_url_index_unittest.cc (right): http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:272: url_index_->Init(this, "en,ja,hi,zh"); On 2011/11/30 22:21:41, GeorgeY wrote: > Please fix indentation Done. http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_m... chrome/browser/history/in_memory_url_index_unittest.cc:278: ScoredHistoryMatches matches = On 2011/11/30 22:21:41, GeorgeY wrote: > Please fix indentation Done.
LGTM |