Issue 8526010: Improve Autocomplete Matches and Handling of Large Results Sets

Issue 8526010: Improve Autocomplete Matches and Handling of Large Results Sets (Closed)

Created:
9 years, 1 month ago by mrossetti

Modified:
9 years ago

Reviewers:
GeorgeY, Peter Kasting

CC:
chromium-reviews, brettw-cc_chromium.org, James Su, Paweł Hajdan Jr.

Base URL:
svn://svn.chromium.org/chrome/trunk/src/

Visibility:
Public.

More Reviews

Description

Improve Autocomplete Matches and Handling of Large Results Sets Do not call FixupUserInput as it was prepending unexpected prefixes (such as file://) to the search string and bypassing valid results. Move the search string decomposition operation from the HQP into the IMUI. In the final substring filtering use whitespace delineated terms rather than words. Instead of bailing if we get a large results set (>500) filter it down to 500 by sorting by typed-count/visit-count/last-visit. This means it's no longer necessary to bypass the HQP if there is only one character in the search term so get rid of the ExpandedInMemoryURLIndexTest.ShortCircuit unit test. BUG=101301, 103575 TEST=Added unit tests. Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=112527

Patch Set 1 #

Patch Set 2 : '' #

Patch Set 3 : '' #

Patch Set 4 : '' #

Patch Set 5 : '' #

Total comments: 23

Patch Set 6 : '' #

Total comments: 2

Patch Set 7 : '' #

Patch Set 8 : '' #

Total comments: 4

Created: 9 years ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+220 lines, -176 lines)			Patch
M	chrome/browser/autocomplete/history_quick_provider.cc	View	1 2 3 4 5 6 7	2 chunks	+1 line, -10 lines	0 comments	Download
M	chrome/browser/history/in_memory_url_index.h	View	1 2 3 4 5 6 7	5 chunks	+39 lines, -16 lines	0 comments	Download
M	chrome/browser/history/in_memory_url_index.cc	View	1 2 3 4 5 6 7	5 chunks	+108 lines, -49 lines	0 comments	Download
M	chrome/browser/history/in_memory_url_index_types.h	View	1 2 3 4 5 6 7	1 chunk	+1 line, -0 lines	0 comments	Download
M	chrome/browser/history/in_memory_url_index_unittest.cc	View	1 2 3 4 5 6 7	13 chunks	+71 lines, -101 lines	4 comments	Download

Messages

Total messages: 19 (0 generated)

Expand Messages | Collapse Messages

mrossetti

Note that the filtering performed by HistoryItemFactorGreater (in in_memory_url_index.h) is quite simple-minded at this point. ...

9 years, 1 month ago (2011-11-10 21:45:26 UTC) #1

Peter Kasting

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc File chrome/browser/autocomplete/history_quick_provider.cc (left): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc#oldcode64 chrome/browser/autocomplete/history_quick_provider.cc:64: if (!FixupUserInput(&autocomplete_input_)) I think removing this may be worse ...

9 years, 1 month ago (2011-11-21 20:31:02 UTC) #3

mrossetti

Issues addressed. Comments on removal of FixupUserInput provided. http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc File chrome/browser/autocomplete/history_quick_provider.cc (left): http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc#oldcode64 chrome/browser/autocomplete/history_quick_provider.cc:64: if ...

9 years, 1 month ago (2011-11-21 21:38:25 UTC) #4

Issues addressed. Comments on removal of FixupUserInput provided.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
File chrome/browser/autocomplete/history_quick_provider.cc (left):

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
chrome/browser/autocomplete/history_quick_provider.cc:64: if
(!FixupUserInput(&autocomplete_input_))
Great! I'd appreciate the advice...

When give a search string like '/r/' (as described in the bug), calling
FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course, causes
the HQP to search for history items with the words 'file' and 'r' and then does
a final filter for that exact string ('file:///r/').

It also sounds to me like the unit tests for this should be enhanced by adding
URLs with escaped characters, etc. Could you suggest a few for completeness?

Since the user input can be any combination of plain words and an URL it's
important I get this right.

On 2011/11/21 20:31:02, Peter Kasting wrote:
> I think removing this may be worse than keeping it.  Without this, you're
going
> to have trouble matching in URLs with escaped characters, punycode domain
names,
> etc.
> 
> Maybe you can give me some concrete examples of problems this causes and we
can
> figure out how to fight them?

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
File chrome/browser/history/in_memory_url_index.cc (right):

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.cc:407: :
history_info_map_(history_info_map) {}
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: 4 spaces before colon.  I generally prefer {} on separate lines unless
the
> entire definition is all on one line.

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.cc:445: search_term_cache_.clear(); 
// Invalidate the term cache.
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: Why is this line necessary?  Wouldn't the cache already be empty?

I updated the comment since it did not mention that the input search string
might not have any words.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.cc:464: bool was_trimmed = false;
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: Even shorter:
> 
>   bool was_trimmed = (pre_filter_item_count > kItemsToScoreLimit);
>   if (was_trimmed) { ...

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.cc:478:
history_ids.resize(kItemsToScoreLimit);
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: You can eliminate this line and simply use "history_ids.begin() +
> kItemsToScoreLimit" in place of history_ids.end() in the copy() call below.

Delicious!

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
File chrome/browser/history/in_memory_url_index.h (right):

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.h:95: // Given a string16 in
|term_string|, scans the history index and return a
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: return -> returns

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.h:101: // terms, as separated by
whitespace, occur withint the candidate's URL
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: withint -> within

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.h:103: // |kItemsToScoreLimit|
candidates (as the scoring of such a large number of
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: instead of using parens, just add a comma before "as".

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.h:194: class HistoryItemFactorGreater
: public std::binary_function<HistoryID,
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: Probably slightly more readable to linebreak before the ":".

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index.h:202: private:
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: Blank line above this

Done.

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
File chrome/browser/history/in_memory_url_index_unittest.cc (right):

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/history/in_m...
chrome/browser/history/in_memory_url_index_unittest.cc:228:
url_index_->HistoryItemsForTerms(UTF8ToUTF16("DrudgeReport"));
On 2011/11/21 20:31:02, Peter Kasting wrote:
> Nit: Can use ASCIIToUTF16() in most of these.

Ah! Why did I forget that??? Thanks.

Peter Kasting

9 years, 1 month ago (2011-11-21 22:08:59 UTC) #5

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
File chrome/browser/autocomplete/history_quick_provider.cc (left):

http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
chrome/browser/autocomplete/history_quick_provider.cc:64: if
(!FixupUserInput(&autocomplete_input_))
On 2011/11/21 21:38:25, mrossetti wrote:
> When give a search string like '/r/' (as described in the bug), calling
> FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course, causes
> the HQP to search for history items with the words 'file' and 'r' and then
does
> a final filter for that exact string ('file:///r/').
> 
> It also sounds to me like the unit tests for this should be enhanced by adding
> URLs with escaped characters, etc. Could you suggest a few for completeness?
> 
> Since the user input can be any combination of plain words and an URL it's
> important I get this right.

Hmm.  I didn't think about the fact that we could have input where it's not
clear which tokens go together and which don't.  If my original URL is
"http://who/peter%20kasting", for example, does "peter kasting" mean "peter
kasting" or "peter%20kasting"?  We can't know.  And if we have something like
"frank google.com\mail" then as humans it's probably obvious that we should
treat "google.com\mail" as a URL and fix up the '\' to be a '/', but I don't
know how to code that.

So maybe you're right and this should go away.

My biggest worry is probably unicode domain names.  I can't remember how the
history backend stores those, but if it stores them in punycode, then users
won't be able to type them in and get any HQP matches at all.  (I think similar
issues made the HUP less effective than it ideally would be and I wanted to
change how the backend worked, but it was too big a task or something.)  I don't
know how to address this perfectly.

My only idea here blows up matching time and thus is probably not terribly
feasible:
  * For each whitespace-delimited token, try to fix it up somehow.  Maybe
FixupUserInput(), maybe something that does less?
  * If the fixed up version is different than the original, then allow matching
either version

The fixup stage here we want to do things like convert unicode hostnames,
convert backslashes to forward slashes, and escape characters in URL paths, but
we don't need it to add schemes, convert numbers to dotted quads, etc.  A lot of
this FixupUserInput() goes back and reverses after the fact (like the dotted
quads thing) but perhaps there's a different way to go...

GeorgeY

http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_memory_url_index.cc File chrome/browser/history/in_memory_url_index.cc (right): http://codereview.chromium.org/8526010/diff/24001/chrome/browser/history/in_memory_url_index.cc#newcode454 chrome/browser/history/in_memory_url_index.cc:454: HistoryIDSet history_id_set = HistoryIDSetFromWords(words); I wonder how fast it ...

9 years, 1 month ago (2011-11-21 22:52:03 UTC) #6

mrossetti

On 2011/11/21 22:08:59, Peter Kasting wrote: > http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc > File chrome/browser/autocomplete/history_quick_provider.cc (left): > > http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete/history_quick_provider.cc#oldcode64 ...

9 years, 1 month ago (2011-11-22 00:07:43 UTC) #7

On 2011/11/21 22:08:59, Peter Kasting wrote:
>
http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
> File chrome/browser/autocomplete/history_quick_provider.cc (left):
> 
>
http://codereview.chromium.org/8526010/diff/21001/chrome/browser/autocomplete...
> chrome/browser/autocomplete/history_quick_provider.cc:64: if
> (!FixupUserInput(&autocomplete_input_))
> On 2011/11/21 21:38:25, mrossetti wrote:
> > When give a search string like '/r/' (as described in the bug), calling
> > FixupUserInput results in the string 'file:\/\/\/r\/'. This, of course,
causes
> > the HQP to search for history items with the words 'file' and 'r' and then
> does
> > a final filter for that exact string ('file:///r/').
> > 
> > It also sounds to me like the unit tests for this should be enhanced by
adding
> > URLs with escaped characters, etc. Could you suggest a few for completeness?
> > 
> > Since the user input can be any combination of plain words and an URL it's
> > important I get this right.
> 
> Hmm.  I didn't think about the fact that we could have input where it's not
> clear which tokens go together and which don't.  If my original URL is
> "http://who/peter%20kasting", for example, does "peter kasting" mean "peter
> kasting" or "peter%20kasting"?  We can't know.  And if we have something like
> "frank google.com\mail" then as humans it's probably obvious that we should
> treat "google.com\mail" as a URL and fix up the '\' to be a '/', but I don't
> know how to code that.
> 
> So maybe you're right and this should go away.
> 
> My biggest worry is probably unicode domain names.  I can't remember how the
> history backend stores those, but if it stores them in punycode, then users
> won't be able to type them in and get any HQP matches at all.  (I think
similar
> issues made the HUP less effective than it ideally would be and I wanted to
> change how the backend worked, but it was too big a task or something.)  I
don't
> know how to address this perfectly.
> 
> My only idea here blows up matching time and thus is probably not terribly
> feasible:
>   * For each whitespace-delimited token, try to fix it up somehow.  Maybe
> FixupUserInput(), maybe something that does less?
>   * If the fixed up version is different than the original, then allow
matching
> either version
> 
> The fixup stage here we want to do things like convert unicode hostnames,
> convert backslashes to forward slashes, and escape characters in URL paths,
but
> we don't need it to add schemes, convert numbers to dotted quads, etc.  A lot
of
> this FixupUserInput() goes back and reverses after the fact (like the dotted
> quads thing) but perhaps there's a different way to go...

If it's okay with you, Peter, I'd like to go with omitting the FixupUserInput()
and stick with this implementation. As is, this is a dramatic improvement over
the current situation and my testing (unit tests and empirical) shows that good
results are generated for URLs with encodings. For example, the example you give
above, "http://who/peter%20kasting", is well-scored with search strings such as
"http://who/peter%20kasting", "pet kas", "peter%20", etc.

I've asked Jungshik to provide me with some samples of complex unicode URLs and
page titles as well as user search strings so that I can enhance the unit tests.
I've created bug 105058 to track this enhancement.

Peter Kasting

On 2011/11/22 00:07:43, mrossetti wrote: > If it's okay with you, Peter, I'd like to ...

9 years, 1 month ago (2011-11-22 02:14:07 UTC) #8

mrossetti

On 2011/11/22 02:14:07, Peter Kasting wrote: > On 2011/11/22 00:07:43, mrossetti wrote: > > "http://who/peter%20kasting", ...

9 years, 1 month ago (2011-11-22 23:29:02 UTC) #9

mrossetti

9 years, 1 month ago (2011-11-22 23:29:12 UTC) #10

Peter Kasting

On 2011/11/22 23:29:02, mrossetti wrote: > On 2011/11/22 02:14:07, Peter Kasting wrote: > > More ...

9 years, 1 month ago (2011-11-23 18:47:34 UTC) #11

mrossetti

The history database stores punycode (which is relevant for when the IMUI is being rebuilt). ...

9 years, 1 month ago (2011-11-23 23:18:21 UTC) #12

Peter Kasting

On 2011/11/23 23:18:21, mrossetti wrote: > Another possibility is to record the > original Unicode ...

9 years, 1 month ago (2011-11-23 23:22:12 UTC) #13

mrossetti

On 2011/11/23 23:22:12, Peter Kasting wrote: > On 2011/11/23 23:18:21, mrossetti wrote: > > Another ...

9 years ago (2011-11-29 21:56:45 UTC) #14

Peter Kasting

On 2011/11/29 21:56:45, mrossetti wrote: > Would it be okay with you if I wrote ...

9 years ago (2011-11-29 21:58:04 UTC) #15

GeorgeY

http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_memory_url_index_unittest.cc File chrome/browser/history/in_memory_url_index_unittest.cc (right): http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_memory_url_index_unittest.cc#newcode272 chrome/browser/history/in_memory_url_index_unittest.cc:272: url_index_->Init(this, "en,ja,hi,zh"); Please fix indentation http://codereview.chromium.org/8526010/diff/37003/chrome/browser/history/in_memory_url_index_unittest.cc#newcode278 chrome/browser/history/in_memory_url_index_unittest.cc:278: ScoredHistoryMatches matches ...

9 years ago (2011-11-30 22:21:41 UTC) #17

mrossetti

Thanks! I had to look at that three times before I saw the extra spaces! ...

9 years ago (2011-11-30 22:45:24 UTC) #18

LGTM

Expand Messages | Collapse Messages