Issue 3214002: Add a term feature extractor for client-side phishing detection.

Issue 3214002: Add a term feature extractor for client-side phishing detection. (Closed)

Created:
10 years, 4 months ago by Brian Ryner

Modified:
9 years, 7 months ago

Reviewers:
marria, lzheng, noelutz

CC:
chromium-reviews, Paweł Hajdan Jr., darin-cc_chromium.org, brettw-cc_chromium.org, chrome-anti-phishing_googlegroups.com

Base URL:
http://src.chromium.org/git/chromium.git

Visibility:
Public.

Description

Add a term feature extractor for client-side phishing detection. This class creates features for n-grams in the page text that appear in the phishing classification model. It will eventually operate on the plain text that is extracted by RenderView::CaptureText(). To make it harder for phishers to enumerate the terms in the classification model, they will be supplied as SHA-256 hashes rather than plain text. The term feature extractor hashes the words in the document in order to check whether they match the model. Since this is potentially expensive, the term feature extractor limits how long it will run on each iteration, similar to the PhishingDOMFeatureExtractor. TEST=PhishingTermFeatureExtractorTest BUG=none Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=58537

Patch Set 1 #

Total comments: 29

Patch Set 2 : address noe's comments #

Total comments: 10

Patch Set 3 : address lei's comments #

Patch Set 4 : Add an extra comment/TODO about performance. #

Created: 10 years, 3 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+729 lines, -2 lines)			Patch
M	base/sha2.h	View		1 chunk	+4 lines, -0 lines	0 comments	Download
M	base/sha2.cc	View		2 chunks	+7 lines, -0 lines	0 comments	Download
M	chrome/chrome_renderer.gypi	View	1 2	1 chunk	+2 lines, -0 lines	0 comments	Download
M	chrome/chrome_tests.gypi	View	1 2	1 chunk	+1 line, -0 lines	0 comments	Download
M	chrome/renderer/safe_browsing/features.h	View	1	1 chunk	+10 lines, -0 lines	0 comments	Download
M	chrome/renderer/safe_browsing/features.cc	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	chrome/renderer/safe_browsing/phishing_dom_feature_extractor.h	View		2 chunks	+3 lines, -2 lines	0 comments	Download
A	chrome/renderer/safe_browsing/phishing_term_feature_extractor.h	View	1 2	1 chunk	+152 lines, -0 lines	0 comments	Download
A	chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc	View	1 2 3	1 chunk	+295 lines, -0 lines	0 comments	Download
A	chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc	View	1 2	1 chunk	+252 lines, -0 lines	0 comments	Download

Messages

Total messages: 8 (0 generated)

Expand Messages | Collapse Messages

noelutz

Hi Brian, This CL looks great to me. Nice job. I have mostly some high ...

10 years, 4 months ago (2010-08-27 06:29:20 UTC) #2

Hi Brian,
This CL looks great to me.  Nice job.  I have mostly some high level questions /
comments.

thanks,
noé.

http://codereview.chromium.org/3214002/diff/1/7
File chrome/renderer/safe_browsing/features.h (right):

http://codereview.chromium.org/3214002/diff/1/7#newcode165
chrome/renderer/safe_browsing/features.h:165: // Token feature for a term
(whitespace-delimited) on a page.  Termw can be
s/Termw/Terms/?

http://codereview.chromium.org/3214002/diff/1/8
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/1/8#newcode51
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:51: // position
returned by our iterator.
Is this an index in the text passed to the constructor of the struct?

http://codereview.chromium.org/3214002/diff/1/8#newcode102
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:102: const
string16* page_text,
Is this a pointer because you want to indicate to the caller that the object
needs to be accessible throughout the lifetime of this object?

http://codereview.chromium.org/3214002/diff/1/8#newcode141
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:141: if
(state_->position == -1) {
Maybe create a constant for the -1 value, e.g., kBeginPosition or kInitPosition?

http://codereview.chromium.org/3214002/diff/1/8#newcode146
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:146: return;
Actually, is there any reason you're not doing this in the constructor of the
struct?  It might fit better there I think and you wouldn't have to use the
position = -1 case.

http://codereview.chromium.org/3214002/diff/1/8#newcode191
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:191: }
Do you not want to track chunk_elapsed here too?

http://codereview.chromium.org/3214002/diff/1/8#newcode203
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:203:
state_->previous_word_sizes.clear();
Ah. Now I remember ;).

http://codereview.chromium.org/3214002/diff/1/8#newcode220
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:220:
hashes_to_check[base::SHA256HashString(current_term)] = current_term;
I'm not sure I understand why you need the hashes_to_check map.  Could you do
the lookup in the page_term_hashes_ map right here?

http://codereview.chromium.org/3214002/diff/1/9
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.h (right):

http://codereview.chromium.org/3214002/diff/1/9#newcode15
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:15: //
FeatureMap will be hsahed so that they can be compared against the model.
s/hsahed/hashed/

http://codereview.chromium.org/3214002/diff/1/9#newcode74
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:74:
DoneCallback* done_callback);
Maybe mention on what thread DoneCallback will be called?

http://codereview.chromium.org/3214002/diff/1/9#newcode78
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:78: // is
unloaded or the PhishingTermFeatureExtractor is destroyed.
It's a bit odd that the user has to call that method if the object gets
destroyed.  Couldn't you call that method in the destructor?

http://codereview.chromium.org/3214002/diff/1/9#newcode86
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:86: static const
int kMaxTimePerChunkMs;
Since we have all the data is there any reason we can't spawn another thread
instead?  Are we worried about taking to much CPU cycles away from important
threads?

http://codereview.chromium.org/3214002/diff/1/9#newcode95
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:95: static const
int kMaxTotalTimeMs;
I think it's not clear what you mean by time here.  Do you mean walltime or time
extracting features (i.e., doing work only in this code)?

Brian Ryner

http://codereview.chromium.org/3214002/diff/1/7 File chrome/renderer/safe_browsing/features.h (right): http://codereview.chromium.org/3214002/diff/1/7#newcode165 chrome/renderer/safe_browsing/features.h:165: // Token feature for a term (whitespace-delimited) on a ...

10 years, 3 months ago (2010-08-27 18:29:42 UTC) #3

http://codereview.chromium.org/3214002/diff/1/7
File chrome/renderer/safe_browsing/features.h (right):

http://codereview.chromium.org/3214002/diff/1/7#newcode165
chrome/renderer/safe_browsing/features.h:165: // Token feature for a term
(whitespace-delimited) on a page.  Termw can be
On 2010/08/27 06:29:20, noelutz wrote:
> s/Termw/Terms/?

Done.

http://codereview.chromium.org/3214002/diff/1/8
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/1/8#newcode51
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:51: // position
returned by our iterator.
On 2010/08/27 06:29:20, noelutz wrote:
> Is this an index in the text passed to the constructor of the struct?

Yep, clarified this.

http://codereview.chromium.org/3214002/diff/1/8#newcode102
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:102: const
string16* page_text,
On 2010/08/27 06:29:20, noelutz wrote:
> Is this a pointer because you want to indicate to the caller that the object
> needs to be accessible throughout the lifetime of this object?

Yes, exactly.

http://codereview.chromium.org/3214002/diff/1/8#newcode141
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:141: if
(state_->position == -1) {
On 2010/08/27 06:29:20, noelutz wrote:
> Maybe create a constant for the -1 value, e.g., kBeginPosition or
kInitPosition?

Actually, I realized that -1 is the same as UBRK_DONE, which is a little
confusing.  I switched this to a separate boolean instead.

http://codereview.chromium.org/3214002/diff/1/8#newcode146
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:146: return;
On 2010/08/27 06:29:20, noelutz wrote:
> Actually, is there any reason you're not doing this in the constructor of the
> struct?  It might fit better there I think and you wouldn't have to use the
> position = -1 case.

Yea, I was going back and forth on this.  There are a couple of reasons I ended
up doing it this way:

- It's not clear from the ICU interface that ubrk_first() is cheap to call.  I
think that in practice it is, but it seems better to run it inside of one of our
chunks of work so that we keep track of how long it takes.

- Even if we call ubrk_first when the state is initialized, we'd still want to
delay checking whether it returned UBRK_DONE until the first chunk of work --
I'm trying to avoid running the callback from inside ExtractFeatures since it
may be unexpected for the caller.  The reason that we'd want to check for
UBRK_DONE up-front is that we don't want to call ubrk_next if we're already at
the end.

http://codereview.chromium.org/3214002/diff/1/8#newcode191
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:191: }
On 2010/08/27 06:29:20, noelutz wrote:
> Do you not want to track chunk_elapsed here too?

I'm intentionally not counting chunks where the extraction finished.  Without
those chunks, we should expect to see the TermFeatureChunkTime histogram be
pretty close to 50ms... adding in the chunks where extraction finished would
throw that off.  We could track the "final chunks" in a separate histogram, but
I'm not sure it's really worth it since we're already tracking the total time.

http://codereview.chromium.org/3214002/diff/1/8#newcode220
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:220:
hashes_to_check[base::SHA256HashString(current_term)] = current_term;
On 2010/08/27 06:29:20, noelutz wrote:
> I'm not sure I understand why you need the hashes_to_check map.  Could you do
> the lookup in the page_term_hashes_ map right here?

It's because I want to use the already-computed word_hash rather than computing
it again (see line 211 where that's inserted into the map), and I didn't want to
duplicate the code that does the lookup / adds the feature.  However, if you can
think of a better way to structure this, let me know.

http://codereview.chromium.org/3214002/diff/1/9
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.h (right):

http://codereview.chromium.org/3214002/diff/1/9#newcode15
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:15: //
FeatureMap will be hsahed so that they can be compared against the model.
On 2010/08/27 06:29:20, noelutz wrote:
> s/hsahed/hashed/

Done.

http://codereview.chromium.org/3214002/diff/1/9#newcode74
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:74:
DoneCallback* done_callback);
On 2010/08/27 06:29:20, noelutz wrote:
> Maybe mention on what thread DoneCallback will be called?

Done.  Also made this change for phishing_dom_feature_extractor.h, which has the
same comment.

http://codereview.chromium.org/3214002/diff/1/9#newcode78
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:78: // is
unloaded or the PhishingTermFeatureExtractor is destroyed.
On 2010/08/27 06:29:20, noelutz wrote:
> It's a bit odd that the user has to call that method if the object gets
> destroyed.  Couldn't you call that method in the destructor?

The reason I didn't do this is for consistency -- it seems a little confusing if
the caller sometimes has to cancel an extraction and sometimes doesn't.  I
wanted to make it clear that the caller is always responsible for cancelling an
unfinished extraction once it's started.

http://codereview.chromium.org/3214002/diff/1/9#newcode86
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:86: static const
int kMaxTimePerChunkMs;
On 2010/08/27 06:29:20, noelutz wrote:
> Since we have all the data is there any reason we can't spawn another thread
> instead?  Are we worried about taking to much CPU cycles away from important
> threads?

As we talked about offline, this would definitely be possible, but I wanted to
avoid adding a new type of async feature extraction interface (the other being
the dom feature extractor).  Running on the same thread keeps the semantics the
same between the two.

http://codereview.chromium.org/3214002/diff/1/9#newcode95
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:95: static const
int kMaxTotalTimeMs;
On 2010/08/27 06:29:20, noelutz wrote:
> I think it's not clear what you mean by time here.  Do you mean walltime or
time
> extracting features (i.e., doing work only in this code)?

Clarified this to be wall time (same for the dom feature extractor).

noelutz

LGTM http://codereview.chromium.org/3214002/diff/1/8 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/1/8#newcode146 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:146: return; On 2010/08/27 18:29:42, Brian Ryner wrote: > ...

10 years, 3 months ago (2010-08-27 19:21:46 UTC) #4

LGTM

http://codereview.chromium.org/3214002/diff/1/8
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/1/8#newcode146
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:146: return;
On 2010/08/27 18:29:42, Brian Ryner wrote:
> On 2010/08/27 06:29:20, noelutz wrote:
> > Actually, is there any reason you're not doing this in the constructor of
the
> > struct?  It might fit better there I think and you wouldn't have to use the
> > position = -1 case.
> 
> Yea, I was going back and forth on this.  There are a couple of reasons I
ended
> up doing it this way:
> 
> - It's not clear from the ICU interface that ubrk_first() is cheap to call.  I
> think that in practice it is, but it seems better to run it inside of one of
our
> chunks of work so that we keep track of how long it takes.
> 
> - Even if we call ubrk_first when the state is initialized, we'd still want to
> delay checking whether it returned UBRK_DONE until the first chunk of work --
> I'm trying to avoid running the callback from inside ExtractFeatures since it
> may be unexpected for the caller.  The reason that we'd want to check for
> UBRK_DONE up-front is that we don't want to call ubrk_next if we're already at
> the end.
> 

I see.  That makes sense.

http://codereview.chromium.org/3214002/diff/1/8#newcode191
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:191: }
On 2010/08/27 18:29:42, Brian Ryner wrote:
> On 2010/08/27 06:29:20, noelutz wrote:
> > Do you not want to track chunk_elapsed here too?
> 
> I'm intentionally not counting chunks where the extraction finished.  Without
> those chunks, we should expect to see the TermFeatureChunkTime histogram be
> pretty close to 50ms... adding in the chunks where extraction finished would
> throw that off.  We could track the "final chunks" in a separate histogram,
but
> I'm not sure it's really worth it since we're already tracking the total time.

Oh yes, I forgot about the total time stat.  Sounds good.

http://codereview.chromium.org/3214002/diff/1/8#newcode220
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:220:
hashes_to_check[base::SHA256HashString(current_term)] = current_term;
On 2010/08/27 18:29:42, Brian Ryner wrote:
> On 2010/08/27 06:29:20, noelutz wrote:
> > I'm not sure I understand why you need the hashes_to_check map.  Could you
do
> > the lookup in the page_term_hashes_ map right here?
> 
> It's because I want to use the already-computed word_hash rather than
computing
> it again (see line 211 where that's inserted into the map), and I didn't want
to
> duplicate the code that does the lookup / adds the feature.  However, if you
can
> think of a better way to structure this, let me know.

I see.  I think it's fine as is.

http://codereview.chromium.org/3214002/diff/1/9
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.h (right):

http://codereview.chromium.org/3214002/diff/1/9#newcode78
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:78: // is
unloaded or the PhishingTermFeatureExtractor is destroyed.
On 2010/08/27 18:29:42, Brian Ryner wrote:
> On 2010/08/27 06:29:20, noelutz wrote:
> > It's a bit odd that the user has to call that method if the object gets
> > destroyed.  Couldn't you call that method in the destructor?
> 
> The reason I didn't do this is for consistency -- it seems a little confusing
if
> the caller sometimes has to cancel an extraction and sometimes doesn't.  I
> wanted to make it clear that the caller is always responsible for cancelling
an
> unfinished extraction once it's started.
> 

Fair enough.

lzheng

Nice CL! http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); This loop ...

10 years, 3 months ago (2010-09-03 05:19:05 UTC) #5

Nice CL!

http://codereview.chromium.org/3214002/diff/7001/8008
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/7001/8008#newcode224
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for
(std::list<size_t>::iterator it = state_->previous_word_sizes.begin();
This loop is pretty heavy due to the hash computation. If somehow we have a
relative long term in the page_term_hashes_, we will do a lot of loops with hash
computation before we get the hash we want (e.g.: To get the hash for A B C D E,
we will get AB, ABC, BC, ABCD, BCD, CD, .... I wonder if this will be improved
if we provide possible word positions in the page_word_hashes_: if we find a
word is at position x of a term and  page_word_hashes_ tells us this word only
appear at y,z for bad terms, we don't need to calcuate sha256 hash when it is at
x. Is this possible? I assume here that sha256 calcuation is more expensive than
lookup for the possible positions of a word and the possible positions of a word
in terms to check are very limited.

http://codereview.chromium.org/3214002/diff/7001/8009
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.h (right):

http://codereview.chromium.org/3214002/diff/7001/8009#newcode125
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:125:
base::hash_set<std::string> page_word_hashes_;
Do you need to keep your own copy of page_word_hashes_ and page_term_hashes_?
Could it just be the pointer from the caller (like features_)?

http://codereview.chromium.org/3214002/diff/7001/8009#newcode128
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:128: size_t
max_words_per_term_;
Could this a static variable if it is the same across the browser?

http://codereview.chromium.org/3214002/diff/7001/8010
File chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc
(right):

http://codereview.chromium.org/3214002/diff/7001/8010#newcode130
chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc:130:
page_text = ASCIIToUTF16("bla bla multi word test bla");
how about a test for "bla bla test word multi bla"? The point is that test word
multi are all in the word set, but they should not generate any term match.

Brian Ryner

Please have another look. http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); ...

10 years, 3 months ago (2010-09-03 20:35:46 UTC) #6

Please have another look.

http://codereview.chromium.org/3214002/diff/7001/8008
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/7001/8008#newcode224
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for
(std::list<size_t>::iterator it = state_->previous_word_sizes.begin();
On 2010/09/03 05:19:05, lzheng wrote:
> This loop is pretty heavy due to the hash computation. If somehow we have a
> relative long term in the page_term_hashes_, we will do a lot of loops with
hash
> computation before we get the hash we want (e.g.: To get the hash for A B C D
E,
> we will get AB, ABC, BC, ABCD, BCD, CD, .... I wonder if this will be improved
> if we provide possible word positions in the page_word_hashes_: if we find a
> word is at position x of a term and  page_word_hashes_ tells us this word only
> appear at y,z for bad terms, we don't need to calcuate sha256 hash when it is
at
> x. Is this possible? I assume here that sha256 calcuation is more expensive
than
> lookup for the possible positions of a word and the possible positions of a
word
> in terms to check are very limited.

Agreed that page_word_hashes_ is somewhat coarse-grained, so there are cases
like this where we'll do more work than necessary.  My preference would be to
keep things simple to start with, then we can optimize if we see from the UMA
metrics that the term extraction is too slow.  If we do need to optimize, noting
the term positions would definitely be possible, and I think we could also keep
a cache of plaintext words that aren't in page_word_hashes_.

http://codereview.chromium.org/3214002/diff/7001/8009
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.h (right):

http://codereview.chromium.org/3214002/diff/7001/8009#newcode125
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:125:
base::hash_set<std::string> page_word_hashes_;
On 2010/09/03 05:19:05, lzheng wrote:
> Do you need to keep your own copy of page_word_hashes_ and page_term_hashes_?
> Could it just be the pointer from the caller (like features_)?

This came up in Noe's scorer CL too.  I don't think it's a huge performance
issue since the copy would happen once on renderer startup, but it's an easy
change, so I went ahead and made it take pointers to the hash sets.

http://codereview.chromium.org/3214002/diff/7001/8009#newcode128
chrome/renderer/safe_browsing/phishing_term_feature_extractor.h:128: size_t
max_words_per_term_;
On 2010/09/03 05:19:05, lzheng wrote:
> Could this a static variable if it is the same across the browser?

Do you mean a static const variable?  This value will come from the model (see
http://codereview.chromium.org/3363004/show), so hard-coding it that way doesn't
actually work.

http://codereview.chromium.org/3214002/diff/7001/8010
File chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc
(right):

http://codereview.chromium.org/3214002/diff/7001/8010#newcode130
chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc:130:
page_text = ASCIIToUTF16("bla bla multi word test bla");
On 2010/09/03 05:19:05, lzheng wrote:
> how about a test for "bla bla test word multi bla"? The point is that test
word
> multi are all in the word set, but they should not generate any term match.

Good idea, done.

lzheng

LGTM. http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); On 2010/09/03 20:35:47, ...

10 years, 3 months ago (2010-09-03 20:44:30 UTC) #7

Brian Ryner

10 years, 3 months ago (2010-09-03 20:57:35 UTC) #8

Thanks,
Brian

http://codereview.chromium.org/3214002/diff/7001/8008
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/7001/8008#newcode224
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for
(std::list<size_t>::iterator it = state_->previous_word_sizes.begin();
On 2010/09/03 20:44:30, lzheng wrote:
> Okay. Can you put a todo here to highlight the potential problem here?
> 

Sure, added a TODO to measure the performance with UMA and the ideas we talked
about for optimizing.  Another optimization we could consider, if necessary, is
to change the term format so that each word is hashed separately; I think that
would also address your concern about word ordering.

Submitting with that change.

Expand Messages | Collapse Messages