Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(109)

Issue 3214002: Add a term feature extractor for client-side phishing detection. (Closed)

Created:
10 years, 4 months ago by Brian Ryner
Modified:
9 years, 7 months ago
Reviewers:
marria, lzheng, noelutz
CC:
chromium-reviews, Paweł Hajdan Jr., darin-cc_chromium.org, brettw-cc_chromium.org, chrome-anti-phishing_googlegroups.com
Base URL:
http://src.chromium.org/git/chromium.git
Visibility:
Public.

Description

Add a term feature extractor for client-side phishing detection. This class creates features for n-grams in the page text that appear in the phishing classification model. It will eventually operate on the plain text that is extracted by RenderView::CaptureText(). To make it harder for phishers to enumerate the terms in the classification model, they will be supplied as SHA-256 hashes rather than plain text. The term feature extractor hashes the words in the document in order to check whether they match the model. Since this is potentially expensive, the term feature extractor limits how long it will run on each iteration, similar to the PhishingDOMFeatureExtractor. TEST=PhishingTermFeatureExtractorTest BUG=none Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=58537

Patch Set 1 #

Total comments: 29

Patch Set 2 : address noe's comments #

Total comments: 10

Patch Set 3 : address lei's comments #

Patch Set 4 : Add an extra comment/TODO about performance. #

Unified diffs Side-by-side diffs Delta from patch set Stats (+729 lines, -2 lines) Patch
M base/sha2.h View 1 chunk +4 lines, -0 lines 0 comments Download
M base/sha2.cc View 2 chunks +7 lines, -0 lines 0 comments Download
M chrome/chrome_renderer.gypi View 1 2 1 chunk +2 lines, -0 lines 0 comments Download
M chrome/chrome_tests.gypi View 1 2 1 chunk +1 line, -0 lines 0 comments Download
M chrome/renderer/safe_browsing/features.h View 1 1 chunk +10 lines, -0 lines 0 comments Download
M chrome/renderer/safe_browsing/features.cc View 1 chunk +3 lines, -0 lines 0 comments Download
M chrome/renderer/safe_browsing/phishing_dom_feature_extractor.h View 2 chunks +3 lines, -2 lines 0 comments Download
A chrome/renderer/safe_browsing/phishing_term_feature_extractor.h View 1 2 1 chunk +152 lines, -0 lines 0 comments Download
A chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc View 1 2 3 1 chunk +295 lines, -0 lines 0 comments Download
A chrome/renderer/safe_browsing/phishing_term_feature_extractor_unittest.cc View 1 2 1 chunk +252 lines, -0 lines 0 comments Download

Messages

Total messages: 8 (0 generated)
Brian Ryner
10 years, 4 months ago (2010-08-27 00:04:05 UTC) #1
noelutz
Hi Brian, This CL looks great to me. Nice job. I have mostly some high ...
10 years, 4 months ago (2010-08-27 06:29:20 UTC) #2
Brian Ryner
http://codereview.chromium.org/3214002/diff/1/7 File chrome/renderer/safe_browsing/features.h (right): http://codereview.chromium.org/3214002/diff/1/7#newcode165 chrome/renderer/safe_browsing/features.h:165: // Token feature for a term (whitespace-delimited) on a ...
10 years, 3 months ago (2010-08-27 18:29:42 UTC) #3
noelutz
LGTM http://codereview.chromium.org/3214002/diff/1/8 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/1/8#newcode146 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:146: return; On 2010/08/27 18:29:42, Brian Ryner wrote: > ...
10 years, 3 months ago (2010-08-27 19:21:46 UTC) #4
lzheng
Nice CL! http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); This loop ...
10 years, 3 months ago (2010-09-03 05:19:05 UTC) #5
Brian Ryner
Please have another look. http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); ...
10 years, 3 months ago (2010-09-03 20:35:46 UTC) #6
lzheng
LGTM. http://codereview.chromium.org/3214002/diff/7001/8008 File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right): http://codereview.chromium.org/3214002/diff/7001/8008#newcode224 chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for (std::list<size_t>::iterator it = state_->previous_word_sizes.begin(); On 2010/09/03 20:35:47, ...
10 years, 3 months ago (2010-09-03 20:44:30 UTC) #7
Brian Ryner
10 years, 3 months ago (2010-09-03 20:57:35 UTC) #8
Thanks,
Brian

http://codereview.chromium.org/3214002/diff/7001/8008
File chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc (right):

http://codereview.chromium.org/3214002/diff/7001/8008#newcode224
chrome/renderer/safe_browsing/phishing_term_feature_extractor.cc:224: for
(std::list<size_t>::iterator it = state_->previous_word_sizes.begin();
On 2010/09/03 20:44:30, lzheng wrote:
> Okay. Can you put a todo here to highlight the potential problem here?
> 

Sure, added a TODO to measure the performance with UMA and the ideas we talked
about for optimizing.  Another optimization we could consider, if necessary, is
to change the term format so that each word is hashed separately; I think that
would also address your concern about word ordering.

Submitting with that change.

Powered by Google App Engine
This is Rietveld 408576698