Issue 1275002: Canonicalize the url based on Section 6.1 Safe Browsing Spec. Also fix the un...

Issue 1275002: Canonicalize the url based on Section 6.1 Safe Browsing Spec. Also fix the un... (Closed)

Created:
10 years, 9 months ago by inferno

Modified:
9 years, 7 months ago

Reviewers:
bryner (do not use), jschuh, eroman, Erik does not do reviews

CC:
chromium-reviews, Paweł Hajdan Jr., Paul Godavari, ben+cc_chromium.org, Chris Evans, gcasto (DO NOT USE)

Base URL:
svn://chrome-svn/chrome/trunk/src/

Visibility:
Public.

Description

Canonicalize the url based on Section 6.1 Safe Browsing Spec. BUG=7713 TEST=SafeBrowsingUtilTest.CanonicalizeUrl Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=43100

Patch Set 13 : '' #

Created: 10 years, 8 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+355 lines, -6 lines)			Patch
M	chrome/browser/safe_browsing/safe_browsing_util.h	View	1 2 3 4 5 6 7 8 9 10 11	1 chunk	+7 lines, -0 lines	0 comments	Download
M	chrome/browser/safe_browsing/safe_browsing_util.cc	View	1 2 3 4 5 6 7 8 9 10 11 12	5 chunks	+147 lines, -5 lines	0 comments	Download
M	chrome/browser/safe_browsing/safe_browsing_util_unittest.cc	View	1 2 3 4 5 6 7 8 9 10 11	2 chunks	+201 lines, -1 line	0 comments	Download

Messages

Total messages: 28 (0 generated)

Expand Messages | Collapse Messages

inferno

Please review this patch. Please note that i have added unittests for everything mentioned in ...

10 years, 9 months ago (2010-03-24 17:45:27 UTC) #1

inferno

minor tweaks - removed extra lines, fixed license file, added if for chars_written to prevent ...

10 years, 9 months ago (2010-03-24 18:17:00 UTC) #3

eroman

Some initial comments. Will look at the un-escaping logic next. http://codereview.chromium.org/1275002/diff/1/2 File chrome/browser/safe_browsing/safe_browsing_service.cc (right): http://codereview.chromium.org/1275002/diff/1/2#newcode72 ...

10 years, 9 months ago (2010-03-24 18:19:30 UTC) #4

eroman

Summarizing our-person chat: We can continue passing around GURLs throughout, and only do the safe-browsing ...

10 years, 9 months ago (2010-03-24 18:34:25 UTC) #5

inferno

please wait for review till next patch. i will fix all the things as mentioned ...

10 years, 9 months ago (2010-03-24 18:53:32 UTC) #6

Erik does not do reviews

Is there something in particular you'd like me to look at? I'm happy to defer ...

10 years, 9 months ago (2010-03-24 20:45:14 UTC) #7

inferno

@erickay - nothing particular :), just wanted to keep you informed about this change. i ...

10 years, 9 months ago (2010-03-24 21:29:38 UTC) #8

Erik does not do reviews

I do have a couple of comments after all: - Since this has the potential ...

10 years, 9 months ago (2010-03-24 22:06:51 UTC) #9

inferno

Thanks @Erikkay for the comments. I will seek Eroman's help to see histograms after he ...

10 years, 9 months ago (2010-03-24 22:35:16 UTC) #10

eroman

Note that I am going to be gone the rest of this week (thu-fri). Hopefully ...

10 years, 9 months ago (2010-03-24 23:03:56 UTC) #11

eroman

http://codereview.chromium.org/1275002/diff/23001/24002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/23001/24002#newcode198 chrome/browser/safe_browsing/safe_browsing_util.cc:198: std::string CanonicalizeUrl(const GURL& url) { How about making the ...

10 years, 9 months ago (2010-03-24 23:09:38 UTC) #12

inferno

Eric, i did make all the suggested changes, please review. hope we can complete before ...

10 years, 9 months ago (2010-03-25 00:37:05 UTC) #13

eroman

http://codereview.chromium.org/1275002/diff/32002/36002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/32002/36002#newcode173 chrome/browser/safe_browsing/safe_browsing_util.cc:173: UnescapeRule::NORMAL | UnescapeRule::SPACES | style-nit: continued lines indent by ...

10 years, 9 months ago (2010-03-25 02:26:15 UTC) #14

http://codereview.chromium.org/1275002/diff/32002/36002
File chrome/browser/safe_browsing/safe_browsing_util.cc (right):

http://codereview.chromium.org/1275002/diff/32002/36002#newcode173
chrome/browser/safe_browsing/safe_browsing_util.cc:173: UnescapeRule::NORMAL |
UnescapeRule::SPACES |
style-nit: continued lines indent by 4 spaces.

http://codereview.chromium.org/1275002/diff/32002/36002#newcode176
chrome/browser/safe_browsing/safe_browsing_util.cc:176: ++loop_var <=
kMaxLoopIterations);
style-nit: please line this up with the opening '(' of the previous line.

http://codereview.chromium.org/1275002/diff/32002/36002#newcode202
chrome/browser/safe_browsing/safe_browsing_util.cc:202: std::string*
canonicalized_hostname,
Is this supposed to include the ":port" as well?

Also, does the spec say what is supposed to happen with IPv6 literal?

http://codereview.chromium.org/1275002/diff/32002/36002#newcode221
chrome/browser/safe_browsing/safe_browsing_util.cc:221: std::string
url_unescaped_str(UnescapeUrl(url_without_fragment.spec()));
Rather than trying to work with the URL's spec(), I think it will be easier to
start off by extracting the host, path and query portions:

std::string host = url.host();
std::string path = url.path();
std::string query = url.query();

And then applying the canonicalizations to each component in isolation.

The danger with using GURL::Replacements, is knowing exactly which pieces to
subtract. For example the GURL could have a username:password embedded in it,
and it doesn't look like we are stripping that right now.

http://codereview.chromium.org/1275002/diff/32002/36002#newcode242
chrome/browser/safe_browsing/safe_browsing_util.cc:242: i !=
host_without_end_dots.end(); i++) {
style-nit: In the for loops, please indent the continued line by lining it up
with the parameters from the previous one.

http://codereview.chromium.org/1275002/diff/32002/36001
File chrome/browser/safe_browsing/safe_browsing_util_unittest.cc (right):

http://codereview.chromium.org/1275002/diff/32002/36001#newcode67
chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:67:
GURL("http://host/%25%32%35"), NULL, NULL, NULL), "http://host/%25");
style-nit: for indentation either use 4 spaces, or line up with the parenthesis
of the previous line.

http://codereview.chromium.org/1275002/diff/32002/36001#newcode77
chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:77:
GURL("http://host/asdf%25%32%35asd"), NULL, NULL, NULL),
I think these tests would be stronger if they checked the individual host, path,
query parts, since that is how it is used in the code. I suggest making the
return value "void" as it is not used except by tests.

inferno

all style nits corrected. canonicalized_hostname does not include :port. spec does not mention anything about ...

10 years, 9 months ago (2010-03-25 02:46:39 UTC) #15

inferno

I am making the changes after thinking more about his and based on the furthur ...

10 years, 9 months ago (2010-03-25 07:27:17 UTC) #16

gcasto (DO NOT USE)

Adding Brian, who is our canonicalization expert. On Thu, Mar 25, 2010 at 12:27 AM, ...

10 years, 9 months ago (2010-03-25 21:28:27 UTC) #17

inferno

looks like you forgot to cc brian, can you please add him. the issue is ...

10 years, 9 months ago (2010-03-25 21:33:55 UTC) #18

gcasto (DO NOT USE)

Actually adding Brian. On Thu, Mar 25, 2010 at 2:33 PM, <inferno@chromium.org> wrote: > looks ...

10 years, 9 months ago (2010-03-25 22:59:47 UTC) #19

inferno

Thank you Brian for discussing this. I got your point that we need to match ...

10 years, 9 months ago (2010-03-26 20:48:48 UTC) #21

eroman

http://codereview.chromium.org/1275002/diff/55001/56002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/55001/56002#newcode175 chrome/browser/safe_browsing/safe_browsing_util.cc:175: } while (unescaped_str.compare(old_unescaped_str) && ++loop_var <= rather than compare, ...

10 years, 9 months ago (2010-03-30 01:11:22 UTC) #22

inferno

Eric, thanks for your insightful review. i have made the changes. a few were left ...

10 years, 8 months ago (2010-03-30 15:20:21 UTC) #23

Eric, thanks for your insightful review. i have made the changes. a few were
left like url parsing to keep behavior similar to spec. 

let me know if these look fine.

http://codereview.chromium.org/1275002/diff/55001/56002
File chrome/browser/safe_browsing/safe_browsing_util.cc (right):

http://codereview.chromium.org/1275002/diff/55001/56002#newcode175
chrome/browser/safe_browsing/safe_browsing_util.cc:175: } while
(unescaped_str.compare(old_unescaped_str) && ++loop_var <=
On 2010/03/30 01:11:23, eroman wrote:
> rather than compare, can you use (unescaped_str != old_escaped_str) ?

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode184
chrome/browser/safe_browsing/safe_browsing_util.cc:184: for (unsigned int j = 0;
j < url.length(); j++) {
On 2010/03/30 01:11:23, eroman wrote:
> nit: I suggest using |size_t i| as the counter type/name.

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode202
chrome/browser/safe_browsing/safe_browsing_util.cc:202: std::string*
canonicalized_hostname,
On 2010/03/30 01:11:23, eroman wrote:
> nit: indentation, line it up with the open parenthesis of the previous line.

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode211
chrome/browser/safe_browsing/safe_browsing_util.cc:211: // 4. Resolve path
sequences "/../" and "/./".
On 2010/03/30 01:11:23, eroman wrote:
> Note that after unescaping the input, there may be new unresolved '../'
> components in the "path".

this is handled by url parsing just before step 6. it removes these chars. i
also have added the unit test for this at end.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode217
chrome/browser/safe_browsing/safe_browsing_util.cc:217:
f_replacements.ClearRef();
On 2010/03/30 01:11:23, eroman wrote:
> I think we should remove the Username and Password here as well.

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode223
chrome/browser/safe_browsing/safe_browsing_util.cc:223:
url_parse::ParseStandardURL(url_unescaped_str.c_str(),
On 2010/03/30 01:11:23, eroman wrote:
> nit: can you use .data() instead of .c_str() here?  (c_str() will sometimes
> require a re-allocation).

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode228
chrome/browser/safe_browsing/safe_browsing_util.cc:228: parsed.host.begin,
parsed.host.len): "";
On 2010/03/30 01:11:23, eroman wrote:
> nit: add a space between the ')' and the ':'.

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode235
chrome/browser/safe_browsing/safe_browsing_util.cc:235: bool isDotSet = false;
On 2010/03/30 01:11:23, eroman wrote:
> name_variables_like_this

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode253
chrome/browser/safe_browsing/safe_browsing_util.cc:253: // 5. In path, replace
runs of consecutive slashes with a single slash.
On 2010/03/30 01:11:23, eroman wrote:
> This is the same code as above for replacing consecutive dots.
> Can you extract it to a function, and simply specify as parameter either '.'
or
> '/' ?

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode260
chrome/browser/safe_browsing/safe_browsing_util.cc:260:
path_without_consecutive_slash.resize(path.length());
On 2010/03/30 01:11:23, eroman wrote:
> as an optimization, could also do a search for '//' (since in the common case
I
> imagine these cases will not be hit, and we can save some allocations.

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode276
chrome/browser/safe_browsing/safe_browsing_util.cc:276:
hp_replacements.SetHost(host_without_consecutive_dots.c_str(),
On 2010/03/30 01:11:23, eroman wrote:
> can you use .data() instead of c_str() ?

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode278
chrome/browser/safe_browsing/safe_browsing_util.cc:278:
hp_replacements.SetPath(path_without_consecutive_slash.c_str(),
On 2010/03/30 01:11:23, eroman wrote:
> can you use .data() instead of c_str() ?

Done.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode289
chrome/browser/safe_browsing/safe_browsing_util.cc:289: // 6. Step needed to
revert escaping done in url_util::ReplaceComponents.
On 2010/03/30 01:11:23, eroman wrote:
> Is it really necessary to re-assemble the pieces back into a URL, since the
> caller only needs the individual pieces?

even though caller needs seperate pieces, it is first necessary to combine it
into url. server protocol operates on whole url so a case like
http://abc.com/def%3fq=?f=m will translate into path /def and query q=?f=m

http://codereview.chromium.org/1275002/diff/55001/56002#newcode317
chrome/browser/safe_browsing/safe_browsing_util.cc:317: const std::string host =
canon_host;  // const sidesteps GCC bugs below!
On 2010/03/30 01:11:23, eroman wrote:
> Can this variable be removed? (rename to us canon_host).

this is gcc issue that requires host to be of const type, otherwise it fails in
line 340. it even existed previously like that. when i tried to make that change
it broken linux stdio compilation.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode355
chrome/browser/safe_browsing/safe_browsing_util.cc:355: const std::string path =
canon_path;   // const sidesteps GCC bugs below!
On 2010/03/30 01:11:23, eroman wrote:
> nit: can you remove these two variables? (i.e. rename the places using |path|
> and |query| to directly use |canon_path| and |canon_query|?

this is gcc issue that requires path to be of const type, otherwise it fails. it
even existed previously like that. when i tried to make that change it broken
linux stdio compilation.

http://codereview.chromium.org/1275002/diff/55001/56002#newcode375
chrome/browser/safe_browsing/safe_browsing_util.cc:375: if (query.length() > 0)
On 2010/03/30 01:11:23, eroman wrote:
> nit: i suggest using |!query.empty()| instead.

Done.

http://codereview.chromium.org/1275002/diff/55001/56003
File chrome/browser/safe_browsing/safe_browsing_util.h (right):

http://codereview.chromium.org/1275002/diff/55001/56003#newcode278
chrome/browser/safe_browsing/safe_browsing_util.h:278: std::string
Unescape(const std::string& url);
On 2010/03/30 01:11:23, eroman wrote:
> Can you remove |Unescape()| and |Escape()| from the header file? Since they
are
> only helper functions, they can be hidden inside of the .cc file to avoid
people
> calling them.

Done.

http://codereview.chromium.org/1275002/diff/55001/56001
File chrome/browser/safe_browsing/safe_browsing_util_unittest.cc (right):

http://codereview.chromium.org/1275002/diff/55001/56001#newcode68
chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:68: const
std::string input_url;
On 2010/03/30 01:11:23, eroman wrote:
> nit: I suggest using |const char*| for these instead.

Done.

http://codereview.chromium.org/1275002/diff/55001/56001#newcode223
chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:223: };
On 2010/03/30 01:11:23, eroman wrote:
> These are good tests!
> 
> I additionally suggest having these tests:
> 
> - Capital letter in hostname
> - Username:password embedded in the URL.
> - fragment (#) and query (?) in the same URL.

added last two. capital letter in hostname in line 143

eroman

LGTM! http://codereview.chromium.org/1275002/diff/55001/56002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/55001/56002#newcode317 chrome/browser/safe_browsing/safe_browsing_util.cc:317: const std::string host = canon_host; // const sidesteps ...

10 years, 8 months ago (2010-03-30 17:22:45 UTC) #24

inferno

Thanks Eric. have made all code changes and committed http://codereview.chromium.org/1275002/diff/62001/63002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/62001/63002#newcode198 chrome/browser/safe_browsing/safe_browsing_util.cc:198: ...

10 years, 8 months ago (2010-03-30 17:40:11 UTC) #25

bryner (do not use)

Correctness-wise this looks good to me. Just one question: are IDN hostnames already converted to ...

10 years, 8 months ago (2010-03-30 18:44:59 UTC) #26

inferno

On 2010/03/30 18:44:59, bryner wrote: > Correctness-wise this looks good to me. Just one question: ...

10 years, 8 months ago (2010-03-30 18:46:43 UTC) #27

inferno

10 years, 8 months ago (2010-03-30 18:47:39 UTC) #28

On 2010/03/30 18:46:43, inferno wrote:
> On 2010/03/30 18:44:59, bryner wrote:
> > Correctness-wise this looks good to me.  Just one question: are IDN
hostnames
> > already converted to punycode at this point this is called?
> 
> I believe GURL automatically does it and we can go from there.

sorry exclude "can" from previous comment.

Expand Messages | Collapse Messages