|
|
Created:
10 years, 9 months ago by inferno Modified:
9 years, 7 months ago CC:
chromium-reviews, Paweł Hajdan Jr., Paul Godavari, ben+cc_chromium.org, Chris Evans, gcasto (DO NOT USE) Base URL:
svn://chrome-svn/chrome/trunk/src/ Visibility:
Public. |
DescriptionCanonicalize the url based on Section 6.1 Safe Browsing Spec.
BUG=7713
TEST=SafeBrowsingUtilTest.CanonicalizeUrl
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=43100
Patch Set 1 #
Total comments: 2
Patch Set 2 : '' #Patch Set 3 : '' #Patch Set 4 : '' #
Total comments: 3
Patch Set 5 : '' #Patch Set 6 : '' #
Total comments: 7
Patch Set 7 : '' #Patch Set 8 : '' #Patch Set 9 : '' #Patch Set 10 : '' #Patch Set 11 : '' #
Total comments: 39
Patch Set 12 : '' #
Total comments: 6
Patch Set 13 : '' #
Messages
Total messages: 28 (0 generated)
Please review this patch. Please note that i have added unittests for everything mentioned in section 6.1 except the following 1. removed those that don't have scheme in there. gurl won't allow those. 2. remove one which had \ in hostname. gurl converts \ into /, so can't do much here. 3. remove one which had char %23 (#) in hostname. gurl won't allow one.
adding chris, eric kay on the cc list.
minor tweaks - removed extra lines, fixed license file, added if for chars_written to prevent zero resizing.
Some initial comments. Will look at the un-escaping logic next. http://codereview.chromium.org/1275002/diff/1/2 File chrome/browser/safe_browsing/safe_browsing_service.cc (right): http://codereview.chromium.org/1275002/diff/1/2#newcode72 chrome/browser/safe_browsing/safe_browsing_service.cc:72: GURL canon_url(safe_browsing_util::CanonicalizeUrl(url)); I don't think we should save the canonicalized URLs back to a GURL -- the risk is that GURL may re-transform the data. Rather, I suggest introducing a new type: SBCanonicalURL Then we can pass these objects around in place of GURL or string, to emphasize that it is an already canonicalized URL (according to SafeBrowsing spec). This will also make it easier to avoid bugs, since they may manifest as compile errors (see the next comment). http://codereview.chromium.org/1275002/diff/1/2#newcode77 chrome/browser/safe_browsing/safe_browsing_service.cc:77: check.url = canon_url; This is a problem, since we are associating the modified URL with the check. On completion of the check, clients are notified by: check->client->OnUrlCheckResult(check->url, result); These clients are expecting the URL parameter of OnUrlCheckResult() to be the same as the URL that was passed into CheckUrl(). (In fact, safe_browsing_resource_handler.cc is going to assert if that is not the case).
Summarizing our-person chat: We can continue passing around GURLs throughout, and only do the safe-browsing unescaping logic for the few locations where SB "path" and "hostname" are needed (i.e. GeneratePathsToCheck() and GenerateHostsToCheck()).
please wait for review till next patch. i will fix all the things as mentioned by Eric and also, thinking to remove reliance of gurl in canonicalizeurl function.
Is there something in particular you'd like me to look at? I'm happy to defer to eroman here.
@erickay - nothing particular :), just wanted to keep you informed about this change. i discussed this with eroman and i have now made the final patch. eroman - please review these new changes.
I do have a couple of comments after all: - Since this has the potential to change how many pages trigger the interstitial, we should keep an eye on before/after numbers on the histograms on the dev channel when this goes out to see how much this changes things. Make sure the current histograms are actually giving you enough info to do this analysis. - nit: any way you can fix the 80 col issues in the tests?
Thanks @Erikkay for the comments. I will seek Eroman's help to see histograms after he says ok on the review. on the nit, i just fixed fix the 80 column issue, will check in as part of final commit.
Note that I am going to be gone the rest of this week (thu-fri). Hopefully we can get through most of it. http://codereview.chromium.org/1275002/diff/23001/24004 File chrome/browser/safe_browsing/filter_false_positive_perftest.cc (right): http://codereview.chromium.org/1275002/diff/23001/24004#newcode190 chrome/browser/safe_browsing/filter_false_positive_perftest.cc:190: safe_browsing_util::CanonicalizeUrl(url_check), &hosts); I think it is better to leave the parameters as is, and only do the converstion from a GURL to a canonicalized SB string when it is used. For example: // Signature remains as is. GenerateHostsToCheck(const GURL&, ....) { // <do the necessary canonicalization of the hostname here>. } // Signature remains as is GeneratePathsToCheck(const GURL&, ...) { // <do the necessary canonicalization here>. }
http://codereview.chromium.org/1275002/diff/23001/24002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/23001/24002#newcode198 chrome/browser/safe_browsing/safe_browsing_util.cc:198: std::string CanonicalizeUrl(const GURL& url) { How about making the interface: void CanonicalizedUrl(const GURL& url, std::string* canonicalized_hostname, std::string* canonicalized_path, std::string* canonicalized_query); This way the callers don't need to do extra steps to extract those parts. http://codereview.chromium.org/1275002/diff/23001/24002#newcode300 chrome/browser/safe_browsing/safe_browsing_util.cc:300: const std::string host = url.substr(parsed_input.host.begin, See comment above -- if ParseURL() splits up the host and path, we won't need to extract it here.
Eric, i did make all the suggested changes, please review. hope we can complete before you leave :)
http://codereview.chromium.org/1275002/diff/32002/36002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/32002/36002#newcode173 chrome/browser/safe_browsing/safe_browsing_util.cc:173: UnescapeRule::NORMAL | UnescapeRule::SPACES | style-nit: continued lines indent by 4 spaces. http://codereview.chromium.org/1275002/diff/32002/36002#newcode176 chrome/browser/safe_browsing/safe_browsing_util.cc:176: ++loop_var <= kMaxLoopIterations); style-nit: please line this up with the opening '(' of the previous line. http://codereview.chromium.org/1275002/diff/32002/36002#newcode202 chrome/browser/safe_browsing/safe_browsing_util.cc:202: std::string* canonicalized_hostname, Is this supposed to include the ":port" as well? Also, does the spec say what is supposed to happen with IPv6 literal? http://codereview.chromium.org/1275002/diff/32002/36002#newcode221 chrome/browser/safe_browsing/safe_browsing_util.cc:221: std::string url_unescaped_str(UnescapeUrl(url_without_fragment.spec())); Rather than trying to work with the URL's spec(), I think it will be easier to start off by extracting the host, path and query portions: std::string host = url.host(); std::string path = url.path(); std::string query = url.query(); And then applying the canonicalizations to each component in isolation. The danger with using GURL::Replacements, is knowing exactly which pieces to subtract. For example the GURL could have a username:password embedded in it, and it doesn't look like we are stripping that right now. http://codereview.chromium.org/1275002/diff/32002/36002#newcode242 chrome/browser/safe_browsing/safe_browsing_util.cc:242: i != host_without_end_dots.end(); i++) { style-nit: In the for loops, please indent the continued line by lining it up with the parameters from the previous one. http://codereview.chromium.org/1275002/diff/32002/36001 File chrome/browser/safe_browsing/safe_browsing_util_unittest.cc (right): http://codereview.chromium.org/1275002/diff/32002/36001#newcode67 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:67: GURL("http://host/%25%32%35"), NULL, NULL, NULL), "http://host/%25"); style-nit: for indentation either use 4 spaces, or line up with the parenthesis of the previous line. http://codereview.chromium.org/1275002/diff/32002/36001#newcode77 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:77: GURL("http://host/asdf%25%32%35asd"), NULL, NULL, NULL), I think these tests would be stronger if they checked the individual host, path, query parts, since that is how it is used in the code. I suggest making the return value "void" as it is not used except by tests.
all style nits corrected. canonicalized_hostname does not include :port. spec does not mention anything about ipv6. Regarding suggestion of applying the canonicalizations to each component in isolation, this is not right way and spec says to do do it on url. e.g. if we have http://google.com//abc, we might get incorrect canonicalization. username:password does not need to be stripped. it won't be taken into consideration as only hostname and path and query is used. i will change the tests to use individual host, path, that will take up some time :)
I am making the changes after thinking more about his and based on the furthur discussion with Eroman. Basically, unescaping individual components is more secure than unescaping url completely since parsing can mess with lets say having a %2F (/) char in hostname. luckily this was not a problem since gurl does not allow such chars like %2f (/), %3f (?), etc in hostname, so any of the two solutions would be fine. So, i went ahead with making a patch with unescaping individual components. Also, this patch passes all the testcases in the spec. however the only issue i see left is with unescaping the %3f (?) in path since the results will be different in the two cases (different query,path). note than '#' char is not affected since luckily again, it is required by the spec to be escaped in final stage. We have contacted Garrett from Safe browsing team to give us some advise on which route to follow.
Adding Brian, who is our canonicalization expert. On Thu, Mar 25, 2010 at 12:27 AM, <inferno@chromium.org> wrote: > I am making the changes after thinking more about his and based on the > furthur > discussion with Eroman. Basically, unescaping individual components is more > secure than unescaping url completely since parsing can mess with lets say > having a %2F (/) char in hostname. luckily this was not a problem since gurl > does not allow such chars like %2f (/), %3f (?), etc in hostname, so any of > the > two solutions would be fine. > > So, i went ahead with making a patch with unescaping individual components. > Also, this patch passes all the testcases in the spec. however the only > issue i > see left is with unescaping the %3f (?) in path since the results will be > different in the two cases (different query,path). note than '#' char is not > affected since luckily again, it is required by the spec to be escaped in > final > stage. > So I haven't looked at the CL yet, but what is the security issue that you are worried about? That you may accidentally block a page that wasn't supposed to be blocked, or vice versa? Currently we unescape the whole url so Chrome should mirror this implementation unless we decide that there is a problem. > We have contacted Garrett from Safe browsing team to give us some advise on > which route to follow. > > http://codereview.chromium.org/1275002 >
looks like you forgot to cc brian, can you please add him. the issue is escaping the url can have bad consequence (a page that is supposed to block does not block) e.g. http://hos%2ft.com/abcdef can turn into http://hos/ft.com/abc.def (our GURL protects against this) http://host.com/ab%3fcd?q=abc can turn into http://host.com/abc?cd?q=abc (this is the confusion here. in this case, if we parse the url, the path and query meaning change). so, we were using the approach to do escaping, unescaping individually on hostname, path, query since it is more secure.
Actually adding Brian. On Thu, Mar 25, 2010 at 2:33 PM, <inferno@chromium.org> wrote: > looks like you forgot to cc brian, can you please add him. > > the issue is escaping the url can have bad consequence (a page that is > supposed > to block does not block) > > e.g. > > http://hos%2ft.com/abcdef can turn into http://hos/ft.com/abc.def (our GURL > protects against this) > http://host.com/ab%3fcd?q=abc can turn into http://host.com/abc?cd?q=abc > (this > is the confusion here. in this case, if we parse the url, the path and query > meaning change). so, we were using the approach to do escaping, unescaping > individually on hostname, path, query since it is more secure. > > http://codereview.chromium.org/1275002 >
Adding Brian
Thank you Brian for discussing this. I got your point that we need to match server side validation which currently happens by unescaping entire url. So, we would have to follow a similar protocol. I just uploaded this final patch for review. Eroman, when you come back, please review this patch and then, this feature will happily land in our browser :)
http://codereview.chromium.org/1275002/diff/55001/56002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/55001/56002#newcode175 chrome/browser/safe_browsing/safe_browsing_util.cc:175: } while (unescaped_str.compare(old_unescaped_str) && ++loop_var <= rather than compare, can you use (unescaped_str != old_escaped_str) ? http://codereview.chromium.org/1275002/diff/55001/56002#newcode184 chrome/browser/safe_browsing/safe_browsing_util.cc:184: for (unsigned int j = 0; j < url.length(); j++) { nit: I suggest using |size_t i| as the counter type/name. http://codereview.chromium.org/1275002/diff/55001/56002#newcode202 chrome/browser/safe_browsing/safe_browsing_util.cc:202: std::string* canonicalized_hostname, nit: indentation, line it up with the open parenthesis of the previous line. http://codereview.chromium.org/1275002/diff/55001/56002#newcode211 chrome/browser/safe_browsing/safe_browsing_util.cc:211: // 4. Resolve path sequences "/../" and "/./". Note that after unescaping the input, there may be new unresolved '../' components in the "path". http://codereview.chromium.org/1275002/diff/55001/56002#newcode217 chrome/browser/safe_browsing/safe_browsing_util.cc:217: f_replacements.ClearRef(); I think we should remove the Username and Password here as well. http://codereview.chromium.org/1275002/diff/55001/56002#newcode223 chrome/browser/safe_browsing/safe_browsing_util.cc:223: url_parse::ParseStandardURL(url_unescaped_str.c_str(), nit: can you use .data() instead of .c_str() here? (c_str() will sometimes require a re-allocation). http://codereview.chromium.org/1275002/diff/55001/56002#newcode228 chrome/browser/safe_browsing/safe_browsing_util.cc:228: parsed.host.begin, parsed.host.len): ""; nit: add a space between the ')' and the ':'. http://codereview.chromium.org/1275002/diff/55001/56002#newcode235 chrome/browser/safe_browsing/safe_browsing_util.cc:235: bool isDotSet = false; name_variables_like_this http://codereview.chromium.org/1275002/diff/55001/56002#newcode253 chrome/browser/safe_browsing/safe_browsing_util.cc:253: // 5. In path, replace runs of consecutive slashes with a single slash. This is the same code as above for replacing consecutive dots. Can you extract it to a function, and simply specify as parameter either '.' or '/' ? http://codereview.chromium.org/1275002/diff/55001/56002#newcode260 chrome/browser/safe_browsing/safe_browsing_util.cc:260: path_without_consecutive_slash.resize(path.length()); as an optimization, could also do a search for '//' (since in the common case I imagine these cases will not be hit, and we can save some allocations. http://codereview.chromium.org/1275002/diff/55001/56002#newcode276 chrome/browser/safe_browsing/safe_browsing_util.cc:276: hp_replacements.SetHost(host_without_consecutive_dots.c_str(), can you use .data() instead of c_str() ? http://codereview.chromium.org/1275002/diff/55001/56002#newcode278 chrome/browser/safe_browsing/safe_browsing_util.cc:278: hp_replacements.SetPath(path_without_consecutive_slash.c_str(), can you use .data() instead of c_str() ? http://codereview.chromium.org/1275002/diff/55001/56002#newcode289 chrome/browser/safe_browsing/safe_browsing_util.cc:289: // 6. Step needed to revert escaping done in url_util::ReplaceComponents. Is it really necessary to re-assemble the pieces back into a URL, since the caller only needs the individual pieces? http://codereview.chromium.org/1275002/diff/55001/56002#newcode317 chrome/browser/safe_browsing/safe_browsing_util.cc:317: const std::string host = canon_host; // const sidesteps GCC bugs below! Can this variable be removed? (rename to us canon_host). http://codereview.chromium.org/1275002/diff/55001/56002#newcode355 chrome/browser/safe_browsing/safe_browsing_util.cc:355: const std::string path = canon_path; // const sidesteps GCC bugs below! nit: can you remove these two variables? (i.e. rename the places using |path| and |query| to directly use |canon_path| and |canon_query|? http://codereview.chromium.org/1275002/diff/55001/56002#newcode375 chrome/browser/safe_browsing/safe_browsing_util.cc:375: if (query.length() > 0) nit: i suggest using |!query.empty()| instead. http://codereview.chromium.org/1275002/diff/55001/56003 File chrome/browser/safe_browsing/safe_browsing_util.h (right): http://codereview.chromium.org/1275002/diff/55001/56003#newcode278 chrome/browser/safe_browsing/safe_browsing_util.h:278: std::string Unescape(const std::string& url); Can you remove |Unescape()| and |Escape()| from the header file? Since they are only helper functions, they can be hidden inside of the .cc file to avoid people calling them. http://codereview.chromium.org/1275002/diff/55001/56001 File chrome/browser/safe_browsing/safe_browsing_util_unittest.cc (right): http://codereview.chromium.org/1275002/diff/55001/56001#newcode68 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:68: const std::string input_url; nit: I suggest using |const char*| for these instead. http://codereview.chromium.org/1275002/diff/55001/56001#newcode223 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:223: }; These are good tests! I additionally suggest having these tests: - Capital letter in hostname - Username:password embedded in the URL. - fragment (#) and query (?) in the same URL.
Eric, thanks for your insightful review. i have made the changes. a few were left like url parsing to keep behavior similar to spec. let me know if these look fine. http://codereview.chromium.org/1275002/diff/55001/56002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/55001/56002#newcode175 chrome/browser/safe_browsing/safe_browsing_util.cc:175: } while (unescaped_str.compare(old_unescaped_str) && ++loop_var <= On 2010/03/30 01:11:23, eroman wrote: > rather than compare, can you use (unescaped_str != old_escaped_str) ? Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode184 chrome/browser/safe_browsing/safe_browsing_util.cc:184: for (unsigned int j = 0; j < url.length(); j++) { On 2010/03/30 01:11:23, eroman wrote: > nit: I suggest using |size_t i| as the counter type/name. Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode202 chrome/browser/safe_browsing/safe_browsing_util.cc:202: std::string* canonicalized_hostname, On 2010/03/30 01:11:23, eroman wrote: > nit: indentation, line it up with the open parenthesis of the previous line. Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode211 chrome/browser/safe_browsing/safe_browsing_util.cc:211: // 4. Resolve path sequences "/../" and "/./". On 2010/03/30 01:11:23, eroman wrote: > Note that after unescaping the input, there may be new unresolved '../' > components in the "path". this is handled by url parsing just before step 6. it removes these chars. i also have added the unit test for this at end. http://codereview.chromium.org/1275002/diff/55001/56002#newcode217 chrome/browser/safe_browsing/safe_browsing_util.cc:217: f_replacements.ClearRef(); On 2010/03/30 01:11:23, eroman wrote: > I think we should remove the Username and Password here as well. Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode223 chrome/browser/safe_browsing/safe_browsing_util.cc:223: url_parse::ParseStandardURL(url_unescaped_str.c_str(), On 2010/03/30 01:11:23, eroman wrote: > nit: can you use .data() instead of .c_str() here? (c_str() will sometimes > require a re-allocation). Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode228 chrome/browser/safe_browsing/safe_browsing_util.cc:228: parsed.host.begin, parsed.host.len): ""; On 2010/03/30 01:11:23, eroman wrote: > nit: add a space between the ')' and the ':'. Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode235 chrome/browser/safe_browsing/safe_browsing_util.cc:235: bool isDotSet = false; On 2010/03/30 01:11:23, eroman wrote: > name_variables_like_this Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode253 chrome/browser/safe_browsing/safe_browsing_util.cc:253: // 5. In path, replace runs of consecutive slashes with a single slash. On 2010/03/30 01:11:23, eroman wrote: > This is the same code as above for replacing consecutive dots. > Can you extract it to a function, and simply specify as parameter either '.' or > '/' ? Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode260 chrome/browser/safe_browsing/safe_browsing_util.cc:260: path_without_consecutive_slash.resize(path.length()); On 2010/03/30 01:11:23, eroman wrote: > as an optimization, could also do a search for '//' (since in the common case I > imagine these cases will not be hit, and we can save some allocations. Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode276 chrome/browser/safe_browsing/safe_browsing_util.cc:276: hp_replacements.SetHost(host_without_consecutive_dots.c_str(), On 2010/03/30 01:11:23, eroman wrote: > can you use .data() instead of c_str() ? Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode278 chrome/browser/safe_browsing/safe_browsing_util.cc:278: hp_replacements.SetPath(path_without_consecutive_slash.c_str(), On 2010/03/30 01:11:23, eroman wrote: > can you use .data() instead of c_str() ? Done. http://codereview.chromium.org/1275002/diff/55001/56002#newcode289 chrome/browser/safe_browsing/safe_browsing_util.cc:289: // 6. Step needed to revert escaping done in url_util::ReplaceComponents. On 2010/03/30 01:11:23, eroman wrote: > Is it really necessary to re-assemble the pieces back into a URL, since the > caller only needs the individual pieces? even though caller needs seperate pieces, it is first necessary to combine it into url. server protocol operates on whole url so a case like http://abc.com/def%3fq=?f=m will translate into path /def and query q=?f=m http://codereview.chromium.org/1275002/diff/55001/56002#newcode317 chrome/browser/safe_browsing/safe_browsing_util.cc:317: const std::string host = canon_host; // const sidesteps GCC bugs below! On 2010/03/30 01:11:23, eroman wrote: > Can this variable be removed? (rename to us canon_host). this is gcc issue that requires host to be of const type, otherwise it fails in line 340. it even existed previously like that. when i tried to make that change it broken linux stdio compilation. http://codereview.chromium.org/1275002/diff/55001/56002#newcode355 chrome/browser/safe_browsing/safe_browsing_util.cc:355: const std::string path = canon_path; // const sidesteps GCC bugs below! On 2010/03/30 01:11:23, eroman wrote: > nit: can you remove these two variables? (i.e. rename the places using |path| > and |query| to directly use |canon_path| and |canon_query|? this is gcc issue that requires path to be of const type, otherwise it fails. it even existed previously like that. when i tried to make that change it broken linux stdio compilation. http://codereview.chromium.org/1275002/diff/55001/56002#newcode375 chrome/browser/safe_browsing/safe_browsing_util.cc:375: if (query.length() > 0) On 2010/03/30 01:11:23, eroman wrote: > nit: i suggest using |!query.empty()| instead. Done. http://codereview.chromium.org/1275002/diff/55001/56003 File chrome/browser/safe_browsing/safe_browsing_util.h (right): http://codereview.chromium.org/1275002/diff/55001/56003#newcode278 chrome/browser/safe_browsing/safe_browsing_util.h:278: std::string Unescape(const std::string& url); On 2010/03/30 01:11:23, eroman wrote: > Can you remove |Unescape()| and |Escape()| from the header file? Since they are > only helper functions, they can be hidden inside of the .cc file to avoid people > calling them. Done. http://codereview.chromium.org/1275002/diff/55001/56001 File chrome/browser/safe_browsing/safe_browsing_util_unittest.cc (right): http://codereview.chromium.org/1275002/diff/55001/56001#newcode68 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:68: const std::string input_url; On 2010/03/30 01:11:23, eroman wrote: > nit: I suggest using |const char*| for these instead. Done. http://codereview.chromium.org/1275002/diff/55001/56001#newcode223 chrome/browser/safe_browsing/safe_browsing_util_unittest.cc:223: }; On 2010/03/30 01:11:23, eroman wrote: > These are good tests! > > I additionally suggest having these tests: > > - Capital letter in hostname > - Username:password embedded in the URL. > - fragment (#) and query (?) in the same URL. added last two. capital letter in hostname in line 143
LGTM! http://codereview.chromium.org/1275002/diff/55001/56002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/55001/56002#newcode317 chrome/browser/safe_browsing/safe_browsing_util.cc:317: const std::string host = canon_host; // const sidesteps GCC bugs below! On 2010/03/30 15:20:21, inferno wrote: > On 2010/03/30 01:11:23, eroman wrote: > > Can this variable be removed? (rename to us canon_host). > > this is gcc issue that requires host to be of const type, otherwise it fails in > line 340. it even existed previously like that. when i tried to make that change > it broken linux stdio compilation. I see, OK, please leave as is then. (I guess the issue here is the later code uses a const_reverse_iterator on |host|, and then triese to use the base() method in conjunction with the end iterator from host(), which results in a const mismatch). http://codereview.chromium.org/1275002/diff/62001/63002#newcode198 chrome/browser/safe_browsing/safe_browsing_util.cc:198: // Canonicalizes url as per Google Safe Browsing Specification. removeConsecutiveChars --> RemoveConsecutiveChars (function names start with captital letter). http://codereview.chromium.org/1275002/diff/62001/63002#newcode203 chrome/browser/safe_browsing/safe_browsing_util.cc:203: std::string* canonicalized_path, Nice, I like this new implementation! (Calling erase is a good tradeoff which should speed up the common case where there are no consecutives). One small suggestion I have is to pass |loc| to find(), so it doesn't have to start searching from the beginning of the string each time. http://codereview.chromium.org/1275002/diff/62001/63002#newcode281 chrome/browser/safe_browsing/safe_browsing_util.cc:281: std::string url_unescaped_with_can_hostpath; nit: for multi-line if's please surround with curly braces {}.
Thanks Eric. have made all code changes and committed http://codereview.chromium.org/1275002/diff/62001/63002 File chrome/browser/safe_browsing/safe_browsing_util.cc (right): http://codereview.chromium.org/1275002/diff/62001/63002#newcode198 chrome/browser/safe_browsing/safe_browsing_util.cc:198: std::string removeConsecutiveChars(const std::string& str, const char c) { On 2010/03/30 17:22:45, eroman wrote: > removeConsecutiveChars --> RemoveConsecutiveChars (function names start with > captital letter). Done. http://codereview.chromium.org/1275002/diff/62001/63002#newcode203 chrome/browser/safe_browsing/safe_browsing_util.cc:203: while ((loc = output.find(string_to_find)) != std::string::npos) { On 2010/03/30 17:22:45, eroman wrote: > Nice, I like this new implementation! (Calling erase is a good tradeoff which > should speed up the common case where there are no consecutives). > > One small suggestion I have is to pass |loc| to find(), so it doesn't have to > start searching from the beginning of the string each time. Done. http://codereview.chromium.org/1275002/diff/62001/63002#newcode281 chrome/browser/safe_browsing/safe_browsing_util.cc:281: if (canonicalized_hostname && final_parsed.host.len > 0) On 2010/03/30 17:22:45, eroman wrote: > nit: for multi-line if's please surround with curly braces {}. Done.
Correctness-wise this looks good to me. Just one question: are IDN hostnames already converted to punycode at this point this is called?
On 2010/03/30 18:44:59, bryner wrote: > Correctness-wise this looks good to me. Just one question: are IDN hostnames > already converted to punycode at this point this is called? I believe GURL automatically does it and we can go from there.
On 2010/03/30 18:46:43, inferno wrote: > On 2010/03/30 18:44:59, bryner wrote: > > Correctness-wise this looks good to me. Just one question: are IDN hostnames > > already converted to punycode at this point this is called? > > I believe GURL automatically does it and we can go from there. sorry exclude "can" from previous comment. |