|
|
Chromium Code Reviews|
Created:
3 years, 8 months ago by pkalinnikov Modified:
3 years, 8 months ago CC:
chromium-reviews, subresource-filter-reviews_chromium.org Target Ref:
refs/heads/master Project:
chromium Visibility:
Public. |
Description[IndexedRuleset] Improve worst-case domain list matching.
Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.
The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.
The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.
Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.
This CL should improve the long-tail/worst-case matching performance.
BUG=708458
Review-Url: https://codereview.chromium.org/2816693002
Cr-Commit-Position: refs/heads/master@{#464457}
Committed: https://chromium.googlesource.com/chromium/src/+/154fc5ad392d229570cf7332ce8fb686da7d96c0
Patch Set 1 #
Total comments: 8
Patch Set 2 : Clean up, add unittest. #
Total comments: 40
Patch Set 3 : Address comments. #
Total comments: 14
Patch Set 4 : Address more comments. #Patch Set 5 : Add TODO. #
Messages
Total messages: 43 (24 generated)
The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
pkalinnikov@chromium.org changed reviewers: + csharrison@chromium.org, engedy@chromium.org
PTAL. I will add more tests with domains lists soon.
https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:368: flatbuffers::uoffset_t left = 0; I will add more comments here later. E.g, why |left| is shared between all iterations.
Some self-comments. Please don't review now, will fix them and return back to you. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:350: size_t DomainListMatch(const url::Origin& origin, const FlatDomains& domains) { Rename and comment. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:351: if (domains.size() <= 5) { This 5 here is the best trade-off according to my tests. Add a comment. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:374: flatbuffers::uoffset_t right = domains.size(); Actually, std::lower_bound supports comparators for the case when the type of |lhs| is not the same as of |rhs|. Will use it here.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Looking forward to tests too. :-)
The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Ready for review.
https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:350: size_t DomainListMatch(const url::Origin& origin, const FlatDomains& domains) { On 2017/04/12 11:23:21, pkalinnikov wrote: > Rename and comment. Done. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:351: if (domains.size() <= 5) { On 2017/04/12 11:23:21, pkalinnikov wrote: > This 5 here is the best trade-off according to my tests. Add a comment. Done. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:368: flatbuffers::uoffset_t left = 0; On 2017/04/12 11:02:07, pkalinnikov wrote: > I will add more comments here later. E.g, why |left| is shared between all > iterations. Done. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filt... components/subresource_filter/core/common/indexed_ruleset.cc:374: flatbuffers::uoffset_t right = domains.size(); On 2017/04/12 11:23:21, pkalinnikov wrote: > Actually, std::lower_bound supports comparators for the case when the type of > |lhs| is not the same as of |rhs|. Will use it here. UPD. Can't std::use lower_bound because it works with iterator-like elements only (it would try dereferencing it). Also flatbuffers::Vector does not provide random access iterators.
Could you please explain in the CL description the rationale for this change?
Generally looks good. I think it needs a bit more documentation of the high level algorithm. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:79: auto precedes = [&builder](FlatStringOffset lhs, FlatStringOffset rhs) { Briefly document this function. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:88: if (!domains_included.empty()) { Document why these are stored in sorted order. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:358: CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0); Musing: Does this really need to be insenstive? The only uppercase bits in canonicalized hosts will be percent encodings. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:377: // Otherwise look for each subdomain of the |origin| using binary search. Have you looked into using standard library's binary search?
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
LGTM % comments. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:65: // Reserve only for |domains_included| because it is more commonly used. nit: ... it is expected to be the one used more frequently. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:71: HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain)); The proto definition defines |domain| to be UTF-8. Could you please double-check that this is correct, and if so, add a comment about it? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:80: auto* lhs_string = flatbuffers::GetTemporaryPointer(*builder, lhs); nit: const auto* https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); How come we didn't have to sort previously? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:351: // Returns whether the |domain| precedes the |target_domain| in the domain list, phrasing nit: ... |domain| either matches or precedes ... https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:355: const base::StringPiece domain_piece = ToStringPiece(domain); nit: Have you considered doing the conversion to StringPiece at the call site? Unless there is a severe performance penalty, that would make it more obvious that this is just an operator <=, and there is no requirement which parameter must come be the search key and which must come from the list. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:358: CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0); Same thing about UTF-8 vs. ASCII here. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:379: base::StringPiece canonicalized_host(origin.host()); nit: Could you please add a DCHECK here that this is not a unique origin? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:438: size_t max_included_length = 0; nit: longest_matching_included_domain_length, it looks like it's not going to create a lot of wrapping. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:442: is_match = !!max_included_length; if (!max_included_length) return false; and then you can get rid of |is_match| entirely. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset_unittest.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:376: } kTestCases[] = { Let's add some tests with: -- leading and/or trailing '.' characters, and -- consecutive ".." characters, -- with multiple excluded subdomains. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:415: {{"domain.com", "~sub.domain.com"}, "http://ssub.domain.com", false}, Okay, by this point, `domain` totally stopped seeming like a real word... (https://xkcd.com/1046/) https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:468: domains.push_back("c.sub." + domain); nit: Can you also add: ~aa.sub, ~ab.sub, ~ba.sub, ~bb.sub for good measure?
https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:71: HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain)); On 2017/04/12 15:35:45, engedy wrote: > The proto definition defines |domain| to be UTF-8. Could you please double-check > that this is correct, and if so, add a comment about it? That's odd. I don't think domains need to be utf8 they should be straight up canonicalized and ascii. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On 2017/04/12 15:35:45, engedy wrote: > How come we didn't have to sort previously? I'm guessing it's because we are now binary searching but agreed it needs some docs.
The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Addressed, thanks! Will you PTAL again? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:65: // Reserve only for |domains_included| because it is more commonly used. On 2017/04/12 15:35:45, engedy wrote: > nit: ... it is expected to be the one used more frequently. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:71: HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain)); On 2017/04/12 15:35:45, engedy wrote: > The proto definition defines |domain| to be UTF-8. Could you please double-check > that this is correct, and if so, add a comment about it? This is correct. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:71: HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain)); On 2017/04/12 15:41:11, Charlie Harrison wrote: > On 2017/04/12 15:35:45, engedy wrote: > > The proto definition defines |domain| to be UTF-8. Could you please > double-check > > that this is correct, and if so, add a comment about it? > > That's odd. I don't think domains need to be utf8 they should be straight up > canonicalized and ascii. I think it's okay either way as soon as the matching phase considers same-encoded domains. Those could be both IDNs or both UTF-8. I leave it as a follow-up TODO. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:79: auto precedes = [&builder](FlatStringOffset lhs, FlatStringOffset rhs) { On 2017/04/12 15:19:50, Charlie Harrison wrote: > Briefly document this function. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:80: auto* lhs_string = flatbuffers::GetTemporaryPointer(*builder, lhs); On 2017/04/12 15:35:45, engedy wrote: > nit: const auto* Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:88: if (!domains_included.empty()) { On 2017/04/12 15:19:50, Charlie Harrison wrote: > Document why these are stored in sorted order. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On 2017/04/12 15:41:11, Charlie Harrison wrote: > On 2017/04/12 15:35:45, engedy wrote: > > How come we didn't have to sort previously? > > I'm guessing it's because we are now binary searching but agreed it needs some > docs. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:351: // Returns whether the |domain| precedes the |target_domain| in the domain list, On 2017/04/12 15:35:45, engedy wrote: > phrasing nit: ... |domain| either matches or precedes ... Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:355: const base::StringPiece domain_piece = ToStringPiece(domain); On 2017/04/12 15:35:45, engedy wrote: > nit: Have you considered doing the conversion to StringPiece at the call site? > Unless there is a severe performance penalty, that would make it more obvious > that this is just an operator <=, and there is no requirement which parameter > must come be the search key and which must come from the list. Done. See CompareDomains. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:358: CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0); On 2017/04/12 15:19:50, Charlie Harrison wrote: > Musing: Does this really need to be insenstive? The only uppercase bits in > canonicalized hosts will be percent encodings. Indeed. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:358: CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0); On 2017/04/12 15:35:45, engedy wrote: > Same thing about UTF-8 vs. ASCII here. Not relevant anymore, because the strings are canonicalized and compared byte-wise. As for the domains in the domain list, left a TODO to solve the non-ASCII case later. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:377: // Otherwise look for each subdomain of the |origin| using binary search. On 2017/04/12 15:19:50, Charlie Harrison wrote: > Have you looked into using standard library's binary search? Yes. There are 2 concerns with it: 1. std::binary_search/lower_bound/etc work with iterators/pointers. flatbuffers::Vector does not provide random access iterators. 2. There is a workaround - make flatbuffers::uoffset_t pretending to be an iterator via a wrapper. IMHO, this is more boilerplate than just writing a binary search. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:379: base::StringPiece canonicalized_host(origin.host()); On 2017/04/12 15:35:45, engedy wrote: > nit: Could you please add a DCHECK here that this is not a unique origin? Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:438: size_t max_included_length = 0; On 2017/04/12 15:35:45, engedy wrote: > nit: longest_matching_included_domain_length, it looks like it's not going to > create a lot of wrapping. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:442: is_match = !!max_included_length; On 2017/04/12 15:35:45, engedy wrote: > if (!max_included_length) > return false; > > and then you can get rid of |is_match| entirely. How about this? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset_unittest.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:376: } kTestCases[] = { On 2017/04/12 15:35:45, engedy wrote: > Let's add some tests with: > -- leading and/or trailing '.' characters, and > -- consecutive ".." characters, > -- with multiple excluded subdomains. Done. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:415: {{"domain.com", "~sub.domain.com"}, "http://ssub.domain.com", false}, On 2017/04/12 15:35:45, engedy wrote: > Okay, by this point, `domain` totally stopped seeming like a real word... > (https://xkcd.com/1046/) Acknowledged. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:468: domains.push_back("c.sub." + domain); On 2017/04/12 15:35:45, engedy wrote: > nit: Can you also add: ~aa.sub, ~ab.sub, ~ba.sub, ~bb.sub for good measure? Done.
Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. BUG=708458 ========== to ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. BUG=708458 ==========
Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. BUG=708458 ========== to ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. This CL shoud improve the long-tail/worst-case matching performance. BUG=708458 ==========
Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. This CL shoud improve the long-tail/worst-case matching performance. BUG=708458 ========== to ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. This CL should improve the long-tail/worst-case matching performance. BUG=708458 ==========
LGTM % comments https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:409: DCHECK_LT(left, domains.size()); Might be nice to move the DCHECK into the loop so we check this every iteration. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:448: return (is_generic || longest_matching_included_domain_length) && optional: Consider breaking this up into a few sub variables with meaningful names. The logic is not trivial.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Still LGTM % comments, thanks! https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On 2017/04/13 12:09:08, pkalinnikov wrote: > On 2017/04/12 15:41:11, Charlie Harrison wrote: > > On 2017/04/12 15:35:45, engedy wrote: > > > How come we didn't have to sort previously? > > > > I'm guessing it's because we are now binary searching but agreed it needs some > > docs. > > Done. Can you please still explain to me why we only need to sort now? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset_unittest.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset_unittest.cc:415: {{"domain.com", "~sub.domain.com"}, "http://ssub.domain.com", false}, On 2017/04/13 12:09:08, pkalinnikov wrote: > On 2017/04/12 15:35:45, engedy wrote: > > Okay, by this point, `domain` totally stopped seeming like a real word... > > (https://xkcd.com/1046/) > > Acknowledged. Glad you agree. :-) https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:32: // Returns comparison result for two domains. For a list of domains sorted nit: How about saying: Performs three-way comparison between two domains. In the total order defined by this predicate, the lengths of domains will be monotonically decreasing. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:84: // Note: The |domain| can have non-ASCII UTF-8 characters, but these nit: ... but ToLowerASCII leaves these intact. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:86: // TODO(pkalinnikov): Put non-ASCII characters to lower case as well. nit: s/Put/Convert/ https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:87: // TODO(pkalinnikov): Possibly convert to IDN here or directly assume nit: ... convert Punycode to IDN ... https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:448: return (is_generic || longest_matching_included_domain_length) && On 2017/04/13 12:39:36, Charlie Harrison wrote: > optional: Consider breaking this up into a few sub variables with meaningful > names. The logic is not trivial. Let's find some middle ground between this and the previous form.
The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Addressed more comments. PTAL. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On 2017/04/13 12:59:14, engedy wrote: > On 2017/04/13 12:09:08, pkalinnikov wrote: > > On 2017/04/12 15:41:11, Charlie Harrison wrote: > > > On 2017/04/12 15:35:45, engedy wrote: > > > > How come we didn't have to sort previously? > > > > > > I'm guessing it's because we are now binary searching but agreed it needs > some > > > docs. > > > > Done. > > Can you please still explain to me why we only need to sort now? We didn't have to sort, because the previous solution scanned domain lists end-to-end without any assumptions on their order. In this CL the naive scanning algorithm assumes so, and that is why it can break the loop when a match is found. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:32: // Returns comparison result for two domains. For a list of domains sorted On 2017/04/13 12:59:14, engedy wrote: > nit: How about saying: > > Performs three-way comparison between two domains. In the total order defined by > this predicate, the lengths of domains will be monotonically decreasing. Done. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:84: // Note: The |domain| can have non-ASCII UTF-8 characters, but these On 2017/04/13 12:59:14, engedy wrote: > nit: ... but ToLowerASCII leaves these intact. Done. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:86: // TODO(pkalinnikov): Put non-ASCII characters to lower case as well. On 2017/04/13 12:59:14, engedy wrote: > nit: s/Put/Convert/ Done. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:87: // TODO(pkalinnikov): Possibly convert to IDN here or directly assume On 2017/04/13 12:59:14, engedy wrote: > nit: ... convert Punycode to IDN ... Done. https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:409: DCHECK_LT(left, domains.size()); On 2017/04/13 12:39:36, Charlie Harrison wrote: > Might be nice to move the DCHECK into the loop so we check this every iteration. Not sure it makes sense to DCHECK |left| in the loop. I added a DCHECK for |middle| because it is used as index into |domains|. Does this look good? https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:448: return (is_generic || longest_matching_included_domain_length) && On 2017/04/13 12:59:14, engedy wrote: > On 2017/04/13 12:39:36, Charlie Harrison wrote: > > optional: Consider breaking this up into a few sub variables with meaningful > > names. The logic is not trivial. > > Let's find some middle ground between this and the previous form. How about this?
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
still LGTM https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_... components/subresource_filter/core/common/indexed_ruleset.cc:409: DCHECK_LT(left, domains.size()); On 2017/04/13 14:40:24, pkalinnikov wrote: > On 2017/04/13 12:39:36, Charlie Harrison wrote: > > Might be nice to move the DCHECK into the loop so we check this every > iteration. > > Not sure it makes sense to DCHECK |left| in the loop. I added a DCHECK for > |middle| because it is used as index into |domains|. Does this look good? Yeah that works better for me.
LGTM.
Added a TODO.
The CQ bit was checked by pkalinnikov@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from engedy@chromium.org, csharrison@chromium.org Link to the patchset: https://codereview.chromium.org/2816693002/#ps80001 (title: "Add TODO.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch.
Bot data: {"patchset_id": 80001, "attempt_start_ts": 1492101351420880,
"parent_rev": "a0bd2c53f52b94f64ac1521e3bdea8da54c870b4", "commit_rev":
"154fc5ad392d229570cf7332ce8fb686da7d96c0"}
Message was sent while issue was closed.
Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. This CL should improve the long-tail/worst-case matching performance. BUG=708458 ========== to ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have long domain lists (up to 100-200 items). The previous solution simply scanned the list linearly whenever the rule was being matched against a network request, and checked each domain to be a subdomain of the document's origin. The CL replaces this algorithm. Now the origin which is matched against the list is scanned. Each of its subdomains (in descending order of length) gets binary-searched in the sorted list of domains, until some of them is found. The old linear scanning still has place for short domain lists. Quick measurements have shown that the length threshold of 5 achieves a decent trade-off. The new binary-search-based algorithm activates only for domain lists longer than 5 items. Additionally, the CL changes the FlatBuffers format of the ruleset by splitting domain list into 2 parts (included/excluded) instead of embedding the exclusion bit into domain strings. This CL should improve the long-tail/worst-case matching performance. BUG=708458 Review-Url: https://codereview.chromium.org/2816693002 Cr-Commit-Position: refs/heads/master@{#464457} Committed: https://chromium.googlesource.com/chromium/src/+/154fc5ad392d229570cf7332ce8f... ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://chromium.googlesource.com/chromium/src/+/154fc5ad392d229570cf7332ce8f... |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
