Issue 2816693002: [IndexedRuleset] Improve worst-case domain list matching.

pkalinnikov

The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-12 10:59:38 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2816693002/1

3 years, 8 months ago (2017-04-12 10:59:51 UTC) #2

pkalinnikov

pkalinnikov@chromium.org changed reviewers: + csharrison@chromium.org, engedy@chromium.org

3 years, 8 months ago (2017-04-12 11:00:43 UTC) #3

pkalinnikov

https://codereview.chromium.org/2816693002/diff/1/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/1/components/subresource_filter/core/common/indexed_ruleset.cc#newcode368 components/subresource_filter/core/common/indexed_ruleset.cc:368: flatbuffers::uoffset_t left = 0; I will add more comments ...

3 years, 8 months ago (2017-04-12 11:02:07 UTC) #5

pkalinnikov

Some self-comments. Please don't review now, will fix them and return back to you. https://codereview.chromium.org/2816693002/diff/1/components/subresource_filter/core/common/indexed_ruleset.cc ...

3 years, 8 months ago (2017-04-12 11:23:21 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-12 11:42:39 UTC) #7

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-12 11:42:40 UTC) #8

pkalinnikov

The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-12 14:11:46 UTC) #10

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2816693002/20001

3 years, 8 months ago (2017-04-12 14:12:07 UTC) #11

pkalinnikov

https://codereview.chromium.org/2816693002/diff/1/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/1/components/subresource_filter/core/common/indexed_ruleset.cc#newcode350 components/subresource_filter/core/common/indexed_ruleset.cc:350: size_t DomainListMatch(const url::Origin& origin, const FlatDomains& domains) { On ...

3 years, 8 months ago (2017-04-12 14:19:50 UTC) #13

engedy

Could you please explain in the CL description the rationale for this change?

3 years, 8 months ago (2017-04-12 14:22:37 UTC) #14

Charlie Harrison

Generally looks good. I think it needs a bit more documentation of the high level ...

3 years, 8 months ago (2017-04-12 15:19:50 UTC) #15

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-12 15:32:53 UTC) #16

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-12 15:32:54 UTC) #17

engedy

LGTM % comments. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode65 components/subresource_filter/core/common/indexed_ruleset.cc:65: // Reserve only for |domains_included| because ...

3 years, 8 months ago (2017-04-12 15:35:45 UTC) #18

Charlie Harrison

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode71 components/subresource_filter/core/common/indexed_ruleset.cc:71: HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain)); On 2017/04/12 15:35:45, engedy ...

3 years, 8 months ago (2017-04-12 15:41:11 UTC) #19

pkalinnikov

The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-13 12:08:22 UTC) #20

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2816693002/40001

3 years, 8 months ago (2017-04-13 12:08:34 UTC) #21

pkalinnikov

Addressed, thanks! Will you PTAL again? https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode65 components/subresource_filter/core/common/indexed_ruleset.cc:65: // Reserve only ...

3 years, 8 months ago (2017-04-13 12:09:09 UTC) #22

Addressed, thanks! Will you PTAL again?

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
File components/subresource_filter/core/common/indexed_ruleset.cc (right):

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:65: // Reserve only
for |domains_included| because it is more commonly used.
On 2017/04/12 15:35:45, engedy wrote:
> nit: ... it is expected to be the one used more frequently.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:71:
HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain));
On 2017/04/12 15:35:45, engedy wrote:
> The proto definition defines |domain| to be UTF-8. Could you please
double-check
> that this is correct, and if so, add a comment about it?

This is correct. Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:71:
HasNoUpperAscii(domain) ? domain : base::ToLowerASCII(domain));
On 2017/04/12 15:41:11, Charlie Harrison wrote:
> On 2017/04/12 15:35:45, engedy wrote:
> > The proto definition defines |domain| to be UTF-8. Could you please
> double-check
> > that this is correct, and if so, add a comment about it?
> 
> That's odd. I don't think domains need to be utf8 they should be straight up
> canonicalized and ascii.

I think it's okay either way as soon as the matching phase considers
same-encoded domains. Those could be both IDNs or both UTF-8. I leave it as a
follow-up TODO.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:79: auto precedes =
[&builder](FlatStringOffset lhs, FlatStringOffset rhs) {
On 2017/04/12 15:19:50, Charlie Harrison wrote:
> Briefly document this function.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:80: auto*
lhs_string = flatbuffers::GetTemporaryPointer(*builder, lhs);
On 2017/04/12 15:35:45, engedy wrote:
> nit: const auto*

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:88: if
(!domains_included.empty()) {
On 2017/04/12 15:19:50, Charlie Harrison wrote:
> Document why these are stored in sorted order.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:89:
std::sort(domains_included.begin(), domains_included.end(), precedes);
On 2017/04/12 15:41:11, Charlie Harrison wrote:
> On 2017/04/12 15:35:45, engedy wrote:
> > How come we didn't have to sort previously?
> 
> I'm guessing it's because we are now binary searching but agreed it needs some
> docs.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:351: // Returns
whether the |domain| precedes the |target_domain| in the domain list,
On 2017/04/12 15:35:45, engedy wrote:
> phrasing nit: ... |domain| either matches or precedes ...

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:355: const
base::StringPiece domain_piece = ToStringPiece(domain);
On 2017/04/12 15:35:45, engedy wrote:
> nit: Have you considered doing the conversion to StringPiece at the call site?
> Unless there is a severe performance penalty, that would make it more obvious
> that this is just an operator <=, and there is no requirement which parameter
> must come be the search key and which must come from the list.

Done. See CompareDomains.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:358:
CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0);
On 2017/04/12 15:19:50, Charlie Harrison wrote:
> Musing: Does this really need to be insenstive? The only uppercase bits in
> canonicalized hosts will be percent encodings.

Indeed. Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:358:
CompareCaseInsensitiveASCII(domain_piece, target_domain) <= 0);
On 2017/04/12 15:35:45, engedy wrote:
> Same thing about UTF-8 vs. ASCII here.

Not relevant anymore, because the strings are canonicalized and compared
byte-wise. As for the domains in the domain list, left a TODO to solve the
non-ASCII case later.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:377: // Otherwise
look for each subdomain of the |origin| using binary search.
On 2017/04/12 15:19:50, Charlie Harrison wrote:
> Have you looked into using standard library's binary search?

Yes. There are 2 concerns with it:
1. std::binary_search/lower_bound/etc work with iterators/pointers.
flatbuffers::Vector does not provide random access iterators.
2. There is a workaround - make flatbuffers::uoffset_t pretending to be an
iterator via a wrapper. IMHO, this is more boilerplate than just writing a
binary search.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:379:
base::StringPiece canonicalized_host(origin.host());
On 2017/04/12 15:35:45, engedy wrote:
> nit: Could you please add a DCHECK here that this is not a unique origin?

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:438: size_t
max_included_length = 0;
On 2017/04/12 15:35:45, engedy wrote:
> nit: longest_matching_included_domain_length, it looks like it's not going to
> create a lot of wrapping.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:442: is_match =
!!max_included_length;
On 2017/04/12 15:35:45, engedy wrote:
> if (!max_included_length)
>   return false;
> 
> and then you can get rid of |is_match| entirely.

How about this?

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
File components/subresource_filter/core/common/indexed_ruleset_unittest.cc
(right):

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset_unittest.cc:376: }
kTestCases[] = {
On 2017/04/12 15:35:45, engedy wrote:
> Let's add some tests with:
>  -- leading and/or trailing '.' characters, and
>  -- consecutive ".." characters,
>  -- with multiple excluded subdomains.

Done.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset_unittest.cc:415:
{{"domain.com", "~sub.domain.com"}, "http://ssub.domain.com", false},
On 2017/04/12 15:35:45, engedy wrote:
> Okay, by this point, `domain` totally stopped seeming like a real word...
> (https://xkcd.com/1046/)

Acknowledged.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset_unittest.cc:468:
domains.push_back("c.sub." + domain);
On 2017/04/12 15:35:45, engedy wrote:
> nit: Can you also add: ~aa.sub, ~ab.sub, ~ba.sub, ~bb.sub for good measure?

Done.

pkalinnikov

Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. BUG=708458 ========== to ========== ...

3 years, 8 months ago (2017-04-13 12:24:09 UTC) #23

pkalinnikov

Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have ...

3 years, 8 months ago (2017-04-13 12:26:31 UTC) #24

Description was changed from

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

BUG=708458
==========

to

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

This CL shoud improve the long-tail/worst-case matching performance.

BUG=708458
==========

pkalinnikov

Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have ...

3 years, 8 months ago (2017-04-13 12:26:40 UTC) #25

Description was changed from

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

This CL shoud improve the long-tail/worst-case matching performance.

BUG=708458
==========

to

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

This CL should improve the long-tail/worst-case matching performance.

BUG=708458
==========

Charlie Harrison

LGTM % comments https://codereview.chromium.org/2816693002/diff/40001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode409 components/subresource_filter/core/common/indexed_ruleset.cc:409: DCHECK_LT(left, domains.size()); Might be nice to ...

3 years, 8 months ago (2017-04-13 12:39:36 UTC) #26

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-13 12:53:56 UTC) #27

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-13 12:53:57 UTC) #28

engedy

Still LGTM % comments, thanks! https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode89 components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On ...

3 years, 8 months ago (2017-04-13 12:59:14 UTC) #29

pkalinnikov

The CQ bit was checked by pkalinnikov@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-13 14:39:20 UTC) #30

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2816693002/60001

3 years, 8 months ago (2017-04-13 14:39:41 UTC) #31

pkalinnikov

Addressed more comments. PTAL. https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/20001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode89 components/subresource_filter/core/common/indexed_ruleset.cc:89: std::sort(domains_included.begin(), domains_included.end(), precedes); On 2017/04/13 ...

3 years, 8 months ago (2017-04-13 14:40:24 UTC) #32

Addressed more comments. PTAL.

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
File components/subresource_filter/core/common/indexed_ruleset.cc (right):

https://codereview.chromium.org/2816693002/diff/20001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:89:
std::sort(domains_included.begin(), domains_included.end(), precedes);
On 2017/04/13 12:59:14, engedy wrote:
> On 2017/04/13 12:09:08, pkalinnikov wrote:
> > On 2017/04/12 15:41:11, Charlie Harrison wrote:
> > > On 2017/04/12 15:35:45, engedy wrote:
> > > > How come we didn't have to sort previously?
> > > 
> > > I'm guessing it's because we are now binary searching but agreed it needs
> some
> > > docs.
> > 
> > Done.
> 
> Can you please still explain to me why we only need to sort now?

We didn't have to sort, because the previous solution scanned domain lists
end-to-end without any assumptions on their order.

In this CL the naive scanning algorithm assumes so, and that is why it can break
the loop when a match is found.

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
File components/subresource_filter/core/common/indexed_ruleset.cc (right):

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:32: // Returns
comparison result for two domains. For a list of domains sorted
On 2017/04/13 12:59:14, engedy wrote:
> nit: How about saying:
> 
> Performs three-way comparison between two domains. In the total order defined
by
> this predicate, the lengths of domains will be monotonically decreasing.

Done.

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:84: // Note: The
|domain| can have non-ASCII UTF-8 characters, but these
On 2017/04/13 12:59:14, engedy wrote:
> nit: ... but ToLowerASCII leaves these intact.

Done.

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:86: //
TODO(pkalinnikov): Put non-ASCII characters to lower case as well.
On 2017/04/13 12:59:14, engedy wrote:
> nit: s/Put/Convert/

Done.

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:87: //
TODO(pkalinnikov): Possibly convert to IDN here or directly assume
On 2017/04/13 12:59:14, engedy wrote:
> nit: ... convert Punycode to IDN ...

Done.

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:409:
DCHECK_LT(left, domains.size());
On 2017/04/13 12:39:36, Charlie Harrison wrote:
> Might be nice to move the DCHECK into the loop so we check this every
iteration.

Not sure it makes sense to DCHECK |left| in the loop. I added a DCHECK for
|middle| because it is used as index into |domains|. Does this look good?

https://codereview.chromium.org/2816693002/diff/40001/components/subresource_...
components/subresource_filter/core/common/indexed_ruleset.cc:448: return
(is_generic || longest_matching_included_domain_length) &&
On 2017/04/13 12:59:14, engedy wrote:
> On 2017/04/13 12:39:36, Charlie Harrison wrote:
> > optional: Consider breaking this up into a few sub variables with meaningful
> > names. The logic is not trivial.
> 
> Let's find some middle ground between this and the previous form.

How about this?

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-13 15:25:47 UTC) #33

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-13 15:25:48 UTC) #34

Charlie Harrison

still LGTM https://codereview.chromium.org/2816693002/diff/40001/components/subresource_filter/core/common/indexed_ruleset.cc File components/subresource_filter/core/common/indexed_ruleset.cc (right): https://codereview.chromium.org/2816693002/diff/40001/components/subresource_filter/core/common/indexed_ruleset.cc#newcode409 components/subresource_filter/core/common/indexed_ruleset.cc:409: DCHECK_LT(left, domains.size()); On 2017/04/13 14:40:24, pkalinnikov wrote: ...

3 years, 8 months ago (2017-04-13 15:53:27 UTC) #35

pkalinnikov

The patchset sent to the CQ was uploaded after l-g-t-m from engedy@chromium.org, csharrison@chromium.org Link to ...

3 years, 8 months ago (2017-04-13 16:35:51 UTC) #39

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2816693002/80001

3 years, 8 months ago (2017-04-13 16:36:20 UTC) #40

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1492101351420880, "parent_rev": "a0bd2c53f52b94f64ac1521e3bdea8da54c870b4", "commit_rev": "154fc5ad392d229570cf7332ce8fb686da7d96c0"}

3 years, 8 months ago (2017-04-13 17:38:19 UTC) #41

commit-bot: I haz the power

Description was changed from ========== [IndexedRuleset] Improve worst-case domain list matching. Some URL rules have ...

3 years, 8 months ago (2017-04-13 17:39:15 UTC) #42

Message was sent while issue was closed.

Description was changed from

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

This CL should improve the long-tail/worst-case matching performance.

BUG=708458
==========

to

==========
[IndexedRuleset] Improve worst-case domain list matching.

Some URL rules have long domain lists (up to 100-200 items). The previous
solution simply scanned the list linearly whenever the rule was being matched
against a network request, and checked each domain to be a subdomain of the
document's origin.

The CL replaces this algorithm. Now the origin which is matched against the
list is scanned. Each of its subdomains (in descending order of length) gets
binary-searched in the sorted list of domains, until some of them is found.

The old linear scanning still has place for short domain lists. Quick
measurements have shown that the length threshold of 5 achieves a decent
trade-off. The new binary-search-based algorithm activates only for domain
lists longer than 5 items.

Additionally, the CL changes the FlatBuffers format of the ruleset by splitting
domain list into 2 parts (included/excluded) instead of embedding the exclusion
bit into domain strings.

This CL should improve the long-tail/worst-case matching performance.

BUG=708458

Review-Url: https://codereview.chromium.org/2816693002
Cr-Commit-Position: refs/heads/master@{#464457}
Committed:
https://chromium.googlesource.com/chromium/src/+/154fc5ad392d229570cf7332ce8f...
==========

commit-bot: I haz the power

3 years, 8 months ago (2017-04-13 17:39:16 UTC) #43

Message was sent while issue was closed.

Committed patchset #5 (id:80001) as
https://chromium.googlesource.com/chromium/src/+/154fc5ad392d229570cf7332ce8f...

Issue 2816693002: [IndexedRuleset] Improve worst-case domain list matching. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Clean up, add unittest. #

Patch Set 3 : Address comments. #

Patch Set 4 : Address more comments. #

Patch Set 5 : Add TODO. #

Messages