Issue 2568133007: Allow CrossSiteDocumentClassifier to operate on Origins

Charlie Harrison

The CQ bit was checked by csharrison@chromium.org to run a CQ dry run

4 years ago (2016-12-13 16:56:15 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2568133007/20001

4 years ago (2016-12-13 16:56:40 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-13 19:04:15 UTC) #3

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/355389)

4 years ago (2016-12-13 19:04:15 UTC) #4

Charlie Harrison

The CQ bit was checked by csharrison@chromium.org to run a CQ dry run

4 years ago (2016-12-13 19:26:49 UTC) #5

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2568133007/20001

4 years ago (2016-12-13 19:27:21 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-13 20:22:09 UTC) #7

Charlie Harrison

csharrison@chromium.org changed reviewers: + nick@chromium.org, pkasting@chromium.org

3 years, 11 months ago (2017-01-03 16:50:41 UTC) #9

Charlie Harrison

nick, pkasting PTAL? This avoids potentially expensive allocation of a GURL. pkasting: I had reservations ...

3 years, 11 months ago (2017-01-03 16:50:41 UTC) #10

ncarter (slow)

lgtm https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode107 net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while (1) { Totally random thought -- is ...

3 years, 11 months ago (2017-01-03 19:38:27 UTC) #11

Charlie Harrison

https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode107 net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while (1) { On 2017/01/03 19:38:27, ncarter wrote: > ...

3 years, 11 months ago (2017-01-03 22:16:56 UTC) #12

ncarter (slow)

On 2017/01/03 22:16:56, Charlie Harrison wrote: > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc > File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode107 ...

3 years, 11 months ago (2017-01-03 23:20:32 UTC) #13

ncarter (slow)

On 2017/01/03 23:20:32, ncarter wrote: > On 2017/01/03 22:16:56, Charlie Harrison wrote: > > > ...

3 years, 11 months ago (2017-01-04 00:19:09 UTC) #14

On 2017/01/03 23:20:32, ncarter wrote:
> On 2017/01/03 22:16:56, Charlie Harrison wrote:
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > File net/base/registry_controlled_domains/registry_controlled_domain.cc
> (right):
> > 
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > net/base/registry_controlled_domains/registry_controlled_domain.cc:107:
while
> > (1) {
> > On 2017/01/03 19:38:27, ncarter wrote:
> > > Totally random thought -- is this function a hot spot?
> > > 
> > > It seems like this loop could potentially be eliminated by constructing
the
> > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find
> longest
> > > suffix from fixed set' computation), which would not only eliminate this
> loop,
> > > but probably mean we'd be able to stop after looking at five or six
> characters
> > > in a lot of cases (for example, as soon as we see the z in http://baz.com,
> we
> > know that
> > > 'com' is the only match).
> > 
> > I think I understand what you are saying. Rather than "lookup string in
fixed
> > set" we make the DAFSA interface instead "return longest match". Then, by
> > seeding the DAFSA backwards, we can just return the longest match using
> > host.reverse().
> > 
> > If the match lines up with a dot correctly, we can return the length.
> > 
> > This used to be a major hot spot until I recently fixed SameDomainOrHost to
> > early return when hosts are identical, but it could still be hot in some
> > circumstances.
> 
> Exactly.

www.google.com is an interesting input case to think about, since it has
collisions both in the forward and reverse directions.

In the current implementation, because www.ck and www.ro are in the list of
effective TLDs, the initial lookup for 'www.google.com' needs to scan as far as
the 'g' before the DAFSA can fail lookup for the first iteration. Then there's a
forward scan to find the dot, and a second iteration ('google.com' input) needs
to scan to the dot, to eliminate the 'google' TLD branch. Then there's a scan
forward to find the dot, and a third iteration with input 'com', which returns a
match.

With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to scan
as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry) to
know that .com is the TLD, since withgoogle.com is in the registry.

With the final scan to extend the (.com) domain to (google.com), I count 32
character reads going forward, vs. 17 going backward, but if you count just the
DAFSA iterations (which are presumably more expensive than the find-next-dot
scans), it would be something like 15 vs. 11.

Alternatively, SameDomainOrHost might have an opportunity for a fast-fail path
where you only call GetDomainAndRegistryAsStringPiece once; after calling it on
the first hostname, you could check whether host2.endsWith(domain1).

Charlie Harrison

On 2017/01/04 00:19:09, ncarter wrote: > On 2017/01/03 23:20:32, ncarter wrote: > > On 2017/01/03 ...

3 years, 11 months ago (2017-01-04 18:17:54 UTC) #15

On 2017/01/04 00:19:09, ncarter wrote:
> On 2017/01/03 23:20:32, ncarter wrote:
> > On 2017/01/03 22:16:56, Charlie Harrison wrote:
> > >
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > > File net/base/registry_controlled_domains/registry_controlled_domain.cc
> > (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > > net/base/registry_controlled_domains/registry_controlled_domain.cc:107:
> while
> > > (1) {
> > > On 2017/01/03 19:38:27, ncarter wrote:
> > > > Totally random thought -- is this function a hot spot?
> > > > 
> > > > It seems like this loop could potentially be eliminated by constructing
> the
> > > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find
> > longest
> > > > suffix from fixed set' computation), which would not only eliminate this
> > loop,
> > > > but probably mean we'd be able to stop after looking at five or six
> > characters
> > > > in a lot of cases (for example, as soon as we see the z in
http://baz.com,
> > we
> > > know that
> > > > 'com' is the only match).
> > > 
> > > I think I understand what you are saying. Rather than "lookup string in
> fixed
> > > set" we make the DAFSA interface instead "return longest match". Then, by
> > > seeding the DAFSA backwards, we can just return the longest match using
> > > host.reverse().
> > > 
> > > If the match lines up with a dot correctly, we can return the length.
> > > 
> > > This used to be a major hot spot until I recently fixed SameDomainOrHost
to
> > > early return when hosts are identical, but it could still be hot in some
> > > circumstances.
> > 
> > Exactly.
> 
> http://www.google.com is an interesting input case to think about, since it
has
> collisions both in the forward and reverse directions.
> 
> In the current implementation, because http://www.ck and http://www.ro are in
the list of
> effective TLDs, the initial lookup for 'www.google.com' needs to scan as far
as
> the 'g' before the DAFSA can fail lookup for the first iteration. Then there's
a
> forward scan to find the dot, and a second iteration ('google.com' input)
needs
> to scan to the dot, to eliminate the 'google' TLD branch. Then there's a scan
> forward to find the dot, and a third iteration with input 'com', which returns
a
> match.
> 
> With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to scan
> as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry) to
> know that .com is the TLD, since http://withgoogle.com is in the registry.
> 
> With the final scan to extend the (.com) domain to (http://google.com), I
count 32
> character reads going forward, vs. 17 going backward, but if you count just
the
> DAFSA iterations (which are presumably more expensive than the find-next-dot
> scans), it would be something like 15 vs. 11.
Very interesting, thank you for doing this analysis. I expect the DAFSA
iterations to be worse due to memory locality, but I am not very familiar with
its in-memory structure.
> 
> Alternatively, SameDomainOrHost might have an opportunity for a fast-fail path
> where you only call GetDomainAndRegistryAsStringPiece once; after calling it
on
> the first hostname, you could check whether host2.endsWith(domain1).
This is probably worth it. I think a big chunk of time we end up calling
SameDomainOrHost we are calling it with a subresource URL and a
first_party_for_cookies URL. I would expect in cases where the hosts are not
identical, it is more probable for the domains to be different.

In fact, if you ensured that you always calculated the domain of the larger host
first, you could have a chance to fail-fast with a length difference from
endsWith().

ncarter (slow)

On 2017/01/04 18:17:54, Charlie Harrison wrote: > On 2017/01/04 00:19:09, ncarter wrote: > > On ...

3 years, 11 months ago (2017-01-05 17:43:49 UTC) #16

On 2017/01/04 18:17:54, Charlie Harrison wrote:
> On 2017/01/04 00:19:09, ncarter wrote:
> > On 2017/01/03 23:20:32, ncarter wrote:
> > > On 2017/01/03 22:16:56, Charlie Harrison wrote:
> > > >
> > >
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > > > File net/base/registry_controlled_domains/registry_controlled_domain.cc
> > > (right):
> > > > 
> > > >
> > >
> >
>
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr...
> > > > net/base/registry_controlled_domains/registry_controlled_domain.cc:107:
> > while
> > > > (1) {
> > > > On 2017/01/03 19:38:27, ncarter wrote:
> > > > > Totally random thought -- is this function a hot spot?
> > > > > 
> > > > > It seems like this loop could potentially be eliminated by
constructing
> > the
> > > > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find
> > > longest
> > > > > suffix from fixed set' computation), which would not only eliminate
this
> > > loop,
> > > > > but probably mean we'd be able to stop after looking at five or six
> > > characters
> > > > > in a lot of cases (for example, as soon as we see the z in
> http://baz.com,
> > > we
> > > > know that
> > > > > 'com' is the only match).
> > > > 
> > > > I think I understand what you are saying. Rather than "lookup string in
> > fixed
> > > > set" we make the DAFSA interface instead "return longest match". Then,
by
> > > > seeding the DAFSA backwards, we can just return the longest match using
> > > > host.reverse().
> > > > 
> > > > If the match lines up with a dot correctly, we can return the length.
> > > > 
> > > > This used to be a major hot spot until I recently fixed SameDomainOrHost
> to
> > > > early return when hosts are identical, but it could still be hot in some
> > > > circumstances.
> > > 
> > > Exactly.
> > 
> > http://www.google.com is an interesting input case to think about, since it
> has
> > collisions both in the forward and reverse directions.
> > 
> > In the current implementation, because http://www.ck and http://www.ro are
in
> the list of
> > effective TLDs, the initial lookup for 'www.google.com' needs to scan as far
> as
> > the 'g' before the DAFSA can fail lookup for the first iteration. Then
there's
> a
> > forward scan to find the dot, and a second iteration ('google.com' input)
> needs
> > to scan to the dot, to eliminate the 'google' TLD branch. Then there's a
scan
> > forward to find the dot, and a third iteration with input 'com', which
returns
> a
> > match.
> > 
> > With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to
scan
> > as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry)
to
> > know that .com is the TLD, since http://withgoogle.com is in the registry.
> > 
> > With the final scan to extend the (.com) domain to (http://google.com), I
> count 32
> > character reads going forward, vs. 17 going backward, but if you count just
> the
> > DAFSA iterations (which are presumably more expensive than the find-next-dot
> > scans), it would be something like 15 vs. 11.
> Very interesting, thank you for doing this analysis. I expect the DAFSA
> iterations to be worse due to memory locality, but I am not very familiar with
> its in-memory structure.
> > 
> > Alternatively, SameDomainOrHost might have an opportunity for a fast-fail
path
> > where you only call GetDomainAndRegistryAsStringPiece once; after calling it
> on
> > the first hostname, you could check whether host2.endsWith(domain1).
> This is probably worth it. I think a big chunk of time we end up calling
> SameDomainOrHost we are calling it with a subresource URL and a
> first_party_for_cookies URL. I would expect in cases where the hosts are not
> identical, it is more probable for the domains to be different.
> 
> In fact, if you ensured that you always calculated the domain of the larger
host
> first, you could have a chance to fail-fast with a length difference from
> endsWith().

I've split off these two ideas into 2 new bugs (678334 and 678351), and there's
good discussion there.

this cl still lgtm as-is.

Peter Kasting

On 2017/01/03 16:50:41, Charlie Harrison wrote: > pkasting: I had reservations about exposing a SameDomainOrHost ...

3 years, 11 months ago (2017-01-06 02:07:26 UTC) #17

Charlie Harrison

On 2017/01/06 02:07:26, Peter Kasting (backlogged) wrote: > On 2017/01/03 16:50:41, Charlie Harrison wrote: > ...

3 years, 11 months ago (2017-01-06 03:43:11 UTC) #18

Charlie Harrison

BTW thanks Nick for working through those other optimizations in the linked bugs!

3 years, 11 months ago (2017-01-06 03:44:12 UTC) #19

Charlie Harrison

The CQ bit was checked by csharrison@chromium.org

3 years, 11 months ago (2017-01-06 03:44:18 UTC) #20

Charlie Harrison

The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, pkasting@chromium.org Link to ...

3 years, 11 months ago (2017-01-06 03:44:19 UTC) #21

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2568133007/40001

3 years, 11 months ago (2017-01-06 03:44:30 UTC) #22

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1483674258777820, "parent_rev": "90a3ce3d5df79337a285056150d3b9b31ad778e0", "commit_rev": "3b8662f8f863dee29b406e73ce7aa7ad22adad4e"}

3 years, 11 months ago (2017-01-06 04:47:04 UTC) #23

commit-bot: I haz the power

Description was changed from ========== Allow CrossSiteDocumentClassifier to operate on Origins BUG=348655 ========== to ========== ...

3 years, 11 months ago (2017-01-06 04:47:32 UTC) #24

commit-bot: I haz the power

3 years, 11 months ago (2017-01-06 04:47:33 UTC) #25

Message was sent while issue was closed.

Committed patchset #3 (id:40001) as
https://chromium.googlesource.com/chromium/src/+/3b8662f8f863dee29b406e73ce7a...

Issue 2568133007: Allow CrossSiteDocumentClassifier to operate on Origins (Closed)

Description

Patch Set 1 #

Patch Set 2 : rebase #

Patch Set 3 : pkasting review #

Messages