|
|
Created:
4 years ago by Charlie Harrison Modified:
3 years, 11 months ago CC:
chromium-reviews, jam, cbentzel+watch_chromium.org, darin-cc_chromium.org Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionAllow CrossSiteDocumentClassifier to operate on Origins
BUG=348655
Review-Url: https://codereview.chromium.org/2568133007
Cr-Commit-Position: refs/heads/master@{#441874}
Committed: https://chromium.googlesource.com/chromium/src/+/3b8662f8f863dee29b406e73ce7aa7ad22adad4e
Patch Set 1 #Patch Set 2 : rebase #
Total comments: 2
Patch Set 3 : pkasting review #
Messages
Total messages: 25 (13 generated)
The CQ bit was checked by csharrison@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by csharrison@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
csharrison@chromium.org changed reviewers: + nick@chromium.org, pkasting@chromium.org
nick, pkasting PTAL? This avoids potentially expensive allocation of a GURL. pkasting: I had reservations about exposing a SameDomainOrHost method that took raw StringPiece host as input, as it could allow for callers to give bad/non-canonicalized hosts. An alternative for this patch would be to add a SameDomainOrHost(Origin, GURL). WDYT?
lgtm https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while (1) { Totally random thought -- is this function a hot spot? It seems like this loop could potentially be eliminated by constructing the DAFSA so that it consumes the string in reverse (i.e. make it a 'find longest suffix from fixed set' computation), which would not only eliminate this loop, but probably mean we'd be able to stop after looking at five or six characters in a lot of cases (for example, as soon as we see the z in baz.com, we know that 'com' is the only match).
https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while (1) { On 2017/01/03 19:38:27, ncarter wrote: > Totally random thought -- is this function a hot spot? > > It seems like this loop could potentially be eliminated by constructing the > DAFSA so that it consumes the string in reverse (i.e. make it a 'find longest > suffix from fixed set' computation), which would not only eliminate this loop, > but probably mean we'd be able to stop after looking at five or six characters > in a lot of cases (for example, as soon as we see the z in http://baz.com, we know that > 'com' is the only match). I think I understand what you are saying. Rather than "lookup string in fixed set" we make the DAFSA interface instead "return longest match". Then, by seeding the DAFSA backwards, we can just return the longest match using host.reverse(). If the match lines up with a dot correctly, we can return the length. This used to be a major hot spot until I recently fixed SameDomainOrHost to early return when hosts are identical, but it could still be hot in some circumstances.
On 2017/01/03 22:16:56, Charlie Harrison wrote: > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while > (1) { > On 2017/01/03 19:38:27, ncarter wrote: > > Totally random thought -- is this function a hot spot? > > > > It seems like this loop could potentially be eliminated by constructing the > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find longest > > suffix from fixed set' computation), which would not only eliminate this loop, > > but probably mean we'd be able to stop after looking at five or six characters > > in a lot of cases (for example, as soon as we see the z in http://baz.com, we > know that > > 'com' is the only match). > > I think I understand what you are saying. Rather than "lookup string in fixed > set" we make the DAFSA interface instead "return longest match". Then, by > seeding the DAFSA backwards, we can just return the longest match using > host.reverse(). > > If the match lines up with a dot correctly, we can return the length. > > This used to be a major hot spot until I recently fixed SameDomainOrHost to > early return when hosts are identical, but it could still be hot in some > circumstances. Exactly.
On 2017/01/03 23:20:32, ncarter wrote: > On 2017/01/03 22:16:56, Charlie Harrison wrote: > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > File net/base/registry_controlled_domains/registry_controlled_domain.cc > (right): > > > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > net/base/registry_controlled_domains/registry_controlled_domain.cc:107: while > > (1) { > > On 2017/01/03 19:38:27, ncarter wrote: > > > Totally random thought -- is this function a hot spot? > > > > > > It seems like this loop could potentially be eliminated by constructing the > > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find > longest > > > suffix from fixed set' computation), which would not only eliminate this > loop, > > > but probably mean we'd be able to stop after looking at five or six > characters > > > in a lot of cases (for example, as soon as we see the z in http://baz.com, > we > > know that > > > 'com' is the only match). > > > > I think I understand what you are saying. Rather than "lookup string in fixed > > set" we make the DAFSA interface instead "return longest match". Then, by > > seeding the DAFSA backwards, we can just return the longest match using > > host.reverse(). > > > > If the match lines up with a dot correctly, we can return the length. > > > > This used to be a major hot spot until I recently fixed SameDomainOrHost to > > early return when hosts are identical, but it could still be hot in some > > circumstances. > > Exactly. www.google.com is an interesting input case to think about, since it has collisions both in the forward and reverse directions. In the current implementation, because www.ck and www.ro are in the list of effective TLDs, the initial lookup for 'www.google.com' needs to scan as far as the 'g' before the DAFSA can fail lookup for the first iteration. Then there's a forward scan to find the dot, and a second iteration ('google.com' input) needs to scan to the dot, to eliminate the 'google' TLD branch. Then there's a scan forward to find the dot, and a third iteration with input 'com', which returns a match. With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to scan as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry) to know that .com is the TLD, since withgoogle.com is in the registry. With the final scan to extend the (.com) domain to (google.com), I count 32 character reads going forward, vs. 17 going backward, but if you count just the DAFSA iterations (which are presumably more expensive than the find-next-dot scans), it would be something like 15 vs. 11. Alternatively, SameDomainOrHost might have an opportunity for a fast-fail path where you only call GetDomainAndRegistryAsStringPiece once; after calling it on the first hostname, you could check whether host2.endsWith(domain1).
On 2017/01/04 00:19:09, ncarter wrote: > On 2017/01/03 23:20:32, ncarter wrote: > > On 2017/01/03 22:16:56, Charlie Harrison wrote: > > > > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > > File net/base/registry_controlled_domains/registry_controlled_domain.cc > > (right): > > > > > > > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > > net/base/registry_controlled_domains/registry_controlled_domain.cc:107: > while > > > (1) { > > > On 2017/01/03 19:38:27, ncarter wrote: > > > > Totally random thought -- is this function a hot spot? > > > > > > > > It seems like this loop could potentially be eliminated by constructing > the > > > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find > > longest > > > > suffix from fixed set' computation), which would not only eliminate this > > loop, > > > > but probably mean we'd be able to stop after looking at five or six > > characters > > > > in a lot of cases (for example, as soon as we see the z in http://baz.com, > > we > > > know that > > > > 'com' is the only match). > > > > > > I think I understand what you are saying. Rather than "lookup string in > fixed > > > set" we make the DAFSA interface instead "return longest match". Then, by > > > seeding the DAFSA backwards, we can just return the longest match using > > > host.reverse(). > > > > > > If the match lines up with a dot correctly, we can return the length. > > > > > > This used to be a major hot spot until I recently fixed SameDomainOrHost to > > > early return when hosts are identical, but it could still be hot in some > > > circumstances. > > > > Exactly. > > http://www.google.com is an interesting input case to think about, since it has > collisions both in the forward and reverse directions. > > In the current implementation, because http://www.ck and http://www.ro are in the list of > effective TLDs, the initial lookup for 'www.google.com' needs to scan as far as > the 'g' before the DAFSA can fail lookup for the first iteration. Then there's a > forward scan to find the dot, and a second iteration ('google.com' input) needs > to scan to the dot, to eliminate the 'google' TLD branch. Then there's a scan > forward to find the dot, and a third iteration with input 'com', which returns a > match. > > With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to scan > as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry) to > know that .com is the TLD, since http://withgoogle.com is in the registry. > > With the final scan to extend the (.com) domain to (http://google.com), I count 32 > character reads going forward, vs. 17 going backward, but if you count just the > DAFSA iterations (which are presumably more expensive than the find-next-dot > scans), it would be something like 15 vs. 11. Very interesting, thank you for doing this analysis. I expect the DAFSA iterations to be worse due to memory locality, but I am not very familiar with its in-memory structure. > > Alternatively, SameDomainOrHost might have an opportunity for a fast-fail path > where you only call GetDomainAndRegistryAsStringPiece once; after calling it on > the first hostname, you could check whether host2.endsWith(domain1). This is probably worth it. I think a big chunk of time we end up calling SameDomainOrHost we are calling it with a subresource URL and a first_party_for_cookies URL. I would expect in cases where the hosts are not identical, it is more probable for the domains to be different. In fact, if you ensured that you always calculated the domain of the larger host first, you could have a chance to fail-fast with a length difference from endsWith().
On 2017/01/04 18:17:54, Charlie Harrison wrote: > On 2017/01/04 00:19:09, ncarter wrote: > > On 2017/01/03 23:20:32, ncarter wrote: > > > On 2017/01/03 22:16:56, Charlie Harrison wrote: > > > > > > > > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > > > File net/base/registry_controlled_domains/registry_controlled_domain.cc > > > (right): > > > > > > > > > > > > > > https://codereview.chromium.org/2568133007/diff/20001/net/base/registry_contr... > > > > net/base/registry_controlled_domains/registry_controlled_domain.cc:107: > > while > > > > (1) { > > > > On 2017/01/03 19:38:27, ncarter wrote: > > > > > Totally random thought -- is this function a hot spot? > > > > > > > > > > It seems like this loop could potentially be eliminated by constructing > > the > > > > > DAFSA so that it consumes the string in reverse (i.e. make it a 'find > > > longest > > > > > suffix from fixed set' computation), which would not only eliminate this > > > loop, > > > > > but probably mean we'd be able to stop after looking at five or six > > > characters > > > > > in a lot of cases (for example, as soon as we see the z in > http://baz.com, > > > we > > > > know that > > > > > 'com' is the only match). > > > > > > > > I think I understand what you are saying. Rather than "lookup string in > > fixed > > > > set" we make the DAFSA interface instead "return longest match". Then, by > > > > seeding the DAFSA backwards, we can just return the longest match using > > > > host.reverse(). > > > > > > > > If the match lines up with a dot correctly, we can return the length. > > > > > > > > This used to be a major hot spot until I recently fixed SameDomainOrHost > to > > > > early return when hosts are identical, but it could still be hot in some > > > > circumstances. > > > > > > Exactly. > > > > http://www.google.com is an interesting input case to think about, since it > has > > collisions both in the forward and reverse directions. > > > > In the current implementation, because http://www.ck and http://www.ro are in > the list of > > effective TLDs, the initial lookup for 'www.google.com' needs to scan as far > as > > the 'g' before the DAFSA can fail lookup for the first iteration. Then there's > a > > forward scan to find the dot, and a second iteration ('google.com' input) > needs > > to scan to the dot, to eliminate the 'google' TLD branch. Then there's a scan > > forward to find the dot, and a third iteration with input 'com', which returns > a > > match. > > > > With a backwards-scanning DAFSA, on input 'www.google.com', you'd need to scan > > as far as 'moc.elgoog.' (because 'withgoogle.com' is in the ETLD registry) to > > know that .com is the TLD, since http://withgoogle.com is in the registry. > > > > With the final scan to extend the (.com) domain to (http://google.com), I > count 32 > > character reads going forward, vs. 17 going backward, but if you count just > the > > DAFSA iterations (which are presumably more expensive than the find-next-dot > > scans), it would be something like 15 vs. 11. > Very interesting, thank you for doing this analysis. I expect the DAFSA > iterations to be worse due to memory locality, but I am not very familiar with > its in-memory structure. > > > > Alternatively, SameDomainOrHost might have an opportunity for a fast-fail path > > where you only call GetDomainAndRegistryAsStringPiece once; after calling it > on > > the first hostname, you could check whether host2.endsWith(domain1). > This is probably worth it. I think a big chunk of time we end up calling > SameDomainOrHost we are calling it with a subresource URL and a > first_party_for_cookies URL. I would expect in cases where the hosts are not > identical, it is more probable for the domains to be different. > > In fact, if you ensured that you always calculated the domain of the larger host > first, you could have a chance to fail-fast with a length difference from > endsWith(). I've split off these two ideas into 2 new bugs (678334 and 678351), and there's good discussion there. this cl still lgtm as-is.
On 2017/01/03 16:50:41, Charlie Harrison wrote: > pkasting: I had reservations about exposing a SameDomainOrHost method that took > raw StringPiece host as input, as it could allow for callers to give > bad/non-canonicalized hosts. An alternative for this patch would be to add a > SameDomainOrHost(Origin, GURL). WDYT? I think your reservation is a good one, and I like your proposed solution. LGTM with that.
On 2017/01/06 02:07:26, Peter Kasting (backlogged) wrote: > On 2017/01/03 16:50:41, Charlie Harrison wrote: > > pkasting: I had reservations about exposing a SameDomainOrHost method that > took > > raw StringPiece host as input, as it could allow for callers to give > > bad/non-canonicalized hosts. An alternative for this patch would be to add a > > SameDomainOrHost(Origin, GURL). WDYT? > > I think your reservation is a good one, and I like your proposed solution. > > LGTM with that. Thanks, done!
BTW thanks Nick for working through those other optimizations in the linked bugs!
The CQ bit was checked by csharrison@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nick@chromium.org, pkasting@chromium.org Link to the patchset: https://codereview.chromium.org/2568133007/#ps40001 (title: "pkasting review")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1483674258777820, "parent_rev": "90a3ce3d5df79337a285056150d3b9b31ad778e0", "commit_rev": "3b8662f8f863dee29b406e73ce7aa7ad22adad4e"}
Message was sent while issue was closed.
Description was changed from ========== Allow CrossSiteDocumentClassifier to operate on Origins BUG=348655 ========== to ========== Allow CrossSiteDocumentClassifier to operate on Origins BUG=348655 Review-Url: https://codereview.chromium.org/2568133007 Cr-Commit-Position: refs/heads/master@{#441874} Committed: https://chromium.googlesource.com/chromium/src/+/3b8662f8f863dee29b406e73ce7a... ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://chromium.googlesource.com/chromium/src/+/3b8662f8f863dee29b406e73ce7a... |