Issue 2641953009: [1 of 4] Support prefix queries against the effective_tld_names DAFSA

ncarter (slow)

Description was changed from ========== registry_controlled_domains: Add FixedSetIncrementalLookup Reimplement LookupStringInFixedSet in terms of a new ...

3 years, 11 months ago (2017-01-23 22:47:36 UTC) #1

ncarter (slow)

Description was changed from ========== registry_controlled_domains: Add FixedSetIncrementalLookup Reimplement LookupStringInFixedSet in terms of a new ...

3 years, 11 months ago (2017-01-23 23:00:53 UTC) #2

ncarter (slow)

Description was changed from ========== [registry_controlled_domains] FixedSetIncrementalLookup Reimplement LookupStringInFixedSet in terms of a new incremental ...

3 years, 11 months ago (2017-01-23 23:03:27 UTC) #3

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-23 23:47:19 UTC) #4

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/40001

3 years, 11 months ago (2017-01-23 23:48:25 UTC) #5

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-24 00:05:20 UTC) #6

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/308313) ios-simulator on ...

3 years, 11 months ago (2017-01-24 00:05:22 UTC) #7

ncarter (slow)

Description was changed from ========== [registry_controlled_domains] FixedSetIncrementalLookup Reimplement LookupStringInFixedSet in terms of a new incremental ...

3 years, 11 months ago (2017-01-24 00:19:48 UTC) #8

Description was changed from

==========
[registry_controlled_domains] FixedSetIncrementalLookup

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs a about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

to

==========
[registry_controlled_domains] support prefix queries

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs a about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

ncarter (slow)

Description was changed from ========== [registry_controlled_domains] support prefix queries Reimplement LookupStringInFixedSet in terms of a ...

3 years, 11 months ago (2017-01-24 00:21:08 UTC) #9

Description was changed from

==========
[registry_controlled_domains] support prefix queries

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs a about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

to

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs a about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

ncarter (slow)

Description was changed from ========== Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in ...

3 years, 11 months ago (2017-01-24 00:21:40 UTC) #10

Description was changed from

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs a about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

to

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

ncarter (slow)

Description was changed from ========== Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in ...

3 years, 11 months ago (2017-01-24 00:21:50 UTC) #11

Description was changed from

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the DCHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

to

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the CHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 194ms

Add some unittests that will be sensitive to reversed inputs.

BUG=678334
==========

ncarter (slow)

Description was changed from ========== Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in ...

3 years, 11 months ago (2017-01-24 01:33:22 UTC) #12

ncarter (slow)

Description was changed from ========== Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in ...

3 years, 11 months ago (2017-01-24 15:32:01 UTC) #13

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-24 21:32:50 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/100001

3 years, 11 months ago (2017-01-24 21:33:25 UTC) #15

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-24 21:53:05 UTC) #16

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/142930)

3 years, 11 months ago (2017-01-24 21:53:06 UTC) #17

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-24 23:40:42 UTC) #18

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/140001

3 years, 11 months ago (2017-01-24 23:41:49 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-24 23:58:35 UTC) #20

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/297512)

3 years, 11 months ago (2017-01-24 23:58:36 UTC) #21

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-25 18:39:30 UTC) #22

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/160001

3 years, 11 months ago (2017-01-25 18:39:57 UTC) #23

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-25 19:03:30 UTC) #24

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/298128)

3 years, 11 months ago (2017-01-25 19:03:30 UTC) #25

Ryan Sleevi

rsleevi@chromium.org changed reviewers: + rsleevi@chromium.org

3 years, 11 months ago (2017-01-25 19:11:02 UTC) #26

Ryan Sleevi

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string_in_fixed_set.cc File net/base/lookup_string_in_fixed_set.cc (right): https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string_in_fixed_set.cc#newcode27 net/base/lookup_string_in_fixed_set.cc:27: inline bool GetNextOffset(const unsigned char** pos, Pedantry: It doesn't ...

3 years, 11 months ago (2017-01-25 19:11:05 UTC) #27

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-26 20:55:06 UTC) #28

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/180001

3 years, 11 months ago (2017-01-26 20:55:42 UTC) #29

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-26 21:19:51 UTC) #30

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/354085)

3 years, 11 months ago (2017-01-26 21:19:52 UTC) #31

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-26 21:27:59 UTC) #32

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/200001

3 years, 11 months ago (2017-01-26 21:28:35 UTC) #33

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-26 22:39:06 UTC) #34

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 11 months ago (2017-01-26 22:39:07 UTC) #35

ncarter (slow)

ptal https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string_in_fixed_set.cc File net/base/lookup_string_in_fixed_set.cc (right): https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string_in_fixed_set.cc#newcode27 net/base/lookup_string_in_fixed_set.cc:27: inline bool GetNextOffset(const unsigned char** pos, On 2017/01/25 ...

3 years, 11 months ago (2017-01-26 23:29:12 UTC) #36

ptal

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set.cc (right):

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:27: inline bool GetNextOffset(const
unsigned char** pos,
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Pedantry: It doesn't seem "inline" reaches
> https://google.github.io/styleguide/cppguide.html#Inline_Functions and given
> that the compiler can ignore it, does this provide us any benefit? Also given
> that it's an an unnamed namespace, can't the compiler fully optimize this
> anyways?
> 
> I'm not going to push hard on this, just a remark though.

It wasn't done arbitrarily; it was a big win. I just re-measured; inlining this
function drops the first benchmark time (on Windows) from ~122ms to 96-98ms. 

Prior to this change, GetNextOffset was only called from one function, and was
getting inlined; now it's being called from both the Advance() and
GetResultForCurrentSequence(), and it seems that MSVC doesn't want to inline it
without the hint.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:55: inline bool IsEOL(const unsigned
char* offset) {
These inline hints seem to not matter anymore; I've dropped them.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:89:
FixedSetIncrementalLookup::~FixedSetIncrementalLookup() {}
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> pedantry: newlines between each of these (at least when reading in the diff,
> it's harder to parse)

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:93: // There are no possible matches
because we've exhausted the graph.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> There are no possible matches because the graph has been exhausted

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:104: // encoded at |*pos| to see if it is
equal to |key|.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Currently processing a label, so it is only necessary to check the character
...

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:120: // the label at |*offset| matches
|key|.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Continue to read offsets from |pos_| until there are no more offsets, or until
> ...

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:134: // code.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Since |key| was already validated as a printable ASCII value, ...

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:152: // If we didn't find a match, then
we are off the end of the DAFSA.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> If no match was found, then the end of the DAFSA has been reached.

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:164: // of a normal character.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> If |pos_| points to a position inside a label, then it is only necessary to
> determine whether ...

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:169: // is a result code.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> s/We need to explore/Explore/

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:171: // We create a copy of |pos_| for
this search, since mutating |pos_| could
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> A copy of |pos_| is created for this search, since ...

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:197: // appended every character in
|key|.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Do an incremental lookup until the graph is exhausted or until every character
> in |key| has been appended.

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.cc:205: // Check to see if the node we
reached in the DAFSA search has a result code
On 2017/01/25 19:11:03, Ryan Sleevi wrote:
> s/we//

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set.h (right):

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:45: // queries can be supported by
constructing the DAFSA on reversed keys.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> First Glance: Could you expand on this by including example calls between the
> two and why one versus the other might be desirable?
> 
> I'm thinking something like
> https://google.github.io/styleguide/cppguide.html#Class_Comments
> 
> Since we're effectively providing two different ways to do the same thing, my
> hope is our documentation (for both) can explore how both ways accomplish the
> same thing, and provide a bit of guidance on when one versus the other is more
> appropriate.
> 
> In particular, if this is objectively better in all ways than the current
code,
> it seems like we'd want to deprecate the current code rather than offer both
> ways. So I'm assuming there's some nuance here?

I've added examples to the class comment.

I've also added a TODO to LookupStringInFixedSet to eliminate that function.
I'll amend step 4 to actually do that.

And full disclosure, perf wise: the old lookup code definitely benefitted from
being able to iterate over the input directly -- when it had a run of adjacent
label chars it had a memcmp-like fast path. The reason this change is a net perf
win, is mostly due to the CHECK removal.

So, we might consider having a StringPiece variant of Advance, but we'd have to
add either a 'bool reverse' parameter once we want to treat the strings in
reverse, unless we construct our DAFSA labels-reversed, (e.g. com.appspot
instead of moc.topsppa), which approach actually there's something to be said
for; it's probably more human friendly.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:51: // which was generated by the script
make_dafsa.py during compilation.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Can you reference the full path to make_dafsa.py here

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:55: // their position in the search. This
is inexpensive.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Just from the header description, it's not entirely obvious when this is
> desirable. But I'm still reading the header incrementally, so perhaps it
becomes
> clearer in subsequent CLs.
> 
> Just sharing first documentation remarks :)

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:63: // matches (since the DAFSA format is
currently limited to those).
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> comment nit: I'm curious why we don't make that a stronger API contract
> guarantee, since it's already an implicit requirement.

Historically, having input char values in the bad ranges (0x00-0x1F, and
0x80-0xFF) could have possibly caused memory errors; in particular values
0x80-0x87 which are result codes. Now there's explicit filtering, and it
guarantees a match failure.

(There's no reason we couldn't extend the dafsa format to [normalized] UTF-8,
but obviously we'd have to carve out bits for that, which would likely shift the
whole encoding scheme around. The wording of this comment was meant to leave
that possibility open.)

Let me know how you'd like this comment to read.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:67: // set. Returns true otherwise.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> Comment nit: Trying to work out the negative case, and then negating the
> negative case (to figure out the true value) is a little bit of a pain.
> 
> I'm curious if you could simply express this in terms of the conditions it
> returns true for, and omit the "Returns false otherwise".
> 
> It's also unclear whether it's legal to Advance() after false has been
returned
> - is that like advancing past the end of an iterator, or is that just a no-op?

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:71: // far to Advance().
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> So one thing that is left unclear is what happens if Advance() returned false
-
> presumably, kDafsaNotFound.

I've clarified the documentation of Advance() that talks about what a false
return value means.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:85: // Pointer to our current position in
the graph, or nullptr if we've exhausted
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> s/our/the/
> s/we've exhausted the graph/the graph is exhausted/
> 
>
https://groups.google.com/a/chromium.org/d/msg/chromium-dev/NH-S6KCkr2M/yv0z3...
> for context :)

I've rewritten the comments, since you asked. But the consensus at that link
actually seems to be that pronouns are okay.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:92: // offsets.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> "When paired with" seems to be describing an implementation detail/usage of
it,
> not what it does. I think you can lose the entire sentence and still
accomplish
> the same.

Done.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set_unittest.cc (right):

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set_unittest.cc:214: //
effective_tld_names.gperf.
On 2017/01/25 19:11:04, Ryan Sleevi wrote:
> So in general, I try to push back on these "Test all input" tests, because I
> don't think they're good to run every test run (versus every build), and what
> you're really testing is the make_dafsa code here, not the implementation of
the
> dafsa bits.
> 
> For example, why is it necessary to test effective_tld_names.gperf - and isn't
> that coupling the use of this class with an implementation detail?

Regarding coupling: these unittests are already written against the .gperf files
in net/base/registry_controlled_domains. This test belongs here, and not in
registry_controlled_domain_unittest.cc, because it doesn't call any functions in
registry_controlled_domain.cc.

Regarding cost: The test is currently medium-expensive: runs 300ms in debug,
30ms in release. We could make it a little faster if we restricted the range of
char values to a smaller subset, but this ways seemed fast enough.

Regarding the coverage value of this test, I see two purposes:
 - Validate that the produced dafsa is equivalent to the input gperf file. This
is a valuable test to have if we tinker with make_dafsa.py, or the decoder
(which this CL does do).
 - Together with the DCHECKs in the implementation, it validates that the
generated dafsa doesn't have any memory errors in the form of offsets past the
end of the array.

Doing this on effective_tld_names.gperf matters, I'd argue, because it's the
only one we ship right now -- this is a validation of that specific binary blob.
I'd support doing similar validation for other dafsas that get linked into the
production binary.

The second purpose could maybe be accomplished by the existing fuzzer, and
turning the DCHECKs into SECURITY_DCHECKs so that they're compiled into the
fuzzer (ASAN) build. However, SECURITY_DCHECK doesn't exist on base. Do you
think ASAN would be able to find memory bugs (out-of-range offsets) without the
|end_| checks?

The first purpose, however, seems like new and valuable test coverage; since
it's a much much stronger validation that there are no strings recognized by the
dafsa that aren't in the original input file.

There's actually such a theoretical bug in the old code, if it were passed an
input char with value 128, it would match a \0 (result code 0) node, and it's
not clear what happens after that. It's possible there would be an extra string
in the language, or an extra memory. -- see the code here
https://cs.chromium.org/chromium/src/net/base/lookup_string_in_fixed_set.cc?q...
.

I know that the input that Lookup currently receives currently has already been
canonicalized via GURL, which is why I describe this bug as theoretical. But,
since the dafsa mechanism is open to reuse for other purposes (and historically
has had other users), validating the inputs at the lookup_string_in_fixed_set
layer seems like not such a bad idea.

If you think this test is too expensive to run on every change, another option
might be to let the existing fuzzer do the work of probing the dafsa for memory
errors (accomplishing purpose 2), and doing the language-is-complete test
(accomplishing purpose 1) only on the unittest sample DAFSAs, where it will run
much faster, and we can just inline the expectation instead of having to read it
from the gperf file.

Let me know how you'd like to proceed.

Ryan Sleevi

LGTM - thanks for doing this; the documentation made this much easier to read & ...

3 years, 11 months ago (2017-01-27 00:08:24 UTC) #37

LGTM - thanks for doing this; the documentation made this much easier to read &
follow & vet it's correct.

What remains below are nits that you can exercise discretion on.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set_unittest.cc (right):

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set_unittest.cc:214: //
effective_tld_names.gperf.
On 2017/01/26 23:29:11, ncarter wrote:
> Regarding coupling: these unittests are already written against the .gperf
files
> in net/base/registry_controlled_domains. This test belongs here, and not in
> registry_controlled_domain_unittest.cc, because it doesn't call any functions
in
> registry_controlled_domain.cc.

Yeah, this used to be an implementation detail of that file. I've held of
stamping the one that moves all this back into that file because both your CLs
and because of https://codereview.chromium.org/2613833002/

> Regarding the coverage value of this test, I see two purposes:
>  - Validate that the produced dafsa is equivalent to the input gperf file.
This
> is a valuable test to have if we tinker with make_dafsa.py, or the decoder
> (which this CL does do).

I'm torn on this argument, because it doesn't seem like a good approach to
testing to me - but I very well could be wrong here.

On the one hand, part of me thinks that testing the produced dafsa is correct is
something we'd want to do in make_dafsa.py's tests - perhaps as a golden master
comparison. On the other, I can see the argument that this is a systems test.

But when I think about other decoders - such as, say, JSON parsing, it feels
weird/wrong to think we'd decode JSON and then re-encode it, and compare - which
is effectively what we're doing here. We're also not testing the interface
(which is make_dafsa.py), but an implementation detail (how make_dafsa.py reads
the gperf), and that feels like a testing abstraction bleed.

> The second purpose could maybe be accomplished by the existing fuzzer, and
> turning the DCHECKs into SECURITY_DCHECKs so that they're compiled into the
> fuzzer (ASAN) build. However, SECURITY_DCHECK doesn't exist on base. Do you
> think ASAN would be able to find memory bugs (out-of-range offsets) without
the
> |end_| checks?

I don't see why we'd need to compile them to SECURITY_DCHECKs to have the
fuzzers catch this. Am I missing something? ASAN should totally pick up if we're
reading past where the graph input was.

> The first purpose, however, seems like new and valuable test coverage; since
> it's a much much stronger validation that there are no strings recognized by
the
> dafsa that aren't in the original input file.

Sure, but this shouldn't need the 'live' file, should it?

> If you think this test is too expensive to run on every change, another option
> might be to let the existing fuzzer do the work of probing the dafsa for
memory
> errors (accomplishing purpose 2), and doing the language-is-complete test
> (accomplishing purpose 1) only on the unittest sample DAFSAs, where it will
run
> much faster, and we can just inline the expectation instead of having to read
it
> from the gperf file.

Yeah, I lean towards that as the solution, but wanted to make sure I wasn't
missing something.

https://codereview.chromium.org/2641953009/diff/220001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set.h (right):

https://codereview.chromium.org/2641953009/diff/220001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:77: //    return input.substr(0,
longest_match + 1);
The "longest_match + 1" is a bit of a red flag, mostly because of the "doing
math on npos" is a red flag.

I had to dig out C++03 spec to see if/why it was defined (answer: npos is
defined as -1 in some aspects, or treated as the unsigned equivalent of -1 in
other places; but -1+1 = 0 or 'largest unsigned'+1 is defined to be 0), then I
had to dig up whether .substr(0, 0) was defined (which refers to string(charT*,
size_t), which says it is.

The pedant in me suggests that something like
return longest_match == std::string::npos ? std::string() : input.substr(0,
longest_match + 1) is less ambiguous, even though it's functionally equivalent,
except now with an added conditional.

ncarter (slow)

The CQ bit was checked by nick@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-15 23:39:02 UTC) #39

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/300001

3 years, 10 months ago (2017-02-15 23:39:53 UTC) #40

ncarter (slow)

I think this is ready to land now. I'm renumbering the patches in the series ...

3 years, 10 months ago (2017-02-15 23:42:11 UTC) #41

I think this is ready to land now.

I'm renumbering the patches in the series (the old patch 1 -- the alexa
10K-based benchmark -- may not be cleanly landable). There are existing
net_perftests, in particular the cookiemonster tests, that we ought to be able
to see movement on, in the perf waterfalls.

Give me another day to rebase the rest of the patches. I'll send 'PTAL' when
they're ready for you.

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set_unittest.cc (right):

https://codereview.chromium.org/2641953009/diff/120001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set_unittest.cc:214: //
effective_tld_names.gperf.
On 2017/01/27 00:08:24, Ryan Sleevi wrote:
> On 2017/01/26 23:29:11, ncarter wrote:
> > Regarding coupling: these unittests are already written against the .gperf
> files
> > in net/base/registry_controlled_domains. This test belongs here, and not in
> > registry_controlled_domain_unittest.cc, because it doesn't call any
functions
> in
> > registry_controlled_domain.cc.
> 
> Yeah, this used to be an implementation detail of that file. I've held of
> stamping the one that moves all this back into that file because both your CLs
> and because of https://codereview.chromium.org/2613833002/
> 
> > Regarding the coverage value of this test, I see two purposes:
> >  - Validate that the produced dafsa is equivalent to the input gperf file.
> This
> > is a valuable test to have if we tinker with make_dafsa.py, or the decoder
> > (which this CL does do).
> 
> I'm torn on this argument, because it doesn't seem like a good approach to
> testing to me - but I very well could be wrong here.
> 
> On the one hand, part of me thinks that testing the produced dafsa is correct
is
> something we'd want to do in make_dafsa.py's tests - perhaps as a golden
master
> comparison. On the other, I can see the argument that this is a systems test.
> 
> But when I think about other decoders - such as, say, JSON parsing, it feels
> weird/wrong to think we'd decode JSON and then re-encode it, and compare -
which
> is effectively what we're doing here. We're also not testing the interface
> (which is make_dafsa.py), but an implementation detail (how make_dafsa.py
reads
> the gperf), and that feels like a testing abstraction bleed.
> 
> > The second purpose could maybe be accomplished by the existing fuzzer, and
> > turning the DCHECKs into SECURITY_DCHECKs so that they're compiled into the
> > fuzzer (ASAN) build. However, SECURITY_DCHECK doesn't exist on base. Do you
> > think ASAN would be able to find memory bugs (out-of-range offsets) without
> the
> > |end_| checks?
> 
> I don't see why we'd need to compile them to SECURITY_DCHECKs to have the
> fuzzers catch this. Am I missing something? ASAN should totally pick up if
we're
> reading past where the graph input was.
> 
> > The first purpose, however, seems like new and valuable test coverage; since
> > it's a much much stronger validation that there are no strings recognized by
> the
> > dafsa that aren't in the original input file.
> 
> Sure, but this shouldn't need the 'live' file, should it?
> 
> > If you think this test is too expensive to run on every change, another
option
> > might be to let the existing fuzzer do the work of probing the dafsa for
> memory
> > errors (accomplishing purpose 2), and doing the language-is-complete test
> > (accomplishing purpose 1) only on the unittest sample DAFSAs, where it will
> run
> > much faster, and we can just inline the expectation instead of having to
read
> it
> > from the gperf file.
> 
> Yeah, I lean towards that as the solution, but wanted to make sure I wasn't
> missing something.

I've redone these unittests so that they just run against three of the test
DAFSAs, and dropped the dependency on the production .gperf file. This plus the
fuzzer gives sufficient coverage, IMO.

The DCHECKs remain DCHECKs. I agree with your point, that we can trust
ASAN+fuzzer to catch the out-of-range reads, without needing SECURITY_DCHECKs.

https://codereview.chromium.org/2641953009/diff/220001/net/base/lookup_string...
File net/base/lookup_string_in_fixed_set.h (right):

https://codereview.chromium.org/2641953009/diff/220001/net/base/lookup_string...
net/base/lookup_string_in_fixed_set.h:77: //    return input.substr(0,
longest_match + 1);
On 2017/01/27 00:08:24, Ryan Sleevi wrote:
> The "longest_match + 1" is a bit of a red flag, mostly because of the "doing
> math on npos" is a red flag.
> 
> I had to dig out C++03 spec to see if/why it was defined (answer: npos is
> defined as -1 in some aspects, or treated as the unsigned equivalent of -1 in
> other places; but -1+1 = 0 or 'largest unsigned'+1 is defined to be 0), then I
> had to dig up whether .substr(0, 0) was defined (which refers to
string(charT*,
> size_t), which says it is.
> 
> The pedant in me suggests that something like
> return longest_match == std::string::npos ? std::string() : input.substr(0,
> longest_match + 1) is less ambiguous, even though it's functionally
equivalent,
> except now with an added conditional.

Fixed.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-16 01:44:15 UTC) #42

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: chromeos_daisy_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) ...

3 years, 10 months ago (2017-02-16 01:44:17 UTC) #43

ncarter (slow)

The patchset sent to the CQ was uploaded after l-g-t-m from rsleevi@chromium.org Link to the ...

3 years, 10 months ago (2017-02-16 19:10:30 UTC) #45

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2641953009/300001

3 years, 10 months ago (2017-02-16 19:11:48 UTC) #46

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 300001, "attempt_start_ts": 1487272230365620, "parent_rev": "654a5af38665d42fdcddafaa0a5f85810d5fdacc", "commit_rev": "0b80485a436c8c881164729adb8484868ecd065d"}

3 years, 10 months ago (2017-02-16 20:26:41 UTC) #47

commit-bot: I haz the power

Description was changed from ========== Support prefix queries against the effective_tld_names DAFSA Reimplement LookupStringInFixedSet in ...

3 years, 10 months ago (2017-02-16 20:27:17 UTC) #48

Message was sent while issue was closed.

Description was changed from

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the CHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Same: 194ms

BUG=678334
==========

to

==========
Support prefix queries against the effective_tld_names DAFSA

Reimplement LookupStringInFixedSet in terms of a new incremental
prefix lookup class, FixedSetIncrementalLookup.

Use FixedSetIncrementalLookup in a unittest to enumerate the
language of the DAFSA, and validate that it matches
the body of the .gperf file. 

Because the unittest exhaustively explores every node of
the DAFSA on every possible input, it should be safe to
demote the CHECKs in this file to DCHECKs, which is a
performance win.

Because of function inlining and the CHECK removal, this
actually runs about 10% faster than the old version.

Current scores on my Windows machine (best of 3 runs):

  RegistryControlledDomainTest.GetDomainAndRegistry: 98ms
  RegistryControlledDomainTest.SameDomainOrHost_Different: 170ms
  RegistryControlledDomainTest.SameDomainOrHost_Same: 194ms

BUG=678334

Review-Url: https://codereview.chromium.org/2641953009
Cr-Commit-Position: refs/heads/master@{#451085}
Committed:
https://chromium.googlesource.com/chromium/src/+/0b80485a436c8c881164729adb84...
==========

commit-bot: I haz the power

3 years, 10 months ago (2017-02-16 20:27:18 UTC) #49

Message was sent while issue was closed.

Committed patchset #16 (id:300001) as
https://chromium.googlesource.com/chromium/src/+/0b80485a436c8c881164729adb84...

Issue 2641953009: [1 of 4] Support prefix queries against the effective_tld_names DAFSA (Closed)

Description

Patch Set 1 #

Patch Set 2 : Comments and DCHECK-ification #

Patch Set 3 : Move EnumerateDafsaLanguage to lookup_string_in_fixed_set_unittest.cc #

Patch Set 4 : Self-review. #

Patch Set 5 : Revert unrelated unittest diffs #

Patch Set 6 : Update #

Patch Set 7 : Rebase #

Patch Set 8 : add gperf to bundle data. #

Patch Set 9 : Add effective_tld_names.gperf to the unittest data dependencies. #

Patch Set 10 : Comment rewriting. #

Patch Set 11 : Convert |sequence| from string to vector (since string.pop_back() isn't everywhere) #

Patch Set 12 : Fixes. #

Patch Set 13 : Move unittests to not depend on production dafsa #

Patch Set 14 : Revert BUILD changes. #

Patch Set 15 : Revert BUILD changes. #

Patch Set 16 : Fix problem #

Messages