Issue 197183002: Reduce footprint of registry controlled domain table

Issue 197183002: Reduce footprint of registry controlled domain table (Closed)

Created:
6 years, 9 months ago by Olle Liljenzin

Modified:
6 years, 6 months ago

Reviewers:
nyquist, Pam (message me for reviews), M-A Ruel, Ryan Sleevi

CC:
chromium-reviews, cbentzel+watch_chromium.org, Randy Smith (Not in Mondays)

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Visibility:
Public.

More Reviews

Description

Reduce footprint of registry controlled domain table The perfect hash table containing all registry controlled domains is replaced by a compact graph (a dafsa) to reduce binary size and PSS of the running process. Size of the new structure is about 33kB compared to 380kB for the perfect hash table. BUG=

Patch Set 1 #

Total comments: 11

Patch Set 2 : Implemented changes suggested in review. #

Patch Set 3 : Added bounds checks #

Total comments: 47

Patch Set 4 : Fixed style issues and added unit tests for make_dafsa.py #

Total comments: 25

Patch Set 5 : Fixed some doc strings #

Total comments: 8

Patch Set 6 : Fixed more style issues and doc strings #

Total comments: 14

Patch Set 7 : Implemented requested changes #

Total comments: 4

Patch Set 8 : Removed shebang and execution bits #

Created: 6 years, 7 months ago

(Patch set is too large to download)

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+2201 lines, -459 lines)			Patch
D	net/base/registry_controlled_domains/effective_tld_names_unittest1.cc	View	1 2 3 4 5 6 7	1 chunk	+0 lines, -218 lines	0 comments	Download
D	net/base/registry_controlled_domains/effective_tld_names_unittest2.cc	View	1 2 3 4 5 6 7	1 chunk	+0 lines, -161 lines	0 comments	Download
A	net/base/registry_controlled_domains/effective_tld_names_unittest3.gperf	View	1 2 3 4 5 6 7	1 chunk	+13 lines, -0 lines	0 comments	Download
A	net/base/registry_controlled_domains/effective_tld_names_unittest4.gperf	View	1 2 3 4 5 6 7	1 chunk	+523 lines, -0 lines	0 comments	Download
A	net/base/registry_controlled_domains/effective_tld_names_unittest5.gperf	View	1 2 3 4 5 6 7	1 chunk	+9 lines, -0 lines	0 comments	Download
A	net/base/registry_controlled_domains/effective_tld_names_unittest6.gperf	View	1 2 3 4 5 6 7	1 chunk	+9 lines, -0 lines	0 comments	Download
M	net/base/registry_controlled_domains/registry_controlled_domain.h	View	1 2 3 4 5 6 7	1 chunk	+5 lines, -4 lines	0 comments	Download
M	net/base/registry_controlled_domains/registry_controlled_domain.cc	View	1 2 3 4 5 6 7	3 chunks	+175 lines, -50 lines	0 comments	Download
M	net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc	View	1 2 3 4 5 6 7	7 chunks	+166 lines, -18 lines	0 comments	Download
M	net/net.gyp	View	1 2 3 4 5 6 7	3 chunks	+38 lines, -0 lines	0 comments	Download
A	net/tools/tld_cleanup/PRESUBMIT.py	View	1 2 3 4 5 6 7	1 chunk	+32 lines, -0 lines	0 comments	Download
M	net/tools/tld_cleanup/README	View	1 2 3 4 5 6 7	1 chunk	+5 lines, -8 lines	0 comments	Download
A	net/tools/tld_cleanup/make_dafsa.py	View	1 2 3 4 5 6 7	1 chunk	+469 lines, -0 lines	0 comments	Download
A	net/tools/tld_cleanup/make_dafsa_unittest.py	View	1 2 3 4 5 6 7	1 chunk	+757 lines, -0 lines	0 comments	Download

Messages

Total messages: 52 (0 generated)

Expand Messages | Collapse Messages

Pam (message me for reviews)

Do you have any performance data for this change? Some callers turn out to be ...

6 years, 9 months ago (2014-03-12 23:46:32 UTC) #4

Olle Liljenzin

On 2014/03/12 23:46:32, Pam (also PM for reviews) wrote: > Do you have any performance ...

6 years, 9 months ago (2014-03-13 12:31:15 UTC) #5

Olle Liljenzin

On 2014/03/13 12:31:15, ollel wrote: > 0.8-0.9 us for all calls and using the dafsa ...

6 years, 9 months ago (2014-03-13 12:34:49 UTC) #6

Olle Liljenzin

On 2014/03/13 12:31:15, ollel wrote: > 560 calls to GetDomainAndRegistry() and 220 calls to GetDomainAndRegistry() ...

6 years, 9 months ago (2014-03-13 12:38:54 UTC) #7

Randy Smith (Not in Mondays)

I don't know this code. Ryan, can you assist?

6 years, 9 months ago (2014-03-14 19:24:06 UTC) #8

Ryan Sleevi

OK, just to ack this review, it's definitely going to take me time to parse, ...

6 years, 9 months ago (2014-03-18 01:23:57 UTC) #9

OK, just to ack this review, it's definitely going to take me time to parse, as
I'm sure it took time to write.

As Pam mentions, this is highly performance-sensitive code. Equally, there are a
lot of graph traversals being done for code that will *discard* the end result,
which has long been a desire to optimize.

What I mean by this, specifically, is that for the sake of 'security policy', it
*does not matter* if the eTLD == last label. That is, each of the ccTLD entries
that are bare, as well as all of the new contracted-but-not-delegated domains
*need not be traversed*

This optimization is key to being able to reclaim the performance from the
sensitive aspects (eg: cookie parsing), while still allowing the flexibility of
the less-than-performance-sensitive-but-still-important parts (eg: wildcard cert
validation, omnibox navigations)

Past discussions with pkasting@ included the possibility of keeping the
cookie-relevant parts as a perfect-hash, and move the other aspects to a
trie/dafsa, with shared storage between the two (mostly, this involves changing
gperf to treat the string-pool as pointer+len, rather than NULL-terminated)

Your main speed metrics are going to be on cookie parsing. Unfortunately, we
can't share the UMA data directly, so it's hard to know what impact you'll have
/ experiment yourselves.

My advice: Populate a cookie database with 3K+ cookies from a variety of a few
dozen domains (which roughly approximates what we see), and run
https://code.google.com/p/chromium/codesearch#chromium/src/content/browser/ne...
through an Optimized, "Release mode" test to see what effect your code has -
positive or negative. Bonus points for checking things like cache misses or
stalls.

https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled...
File net/base/registry_controlled_domains/registry_controlled_domain.cc (right):

https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled...
net/base/registry_controlled_domains/registry_controlled_domain.cc:70: const
unsigned char* child = pos;
Generally speaking, this level of manual string manipulation scares the crap out
of me.

I just mention it because it's going to take me a lot more time to reason about
the security on, even if it's (probably) correct.

That said, the use of magic values here (0x60, 0x40, 0x80, 0x0F) would all be
better if given symbolic names, and documented heavily within this code

const char kTerminalNodeMask = 0x80;
const char kOffsetLengthMask = 0x60;
const char kOffsetMask = 0x1F;

etc

cbentzel

On 2014/03/18 01:23:57, Ryan Sleevi wrote: > OK, just to ack this review, it's definitely ...

6 years, 9 months ago (2014-03-18 01:45:18 UTC) #10

On 2014/03/18 01:23:57, Ryan Sleevi wrote:
> OK, just to ack this review, it's definitely going to take me time to parse,
as
> I'm sure it took time to write.
> 
> As Pam mentions, this is highly performance-sensitive code. Equally, there are
a
> lot of graph traversals being done for code that will *discard* the end
result,
> which has long been a desire to optimize.
> 
> What I mean by this, specifically, is that for the sake of 'security policy',
it
> *does not matter* if the eTLD == last label. That is, each of the ccTLD
entries
> that are bare, as well as all of the new contracted-but-not-delegated domains
> *need not be traversed*
> 
> This optimization is key to being able to reclaim the performance from the
> sensitive aspects (eg: cookie parsing), while still allowing the flexibility
of
> the less-than-performance-sensitive-but-still-important parts (eg: wildcard
cert
> validation, omnibox navigations)
> 
> Past discussions with pkasting@ included the possibility of keeping the
> cookie-relevant parts as a perfect-hash, and move the other aspects to a
> trie/dafsa, with shared storage between the two (mostly, this involves
changing
> gperf to treat the string-pool as pointer+len, rather than NULL-terminated)
> 
> Your main speed metrics are going to be on cookie parsing. Unfortunately, we
> can't share the UMA data directly, so it's hard to know what impact you'll
have
> / experiment yourselves.
> 
> My advice: Populate a cookie database with 3K+ cookies from a variety of a few
> dozen domains (which roughly approximates what we see), and run
>
https://code.google.com/p/chromium/codesearch#chromium/src/content/browser/ne...
> through an Optimized, "Release mode" test to see what effect your code has -
> positive or negative. Bonus points for checking things like cache misses or
> stalls.
> 
>
https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled...
> File net/base/registry_controlled_domains/registry_controlled_domain.cc
(right):
> 
>
https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled...
> net/base/registry_controlled_domains/registry_controlled_domain.cc:70: const
> unsigned char* child = pos;
> Generally speaking, this level of manual string manipulation scares the crap
out
> of me.
> 
> I just mention it because it's going to take me a lot more time to reason
about
> the security on, even if it's (probably) correct.
> 
> That said, the use of magic values here (0x60, 0x40, 0x80, 0x0F) would all be
> better if given symbolic names, and documented heavily within this code
> 
> const char kTerminalNodeMask = 0x80;
> const char kOffsetLengthMask = 0x60;
> const char kOffsetMask = 0x1F;
> 
> etc

FYI: Bryan McQuade and I started a suffix-trie based approach about two years
ago but abandoned because something new and shiny came along.

See https://codereview.chromium.org/8226007/ and
https://code.google.com/p/domain-registry-provider/

not sure if picking up that approach again would be preferable.

Olle Liljenzin

On 2014/03/18 01:23:57, Ryan Sleevi wrote: > My advice: Populate a cookie database with 3K+ ...

6 years, 9 months ago (2014-03-18 15:35:43 UTC) #11

Ryan Sleevi

I realize I omitted one key detail: the device to test on. On Desktop, the ...

6 years, 9 months ago (2014-03-18 15:41:45 UTC) #12

I realize I omitted one key detail: the device to test on.

On Desktop, the overall time is so low that it hardly matters.

On Mobile, in the past, we saw much more amplified perf hits due to cache
misses and the like. Given that your dafsa is markedly smaller than the
stringpool, I would *expect* things to be much less negatively impactful,
but a good bit of data wouldn't hurt.

My inclination is that we're fine here - eliding TLDs at the last label can
remain a future optimization - and so it will just take me a bit of time to
really get my hands dirty here reviewing this.
On Mar 18, 2014 8:35 AM, <ollel@opera.com> wrote:

> On 2014/03/18 01:23:57, Ryan Sleevi wrote:
>
>> My advice: Populate a cookie database with 3K+ cookies from a variety of
>> a few
>> dozen domains (which roughly approximates what we see), and run
>>
>
> https://code.google.com/p/chromium/codesearch#chromium/
> src/content/browser/net/sqlite_persistent_cookie_store.cc&l=554
>
>> through an Optimized, "Release mode" test to see what effect your code
>> has -
>> positive or negative. Bonus points for checking things like cache misses
>> or
>> stalls.
>>
>
> I tested with my real personal cookies from my desktop browser. Only 2379
> cookies, but I can't see why time shouldn't scale proportionally. I
> measured
> total time in InitializeDatabase() and the time spent in the loop over
> GetDomainAndRegistry() inside InitializeDatabase(). (Thus two figures.)
>
> I get big variations between runs (up to 50%) both with perfect hash and
> dafsa,
> so I picked the best figures from ten runs for comparison.
>
> Total time for InitializeDatabase() was 2.26 ms (perfect hash) and 2.45 ms
> (dafsa).
> Time in the loop was 0.95 vs 1.08 ms.
>
> So the dafsa is consistently slower, although not much slower.
>
> How fast does it have to be?
>
>  Generally speaking, this level of manual string manipulation scares the
>> crap
>>
> out
>
>> of me.
>>
>
> Me too. But it is a consequence of compression. Note that size of the final
> structure is just 42% of raw string data size and it replaces both the hash
> table and the string pool.
>
> https://codereview.chromium.org/197183002/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Ryan Sleevi

https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled_domains/registry_controlled_domain.cc File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode66 net/base/registry_controlled_domains/registry_controlled_domain.cc:66: int LookupString(const unsigned char* pos, const char* key, int ...

6 years, 9 months ago (2014-03-19 03:08:31 UTC) #13

Daniel Bratell

We're already using this code in Opera and I reviewed this patch before it landed ...

6 years, 9 months ago (2014-03-19 12:20:33 UTC) #14

Olle Liljenzin

https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled_domains/registry_controlled_domain.cc File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode72 net/base/registry_controlled_domains/registry_controlled_domain.cc:72: bool is_last = (*pos & 0x80) != 0; Iterators ...

6 years, 9 months ago (2014-03-19 14:19:41 UTC) #15

Ryan Sleevi

It is absolutely possible to make this code more readable. I'm aware of the performance ...

6 years, 9 months ago (2014-03-19 15:05:17 UTC) #16

It is absolutely possible to make this code more readable.

I'm aware of the performance sensitive nature of this, but that's rarely a
good justification for unreadable code.

It is clear there are several conceptual steps in this algorithm - reading
the next link, determining if its a full match, yielding a common error
when its not - that we can strive to improve.

Documentation and solid, readable code structure are, with few exceptions,
generally far more important than performance; naturally, this has been a
tension when developing for mobile.

Still, I strongly believe you can encapsulate the graph traversal without
overhead, and with a clearer structural representation.
On Mar 19, 2014 7:19 AM, <ollel@opera.com> wrote:

>
> https://codereview.chromium.org/197183002/diff/1/net/base/
> registry_controlled_domains/registry_controlled_domain.cc
> File net/base/registry_controlled_domains/registry_controlled_domain.cc
> (right):
>
> https://codereview.chromium.org/197183002/diff/1/net/base/
> registry_controlled_domains/registry_controlled_domain.cc#newcode72
> net/base/registry_controlled_domains/registry_controlled_domain.cc:72:
> bool is_last = (*pos & 0x80) != 0;
> Iterators may increase readability by adding a familiar interface on the
> implementation. But I can't see how it would fit here. E.g. reading a
> link has side effects (both pos and child pointers are incremented), and
> hiding that in an overloaded operator or GetNext() method would not make
> the code easier to read. The current code does not contain repeated
> patterns and adding abstraction layers will then just blow up code size
> and split the implementation on different locations, making it harder to
> follow what really happens in case of debugging.
>
> The code is also extremely performance sensitive and compilers are far
> from perfect. In case the compiler (on some platform) fails to reduce
> some redundant instructions we will have to read the assembler code to
> find out where it failed.
>
> The current code is just about 50-60 statements (not counting white
> space and comments), and I would prefer to not make it much larger.
>
> https://codereview.chromium.org/197183002/
>

To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Olle Liljenzin

https://codereview.chromium.org/197183002/diff/1/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76 net/net.gyp:76: }, On 2014/03/19 03:08:32, Ryan Sleevi wrote: > Let's ...

6 years, 9 months ago (2014-03-21 09:36:41 UTC) #17

Olle Liljenzin

Ryan, No one would be more happy than me if I could make the code ...

6 years, 9 months ago (2014-03-27 16:14:58 UTC) #18

Ryan Sleevi

I'll follow-up offline with a suggested rewrite, which also makes sure that I've fundamentally understood ...

6 years, 9 months ago (2014-03-27 20:23:01 UTC) #19

I'll follow-up offline with a suggested rewrite, which also makes sure that I've
fundamentally understood the algorithm :)

https://codereview.chromium.org/197183002/diff/1/net/net.gyp
File net/net.gyp (right):

https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76
net/net.gyp:76: },
On 2014/03/21 09:36:42, Olle Liljenzin wrote:
> On 2014/03/19 03:08:32, Ryan Sleevi wrote:
> > Let's split this up. Ideally, I'd like to avoid using
> > <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data
> > separate.
> > 
> > I'm happy with a target_defaults that handles gperf and outputs the
> appropriate
> > files, and then moving the .gperf dependency into net/net_unittests,
> outputting
> > to INTERMEDIATE_DIR
> 
> I can't figure out how to get the right include path using INTERMEDIATE_DIR.
> 
> To get effective_tld_names-inc.cc generated before
registry_controlled_domain.cc
> is compiled, I still need to list effective_tld_names.gperf in a separate
target
> that net depends on. (Is this correct?)

No, it shouldn't be. 'actions' and 'rules' are supposed to be run before the
compilation step (because of this exact reason)

This is spelled out in
https://code.google.com/p/chromium/codesearch#chromium/src/tools/gyp/pylib/gy...
for the order of execution.

[Note that 'none' types are generally "not encouraged" in GYP. I really need to
get around to writing a GYP Style Guide/All the Ways GYP can hate you Guide]

You should be able to do
'targets': [
  {
    'target_name': 'net',
    ...
    'rules': [
       ...
    ],
    'sources': [
       ...,
       'foo.gperf',
     ],
     'include_dirs': [
        ...,
        '<(INTERMEDIATE_DIR)',
     ],
   }
]

> 
> I tried this:
> 
>   'target_defaults': {
>     'rules': [
>       {
>         'rule_name': 'dafsa',
>         'extension': 'gperf',
>         'outputs': [
>           '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc',
>         ],
>         'inputs': [
>           'tools/tld_cleanup/make_dafsa.py',
>         ],
>         'action': [
>           'python',
>           'tools/tld_cleanup/make_dafsa.py',
>           '<(RULE_INPUT_PATH)',
>           '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc',
>         ],
>       },
>     ],
>   },
>   'targets': [
>     {
>       'target_name': 'net_derived_sources',
>       'type': 'none',
>       'sources': [
>         'base/registry_controlled_domains/effective_tld_names.gperf',
>       ],
>       'direct_dependent_settings': {
>         'include_dirs': [
>           '<(INTERMEDIATE_DIR)',
>         ],
>       },
>     },
>     ...
>   ],
> 
> Now the problem is that effective_tld_names-inc.cc gets installed in
> obj/net/net_derived_sources.gen, but -Iobj/net/net.gen/net is added when
> compiling registry_controlled_domain.cc. I can't see how I can get
> INTERMEDIATE_DIR expanded to the same value in both places.

<(INTERMEDIATE_DIR) is per-target

Olle Liljenzin

6 years, 9 months ago (2014-03-28 12:51:22 UTC) #20

https://codereview.chromium.org/197183002/diff/1/net/net.gyp
File net/net.gyp (right):

https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76
net/net.gyp:76: },
When I put the rule inside the net target, then a target for
effective_tld_names-inc.cc is added to net.ninja and it works. To avoid
duplicating the rule into both net and net_unittests, I was asked to move the
rule into target_defaults. But if I move the rule, gyp will not generate a
target for effective_tld_names-inc.cc in net.ninja and the build fails.


On 2014/03/27 20:23:01, Ryan Sleevi wrote:
> On 2014/03/21 09:36:42, Olle Liljenzin wrote:
> > On 2014/03/19 03:08:32, Ryan Sleevi wrote:
> > > Let's split this up. Ideally, I'd like to avoid using
> > > <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data
> > > separate.
> > > 
> > > I'm happy with a target_defaults that handles gperf and outputs the
> > appropriate
> > > files, and then moving the .gperf dependency into net/net_unittests,
> > outputting
> > > to INTERMEDIATE_DIR
> > 
> > I can't figure out how to get the right include path using INTERMEDIATE_DIR.
> > 
> > To get effective_tld_names-inc.cc generated before
> registry_controlled_domain.cc
> > is compiled, I still need to list effective_tld_names.gperf in a separate
> target
> > that net depends on. (Is this correct?)
> 
> No, it shouldn't be. 'actions' and 'rules' are supposed to be run before the
> compilation step (because of this exact reason)
> 
> This is spelled out in
>
https://code.google.com/p/chromium/codesearch#chromium/src/tools/gyp/pylib/gy...
> for the order of execution.
> 
> [Note that 'none' types are generally "not encouraged" in GYP. I really need
to
> get around to writing a GYP Style Guide/All the Ways GYP can hate you Guide]
> 
> You should be able to do
> 'targets': [
>   {
>     'target_name': 'net',
>     ...
>     'rules': [
>        ...
>     ],
>     'sources': [
>        ...,
>        'foo.gperf',
>      ],
>      'include_dirs': [
>         ...,
>         '<(INTERMEDIATE_DIR)',
>      ],
>    }
> ]
> 
> > 
> > I tried this:
> > 
> >   'target_defaults': {
> >     'rules': [
> >       {
> >         'rule_name': 'dafsa',
> >         'extension': 'gperf',
> >         'outputs': [
> >           '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc',
> >         ],
> >         'inputs': [
> >           'tools/tld_cleanup/make_dafsa.py',
> >         ],
> >         'action': [
> >           'python',
> >           'tools/tld_cleanup/make_dafsa.py',
> >           '<(RULE_INPUT_PATH)',
> >           '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc',
> >         ],
> >       },
> >     ],
> >   },
> >   'targets': [
> >     {
> >       'target_name': 'net_derived_sources',
> >       'type': 'none',
> >       'sources': [
> >         'base/registry_controlled_domains/effective_tld_names.gperf',
> >       ],
> >       'direct_dependent_settings': {
> >         'include_dirs': [
> >           '<(INTERMEDIATE_DIR)',
> >         ],
> >       },
> >     },
> >     ...
> >   ],
> > 
> > Now the problem is that effective_tld_names-inc.cc gets installed in
> > obj/net/net_derived_sources.gen, but -Iobj/net/net.gen/net is added when
> > compiling registry_controlled_domain.cc. I can't see how I can get
> > INTERMEDIATE_DIR expanded to the same value in both places.
> 
> <(INTERMEDIATE_DIR) is per-target

bmcquade1

Just an FYI that an implementation very similar to this already exists as a third ...

6 years, 9 months ago (2014-03-28 15:05:59 UTC) #21

bmcquade1

Also, I'm happy to give contributor rights to anyone that wants to contribute to the ...

6 years, 9 months ago (2014-03-28 15:24:10 UTC) #22

Olle Liljenzin

bmcquade1, I tried to build the linked program with a fresh domain table, but it ...

6 years, 9 months ago (2014-03-28 16:07:29 UTC) #23

bmcquade1

Hi Ollie, You're right, the project hasn't been updated in a while, and the increased ...

6 years, 9 months ago (2014-03-28 16:47:27 UTC) #24

Ryan Sleevi

I don't think domain-registry-provider is currently in a good state right now (doesn't handle the ...

6 years, 9 months ago (2014-03-29 02:14:28 UTC) #25

Ryan Sleevi

Olle: Apologies for the delay, I spent the last week out sick. I've attempted to ...

6 years, 8 months ago (2014-04-07 17:59:56 UTC) #26

Olle: Apologies for the delay, I spent the last week out sick.

I've attempted to distill your algorithm into something a little more 'high
level' - in as much as it tries to read the way one might mentally process. This
was the biggest challenge when reading the code, was trying to understand each
of the cases you were handling.

As I see it, there are several possible conditions to match a given node:
char <char>+ end_char offsets
char <char>+ return_value
char end_char offsets
char return_value
end_char offsets
return_value

We should not 'descend' into a node if the first char/end_char doesn't match. If
it started with a char, then all remaining char/end_chars should match. If it
started with end_char, well, then we just dive into offsets.

I tried to rewrite the algorithm below as a "strip the first char off - if it
doesn't match, try the next node". Then "consume all remaining chars", so that
the only thing left is an end_char or return_value. If the end_char matches,
descend. If it's a return value, return that. Otherwise, error out.

The other thing is we want to make sure we can handle memory corruption somewhat
gracefully - that is, by doing bounds checks to make sure we're not jumping off
into an abyss.

I attempted a little pseudo-code of the algorithm, to sake of discussion, and
will attempt to write it up proper (again, apologies - still recovering from a
nasty cold/flu, so my brain may be missing the obvious bugs)

while (GetNextOffset(&pos, &offset, end)) {
  //   char <char>+ end_char offsets
  //   char <char>+ return value
  //   char end_char offsets
  //   char return value
  //   end_char offsets
  //   return_value
  bool did_consume = false;
  if (key != key_end && !IsEOL(offset, end)) {
    // Leading <char> is not a match. Don't dive into this child
    if (!IsMatch(end, &offset, &key))
      continue;
    did_consume = true;
    ++offset
    ++key;
    // Possible matches at this point:
    // <char>+ end_char offsets
    // <char>+ return value
    // end_char offsets
    // return value
    // Remove all remaining <char> nodes possible
    while (!IsEOL(offset, end) && key != key_end) {
      if (!IsMatch(end, &offset, &key))
        return kFatal;
      ++key;
      ++offset;
    }
  }
  // Possible matches:
  // end_char offsets
  // return_value
  // If one or more <char> elements were consumed, a failure
  // to match is terminal. Otherwise, try the next node.
  if (key == key_end) {
    if (GetReturnValue(end, &offset, &return_value))
      return return_value;
    if (did_consume)
      return kFatal;
    continue;
  }
  if (!IsEndCharMatch(end, &offset, &key)) {
    if (did_consume)
      return kFatal;  // Unexpected
    continue;
  }
  ++key;
  pos = ++offset;  // Dive into child
}

Olle Liljenzin

Ryan, Suddenly the code looks much better. :-) I will try to complete your code ...

6 years, 8 months ago (2014-04-08 08:48:22 UTC) #27

Ryan Sleevi

On Apr 8, 2014 1:48 AM, <ollel@opera.com> wrote: > > Ryan, > > Suddenly the ...

6 years, 8 months ago (2014-04-08 14:35:32 UTC) #28

Olle Liljenzin

I just don't see the point in doing the check if we can verify that ...

6 years, 8 months ago (2014-04-09 12:49:27 UTC) #29

Ryan Sleevi

On 2014/04/09 12:49:27, Olle Liljenzin wrote: > I just don't see the point in doing ...

6 years, 8 months ago (2014-04-09 20:06:06 UTC) #30

Olle Liljenzin

Ryan: Sorry for taking so long, but I had to do some other work between. ...

6 years, 8 months ago (2014-04-22 14:38:02 UTC) #31

Ryan Sleevi

Hi Olle, I think we're approaching the final stretches of the .cc side, and the ...

6 years, 8 months ago (2014-04-23 01:52:17 UTC) #32

M-A Ruel

FTR, the space saving looks nice. I just want to make sure the python script ...

6 years, 8 months ago (2014-04-23 12:26:17 UTC) #33

Pam (message me for reviews)

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py#newcode1 net/tools/tld_cleanup/make_dafsa.py:1: #!/usr/bin/python On 2014/04/23 12:26:18, M-A Ruel wrote: > Please ...

6 years, 8 months ago (2014-04-23 13:13:56 UTC) #34

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
File net/tools/tld_cleanup/make_dafsa.py (right):

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:1: #!/usr/bin/python
On 2014/04/23 12:26:18, M-A Ruel wrote:
> Please take a look at other scripts around of get a sense of the general
coding
> style.
> - Take the 4 first lines header.
> - """ for docstrings.
> - Put the main code in main(), e.g.:
> def main():
>   ...
>   return 0
> 
> 
> if __name__ == '__main__':
>   sys.exit(main())
> 
> - 2 empty lines between file level symbols
> - Imperative verbs to start the function docstrings.
> - Use with statements, e.g.
> with open(foo, 'rb') as f:
>   content = f.read()
> - While historical code tends to use CamelCase functions, we're slowly
switching
> over to more PEP8 compliant names.
> - Use () instead of \.

Python coding style is described at
http://www.chromium.org/developers/coding-style
There's also a copy of pylint in the depot_tools to get basic checks.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:225: '''Create new nodes'''
This docstring could be more descriptive.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:278: '''Return a macthing node. A new node
is created if not matching node
typos "matching"; "if no matching node"

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:325: # This is an <end_label> node and no
links are follow in such nodes
language nit: "no links follow such nodes"

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line
in lines]
On 2014/04/23 12:26:18, M-A Ruel wrote:
> Could you add in a comment how the lines should look like. I have no idea what
> this should produce from what input.

Also some unit tests for this script. There are various frameworks in use, but
at least a set of appropriately chosen inputs and expected outputs would help.

Olle Liljenzin

On 2014/04/23 13:13:56, Pam (also PM for reviews) wrote: > Python coding style is described ...

6 years, 8 months ago (2014-04-23 15:00:01 UTC) #35

Olle Liljenzin

https://codereview.chromium.org/197183002/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/20001/net/base/registry_controlled_domains/registry_controlled_domain.cc#newcode71 net/base/registry_controlled_domains/registry_controlled_domain.cc:71: const unsigned char*& offset) { On 2014/04/23 01:52:18, Ryan ...

6 years, 8 months ago (2014-04-24 09:29:59 UTC) #36

https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro...
File net/base/registry_controlled_domains/registry_controlled_domain.cc (right):

https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro...
net/base/registry_controlled_domains/registry_controlled_domain.cc:71: const
unsigned char*& offset) {
On 2014/04/23 01:52:18, Ryan Sleevi wrote:
> Per
>
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Refere...
> , all reference arguments are supposed to be const.
> 
> bool GetNextOffset(const unsigned char** pos,
>                    const unsigned char** end,
>                    const unsigned char** offset)

Then it should be "** pos, * end, ** offset". The style guide says output
parameters should be last in the parameter list. Is there a similar rule for
in-out parameters?

https://codereview.chromium.org/197183002/diff/20001/net/net.gyp
File net/net.gyp (right):

https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode78
net/net.gyp:78: 'target_name': 'net',
On 2014/04/23 01:52:18, Ryan Sleevi wrote:
> You didn't seem to add <(SHARED_INTERMEDIATE_DIR)/net to the include_dirs here
> 
> However, see above

It looks like this include path is added to net by some other reason. I don't
know why. But I agree it should be added here to not add a dependency on
something that could change in the future.

https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode1686
net/net.gyp:1686: '<(SHARED_INTERMEDIATE_DIR)/net',
On 2014/04/23 01:52:18, Ryan Sleevi wrote:
> Unnecessary, right?

Necessary unless there is a better way to add the include path to net_unittests.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
File net/tools/tld_cleanup/make_dafsa.py (right):

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:194: def ToDafsa(words):
On 2014/04/23 12:26:18, M-A Ruel wrote:
> Correct me if I'm wrong, but nowhere you describe what data types this
function
> returns exactly. I understand it is a list of tuples, with single item list
> recursively, but that's not clear.

It returns a source node described on line 24.

Clearly it would have been more readable with a node class and named attributes
instead of anonymous tuples, lists and dictionaries. But it would also make the
script more than ten times slower.

Running make_dafsa.py to generate effective_tld_names-inc.cc takes about a
second and it replaces gperf that took above 4 minutes to run on the same
machine. So now we can generate the files on the fly instead of running gperf
manually as before, which is much better in my opinion.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:201: assert(words)
On 2014/04/23 12:26:18, M-A Ruel wrote:
> So the script will throw on an empty file?

The output format doesn't support the empty graph. It starts with an offset list
which must contain at least one offset.

I will add a comment to clearify the assert.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:391: output += EncodePrefix(node[0])
On 2014/04/23 12:26:18, M-A Ruel wrote:
> optional nit: I generally prefer .append()

List concat and list append are different operations.

https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma...
net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line
in lines]
On 2014/04/23 12:26:18, M-A Ruel wrote:
> Could you add in a comment how the lines should look like. I have no idea what
> this should produce from what input.

There are two examples, see lines 107 and 152.

M-A Ruel

Did you forget to upload? https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py#newcode14 net/tools/tld_cleanup/make_dafsa.py:14: The input strings are ...

6 years, 8 months ago (2014-04-24 12:45:18 UTC) #37

M-A Ruel

FTR, implementing a python decoder would go a long way to document the format and ...

6 years, 8 months ago (2014-04-24 12:52:41 UTC) #38

Olle Liljenzin

On 2014/04/24 12:45:18, M-A Ruel wrote: > Did you forget to upload? I was just ...

6 years, 8 months ago (2014-04-24 12:54:25 UTC) #39

M-A Ruel

On 2014/04/24 12:54:25, Olle Liljenzin wrote: > On 2014/04/24 12:45:18, M-A Ruel wrote: > > ...

6 years, 8 months ago (2014-04-24 13:00:16 UTC) #40

Olle Liljenzin

On 2014/04/24 12:52:41, M-A Ruel wrote: > FTR, implementing a python decoder would go a ...

6 years, 8 months ago (2014-04-24 13:28:38 UTC) #41

M-A Ruel

On 2014/04/24 13:28:38, Olle Liljenzin wrote: > On 2014/04/24 12:52:41, M-A Ruel wrote: > > ...

6 years, 8 months ago (2014-04-25 00:25:52 UTC) #42

Olle Liljenzin

Fixed style issues and added unit tests for make_dafsa.py. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/make_dafsa.py#newcode14 net/tools/tld_cleanup/make_dafsa.py:14: ...

6 years, 7 months ago (2014-04-29 12:41:52 UTC) #43

M-A Ruel

lgtm with a few changes and others optional. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/make_dafsa.py File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/make_dafsa.py#newcode2 net/tools/tld_cleanup/make_dafsa.py:2: Remove ...

6 years, 7 months ago (2014-04-29 21:12:57 UTC) #44

Ryan Sleevi

This LGTM as well. Note that I suspect I will need to manually land this, ...

6 years, 7 months ago (2014-04-29 21:20:46 UTC) #45

Olle Liljenzin

https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/make_dafsa.py File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/make_dafsa.py#newcode201 net/tools/tld_cleanup/make_dafsa.py:201: self.msg = msg On 2014/04/29 21:12:58, M-A Ruel wrote: ...

6 years, 7 months ago (2014-04-30 11:43:42 UTC) #46

M-A Ruel

https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PRESUBMIT.py File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PRESUBMIT.py#newcode2 net/tools/tld_cleanup/PRESUBMIT.py:2: I meant to remove the empty line, it's not ...

6 years, 7 months ago (2014-05-02 16:53:01 UTC) #47

Olle Liljenzin

6 years, 7 months ago (2014-05-05 08:41:39 UTC) #48

M-A Ruel

python lgtm with 3 nits https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py#newcode1 net/tools/tld_cleanup/PRESUBMIT.py:1: #!/usr/bin/env python Technically, this ...

6 years, 7 months ago (2014-05-05 16:54:50 UTC) #49

Olle Liljenzin

https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py#newcode1 net/tools/tld_cleanup/PRESUBMIT.py:1: #!/usr/bin/env python On 2014/05/05 16:54:51, M-A Ruel wrote: > ...

6 years, 7 months ago (2014-05-05 17:32:04 UTC) #50

Ryan Sleevi

On 2014/05/05 17:32:04, Olle Liljenzin wrote: > https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py > File net/tools/tld_cleanup/PRESUBMIT.py (right): > > https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PRESUBMIT.py#newcode1 ...

6 years, 7 months ago (2014-05-06 23:08:17 UTC) #51

Oh and thanks for bearing with me. :)

Expand Messages | Collapse Messages