|
|
Created:
6 years, 9 months ago by Olle Liljenzin Modified:
6 years, 6 months ago CC:
chromium-reviews, cbentzel+watch_chromium.org, Randy Smith (Not in Mondays) Base URL:
https://chromium.googlesource.com/chromium/src.git@master Visibility:
Public. |
DescriptionReduce footprint of registry controlled domain table
The perfect hash table containing all registry controlled domains is
replaced by a compact graph (a dafsa) to reduce binary size and PSS
of the running process. Size of the new structure is about 33kB
compared to 380kB for the perfect hash table.
BUG=
Patch Set 1 #
Total comments: 11
Patch Set 2 : Implemented changes suggested in review. #Patch Set 3 : Added bounds checks #
Total comments: 47
Patch Set 4 : Fixed style issues and added unit tests for make_dafsa.py #
Total comments: 25
Patch Set 5 : Fixed some doc strings #
Total comments: 8
Patch Set 6 : Fixed more style issues and doc strings #
Total comments: 14
Patch Set 7 : Implemented requested changes #
Total comments: 4
Patch Set 8 : Removed shebang and execution bits #
Created: 6 years, 7 months ago
(Patch set is too large to download)
Messages
Total messages: 52 (0 generated)
The CQ bit was checked by ollel@opera.com
The CQ bit was unchecked by ollel@opera.com
Could you please review my patch.
Do you have any performance data for this change? Some callers turn out to be pretty sensitive to small differences.
On 2014/03/12 23:46:32, Pam (also PM for reviews) wrote: > Do you have any performance data for this change? Some callers turn out to be > pretty sensitive to small differences. The time-memory tradeoff is always there. We do this to save memory on mobile devices that lack of both cpu power and memory, and after the v8 snapshot the registry domain table is the largest chunk of memory in the binary. The time for each call is quite small and trying to measure a relevant figure I logged all calls to registry_controlled_domains coming from the browser and the render process when I cold-started and visited a first url. Together I got 560 calls to GetDomainAndRegistry() and 220 calls to GetDomainAndRegistry(). Then I created a test program doing the same calls in serial and measured the time for all calls together (running on my desktop [E5-1650]). Using the perfect hash table it used about 0.8-0.9 us for all calls and using the dafsa it used about 1.15-1.19 us, thus a slowdown of about 35%. But compared to the total time for starting and navigating the browser I think this should be "fast enough".
On 2014/03/13 12:31:15, ollel wrote: > 0.8-0.9 us for all calls and using the dafsa it used about > 1.15-1.19 us Should be 0.8-0.9 ms and 1.15-1.19 ms (wrong unit).
On 2014/03/13 12:31:15, ollel wrote: > 560 calls to GetDomainAndRegistry() and 220 calls to GetDomainAndRegistry() 560 calls to GetDomainAndRegistry() and 220 calls to SameDomainOrHost()
I don't know this code. Ryan, can you assist?
OK, just to ack this review, it's definitely going to take me time to parse, as I'm sure it took time to write. As Pam mentions, this is highly performance-sensitive code. Equally, there are a lot of graph traversals being done for code that will *discard* the end result, which has long been a desire to optimize. What I mean by this, specifically, is that for the sake of 'security policy', it *does not matter* if the eTLD == last label. That is, each of the ccTLD entries that are bare, as well as all of the new contracted-but-not-delegated domains *need not be traversed* This optimization is key to being able to reclaim the performance from the sensitive aspects (eg: cookie parsing), while still allowing the flexibility of the less-than-performance-sensitive-but-still-important parts (eg: wildcard cert validation, omnibox navigations) Past discussions with pkasting@ included the possibility of keeping the cookie-relevant parts as a perfect-hash, and move the other aspects to a trie/dafsa, with shared storage between the two (mostly, this involves changing gperf to treat the string-pool as pointer+len, rather than NULL-terminated) Your main speed metrics are going to be on cookie parsing. Unfortunately, we can't share the UMA data directly, so it's hard to know what impact you'll have / experiment yourselves. My advice: Populate a cookie database with 3K+ cookies from a variety of a few dozen domains (which roughly approximates what we see), and run https://code.google.com/p/chromium/codesearch#chromium/src/content/browser/ne... through an Optimized, "Release mode" test to see what effect your code has - positive or negative. Bonus points for checking things like cache misses or stalls. https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:70: const unsigned char* child = pos; Generally speaking, this level of manual string manipulation scares the crap out of me. I just mention it because it's going to take me a lot more time to reason about the security on, even if it's (probably) correct. That said, the use of magic values here (0x60, 0x40, 0x80, 0x0F) would all be better if given symbolic names, and documented heavily within this code const char kTerminalNodeMask = 0x80; const char kOffsetLengthMask = 0x60; const char kOffsetMask = 0x1F; etc
On 2014/03/18 01:23:57, Ryan Sleevi wrote: > OK, just to ack this review, it's definitely going to take me time to parse, as > I'm sure it took time to write. > > As Pam mentions, this is highly performance-sensitive code. Equally, there are a > lot of graph traversals being done for code that will *discard* the end result, > which has long been a desire to optimize. > > What I mean by this, specifically, is that for the sake of 'security policy', it > *does not matter* if the eTLD == last label. That is, each of the ccTLD entries > that are bare, as well as all of the new contracted-but-not-delegated domains > *need not be traversed* > > This optimization is key to being able to reclaim the performance from the > sensitive aspects (eg: cookie parsing), while still allowing the flexibility of > the less-than-performance-sensitive-but-still-important parts (eg: wildcard cert > validation, omnibox navigations) > > Past discussions with pkasting@ included the possibility of keeping the > cookie-relevant parts as a perfect-hash, and move the other aspects to a > trie/dafsa, with shared storage between the two (mostly, this involves changing > gperf to treat the string-pool as pointer+len, rather than NULL-terminated) > > Your main speed metrics are going to be on cookie parsing. Unfortunately, we > can't share the UMA data directly, so it's hard to know what impact you'll have > / experiment yourselves. > > My advice: Populate a cookie database with 3K+ cookies from a variety of a few > dozen domains (which roughly approximates what we see), and run > https://code.google.com/p/chromium/codesearch#chromium/src/content/browser/ne... > through an Optimized, "Release mode" test to see what effect your code has - > positive or negative. Bonus points for checking things like cache misses or > stalls. > > https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... > File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): > > https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... > net/base/registry_controlled_domains/registry_controlled_domain.cc:70: const > unsigned char* child = pos; > Generally speaking, this level of manual string manipulation scares the crap out > of me. > > I just mention it because it's going to take me a lot more time to reason about > the security on, even if it's (probably) correct. > > That said, the use of magic values here (0x60, 0x40, 0x80, 0x0F) would all be > better if given symbolic names, and documented heavily within this code > > const char kTerminalNodeMask = 0x80; > const char kOffsetLengthMask = 0x60; > const char kOffsetMask = 0x1F; > > etc FYI: Bryan McQuade and I started a suffix-trie based approach about two years ago but abandoned because something new and shiny came along. See https://codereview.chromium.org/8226007/ and https://code.google.com/p/domain-registry-provider/ not sure if picking up that approach again would be preferable.
On 2014/03/18 01:23:57, Ryan Sleevi wrote: > My advice: Populate a cookie database with 3K+ cookies from a variety of a few > dozen domains (which roughly approximates what we see), and run > https://code.google.com/p/chromium/codesearch#chromium/src/content/browser/ne... > through an Optimized, "Release mode" test to see what effect your code has - > positive or negative. Bonus points for checking things like cache misses or > stalls. I tested with my real personal cookies from my desktop browser. Only 2379 cookies, but I can't see why time shouldn't scale proportionally. I measured total time in InitializeDatabase() and the time spent in the loop over GetDomainAndRegistry() inside InitializeDatabase(). (Thus two figures.) I get big variations between runs (up to 50%) both with perfect hash and dafsa, so I picked the best figures from ten runs for comparison. Total time for InitializeDatabase() was 2.26 ms (perfect hash) and 2.45 ms (dafsa). Time in the loop was 0.95 vs 1.08 ms. So the dafsa is consistently slower, although not much slower. How fast does it have to be? > Generally speaking, this level of manual string manipulation scares the crap out > of me. Me too. But it is a consequence of compression. Note that size of the final structure is just 42% of raw string data size and it replaces both the hash table and the string pool.
I realize I omitted one key detail: the device to test on. On Desktop, the overall time is so low that it hardly matters. On Mobile, in the past, we saw much more amplified perf hits due to cache misses and the like. Given that your dafsa is markedly smaller than the stringpool, I would *expect* things to be much less negatively impactful, but a good bit of data wouldn't hurt. My inclination is that we're fine here - eliding TLDs at the last label can remain a future optimization - and so it will just take me a bit of time to really get my hands dirty here reviewing this. On Mar 18, 2014 8:35 AM, <ollel@opera.com> wrote: > On 2014/03/18 01:23:57, Ryan Sleevi wrote: > >> My advice: Populate a cookie database with 3K+ cookies from a variety of >> a few >> dozen domains (which roughly approximates what we see), and run >> > > https://code.google.com/p/chromium/codesearch#chromium/ > src/content/browser/net/sqlite_persistent_cookie_store.cc&l=554 > >> through an Optimized, "Release mode" test to see what effect your code >> has - >> positive or negative. Bonus points for checking things like cache misses >> or >> stalls. >> > > I tested with my real personal cookies from my desktop browser. Only 2379 > cookies, but I can't see why time shouldn't scale proportionally. I > measured > total time in InitializeDatabase() and the time spent in the loop over > GetDomainAndRegistry() inside InitializeDatabase(). (Thus two figures.) > > I get big variations between runs (up to 50%) both with perfect hash and > dafsa, > so I picked the best figures from ten runs for comparison. > > Total time for InitializeDatabase() was 2.26 ms (perfect hash) and 2.45 ms > (dafsa). > Time in the loop was 0.95 vs 1.08 ms. > > So the dafsa is consistently slower, although not much slower. > > How fast does it have to be? > > Generally speaking, this level of manual string manipulation scares the >> crap >> > out > >> of me. >> > > Me too. But it is a consequence of compression. Note that size of the final > structure is just 42% of raw string data size and it replaces both the hash > table and the string pool. > > https://codereview.chromium.org/197183002/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:66: int LookupString(const unsigned char* pos, const char* key, int length) { 1) Needs documentation 2) int for byte length = better to use size_t here 3) Would it make more sense to use a StringPeice, which will get optimized out. https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:72: bool is_last = (*pos & 0x80) != 0; Normally within net/ we encapsulate a lot of the string parsing within iterators (see header parsing, header value parsing, etc) The optimizer should optimize it out all the same, but it makes it easier to read. For example, you could create an iterator to read links, where the returned type can provide disambiguation about end of key, child, etc. Examples: see https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_util... https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_util... https://code.google.com/p/chromium/codesearch#chromium/src/net/http/http_requ... https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:101: return -1; Not a fan of magic values. If you took base::StringPiece(), returning a size_t (or npos) would be far clearer. [Edit: After having written that, I realize it's instead some magic flags value; definitely need to document this stuff, along with providing a meaningful constant for this negative return value] https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:206: int domain_length = host_check_len - curr_start; this should have been a size_t, IINM. https://codereview.chromium.org/197183002/diff/1/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76 net/net.gyp:76: }, Let's split this up. Ideally, I'd like to avoid using <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data separate. I'm happy with a target_defaults that handles gperf and outputs the appropriate files, and then moving the .gperf dependency into net/net_unittests, outputting to INTERMEDIATE_DIR
We're already using this code in Opera and I reviewed this patch before it landed in Opera so I can vouch (as much as anyone can) for it functioning and being a very nice improvement. It is pretty high tech and it takes a little reading to understand but the result has been worth it. (We also experimented with a better perfect hash generator but even with a minimal perfect hash the size was 2-3 times as large than with this custom data structure).
https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:72: bool is_last = (*pos & 0x80) != 0; Iterators may increase readability by adding a familiar interface on the implementation. But I can't see how it would fit here. E.g. reading a link has side effects (both pos and child pointers are incremented), and hiding that in an overloaded operator or GetNext() method would not make the code easier to read. The current code does not contain repeated patterns and adding abstraction layers will then just blow up code size and split the implementation on different locations, making it harder to follow what really happens in case of debugging. The code is also extremely performance sensitive and compilers are far from perfect. In case the compiler (on some platform) fails to reduce some redundant instructions we will have to read the assembler code to find out where it failed. The current code is just about 50-60 statements (not counting white space and comments), and I would prefer to not make it much larger.
It is absolutely possible to make this code more readable. I'm aware of the performance sensitive nature of this, but that's rarely a good justification for unreadable code. It is clear there are several conceptual steps in this algorithm - reading the next link, determining if its a full match, yielding a common error when its not - that we can strive to improve. Documentation and solid, readable code structure are, with few exceptions, generally far more important than performance; naturally, this has been a tension when developing for mobile. Still, I strongly believe you can encapsulate the graph traversal without overhead, and with a clearer structural representation. On Mar 19, 2014 7:19 AM, <ollel@opera.com> wrote: > > https://codereview.chromium.org/197183002/diff/1/net/base/ > registry_controlled_domains/registry_controlled_domain.cc > File net/base/registry_controlled_domains/registry_controlled_domain.cc > (right): > > https://codereview.chromium.org/197183002/diff/1/net/base/ > registry_controlled_domains/registry_controlled_domain.cc#newcode72 > net/base/registry_controlled_domains/registry_controlled_domain.cc:72: > bool is_last = (*pos & 0x80) != 0; > Iterators may increase readability by adding a familiar interface on the > implementation. But I can't see how it would fit here. E.g. reading a > link has side effects (both pos and child pointers are incremented), and > hiding that in an overloaded operator or GetNext() method would not make > the code easier to read. The current code does not contain repeated > patterns and adding abstraction layers will then just blow up code size > and split the implementation on different locations, making it harder to > follow what really happens in case of debugging. > > The code is also extremely performance sensitive and compilers are far > from perfect. In case the compiler (on some platform) fails to reduce > some redundant instructions we will have to read the assembler code to > find out where it failed. > > The current code is just about 50-60 statements (not counting white > space and comments), and I would prefer to not make it much larger. > > https://codereview.chromium.org/197183002/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
https://codereview.chromium.org/197183002/diff/1/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76 net/net.gyp:76: }, On 2014/03/19 03:08:32, Ryan Sleevi wrote: > Let's split this up. Ideally, I'd like to avoid using > <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data > separate. > > I'm happy with a target_defaults that handles gperf and outputs the appropriate > files, and then moving the .gperf dependency into net/net_unittests, outputting > to INTERMEDIATE_DIR I can't figure out how to get the right include path using INTERMEDIATE_DIR. To get effective_tld_names-inc.cc generated before registry_controlled_domain.cc is compiled, I still need to list effective_tld_names.gperf in a separate target that net depends on. (Is this correct?) I tried this: 'target_defaults': { 'rules': [ { 'rule_name': 'dafsa', 'extension': 'gperf', 'outputs': [ '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', ], 'inputs': [ 'tools/tld_cleanup/make_dafsa.py', ], 'action': [ 'python', 'tools/tld_cleanup/make_dafsa.py', '<(RULE_INPUT_PATH)', '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', ], }, ], }, 'targets': [ { 'target_name': 'net_derived_sources', 'type': 'none', 'sources': [ 'base/registry_controlled_domains/effective_tld_names.gperf', ], 'direct_dependent_settings': { 'include_dirs': [ '<(INTERMEDIATE_DIR)', ], }, }, ... ], Now the problem is that effective_tld_names-inc.cc gets installed in obj/net/net_derived_sources.gen, but -Iobj/net/net.gen/net is added when compiling registry_controlled_domain.cc. I can't see how I can get INTERMEDIATE_DIR expanded to the same value in both places.
Ryan, No one would be more happy than me if I could make the code easier to read (without giving up performance). Replacing numeric constants with symbolic names is fine with me. But I can't figure out how I can make the lookup function more readable by restructuring the code, e.g. by using iterators. As I see it the function has one internal state that ties the code together and I can't find a good way to split it up. It would be very helpful if you could be more specific in how you would like the function to be structured. On 2014/03/19 15:05:17, Ryan Sleevi wrote: > It is absolutely possible to make this code more readable. > > I'm aware of the performance sensitive nature of this, but that's rarely a > good justification for unreadable code. > > It is clear there are several conceptual steps in this algorithm - reading > the next link, determining if its a full match, yielding a common error > when its not - that we can strive to improve. > > Documentation and solid, readable code structure are, with few exceptions, > generally far more important than performance; naturally, this has been a > tension when developing for mobile. > > Still, I strongly believe you can encapsulate the graph traversal without > overhead, and with a clearer structural representation. > On Mar 19, 2014 7:19 AM, <mailto:ollel@opera.com> wrote:
I'll follow-up offline with a suggested rewrite, which also makes sure that I've fundamentally understood the algorithm :) https://codereview.chromium.org/197183002/diff/1/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76 net/net.gyp:76: }, On 2014/03/21 09:36:42, Olle Liljenzin wrote: > On 2014/03/19 03:08:32, Ryan Sleevi wrote: > > Let's split this up. Ideally, I'd like to avoid using > > <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data > > separate. > > > > I'm happy with a target_defaults that handles gperf and outputs the > appropriate > > files, and then moving the .gperf dependency into net/net_unittests, > outputting > > to INTERMEDIATE_DIR > > I can't figure out how to get the right include path using INTERMEDIATE_DIR. > > To get effective_tld_names-inc.cc generated before registry_controlled_domain.cc > is compiled, I still need to list effective_tld_names.gperf in a separate target > that net depends on. (Is this correct?) No, it shouldn't be. 'actions' and 'rules' are supposed to be run before the compilation step (because of this exact reason) This is spelled out in https://code.google.com/p/chromium/codesearch#chromium/src/tools/gyp/pylib/gy... for the order of execution. [Note that 'none' types are generally "not encouraged" in GYP. I really need to get around to writing a GYP Style Guide/All the Ways GYP can hate you Guide] You should be able to do 'targets': [ { 'target_name': 'net', ... 'rules': [ ... ], 'sources': [ ..., 'foo.gperf', ], 'include_dirs': [ ..., '<(INTERMEDIATE_DIR)', ], } ] > > I tried this: > > 'target_defaults': { > 'rules': [ > { > 'rule_name': 'dafsa', > 'extension': 'gperf', > 'outputs': [ > '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', > ], > 'inputs': [ > 'tools/tld_cleanup/make_dafsa.py', > ], > 'action': [ > 'python', > 'tools/tld_cleanup/make_dafsa.py', > '<(RULE_INPUT_PATH)', > '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', > ], > }, > ], > }, > 'targets': [ > { > 'target_name': 'net_derived_sources', > 'type': 'none', > 'sources': [ > 'base/registry_controlled_domains/effective_tld_names.gperf', > ], > 'direct_dependent_settings': { > 'include_dirs': [ > '<(INTERMEDIATE_DIR)', > ], > }, > }, > ... > ], > > Now the problem is that effective_tld_names-inc.cc gets installed in > obj/net/net_derived_sources.gen, but -Iobj/net/net.gen/net is added when > compiling registry_controlled_domain.cc. I can't see how I can get > INTERMEDIATE_DIR expanded to the same value in both places. <(INTERMEDIATE_DIR) is per-target
https://codereview.chromium.org/197183002/diff/1/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/1/net/net.gyp#newcode76 net/net.gyp:76: }, When I put the rule inside the net target, then a target for effective_tld_names-inc.cc is added to net.ninja and it works. To avoid duplicating the rule into both net and net_unittests, I was asked to move the rule into target_defaults. But if I move the rule, gyp will not generate a target for effective_tld_names-inc.cc in net.ninja and the build fails. On 2014/03/27 20:23:01, Ryan Sleevi wrote: > On 2014/03/21 09:36:42, Olle Liljenzin wrote: > > On 2014/03/19 03:08:32, Ryan Sleevi wrote: > > > Let's split this up. Ideally, I'd like to avoid using > > > <(SHARED_INTERMEDIATE_DIR), but I'd also like to keep test and prod data > > > separate. > > > > > > I'm happy with a target_defaults that handles gperf and outputs the > > appropriate > > > files, and then moving the .gperf dependency into net/net_unittests, > > outputting > > > to INTERMEDIATE_DIR > > > > I can't figure out how to get the right include path using INTERMEDIATE_DIR. > > > > To get effective_tld_names-inc.cc generated before > registry_controlled_domain.cc > > is compiled, I still need to list effective_tld_names.gperf in a separate > target > > that net depends on. (Is this correct?) > > No, it shouldn't be. 'actions' and 'rules' are supposed to be run before the > compilation step (because of this exact reason) > > This is spelled out in > https://code.google.com/p/chromium/codesearch#chromium/src/tools/gyp/pylib/gy... > for the order of execution. > > [Note that 'none' types are generally "not encouraged" in GYP. I really need to > get around to writing a GYP Style Guide/All the Ways GYP can hate you Guide] > > You should be able to do > 'targets': [ > { > 'target_name': 'net', > ... > 'rules': [ > ... > ], > 'sources': [ > ..., > 'foo.gperf', > ], > 'include_dirs': [ > ..., > '<(INTERMEDIATE_DIR)', > ], > } > ] > > > > > I tried this: > > > > 'target_defaults': { > > 'rules': [ > > { > > 'rule_name': 'dafsa', > > 'extension': 'gperf', > > 'outputs': [ > > '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', > > ], > > 'inputs': [ > > 'tools/tld_cleanup/make_dafsa.py', > > ], > > 'action': [ > > 'python', > > 'tools/tld_cleanup/make_dafsa.py', > > '<(RULE_INPUT_PATH)', > > '<(INTERMEDIATE_DIR)/<(RULE_INPUT_ROOT)-inc.cc', > > ], > > }, > > ], > > }, > > 'targets': [ > > { > > 'target_name': 'net_derived_sources', > > 'type': 'none', > > 'sources': [ > > 'base/registry_controlled_domains/effective_tld_names.gperf', > > ], > > 'direct_dependent_settings': { > > 'include_dirs': [ > > '<(INTERMEDIATE_DIR)', > > ], > > }, > > }, > > ... > > ], > > > > Now the problem is that effective_tld_names-inc.cc gets installed in > > obj/net/net_derived_sources.gen, but -Iobj/net/net.gen/net is added when > > compiling registry_controlled_domain.cc. I can't see how I can get > > INTERMEDIATE_DIR expanded to the same value in both places. > > <(INTERMEDIATE_DIR) is per-target
Just an FYI that an implementation very similar to this already exists as a third party open source project, here: https://code.google.com/p/domain-registry-provider/ I'd just propose that someone take a look at the impl in this CL vs the impl in the linked code project and make a decision about which will be more maintainable going forward. I don't feel strongly about which is chosen & I haven't looked over the code in this CL - I'll leave the decision to those on this thread.
Also, I'm happy to give contributor rights to anyone that wants to contribute to the above linked project.
bmcquade1, I tried to build the linked program with a fresh domain table, but it fails with the following error: "OverflowError: Values 32768 0 0 out of range." It could be because of a syntax change in the data file, but the design document talks about 2 byte offsets into the string pool. The string data one would use in a trie has grown from 20k to 36k. Can this still work with 2 byte offsets? On 2014/03/28 15:05:59, bmcquade1 wrote: > Just an FYI that an implementation very similar to this already exists as a > third party open source project, here: > https://code.google.com/p/domain-registry-provider/ > > I'd just propose that someone take a look at the impl in this CL vs the impl in > the linked code project and make a decision about which will be more > maintainable going forward. I don't feel strongly about which is chosen & I > haven't looked over the code in this CL - I'll leave the decision to those on > this thread.
Hi Ollie, You're right, the project hasn't been updated in a while, and the increased table size no longer fits in the fixed field widths. The code is designed to detect this case and fail, as you encountered. jefftk has recently been working on updating the dat file. I'll go ahead and get that change committed. Additionally, for context, here's the relevant Chromium bug for this: https://code.google.com/p/chromium/issues/detail?id=43880 There is one open issue that emerged when updating to the most recent dat file, which Jeff is working on fixing (https://code.google.com/p/domain-registry-provider/issues/detail?id=3). We'd welcome your help on this if you are interested. I'd be happy to discuss in more detail in an email thread. Let me know!
I don't think domain-registry-provider is currently in a good state right now (doesn't handle the Intranet distinction, is known to have less-than-desirable perf for what's critical), nor is it good to hijack this review for it. I think Olle is right on the money here with this approach. If we do go the domain registry path, we'll need further cleanup there. I promised Olle I'd take a stab at cleaning it up myself, and I am still on track for doing so. Now that I've worked through the implementation (and made sure to annotate my understanding), I think that even without an iterator approach, there's still a fair bit of low-hanging fruit for stylistic/control flow cleanups that avoid having to nest conditionals so deeply. I'll follow-up on this first thing next week - apologies that this review is taking so long, but hopefully understandably, as the code is a little complex :) https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/1/net/base/registry_controlled... net/base/registry_controlled_domains/registry_controlled_domain.cc:90: // single byte in range 0x80-0x9F encoding the return value. As a justification for why we need to split this code into something more readable - it took me way too long to figure out whether the comment was correct and the code was wrong, or whether the code was correct and the comment wrong. In this case, it's the latter. It's a single byte in the range 0x80 - 0x8F, as reflected in the comments for make_dafsa.py
Olle: Apologies for the delay, I spent the last week out sick. I've attempted to distill your algorithm into something a little more 'high level' - in as much as it tries to read the way one might mentally process. This was the biggest challenge when reading the code, was trying to understand each of the cases you were handling. As I see it, there are several possible conditions to match a given node: char <char>+ end_char offsets char <char>+ return_value char end_char offsets char return_value end_char offsets return_value We should not 'descend' into a node if the first char/end_char doesn't match. If it started with a char, then all remaining char/end_chars should match. If it started with end_char, well, then we just dive into offsets. I tried to rewrite the algorithm below as a "strip the first char off - if it doesn't match, try the next node". Then "consume all remaining chars", so that the only thing left is an end_char or return_value. If the end_char matches, descend. If it's a return value, return that. Otherwise, error out. The other thing is we want to make sure we can handle memory corruption somewhat gracefully - that is, by doing bounds checks to make sure we're not jumping off into an abyss. I attempted a little pseudo-code of the algorithm, to sake of discussion, and will attempt to write it up proper (again, apologies - still recovering from a nasty cold/flu, so my brain may be missing the obvious bugs) while (GetNextOffset(&pos, &offset, end)) { // char <char>+ end_char offsets // char <char>+ return value // char end_char offsets // char return value // end_char offsets // return_value bool did_consume = false; if (key != key_end && !IsEOL(offset, end)) { // Leading <char> is not a match. Don't dive into this child if (!IsMatch(end, &offset, &key)) continue; did_consume = true; ++offset ++key; // Possible matches at this point: // <char>+ end_char offsets // <char>+ return value // end_char offsets // return value // Remove all remaining <char> nodes possible while (!IsEOL(offset, end) && key != key_end) { if (!IsMatch(end, &offset, &key)) return kFatal; ++key; ++offset; } } // Possible matches: // end_char offsets // return_value // If one or more <char> elements were consumed, a failure // to match is terminal. Otherwise, try the next node. if (key == key_end) { if (GetReturnValue(end, &offset, &return_value)) return return_value; if (did_consume) return kFatal; continue; } if (!IsEndCharMatch(end, &offset, &key)) { if (did_consume) return kFatal; // Unexpected continue; } ++key; pos = ++offset; // Dive into child }
Ryan, Suddenly the code looks much better. :-) I will try to complete your code and see how it performs. If I don't get time today I will look at it tomorrow. About checking bounds to protect against memory corruption. The array should be mapped read-only. So corruption could happen because of failure in: (a) make_dafsa.py, (b) compiler or (c) hardware. Bound checks wouldn't protect against (b) and (c) since the optimizer could simply remove some checks as redundant. But for (a) I could add a unit test that makes a full graph traversal and checks all nodes and offset are within bounds. That would avoid burning cpu cycles in the browser.
On Apr 8, 2014 1:48 AM, <ollel@opera.com> wrote: > > Ryan, > > Suddenly the code looks much better. :-) > > I will try to complete your code and see how it performs. If I don't get time > today I will look at it tomorrow. > > About checking bounds to protect against memory corruption. The array should be > mapped read-only. So corruption could happen because of failure in: (a) > make_dafsa.py, (b) compiler or (c) hardware. Bound checks wouldn't protect > against (b) and (c) since the optimizer could simply remove some checks as > redundant. But for (a) I could add a unit test that makes a full graph traversal > and checks all nodes and offset are within bounds. That would avoid burning cpu > cycles in the browser. > > https://codereview.chromium.org/197183002/ We actually see memory corruption fairly often on users machines - eg: third party modules or even bad ram. The goal would be to crash still (CHECK), but gracefully. The main concern would be making sure that any given offset is within the bounds of (start, end) - which is a fairly inexpensive check. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
I just don't see the point in doing the check if we can verify that the byte array is correct at compile time. Where else in the code do we check for corruption of read-only segments? Content Shell is 95% code and 5% read-only data and why check data if we don't check the code? The check itself will introduce a new jump to an offset that could get rotten with same probability as the offset it checks, and how do we check that this offset in range? The difference is that you don't see this offset if you don't disassemble the compiler output. And how do we check that the load instruction in the check doesn't flip to a store because of a bit error? About the pseudo-code, where is end_offset handled?
On 2014/04/09 12:49:27, Olle Liljenzin wrote: > I just don't see the point in doing the check if we can verify that the byte > array is correct at compile time. Where else in the code do we check for > corruption of read-only segments? Content Shell is 95% code and 5% read-only > data and why check data if we don't check the code? The check itself will > introduce a new jump to an offset that could get rotten with same probability as > the offset it checks, and how do we check that this offset in range? The > difference is that you don't see this offset if you don't disassemble the > compiler output. And how do we check that the load instruction in the check > doesn't flip to a store because of a bit error? > > About the pseudo-code, where is end_offset handled? Well, I think part of the goal would be to supporting updating the PSL dynamically (eg: via component_updater). Still, I'm surprised to see hesitance towards bounds checking when dealing with raw pointers, which is generally good practice. As for the psuedo-code, I was thinking that end-offset would be handled by GetNextOffset, by setting pos == end, having offset point to the end_offset, and returning true. Note that, within the body, |pos| is never accessed - only offset. On the next iteration, if we didn't descend into a child, pos == end, GetNextOffset() returns false, and then we'd return kFatal at the end of the loop (since the only way we got there would be no handled condition)
Ryan: Sorry for taking so long, but I had to do some other work between. I changed the code as you suggested, except for keeping numeric masks. I could change that too if you insist, but in my eyes it is still more readable with numeric values when we do bit operations. I still don't like the bound checks in ro-data, but they are there now. I also found a small improvement in make_dafsa.py that reduces the generated size with about 1400 bytes by skipping some redundant links. This change has no impact on the code that reads the array.
Hi Olle, I think we're approaching the final stretches of the .cc side, and the graph approach itself is sound, so I've added Marc-Antoine to review the Python side for Chromium style and implementation, as he's far more familiar with the Pythonic aspects than I am. Marc-Antoine, feel free to push to another reviewer, although I suspect you'll enjoy this one. I've focused mostly on style issues. I agree that your use of constant/magic values is probably fine here, since a given magic value is only used once (the exception is 0x80, although that has several meanings in the code, and each of those tends to only be used once, so not a good constant) Mostly style issues at this point https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:66: size_t graph_length = sizeof(kDafsa); Per http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Variab... , global variable names (even though this is semi-constant, save for testing), are traditionally prefixed with a g_ to indicate so const unsigned char* g_graph = kDafsa; size_t g_graph_length = sizeof(kDafsa); https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:71: const unsigned char*& offset) { Per http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Refere... , all reference arguments are supposed to be const. bool GetNextOffset(const unsigned char** pos, const unsigned char** end, const unsigned char** offset) https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:72: if (pos < end) { Convention in Chromium is to handle errors early, to avoid nesting code too deeply if (pos == end) return false; CHECK_LT(pos, end); // When reading an offset ... CHECK_LT(pos + 2, end); size_t bytes_consumed; https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:105: CHECK(offset < end); CHECK_LT (and all other occurrences below) https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:130: if ((*offset & 0xe0) == 0x80) { nit: If you're going to capitalize 0x0F (line 131) and 0x3F (line 89), capitalize 0xE0 https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:137: int kFatal = -1; const int Having looked through this, I suspect this may make more sense as kKeyNotFound (or simply kNotFound? NotPresent?) rather than kFatal. https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:157: if (!IsMatch(offset, end, key)) continue; style nit: Convention has been to eschew the one-liners within net (though permitted by the style guide), in favour of if (!IsMatch()) continue; did_consume Apologies for that. Ditto 168, 180, 181, 185 https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:168: if (!IsMatch(offset, end, key)) return kFatal; Should comment why this is kFatal. I'm not entirely sure how best to word this (read: feel free to reword), but I suspect something along the lines of // The DAFSA guarantees that if the first char is a match, all // remaining char elements MUST match if the key is truly present. https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:173: // Possible matches: s/Possible matches/Possible matches at this point:/ (matching 161) https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:196: const int kPrivateRule = 4; While this is not code within a class, convention is that constants are listed before functions (typically at the top of the file), as that most matches the class definition ( http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Declar... ) Move this (and line 137) up to 66 https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:232: type != -1 && (!(type & kPrivateRule) || s/-1/kFatal https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:236: // those, it can't be an actual match. s/ those/ those/ https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:389: CHECK(length > 0); CHECK_GT, although really, since it's size_t CHECK_NE(0, length); https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... File net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc (right): https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc:65: const std::string& host, UnknownRegistryFilter unknown_filter) { Did you run "clang-format" over this (eg: via "git cl format")? I believe it would break similarly to lines 59/60 https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc:80: void UseDomainData(const Graph &graph) { style nit: const Graph& graph https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain_unittest.cc:365: TEST_F(RegistryControlledDomainTest, TestDafsaTwoByteOffsets) { Can you add comments to each of these tests indicating what you're testing and how the data is structured (eg: look at lines 314-318 as an example) I admit, this isn't entirely consistent with the rest of this file, but these tests are fairly subtle, different keys, so it's good to explain how they're structured and what is being tested. https://codereview.chromium.org/197183002/diff/20001/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode78 net/net.gyp:78: 'target_name': 'net', You didn't seem to add <(SHARED_INTERMEDIATE_DIR)/net to the include_dirs here However, see above https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode1686 net/net.gyp:1686: '<(SHARED_INTERMEDIATE_DIR)/net', Unnecessary, right?
FTR, the space saving looks nice. I just want to make sure the python script is readable for anyone on the team. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:1: #!/usr/bin/python Please take a look at other scripts around of get a sense of the general coding style. - Take the 4 first lines header. - """ for docstrings. - Put the main code in main(), e.g.: def main(): ... return 0 if __name__ == '__main__': sys.exit(main()) - 2 empty lines between file level symbols - Imperative verbs to start the function docstrings. - Use with statements, e.g. with open(foo, 'rb') as f: content = f.read() - While historical code tends to use CamelCase functions, we're slowly switching over to more PEP8 compliant names. - Use () instead of \. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:194: def ToDafsa(words): Correct me if I'm wrong, but nowhere you describe what data types this function returns exactly. I understand it is a list of tuples, with single item list recursively, but that's not clear. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: assert(words) So the script will throw on an empty file? https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:215: if node is None: if not node: (everywhere) https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:235: sink = [] Move closure variables before the function using them for sanity. Do this everywhere. In general, I'm really not a fan of closures at all but the ones in this file are small enough so I do not mind too much. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:391: output += EncodePrefix(node[0]) optional nit: I generally prefer .append() https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:403: text = '/* This file is generated. Don\'t edit!\n\n' I'd put DO NOT EDIT in all caps. Yelling is good sometimes. It'd be good to explain what it is in the generated header. Someone reading the generated file will have a hard time figuring out what it is. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:405: text += ' \"' + ' '.join(sys.argv) + '\"\n\n' Not necessary IMHO. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:410: text += ', '.join(['0x%02x' % byte for byte in data[i:i + 12]]) [] are not needed https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:421: return ToCxx(byte_values) return ToCxx(Encode(dafsa)) https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line in lines] Could you add in a comment how the lines should look like. I have no idea what this should produce from what input.
https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:1: #!/usr/bin/python On 2014/04/23 12:26:18, M-A Ruel wrote: > Please take a look at other scripts around of get a sense of the general coding > style. > - Take the 4 first lines header. > - """ for docstrings. > - Put the main code in main(), e.g.: > def main(): > ... > return 0 > > > if __name__ == '__main__': > sys.exit(main()) > > - 2 empty lines between file level symbols > - Imperative verbs to start the function docstrings. > - Use with statements, e.g. > with open(foo, 'rb') as f: > content = f.read() > - While historical code tends to use CamelCase functions, we're slowly switching > over to more PEP8 compliant names. > - Use () instead of \. Python coding style is described at http://www.chromium.org/developers/coding-style There's also a copy of pylint in the depot_tools to get basic checks. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:225: '''Create new nodes''' This docstring could be more descriptive. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:278: '''Return a macthing node. A new node is created if not matching node typos "matching"; "if no matching node" https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:325: # This is an <end_label> node and no links are follow in such nodes language nit: "no links follow such nodes" https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line in lines] On 2014/04/23 12:26:18, M-A Ruel wrote: > Could you add in a comment how the lines should look like. I have no idea what > this should produce from what input. Also some unit tests for this script. There are various frameworks in use, but at least a set of appropriately chosen inputs and expected outputs would help.
On 2014/04/23 13:13:56, Pam (also PM for reviews) wrote: > Python coding style is described at > http://www.chromium.org/developers/coding-style > There's also a copy of pylint in the depot_tools to get basic checks. I did run pylint and it doesn't complain. The coding style says MixedCase should be used. (I wrote the original code following pep-8 and changed to MixedCase after reading the style guide.) But I am happy for all comments and will make the changes requested here.
https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/20001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:71: const unsigned char*& offset) { On 2014/04/23 01:52:18, Ryan Sleevi wrote: > Per > http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Refere... > , all reference arguments are supposed to be const. > > bool GetNextOffset(const unsigned char** pos, > const unsigned char** end, > const unsigned char** offset) Then it should be "** pos, * end, ** offset". The style guide says output parameters should be last in the parameter list. Is there a similar rule for in-out parameters? https://codereview.chromium.org/197183002/diff/20001/net/net.gyp File net/net.gyp (right): https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode78 net/net.gyp:78: 'target_name': 'net', On 2014/04/23 01:52:18, Ryan Sleevi wrote: > You didn't seem to add <(SHARED_INTERMEDIATE_DIR)/net to the include_dirs here > > However, see above It looks like this include path is added to net by some other reason. I don't know why. But I agree it should be added here to not add a dependency on something that could change in the future. https://codereview.chromium.org/197183002/diff/20001/net/net.gyp#newcode1686 net/net.gyp:1686: '<(SHARED_INTERMEDIATE_DIR)/net', On 2014/04/23 01:52:18, Ryan Sleevi wrote: > Unnecessary, right? Necessary unless there is a better way to add the include path to net_unittests. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:194: def ToDafsa(words): On 2014/04/23 12:26:18, M-A Ruel wrote: > Correct me if I'm wrong, but nowhere you describe what data types this function > returns exactly. I understand it is a list of tuples, with single item list > recursively, but that's not clear. It returns a source node described on line 24. Clearly it would have been more readable with a node class and named attributes instead of anonymous tuples, lists and dictionaries. But it would also make the script more than ten times slower. Running make_dafsa.py to generate effective_tld_names-inc.cc takes about a second and it replaces gperf that took above 4 minutes to run on the same machine. So now we can generate the files on the fly instead of running gperf manually as before, which is much better in my opinion. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: assert(words) On 2014/04/23 12:26:18, M-A Ruel wrote: > So the script will throw on an empty file? The output format doesn't support the empty graph. It starts with an offset list which must contain at least one offset. I will add a comment to clearify the assert. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:391: output += EncodePrefix(node[0]) On 2014/04/23 12:26:18, M-A Ruel wrote: > optional nit: I generally prefer .append() List concat and list append are different operations. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line in lines] On 2014/04/23 12:26:18, M-A Ruel wrote: > Could you add in a comment how the lines should look like. I have no idea what > this should produce from what input. There are two examples, see lines 107 and 152.
Did you forget to upload? https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:14: The input strings are assumed to consist of printable 7-bit ASCII characters Why not just enforce it and print an error message when this is not the case, then you don't have to document it. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:194: def ToDafsa(words): On 2014/04/24 09:30:00, Olle Liljenzin wrote: > On 2014/04/23 12:26:18, M-A Ruel wrote: > > Correct me if I'm wrong, but nowhere you describe what data types this > function > > returns exactly. I understand it is a list of tuples, with single item list > > recursively, but that's not clear. > > It returns a source node described on line 24. > > Clearly it would have been more readable with a node class and named attributes > instead of anonymous tuples, lists and dictionaries. But it would also make the > script more than ten times slower. What's the relative perf impact of using collections.namedtuple? https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:391: output += EncodePrefix(node[0]) On 2014/04/24 09:30:00, Olle Liljenzin wrote: > On 2014/04/23 12:26:18, M-A Ruel wrote: > > optional nit: I generally prefer .append() > > List concat and list append are different operations. Eh, I also prefer .extend(). Using += is really confusing IMHO. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line in lines] Would making it a generator be faster? E.g. replace line 433 with yield line[:-3] + line[-1]
FTR, implementing a python decoder would go a long way to document the format and act as a unit test. That said I wouldn't block the CL on this if you think it's worthless.
On 2014/04/24 12:45:18, M-A Ruel wrote: > Did you forget to upload? I was just replying to some comments. I will upload when I have implemented the requested unit tests. I assume the "import unittest" framework is ok, so now I just have to figure out how to plug this into gyp.
On 2014/04/24 12:54:25, Olle Liljenzin wrote: > On 2014/04/24 12:45:18, M-A Ruel wrote: > > Did you forget to upload? > > I was just replying to some comments. > > I will upload when I have implemented the requested unit tests. I assume the > "import unittest" framework is ok, so now I just have to figure out how to plug > this into gyp. No need to plug it into gyp, make it run on PRESUBMIT.py assuming it's relatively and key its run on the change in either make_data.py or the gperf files, there's no point to run it otherwise.
On 2014/04/24 12:52:41, M-A Ruel wrote: > FTR, implementing a python decoder would go a long way to document the format > and act as a unit test. That said I wouldn't block the CL on this if you think > it's worthless. My idea was to write a simple unit test for each function in make_dafsa.py. Tests to check that each function produces the expected output from an extremely simple input. There are already c++ tests that decode the complete output and implementing the same in python looks redundant to me.
On 2014/04/24 13:28:38, Olle Liljenzin wrote: > On 2014/04/24 12:52:41, M-A Ruel wrote: > > FTR, implementing a python decoder would go a long way to document the format > > and act as a unit test. That said I wouldn't block the CL on this if you think > > it's worthless. > > My idea was to write a simple unit test for each function in make_dafsa.py. > Tests to check that each function produces the expected output from an extremely > simple input. > > There are already c++ tests that decode the complete output and implementing the > same in python looks redundant to me. As I noted, I considered a complete parser optional.
Fixed style issues and added unit tests for make_dafsa.py. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:14: The input strings are assumed to consist of printable 7-bit ASCII characters On 2014/04/24 12:45:19, M-A Ruel wrote: > Why not just enforce it and print an error message when this is not the case, > then you don't have to document it. It is enforced, lines 205 and 431-432. But it has to be documented since the compression is based on this kind of restrictions. https://codereview.chromium.org/197183002/diff/20001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:433: return [line[:-3] + line[-1] for line in lines] On 2014/04/24 12:45:19, M-A Ruel wrote: > Would making it a generator be faster? E.g. replace line 433 with > yield line[:-3] + line[-1] You are probably right, but I don't understand exactly how you mean.
lgtm with a few changes and others optional. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:2: Remove empty line https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: self.msg = msg Why not use standard args? https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:202: 2 lines between file level symbols https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:381: """ """Encodes ... https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:386: buf.reverse() buf = reversed(ord(c) for c in label) ? https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:393: """ same https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:401: return [ord(c) for c in reversed(label)] I'd do: return reversed(ord(c) for c in label) instead. And I'd make encode_label() call this function. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:439: """Generate C++ code from a word list""" Generates https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:447: """Parse gperf file and extract strings and return code""" Parses https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa_unittest.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:2: remove https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:9: import make_dafsa insert empty line above https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:752: self.assertEquals(make_dafsa.words_to_cxx(make_dafsa.parse_gperf(infile)), Always use assertEqual instead of assertEquals. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:757: sys.exit(unittest.main()) sys.exit() is not needed, unittest.main() never return. https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:209: if not words: optional: This requires words to be a list, not a generator. This would have to be fixed if you want to switch to a generator as described below. https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:450: begin = lines.index('%%') + 1 optional: Technically, you could create a small state machine and do it on the fly. I'm only stating this because you already noted it was relatively slow, e.g. state = 0 for line in infile: if state == 0: if line == '%%': state = 1 continue if state == 1: if line == '%%': state = 2 continue lines 454-460 plus the yield. That's it. https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:453: for line in lines: for line in lines[begin:end]: https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:460: line[-1]) optional: add right after line 460: yield line[:-3] + line[-1] and remove line 461.
This LGTM as well. Note that I suspect I will need to manually land this, due to the size of files (I need to land the .gperf, unittest1.gperf, and unittest2.gperf) After your edits, upload the final copy of the CL, and then if you can mail me your raw diff, I'll dcommit it manually. https://codereview.chromium.org/197183002/diff/40001/net/base/registry_contro... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/40001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:176: // Possible matches at this point:: nit: extra :
https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: self.msg = msg On 2014/04/29 21:12:58, M-A Ruel wrote: > Why not use standard args? What is standard args? I looked into the python tutorial tried to write something close to that: https://docs.python.org/2/tutorial/errors.html#user-defined-exceptions https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:202: On 2014/04/29 21:12:58, M-A Ruel wrote: > 2 lines between file level symbols Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:381: """ On 2014/04/29 21:12:58, M-A Ruel wrote: > """Encodes ... Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:386: buf.reverse() On 2014/04/29 21:12:58, M-A Ruel wrote: > buf = reversed(ord(c) for c in label) > ? Unfortunately reversed() wants a sequence as argument. But the label can be iterated in reverse order of course. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:393: """ On 2014/04/29 21:12:58, M-A Ruel wrote: > same Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:401: return [ord(c) for c in reversed(label)] On 2014/04/29 21:12:58, M-A Ruel wrote: > I'd do: > return reversed(ord(c) for c in label) > instead. See comment above. > And I'd make encode_label() call this function. Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:439: """Generate C++ code from a word list""" On 2014/04/29 21:12:58, M-A Ruel wrote: > Generates Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:447: """Parse gperf file and extract strings and return code""" On 2014/04/29 21:12:58, M-A Ruel wrote: > Parses Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa_unittest.py (right): https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:2: On 2014/04/29 21:12:58, M-A Ruel wrote: > remove Hashbang is needed if we follow the recommendation to run the script in a subprocess: http://dev.chromium.org/developers/how-tos/depottools/presubmit-scripts https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:9: import make_dafsa On 2014/04/29 21:12:58, M-A Ruel wrote: > insert empty line above Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:752: self.assertEquals(make_dafsa.words_to_cxx(make_dafsa.parse_gperf(infile)), On 2014/04/29 21:12:58, M-A Ruel wrote: > Always use assertEqual instead of assertEquals. Done. https://codereview.chromium.org/197183002/diff/30001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa_unittest.py:757: sys.exit(unittest.main()) On 2014/04/29 21:12:58, M-A Ruel wrote: > sys.exit() is not needed, unittest.main() never return. Done. https://codereview.chromium.org/197183002/diff/40001/net/base/registry_contro... File net/base/registry_controlled_domains/registry_controlled_domain.cc (right): https://codereview.chromium.org/197183002/diff/40001/net/base/registry_contro... net/base/registry_controlled_domains/registry_controlled_domain.cc:176: // Possible matches at this point:: On 2014/04/29 21:20:47, Ryan Sleevi wrote: > nit: extra : Done. https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:209: if not words: On 2014/04/29 21:12:58, M-A Ruel wrote: > optional: This requires words to be a list, not a generator. This would have to > be fixed if you want to switch to a generator as described below. See comment below. https://codereview.chromium.org/197183002/diff/40001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:450: begin = lines.index('%%') + 1 On 2014/04/29 21:12:58, M-A Ruel wrote: > optional: Technically, you could create a small state machine and do it on the > fly. I'm only stating this because you already noted it was relatively slow, > e.g. > > state = 0 > for line in infile: > if state == 0: > if line == '%%': > state = 1 > continue > if state == 1: > if line == '%%': > state = 2 > continue > lines 454-460 plus the yield. > > That's it. I tried this but it is just slower. Low level implementation of yield is quite complex.
https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:2: I meant to remove the empty line, it's not needed, I didn't comment on the shebang. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:31: print('Running ' + cmd_name) The output is already printed when in verbose mode, I don't think it's useful to print the command in addition. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: self.msg = msg I meant self.args, basically, delete the __init__ function. It just works. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:257: """Generate a new DAFSA where internal nodes are merged if there is a one to Generates (Be consistent) https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:348: assert children children = sorted(children, key = lambda x: -offsets[id(x)]) so you don't have to sort it repeatedly inside the while True https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:464: sys.exit(-1) return 1 https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:471: main() why did you remove the sys.exit()?
https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:2: On 2014/05/02 16:53:02, M-A Ruel wrote: > I meant to remove the empty line, it's not needed, I didn't comment on the > shebang. Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:31: print('Running ' + cmd_name) On 2014/05/02 16:53:02, M-A Ruel wrote: > The output is already printed when in verbose mode, I don't think it's useful to > print the command in addition. Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... File net/tools/tld_cleanup/make_dafsa.py (right): https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:201: self.msg = msg On 2014/05/02 16:53:02, M-A Ruel wrote: > I meant self.args, basically, delete the __init__ function. It just works. Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:257: """Generate a new DAFSA where internal nodes are merged if there is a one to On 2014/05/02 16:53:02, M-A Ruel wrote: > Generates > (Be consistent) Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:348: assert children On 2014/05/02 16:53:02, M-A Ruel wrote: > children = sorted(children, key = lambda x: -offsets[id(x)]) > so you don't have to sort it repeatedly inside the while True Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:464: sys.exit(-1) On 2014/05/02 16:53:02, M-A Ruel wrote: > return 1 Done. https://codereview.chromium.org/197183002/diff/50001/net/tools/tld_cleanup/ma... net/tools/tld_cleanup/make_dafsa.py:471: main() On 2014/05/02 16:53:02, M-A Ruel wrote: > why did you remove the sys.exit()? Good question. Don't know.
python lgtm with 3 nits https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:1: #!/usr/bin/env python Technically, this file is not executable, so please remove the shebang line. Also remove the +x bit. https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:21: cmd = [input_api.python_executable, test_path] Do it on all platform for consistency.
https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... File net/tools/tld_cleanup/PRESUBMIT.py (right): https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:1: #!/usr/bin/env python On 2014/05/05 16:54:51, M-A Ruel wrote: > Technically, this file is not executable, so please remove the shebang line. > > Also remove the +x bit. Done. https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... net/tools/tld_cleanup/PRESUBMIT.py:21: cmd = [input_api.python_executable, test_path] On 2014/05/05 16:54:51, M-A Ruel wrote: > Do it on all platform for consistency. Note that I just copied the code from the documentation: http://dev.chromium.org/developers/how-tos/depottools/presubmit-scripts
On 2014/05/05 17:32:04, Olle Liljenzin wrote: > https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... > File net/tools/tld_cleanup/PRESUBMIT.py (right): > > https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... > net/tools/tld_cleanup/PRESUBMIT.py:1: #!/usr/bin/env python > On 2014/05/05 16:54:51, M-A Ruel wrote: > > Technically, this file is not executable, so please remove the shebang line. > > > > Also remove the +x bit. > > Done. > > https://codereview.chromium.org/197183002/diff/60001/net/tools/tld_cleanup/PR... > net/tools/tld_cleanup/PRESUBMIT.py:21: cmd = [input_api.python_executable, > test_path] > On 2014/05/05 16:54:51, M-A Ruel wrote: > > Do it on all platform for consistency. > > Note that I just copied the code from the documentation: > http://dev.chromium.org/developers/how-tos/depottools/presubmit-scripts Cloned Olle's bits at https://codereview.chromium.org/274443003 (which includes the too large files) for manual landing.
Oh and thanks for bearing with me. :) |