| 
 | 
 | 
 Chromium Code Reviews
 Chromium Code Reviews Issue 
            2128173003:
    Merge the hash prefixes from the old store and the additions in the partial  (Closed)
    
  
    Issue 
            2128173003:
    Merge the hash prefixes from the old store and the additions in the partial  (Closed) 
  | Created: 4 years, 5 months ago by vakh (use Gerrit instead) Modified: 4 years, 4 months ago CC: chromium-reviews, palmer, noé, woz Base URL: https://chromium.googlesource.com/chromium/src.git@master Target Ref: refs/pending/heads/master Project: chromium Visibility: Public. | DescriptionMerge the hash prefixes from the old store and the additions in the partial
update fetched from the PVer4 service to update the hash prefixes in the new
store.
Design doc: https://goto.google.com/design-doc-v4store
BUG=543161
Committed: https://crrev.com/2eb43a6078426caf95b9fd46342e47696bfb4272
Cr-Commit-Position: refs/heads/master@{#405331}
   Patch Set 1 #Patch Set 2 : git fetch and pull #
      Total comments: 1
      
     Patch Set 3 : Minor changes to the test inputs #
      Total comments: 25
      
     Patch Set 4 : HashPrefix as std::string and so is HashPrefixes (the list of hashes). HashPrefixes is a concatenat… #Patch Set 5 : Nit: Added a NOTREACHED #Patch Set 6 : git fetch and pull #
      Total comments: 21
      
     Patch Set 7 : Simplified merge logic. +Nit: Fix a failing test #
      Total comments: 8
      
     Patch Set 8 : Discard update if there's an error in processing it. Add tests. CR feedback. #Patch Set 9 : Minor: Address comments by nparker@ #Patch Set 10 : Add the histogram info to histograms.xml #
      Total comments: 1
      
     Patch Set 11 : Nit: s/soace/space #
 Dependent Patchsets: Messages
    Total messages: 37 (6 generated)
     
 vakh@chromium.org changed reviewers: + nparker@chromium.org 
 nparker: This isn't ready for a full review yet but it might be worth taking a quick look to understand the merge algorithm. Removals are TODO. shess, palmer: just FYI for now. Feel free to review any time. 
 git fetch and pull 
 Minor changes to the test inputs 
 https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:22: const uint32_t kMinHashPrefixLength = 4; Nit: Would it be easier to read if the code didn't use the word prefix, and just called things "hashes" (of different length)? https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:130: } todo: Do something reasonable if response_type is something else. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:150: DCHECK(kMinHashPrefixLength <= prefix_size); todo: skip additions with invalid metadata (i.e. more than just dcheck). https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:156: RecordMergeUpdateResult(result); Might as well record successes as well, unless that's recorded elsewhere. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:196: if (!found || Replace !found with !smallest_prefix, and then you don't need found. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:223: HashPrefixes& hash_prefixes = hash_prefix_map.at(prefix_size); nit (Feel free to ignore): Might be a bit more readable just all stuck together: hash_prefix_map.at(prefix_size).at( counter_map.at(prefix_size)) https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:248: if (found_in_old && found_in_additions) { To remove repeated code here, could you do if (found_in_old) next_smallest_prefix_old = GetNextUnmergedPrefixForSize( next_size_for_old, old_prefixes_map, old_counter_map); if (found_in_additions) next_smallest_prefix_additions = GetNextUnmergedPrefixForSize( next_size_for_additions, additions_map, additions_counter_map); Then you can de-dup the moving/advancing by computing a variable that says which set to pull from: bool take_from_old = found_in_old && found_in_additions && compare(..); if (take_from_old) { //move from old set, increment, get next smallest. } else { //move from additions set, increment, get next smallest. } I might s/found_in_old/old_has_next/, and similar for additions. "Found" sounds like you're searching for something. Or flip it and call it "old_is_empty"? https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store.h (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:26: typedef std::unique_ptr<char> HashPrefix; I think we talked about this already, but keeping a pointer to each 4-byte hash will triple the memory requirement. To keep one contiguous block, you could make a container, say VarHashArray, with a ctor like VarHashArray(size_t hash_length); You could make a () operator that'd return a char* ptr to the x*hash_length'th byte. You could write a VarHashArray::compare(other) that'd take into account the potentially different hash lengths for the two arrays. You'd probably need an append method, and an iterator (or just use ints). Hmm, seems like this must exist. It's almost covered by std::valarray. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:33: typedef base::hash_map<PrefixSize, HashPrefixes> HashPrefixMap; I was going to say you should think about using a vector<HashPrefixes>, indexed by PrefixSize since that'd be faster to iterate through -- it wouldn't have to hash the size before looking it up. We have the advantage that we know the index is bounded 4-32. But.. it wouldn't be much faster. The hash function for a size_t is a noop (I looked it up), and then to find the hash bucket it'd have to do a few pointer de-ref's. So for now I'd say the readability of this outweighs the potential perf gain. We'll measure it later. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:189: HashPrefixMap& hash_prefix_map, I think the style guide says output-args should be pointers rather than references. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:194: static HashPrefixMap GetHashPrefixMapFromAdditions( Make HashPrefixMap a ptr output-arg. Same below. Otherwise it makes temp copies of them on the stack. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store_unittest.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store_unittest.cc:201: EXPECT_EQ(4u, prefix_size); Is this testing that "----" < "-----"? https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store_unittest.cc:245: } We should test all the branches of the merge code. Some more cases: 1) old or additions has zero hashes of a particular size 2) old list runs out (merges the lexigraphically last element) first 3) additions list runs out first 4) additions has an identical entry. This shouldn't happen, but we should have some defined behavior for it. 
 HashPrefix as std::string and so is HashPrefixes (the list of hashes). HashPrefixes is a concatenation of individual HashPrefix. Reason: To save memory by avoiding having a separate pointer for each prefix. 
 Two comments left unresolved. Addressed all other feedback. PTAL. https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsin... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:39: UMA_HISTOGRAM_ENUMERATION("SafeBrowsing.V4MergeUpdateResult", result, TODO(vakh): add an entry for this in the histograms.xml file https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:22: const uint32_t kMinHashPrefixLength = 4; On 2016/07/11 18:09:58, Nathan Parker wrote: > Nit: Would it be easier to read if the code didn't use the word prefix, and just > called things "hashes" (of different length)? Yes, it would be slightly more compact but at the risk of being misleading since these are not hashes (which conveys full hashes). I feel that it is better to be more explicit with this. In fact, I think it might be better to avoid just "hashes" altogether to make it clear whether we mean hash prefixes or full hashes. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:130: } On 2016/07/11 18:09:58, Nathan Parker wrote: > todo: Do something reasonable if response_type is something else. Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:150: DCHECK(kMinHashPrefixLength <= prefix_size); On 2016/07/11 18:09:58, Nathan Parker wrote: > todo: skip additions with invalid metadata (i.e. more than just dcheck). Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:156: RecordMergeUpdateResult(result); On 2016/07/11 18:09:58, Nathan Parker wrote: > Might as well record successes as well, unless that's recorded elsewhere. Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:196: if (!found || On 2016/07/11 18:09:58, Nathan Parker wrote: > Replace !found with !smallest_prefix, and then you don't need found. Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:223: HashPrefixes& hash_prefixes = hash_prefix_map.at(prefix_size); On 2016/07/11 18:09:58, Nathan Parker wrote: > nit (Feel free to ignore): Might be a bit more readable just all stuck together: > > hash_prefix_map.at(prefix_size).at( > counter_map.at(prefix_size)) This code has changed so NA. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store.h (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:26: typedef std::unique_ptr<char> HashPrefix; On 2016/07/11 18:09:58, Nathan Parker wrote: > I think we talked about this already, but keeping a pointer to each 4-byte hash > will triple the memory requirement. > > To keep one contiguous block, you could make a container, say VarHashArray, with > a ctor like > VarHashArray(size_t hash_length); > > You could make a () operator that'd return a char* ptr to the x*hash_length'th > byte. You could write a VarHashArray::compare(other) that'd take into account > the potentially different hash lengths for the two arrays. You'd probably need > an append method, and an iterator (or just use ints). Hmm, seems like this must > exist. It's almost covered by std::valarray. As discussed offline, using a std::string instead. This will keep the design simple. If, at a later time, we feel the need to use a valarray of objects later, we'll do so then. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:33: typedef base::hash_map<PrefixSize, HashPrefixes> HashPrefixMap; On 2016/07/11 18:09:58, Nathan Parker wrote: > I was going to say you should think about using a vector<HashPrefixes>, indexed > by PrefixSize since that'd be faster to iterate through -- it wouldn't have to > hash the size before looking it up. We have the advantage that we know the > index is bounded 4-32. > > But.. it wouldn't be much faster. The hash function for a size_t is a noop (I > looked it up), and then to find the hash bucket it'd have to do a few pointer > de-ref's. So for now I'd say the readability of this outweighs the potential > perf gain. We'll measure it later. Acknowledged. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:189: HashPrefixMap& hash_prefix_map, On 2016/07/11 18:09:58, Nathan Parker wrote: > I think the style guide says output-args should be pointers rather than > references. Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.h:194: static HashPrefixMap GetHashPrefixMapFromAdditions( On 2016/07/11 18:09:58, Nathan Parker wrote: > Make HashPrefixMap a ptr output-arg. Same below. Otherwise it makes temp > copies of them on the stack. Done. https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store_unittest.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store_unittest.cc:201: EXPECT_EQ(4u, prefix_size); On 2016/07/11 18:09:58, Nathan Parker wrote: > Is this testing that "----" < "-----"? Yes, and more importantly that the order is as expected. 
 Nit: Added a NOTREACHED 
 git fetch and pull 
 Simplified merge logic. +Nit: Fix a failing test 
 I ended up commenting on two different versions, but I think it worked out. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:21: // The minimum expected size of a hash-prefix. ...size in bytes https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:153: return PREFIX_SIZE_TOO_SMALL_FAILURE; You don't need any of the else's here. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:161: (*additions_map)[prefix_size] = lumped_hashes; To consider, or add a todo: This copies the string unnecessarily. If you guaranteed that the proto's lived at least as long as the update process, you could store a pointed to the proto's string. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:171: HashPrefix smallest_prefix; You could use a StringPiece here and within the loop so it doesn't actually copy the substring. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:184: } else dcheck https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:202: HashPrefixes hash_prefixes = hash_prefix_map.at(prefix_size); This copies the whole string. You could use const HashPrefixes&. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:211: for (const auto& pair : other_prefixes_map) { Add a comment about how this gets close to ideal, but will leave a bit of space due to deletions. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:215: HashPrefixes existing_prefixes = (*prefix_map_to_update)[prefix_size]; const HashPrefixes&. Otherwise it makes a copy. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:225: ReserveSpaceInPrefixMap(old_prefixes_map, &hash_prefix_map_); Is the assumption that hash_prefix_map_ is empty here? Otherwise, it'll add old+additions to current. Maybe making this one call (that takes both as args) could enforce that, by not using the existing capacity. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:223: void V4Store::MergeUpdate(const HashPrefixMap& old_prefixes_map, Nice, this function is easier to read! https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:246: HashPrefix next_smallest_prefix_old, next_smallest_prefix_additions; I _think_ it's more efficient if you define these outside the loop. Even though it's best practice to minimize the scope as much as possible, the std::string constructor/desctructor will be called in every iteration if it's declared inside. Also you could use a StringPiece to avoid copying. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... File components/safe_browsing_db/v4_store_unittest.cc (right): https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store_unittest.cc:167: EXPECT_EQ("abcde", hash_prefixes.substr(0 * prefix_size, prefix_size)); Seems like you could just test that the whole string is equal to the argument above, rather than checking substrs. There's some value in doing it down below, but here it's a bit odd. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store_unittest.cc:212: V4Store::AddUnlumpedHashes(4, "-----0000054321abcde", &prefix_map); nit: maybe use a string that looks like it's in sections of 4 for the 4-byte hash. 
 Discard update if there's an error in processing it. Add tests. CR feedback. 
 https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_store.cc:248: if (found_in_old && found_in_additions) { On 2016/07/11 18:09:58, Nathan Parker wrote: > To remove repeated code here, could you do > > if (found_in_old) > next_smallest_prefix_old = GetNextUnmergedPrefixForSize( > next_size_for_old, old_prefixes_map, old_counter_map); > > if (found_in_additions) > next_smallest_prefix_additions = GetNextUnmergedPrefixForSize( > next_size_for_additions, additions_map, additions_counter_map); > > Then you can de-dup the moving/advancing by computing a variable that says which > set to pull from: > > bool take_from_old = found_in_old && found_in_additions && compare(..); > if (take_from_old) { > //move from old set, increment, get next smallest. > } else { > //move from additions set, increment, get next smallest. > } > > I might s/found_in_old/old_has_next/, and similar for additions. "Found" sounds > like you're searching for something. Or flip it and call it "old_is_empty"? > > > Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:21: // The minimum expected size of a hash-prefix. On 2016/07/12 20:35:05, Nathan Parker wrote: > ...size in bytes Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:153: return PREFIX_SIZE_TOO_SMALL_FAILURE; On 2016/07/12 20:35:06, Nathan Parker wrote: > You don't need any of the else's here. Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:171: HashPrefix smallest_prefix; On 2016/07/12 20:35:06, Nathan Parker wrote: > You could use a StringPiece here and within the loop so it doesn't actually copy > the substring. That would break the consistency of hash prefixes being HashPrefixes and expose the implementation detail about a hash prefix being a string. These strings are fairly small so copying them isn't worth the sacrifice in readability. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:184: } On 2016/07/12 20:35:06, Nathan Parker wrote: > else dcheck That's a valid case. For instance, when one map has been merged entirely. In that case, all sized_index values would be outside the valid range for all corresponding HashPrefixes. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:202: HashPrefixes hash_prefixes = hash_prefix_map.at(prefix_size); On 2016/07/12 20:35:06, Nathan Parker wrote: > This copies the whole string. You could use const HashPrefixes&. Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:211: for (const auto& pair : other_prefixes_map) { On 2016/07/12 20:35:06, Nathan Parker wrote: > Add a comment about how this gets close to ideal, but will leave a bit of space > due to deletions. Updated the comment in header file (consistent with the other functions). https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:215: HashPrefixes existing_prefixes = (*prefix_map_to_update)[prefix_size]; On 2016/07/12 20:35:06, Nathan Parker wrote: > const HashPrefixes&. Otherwise it makes a copy. Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:225: ReserveSpaceInPrefixMap(old_prefixes_map, &hash_prefix_map_); On 2016/07/12 20:35:06, Nathan Parker wrote: > Is the assumption that hash_prefix_map_ is empty here? Otherwise, it'll add > old+additions to current. Maybe making this one call (that takes both as args) > could enforce that, by not using the existing capacity. Good map. However, for code reuse purposes, I think it is better to have a method that method as-is. Clearing out the hash_prefix_map_ explicitly and added a DCHECK. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:223: void V4Store::MergeUpdate(const HashPrefixMap& old_prefixes_map, On 2016/07/12 20:35:06, Nathan Parker wrote: > Nice, this function is easier to read! Acknowledged. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:246: HashPrefix next_smallest_prefix_old, next_smallest_prefix_additions; On 2016/07/12 20:35:06, Nathan Parker wrote: > I _think_ it's more efficient if you define these outside the loop. Even though > it's best practice to minimize the scope as much as possible, the std::string > constructor/desctructor will be called in every iteration if it's declared > inside. Done > Also you could use a StringPiece to avoid copying. Please see my other comment. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... File components/safe_browsing_db/v4_store_unittest.cc (right): https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store_unittest.cc:167: EXPECT_EQ("abcde", hash_prefixes.substr(0 * prefix_size, prefix_size)); On 2016/07/12 20:35:06, Nathan Parker wrote: > Seems like you could just test that the whole string is equal to the argument > above, rather than checking substrs. There's some value in doing it down below, > but here it's a bit odd. Done. https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi... components/safe_browsing_db/v4_store_unittest.cc:212: V4Store::AddUnlumpedHashes(4, "-----0000054321abcde", &prefix_map); On 2016/07/12 20:35:06, Nathan Parker wrote: > nit: maybe use a string that looks like it's in sections of 4 for the 4-byte > hash. Good idea but when I do that, the output of 'git cl format' is pretty bad (too much whitespace inserted). 
 lgtm w/ two comments to address https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:161: (*additions_map)[prefix_size] = lumped_hashes; On 2016/07/12 20:35:06, Nathan Parker wrote: > To consider, or add a todo: This copies the string unnecessarily. If you > guaranteed that the proto's lived at least as long as the update process, you > could store a pointed to the proto's string. (Missed this comment) Though, I'm not sure how to implement this easily, since it'd require HashPrefixMap to not own the data in some cases (additions) and own it in others (hash_prefix_map_). Maybe just a TODO for now. This probably only really matters at startup -- For a 2MB list loaded from disk, this'll use a extra 2MB RAM. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:171: HashPrefix smallest_prefix; On 2016/07/12 21:57:02, vakh wrote: > On 2016/07/12 20:35:06, Nathan Parker wrote: > > You could use a StringPiece here and within the loop so it doesn't actually > copy > > the substring. > > That would break the consistency of hash prefixes being HashPrefixes and expose > the implementation detail about a hash prefix being a string. > These strings are fairly small so copying them isn't worth the sacrifice in > readability. They are small, but there is a dynamic allocation everytime one gets copied. Actually, just pulling it out of the loop would fix most of that, since then it could reuse the same buffer over and over. 
 Minor: Address comments by nparker@ 
 Add the histogram info to histograms.xml 
 vakh@chromium.org changed reviewers: + rkaplow@chromium.org 
 rkaplow@ -- can you please review changes in histograms.xml? nparker@ -- thanks for the review! https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:161: (*additions_map)[prefix_size] = lumped_hashes; On 2016/07/12 22:47:12, Nathan Parker wrote: > On 2016/07/12 20:35:06, Nathan Parker wrote: > > To consider, or add a todo: This copies the string unnecessarily. If you > > guaranteed that the proto's lived at least as long as the update process, you > > could store a pointed to the proto's string. > > (Missed this comment) Though, I'm not sure how to implement this easily, since > it'd require HashPrefixMap to not own the data in some cases (additions) and own > it in others (hash_prefix_map_). Maybe just a TODO for now. This probably only > really matters at startup -- For a 2MB list loaded from disk, this'll use a > extra 2MB RAM. Done. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi... components/safe_browsing_db/v4_store.cc:171: HashPrefix smallest_prefix; On 2016/07/12 22:47:12, Nathan Parker wrote: > On 2016/07/12 21:57:02, vakh wrote: > > On 2016/07/12 20:35:06, Nathan Parker wrote: > > > You could use a StringPiece here and within the loop so it doesn't actually > > copy > > > the substring. > > > > That would break the consistency of hash prefixes being HashPrefixes and > expose > > the implementation detail about a hash prefix being a string. > > These strings are fairly small so copying them isn't worth the sacrifice in > > readability. > > They are small, but there is a dynamic allocation everytime one gets copied. > Actually, just pulling it out of the loop would fix most of that, since then it > could reuse the same buffer over and over. Done. 
 shess@chromium.org changed reviewers: + shess@chromium.org 
 I find that I don't have the time/energy to really understand/review during the few days I'm around. But I wanted to comment that the existing code tries really hard to do sorted-merge type operations, because a streaming traversal of megabytes of data may be much faster than a multiply-indirect traversal of tens of kilobytes of data. https://codereview.chromium.org/2128173003/diff/180001/components/safe_browsi... File components/safe_browsing_db/v4_store.h (right): https://codereview.chromium.org/2128173003/diff/180001/components/safe_browsi... components/safe_browsing_db/v4_store.h:225: // list is exact. This ignores the soace that would otherwise be released by s/soace/space/ 
 lgtm histogram lg 
 Nit: s/soace/space 
 On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > I find that I don't have the time/energy to really understand/review during the > few days I'm around. But I wanted to comment that the existing code tries > really hard to do sorted-merge type operations, because a streaming traversal of > megabytes of data may be much faster than a multiply-indirect traversal of tens > of kilobytes of data. Scott, I am not sure I understand what you're saying here. I know you're OOO so if and when possible, please try to explain this in more detail (email/CL comment/chat) and I'll address it. Meanwhile, I am going to go ahead and submit this CL. > > https://codereview.chromium.org/2128173003/diff/180001/components/safe_browsi... > File components/safe_browsing_db/v4_store.h (right): > > https://codereview.chromium.org/2128173003/diff/180001/components/safe_browsi... > components/safe_browsing_db/v4_store.h:225: // list is exact. This ignores the > soace that would otherwise be released by > s/soace/space/ Done 
 The CQ bit was checked by vakh@chromium.org 
 The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org, rkaplow@chromium.org Link to the patchset: https://codereview.chromium.org/2128173003/#ps200001 (title: "Nit: s/soace/space") 
 CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or... 
 
            
              
                Message was sent while issue was closed.
              
            
             Committed patchset #11 (id:200001) 
 
            
              
                Message was sent while issue was closed.
              
            
             CQ bit was unchecked. 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/07/13 20:27:02, vakh wrote: > On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > > I find that I don't have the time/energy to really understand/review during > the > > few days I'm around. But I wanted to comment that the existing code tries > > really hard to do sorted-merge type operations, because a streaming traversal > of > > megabytes of data may be much faster than a multiply-indirect traversal of > tens > > of kilobytes of data. > > Scott, I am not sure I understand what you're saying here. > I know you're OOO so if and when possible, please try to explain this in more > detail (email/CL comment/chat) and I'll address it. > Meanwhile, I am going to go ahead and submit this CL. Mostly I mean that it should process two sorted inputs in order into a single sorted output, because jumping around the inputs or output is very expensive. I can see you have various maps in here, but in a superficial review I'm not sure if the maps map to lists of hashes, or to actual hashes. 
 
            
              
                Message was sent while issue was closed.
              
            
             Description was changed from ========== Merge the hash prefixes from the old store and the additions in the partial update fetched from the PVer4 service to update the hash prefixes in the new store. Design doc: https://goto.google.com/design-doc-v4store BUG=543161 ========== to ========== Merge the hash prefixes from the old store and the additions in the partial update fetched from the PVer4 service to update the hash prefixes in the new store. Design doc: https://goto.google.com/design-doc-v4store BUG=543161 Committed: https://crrev.com/2eb43a6078426caf95b9fd46342e47696bfb4272 Cr-Commit-Position: refs/heads/master@{#405331} ========== 
 
            
              
                Message was sent while issue was closed.
              
            
             Patchset 11 (id:??) landed as https://crrev.com/2eb43a6078426caf95b9fd46342e47696bfb4272 Cr-Commit-Position: refs/heads/master@{#405331} 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/07/13 22:57:05, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/07/13 20:27:02, vakh wrote: > > On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > > > I find that I don't have the time/energy to really understand/review during > > the > > > few days I'm around. But I wanted to comment that the existing code tries > > > really hard to do sorted-merge type operations, because a streaming > traversal > > of > > > megabytes of data may be much faster than a multiply-indirect traversal of > > tens > > > of kilobytes of data. > > > > Scott, I am not sure I understand what you're saying here. > > I know you're OOO so if and when possible, please try to explain this in more > > detail (email/CL comment/chat) and I'll address it. > > Meanwhile, I am going to go ahead and submit this CL. > > Mostly I mean that it should process two sorted inputs in order into a single > sorted output, because jumping around the inputs or output is very expensive. I > can see you have various maps in here, but in a superficial review I'm not sure > if the maps map to lists of hashes, or to actual hashes. The map of the form: prefix_size: <concatenated list of sorted hashes>" The reason I need to jump around in the map (which isn't clear in this CL but will be clear in a future CL) is that the removals sent by the server are indexes into a list that would hold the lexographically sorted hash prefixes of all sizes. So if the current hash prefixes we have are: 4: ["aaaa", "bbbb"] 5: ["aabbb"] and the removals list is: [1] Then we need to index 1 from sorted(["aaaa", "bbbb"] + ["aabbb"]) i.e. remove "aabbb". The current implementation keeps a counter of what elements we have merged from the existing prefix map so that we can drop the indexes in removals list on the fly without having to create sorted(["aaaa", "bbbb"] + ["aabbb"]) first. If there's a better way to manage these removals, please let me know. 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/07/13 23:10:54, vakh wrote: > On 2016/07/13 22:57:05, Scott Hess (OOO Jul 1-Aug 7) wrote: > > On 2016/07/13 20:27:02, vakh wrote: > > > On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > > > > I find that I don't have the time/energy to really understand/review > during > > > the > > > > few days I'm around. But I wanted to comment that the existing code tries > > > > really hard to do sorted-merge type operations, because a streaming > > traversal > > > of > > > > megabytes of data may be much faster than a multiply-indirect traversal of > > > tens > > > > of kilobytes of data. > > > > > > Scott, I am not sure I understand what you're saying here. > > > I know you're OOO so if and when possible, please try to explain this in > more > > > detail (email/CL comment/chat) and I'll address it. > > > Meanwhile, I am going to go ahead and submit this CL. > > > > Mostly I mean that it should process two sorted inputs in order into a single > > sorted output, because jumping around the inputs or output is very expensive. > I > > can see you have various maps in here, but in a superficial review I'm not > sure > > if the maps map to lists of hashes, or to actual hashes. > > The map of the form: > prefix_size: <concatenated list of sorted hashes>" > > The reason I need to jump around in the map (which isn't clear in this CL but > will be clear in a future CL) is that the removals sent by the server are > indexes into a list that would hold the lexographically sorted hash prefixes of > all sizes. > So if the current hash prefixes we have are: > 4: ["aaaa", "bbbb"] > 5: ["aabbb"] > > and the removals list is: > [1] > > Then we need to index 1 from sorted(["aaaa", "bbbb"] + ["aabbb"]) i.e. remove > "aabbb". > > The current implementation keeps a counter of what elements we have merged from > the existing prefix map so that we can drop the indexes in removals list on the > fly without having to create sorted(["aaaa", "bbbb"] + ["aabbb"]) first. > If there's a better way to manage these removals, please let me know. I know this is maybe late to the game, and I don't understand why I didn't pick up on it earlier - but maybe my core concern is simply a complaint about variable prefix sizes. Having 32-bit prefixes and 32-byte hashes made things pretty clean, resulting in efficient code that was reasonably amenable to inspection. Having a more open-ended number of bins makes it harder to grok. Additionally, from what I recall when I did distribution analysis when replacing the bloom filter, it seemed like 32-bit prefixes were close to the sweet spot for sets containing between 100k and a few million items. More bits wouldn't move the needle for collisions, and fewer bits only provided savings for less than a million items. The only time having a mixture of prefix sizes will help is if the set size is within a few bits of a magic number (like between 2^21 and 2^27, say), so you could vary between 24 bits and 32 bits on demand, but I'm not sure that's something worth optimizing for. Even if you did, though, I don't think there would ever be value in sending data spanning more than an 8-bit divide (say 24-bit, 32-bit, and 40-bit). Apologies if I grossly misunderstand things. 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/08/10 17:45:53, Scott Hess wrote: > On 2016/07/13 23:10:54, vakh wrote: > > On 2016/07/13 22:57:05, Scott Hess (OOO Jul 1-Aug 7) wrote: > > > On 2016/07/13 20:27:02, vakh wrote: > > > > On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > > > > > I find that I don't have the time/energy to really understand/review > > during > > > > the > > > > > few days I'm around. But I wanted to comment that the existing code > tries > > > > > really hard to do sorted-merge type operations, because a streaming > > > traversal > > > > of > > > > > megabytes of data may be much faster than a multiply-indirect traversal > of > > > > tens > > > > > of kilobytes of data. > > > > > > > > Scott, I am not sure I understand what you're saying here. > > > > I know you're OOO so if and when possible, please try to explain this in > > more > > > > detail (email/CL comment/chat) and I'll address it. > > > > Meanwhile, I am going to go ahead and submit this CL. > > > > > > Mostly I mean that it should process two sorted inputs in order into a > single > > > sorted output, because jumping around the inputs or output is very > expensive. > > I > > > can see you have various maps in here, but in a superficial review I'm not > > sure > > > if the maps map to lists of hashes, or to actual hashes. > > > > The map of the form: > > prefix_size: <concatenated list of sorted hashes>" > > > > The reason I need to jump around in the map (which isn't clear in this CL but > > will be clear in a future CL) is that the removals sent by the server are > > indexes into a list that would hold the lexographically sorted hash prefixes > of > > all sizes. > > So if the current hash prefixes we have are: > > 4: ["aaaa", "bbbb"] > > 5: ["aabbb"] > > > > and the removals list is: > > [1] > > > > Then we need to index 1 from sorted(["aaaa", "bbbb"] + ["aabbb"]) i.e. remove > > "aabbb". > > > > The current implementation keeps a counter of what elements we have merged > from > > the existing prefix map so that we can drop the indexes in removals list on > the > > fly without having to create sorted(["aaaa", "bbbb"] + ["aabbb"]) first. > > If there's a better way to manage these removals, please let me know. > > I know this is maybe late to the game, and I don't understand why I didn't pick > up on it earlier - but maybe my core concern is simply a complaint about > variable prefix sizes. Having 32-bit prefixes and 32-byte hashes made things > pretty clean, resulting in efficient code that was reasonably amenable to > inspection. Having a more open-ended number of bins makes it harder to grok. > Additionally, from what I recall when I did distribution analysis when replacing > the bloom filter, it seemed like 32-bit prefixes were close to the sweet spot > for sets containing between 100k and a few million items. More bits wouldn't > move the needle for collisions, and fewer bits only provided savings for less > than a million items. The only time having a mixture of prefix sizes will help > is if the set size is within a few bits of a magic number (like between 2^21 and > 2^27, say), so you could vary between 24 bits and 32 bits on demand, but I'm not > sure that's something worth optimizing for. Even if you did, though, I don't > think there would ever be value in sending data spanning more than an 8-bit > divide (say 24-bit, 32-bit, and 40-bit). > > Apologies if I grossly misunderstand things. It isn't too late, please feel free to look at other CLs that I submitted while you were away and review those too. I welcome and appreciate your feedback/reviews. I am not sure I understand your comment completely but the core of it seems be: why are we using different-sized prefixes? The reason is that the SafeBrowsing service now supports different sized prefixes. These are used to prevent collisions between a bad website and a hugely popular website. You probably understand this better than me, but to illustrate with an example, let's assume: 1. example.com/one is a bad URL with hash: aabbccddeeeeeeee 2. google.com with hash: aabbccddgggggggg They have a common 32-bit hash-prefix: aabbccdd This means if we only support 32-bit hash prefixes, every time a user goes to google.com, it would lead to a full hash request to SafeBrowsing service. By supporting 5-byte hash prefixes, we blacklist: aabbccddee and do not send full hash requests for aabbccddgg 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/08/10 17:58:50, vakh wrote: > I am not sure I understand your comment completely but the core of it seems be: > why are we using different-sized prefixes? > The reason is that the SafeBrowsing service now supports different sized > prefixes. > These are used to prevent collisions between a bad website and a hugely popular > website. > You probably understand this better than me, but to illustrate with an example, > let's assume: > 1. example.com/one is a bad URL with hash: aabbccddeeeeeeee > 2. http://google.com with hash: aabbccddgggggggg > > They have a common 32-bit hash-prefix: aabbccdd > This means if we only support 32-bit hash prefixes, every time a user goes to > http://google.com, it would lead to a full hash request to SafeBrowsing service. > > By supporting 5-byte hash prefixes, we blacklist: aabbccddee and do not send > full hash requests for aabbccddgg My point is that for a set of ~2M items using 32-bit prefixes, there should be only a handful of collisions, so the advantage of having a 40-bit prefix for that case over a 32-byte full hash is pretty trivial. If you had ~2M items using 24-bit prefixes, there should be a fair number of collisions, so being able to send both 24-bit and 32-bit prefixes could save some bandwidth. If the set was well-sorted, though, the initial download wouldn't be much impacted (due to compression opportunities), and if prefixes waffle between 24-bit and 32-bit during updates you could easily eat up the savings in overhead. In any case, moving from 24-bit to 32-bit should reduce collisions by 1/256, at which point it's unlikely that adding 40-bit prefixes would be a material improvement. Mostly where I'm going is that it's probably not worthwhile to make the code complicated to save ~1% in bandwidth. Having a map from prefix size to lists feels complicated to me, because it makes it really easy for minor errors to have major impacts. As a for-instance of the kind of thing I worry about, you have a map full of iterators with the same signatures from different containers, so you could imagine cases comparing across containers resulting in running off the end of the container, but maybe due to statistics of hashing it only happens infrequently so the case kicks up based on a data change, rather than a code change (so suddenly all channels start crashing). [It's possible the server side has already set everything in stone for you. I've often worried about whether Chrome team gave them enough pushback about offloading processing and memory requirements onto clients, given that there are billions of clients.] 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/08/10 17:58:50, vakh wrote: > I am not sure I understand your comment completely but the core of it seems be: > why are we using different-sized prefixes? > The reason is that the SafeBrowsing service now supports different sized > prefixes. > These are used to prevent collisions between a bad website and a hugely popular > website. > You probably understand this better than me, but to illustrate with an example, > let's assume: > 1. example.com/one is a bad URL with hash: aabbccddeeeeeeee > 2. http://google.com with hash: aabbccddgggggggg > > They have a common 32-bit hash-prefix: aabbccdd > This means if we only support 32-bit hash prefixes, every time a user goes to > http://google.com, it would lead to a full hash request to SafeBrowsing service. > > By supporting 5-byte hash prefixes, we blacklist: aabbccddee and do not send > full hash requests for aabbccddgg My point is that for a set of ~2M items using 32-bit prefixes, there should be only a handful of collisions, so the advantage of having a 40-bit prefix for that case over a 32-byte full hash is pretty trivial. If you had ~2M items using 24-bit prefixes, there should be a fair number of collisions, so being able to send both 24-bit and 32-bit prefixes could save some bandwidth. If the set was well-sorted, though, the initial download wouldn't be much impacted (due to compression opportunities), and if prefixes waffle between 24-bit and 32-bit during updates you could easily eat up the savings in overhead. In any case, moving from 24-bit to 32-bit should reduce collisions by 1/256, at which point it's unlikely that adding 40-bit prefixes would be a material improvement. Mostly where I'm going is that it's probably not worthwhile to make the code complicated to save ~1% in bandwidth. Having a map from prefix size to lists feels complicated to me, because it makes it really easy for minor errors to have major impacts. As a for-instance of the kind of thing I worry about, you have a map full of iterators with the same signatures from different containers, so you could imagine cases comparing across containers resulting in running off the end of the container, but maybe due to statistics of hashing it only happens infrequently so the case kicks up based on a data change, rather than a code change (so suddenly all channels start crashing). [It's possible the server side has already set everything in stone for you. I've often worried about whether Chrome team gave them enough pushback about offloading processing and memory requirements onto clients, given that there are billions of clients.] 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2016/08/10 18:22:15, Scott Hess wrote: > On 2016/08/10 17:58:50, vakh wrote: > > I am not sure I understand your comment completely but the core of it seems > be: > > why are we using different-sized prefixes? > > The reason is that the SafeBrowsing service now supports different sized > > prefixes. > > These are used to prevent collisions between a bad website and a hugely > popular > > website. > > You probably understand this better than me, but to illustrate with an > example, > > let's assume: > > 1. example.com/one is a bad URL with hash: aabbccddeeeeeeee > > 2. http://google.com with hash: aabbccddgggggggg > > > > They have a common 32-bit hash-prefix: aabbccdd > > This means if we only support 32-bit hash prefixes, every time a user goes to > > http://google.com, it would lead to a full hash request to SafeBrowsing > service. > > > > By supporting 5-byte hash prefixes, we blacklist: aabbccddee and do not send > > full hash requests for aabbccddgg > > My point is that for a set of ~2M items using 32-bit prefixes, there should be > only a handful of collisions, so the advantage of having a 40-bit prefix for > that case over a 32-byte full hash is pretty trivial. > > If you had ~2M items using 24-bit prefixes, there should be a fair number of > collisions, so being able to send both 24-bit and 32-bit prefixes could save > some bandwidth. If the set was well-sorted, though, the initial download > wouldn't be much impacted (due to compression opportunities), and if prefixes > waffle between 24-bit and 32-bit during updates you could easily eat up the > savings in overhead. In any case, moving from 24-bit to 32-bit should reduce > collisions by 1/256, at which point it's unlikely that adding 40-bit prefixes > would be a material improvement. > > Mostly where I'm going is that it's probably not worthwhile to make the code > complicated to save ~1% in bandwidth. Having a map from prefix size to lists > feels complicated to me I know the new design is slightly more complex to understand at first due to variable sized hash prefixes, but to be honest, I find it easier to understand overall than the PVer2/3 implementation. Part of it is because you helped keep it simple by avoiding use of locks etc and partly because the data structures used (unordered_maps) are ones used very commonly (unlike BloomFilters/PrefixSet). I would say that the increase in complexity of code is subjective. > because it makes it really easy for minor errors to > have major impacts. With the kind of user base Chrome has, this is true of any errors. > As a for-instance of the kind of thing I worry about, you > have a map full of iterators with the same signatures from different containers, > so you could imagine cases comparing across containers resulting in running off > the end of the container, but maybe due to statistics of hashing it only happens > infrequently so the case kicks up based on a data change, rather than a code > change (so suddenly all channels start crashing). I agree that any condition that can trigger based on data change can start causing crashes anytime and should be avoided. At the same time, I think that's why having unit tests and rolling out the change slowly is the right way to go about this launch, which is what I plan to do. As for the particular case of map of iterators, it is always initialized just before use and has a short lifetime so I don't expect it to go into a weird state ever. Also, there are a number of unit tests for that piece of code. > > [It's possible the server side has already set everything in stone for you. > I've often worried about whether Chrome team gave them enough pushback about > offloading processing and memory requirements onto clients, given that there are > billions of clients.] That's a fair point, but it is something that can be discussed for the next iteration of the server implementation. As the service works currently, this is the only option for Chrome [other than to not adopt the new server implementation]. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
