Issue 2128173003: Merge the hash prefixes from the old store and the additions in the partial

vakh (use Gerrit instead)

vakh@chromium.org changed reviewers: + nparker@chromium.org

4 years, 5 months ago (2016-07-07 10:41:50 UTC) #1

vakh (use Gerrit instead)

nparker: This isn't ready for a full review yet but it might be worth taking ...

4 years, 5 months ago (2016-07-07 10:41:50 UTC) #2

Nathan Parker

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsing_db/v4_store.cc#newcode22 components/safe_browsing_db/v4_store.cc:22: const uint32_t kMinHashPrefixLength = 4; Nit: Would it be ...

4 years, 5 months ago (2016-07-11 18:09:58 UTC) #5

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:22: const uint32_t kMinHashPrefixLength
= 4;
Nit: Would it be easier to read if the code didn't use the word prefix, and just
called things "hashes" (of different length)?

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:130: }
todo: Do something reasonable if response_type is something else.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:150: DCHECK(kMinHashPrefixLength <=
prefix_size);
todo: skip additions with invalid metadata (i.e. more than just dcheck).

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:156: RecordMergeUpdateResult(result);
Might as well record successes as well, unless that's recorded elsewhere.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:196: if (!found ||
Replace !found with !smallest_prefix, and then you don't need found.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:223: HashPrefixes& hash_prefixes =
hash_prefix_map.at(prefix_size);
nit (Feel free to ignore): Might be a bit more readable just all stuck together:

hash_prefix_map.at(prefix_size).at(
  counter_map.at(prefix_size))

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:248: if (found_in_old &&
found_in_additions) {
To remove repeated code here, could you do

if (found_in_old) 
  next_smallest_prefix_old = GetNextUnmergedPrefixForSize(
          next_size_for_old, old_prefixes_map, old_counter_map);

if (found_in_additions)
  next_smallest_prefix_additions = GetNextUnmergedPrefixForSize(
          next_size_for_additions, additions_map, additions_counter_map);

Then you can de-dup the moving/advancing by computing a variable that says which
set to pull from:

bool take_from_old = found_in_old && found_in_additions && compare(..);
if (take_from_old) {
  //move from old set, increment, get next smallest.
} else {
  //move from additions set, increment, get next smallest.
}

I might s/found_in_old/old_has_next/, and similar for additions.  "Found" sounds
like you're searching for something.  Or flip it and call it "old_is_empty"?

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store.h (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:26: typedef std::unique_ptr<char>
HashPrefix;
I think we talked about this already, but keeping a pointer to each 4-byte hash
will triple the memory requirement.

To keep one contiguous block, you could make a container, say VarHashArray, with
a ctor like
  VarHashArray(size_t hash_length);

You could make a () operator that'd return a char* ptr to the x*hash_length'th
byte. You could write a VarHashArray::compare(other) that'd take into account
the potentially different hash lengths for the two arrays. You'd probably need
an append method, and an iterator (or just use ints).  Hmm, seems like this must
exist. It's almost covered by std::valarray.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:33: typedef base::hash_map<PrefixSize,
HashPrefixes> HashPrefixMap;
I was going to say you should think about using a vector<HashPrefixes>, indexed
by PrefixSize since that'd be faster to iterate through -- it wouldn't have to
hash the size before looking it up.  We have the advantage that we know the
index is bounded 4-32.

But.. it wouldn't be much faster.  The hash function for a size_t is a noop (I
looked it up), and then to find the hash bucket it'd have to do a few pointer
de-ref's.  So for now I'd say the readability of this outweighs the potential
perf gain.  We'll measure it later.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:189: HashPrefixMap& hash_prefix_map,
I think the style guide says output-args should be pointers rather than
references.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:194: static HashPrefixMap
GetHashPrefixMapFromAdditions(
Make HashPrefixMap a ptr output-arg.  Same below.  Otherwise it makes temp
copies of them on the stack.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store_unittest.cc (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store_unittest.cc:201: EXPECT_EQ(4u,
prefix_size);
Is this testing that "----" < "-----"?

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store_unittest.cc:245: }
We should test all the branches of the merge code.  Some more cases:

1) old or additions has zero hashes of a particular size
2) old list runs out (merges the lexigraphically last element) first
3) additions list runs out first
4) additions has an identical entry.  This shouldn't happen, but we should have
some defined behavior for it.

vakh (use Gerrit instead)

HashPrefix as std::string and so is HashPrefixes (the list of hashes). HashPrefixes is a concatenation ...

4 years, 5 months ago (2016-07-12 07:30:10 UTC) #6

vakh (use Gerrit instead)

Two comments left unresolved. Addressed all other feedback. PTAL. https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsing_db/v4_store.cc#newcode39 components/safe_browsing_db/v4_store.cc:39: ...

4 years, 5 months ago (2016-07-12 07:34:19 UTC) #7

Two comments left unresolved. Addressed all other feedback.
PTAL.

https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/20001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:39:
UMA_HISTOGRAM_ENUMERATION("SafeBrowsing.V4MergeUpdateResult", result,
TODO(vakh): add an entry for this in the histograms.xml file

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:22: const uint32_t kMinHashPrefixLength
= 4;
On 2016/07/11 18:09:58, Nathan Parker wrote:
> Nit: Would it be easier to read if the code didn't use the word prefix, and
just
> called things "hashes" (of different length)?

Yes, it would be slightly more compact but at the risk of being misleading since
these are not hashes (which conveys full hashes). I feel that it is better to be
more explicit with this.

In fact, I think it might be better to avoid just "hashes" altogether to make it
clear whether we mean hash prefixes or full hashes.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:130: }
On 2016/07/11 18:09:58, Nathan Parker wrote:
> todo: Do something reasonable if response_type is something else. 

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:150: DCHECK(kMinHashPrefixLength <=
prefix_size);
On 2016/07/11 18:09:58, Nathan Parker wrote:
> todo: skip additions with invalid metadata (i.e. more than just dcheck).

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:156: RecordMergeUpdateResult(result);
On 2016/07/11 18:09:58, Nathan Parker wrote:
> Might as well record successes as well, unless that's recorded elsewhere.

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:196: if (!found ||
On 2016/07/11 18:09:58, Nathan Parker wrote:
> Replace !found with !smallest_prefix, and then you don't need found.

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:223: HashPrefixes& hash_prefixes =
hash_prefix_map.at(prefix_size);
On 2016/07/11 18:09:58, Nathan Parker wrote:
> nit (Feel free to ignore): Might be a bit more readable just all stuck
together:
> 
> hash_prefix_map.at(prefix_size).at(
>   counter_map.at(prefix_size))

This code has changed so NA.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store.h (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:26: typedef std::unique_ptr<char>
HashPrefix;
On 2016/07/11 18:09:58, Nathan Parker wrote:
> I think we talked about this already, but keeping a pointer to each 4-byte
hash
> will triple the memory requirement.
> 
> To keep one contiguous block, you could make a container, say VarHashArray,
with
> a ctor like
>   VarHashArray(size_t hash_length);
> 
> You could make a () operator that'd return a char* ptr to the x*hash_length'th
> byte. You could write a VarHashArray::compare(other) that'd take into account
> the potentially different hash lengths for the two arrays. You'd probably need
> an append method, and an iterator (or just use ints).  Hmm, seems like this
must
> exist. It's almost covered by std::valarray.

As discussed offline, using a std::string instead. This will keep the design
simple.
If, at a later time, we feel the need to use a valarray of objects later, we'll
do so then.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:33: typedef base::hash_map<PrefixSize,
HashPrefixes> HashPrefixMap;
On 2016/07/11 18:09:58, Nathan Parker wrote:
> I was going to say you should think about using a vector<HashPrefixes>,
indexed
> by PrefixSize since that'd be faster to iterate through -- it wouldn't have to
> hash the size before looking it up.  We have the advantage that we know the
> index is bounded 4-32.
> 
> But.. it wouldn't be much faster.  The hash function for a size_t is a noop (I
> looked it up), and then to find the hash bucket it'd have to do a few pointer
> de-ref's.  So for now I'd say the readability of this outweighs the potential
> perf gain.  We'll measure it later.

Acknowledged.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:189: HashPrefixMap& hash_prefix_map,
On 2016/07/11 18:09:58, Nathan Parker wrote:
> I think the style guide says output-args should be pointers rather than
> references.

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.h:194: static HashPrefixMap
GetHashPrefixMapFromAdditions(
On 2016/07/11 18:09:58, Nathan Parker wrote:
> Make HashPrefixMap a ptr output-arg.  Same below.  Otherwise it makes temp
> copies of them on the stack.

Done.

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store_unittest.cc (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store_unittest.cc:201: EXPECT_EQ(4u,
prefix_size);
On 2016/07/11 18:09:58, Nathan Parker wrote:
> Is this testing that "----" < "-----"?  

Yes, and more importantly that the order is as expected.

vakh (use Gerrit instead)

Simplified merge logic. +Nit: Fix a failing test

4 years, 5 months ago (2016-07-12 18:33:14 UTC) #10

Nathan Parker

I ended up commenting on two different versions, but I think it worked out. https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsing_db/v4_store.cc ...

4 years, 5 months ago (2016-07-12 20:35:06 UTC) #11

vakh (use Gerrit instead)

Discard update if there's an error in processing it. Add tests. CR feedback.

4 years, 5 months ago (2016-07-12 21:56:47 UTC) #12

vakh (use Gerrit instead)

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsing_db/v4_store.cc#newcode248 components/safe_browsing_db/v4_store.cc:248: if (found_in_old && found_in_additions) { On 2016/07/11 18:09:58, Nathan ...

4 years, 5 months ago (2016-07-12 21:57:03 UTC) #13

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/40001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:248: if (found_in_old &&
found_in_additions) {
On 2016/07/11 18:09:58, Nathan Parker wrote:
> To remove repeated code here, could you do
> 
> if (found_in_old) 
>   next_smallest_prefix_old = GetNextUnmergedPrefixForSize(
>           next_size_for_old, old_prefixes_map, old_counter_map);
> 
> if (found_in_additions)
>   next_smallest_prefix_additions = GetNextUnmergedPrefixForSize(
>           next_size_for_additions, additions_map, additions_counter_map);
> 
> Then you can de-dup the moving/advancing by computing a variable that says
which
> set to pull from:
> 
> bool take_from_old = found_in_old && found_in_additions && compare(..);
> if (take_from_old) {
>   //move from old set, increment, get next smallest.
> } else {
>   //move from additions set, increment, get next smallest.
> }
> 
> I might s/found_in_old/old_has_next/, and similar for additions.  "Found"
sounds
> like you're searching for something.  Or flip it and call it "old_is_empty"?
> 
> 
> 

Done.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:21: // The minimum expected size of a
hash-prefix.
On 2016/07/12 20:35:05, Nathan Parker wrote:
> ...size in bytes

Done.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:153: return
PREFIX_SIZE_TOO_SMALL_FAILURE;
On 2016/07/12 20:35:06, Nathan Parker wrote:
> You don't need any of the else's here.

Done.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:171: HashPrefix smallest_prefix;
On 2016/07/12 20:35:06, Nathan Parker wrote:
> You could use a StringPiece here and within the loop so it doesn't actually
copy
> the substring.

That would break the consistency of hash prefixes being HashPrefixes and expose
the implementation detail about a hash prefix being a string.
These strings are fairly small so copying them isn't worth the sacrifice in
readability.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:184: }
On 2016/07/12 20:35:06, Nathan Parker wrote:
> else dcheck

That's a valid case. For instance, when one map has been merged entirely. In
that case, all sized_index values would be outside the valid range for all
corresponding HashPrefixes.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:202: HashPrefixes hash_prefixes =
hash_prefix_map.at(prefix_size);
On 2016/07/12 20:35:06, Nathan Parker wrote:
> This copies the whole string.  You could use const HashPrefixes&.

Done.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:211: for (const auto& pair :
other_prefixes_map) {
On 2016/07/12 20:35:06, Nathan Parker wrote:
> Add a comment about how this gets close to ideal, but will leave a bit of
space
> due to deletions.

Updated the comment in header file (consistent with the other functions).

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:215: HashPrefixes existing_prefixes =
(*prefix_map_to_update)[prefix_size];
On 2016/07/12 20:35:06, Nathan Parker wrote:
> const HashPrefixes&.  Otherwise it makes a copy.

Done.

https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:225:
ReserveSpaceInPrefixMap(old_prefixes_map, &hash_prefix_map_);
On 2016/07/12 20:35:06, Nathan Parker wrote:
> Is the assumption that hash_prefix_map_ is empty here?  Otherwise, it'll add
> old+additions to current.  Maybe making this one call (that takes both as
args)
> could enforce that, by not using the existing capacity.

Good map. However, for code reuse purposes, I think it is better to have a
method that method as-is.
Clearing out the hash_prefix_map_ explicitly and added a DCHECK.

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:223: void V4Store::MergeUpdate(const
HashPrefixMap& old_prefixes_map,
On 2016/07/12 20:35:06, Nathan Parker wrote:
> Nice, this function is easier to read!

Acknowledged.

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:246: HashPrefix
next_smallest_prefix_old, next_smallest_prefix_additions;
On 2016/07/12 20:35:06, Nathan Parker wrote:
> I _think_ it's more efficient if you define these outside the loop.  Even
though
> it's best practice to minimize the scope as much as possible, the std::string
> constructor/desctructor will be called in every iteration if it's declared
> inside.

Done

> Also you could use a StringPiece to avoid copying.
Please see my other comment.

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
File components/safe_browsing_db/v4_store_unittest.cc (right):

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
components/safe_browsing_db/v4_store_unittest.cc:167: EXPECT_EQ("abcde",
hash_prefixes.substr(0 * prefix_size, prefix_size));
On 2016/07/12 20:35:06, Nathan Parker wrote:
> Seems like you could just test that the whole string is equal to the argument
> above, rather than checking substrs. There's some value in doing it down
below,
> but here it's a bit odd.

Done.

https://codereview.chromium.org/2128173003/diff/120001/components/safe_browsi...
components/safe_browsing_db/v4_store_unittest.cc:212:
V4Store::AddUnlumpedHashes(4, "-----0000054321abcde", &prefix_map);
On 2016/07/12 20:35:06, Nathan Parker wrote:
> nit: maybe use a string that looks like it's in sections of 4 for the 4-byte
> hash.

Good idea but when I do that, the output of 'git cl format' is pretty bad (too
much whitespace inserted).

Nathan Parker

lgtm w/ two comments to address https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2128173003/diff/100001/components/safe_browsing_db/v4_store.cc#newcode161 components/safe_browsing_db/v4_store.cc:161: (*additions_map)[prefix_size] = lumped_hashes; ...

4 years, 5 months ago (2016-07-12 22:47:12 UTC) #14

vakh (use Gerrit instead)

Add the histogram info to histograms.xml

4 years, 5 months ago (2016-07-13 00:28:11 UTC) #16

vakh (use Gerrit instead)

vakh@chromium.org changed reviewers: + rkaplow@chromium.org

4 years, 5 months ago (2016-07-13 00:29:16 UTC) #17

vakh (use Gerrit instead)

rkaplow@ -- can you please review changes in histograms.xml? nparker@ -- thanks for the review! ...

4 years, 5 months ago (2016-07-13 00:29:17 UTC) #18

Scott Hess - ex-Googler

shess@chromium.org changed reviewers: + shess@chromium.org

4 years, 5 months ago (2016-07-13 05:46:07 UTC) #19

Scott Hess - ex-Googler

I find that I don't have the time/energy to really understand/review during the few days ...

4 years, 5 months ago (2016-07-13 05:46:08 UTC) #20

vakh (use Gerrit instead)

On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) wrote: > I find that I ...

4 years, 5 months ago (2016-07-13 20:27:02 UTC) #23

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org

4 years, 5 months ago (2016-07-13 20:27:25 UTC) #24

vakh (use Gerrit instead)

The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org, rkaplow@chromium.org Link to ...

4 years, 5 months ago (2016-07-13 20:27:26 UTC) #25

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2128173003/200001

4 years, 5 months ago (2016-07-13 20:28:30 UTC) #26

Scott Hess - ex-Googler

On 2016/07/13 20:27:02, vakh wrote: > On 2016/07/13 05:46:08, Scott Hess (OOO Jul 1-Aug 7) ...

4 years, 5 months ago (2016-07-13 22:57:05 UTC) #29

commit-bot: I haz the power

Description was changed from ========== Merge the hash prefixes from the old store and the ...

4 years, 5 months ago (2016-07-13 22:59:05 UTC) #30

commit-bot: I haz the power

Patchset 11 (id:??) landed as https://crrev.com/2eb43a6078426caf95b9fd46342e47696bfb4272 Cr-Commit-Position: refs/heads/master@{#405331}

4 years, 5 months ago (2016-07-13 22:59:06 UTC) #31

vakh (use Gerrit instead)

On 2016/07/13 22:57:05, Scott Hess (OOO Jul 1-Aug 7) wrote: > On 2016/07/13 20:27:02, vakh ...

4 years, 5 months ago (2016-07-13 23:10:54 UTC) #32