Issue 2228393003: PVer4: DecodeHashes needs to sort the output of the Rice decoder

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-11 08:13:04 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/1

4 years, 4 months ago (2016-08-11 08:13:13 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-11 08:27:14 UTC) #3

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/259287)

4 years, 4 months ago (2016-08-11 08:27:15 UTC) #4

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-11 21:09:19 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/20001

4 years, 4 months ago (2016-08-11 21:09:47 UTC) #7

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-11 21:21:29 UTC) #8

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_compile_dbg_ng/builds/139457)

4 years, 4 months ago (2016-08-11 21:21:30 UTC) #9

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-11 21:44:37 UTC) #11

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/40001

4 years, 4 months ago (2016-08-11 21:45:29 UTC) #12

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-11 23:24:15 UTC) #13

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-11 23:24:16 UTC) #14

vakh (use Gerrit instead)

PVer4: DecodeHashes needs to sort the output of the Rice decoder

4 years, 4 months ago (2016-08-12 21:53:06 UTC) #15

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-12 21:53:08 UTC) #16

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/60001

4 years, 4 months ago (2016-08-12 21:53:32 UTC) #17

vakh (use Gerrit instead)

Description was changed from ========== Add tests to validate the Rice-decoder implementation. Also, make the ...

4 years, 4 months ago (2016-08-12 21:54:21 UTC) #18

vakh (use Gerrit instead)

Patchset #1 (id:20001) has been deleted

4 years, 4 months ago (2016-08-12 21:54:45 UTC) #20

vakh (use Gerrit instead)

Patchset #1 (id:40001) has been deleted

4 years, 4 months ago (2016-08-12 21:54:59 UTC) #21

vakh (use Gerrit instead)

Fix the test to expect sorted list of hashes from DecodeBytes

4 years, 4 months ago (2016-08-12 21:58:37 UTC) #22

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-12 21:58:39 UTC) #23

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/80001

4 years, 4 months ago (2016-08-12 21:59:08 UTC) #24

vakh (use Gerrit instead)

vakh@chromium.org changed reviewers: + awoz@chromium.org, nparker@chromium.org

4 years, 4 months ago (2016-08-12 22:04:16 UTC) #25

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-12 22:33:11 UTC) #27

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/217865) win_chromium_x64_rel_ng on ...

4 years, 4 months ago (2016-08-12 22:33:12 UTC) #28

vakh (use Gerrit instead)

Fix the corresponding test in v4_store_unittest.cc also

4 years, 4 months ago (2016-08-12 22:49:35 UTC) #29

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-12 22:49:38 UTC) #30

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/100001

4 years, 4 months ago (2016-08-12 22:49:59 UTC) #31

Nathan Parker

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (right): https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsing_db/v4_rice.cc#newcode22 components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t word) { This should be platform-specific. Maybe ...

4 years, 4 months ago (2016-08-12 23:01:49 UTC) #32

Scott Hess - ex-Googler

shess@chromium.org changed reviewers: + shess@chromium.org

4 years, 4 months ago (2016-08-12 23:25:46 UTC) #33

Scott Hess - ex-Googler

DVLOG! OK, I'll ignore it. https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (left): https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice.cc#oldcode5 components/safe_browsing_db/v4_rice.cc:5: #include "base/logging.h" I think ...

4 years, 4 months ago (2016-08-12 23:25:47 UTC) #34

DVLOG!  OK, I'll ignore it.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (left):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:5: #include "base/logging.h"
I think you still need logging.h for things like NOTREACHED().  Just because
safe_math.h has it for you doesn't mean you should skip it here.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t
word) {
Is this right?  I don't _think_ we have a mix of little-endian and big-endian
platforms running Chrome, but I'm not entirely certain.  I know there are MIPS
ports out there, and I see evidence that someone was working on cleaning up
big-endian support for MIPS.  Does this really need to always flip, or is it
more like ntohl()?  If the latter, then I suspect we have base/ support for that
kind of thing.

[Maybe base/big_endian.h ?  Or perhaps something in net/?]

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:249: // |current_word_|.
I'm finding this confusing.  I _think_ what it's saying is that the entire span
of bits does not exist in current_word_.  But below it calls GetNextBits() to
read num_bits_left_in_current_word, which _does_ exist in current_word_ (so I
think it will always degenerate to GetBitsFromCurrentWord).  Then upper has to
be read via GetNextBits() to load a new word.  [Is the same possibly true for
reading lower?  Or would num_bits_left_in_current_word be zero in that case?]

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:39: }
There's an expectation below for the decoder not having additional values - do
you need an expectation for the decoder having enough values?  Or does
DECODE_SUCCESS cover that?  Will that fire repeatedly if the decoder comes up
short?  And clearly |word != expected| in case of decode failure.

[Perhaps ASSERT_EQ() would handle this case for the decode failure, if it can be
used in a helper function like this.]

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:142: VerifyRiceDecoding(it);
In case of failure this isn't going to give very satisfying diagnostics.  It
might be nice to have a descriptive value logged with expectation failures.

OTOH, if test case #7 fails, I'm not entirely sure how to address it by
inspection.

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-13 00:15:09 UTC) #36

vakh (use Gerrit instead)

Thanks for the review. PTAL. DVLOGs - removed. https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (right): https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsing_db/v4_rice.cc#newcode22 components/safe_browsing_db/v4_rice.cc:22: uint32_t ...

4 years, 4 months ago (2016-08-13 00:15:14 UTC) #37

Thanks for the review. PTAL.

DVLOGs - removed.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/12 23:01:49, Nathan Parker wrote:
> This should be platform-specific.  Maybe use C's htonl(), or something from
> chromium/src/base/big_endian.h

Waiting for Noe/Alex to confirm.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice.cc:29: bytes[3] = (char)(word & mask);
On 2016/08/12 23:01:49, Nathan Parker wrote:
> Is this ordering endian-dependent as well?

No.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice.cc:119: base::CheckedNumeric<uint32_t>
last_value(first_value);
On 2016/08/12 23:01:49, Nathan Parker wrote:
> You're casting a int64 to a unit32_t.  Is that intentional?

Yes, I am not sure why the first_value is a int64 when we only expect uint32s.
The reason for this change is to capture the value from the proto as-is and let
the NumericChecker code in Chrome raise if the value is actually greater than
what can fit in 32 bits.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice.cc:120: std::set<uint32_t> hash_prefixes;
On 2016/08/12 23:01:49, Nathan Parker wrote:
> This is going to use at least another 24 bytes of RAM per 4-byte hash (three
> pointers). I think it'd be better to use a different sorting mechanism, since
> we've got several copies of the hashes in RAM already.

Let me try what we discussed offline: use the string out to store unsorted hash
prefixes using reinterpret magic and sort in-place at the end.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice_unittest.cc:142: VerifyRiceDecoding(it);
On 2016/08/12 23:01:49, Nathan Parker wrote:
> If this fails, will it tell you which test cast failed?  An idea: log a
testcase
> number when it fails.

It'll grok what did not match and I can then grep it.
Test case number is useful but even with that, it'd require me to count the
instances of RiceDecodingTestInfo to see which one did fail.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice_unittest.cc:257:
EXPECT_EQ(std::string("\x5\0\0\0\fL\x93\xADV\x7F\xF6o\xCEo1\x81", 16), out);
On 2016/08/12 23:01:49, Nathan Parker wrote:
> How did you generate this? (or was it just based on code output?)

Got the input from a unit test for the server.
Verified the output manually and then added it as the expected value.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:205: DVLOG(1) << "Update successful: "
<< *this;
On 2016/08/12 23:01:49, Nathan Parker wrote:
> These are fairly verbose -- You should plan to remove most of them once the
code
> is stable.

Ack. Done.
Also: http://crbug.com/637468

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:404: DVLOG(1) << "Failure:
ADDITIONS_HAS_EXISTING_PREFIX_FAILURE";
On 2016/08/12 23:01:49, Nathan Parker wrote:
> Can you log this further up instead?  Just logging the enum # is usually
> sufficient, since it should be rarely used.
> 
> Same with those below.

Done.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:506: DVLOG(1) << "Failure: Unexpected
magic number found in file: "
On 2016/08/12 23:01:49, Nathan Parker wrote:
> These DVLOGs actually seem like more clutter than they're worth, since they
just
> mean the file is corrupted.  We don't need to maintain code to tell a
developer
> which part of the file is corrupted.
> 
> Same with file version and response type.  If someday we start getting those
> errors, we can easily add a line to log it.

Done.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (left):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:5: #include "base/logging.h"
On 2016/08/12 23:25:46, Scott Hess wrote:
> I think you still need logging.h for things like NOTREACHED().  Just because
> safe_math.h has it for you doesn't mean you should skip it here.

I agree. Didn't realize it was needed for that. Added back.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/12 23:25:47, Scott Hess wrote:
> Is this right?  I don't _think_ we have a mix of little-endian and big-endian
> platforms running Chrome, but I'm not entirely certain.  I know there are MIPS
> ports out there, and I see evidence that someone was working on cleaning up
> big-endian support for MIPS.  Does this really need to always flip, or is it
> more like ntohl()?  If the latter, then I suspect we have base/ support for
that
> kind of thing.
> 
> [Maybe base/big_endian.h ?  Or perhaps something in net/?]

Is this flip always needed: Waiting to get a confirmation from the SafeBrowsing
team on that. cc'd you on that email.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:249: // |current_word_|.
On 2016/08/12 23:25:46, Scott Hess wrote:
> I'm finding this confusing.  I _think_ what it's saying is that the entire
span
> of bits does not exist in current_word_.  But below it calls GetNextBits() to
> read num_bits_left_in_current_word, which _does_ exist in current_word_ (so I
> think it will always degenerate to GetBitsFromCurrentWord).  Then upper has to
> be read via GetNextBits() to load a new word.  [Is the same possibly true for
> reading lower?  Or would num_bits_left_in_current_word be zero in that case?]

Let me illustrate with an example: Let's say we want to read 28 bits, but we've
already read 23 from current_word_. This means:
num_requested_bits = 28
current_word_bit_index_ = 23
num_bits_left_in_current_word = 9

We split the read operation into two parts: read 9 bits, then read 19 bits for a
total of 28.

lower = GetNextBits(9)
=> current_word_bit_index_ = 23 + 9 = 32.

upper = GetNextBits(19)
=> but, current_word_bit_index_ == 32 (kMaxBitIndex)
=> => Read next word into current_word_ and reset current_word_bit_index_ to 0
(see line 233).
=> Read 19 bits from new current_word_ into upper.
=> current_word_bit_index_ = 0 + 19 = 19

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/120001

4 years, 4 months ago (2016-08-13 00:15:33 UTC) #38

vakh (use Gerrit instead)

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice_unittest.cc File components/safe_browsing_db/v4_rice_unittest.cc (right): https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice_unittest.cc#newcode142 components/safe_browsing_db/v4_rice_unittest.cc:142: VerifyRiceDecoding(it); On 2016/08/12 23:25:47, Scott Hess wrote: > In ...

4 years, 4 months ago (2016-08-13 00:17:16 UTC) #39

vakh (use Gerrit instead)

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice_unittest.cc File components/safe_browsing_db/v4_rice_unittest.cc (right): https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsing_db/v4_rice_unittest.cc#newcode39 components/safe_browsing_db/v4_rice_unittest.cc:39: } On 2016/08/12 23:25:47, Scott Hess wrote: > There's ...

4 years, 4 months ago (2016-08-13 00:19:44 UTC) #40

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-13 01:07:45 UTC) #41

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-13 01:07:46 UTC) #42

vakh (use Gerrit instead)

Use a vector to decode and sort uint32_t values. re-interpret the memory of that vector ...

4 years, 4 months ago (2016-08-15 20:52:34 UTC) #43

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-15 20:52:36 UTC) #44

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/140001

4 years, 4 months ago (2016-08-15 20:52:56 UTC) #45

Scott Hess - ex-Googler

It feels like there is something being forced in the flippedness area, here. Rice encoding ...

4 years, 4 months ago (2016-08-15 21:18:15 UTC) #46

It feels like there is something being forced in the flippedness area, here. 
Rice encoding should be happening on a series of fixed-size integers.  So the
client of the encoder/decoder shouldn't be doing any byte-swapping at all - the
integers should go in and come out as integers, not as blobs.

Likewise, for local use, there shouldn't be any byte-order involved.  For
convenience you might encode a series of fixed-size integers into a string blob,
but since it's for local use it should all be strictly with local byte order.

I'm not entirely certain where things are going awry.  You may need to step back
one level and look at things with fresh eyes.

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice_unittest.cc:142: VerifyRiceDecoding(it);
On 2016/08/13 00:15:14, vakh wrote:
> On 2016/08/12 23:01:49, Nathan Parker wrote:
> > If this fails, will it tell you which test cast failed?  An idea: log a
> testcase
> > number when it fails.
> 
> It'll grok what did not match and I can then grep it.
> Test case number is useful but even with that, it'd require me to count the
> instances of RiceDecodingTestInfo to see which one did fail.

It seems unlikely that the logged output will match the \x-escaped data.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/13 00:15:14, vakh wrote:
> On 2016/08/12 23:25:47, Scott Hess wrote:
> > Is this right?  I don't _think_ we have a mix of little-endian and
big-endian
> > platforms running Chrome, but I'm not entirely certain.  I know there are
MIPS
> > ports out there, and I see evidence that someone was working on cleaning up
> > big-endian support for MIPS.  Does this really need to always flip, or is it
> > more like ntohl()?  If the latter, then I suspect we have base/ support for
> that
> > kind of thing.
> > 
> > [Maybe base/big_endian.h ?  Or perhaps something in net/?]
> 
> Is this flip always needed: Waiting to get a confirmation from the
SafeBrowsing
> team on that. cc'd you on that email.

To be clear - I suspect they are just serving network byte order, so you need to
use ntohl() or base/big_endian.h .  My guess is that they don't rearrange the
byte order based on user-agent request, so the right thing to do would be to
conditionally flip (such as those two options do).  It probably would be
preferable to use something like base/big_endian.h so that the data is simply
read out right, rather than have separate steps for read-the-data and
conditional-flip, though cache hierarchies on modern CPUs may make the
difference moot.

Actually, if it's little-endian on the wire, then having a read like your prior
reader is probably best, since it's not conditional (it simply put the first
byte in the low-order bits, and so on).

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:249: // |current_word_|.
On 2016/08/13 00:15:14, vakh wrote:
> On 2016/08/12 23:25:46, Scott Hess wrote:
> > I'm finding this confusing.  I _think_ what it's saying is that the entire
> span
> > of bits does not exist in current_word_.  But below it calls GetNextBits()
to
> > read num_bits_left_in_current_word, which _does_ exist in current_word_ (so
I
> > think it will always degenerate to GetBitsFromCurrentWord).  Then upper has
to
> > be read via GetNextBits() to load a new word.  [Is the same possibly true
for
> > reading lower?  Or would num_bits_left_in_current_word be zero in that
case?]
> 
> Let me illustrate with an example: Let's say we want to read 28 bits, but
we've
> already read 23 from current_word_. This means:
> num_requested_bits = 28
> current_word_bit_index_ = 23
> num_bits_left_in_current_word = 9
> 
> We split the read operation into two parts: read 9 bits, then read 19 bits for
a
> total of 28.
> 
> lower = GetNextBits(9)
> => current_word_bit_index_ = 23 + 9 = 32.
> 
> upper = GetNextBits(19)
> => but, current_word_bit_index_ == 32 (kMaxBitIndex)
> => => Read next word into current_word_ and reset current_word_bit_index_ to 0
> (see line 233).
> => Read 19 bits from new current_word_ into upper.
> => current_word_bit_index_ = 0 + 19 = 19

Yes, but the comment is written in terms of current_word_, which is not a thing
the local code is using.  Maybe try writing a new comment without reference to
the current one, to see if a cleaner description can be made.  Like "Combine the
rest of current_word_ with additional bits from the next word."

I think the first call to GetNextBits() should actually be calling
GetBitsFromCurrentWord() to eat the remainder of current_word_.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:39: }
On 2016/08/13 00:19:44, vakh wrote:
> On 2016/08/12 23:25:47, Scott Hess wrote:
> > There's an expectation below for the decoder not having additional values -
do
> > you need an expectation for the decoder having enough values?  Or does
> > DECODE_SUCCESS cover that?  Will that fire repeatedly if the decoder comes
up
> > short?  And clearly |word != expected| in case of decode failure.
> > 
> > [Perhaps ASSERT_EQ() would handle this case for the decode failure, if it
can
> be
> > used in a helper function like this.]
> 
> That case returns DECODE_NO_MORE_ENTRIES_FAILURE so is covered.

It's going to return DECODE_NO_MORE_ENTRIES_FAILURE, but then have a cascade of
expectation failures.  So it might be better to use ASSERT_EQ for that case.

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:132: //
https://developers.google.com/safe-browsing/v4/compression
The protocol says the data is in network byte order, it doesn't say that it
needs to be flipped.  If the platform is little-endian, flipping would be
incorrect.

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:136: std::sort(out->begin(), out->end());
Whoa!  I could see ASSERT_TRUE(std::is_sorted(...)) here, as if these aren't
already sorted, I think something is fubar.

Scott Hess - ex-Googler

On 2016/08/15 21:18:15, Scott Hess wrote: > It feels like there is something being forced ...

4 years, 4 months ago (2016-08-15 21:19:38 UTC) #47

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-15 21:38:02 UTC) #48

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-15 21:38:03 UTC) #49

vakh (use Gerrit instead)

Use htonl/ntohl instead of FlipEndianness (but the effect is the same). Use GetBitsFromCurrentWord for lower

4 years, 4 months ago (2016-08-16 00:18:56 UTC) #50

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-16 00:18:59 UTC) #51

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/160001

4 years, 4 months ago (2016-08-16 00:19:47 UTC) #52

vakh (use Gerrit instead)

On 2016/08/15 21:19:38, Scott Hess wrote: > On 2016/08/15 21:18:15, Scott Hess wrote: > > ...

4 years, 4 months ago (2016-08-16 00:21:53 UTC) #53

vakh (use Gerrit instead)

Out-of-order comments! This one should have gone out before: <quote> Yes, I agree it does ...

4 years, 4 months ago (2016-08-16 00:23:03 UTC) #54

Out-of-order comments! This one should have gone out before:
<quote>
Yes, I agree it does look strange.
[Rest is repition of what I have said elsewhere.]
But what's read off the wire is bits, not bytes. The flipping happens after the
decoder has produced the uint32_t.
</quote>

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_rice_unittest.cc:142: VerifyRiceDecoding(it);
On 2016/08/15 21:18:15, Scott Hess wrote:
> On 2016/08/13 00:15:14, vakh wrote:
> > On 2016/08/12 23:01:49, Nathan Parker wrote:
> > > If this fails, will it tell you which test cast failed?  An idea: log a
> > testcase
> > > number when it fails.
> > 
> > It'll grok what did not match and I can then grep it.
> > Test case number is useful but even with that, it'd require me to count the
> > instances of RiceDecodingTestInfo to see which one did fail.
> 
> It seems unlikely that the logged output will match the \x-escaped data.

I changed one of the values and the produced error was:
../../components/safe_browsing_db/v4_rice_unittest.cc:39: Failure
Value of: word
  Actual: 922272696
Expected: expected
Which is: 924369848

I was then able to grep 924369848 in the file to see which case is failing.
However, that requires unique values in the file, which is an unstated
assumption so I've added a DVLOG for test case number.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:22: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/15 21:18:15, Scott Hess wrote:
> On 2016/08/13 00:15:14, vakh wrote:
> > On 2016/08/12 23:25:47, Scott Hess wrote:
> > > Is this right?  I don't _think_ we have a mix of little-endian and
> big-endian
> > > platforms running Chrome, but I'm not entirely certain.  I know there are
> MIPS
> > > ports out there, and I see evidence that someone was working on cleaning
up
> > > big-endian support for MIPS.  Does this really need to always flip, or is
it
> > > more like ntohl()?  If the latter, then I suspect we have base/ support
for
> > that
> > > kind of thing.
> > > 
> > > [Maybe base/big_endian.h ?  Or perhaps something in net/?]
> > 
> > Is this flip always needed: Waiting to get a confirmation from the
> SafeBrowsing
> > team on that. cc'd you on that email.
> 
> To be clear - I suspect they are just serving network byte order, so you need
to
> use ntohl() or base/big_endian.h .  My guess is that they don't rearrange the
> byte order based on user-agent request, so the right thing to do would be to
> conditionally flip (such as those two options do).  It probably would be
> preferable to use something like base/big_endian.h so that the data is simply
> read out right, rather than have separate steps for read-the-data and
> conditional-flip, though cache hierarchies on modern CPUs may make the
> difference moot.
> 
> Actually, if it's little-endian on the wire, then having a read like your
prior
> reader is probably best, since it's not conditional (it simply put the first
> byte in the low-order bits, and so on).

Noe -- can you please review this part of the code since you are most aware of
the server side implementation?

This is what my implementation does for now:
1. decode encoded_data field using Rice decoder.
2. Flip the bits of each output uint32_t value
3. Sort these uint32_t (since flipping the bits destroys sort order).
4. Flip the bits back and reinterpret them as bytes in a string.

Can you specifically comment on what parts of this are avoidable and what parts
are platform specific?

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:249: // |current_word_|.
On 2016/08/15 21:18:15, Scott Hess wrote:
> On 2016/08/13 00:15:14, vakh wrote:
> > On 2016/08/12 23:25:46, Scott Hess wrote:
> > > I'm finding this confusing.  I _think_ what it's saying is that the entire
> > span
> > > of bits does not exist in current_word_.  But below it calls GetNextBits()
> to
> > > read num_bits_left_in_current_word, which _does_ exist in current_word_
(so
> I
> > > think it will always degenerate to GetBitsFromCurrentWord).  Then upper
has
> to
> > > be read via GetNextBits() to load a new word.  [Is the same possibly true
> for
> > > reading lower?  Or would num_bits_left_in_current_word be zero in that
> case?]
> > 
> > Let me illustrate with an example: Let's say we want to read 28 bits, but
> we've
> > already read 23 from current_word_. This means:
> > num_requested_bits = 28
> > current_word_bit_index_ = 23
> > num_bits_left_in_current_word = 9
> > 
> > We split the read operation into two parts: read 9 bits, then read 19 bits
for
> a
> > total of 28.
> > 
> > lower = GetNextBits(9)
> > => current_word_bit_index_ = 23 + 9 = 32.
> > 
> > upper = GetNextBits(19)
> > => but, current_word_bit_index_ == 32 (kMaxBitIndex)
> > => => Read next word into current_word_ and reset current_word_bit_index_ to
0
> > (see line 233).
> > => Read 19 bits from new current_word_ into upper.
> > => current_word_bit_index_ = 0 + 19 = 19
> 
> Yes, but the comment is written in terms of current_word_, which is not a
thing
> the local code is using.  Maybe try writing a new comment without reference to
> the current one, to see if a cleaner description can be made.  Like "Combine
the
> rest of current_word_ with additional bits from the next word."
> 
> I think the first call to GetNextBits() should actually be calling
> GetBitsFromCurrentWord() to eat the remainder of current_word_.

Done.

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/100001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:39: }
On 2016/08/15 21:18:15, Scott Hess wrote:
> On 2016/08/13 00:19:44, vakh wrote:
> > On 2016/08/12 23:25:47, Scott Hess wrote:
> > > There's an expectation below for the decoder not having additional values
-
> do
> > > you need an expectation for the decoder having enough values?  Or does
> > > DECODE_SUCCESS cover that?  Will that fire repeatedly if the decoder comes
> up
> > > short?  And clearly |word != expected| in case of decode failure.
> > > 
> > > [Perhaps ASSERT_EQ() would handle this case for the decode failure, if it
> can
> > be
> > > used in a helper function like this.]
> > 
> > That case returns DECODE_NO_MORE_ENTRIES_FAILURE so is covered.
> 
> It's going to return DECODE_NO_MORE_ENTRIES_FAILURE, but then have a cascade
of
> expectation failures.  So it might be better to use ASSERT_EQ for that case.

Done. Please let me know if you meant something else.

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:132: //
https://developers.google.com/safe-browsing/v4/compression
On 2016/08/15 21:18:15, Scott Hess wrote:
> The protocol says the data is in network byte order, it doesn't say that it
> needs to be flipped.  If the platform is little-endian, flipping would be
> incorrect.

My machine is little-endian and it does require byte flipping.

The encoded data (in the form of a string) passed by the server is read
bit-by-bit to produce the Rice-decoded values. I believe, that part of the
process is independent of the machine's endianness.

Once that decoder has produced a uint32_t value, the only way to match the
checksum is to flip those bits (at least on a little-endian machine). I am not
sure about the expected behavior on big-endian machines so I'd let Noe comment
on that.

FWIW, I know that the SafeBrowsing code in Android does this bit flipping
unconditionally.

https://codereview.chromium.org/2228393003/diff/140001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:136: std::sort(out->begin(), out->end());
On 2016/08/15 21:18:15, Scott Hess wrote:
> Whoa!  I could see ASSERT_TRUE(std::is_sorted(...)) here, as if these aren't
> already sorted, I think something is fubar.

These aren't already sorted. That's because of the bit flipping I do just before
push_back.

Scott Hess - ex-Googler

On 2016/08/16 00:21:53, vakh wrote: > On 2016/08/15 21:19:38, Scott Hess wrote: > > On ...

4 years, 4 months ago (2016-08-16 00:36:13 UTC) #55

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-16 01:00:04 UTC) #56

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/274046)

4 years, 4 months ago (2016-08-16 01:00:05 UTC) #57

vakh (use Gerrit instead)

On 2016/08/16 at 00:36:13, shess wrote: > On 2016/08/16 00:21:53, vakh wrote: > > On ...

4 years, 4 months ago (2016-08-16 01:33:28 UTC) #58

On 2016/08/16 at 00:36:13, shess wrote:
> On 2016/08/16 00:21:53, vakh wrote:
> > On 2016/08/15 21:19:38, Scott Hess wrote:
> > > On 2016/08/15 21:18:15, Scott Hess wrote:
> > > > It feels like there is something being forced in the flippedness area,
here.
> > 
> > > > Rice encoding should be happening on a series of fixed-size integers. 
So
> > the
> > > > client of the encoder/decoder shouldn't be doing any byte-swapping at
all -
> > > the
> > > > integers should go in and come out as integers, not as blobs.
> > > 
> > > Hmm, to clarify - what I mean is that the flipping should be happening
when
> > > encoding onto the wire and decoding off the wire.  If the client of the
> > > encoder/decoder is doing the flipping, that implies that something wrong
is
> > > happening.
> > 
> > Yes, I agree it does look strange.
> > [Rest is repition of what I have said elsewhere.]
> > But what's read off the wire is bits, not bytes. The flipping happens after
the
> > decoder has produced the uint32_t.
> 
> That doesn't really change anything, though.  A class which decodes a stream
of integers should generally decode directly to platform integers.  The sense of
whether the source data is little-endian or big-endian should be encapsulated
within the decoder/encoder, not exposed to the client code.
> 
> I mean, correct me if I'm wrong - my understanding of this is that what we
have is a base value plus a cumulative series of increments from that base
value.  If that is decoded wrong-endian, then the increments are going to be
entirely incorrect!
That's correct.

>  Instead of being a series of small integers which increment through a flat
distribution, you would have a series of giant integers with repeated overflow.
That's not possible since I use base::CheckedNumeric<uint32_t> to ensure that
overflows don't happen.

>  Which ... which under unsigned arithmetic might just result in the expected
output after byte-flipping, I'll have to think on that.
That's not possible (see above), but even if it were, that would be a big
coincidence. Especially, if it works correctly repeatedly.

> 
> I can't see where this could be going wrong in the code, though.  Is it
possible that the datasets you're using are actually encoded incorrectly?
I am testing this against production servers that have been serving lists to
Android users for months now so it shouldn't be wrong.

vakh (use Gerrit instead)

On 2016/08/16 at 01:33:28, vakh wrote: > On 2016/08/16 at 00:36:13, shess wrote: > > ...

4 years, 4 months ago (2016-08-16 19:00:07 UTC) #59

On 2016/08/16 at 01:33:28, vakh wrote:
> On 2016/08/16 at 00:36:13, shess wrote:
> > On 2016/08/16 00:21:53, vakh wrote:
> > > On 2016/08/15 21:19:38, Scott Hess wrote:
> > > > On 2016/08/15 21:18:15, Scott Hess wrote:
> > > > > It feels like there is something being forced in the flippedness area,
here.
> > > 
> > > > > Rice encoding should be happening on a series of fixed-size integers. 
So
> > > the
> > > > > client of the encoder/decoder shouldn't be doing any byte-swapping at
all -
> > > > the
> > > > > integers should go in and come out as integers, not as blobs.
> > > > 
> > > > Hmm, to clarify - what I mean is that the flipping should be happening
when
> > > > encoding onto the wire and decoding off the wire.  If the client of the
> > > > encoder/decoder is doing the flipping, that implies that something wrong
is
> > > > happening.
> > > 
> > > Yes, I agree it does look strange.
> > > [Rest is repition of what I have said elsewhere.]
> > > But what's read off the wire is bits, not bytes. The flipping happens
after the
> > > decoder has produced the uint32_t.
> > 
> > That doesn't really change anything, though.  A class which decodes a stream
of integers should generally decode directly to platform integers.  The sense of
whether the source data is little-endian or big-endian should be encapsulated
within the decoder/encoder, not exposed to the client code.
> > 
> > I mean, correct me if I'm wrong - my understanding of this is that what we
have is a base value plus a cumulative series of increments from that base
value.  If that is decoded wrong-endian, then the increments are going to be
entirely incorrect!
> That's correct.
> 
> >  Instead of being a series of small integers which increment through a flat
distribution, you would have a series of giant integers with repeated overflow.
> That's not possible since I use base::CheckedNumeric<uint32_t> to ensure that
overflows don't happen.
> 
> >  Which ... which under unsigned arithmetic might just result in the expected
output after byte-flipping, I'll have to think on that.
> That's not possible (see above), but even if it were, that would be a big
coincidence. Especially, if it works correctly repeatedly.
> 
> > 
> > I can't see where this could be going wrong in the code, though.  Is it
possible that the datasets you're using are actually encoded incorrectly?
> I am testing this against production servers that have been serving lists to
Android users for months now so it shouldn't be wrong.

[Noe has replied to the email separately mentioning why the bit flipping needs
to be done.]
Also, Chrome only supports little-endian architectures so this is not an issue.
See: https://cs.chromium.org/chromium/src/build/build_config.h?l=100

I propose two things:
1. Add a #error for big endian platforms and move ahead with the rest of the
implementation as-is?
2. Use the FlipBytes method as I had previously, since that's exactly what
htonl/ntohl methods do unconditionally anyway.
But with htonl/ntohl, I need to depend on an external header file that's
different across platforms.

vakh (use Gerrit instead)

Use FlipEndianness instead of ntohl/htonl. #error for big endian

4 years, 4 months ago (2016-08-17 00:41:00 UTC) #60

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-17 00:41:03 UTC) #61

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/180001

4 years, 4 months ago (2016-08-17 00:41:38 UTC) #62

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-17 01:22:18 UTC) #63

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-17 01:22:19 UTC) #64

Nathan Parker

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (right): https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc#newcode28 components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t word) { I feel like there's no ...

4 years, 4 months ago (2016-08-17 20:57:27 UTC) #66

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t
word) {
I feel like there's no point in building in an endianness dependency unless it
makes the code faster or a lot cleaner.  What's wrong with htonl()?

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:135: // This flipping is done so that the
decoded uint32 is interpreted
re-flow comment.
Make a comment that the ints need to be converted to 4-byte-strings as if they
were Big Endian ints, and Chrome runs only on Little Endian.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:149: (*out)[i] =
FlipEndianness((*out)[i]);
Wait, now I'm confused again.  You're flipping, sorting as ints, then flipping
again.

OH, I think I understand: You're sorting the array as integers, rather than as
4-byte-strings, but you want them to end up in lexigraphical order.  I _think_
you could avoid all the flipping by having sort() compare 4-byte strings instead
of ints.  I'm not sure if that'll be any faster comparing ints is pretty ideal. 
But the current code has an extra pass through the array.



If what I said make sense, you could add a comment that we might want to compare
my proposed method's performance, if performance here turns out to be an issue.


It might be more clear if you didn't use uint32_t's, but instead had something
like
  typedef uint32_t Hash32Bits
  std::vector<Hash32Bits>* out

Since really once you've done the Rice decoding, you no longer treat them as
ints (and shouldn't).

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:269: void
V4RiceDecoder::GetBitsFromCurrentWord(unsigned int num_requested_bits,
Could this just return x?

vakh (use Gerrit instead)

Use htonl. Return from GetBitsFromCurrentWord

4 years, 4 months ago (2016-08-17 22:10:36 UTC) #67

vakh (use Gerrit instead)

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (right): https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc#newcode28 components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t word) { On 2016/08/17 at 20:57:27, Nathan ...

4 years, 4 months ago (2016-08-17 22:10:41 UTC) #68

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/17 at 20:57:27, Nathan Parker wrote:
> I feel like there's no point in building in an endianness dependency unless it
makes the code faster or a lot cleaner.  What's wrong with htonl()?

[I don't feel strongly about it bit have a slight preference so I changed the
code.]
There are two parts to this statement:
1. endianness dependency: this is irrelevant since Chrome has only been
supported on little-endian platforms for the longest time. I am not sure if this
code would work correctly on big endian machines so I ifdef'd it out.
2. Nothing wrong with htonl -- It is defined in different header files on
different plaforms so defining a simple bit-flipping method that does the same
thing seemed cleaner.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:135: // This flipping is done so that the
decoded uint32 is interpreted
On 2016/08/17 at 20:57:27, Nathan Parker wrote:
> re-flow comment.
> Make a comment that the ints need to be converted to 4-byte-strings as if they
were Big Endian ints, and Chrome runs only on Little Endian.

Please see my other comment about this below.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:149: (*out)[i] =
FlipEndianness((*out)[i]);
On 2016/08/17 at 20:57:27, Nathan Parker wrote:
> Wait, now I'm confused again.  You're flipping, sorting as ints, then flipping
again.
> 
> OH, I think I understand: You're sorting the array as integers, rather than as
4-byte-strings, but you want them to end up in lexigraphical order.  I _think_
you could avoid all the flipping by having sort() compare 4-byte strings instead
of ints.  I'm not sure if that'll be any faster comparing ints is pretty ideal. 
But the current code has an extra pass through the array.
> 
> 
> 
> If what I said make sense, you could add a comment that we might want to
compare my proposed method's performance, if performance here turns out to be an
issue.
> 
> 
> It might be more clear if you didn't use uint32_t's, but instead had something
like
>   typedef uint32_t Hash32Bits
>   std::vector<Hash32Bits>* out
> 
> Since really once you've done the Rice decoding, you no longer treat them as
ints (and shouldn't).

I think the explanation goes like this. The following program, on a
little-endian machine produces:

  std::vector<uint32_t> v2;
  v2.reserve(3);
  v2.push_back(0x41424344);
  v2.push_back(0);
  cptr = reinterpret_cast<char*>(v2.data());
  std::cout << "as string: " << cptr << std::endl;

Output:
as string: DCBA

This shows that when a string (which is what the server meant to send) as
re-interpreted as a uint32_t, it reverses the order of the bits.

That's why we need to do this whole bit flipping.

The server sends a response.
Parse the response as uint32_t values.
Flip bytes to reinterpret them correctly.
Sort them to get the correct lexicographical order
Flip the bytes again, since they need to be consumed as bytes not uint32_t
values.

I think the current comment describes this process fairly accurately but I am
super willing to consider suggestions.

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-17 22:10:42 UTC) #69

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/220001

4 years, 4 months ago (2016-08-17 22:11:16 UTC) #70

Scott Hess - ex-Googler

I think I get the situation this is dealing with. I think it was an ...

4 years, 4 months ago (2016-08-17 22:42:27 UTC) #71

I think I get the situation this is dealing with.  I think it was an unfortunate
thing for the server data to have a confusing ordering dependency baked in that
way.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/17 20:57:27, Nathan Parker wrote:
> I feel like there's no point in building in an endianness dependency unless it
> makes the code faster or a lot cleaner.  What's wrong with htonl()? 

I agree with this, if it's reasonable to just do it right, do it right.  At this
point I wouldn't worry excessively about performance, because there will
probably be bigger fish to fry in here.

That said, I also support the "If it's not little endian, break the compile"
strategy.  I'm slightly nervous about doing it as macro checks, rather than some
sort of compiler-generated check.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:107: std::vector<uint32_t>* out) {
"DecodeBytes" decodes to a vector of uint32_t.  Maybe "DecodePrefixes" or
something like that.

Also this heavily overlaps the DecodeIntegers() function, other than the
ordering constraints.  I couldn't see an obvious way to steal the underlying
storage from the RepeatedField, unfortunately.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:148: for (unsigned i = 0; i <
out->size(); i++) {
size_t rather than unsigned, here.

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:26: #endif
Is this still necessary now that it's using htonl?

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:145: std::sort(out->begin(), out->end());
I think at least part of the confusion is that there's a sorting constraint
which is being explained from the bottom up (comments probably added as you
discovered things), and maybe it would be cleaner if the header comment for this
function described how the results should be sorted.

AFAICT, the sort order is because someone elsewhere is going to interpret the
raw vector::data() bytes.  I'm not making a value judgement on _that_, I'm just
saying that if you're going to bake that assumption in here, then you need to
describe it because it's a completely non-obvious requirement for someone
looking at this code.

That said ... I'm also not entirely convinced that this should return a vector
of integers, given that there's this odd sorting constraint and someone else is
accessing it as raw data.  If the only reason for the vector is for unit tests,
then IMHO it would be reasonable for the unit tests to decode/flip things for
purposes of testing.

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:253: for (unsigned i = 0; i <
expected.size(); i++) {
size_t here.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-17 22:56:49 UTC) #72

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-17 22:56:50 UTC) #73

vakh (use Gerrit instead)

DecodeBytes->DecodePrefixes. More detailed comment.

4 years, 4 months ago (2016-08-17 23:30:18 UTC) #74

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-17 23:30:21 UTC) #75

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/240001

4 years, 4 months ago (2016-08-17 23:30:45 UTC) #76

vakh (use Gerrit instead)

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc File components/safe_browsing_db/v4_rice.cc (right): https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsing_db/v4_rice.cc#newcode28 components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t word) { On 2016/08/17 at 22:42:26, Scott ...

4 years, 4 months ago (2016-08-17 23:40:08 UTC) #77

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:28: uint32_t FlipEndianness(uint32_t
word) {
On 2016/08/17 at 22:42:26, Scott Hess wrote:
> On 2016/08/17 20:57:27, Nathan Parker wrote:
> > I feel like there's no point in building in an endianness dependency unless
it
> > makes the code faster or a lot cleaner.  What's wrong with htonl()? 
> 
> I agree with this, if it's reasonable to just do it right, do it right.  At
this point I wouldn't worry excessively about performance, because there will
probably be bigger fish to fry in here.
> 
Done.

> That said, I also support the "If it's not little endian, break the compile"
strategy.  I'm slightly nervous about doing it as macro checks, rather than some
sort of compiler-generated check.

I really think that if it should fail, it should fail spectacularly.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:107: std::vector<uint32_t>* out) {
On 2016/08/17 at 22:42:26, Scott Hess wrote:
> "DecodeBytes" decodes to a vector of uint32_t.  Maybe "DecodePrefixes" or
something like that.
> 
Done.

> Also this heavily overlaps the DecodeIntegers() function, other than the
ordering constraints.  I couldn't see an obvious way to steal the underlying
storage from the RepeatedField, unfortunately.

Hmm.

https://codereview.chromium.org/2228393003/diff/200001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:148: for (unsigned i = 0; i <
out->size(); i++) {
On 2016/08/17 at 22:42:26, Scott Hess wrote:
> size_t rather than unsigned, here.

Done.

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
File components/safe_browsing_db/v4_rice.cc (right):

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:26: #endif
On 2016/08/17 at 22:42:27, Scott Hess wrote:
> Is this still necessary now that it's using htonl?

To be frank, I won't be 100% sure until I observe it.
I'd much rather it fails loud and then someone can make it work, than it fails
silently and no one notices.

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice.cc:145: std::sort(out->begin(), out->end());
On 2016/08/17 at 22:42:26, Scott Hess wrote:
> I think at least part of the confusion is that there's a sorting constraint
which is being explained from the bottom up (comments probably added as you
discovered things), and maybe it would be cleaner if the header comment for this
function described how the results should be sorted.
> 
Done. Let me know if that isn't clear enough for a new reader.

> AFAICT, the sort order is because someone elsewhere is going to interpret the
raw vector::data() bytes.  I'm not making a value judgement on _that_, I'm just
saying that if you're going to bake that assumption in here, then you need to
describe it because it's a completely non-obvious requirement for someone
looking at this code.
> 
Done.

> That said ... I'm also not entirely convinced that this should return a vector
of integers, given that there's this odd sorting constraint and someone else is
accessing it as raw data.  If the only reason for the vector is for unit tests,
then IMHO it would be reasonable for the unit tests to decode/flip things for
purposes of testing.

The reason it returns a vector is that I am trying to interpret the same memory
block some times as std::vector<uint32_t> and other times as std::string.

The other way to do this would be to create a std::string, reserve its size
appropriately, and then re-interpret that block of memory as vector<uint32_t>.
The problem with that approach is that string::data() returns a const char *
pointer, so I cannot modify the memory underneath that string.
Of course I could cast away the constness but that's probably not a good idea
either. So I am between a rock and a hard place.

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
File components/safe_browsing_db/v4_rice_unittest.cc (right):

https://codereview.chromium.org/2228393003/diff/220001/components/safe_browsi...
components/safe_browsing_db/v4_rice_unittest.cc:253: for (unsigned i = 0; i <
expected.size(); i++) {
On 2016/08/17 at 22:42:27, Scott Hess wrote:
> size_t here.

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-18 01:08:45 UTC) #78

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 4 months ago (2016-08-18 01:08:46 UTC) #79

vakh (use Gerrit instead)

Since most of the major points have been addressed, I am going to commit this ...

4 years, 4 months ago (2016-08-19 00:36:16 UTC) #81

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org

4 years, 4 months ago (2016-08-19 00:36:23 UTC) #82

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2228393003/240001

4 years, 4 months ago (2016-08-19 00:37:11 UTC) #83

commit-bot: I haz the power

Description was changed from ========== PVer4: DecodeHashes needs to sort the output of the Rice ...

4 years, 4 months ago (2016-08-19 02:18:35 UTC) #84

commit-bot: I haz the power

Description was changed from ========== PVer4: DecodeHashes needs to sort the output of the Rice ...

4 years, 4 months ago (2016-08-19 02:20:33 UTC) #86

commit-bot: I haz the power

4 years, 4 months ago (2016-08-19 02:20:34 UTC) #87

Message was sent while issue was closed.

Patchset 10 (id:??) landed as
https://crrev.com/1670eb02e8172d1750b70c50dbe3fb968e1f365f
Cr-Commit-Position: refs/heads/master@{#413027}

Issue 2228393003: PVer4: DecodeHashes needs to sort the output of the Rice decoder (Closed)

Description

Patch Set 1 : PVer4: DecodeHashes needs to sort the output of the Rice decoder #

Patch Set 2 : Fix the test to expect sorted list of hashes from DecodeBytes #

Patch Set 3 : Fix the corresponding test in v4_store_unittest.cc also #

Patch Set 4 : Remove DVLOGs. #include logging.h #

Patch Set 5 : Use a vector to decode and sort uint32_t values. re-interpret the memory of that vector as raw hash… #

Patch Set 6 : Use htonl/ntohl instead of FlipEndianness (but the effect is the same). Use GetBitsFromCurrentWord … #

Patch Set 7 : Use FlipEndianness instead of ntohl/htonl. #error for big endian #

Patch Set 8 : rebase #

Patch Set 9 : Use htonl. Return from GetBitsFromCurrentWord #

Patch Set 10 : DecodeBytes->DecodePrefixes. More detailed comment. #

Messages