Issue 1070903003: Optimize BitStreamReader, speed up EV Whitelist loading

ian fette

5 years, 8 months ago (2015-04-08 22:49:07 UTC) #1

Ryan Sleevi

rsleevi@chromium.org changed reviewers: + eranm@chromium.org

5 years, 8 months ago (2015-04-09 01:33:56 UTC) #2

Ryan Sleevi

As I shared on the bug, I don't think this is a high-priority CL. If ...

5 years, 8 months ago (2015-04-09 01:33:57 UTC) #3

ian fette

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc#newcode41 components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); On 2015/04/09 01:33:57, Ryan Sleevi ...

5 years, 8 months ago (2015-04-09 02:43:06 UTC) #4

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right):

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8);
On 2015/04/09 01:33:57, Ryan Sleevi wrote:
> I'm not terribly keen on this, which duplicates the logic of ReadBit()
> 
> Seems like there should be an internal ReadByte helper that's responsible for
> checking pre-conditions and updating current_byte_

One important precondition is whether the stream is currently byte-aligned or
not. Hence the first for loop looking at current_bit_. I'm not sure if there's a
good way to avoid duplicating that logic, as even if you had a ReadByte() helper
you wouldn't want to call it unless you were byte-aligned, otherwise you're
going to just end up reading bit-by-bit. Could add a bool IsByteAligned() helper
if you want to avoid touching current_bit_ but it seems a bit overkill IMO.

The second for loop does indeed mirror some of the logic of ReadBit (namely,
adjusting current_byte as you mention, and touching source_.data() directly).
This could be moved off into a helper function, which would hopefully get
inlined, but then we're going to have to decide whether that helper function
needs to check that it's byte aligned first. I presume it would (unless it were
private?) at which point we're double checking preconditions. Not end of the
world but the whole point was to improve the performance here :) Those
preconditions would also get checked then 8 times for a 64-bit read (common). 

Personally, I'd prefer to leave it in-line there, but if you want it in a
separate function, I'd propose making it
a) private
b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >= 8)
c) return a uint8_t and update current_byte_ 

The only downside would be if someone expects ReadByte() to work regardless of
whether the stream is currently byte-aligned (or at the very least to check
whether that condition is met). Could either handle both cases (in which case
you're having an extra switch statement run 8 times for a 64-bit read) or just
document that being byte-aligned is a precondition.

Thoughts?

Ryan Sleevi

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc#newcode41 components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); On 2015/04/09 02:43:06, ian fette ...

5 years, 8 months ago (2015-04-09 02:57:16 UTC) #5

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right):

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8);
On 2015/04/09 02:43:06, ian fette wrote:
> The second for loop does indeed mirror some of the logic of ReadBit (namely,
> adjusting current_byte as you mention, and touching source_.data() directly).
> This could be moved off into a helper function, which would hopefully get
> inlined, but then we're going to have to decide whether that helper function
> needs to check that it's byte aligned first. I presume it would (unless it
were
> private?) at which point we're double checking preconditions. Not end of the
> world but the whole point was to improve the performance here :) Those
> preconditions would also get checked then 8 times for a 64-bit read (common). 

Right, I was just talking in the context of the second loop (e.g. precondition
is that things are aligned). The first loop uses ReadBit(), such that
current_bit_/current_byte_ are always set.

I'm not worried about the precondition cost - DCHECK optimizes out in official
builds (and has for a long time) - it's just to guard against developer mistake,
which would happen if someone adjusted the first loop without adjusting the
second.

> Personally, I'd prefer to leave it in-line there, but if you want it in a
> separate function, I'd propose making it
> a) private
> b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >=
8)
> c) return a uint8_t and update current_byte_ 

Yup, exactly what I was trying to suggest.

> just document that being byte-aligned is a precondition.

Exactly this.

From a design perspective, my grumpiness towards the current approach came from
the fact that ReadBits() didn't mutate internal state directly, but instead
delegated through ReadBit, so I wasn't keen to spread the mutation 'up'.

There's definitely room for 'cleverness', given the read patters (read 64 bits,
read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a
copy-mask-shift approach at the 128-bit level (so we can span the two 64-bits)
would totally work, and might yield better loop unrolling.

But that does feel like extreme microptimizations for the task, given the
discussion on the bug.

ian fette

PTAL https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whitelist/bit_stream_reader.cc#oldcode10 components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: On 2015/04/09 01:33:57, Ryan Sleevi wrote: > Don't ...

5 years, 8 months ago (2015-04-09 16:22:14 UTC) #6

PTAL

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left):

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: 
On 2015/04/09 01:33:57, Ryan Sleevi wrote:
> Don't delete this newline

Done.

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right):

https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8);
On 2015/04/09 02:57:16, Ryan Sleevi wrote:
> On 2015/04/09 02:43:06, ian fette wrote:
> > The second for loop does indeed mirror some of the logic of ReadBit (namely,
> > adjusting current_byte as you mention, and touching source_.data()
directly).
> > This could be moved off into a helper function, which would hopefully get
> > inlined, but then we're going to have to decide whether that helper function
> > needs to check that it's byte aligned first. I presume it would (unless it
> were
> > private?) at which point we're double checking preconditions. Not end of the
> > world but the whole point was to improve the performance here :) Those
> > preconditions would also get checked then 8 times for a 64-bit read
(common). 
> 
> Right, I was just talking in the context of the second loop (e.g. precondition
> is that things are aligned). The first loop uses ReadBit(), such that
> current_bit_/current_byte_ are always set.
> 
> I'm not worried about the precondition cost - DCHECK optimizes out in official
> builds (and has for a long time) - it's just to guard against developer
mistake,
> which would happen if someone adjusted the first loop without adjusting the
> second.
> 
> > Personally, I'd prefer to leave it in-line there, but if you want it in a
> > separate function, I'd propose making it
> > a) private
> > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >=
> 8)
> > c) return a uint8_t and update current_byte_ 
> 
> Yup, exactly what I was trying to suggest.
> 
> > just document that being byte-aligned is a precondition.
> 
> Exactly this.
> 
> From a design perspective, my grumpiness towards the current approach came
from
> the fact that ReadBits() didn't mutate internal state directly, but instead
> delegated through ReadBit, so I wasn't keen to spread the mutation 'up'.
> 
> There's definitely room for 'cleverness', given the read patters (read 64
bits,
> read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a
> copy-mask-shift approach at the 128-bit level (so we can span the two 64-bits)
> would totally work, and might yield better loop unrolling.
> 
> But that does feel like extreme microptimizations for the task, given the
> discussion on the bug.

Done.

ian fette

Can I get an LGTM? I made your requested changes, seems pretty straightforward. Local results ...

5 years, 8 months ago (2015-04-14 18:18:39 UTC) #7

Can I get an LGTM? I made your requested changes, seems pretty straightforward.
Local results showed a good speedup (~65% which included the I/O time), I'd like
to land and see what happens in the UMA data.

On 2015/04/09 16:22:14, ian fette wrote:
> PTAL
> 
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left):
> 
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: 
> On 2015/04/09 01:33:57, Ryan Sleevi wrote:
> > Don't delete this newline
> 
> Done.
> 
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right):
> 
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8);
> On 2015/04/09 02:57:16, Ryan Sleevi wrote:
> > On 2015/04/09 02:43:06, ian fette wrote:
> > > The second for loop does indeed mirror some of the logic of ReadBit
(namely,
> > > adjusting current_byte as you mention, and touching source_.data()
> directly).
> > > This could be moved off into a helper function, which would hopefully get
> > > inlined, but then we're going to have to decide whether that helper
function
> > > needs to check that it's byte aligned first. I presume it would (unless it
> > were
> > > private?) at which point we're double checking preconditions. Not end of
the
> > > world but the whole point was to improve the performance here :) Those
> > > preconditions would also get checked then 8 times for a 64-bit read
> (common). 
> > 
> > Right, I was just talking in the context of the second loop (e.g.
precondition
> > is that things are aligned). The first loop uses ReadBit(), such that
> > current_bit_/current_byte_ are always set.
> > 
> > I'm not worried about the precondition cost - DCHECK optimizes out in
official
> > builds (and has for a long time) - it's just to guard against developer
> mistake,
> > which would happen if someone adjusted the first loop without adjusting the
> > second.
> > 
> > > Personally, I'd prefer to leave it in-line there, but if you want it in a
> > > separate function, I'd propose making it
> > > a) private
> > > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft()
>=
> > 8)
> > > c) return a uint8_t and update current_byte_ 
> > 
> > Yup, exactly what I was trying to suggest.
> > 
> > > just document that being byte-aligned is a precondition.
> > 
> > Exactly this.
> > 
> > From a design perspective, my grumpiness towards the current approach came
> from
> > the fact that ReadBits() didn't mutate internal state directly, but instead
> > delegated through ReadBit, so I wasn't keen to spread the mutation 'up'.
> > 
> > There's definitely room for 'cleverness', given the read patters (read 64
> bits,
> > read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a
> > copy-mask-shift approach at the 128-bit level (so we can span the two
64-bits)
> > would totally work, and might yield better loop unrolling.
> > 
> > But that does feel like extreme microptimizations for the task, given the
> > discussion on the bug.
> 
> Done.

ian fette

ian@chromium.org changed reviewers: + jam@chromium.org - rsleevi@chromium.org

5 years, 8 months ago (2015-04-15 18:10:33 UTC) #8

ian fette

Comments were addressed, not getting a response, changing reviewer. On 2015/04/14 18:18:39, ian fette wrote: ...

5 years, 8 months ago (2015-04-15 18:10:58 UTC) #9

Comments were addressed, not getting a response, changing reviewer.

On 2015/04/14 18:18:39, ian fette wrote:
> Can I get an LGTM? I made your requested changes, seems pretty
straightforward.
> Local results showed a good speedup (~65% which included the I/O time), I'd
like
> to land and see what happens in the UMA data.
> 
> On 2015/04/09 16:22:14, ian fette wrote:
> > PTAL
> > 
> >
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left):
> > 
> >
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> > components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: 
> > On 2015/04/09 01:33:57, Ryan Sleevi wrote:
> > > Don't delete this newline
> > 
> > Done.
> > 
> >
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right):
> > 
> >
>
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi...
> > components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits -
8);
> > On 2015/04/09 02:57:16, Ryan Sleevi wrote:
> > > On 2015/04/09 02:43:06, ian fette wrote:
> > > > The second for loop does indeed mirror some of the logic of ReadBit
> (namely,
> > > > adjusting current_byte as you mention, and touching source_.data()
> > directly).
> > > > This could be moved off into a helper function, which would hopefully
get
> > > > inlined, but then we're going to have to decide whether that helper
> function
> > > > needs to check that it's byte aligned first. I presume it would (unless
it
> > > were
> > > > private?) at which point we're double checking preconditions. Not end of
> the
> > > > world but the whole point was to improve the performance here :) Those
> > > > preconditions would also get checked then 8 times for a 64-bit read
> > (common). 
> > > 
> > > Right, I was just talking in the context of the second loop (e.g.
> precondition
> > > is that things are aligned). The first loop uses ReadBit(), such that
> > > current_bit_/current_byte_ are always set.
> > > 
> > > I'm not worried about the precondition cost - DCHECK optimizes out in
> official
> > > builds (and has for a long time) - it's just to guard against developer
> > mistake,
> > > which would happen if someone adjusted the first loop without adjusting
the
> > > second.
> > > 
> > > > Personally, I'd prefer to leave it in-line there, but if you want it in
a
> > > > separate function, I'd propose making it
> > > > a) private
> > > > b) Only DCHECK() the preconditions (that current_bit_ == 7 and
BitsLeft()
> >=
> > > 8)
> > > > c) return a uint8_t and update current_byte_ 
> > > 
> > > Yup, exactly what I was trying to suggest.
> > > 
> > > > just document that being byte-aligned is a precondition.
> > > 
> > > Exactly this.
> > > 
> > > From a design perspective, my grumpiness towards the current approach came
> > from
> > > the fact that ReadBits() didn't mutate internal state directly, but
instead
> > > delegated through ReadBit, so I wasn't keen to spread the mutation 'up'.
> > > 
> > > There's definitely room for 'cleverness', given the read patters (read 64
> > bits,
> > > read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a
> > > copy-mask-shift approach at the 128-bit level (so we can span the two
> 64-bits)
> > > would totally work, and might yield better loop unrolling.
> > > 
> > > But that does feel like extreme microptimizations for the task, given the
> > > discussion on the bug.
> > 
> > Done.

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1070903003/40001

5 years, 8 months ago (2015-04-15 18:12:10 UTC) #14

commit-bot: I haz the power

5 years, 8 months ago (2015-04-15 19:16:06 UTC) #16

Message was sent while issue was closed.

Patchset 3 (id:??) landed as
https://crrev.com/ca0d5d46c95fa5e8d5c89a979baba0c5b6bbefd0
Cr-Commit-Position: refs/heads/master@{#325286}

Issue 1070903003: Optimize BitStreamReader, speed up EV Whitelist loading (Closed)

Description

Patch Set 1 #

Patch Set 2 : Move logic from ReadBits to ReadByte helper #

Patch Set 3 : Remove unnecessary &0xff - forgot uint8_t is exactly 1 byte not at least 1 byte #

Messages