|
|
DescriptionOptimize BitStreamReader, speed up EV Whitelist loading
The EV Certificate Whitelist loads ~1MB of data using BitStreamReader.
Currently this is read in one bit at a time. Changing BitStreamReader to
support byte reads sped up loading by about 65% on my computer.
BUG=457949
TEST=bit_stream_reader_unittest.cc
Committed: https://crrev.com/ca0d5d46c95fa5e8d5c89a979baba0c5b6bbefd0
Cr-Commit-Position: refs/heads/master@{#325286}
Patch Set 1 #
Total comments: 6
Patch Set 2 : Move logic from ReadBits to ReadByte helper #Patch Set 3 : Remove unnecessary &0xff - forgot uint8_t is exactly 1 byte not at least 1 byte #
Messages
Total messages: 16 (5 generated)
rsleevi@chromium.org changed reviewers: + eranm@chromium.org
As I shared on the bug, I don't think this is a high-priority CL. If it is, this isn't the right lever for it. That said, I'm going to take a day or two to see if I can't think of anything cleverer that might optimize better. Adding Eran as well There's also the note that the Official builds use different optimization settings than Release builds (c.f. https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&... ), but I mean, /O2 should be enough for anybody. https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: Don't delete this newline https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); I'm not terribly keen on this, which duplicates the logic of ReadBit() Seems like there should be an internal ReadByte helper that's responsible for checking pre-conditions and updating current_byte_
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); On 2015/04/09 01:33:57, Ryan Sleevi wrote: > I'm not terribly keen on this, which duplicates the logic of ReadBit() > > Seems like there should be an internal ReadByte helper that's responsible for > checking pre-conditions and updating current_byte_ One important precondition is whether the stream is currently byte-aligned or not. Hence the first for loop looking at current_bit_. I'm not sure if there's a good way to avoid duplicating that logic, as even if you had a ReadByte() helper you wouldn't want to call it unless you were byte-aligned, otherwise you're going to just end up reading bit-by-bit. Could add a bool IsByteAligned() helper if you want to avoid touching current_bit_ but it seems a bit overkill IMO. The second for loop does indeed mirror some of the logic of ReadBit (namely, adjusting current_byte as you mention, and touching source_.data() directly). This could be moved off into a helper function, which would hopefully get inlined, but then we're going to have to decide whether that helper function needs to check that it's byte aligned first. I presume it would (unless it were private?) at which point we're double checking preconditions. Not end of the world but the whole point was to improve the performance here :) Those preconditions would also get checked then 8 times for a 64-bit read (common). Personally, I'd prefer to leave it in-line there, but if you want it in a separate function, I'd propose making it a) private b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >= 8) c) return a uint8_t and update current_byte_ The only downside would be if someone expects ReadByte() to work regardless of whether the stream is currently byte-aligned (or at the very least to check whether that condition is met). Could either handle both cases (in which case you're having an extra switch statement run 8 times for a 64-bit read) or just document that being byte-aligned is a precondition. Thoughts?
https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); On 2015/04/09 02:43:06, ian fette wrote: > The second for loop does indeed mirror some of the logic of ReadBit (namely, > adjusting current_byte as you mention, and touching source_.data() directly). > This could be moved off into a helper function, which would hopefully get > inlined, but then we're going to have to decide whether that helper function > needs to check that it's byte aligned first. I presume it would (unless it were > private?) at which point we're double checking preconditions. Not end of the > world but the whole point was to improve the performance here :) Those > preconditions would also get checked then 8 times for a 64-bit read (common). Right, I was just talking in the context of the second loop (e.g. precondition is that things are aligned). The first loop uses ReadBit(), such that current_bit_/current_byte_ are always set. I'm not worried about the precondition cost - DCHECK optimizes out in official builds (and has for a long time) - it's just to guard against developer mistake, which would happen if someone adjusted the first loop without adjusting the second. > Personally, I'd prefer to leave it in-line there, but if you want it in a > separate function, I'd propose making it > a) private > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >= 8) > c) return a uint8_t and update current_byte_ Yup, exactly what I was trying to suggest. > just document that being byte-aligned is a precondition. Exactly this. From a design perspective, my grumpiness towards the current approach came from the fact that ReadBits() didn't mutate internal state directly, but instead delegated through ReadBit, so I wasn't keen to spread the mutation 'up'. There's definitely room for 'cleverness', given the read patters (read 64 bits, read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a copy-mask-shift approach at the 128-bit level (so we can span the two 64-bits) would totally work, and might yield better loop unrolling. But that does feel like extreme microptimizations for the task, given the discussion on the bug.
PTAL https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: On 2015/04/09 01:33:57, Ryan Sleevi wrote: > Don't delete this newline Done. https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); On 2015/04/09 02:57:16, Ryan Sleevi wrote: > On 2015/04/09 02:43:06, ian fette wrote: > > The second for loop does indeed mirror some of the logic of ReadBit (namely, > > adjusting current_byte as you mention, and touching source_.data() directly). > > This could be moved off into a helper function, which would hopefully get > > inlined, but then we're going to have to decide whether that helper function > > needs to check that it's byte aligned first. I presume it would (unless it > were > > private?) at which point we're double checking preconditions. Not end of the > > world but the whole point was to improve the performance here :) Those > > preconditions would also get checked then 8 times for a 64-bit read (common). > > Right, I was just talking in the context of the second loop (e.g. precondition > is that things are aligned). The first loop uses ReadBit(), such that > current_bit_/current_byte_ are always set. > > I'm not worried about the precondition cost - DCHECK optimizes out in official > builds (and has for a long time) - it's just to guard against developer mistake, > which would happen if someone adjusted the first loop without adjusting the > second. > > > Personally, I'd prefer to leave it in-line there, but if you want it in a > > separate function, I'd propose making it > > a) private > > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >= > 8) > > c) return a uint8_t and update current_byte_ > > Yup, exactly what I was trying to suggest. > > > just document that being byte-aligned is a precondition. > > Exactly this. > > From a design perspective, my grumpiness towards the current approach came from > the fact that ReadBits() didn't mutate internal state directly, but instead > delegated through ReadBit, so I wasn't keen to spread the mutation 'up'. > > There's definitely room for 'cleverness', given the read patters (read 64 bits, > read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a > copy-mask-shift approach at the 128-bit level (so we can span the two 64-bits) > would totally work, and might yield better loop unrolling. > > But that does feel like extreme microptimizations for the task, given the > discussion on the bug. Done.
Can I get an LGTM? I made your requested changes, seems pretty straightforward. Local results showed a good speedup (~65% which included the I/O time), I'd like to land and see what happens in the UMA data. On 2015/04/09 16:22:14, ian fette wrote: > PTAL > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left): > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: > On 2015/04/09 01:33:57, Ryan Sleevi wrote: > > Don't delete this newline > > Done. > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); > On 2015/04/09 02:57:16, Ryan Sleevi wrote: > > On 2015/04/09 02:43:06, ian fette wrote: > > > The second for loop does indeed mirror some of the logic of ReadBit (namely, > > > adjusting current_byte as you mention, and touching source_.data() > directly). > > > This could be moved off into a helper function, which would hopefully get > > > inlined, but then we're going to have to decide whether that helper function > > > needs to check that it's byte aligned first. I presume it would (unless it > > were > > > private?) at which point we're double checking preconditions. Not end of the > > > world but the whole point was to improve the performance here :) Those > > > preconditions would also get checked then 8 times for a 64-bit read > (common). > > > > Right, I was just talking in the context of the second loop (e.g. precondition > > is that things are aligned). The first loop uses ReadBit(), such that > > current_bit_/current_byte_ are always set. > > > > I'm not worried about the precondition cost - DCHECK optimizes out in official > > builds (and has for a long time) - it's just to guard against developer > mistake, > > which would happen if someone adjusted the first loop without adjusting the > > second. > > > > > Personally, I'd prefer to leave it in-line there, but if you want it in a > > > separate function, I'd propose making it > > > a) private > > > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() >= > > 8) > > > c) return a uint8_t and update current_byte_ > > > > Yup, exactly what I was trying to suggest. > > > > > just document that being byte-aligned is a precondition. > > > > Exactly this. > > > > From a design perspective, my grumpiness towards the current approach came > from > > the fact that ReadBits() didn't mutate internal state directly, but instead > > delegated through ReadBit, so I wasn't keen to spread the mutation 'up'. > > > > There's definitely room for 'cleverness', given the read patters (read 64 > bits, > > read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a > > copy-mask-shift approach at the 128-bit level (so we can span the two 64-bits) > > would totally work, and might yield better loop unrolling. > > > > But that does feel like extreme microptimizations for the task, given the > > discussion on the bug. > > Done.
ian@chromium.org changed reviewers: + jam@chromium.org - rsleevi@chromium.org
Comments were addressed, not getting a response, changing reviewer. On 2015/04/14 18:18:39, ian fette wrote: > Can I get an LGTM? I made your requested changes, seems pretty straightforward. > Local results showed a good speedup (~65% which included the I/O time), I'd like > to land and see what happens in the UMA data. > > On 2015/04/09 16:22:14, ian fette wrote: > > PTAL > > > > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (left): > > > > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > > components/packed_ct_ev_whitelist/bit_stream_reader.cc:10: > > On 2015/04/09 01:33:57, Ryan Sleevi wrote: > > > Don't delete this newline > > > > Done. > > > > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > > File components/packed_ct_ev_whitelist/bit_stream_reader.cc (right): > > > > > https://codereview.chromium.org/1070903003/diff/1/components/packed_ct_ev_whi... > > components/packed_ct_ev_whitelist/bit_stream_reader.cc:41: << (num_bits - 8); > > On 2015/04/09 02:57:16, Ryan Sleevi wrote: > > > On 2015/04/09 02:43:06, ian fette wrote: > > > > The second for loop does indeed mirror some of the logic of ReadBit > (namely, > > > > adjusting current_byte as you mention, and touching source_.data() > > directly). > > > > This could be moved off into a helper function, which would hopefully get > > > > inlined, but then we're going to have to decide whether that helper > function > > > > needs to check that it's byte aligned first. I presume it would (unless it > > > were > > > > private?) at which point we're double checking preconditions. Not end of > the > > > > world but the whole point was to improve the performance here :) Those > > > > preconditions would also get checked then 8 times for a 64-bit read > > (common). > > > > > > Right, I was just talking in the context of the second loop (e.g. > precondition > > > is that things are aligned). The first loop uses ReadBit(), such that > > > current_bit_/current_byte_ are always set. > > > > > > I'm not worried about the precondition cost - DCHECK optimizes out in > official > > > builds (and has for a long time) - it's just to guard against developer > > mistake, > > > which would happen if someone adjusted the first loop without adjusting the > > > second. > > > > > > > Personally, I'd prefer to leave it in-line there, but if you want it in a > > > > separate function, I'd propose making it > > > > a) private > > > > b) Only DCHECK() the preconditions (that current_bit_ == 7 and BitsLeft() > >= > > > 8) > > > > c) return a uint8_t and update current_byte_ > > > > > > Yup, exactly what I was trying to suggest. > > > > > > > just document that being byte-aligned is a precondition. > > > > > > Exactly this. > > > > > > From a design perspective, my grumpiness towards the current approach came > > from > > > the fact that ReadBits() didn't mutate internal state directly, but instead > > > delegated through ReadBit, so I wasn't keen to spread the mutation 'up'. > > > > > > There's definitely room for 'cleverness', given the read patters (read 64 > > bits, > > > read 0-7 bits, read 47 bits, read 0-7 bits, read 47 bits, etc) where a > > > copy-mask-shift approach at the 128-bit level (so we can span the two > 64-bits) > > > would totally work, and might yield better loop unrolling. > > > > > > But that does feel like extreme microptimizations for the task, given the > > > discussion on the bug. > > > > Done.
lgtm
The CQ bit was checked by ian@chromium.org
The CQ bit was unchecked by ian@chromium.org
The CQ bit was checked by ian@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1070903003/40001
Message was sent while issue was closed.
Committed patchset #3 (id:40001)
Message was sent while issue was closed.
Patchset 3 (id:??) landed as https://crrev.com/ca0d5d46c95fa5e8d5c89a979baba0c5b6bbefd0 Cr-Commit-Position: refs/heads/master@{#325286} |