Issue 2384893002: PVer4: Test checksum on startup outside the hotpath of DB load

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-01 00:05:29 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/1

4 years, 2 months ago (2016-10-01 00:06:33 UTC) #2

vakh (use Gerrit instead)

Description was changed from ========== PVer4: Test checksum on startup outside the hotpath of DB ...

4 years, 2 months ago (2016-10-01 00:07:30 UTC) #3

vakh (use Gerrit instead)

vakh@chromium.org changed reviewers: + nparker@chromium.org, shess@chromium.org

4 years, 2 months ago (2016-10-01 00:07:31 UTC) #4

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-01 01:13:44 UTC) #5

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 2 months ago (2016-10-01 01:13:45 UTC) #6

Nathan Parker

(out-of-character code review on the weekend since I'm out Monday...) https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc#newcode133 ...

4 years, 2 months ago (2016-10-03 02:26:49 UTC) #7

(out-of-character code review on the weekend since I'm out Monday...)

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:133: hash_prefix_map_.clear();
This should delete the file too, according to the .h file.  And just return
void, since there's no possibility of another value.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:233: RecordMergeUpdateTime(after -
before);
This mergeupdatetime entry will be for a different code path on startup, so
maybe not really comparable to regular updates. I'm not sure if it matters... 
One idea would be to not put the just-copy-on-startup in MergeUpdate, but call a
different function here if map_old is empty, and record it in a different 
metric.

Just make sure you think about how you'll use this metric, and if this code is
sufficient.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:423: task_runner_->PostTask(
Add a comment to why this is not run synchronously.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:424: FROM_HERE,
base::Bind(&V4Store::VerifyChecksum,
How many CPU seconds will this take, when it does run?  If it's substantian
(>100's of ms of CPU), we really should put it off till after startup is really
done.  Otherwise, it'll contend with other startup tasks for CPU time.

You could post it with a 5 minute delay.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:672: void V4Store::VerifyChecksum(const
std::string& expected_checksum) {
Maybe call this VerifyChecksumDelayed, so as to differentiate it from the
checksum check above.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:711:
RecordApplyUpdateResultWhenReadingFromDisk(CHECKSUM_MISMATCH_FAILURE);
This should probably be a different enum, since checksum failures due to disk
corruption are quite different than those from bad merges.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:713: Reset();
This is a race if someone is accessing the store on the IO thread right now. 
Maybe Reset should post to the IO thread to do the clearing.

vakh (use Gerrit instead)

Log UMA metrics per store and total using base_metric and suffixes

4 years, 2 months ago (2016-10-04 00:12:15 UTC) #8

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-04 00:12:17 UTC) #9

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/20001

4 years, 2 months ago (2016-10-04 00:13:11 UTC) #10

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-04 00:51:00 UTC) #11

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_compile_dbg_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng/builds/269053) win_chromium_rel_ng on ...

4 years, 2 months ago (2016-10-04 00:51:01 UTC) #12

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-04 07:05:49 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/40001

4 years, 2 months ago (2016-10-04 07:05:58 UTC) #15

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-04 07:14:09 UTC) #17

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/60001

4 years, 2 months ago (2016-10-04 07:14:23 UTC) #18

vakh (use Gerrit instead)

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc#newcode133 components/safe_browsing_db/v4_store.cc:133: hash_prefix_map_.clear(); On 2016/10/03 02:26:49, Nathan Parker wrote: > This ...

4 years, 2 months ago (2016-10-04 07:14:39 UTC) #19

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:133: hash_prefix_map_.clear();
On 2016/10/03 02:26:49, Nathan Parker wrote:
> This should delete the file too, according to the .h file.  And just return
> void, since there's no possibility of another value.

Done. Don't need to delete the file since it would be over-written on next
update.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:233: RecordMergeUpdateTime(after -
before);
On 2016/10/03 02:26:48, Nathan Parker wrote:
> This mergeupdatetime entry will be for a different code path on startup, so
> maybe not really comparable to regular updates. I'm not sure if it matters... 
> One idea would be to not put the just-copy-on-startup in MergeUpdate, but call
a
> different function here if map_old is empty, and record it in a different 
> metric.
> 
> Just make sure you think about how you'll use this metric, and if this code is
> sufficient.

Done. With the latest patch, the time metrics are logged separately for
ReadFromDisk, ProcessFullUpdate and ProcessPartialUpdate.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:423: task_runner_->PostTask(
On 2016/10/03 02:26:49, Nathan Parker wrote:
> Add a comment to why this is not run synchronously.

Done.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:424: FROM_HERE,
base::Bind(&V4Store::VerifyChecksum,
On 2016/10/03 02:26:49, Nathan Parker wrote:
> How many CPU seconds will this take, when it does run?

It takes 2-3 seconds on my machine.


>  If it's substantian
> (>100's of ms of CPU), we really should put it off till after startup is
really
> done.  Otherwise, it'll contend with other startup tasks for CPU time.

That's a good point but it shouldn't contend with other important tasks because
it is scheduled on the task runner, which by design, is for less critical tasks.
All important tasks should be scheduled on the IO thread, which gets preference
over task runner.
[I ran an experiment and scheduled a task on the IO thread and another on the
task runner in MergeUpdate. The one on IO thread consistently ran much earlier
than the one on task runner].

> 
> You could post it with a 5 minute delay.

As I explained above, this should not be necessary if the checksum verification
is scheduled on the task runner.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:672: void V4Store::VerifyChecksum(const
std::string& expected_checksum) {
On 2016/10/03 02:26:49, Nathan Parker wrote:
> Maybe call this VerifyChecksumDelayed, so as to differentiate it from the
> checksum check above.

Done.

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:711:
RecordApplyUpdateResultWhenReadingFromDisk(CHECKSUM_MISMATCH_FAILURE);
On 2016/10/03 02:26:49, Nathan Parker wrote:
> This should probably be a different enum, since checksum failures due to disk
> corruption are quite different than those from bad merges.

Yes, but it is different. This one is:
RecordApplyUpdateResultWhenReadingFromDisk. The other one is logged using:
RecordApplyUpdateResult

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db...
components/safe_browsing_db/v4_store.cc:713: Reset();
On 2016/10/03 02:26:49, Nathan Parker wrote:
> This is a race if someone is accessing the store on the IO thread right now. 
> Maybe Reset should post to the IO thread to do the clearing.

Done.

vakh (use Gerrit instead)

Description was changed from ========== PVer4: Test checksum on startup outside the hotpath of DB ...

4 years, 2 months ago (2016-10-04 07:21:54 UTC) #20

Description was changed from

==========
PVer4: Test checksum on startup outside the hotpath of DB load

On startup, when MergeUpdate is called, the old_prefixes_map is empty. When we
detect this case, we shortcut the merge process.

* Instead of merging the hash prefixes in lexicographical order, one by one, we
copy the additions_map into the hash_prefix_map_.
* Also, instead of checking the checksum at this time, we post a task to the
task runner to verify the
checksum. When that task runs, if the checksum matches, it returns silently;
otherwise, it clears out the state and the hash_prefix_map_ of the store.

This reduces the merge time for Malware and SocialEngineering lists by about 5
seconds total on my machine and brings the DB load time to <1s from 12s
previously.

BUG=651911, 543161
==========

to

==========
PVer4: Test checksum on startup outside the hotpath of DB load

On startup, when MergeUpdate is called, the old_prefixes_map is empty. When we
detect this case, we shortcut the merge process.
* Instead of merging the hash prefixes in lexicographical order, one by one, we
copy the additions_map into the hash_prefix_map_.
* Then, instead of checking the checksum at this time, we post a task to the
task runner to verify the checksum.
  When that task runs, if the checksum matches, it returns silently; otherwise,
it clears out the state and the hash_prefix_map_ of the store.

This reduces the merge time for Malware and SocialEngineering lists by about 5
seconds total on my machine and brings the DB load time to <1s from 12s
previously.

Also, changed the way some of the UMA metrics are logged. I've done this for
duration-related metrics for now.
With the new structure, it is easy to see which operation is taking how long and
also the breakdown of time taken
for each of the sub-operations.
For example, there are now separate histograms for:
 - SafeBrowsing.V4ProcessFullUpdate.Time, and its sub operations
 - SafeBrowsing.V4ProcessFullUpdate.MergeUpdate.Time,
 - SafeBrowsing.V4ProcessFullUpdate.DecodeAdditions.Time.

BUG=651911, 543161
==========

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-04 08:04:45 UTC) #21

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_compile_dbg_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_compile_dbg_ng/builds/269238)

4 years, 2 months ago (2016-10-04 08:04:46 UTC) #22

vakh (use Gerrit instead)

Remove all histogram related changes. That would be a separate CL

4 years, 2 months ago (2016-10-04 20:31:12 UTC) #23

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-04 20:31:14 UTC) #24

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/80001

4 years, 2 months ago (2016-10-04 20:31:57 UTC) #25

vakh (use Gerrit instead)

Description was changed from ========== PVer4: Test checksum on startup outside the hotpath of DB ...

4 years, 2 months ago (2016-10-04 20:32:42 UTC) #26

Description was changed from

==========
PVer4: Test checksum on startup outside the hotpath of DB load

On startup, when MergeUpdate is called, the old_prefixes_map is empty. When we
detect this case, we shortcut the merge process.
* Instead of merging the hash prefixes in lexicographical order, one by one, we
copy the additions_map into the hash_prefix_map_.
* Then, instead of checking the checksum at this time, we post a task to the
task runner to verify the checksum.
  When that task runs, if the checksum matches, it returns silently; otherwise,
it clears out the state and the hash_prefix_map_ of the store.

This reduces the merge time for Malware and SocialEngineering lists by about 5
seconds total on my machine and brings the DB load time to <1s from 12s
previously.

Also, changed the way some of the UMA metrics are logged. I've done this for
duration-related metrics for now.
With the new structure, it is easy to see which operation is taking how long and
also the breakdown of time taken
for each of the sub-operations.
For example, there are now separate histograms for:
 - SafeBrowsing.V4ProcessFullUpdate.Time, and its sub operations
 - SafeBrowsing.V4ProcessFullUpdate.MergeUpdate.Time,
 - SafeBrowsing.V4ProcessFullUpdate.DecodeAdditions.Time.

BUG=651911, 543161
==========

to

==========
PVer4: Test checksum on startup outside the hotpath of DB load

On startup, when MergeUpdate is called, the old_prefixes_map is empty. When we
detect this case, we shortcut the merge process.
* Instead of merging the hash prefixes in lexicographical order, one by one, we
copy the additions_map into the hash_prefix_map_.
* Then, instead of checking the checksum at this time, we post a task to the
task runner to verify the checksum.
  When that task runs, if the checksum matches, it returns silently; otherwise,
it clears out the state and the hash_prefix_map_ of the store.

This reduces the merge time for Malware and SocialEngineering lists by about 5
seconds total on my machine and brings the DB load time to <1s from 12s
previously.

BUG=651911, 543161
==========

vakh (use Gerrit instead)

https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc File components/safe_browsing_db/v4_store.cc (right): https://codereview.chromium.org/2384893002/diff/1/components/safe_browsing_db/v4_store.cc#newcode233 components/safe_browsing_db/v4_store.cc:233: RecordMergeUpdateTime(after - before); On 2016/10/04 07:14:39, vakh (Varun Khaneja) ...

4 years, 2 months ago (2016-10-04 20:34:50 UTC) #28

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-04 22:05:59 UTC) #29

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 2 months ago (2016-10-04 22:05:59 UTC) #30

Scott Hess - ex-Googler

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsing_db/v4_database.cc File components/safe_browsing_db/v4_database.cc (right): https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsing_db/v4_database.cc#newcode160 components/safe_browsing_db/v4_database.cc:160: return true; Does this leave any implementations which can ...

4 years, 2 months ago (2016-10-05 04:42:13 UTC) #31

vakh (use Gerrit instead)

DB posts a VerifyChecksum task for itself. Enables update manager after checking checksum.

4 years, 2 months ago (2016-10-05 18:56:34 UTC) #32

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-05 18:56:36 UTC) #33

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org

4 years, 2 months ago (2016-10-05 18:56:38 UTC) #34

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/100001

4 years, 2 months ago (2016-10-05 18:57:22 UTC) #35

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-05 18:57:23 UTC) #36

commit-bot: I haz the power

No L-G-T-M from a valid reviewer yet. CQ run can only be started by full ...

4 years, 2 months ago (2016-10-05 18:57:24 UTC) #37

vakh (use Gerrit instead)

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsing_db/v4_database.cc File components/safe_browsing_db/v4_database.cc (right): https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsing_db/v4_database.cc#newcode160 components/safe_browsing_db/v4_database.cc:160: return true; On 2016/10/05 04:42:13, Scott Hess wrote: > ...

4 years, 2 months ago (2016-10-05 23:49:32 UTC) #38

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_database.cc (right):

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_database.cc:160: return true;
On 2016/10/05 04:42:13, Scott Hess wrote:
> Does this leave any implementations which can return false?  If not, change
the
> function's signature to void.

Done.

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_database_unittest.cc (right):

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_database_unittest.cc:128: EXPECT_EQ(true,
v4_database->ResetDatabase());
On 2016/10/05 04:42:13, Scott Hess wrote:
> EXPECT_TRUE()?

Done.

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2384893002/diff/80001/components/safe_browsin...
components/safe_browsing_db/v4_store.cc:182: // which runs on task runner.
On 2016/10/05 04:42:13, Scott Hess wrote:
> I may be reading things wrong, but I think now what happens is the manager
> receives ResetDatabase() on the I/O thread, so it posts a task to the task
> runner, which asks the database to reset, which asks this to reset, which
posts
> tasks back to the I/O thread.  Seems like a lot of work.  Since reset is only
> affecting in-memory structures, it could all be on I/O thread, right?
> 
> Also, I feel like this code can't post back out to the I/O thread this way
> without running the risk of use-after-free because the next task on the task
> runner could be the DeleteSoon() on the owning V4Database instance.

I've deleted the Reset code in the V4Database since it is not being called.

The other thing that I am going to do is to make V4LDBM be incharge of all write
operations that happen on the DB/store. This is to avoid the race conditions
between making changes to the store/DB and DB getting destroyed.

Here's a single-page design document of the new proposal:
https://goto.google.com/design-doc-v4store-verifychecksum

Please review it when you have the time. Meanwhile, I'll code it up in this CL.

vakh (use Gerrit instead)

go: design-doc-v4store-verifychecksum -- VerifyChecksum in a way that avoids race conditions between the task runner ...

4 years, 2 months ago (2016-10-06 06:46:33 UTC) #39

vakh (use Gerrit instead)

Tests mimic the thread-related operations of product code. Tests pass

4 years, 2 months ago (2016-10-06 20:12:50 UTC) #40

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-06 20:12:52 UTC) #41

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/140001

4 years, 2 months ago (2016-10-06 20:13:27 UTC) #42

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-06 20:25:20 UTC) #43

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: android_compile_dbg on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_compile_dbg/builds/141676) chromeos_amd64-generic_chromium_compile_only_ng on ...

4 years, 2 months ago (2016-10-06 20:25:21 UTC) #44

vakh (use Gerrit instead)

Minor: Rebase and s/base::MessageLoop::current()->task_runner()/base::ThreadTaskRunnerHandle::Get()

4 years, 2 months ago (2016-10-06 21:36:34 UTC) #45

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-06 21:36:37 UTC) #46

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/160001

4 years, 2 months ago (2016-10-06 21:37:18 UTC) #47

vakh (use Gerrit instead)

https://codereview.chromium.org/2384893002/diff/120001/components/safe_browsing_db/v4_database.h File components/safe_browsing_db/v4_database.h (right): https://codereview.chromium.org/2384893002/diff/120001/components/safe_browsing_db/v4_database.h#newcode134 components/safe_browsing_db/v4_database.h:134: // |db_ready_for_updates_callback| TODO: format

4 years, 2 months ago (2016-10-06 21:45:21 UTC) #48

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-06 21:45:48 UTC) #50

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/180001

4 years, 2 months ago (2016-10-06 21:46:31 UTC) #51

vakh (use Gerrit instead)

Nathan, Scott: Please take another look. I've updated the code to follow the design doc ...

4 years, 2 months ago (2016-10-06 21:47:56 UTC) #52

Scott Hess - ex-Googler

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsing_db/v4_database.cc File components/safe_browsing_db/v4_database.cc (right): https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsing_db/v4_database.cc#newcode187 components/safe_browsing_db/v4_database.cc:187: store_pair->second->Reset(); Is there any possibility that the second of ...

4 years, 2 months ago (2016-10-06 23:04:10 UTC) #53

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_database.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_database.cc:187: store_pair->second->Reset();
Is there any possibility that the second of stores can change between
VerifyChecksumOnTaskRunner() and here?  I can see that there's at least a code
path which might change the actual store associated with the identifier, but
AFAICT there has to be a store there.  This feels a bit roundabout compared to
(*store_map_)[identifier]->Reset().

[Admittedly, (*store_map_)[identifier]->Reset() isn't pretty :-).]

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_database.h (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_database.h:29: DatabaseReadyForUpdatesCallback;
Why unordered_set rather than vector?  AFAICT, there's no chance you'll have
collisions to discard, and also AFAICT there would be no benefit to discarding
collisions because the operation is cheap.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_local_database_manager.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_local_database_manager.cc:291:
v4_database_->VerifyChecksum(db_ready_for_updates_callback);
OK, thinking out loud ... since this is enabled_, then StopOnIOThread() hasn't
been called, so there isn't an outstanding DeleteSoon().  But StopOnIOThread()
might be sent immediately after this.  In that case, the outstanding verify will
send the callback, which will short-circuit because of enabled_, and then delete
the database.

I think there's still a flaw in the threading.  For messages which go from the
manager to the db, enabled_ is sufficient, because the manager scopes enabled_
changes around db acquisition.  But StopOnIOThread() could be called while
VerifyChecksumOnTaskRunner() is in flight.  There's nothing which prevents the
client from calling the destructor immediately after StopOnIOThread() is done,
in which case the callback will use-after-free.

I think a similar sequence can be constructed for StopOnIOThread() versus the
callback from ApplyUpdate().  The only way I can think of solving it offhand is
a weak_ptr for both callbacks, so when the callback posts it is dropped on the
floor.  I think that's reasonable for a followup CL, though.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (left):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:315: // static
AFAICT, the only reason it can't be static is store_path_.  Maybe it could be a
parameter?  [Sorry, I don't recall the original direction this code was aiming,
so that could be a non-goal.]

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:581: ;
Extra semi-colon, and also DCHECK_IS_ON() wrapper (see other comment).

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:736: next_smallest_prefix_size);
Why string_as_array rather than data()?  Update() takes a const void *, so
there's no need to have a mutable pointer, which I think is the reason
string_as_array() exists.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:744:
checksum_ctx->Finish(base::string_as_array(&checksum), checksum.size());
This usage of string_as_array() is valid ... but it might be even more sensible
to simply have an array of char in the first place rather than heap-allocating
and pre-sizing/filling a string.  It would require a slight adjustment of the
compare, but the encodes would be fine due to StringPiece.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:755: << "; store: " << *this;
That said, it might be worth wrapping this in an DCHECK_IS_ON() to make sure the
temporaries aren't in production.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-06 23:23:19 UTC) #54

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/239379)

4 years, 2 months ago (2016-10-06 23:23:21 UTC) #55

vakh (use Gerrit instead)

Thanks for the review. PTAL. https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsing_db/v4_database.cc File components/safe_browsing_db/v4_database.cc (right): https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsing_db/v4_database.cc#newcode187 components/safe_browsing_db/v4_database.cc:187: store_pair->second->Reset(); On 2016/10/06 23:04:10, ...

4 years, 2 months ago (2016-10-07 00:39:45 UTC) #57

Thanks for the review. PTAL.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_database.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_database.cc:187: store_pair->second->Reset();
On 2016/10/06 23:04:10, Scott Hess wrote:
> Is there any possibility that the second of stores can change between
> VerifyChecksumOnTaskRunner() and here?  I can see that there's at least a code
> path which might change the actual store associated with the identifier, but
> AFAICT there has to be a store there.  This feels a bit roundabout compared to
> (*store_map_)[identifier]->Reset().
> 
> [Admittedly, (*store_map_)[identifier]->Reset() isn't pretty :-).]

I didn't understand what you said there :-) Can you please re-word it for me?
Thanks.

(*store_map_)[identifier]->Reset() is concise but I wanted to add the DCHECK.
The compiler will optimize the code in Release build so I would prefer to leave
it as-is.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_database.h (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_database.h:29: DatabaseReadyForUpdatesCallback;
On 2016/10/06 23:04:10, Scott Hess wrote:
> Why unordered_set rather than vector?  AFAICT, there's no chance you'll have
> collisions to discard, and also AFAICT there would be no benefit to discarding
> collisions because the operation is cheap.

Done.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_local_database_manager.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_local_database_manager.cc:291:
v4_database_->VerifyChecksum(db_ready_for_updates_callback);
On 2016/10/06 23:04:10, Scott Hess wrote:
> OK, thinking out loud ... since this is enabled_, then StopOnIOThread() hasn't
> been called, so there isn't an outstanding DeleteSoon().  But StopOnIOThread()
> might be sent immediately after this.  In that case, the outstanding verify
will
> send the callback, which will short-circuit because of enabled_, and then
delete
> the database.
> 

Would that be Case 2 here:
http://go/design-doc-v4store-verifychecksum/edit#heading=h.9q94nrcrdfra

I think that's covered. PLMK if you are thinking of a different case.


> I think there's still a flaw in the threading.  For messages which go from the
> manager to the db, enabled_ is sufficient, because the manager scopes enabled_
> changes around db acquisition.  But StopOnIOThread() could be called while
> VerifyChecksumOnTaskRunner() is in flight.  There's nothing which prevents the
> client from calling the destructor immediately after StopOnIOThread() is done,
> in which case the callback will use-after-free.
> 
> I think a similar sequence can be constructed for StopOnIOThread() versus the
> callback from ApplyUpdate().

http://crbug.com/647237 :)

>  The only way I can think of solving it offhand is
> a weak_ptr for both callbacks, so when the callback posts it is dropped on the
> floor.  I think that's reasonable for a followup CL, though.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (left):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:315: // static
On 2016/10/06 23:04:10, Scott Hess wrote:
> AFAICT, the only reason it can't be static is store_path_.  Maybe it could be
a
> parameter?  [Sorry, I don't recall the original direction this code was
aiming,
> so that could be a non-goal.]

It was already non-static but the comment hadn't been removed.
I prefer functions to be static since it makes it easier to think about them and
test them so I'll change it back but is it OK if I do it in a later CL?

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:736: next_smallest_prefix_size);
On 2016/10/06 23:04:10, Scott Hess wrote:
> Why string_as_array rather than data()?  Update() takes a const void *, so
> there's no need to have a mutable pointer, which I think is the reason
> string_as_array() exists.

Done.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:744:
checksum_ctx->Finish(base::string_as_array(&checksum), checksum.size());
On 2016/10/06 23:04:10, Scott Hess wrote:
> This usage of string_as_array() is valid ... but it might be even more
sensible
> to simply have an array of char in the first place rather than heap-allocating
> and pre-sizing/filling a string.  It would require a slight adjustment of the
> compare, but the encodes would be fine due to StringPiece.

Done.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:755: << "; store: " << *this;
On 2016/10/06 23:04:10, Scott Hess wrote:
> That said, it might be worth wrapping this in an DCHECK_IS_ON() to make sure
the
> temporaries aren't in production.

Done.

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-07 00:39:47 UTC) #58

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/200001

4 years, 2 months ago (2016-10-07 00:40:04 UTC) #59

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-07 02:00:44 UTC) #60

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/239534)

4 years, 2 months ago (2016-10-07 02:00:46 UTC) #61

Scott Hess - ex-Googler

Hmm. I don't see any concerns not addressed, so sounds like LGTM then! https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsing_db/v4_database.cc File ...

4 years, 2 months ago (2016-10-07 13:31:09 UTC) #62

Hmm.  I don't see any concerns not addressed, so sounds like LGTM then!

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_database.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_database.cc:187: store_pair->second->Reset();
On 2016/10/07 00:39:45, vakh (Varun Khaneja) wrote:
> On 2016/10/06 23:04:10, Scott Hess wrote:
> > Is there any possibility that the second of stores can change between
> > VerifyChecksumOnTaskRunner() and here?  I can see that there's at least a
code
> > path which might change the actual store associated with the identifier, but
> > AFAICT there has to be a store there.  This feels a bit roundabout compared
to
> > (*store_map_)[identifier]->Reset().
> > 
> > [Admittedly, (*store_map_)[identifier]->Reset() isn't pretty :-).]
> 
> I didn't understand what you said there :-) Can you please re-word it for me?
> Thanks.
> 
> (*store_map_)[identifier]->Reset() is concise but I wanted to add the DCHECK.
> The compiler will optimize the code in Release build so I would prefer to
leave
> it as-is.

OK, I guess.  It just kinda feels like double-checking that water is wet, to me.

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_local_database_manager.cc (right):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_local_database_manager.cc:291:
v4_database_->VerifyChecksum(db_ready_for_updates_callback);
On 2016/10/07 00:39:45, vakh (Varun Khaneja) wrote:
> On 2016/10/06 23:04:10, Scott Hess wrote:
> > OK, thinking out loud ... since this is enabled_, then StopOnIOThread()
hasn't
> > been called, so there isn't an outstanding DeleteSoon().  But
StopOnIOThread()
> > might be sent immediately after this.  In that case, the outstanding verify
> will
> > send the callback, which will short-circuit because of enabled_, and then
> delete
> > the database.
> > 
> 
> Would that be Case 2 here:
> http://go/design-doc-v4store-verifychecksum/edit#heading=h.9q94nrcrdfra
> 
> I think that's covered. PLMK if you are thinking of a different case.

I'm thinking:

tio: NewDatabaseReady
tio: StopOnIOThread
tio: ~V4LocalDatabaseManager
tr: VerifyChecksum
tio: UAF in DatabaseReadyForUpdates

Hmm ... is ~V4LocalDatabaseManager guaranteed never to run?  Like even at
shutdown?

> > I think there's still a flaw in the threading.  For messages which go from
the
> > manager to the db, enabled_ is sufficient, because the manager scopes
enabled_
> > changes around db acquisition.  But StopOnIOThread() could be called while
> > VerifyChecksumOnTaskRunner() is in flight.  There's nothing which prevents
the
> > client from calling the destructor immediately after StopOnIOThread() is
done,
> > in which case the callback will use-after-free.
> > 
> > I think a similar sequence can be constructed for StopOnIOThread() versus
the
> > callback from ApplyUpdate().
> 
> http://crbug.com/647237 :)

Yes...

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
File components/safe_browsing_db/v4_store.cc (left):

https://codereview.chromium.org/2384893002/diff/180001/components/safe_browsi...
components/safe_browsing_db/v4_store.cc:315: // static
On 2016/10/07 00:39:45, vakh (Varun Khaneja) wrote:
> On 2016/10/06 23:04:10, Scott Hess wrote:
> > AFAICT, the only reason it can't be static is store_path_.  Maybe it could
be
> a
> > parameter?  [Sorry, I don't recall the original direction this code was
> aiming,
> > so that could be a non-goal.]
> 
> It was already non-static but the comment hadn't been removed.
> I prefer functions to be static since it makes it easier to think about them
and
> test them so I'll change it back but is it OK if I do it in a later CL?

Sure.

Nathan Parker

Is there a test that verifies the VerifyChecksum's are run later on startup? If so ...

4 years, 2 months ago (2016-10-07 23:24:29 UTC) #63

vakh (use Gerrit instead)

nparker@ feedback, add test for VerifyChecksum == true, possibly fix the asan failures

4 years, 2 months ago (2016-10-09 09:25:25 UTC) #64

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-09 09:25:29 UTC) #65

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/220001

4 years, 2 months ago (2016-10-09 09:25:41 UTC) #66

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-09 10:04:31 UTC) #67

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/240410)

4 years, 2 months ago (2016-10-09 10:04:33 UTC) #68

vakh (use Gerrit instead)

surely fix the asan failures. fixed on my machine, at least.

4 years, 2 months ago (2016-10-09 12:22:49 UTC) #69

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-09 12:22:52 UTC) #70

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/240001

4 years, 2 months ago (2016-10-09 12:23:01 UTC) #71

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 2 months ago (2016-10-09 13:58:50 UTC) #72

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 2 months ago (2016-10-09 13:58:51 UTC) #73

vakh (use Gerrit instead)

Verify that the checksum check happens async

4 years, 2 months ago (2016-10-10 17:41:15 UTC) #74

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org to run a CQ dry run

4 years, 2 months ago (2016-10-10 17:41:18 UTC) #75

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/260001

4 years, 2 months ago (2016-10-10 17:41:31 UTC) #76

vakh (use Gerrit instead)

Also added a test to ensure that the checksum check happens async. https://codereview.chromium.org/2384893002/diff/200001/components/safe_browsing_db/v4_database.cc File components/safe_browsing_db/v4_database.cc ...

4 years, 2 months ago (2016-10-10 17:42:34 UTC) #77

vakh (use Gerrit instead)

The CQ bit was checked by vakh@chromium.org

4 years, 2 months ago (2016-10-10 17:42:54 UTC) #78

vakh (use Gerrit instead)

The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org, shess@chromium.org Link to ...

4 years, 2 months ago (2016-10-10 17:42:55 UTC) #79

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2384893002/260001

4 years, 2 months ago (2016-10-10 17:43:02 UTC) #80

commit-bot: I haz the power

Description was changed from ========== PVer4: Test checksum on startup outside the hotpath of DB ...

4 years, 2 months ago (2016-10-10 18:26:42 UTC) #81

commit-bot: I haz the power

Description was changed from ========== PVer4: Test checksum on startup outside the hotpath of DB ...

4 years, 2 months ago (2016-10-10 18:29:22 UTC) #83

commit-bot: I haz the power

4 years, 2 months ago (2016-10-10 18:29:24 UTC) #84

Message was sent while issue was closed.

Patchset 14 (id:??) landed as
https://crrev.com/cb8c22f7ca3eedb7e207149f1f75c90c3e6a748b
Cr-Commit-Position: refs/heads/master@{#424189}

Issue 2384893002: PVer4: Test checksum on startup outside the hotpath of DB load (Closed)

Description

Patch Set 1 #

Patch Set 2 : Log UMA metrics per store and total using base_metric and suffixes #

Patch Set 3 : Reset on IO thread #

Patch Set 4 : VerifyChecksum->VerifyChecksumDelayed #

Patch Set 5 : Remove all histogram related changes. That would be a separate CL #

Patch Set 6 : DB posts a VerifyChecksum task for itself. Enables update manager after checking checksum. #

Patch Set 7 : go: design-doc-v4store-verifychecksum -- VerifyChecksum in a way that avoids race conditions betwee… #

Patch Set 8 : Tests mimic the thread-related operations of product code. Tests pass #

Patch Set 9 : Minor: Rebase and s/base::MessageLoop::current()->task_runner()/base::ThreadTaskRunnerHandle::Get() #

Patch Set 10 : Minor: Formatting #

Patch Set 11 : shess@ feedback #

Patch Set 12 : nparker@ feedback, add test for VerifyChecksum == true, possibly fix the asan failures #

Patch Set 13 : surely fix the asan failures. fixed on my machine, at least. #

Patch Set 14 : Verify that the checksum check happens async #

Messages