|
|
Created:
4 years ago by Scott Hess - ex-Googler Modified:
3 years, 11 months ago CC:
chromium-reviews, vakh+watch_chromium.org Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionFix race condition with CancelCheck().
In V4LocalDatabaseManager, HandleCheck() does a PostTask() to
PerformFullHashCheck(), which adds the check to pending_clients_. If the caller
calls CancelCheck() before posted task runs, the check won't be cancelled.
BUG=660293
Review-Url: https://codereview.chromium.org/2565283008
Cr-Commit-Position: refs/heads/master@{#441730}
Committed: https://chromium.googlesource.com/chromium/src/+/2bce5de9484c07967bda1ffecff891c85423b659
Patch Set 1 #
Total comments: 6
Patch Set 2 : Add some tests. #
Total comments: 10
Patch Set 3 : rebase #Patch Set 4 : reqs for each platform. #
Total comments: 4
Messages
Total messages: 44 (22 generated)
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
shess@chromium.org changed reviewers: + nparker@chromium.org
I set out to write a test which would show OnCheckBrowseUrlResult() getting called after a CancelCheck(), but ... none of the existing tests in v4_ldbm_unittest.cc use a fake protocol manager, so there's no way to pump events end to end. I think it's possible to implement that, but maybe I should double-check if you have an aversion to the change before doing that work. Or maybe you'll say "Varun was concerned about whether he'll have enough to do when he gets back, checkin as-is." I don't really think this will fix the bug, it's more of a drive-by.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Nice catch. Do you think this could cause the crashes we were seeing? I have a few questions on the impl, but otherwise l-g-t-m. As for adding more tests: Yes, go for it. https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_local_database_manager.cc (right): https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:586: DCHECK(!full_hash_to_store_and_hash_prefixes.empty()); Maybe add a comment that the caller should have added the client to pending_clients_? https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:597: // Steal the queue to protect against reentrant CancelCheck() calls. Is this if they call Cancel() from within RespondToClient --> client->OnCheckBrowseUrl? Otherwise, the Cancel call has to come on the IO thread so it wouldn't happen here.
On 2016/12/14 18:03:04, Nathan Parker wrote: > Nice catch. Do you think this could cause the crashes we were seeing? I don't _think_ so. Since we're using raw pointers, it's possible that a lost cancel plus a new request could end up with the same client object, but I'd expect to see other evidence of something like that. > As for adding more tests: Yes, go for it. OK, I'll poke at that and get back to you. https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_local_database_manager.cc (right): https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:586: DCHECK(!full_hash_to_store_and_hash_prefixes.empty()); On 2016/12/14 18:03:03, Nathan Parker wrote: > Maybe add a comment that the caller should have added the client to > pending_clients_? Should have, but if CancelCheck() was called between task post and task run, it won't be there. I suppose this code could check if cancel occurred and drop the request. Not sure if that's worth additional code on every request to catch an edge case which probably never happens, but pending_checks_ should be pretty hot and empty, right? https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:597: // Steal the queue to protect against reentrant CancelCheck() calls. On 2016/12/14 18:03:04, Nathan Parker wrote: > Is this if they call Cancel() from within RespondToClient --> > client->OnCheckBrowseUrl? Otherwise, the Cancel call has to come on the IO > thread so it wouldn't happen here. Yes, that is my concern. We're calling out to external code and relying on it to not make certain calls which would disrupt internal structures while in use. I don't _think_ any clients chain requests together oddly (with one response cancelling a different request), but I think this case could even come up if the client being processed cancels itself (perhaps in destruction). In both functions, we could run past the end of the queue, so lacking any diagnostics complaining about that probably indicates it's not happening. But I don't know how the foreach iterator works if the container changes contents in the body.
https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_local_database_manager.cc (right): https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:597: // Steal the queue to protect against reentrant CancelCheck() calls. On 2016/12/14 20:45:58, Scott Hess wrote: > In both functions, we could run past the end of the queue, so lacking any > diagnostics complaining about that probably indicates it's not happening. But I > don't know how the foreach iterator works if the container changes contents in > the body. OK, it looks like for(:) implementation expects to be able to cache end(), so it does not adjust to operations such as erase. So the prior code is plausibly incorrect, but the fact that we've not seen crashes suggesting that we're processing past the end of the queue probably means it never comes up in practice. That said, swap-to-temporary should be approximately the same cost. Maybe even a bit better, since it gets rid of the space for items in queued_checks_, which will reduce Chrome's memory usage by a trillionth.
Add some tests.
The CQ bit was checked by shess@chromium.org to run a CQ dry run
OK, tests. https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_local_database_manager.cc (right): https://codereview.chromium.org/2565283008/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_local_database_manager.cc:604: RespondToClient(std::move(it)); Ha, it took some work, but the amusing result is that this code is moving the pointer from the vector, so if RespondToClient() leads to a CancelCheck() call, you get a crash because of a NULL-deref. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:104: manager_to_cancel_->CancelCheck(this); Any CancelCheck() with items in the queue during queue flushing causes problems. So I didn't go for something more realistic, like a closure to cancel against the second client. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:346: "pIAEgAyAEIAUgBg==&$ct=application/x-protobuf&key=test_key_param"), I'd love suggestions about this. AFAICT, FakeURLFetcherFactory can't bind a prefix, it has to have the entire URL. Doesn't seem to have a generic fallback handler, either. The best I could think of would be to expose the URL-construction API so that this code could construct "The URL for the appropriate fullhash", which also seemed intrusive. [Or maybe I missed something about mocking the get-hash-protocol-manager.] https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:372: EXPECT_FALSE(client.result_received_); Pre-patch, this expectation failed. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:377: // CancelCheck() is not fatal in the client callback. This does not test the RespondSafeToQueuedChecks() path, which happens if StopOnIOThread() is called with queued checks. In that case, the code would have to work around the DCHECK(enabled_) in CancelCheck().
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
lgtm https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:104: manager_to_cancel_->CancelCheck(this); On 2016/12/15 01:14:13, Scott Hess wrote: > Any CancelCheck() with items in the queue during queue flushing causes problems. > So I didn't go for something more realistic, like a closure to cancel against > the second client. Acknowledged. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:346: "pIAEgAyAEIAUgBg==&$ct=application/x-protobuf&key=test_key_param"), On 2016/12/15 01:14:13, Scott Hess wrote: > I'd love suggestions about this. AFAICT, FakeURLFetcherFactory can't bind a > prefix, it has to have the entire URL. Doesn't seem to have a generic fallback > handler, either. The best I could think of would be to expose the > URL-construction API so that this code could construct "The URL for the > appropriate fullhash", which also seemed intrusive. > > [Or maybe I missed something about mocking the get-hash-protocol-manager.] Matching a prefix or having default seems like it'd be pretty useful, I agree. I think you could use a TestUrlFetcherFactory and then call OnURLFetchComplete on the right fetcher, with the response. But that would run it on the test's thread. There is V4GetHashProtocolManager::RegisterFactory(..) which looks like it was intended for this purpose, but was never used. You could create a mock or just override the protocol-manager such that when GetFullHashes(..) is called, it stashes the callback so you could invoke it later. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:372: EXPECT_FALSE(client.result_received_); On 2016/12/15 01:14:13, Scott Hess wrote: > Pre-patch, this expectation failed. Wait, that seems to imply that all cancels were broken? https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:377: // CancelCheck() is not fatal in the client callback. On 2016/12/15 01:14:13, Scott Hess wrote: > This does not test the RespondSafeToQueuedChecks() path, which happens if > StopOnIOThread() is called with queued checks. In that case, the code would > have to work around the DCHECK(enabled_) in CancelCheck(). Probably fine
vakh@chromium.org changed reviewers: + vakh@chromium.org
lgtm Thanks for finding and fixing this issue! https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:372: EXPECT_FALSE(client.result_received_); On 2016/12/16 22:37:13, Nathan Parker wrote: > On 2016/12/15 01:14:13, Scott Hess wrote: > > Pre-patch, this expectation failed. > Wait, that seems to imply that all cancels were broken? No, it means that any CancelCheck calls made between calling CheckBrowseUrl and execution of PerformFullHashChecks were broken.
rebase
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
OK, giving this a rebase. I think the breaks were probably a merge conflict which I can't be bothered to research the specifics of. https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/20001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:346: "pIAEgAyAEIAUgBg==&$ct=application/x-protobuf&key=test_key_param"), On 2016/12/16 22:37:13, Nathan Parker wrote: > On 2016/12/15 01:14:13, Scott Hess wrote: > > I'd love suggestions about this. AFAICT, FakeURLFetcherFactory can't bind a > > prefix, it has to have the entire URL. Doesn't seem to have a generic > fallback > > handler, either. The best I could think of would be to expose the > > URL-construction API so that this code could construct "The URL for the > > appropriate fullhash", which also seemed intrusive. > > > > [Or maybe I missed something about mocking the get-hash-protocol-manager.] > > Matching a prefix or having default seems like it'd be pretty useful, I agree. I > think you could use a TestUrlFetcherFactory and then call OnURLFetchComplete on > the right fetcher, with the response. But that would run it on the test's > thread. > > There is V4GetHashProtocolManager::RegisterFactory(..) which looks like it was > intended for this purpose, but was never used. You could create a mock or just > override the protocol-manager such that when GetFullHashes(..) is called, it > stashes the callback so you could invoke it later. Varun, did you have an opinion on this hard-coded URL? I'm somewhat inclined to just check things in and aggressively disable if something goes awry, since I worry that if I step out to add more mocking, I'll entirely forget what I'm doing!
> Varun, did you have an opinion on this hard-coded URL? > > I'm somewhat inclined to just check things in and aggressively disable if > something goes awry, since I worry that if I step out to add more mocking, I'll > entirely forget what I'm doing! No, I think it is more important to get this change in. The hardcoding isn't ideal but is acceptable.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_androi...) linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
On 2017/01/05 00:08:45, vakh (Varun Khaneja) wrote: > > Varun, did you have an opinion on this hard-coded URL? > > > > I'm somewhat inclined to just check things in and aggressively disable if > > something goes awry, since I worry that if I step out to add more mocking, > I'll > > entirely forget what I'm doing! > > No, I think it is more important to get this change in. > The hardcoding isn't ideal but is acceptable. Hey, I found an argument! The URL is platform-specific. I could load each of OSX, Linux, and Windows easily enough, I guess, or we could just disable the new test for now (or both, because enumerating the URLs seems pretty brittle).
reqs for each platform.
The CQ bit was checked by shess@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/01/05 01:23:11, Scott Hess wrote: > On 2017/01/05 00:08:45, vakh (Varun Khaneja) wrote: > > > Varun, did you have an opinion on this hard-coded URL? > > > > > > I'm somewhat inclined to just check things in and aggressively disable if > > > something goes awry, since I worry that if I step out to add more mocking, > > I'll > > > entirely forget what I'm doing! > > > > No, I think it is more important to get this change in. > > The hardcoding isn't ideal but is acceptable. > > Hey, I found an argument! The URL is platform-specific. I could load each of > OSX, Linux, and Windows easily enough, I guess, or we could just disable the new > test for now (or both, because enumerating the URLs seems pretty brittle). OK, patched something together - WDYT? I'll go look at mocking the other thing out on the side. I seem to recall there was a reason I didn't do that earlier.
On 2017/01/05 01:23:11, Scott Hess wrote: > On 2017/01/05 00:08:45, vakh (Varun Khaneja) wrote: > > > Varun, did you have an opinion on this hard-coded URL? > > > > > > I'm somewhat inclined to just check things in and aggressively disable if > > > something goes awry, since I worry that if I step out to add more mocking, > > I'll > > > entirely forget what I'm doing! > > > > No, I think it is more important to get this change in. > > The hardcoding isn't ideal but is acceptable. > > Hey, I found an argument! The URL is platform-specific. I could load each of > OSX, Linux, and Windows easily enough, I guess, or we could just disable the new > test for now (or both, because enumerating the URLs seems pretty brittle). OK, patched something together - WDYT? I'll go look at mocking the other thing out on the side. I seem to recall there was a reason I didn't do that earlier.
lgtm https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:377: // TODO(shess): Modify this to use a mock protocol manager instead you can change this to vakh, if you like and don't mind.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by shess@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org Link to the patchset: https://codereview.chromium.org/2565283008/#ps60001 (title: "reqs for each platform.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Sheriffs: This has a known platform-specific bit. I've covered all the platforms that the dry run handles, but in my experience the waterfall and try-bots aren't always congruent. So if you're here because V4LocalDatabaseManagerTest.CancelPending is failing, revert revert revert.
CQ is committing da patch. Bot data: {"patchset_id": 60001, "attempt_start_ts": 1483644480900460, "parent_rev": "930c8e9b9bf2c2c2bd4ed42ab4939562fa6402d4", "commit_rev": "2bce5de9484c07967bda1ffecff891c85423b659"}
Message was sent while issue was closed.
Description was changed from ========== Fix race condition with CancelCheck(). In V4LocalDatabaseManager, HandleCheck() does a PostTask() to PerformFullHashCheck(), which adds the check to pending_clients_. If the caller calls CancelCheck() before posted task runs, the check won't be cancelled. BUG=660293 ========== to ========== Fix race condition with CancelCheck(). In V4LocalDatabaseManager, HandleCheck() does a PostTask() to PerformFullHashCheck(), which adds the check to pending_clients_. If the caller calls CancelCheck() before posted task runs, the check won't be cancelled. BUG=660293 Review-Url: https://codereview.chromium.org/2565283008 Cr-Commit-Position: refs/heads/master@{#441730} Committed: https://chromium.googlesource.com/chromium/src/+/2bce5de9484c07967bda1ffecff8... ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001) as https://chromium.googlesource.com/chromium/src/+/2bce5de9484c07967bda1ffecff8...
Message was sent while issue was closed.
lgtm https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:381: "Cg8KCHVuaXR0ZXN0EgMxLjAaJwgBCAIIAwgGCAcICAgJCAoQBBAIGgcKBWVXGg-" wat. Proto wire formats vary by platform? Isn't that the point of... a standard format? Does this print the request if one is not found? That'll make it easy to fix at least. Really the Factory needs a "SafeFakeDefaultResponse" method. Really.
Message was sent while issue was closed.
https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:381: "Cg8KCHVuaXR0ZXN0EgMxLjAaJwgBCAIIAwgGCAcICAgJCAoQBBAIGgcKBWVXGg-" On 2017/01/05 21:35:32, Nathan Parker wrote: > wat. Proto wire formats vary by platform? Isn't that the point of... a > standard format? The proto is portable - but each platform puts a different type in there (LINUX_PLATFORM, WINDOWS_PLATFORM, etc). > Does this print the request if one is not found? That'll make it easy to fix at > least. Yeah, I pulled these together from trybot logs. > Really the Factory needs a "SafeFakeDefaultResponse" method. Really. Or a prefix-based version or a callback-based version. But at least the callback-based option seems so obvious that I have to imagine they explicitly decided against it.
Message was sent while issue was closed.
https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... File components/safe_browsing_db/v4_local_database_manager_unittest.cc (right): https://codereview.chromium.org/2565283008/diff/60001/components/safe_browsin... components/safe_browsing_db/v4_local_database_manager_unittest.cc:379: const char* kReqs[] = { QQ ... I kinda messed up my emacs init.el, so was having to do formatting manually rather than using google-c-style.el. So when git-cl said to run git-cl-format to reform this, and it made it four-space indent, I was like "What? Maybe I just don't remember things right." But now I've fixed things, and google-c-style.el definitely wants these to be two-space indented. Did something change? |