|
|
Description[Sync] Signal UIModelWorker to abort on sync shutdown.
This should prevent a deadlock between the sync and UI threads during
shutdown.
BUG=663600
Patch Set 1 #
Total comments: 10
Patch Set 2 : Address comments + add test. #
Total comments: 6
Patch Set 3 : Change more things. #Patch Set 4 : Fix iOS hopefully. #
Messages
Total messages: 25 (17 generated)
The CQ bit was checked by maxbogue@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
maxbogue@chromium.org changed reviewers: + fdoray@chromium.org, zea@chromium.org
zea@ or fdoray@, PTAL!
LGTM with nits https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... File components/sync/driver/glue/ui_model_worker.h (right): https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.h:39: // being torn down, or aborted do to sync stopping. typo: do -> due https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.h:39: // being torn down, or aborted do to sync stopping. nit: i think it's worth explaining why we need this (namely that the UI thread can join on the sync thread, and this prevents the sync thread from waiting on work on the UI thread)
Actually, how hard would it be to add a unit test for this?
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... File components/sync/driver/glue/ui_model_worker.cc (right): https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:28: const scoped_refptr<base::SingleThreadTaskRunner>& ui_thread) Pass scoped_refptr by value https://chromium.googlesource.com/chromium/src/+/master/styleguide/c++/c++.md... Relevant discussion: https://groups.google.com/a/chromium.org/d/msg/chromium-dev/TlL1D-Djta0/hDMBe... https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:30: work_done_or_abandoned_(base::WaitableEvent::ResetPolicy::AUTOMATIC, Should be base::WaitableEvent::ResetPolicy::MANUAL to permanently prevent DoWorkAndWaitUntilDoneImpl() from blocking after RequestStop() is called. https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:59: work_done_or_abandoned_.Signal(); In sync_integration_tests, this is called before the main MessageLoop is done running. That means that |work_done_or_abandonned_| is signaled and DoWorkAndWaitUntilDoneImpl() returns when the WorkCallback still has a chance to run. The WorkCallback accesses arguments which are no longer guaranteed to exist https://cs.chromium.org/chromium/src/components/sync/engine_impl/directory_up... Maybe we could change CallDoWorkAndSignalEvent() to check if the event is already signaled before running the callback? (event would need to be ResetPolicy::MANUAL)
The CQ bit was checked by maxbogue@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2016/11/10 22:37:23, Nicolas Zea wrote: > Actually, how hard would it be to add a unit test for this? Turns out surprisingly difficult. I managed to come up with one that can potentially false positive but never false negative. PTAL and let me know if you can think of a better way to reproduce the issue... https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... File components/sync/driver/glue/ui_model_worker.cc (right): https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:28: const scoped_refptr<base::SingleThreadTaskRunner>& ui_thread) On 2016/11/11 15:40:28, fdoray wrote: > Pass scoped_refptr by value > https://chromium.googlesource.com/chromium/src/+/master/styleguide/c++/c++.md... > > Relevant discussion: > https://groups.google.com/a/chromium.org/d/msg/chromium-dev/TlL1D-Djta0/hDMBe... Huh. We do that wrong all over sync code apparently. Fixed. https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:30: work_done_or_abandoned_(base::WaitableEvent::ResetPolicy::AUTOMATIC, On 2016/11/11 15:40:28, fdoray wrote: > Should be base::WaitableEvent::ResetPolicy::MANUAL to permanently prevent > DoWorkAndWaitUntilDoneImpl() from blocking after RequestStop() is called. Sorta right but for the wrong reason. Needs to be MANUAL so CallDoWorkAndSignalEvent can abort early if it's been set, but it has to be reset inside DoWorkAndWaitUntilDoneImpl before the post to be able to do work more than once. https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.cc:59: work_done_or_abandoned_.Signal(); On 2016/11/11 15:40:28, fdoray wrote: > In sync_integration_tests, this is called before the main MessageLoop is done > running. That means that |work_done_or_abandonned_| is signaled and > DoWorkAndWaitUntilDoneImpl() returns when the WorkCallback still has a chance to > run. The WorkCallback accesses arguments which are no longer guaranteed to exist > > https://cs.chromium.org/chromium/src/components/sync/engine_impl/directory_up... > > Maybe we could change CallDoWorkAndSignalEvent() to check if the event is > already signaled before running the callback? (event would need to be > ResetPolicy::MANUAL) Done. https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... File components/sync/driver/glue/ui_model_worker.h (right): https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.h:39: // being torn down, or aborted do to sync stopping. On 2016/11/10 22:27:07, Nicolas Zea wrote: > typo: do -> due Done. https://codereview.chromium.org/2496723003/diff/1/components/sync/driver/glue... components/sync/driver/glue/ui_model_worker.h:39: // being torn down, or aborted do to sync stopping. On 2016/11/10 22:27:07, Nicolas Zea wrote: > nit: i think it's worth explaining why we need this (namely that the UI thread > can join on the sync thread, and this prevents the sync thread from waiting on > work on the UI thread) Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... File components/sync/driver/glue/ui_model_worker.cc (right): https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker.cc:49: work_done_or_abandoned_.Reset(); This can happen: THREAD ACTION Sync: DoWorkAndWaitUntilDone() UI: UIModelWorker::RequestStop() Sync: DoWorkAndWaitUntilDoneImpl() Sync: work_done_or_abandoned_.Reset() !!! Sync: ui_thread_->PostTask(...) Sync: work_done_or_abandoned_.Wait() !!! Blocked forever For that reason, I would use 2 events: SyncerError UIModelWorker::DoWorkAndWaitUntilDoneImpl(...) { ... // Signaled when the task is deleted, i.e. after it runs or when it is // abandoned. base::WaitableEvent work_done_or_abandoned( base::WaitableEvent::ResetPolicy::AUTOMATIC, base::WaitableEvent::InitialState::NOT_SIGNALED); if (!ui_thread_->PostTask(...)) { ... } base::WaitableEvent* events[] = {&work_done_or_abandoned, &stop_requested_}; base::WaitableEvent::WaitMany(events, arraysize(events)); ... } void UIModelWorker::RequestStop() { DCHECK(ui_thread_->BelongsToCurrentThread()); ModelSafeWorker::RequestStop(); stop_requested_.Signal(); } WDYT? https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... File components/sync/driver/glue/ui_model_worker_unittest.cc (right): https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker_unittest.cc:39: if (event) EXPECT_TRUE(event) since we don't expect |event| to be nullptr https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker_unittest.cc:91: // we just get a false positive (no flakes). I think the test below avoids false positives (the test hangs unless it goes through the stopped_.IsSet() check in ModelSafeWorker::DoWorkAndWaitUntilDone, posts the WorkCallback to the UI thread, starts waiting for the WorkCallback and stops waiting when RequestStop is invoked). class SyncUIModelWorkerTest : public testing::Test { ... void StopSyncThread(Closure quit_closure) { // This should cause the sync thread to stop waiting for // this WorkCallback. worker()->RequestStop(); // This should not deadlock since the sync thread stopped // waiting for this WorkCallback. sync_thread_.Stop(); // This should unblock the run_loop.Run() call in the test body. quit_closure.Run(); } ... }; TEST_F(SyncUIModelWorkerTest, ShutdownNoDeadlock) { base::RunLoop run_loop; PostWorkToSyncThread(base::Bind(&SyncUIModelWorkerTest::StopSyncThread, base::Unretained(this), run_loop.QuitClosure())); // This blocks until run_loop.QuitClosure() is invoked // by the WorkCallback above. run_loop.Run(); }
The CQ bit was checked by maxbogue@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...)
The CQ bit was checked by maxbogue@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
PTAL at most recent changes; I'm still working on why TwoClientsPasswordManagerSettingMigratorServiceSyncTest.* are failing on some bots. I can't repro locally; gonna try a release build. https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... File components/sync/driver/glue/ui_model_worker.cc (right): https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker.cc:49: work_done_or_abandoned_.Reset(); On 2016/11/14 14:24:34, fdoray wrote: > This can happen: > THREAD ACTION > Sync: DoWorkAndWaitUntilDone() > UI: UIModelWorker::RequestStop() > Sync: DoWorkAndWaitUntilDoneImpl() > Sync: work_done_or_abandoned_.Reset() !!! > Sync: ui_thread_->PostTask(...) > Sync: work_done_or_abandoned_.Wait() !!! Blocked forever > > For that reason, I would use 2 events: > > SyncerError UIModelWorker::DoWorkAndWaitUntilDoneImpl(...) { > ... > > // Signaled when the task is deleted, i.e. after it runs or when it is > // abandoned. > base::WaitableEvent work_done_or_abandoned( > base::WaitableEvent::ResetPolicy::AUTOMATIC, > base::WaitableEvent::InitialState::NOT_SIGNALED); > > if (!ui_thread_->PostTask(...)) { > ... > } > > base::WaitableEvent* events[] = {&work_done_or_abandoned, &stop_requested_}; > base::WaitableEvent::WaitMany(events, arraysize(events)); > > ... > } > > void UIModelWorker::RequestStop() { > DCHECK(ui_thread_->BelongsToCurrentThread()); > ModelSafeWorker::RequestStop(); > stop_requested_.Signal(); > } > > WDYT? This actually occurred to me on Friday right before I stopped working... I'm just going to pass worker as a parameter and check IsStopped before running the work, since that's more "the right thing to do". I didn't know about WaitMany though, that's neat. https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... File components/sync/driver/glue/ui_model_worker_unittest.cc (right): https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker_unittest.cc:39: if (event) On 2016/11/14 14:24:34, fdoray wrote: > EXPECT_TRUE(event) since we don't expect |event| to be nullptr n/a, event is no longer needed. https://codereview.chromium.org/2496723003/diff/20001/components/sync/driver/... components/sync/driver/glue/ui_model_worker_unittest.cc:91: // we just get a false positive (no flakes). On 2016/11/14 14:24:34, fdoray wrote: > I think the test below avoids false positives (the test hangs unless it goes > through the stopped_.IsSet() check in ModelSafeWorker::DoWorkAndWaitUntilDone, > posts the WorkCallback to the UI thread, starts waiting for the WorkCallback and > stops waiting when RequestStop is invoked). > > class SyncUIModelWorkerTest : public testing::Test { > ... > > void StopSyncThread(Closure quit_closure) { > // This should cause the sync thread to stop waiting for > // this WorkCallback. > worker()->RequestStop(); > > // This should not deadlock since the sync thread stopped > // waiting for this WorkCallback. > sync_thread_.Stop(); > > // This should unblock the run_loop.Run() call in the test body. > quit_closure.Run(); > } > > ... > }; > > TEST_F(SyncUIModelWorkerTest, ShutdownNoDeadlock) { > base::RunLoop run_loop; > > PostWorkToSyncThread(base::Bind(&SyncUIModelWorkerTest::StopSyncThread, > base::Unretained(this), > run_loop.QuitClosure())); > > // This blocks until run_loop.QuitClosure() is invoked > // by the WorkCallback above. > run_loop.Run(); > } Of course, just do everything inside the work closure. Clever, thanks!
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...)
This has turned out to be far trickier than anticipated, so fdoray@ is going to take over from here. |