|
|
Created:
3 years, 7 months ago by Marijn Kruisselbrink Modified:
3 years, 7 months ago CC:
chromium-reviews, darin-cc_chromium.org, fdoray, jam, robliao Target Ref:
refs/heads/master Project:
chromium Visibility:
Public. |
DescriptionFix possible deadlock during shutdown by not using the FILE thread.
The leveldb service needs to be able to call code in the file service,
even during browser shutdown. If the file service runs on the file
thread, this will fail, leading to deadlock/timeout. Instead move
the file service to a new SingleThreadTaskRunner managed by the task
scheduler.
Also re-enable tests that hopefully should be fixed by this.
BUG=712872
BUG=586194
Review-Url: https://codereview.chromium.org/2852333004
Cr-Commit-Position: refs/heads/master@{#470169}
Committed: https://chromium.googlesource.com/chromium/src/+/17399e67ce617b8936b5088dba92abb92b0cb8ee
Patch Set 1 #Patch Set 2 : temporary change to show problem with task scheduler shutdown #
Total comments: 1
Patch Set 3 : undo temporary change, and add comment linking to mojo bug #
Total comments: 5
Patch Set 4 : use new way of specifying task traits #
Messages
Total messages: 38 (19 generated)
The CQ bit was checked by mek@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by mek@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
mek@chromium.org changed reviewers: + gab@chromium.org
gab: this change seems like a fairly straight forward replacing of a usage of a BrowserThread with a task scheduler managed thread. But I'm not sure how to deal with the resulting test failures (caused by TaskRunners outliving the task scheduler). For more context/explanation of what is going on see also the email on scheduler-dev I sent about this yesterday.
On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > gab: this change seems like a fairly straight forward replacing of a usage of a > BrowserThread with a task scheduler managed thread. But I'm not sure how to deal > with the resulting test failures (caused by TaskRunners outliving the task > scheduler). For more context/explanation of what is going on see also the email > on scheduler-dev I sent about this yesterday. So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is valid I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle wasn't catching that before). What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest -> ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants to scratch the current execution environment (I don't see why it makes sense to do that honestly). In doing so it brings down the MessageLoops and worker threads that were brought up in the SetUp() phase... swtiching the task environment midflight in a unit test seems like a bad idea to me.... The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment (which it defaults to ref. kThreadOptions in extension_service_test_base.cc). But this is the wrong way to do that... the task environment needs to be the first thing set up and last one teared down (or at least must not be set up after nor teared down anything that uses it). So to make it configurable that fixture's options should be tweaked in the constructor or something... I realize this is outside the scope of what you're trying to do, we were going to start manually migrating a few FILE thread callsites in the coming days to try and weed out such issues but looks like you beat us to it :) -- solution IMO is to fix ExtensionServiceTestBase to be configurable earlier instead of wholesale resetting its task environment. https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... File content/browser/browser_context.cc (right): https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... content/browser/browser_context.cc:484: // runner when mojo supports that. Add http://crbug.com/678155 for ease of searching. But overall curious why this has to do with Mojo? Mojo isn't happy running its services per se on sequences but it shouldn't care that the "file_service_runner_" is a SequencedTaskRunner (just requires adjusting FileService's interface to allow that? it doesn't appear to be truly thread-affine, rather just thread-unsafe == sequence-affine?)
On 2017/05/03 at 20:18:57, gab wrote: > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > > gab: this change seems like a fairly straight forward replacing of a usage of a > > BrowserThread with a task scheduler managed thread. But I'm not sure how to deal > > with the resulting test failures (caused by TaskRunners outliving the task > > scheduler). For more context/explanation of what is going on see also the email > > on scheduler-dev I sent about this yesterday. > > So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is valid I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle wasn't catching that before). > > What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest -> ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants to scratch the current execution environment (I don't see why it makes sense to do that honestly). In doing so it brings down the MessageLoops and worker threads that were brought up in the SetUp() phase... swtiching the task environment midflight in a unit test seems like a bad idea to me.... > > The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment (which it defaults to ref. kThreadOptions in extension_service_test_base.cc). > > But this is the wrong way to do that... the task environment needs to be the first thing set up and last one teared down (or at least must not be set up after nor teared down anything that uses it). So to make it configurable that fixture's options should be tweaked in the constructor or something... > > I realize this is outside the scope of what you're trying to do, we were going to start manually migrating a few FILE thread callsites in the coming days to try and weed out such issues but looks like you beat us to it :) -- solution IMO is to fix ExtensionServiceTestBase to be configurable earlier instead of wholesale resetting its task environment. I think ResetThreadBundle is kind of a read herring here. If I combine this CL with another in-flight CL (https://codereview.chromium.org/2861473002, which moves some unrelated code from the UI to the IO thread to fix other threading issues...) more tests start failing with the same error, but this time during test harness destruction (see https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel... for a test run of the combination of this CL and that other CL). I of course agree that ideally all nothing should try to use the task environment after shutting it down. But it's not even clear that that is what is happening. What the DCHECK is complaining about is that there still is code with references to a TaskRunner, not that code is actively trying to use those TaskRunners. Also for other Task Runners its perfectly fine to hold on references to them after the system managing them has shut down. PostTask will simply start returning false, since the tasks will never run again. Why can't these TaskRunners do the same? As explained in my email, actually making sure no references to the task runners remain is very much non-trivial, in particular with mojo copying references to these task runners around all over the place (with StrongBinding being particularly annoying, in that it generally manages its own lifetime, and intentionally doesn't try to observe the task runner going away anymore). > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > File content/browser/browser_context.cc (right): > > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > content/browser/browser_context.cc:484: // runner when mojo supports that. > Add http://crbug.com/678155 for ease of searching. > > But overall curious why this has to do with Mojo? Mojo isn't happy running its services per se on sequences but it shouldn't care that the "file_service_runner_" is a SequencedTaskRunner (just requires adjusting FileService's interface to allow that? it doesn't appear to be truly thread-affine, rather just thread-unsafe == sequence-affine?) Pretty much every mojo related class (binding, interfaceptr, etc) uses base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks on. So in effect every bit of code that tries to either call a mojo service or implement a mojo service currently has to live in a SingleThreadTaskRunner. (I agree that there doesn't seem to necessarily be any good reason why mojo would require thread-affine, rather than sequence-affine, but that just is the current situation...)
On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote: > On 2017/05/03 at 20:18:57, gab wrote: > > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > > > gab: this change seems like a fairly straight forward replacing of a usage > of a > > > BrowserThread with a task scheduler managed thread. But I'm not sure how to > deal > > > with the resulting test failures (caused by TaskRunners outliving the task > > > scheduler). For more context/explanation of what is going on see also the > email > > > on scheduler-dev I sent about this yesterday. > > > > So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is valid > I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle > wasn't catching that before). > > > > What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest -> > ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants to > scratch the current execution environment (I don't see why it makes sense to do > that honestly). In doing so it brings down the MessageLoops and worker threads > that were brought up in the SetUp() phase... swtiching the task environment > midflight in a unit test seems like a bad idea to me.... > > > > The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to > be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment (which > it defaults to ref. kThreadOptions in extension_service_test_base.cc). > > > > But this is the wrong way to do that... the task environment needs to be the > first thing set up and last one teared down (or at least must not be set up > after nor teared down anything that uses it). So to make it configurable that > fixture's options should be tweaked in the constructor or something... > > > > I realize this is outside the scope of what you're trying to do, we were going > to start manually migrating a few FILE thread callsites in the coming days to > try and weed out such issues but looks like you beat us to it :) -- solution IMO > is to fix ExtensionServiceTestBase to be configurable earlier instead of > wholesale resetting its task environment. > > I think ResetThreadBundle is kind of a read herring here. If I combine this CL > with another in-flight CL (https://codereview.chromium.org/2861473002, which > moves some unrelated code from the UI to the IO thread to fix other threading > issues...) more tests start failing with the same error, but this time during > test harness destruction (see > https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel... > for a test run of the combination of this CL and that other CL). > > I of course agree that ideally all nothing should try to use the task > environment after shutting it down. But it's not even clear that that is what is > happening. What the DCHECK is complaining about is that there still is code with > references to a TaskRunner, not that code is actively trying to use those > TaskRunners. Also for other Task Runners its perfectly fine to hold on > references to them after the system managing them has shut down. PostTask will > simply start returning false, since the tasks will never run again. Why can't > these TaskRunners do the same? As explained in my email, actually making sure no > references to the task runners remain is very much non-trivial, in particular > with mojo copying references to these task runners around all over the place > (with StrongBinding being particularly annoying, in that it generally manages > its own lifetime, and intentionally doesn't try to observe the task runner going > away anymore). I agree with you for production code and for that very reason in production we only call TaskScheduler::Shutdown() to stop it but never actually tear down its state (post tasks will return false, etc.). In tests however we have to tear things down between each test. As such the task environment should be set up first and brought down last. Looks like this other failure you're hitting is another fallout of ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in that case I'd suspect that having the content::RenderViewHostTestEnabler member earlier in the list is the problem as the task environment will be ripped from under it on tear down). I'm looking into improving documentation and DCHECKs to make it clearer what to do when such an ordering issue arises in tests (and fixing ExtensionServiceTestBase). > > > > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > > File content/browser/browser_context.cc (right): > > > > > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > > content/browser/browser_context.cc:484: // runner when mojo supports that. > > Add http://crbug.com/678155 for ease of searching. > > > > But overall curious why this has to do with Mojo? Mojo isn't happy running its > services per se on sequences but it shouldn't care that the > "file_service_runner_" is a SequencedTaskRunner (just requires adjusting > FileService's interface to allow that? it doesn't appear to be truly > thread-affine, rather just thread-unsafe == sequence-affine?) > > Pretty much every mojo related class (binding, interfaceptr, etc) uses > base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks on. > So in effect every bit of code that tries to either call a mojo service or > implement a mojo service currently has to live in a SingleThreadTaskRunner. (I > agree that there doesn't seem to necessarily be any good reason why mojo would > require thread-affine, rather than sequence-affine, but that just is the current > situation...) Right... well let's at least link to http://crbug.com/678155 so we can come back and address these when Mojo fixes that.
On 2017/05/04 18:45:48, gab wrote: > On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote: > > On 2017/05/03 at 20:18:57, gab wrote: > > > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > > > > gab: this change seems like a fairly straight forward replacing of a usage > > of a > > > > BrowserThread with a task scheduler managed thread. But I'm not sure how > to > > deal > > > > with the resulting test failures (caused by TaskRunners outliving the task > > > > scheduler). For more context/explanation of what is going on see also the > > email > > > > on scheduler-dev I sent about this yesterday. > > > > > > So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is > valid > > I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle > > wasn't catching that before). > > > > > > What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest -> > > ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants > to > > scratch the current execution environment (I don't see why it makes sense to > do > > that honestly). In doing so it brings down the MessageLoops and worker threads > > that were brought up in the SetUp() phase... swtiching the task environment > > midflight in a unit test seems like a bad idea to me.... > > > > > > The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to > > be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment > (which > > it defaults to ref. kThreadOptions in extension_service_test_base.cc). > > > > > > But this is the wrong way to do that... the task environment needs to be the > > first thing set up and last one teared down (or at least must not be set up > > after nor teared down anything that uses it). So to make it configurable that > > fixture's options should be tweaked in the constructor or something... > > > > > > I realize this is outside the scope of what you're trying to do, we were > going > > to start manually migrating a few FILE thread callsites in the coming days to > > try and weed out such issues but looks like you beat us to it :) -- solution > IMO > > is to fix ExtensionServiceTestBase to be configurable earlier instead of > > wholesale resetting its task environment. > > > > I think ResetThreadBundle is kind of a read herring here. If I combine this CL > > with another in-flight CL (https://codereview.chromium.org/2861473002, which > > moves some unrelated code from the UI to the IO thread to fix other threading > > issues...) more tests start failing with the same error, but this time during > > test harness destruction (see > > > https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel... > > for a test run of the combination of this CL and that other CL). > > > > I of course agree that ideally all nothing should try to use the task > > environment after shutting it down. But it's not even clear that that is what > is > > happening. What the DCHECK is complaining about is that there still is code > with > > references to a TaskRunner, not that code is actively trying to use those > > TaskRunners. Also for other Task Runners its perfectly fine to hold on > > references to them after the system managing them has shut down. PostTask will > > simply start returning false, since the tasks will never run again. Why can't > > these TaskRunners do the same? As explained in my email, actually making sure > no > > references to the task runners remain is very much non-trivial, in particular > > with mojo copying references to these task runners around all over the place > > (with StrongBinding being particularly annoying, in that it generally manages > > its own lifetime, and intentionally doesn't try to observe the task runner > going > > away anymore). > > > I agree with you for production code and for that very reason in production we > only call TaskScheduler::Shutdown() to stop it but never actually tear down its > state (post tasks will return false, etc.). > > In tests however we have to tear things down between each test. As such the task > environment should be set up first and brought down last. FWIW, in the absence of this DCHECK, the test would crash later when calling ~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its deletion. > > Looks like this other failure you're hitting is another fallout of > ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in > that case I'd suspect that having the content::RenderViewHostTestEnabler member > earlier in the list is the problem as the task environment will be ripped from > under it on tear down). I'm looking into improving documentation and DCHECKs to > make it clearer what to do when such an ordering issue arises in tests (and > fixing ExtensionServiceTestBase). > > > > > > > > > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > > > File content/browser/browser_context.cc (right): > > > > > > > > > https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser... > > > content/browser/browser_context.cc:484: // runner when mojo supports that. > > > Add http://crbug.com/678155 for ease of searching. > > > > > > But overall curious why this has to do with Mojo? Mojo isn't happy running > its > > services per se on sequences but it shouldn't care that the > > "file_service_runner_" is a SequencedTaskRunner (just requires adjusting > > FileService's interface to allow that? it doesn't appear to be truly > > thread-affine, rather just thread-unsafe == sequence-affine?) > > > > Pretty much every mojo related class (binding, interfaceptr, etc) uses > > base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks > on. > > So in effect every bit of code that tries to either call a mojo service or > > implement a mojo service currently has to live in a SingleThreadTaskRunner. (I > > agree that there doesn't seem to necessarily be any good reason why mojo would > > require thread-affine, rather than sequence-affine, but that just is the > current > > situation...) > > Right... well let's at least link to http://crbug.com/678155 so we can come back > and address these when Mojo fixes that.
Improved documentation and check in https://codereview.chromium.org/2860063003/ so that you should now get the stack traces for both destructors happening out-of-order in the test fixture. But like I said the issue is very likely with ExtensionServiceTestBase not keeping its task environment alive long enough (and also resetting it midflight..!)
On 2017/05/04 at 18:45:48, gab wrote: > Looks like this other failure you're hitting is another fallout of ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in that case I'd suspect that having the content::RenderViewHostTestEnabler member earlier in the list is the problem as the task environment will be ripped from under it on tear down). I'm looking into improving documentation and DCHECKs to make it clearer what to do when such an ordering issue arises in tests (and fixing ExtensionServiceTestBase). > FWIW, in the absence of this DCHECK, the test would crash later when calling ~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its deletion. Actually changing that order didn't do anything. And in the absence of the DCHECK things just worked. But I did (mostly) figure out what was going on. My other CL (https://codereview.chromium.org/2861473002) changed some shutdown logic from being sync to now be async, and somehow even the various flushing and running-until-idle that were done as part of ~TestBrowserThreadBundle wasn't enough to let the async shutdown logic finish. Not sure where things were failing though. Anyway, I changed something in that other CL that happens to have as a side effect that in these particular tests the shutdown logic happens to still be sync, so the problem no longer manifests itself. Not entirely satisfying, as there still seems to be a problem with with should be shutdown blocking tasks not actually blocking shutdown (well, not sure if all of these tasks on the many threads involved are actually shutdown blocking; and there probably still are problems remaining there too where a thread posts a task to another thread, and expects a reply, but the only one of those directions work with the way BrowserThreads are flushed/destroyed). Hopefully once more gets moved to the TaskScheduler rather than various BrowserThreads those things get solved as well, and these multi-thread async shutdown sequences actually work more reliably during task scheduler flushing? For now at least I'm satisfied that this seems to work fine, and any remaining problems are definitely unrelated to this change. > Right... well let's at least link to http://crbug.com/678155 so we can come back and address these when Mojo fixes that. Added a link to that. On 2017/05/04 at 21:43:11, gab wrote: > Improved documentation and check in https://codereview.chromium.org/2860063003/ so that you should now get the stack traces for both destructors happening out-of-order in the test fixture. > > But like I said the issue is very likely with ExtensionServiceTestBase not keeping its task environment alive long enough (and also resetting it midflight..!) Unfortunately the second stack trace wouldn't ever actually fire for me, since the various browser threads that were already shut down were needed to make progress to actually release the task runner (at least that's my theory).
The CQ bit was checked by mek@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
On 2017/05/05 00:01:06, Marijn Kruisselbrink wrote: > On 2017/05/04 at 18:45:48, gab wrote: > > Looks like this other failure you're hitting is another fallout of > ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in > that case I'd suspect that having the content::RenderViewHostTestEnabler member > earlier in the list is the problem as the task environment will be ripped from > under it on tear down). I'm looking into improving documentation and DCHECKs to > make it clearer what to do when such an ordering issue arises in tests (and > fixing ExtensionServiceTestBase). > > > FWIW, in the absence of this DCHECK, the test would crash later when calling > ~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its > deletion. > > Actually changing that order didn't do anything. And in the absence of the > DCHECK things just worked. But I did (mostly) figure out what was going on. My > other CL (https://codereview.chromium.org/2861473002) changed some shutdown > logic from being sync to now be async, and somehow even the various flushing and > running-until-idle that were done as part of ~TestBrowserThreadBundle wasn't > enough to let the async shutdown logic finish. Not sure where things were > failing though. Anyway, I changed something in that other CL that happens to > have as a side effect that in these particular tests the shutdown logic happens > to still be sync, so the problem no longer manifests itself. Not entirely > satisfying, as there still seems to be a problem with with should be shutdown > blocking tasks not actually blocking shutdown (well, not sure if all of these > tasks on the many threads involved are actually shutdown blocking; and there > probably still are problems remaining there too where a thread posts a task to > another thread, and expects a reply, but the only one of those directions work > with the way BrowserThreads are flushed/destroyed). Hopefully once more gets > moved to the TaskScheduler rather than various BrowserThreads those things get > solved as well, and these multi-thread async shutdown sequences actually work > more reliably during task scheduler flushing? > > For now at least I'm satisfied that this seems to work fine, and any remaining > problems are definitely unrelated to this change. > > > Right... well let's at least link to http://crbug.com/678155 so we can come > back and address these when Mojo fixes that. > > Added a link to that. > > > On 2017/05/04 at 21:43:11, gab wrote: > > Improved documentation and check in > https://codereview.chromium.org/2860063003/ so that you should now get the stack > traces for both destructors happening out-of-order in the test fixture. > > > > But like I said the issue is very likely with ExtensionServiceTestBase not > keeping its task environment alive long enough (and also resetting it > midflight..!) > > Unfortunately the second stack trace wouldn't ever actually fire for me, since > the various browser threads that were already shut down were needed to make > progress to actually release the task runner (at least that's my theory). I noticed that when I tried locally yesterday too with your change on top of mine. I initially thought it was leaking the task runner but then tried it in the debugger and did hit the expected crash, I first attributed it to my release build's heap being generous and somehow not crashing on the invalid |outer_| deref but doing so under debugging conditions... But now I'm thinking maybe it's a race (which resolves differently when debugging). In unit tests shutdown we should flush all tasks (even those not blocking shutdown) as leaking state outside the scope of a unit test is not cool... Anyways, at least that frees you up, but I'll try to figure out what's up.. Thanks for your patience and for using TaskScheduler!
On 2017/05/05 at 11:23:46, gab wrote: > I noticed that when I tried locally yesterday too with your change on top of mine. I initially thought it was leaking the task runner but then tried it in the debugger and did hit the expected crash, I first attributed it to my release build's heap being generous and somehow not crashing on the invalid |outer_| deref but doing so under debugging conditions... But now I'm thinking maybe it's a race (which resolves differently when debugging). In unit tests shutdown we should flush all tasks (even those not blocking shutdown) as leaking state outside the scope of a unit test is not cool... I'm pretty sure the bug is almost entirely in my own code FWIW. Something like what happens: - UI thread PostTask to IO thread to shutdown mojo localstorage implementation - IO thread (via mojo) PostTask to File task runner to do some stuff - IO thread shuts down, as nothing is currently queued there, and it's the first one to shut down - File task runner (via mojo) tries to reply to IO thread, but that fails - Somehow somewhere this leaves a reference to the file task runner, presumably somewhere internal to mojo Not sure about the details. And the DB thread might play a roll in there too (some of the communication is more IO->DB->File->DB->IO). Actually, maybe the "problematic" reference is the one the leveldb service (living on the DB thread) holds to the file runner. With the change to mojo bindings to no longer have a MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners bug) it seems plausible that in some circumstances the taskrunner the leveldb server "lives" on gets shutdown before its connection is properly severed, maybe something like: - UI thread posttask to IO thread to start shutdown of mojo localstorage - IO thread posttask to leveldb service in DB thread to get something from the DB - DB thread posttask back to IO thread with result - DB thread shuts down as nothing is currently queued for it - IO thread tries to posttask to DB thread to sever leveldb service connection, but thread is already gone so LevelDBService will never be destroyed properly, and task runner reference it holds leaks > Anyways, at least that frees you up, but I'll try to figure out what's up.. Thanks for your patience and for using TaskScheduler! Thanks for working on TaskScheduler, it does seem to have a lot of promise and should make things a lot simpler! So does this CL look good from a TaskScheduler point of view (other than the unfortunate part of having to use a SingleThreadTaskRunner for now)?
On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > On 2017/05/05 at 11:23:46, gab wrote: > > I noticed that when I tried locally yesterday too with your change on top of mine. I initially thought it was leaking the task runner but then tried it in the debugger and did hit the expected crash, I first attributed it to my release build's heap being generous and somehow not crashing on the invalid |outer_| deref but doing so under debugging conditions... But now I'm thinking maybe it's a race (which resolves differently when debugging). In unit tests shutdown we should flush all tasks (even those not blocking shutdown) as leaking state outside the scope of a unit test is not cool... > > I'm pretty sure the bug is almost entirely in my own code FWIW. Something like what happens: > - UI thread PostTask to IO thread to shutdown mojo localstorage implementation > - IO thread (via mojo) PostTask to File task runner to do some stuff > - IO thread shuts down, as nothing is currently queued there, and it's the first one to shut down > - File task runner (via mojo) tries to reply to IO thread, but that fails > - Somehow somewhere this leaves a reference to the file task runner, presumably somewhere internal to mojo > > Not sure about the details. And the DB thread might play a roll in there too (some of the communication is more IO->DB->File->DB->IO). Actually, maybe the "problematic" reference is the one the leveldb service (living on the DB thread) holds to the file runner. With the change to mojo bindings to no longer have a MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners bug) it seems plausible that in some circumstances the taskrunner the leveldb server "lives" on gets shutdown before its connection is properly severed, maybe something like: > - UI thread posttask to IO thread to start shutdown of mojo localstorage > - IO thread posttask to leveldb service in DB thread to get something from the DB > - DB thread posttask back to IO thread with result > - DB thread shuts down as nothing is currently queued for it > - IO thread tries to posttask to DB thread to sever leveldb service connection, but thread is already gone so LevelDBService will never be destroyed properly, and task runner reference it holds leaks Wait, no, that has thread shutdown in the wrong order, what could actually happen, depending on state of the various components: - UI thread posttask to IO thread to start shutdown of mojo localstorage - IO thread posttask to File runner to talk to file service - no outstanding tasks on DB thread, so thread is shut down - since code that could only run on DB thread had reference to file task runner, that reference will never be freed
On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > > On 2017/05/05 at 11:23:46, gab wrote: > > > I noticed that when I tried locally yesterday too with your change on top of mine. I initially thought it was leaking the task runner but then tried it in the debugger and did hit the expected crash, I first attributed it to my release build's heap being generous and somehow not crashing on the invalid |outer_| deref but doing so under debugging conditions... But now I'm thinking maybe it's a race (which resolves differently when debugging). In unit tests shutdown we should flush all tasks (even those not blocking shutdown) as leaking state outside the scope of a unit test is not cool... > > > > I'm pretty sure the bug is almost entirely in my own code FWIW. Something like what happens: > > - UI thread PostTask to IO thread to shutdown mojo localstorage implementation > > - IO thread (via mojo) PostTask to File task runner to do some stuff > > - IO thread shuts down, as nothing is currently queued there, and it's the first one to shut down > > - File task runner (via mojo) tries to reply to IO thread, but that fails > > - Somehow somewhere this leaves a reference to the file task runner, presumably somewhere internal to mojo > > > > Not sure about the details. And the DB thread might play a roll in there too (some of the communication is more IO->DB->File->DB->IO). Actually, maybe the "problematic" reference is the one the leveldb service (living on the DB thread) holds to the file runner. With the change to mojo bindings to no longer have a MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners bug) it seems plausible that in some circumstances the taskrunner the leveldb server "lives" on gets shutdown before its connection is properly severed, maybe something like: > > - UI thread posttask to IO thread to start shutdown of mojo localstorage > > - IO thread posttask to leveldb service in DB thread to get something from the DB > > - DB thread posttask back to IO thread with result > > - DB thread shuts down as nothing is currently queued for it > > - IO thread tries to posttask to DB thread to sever leveldb service connection, but thread is already gone so LevelDBService will never be destroyed properly, and task runner reference it holds leaks > > Wait, no, that has thread shutdown in the wrong order, what could actually happen, depending on state of the various components: > - UI thread posttask to IO thread to start shutdown of mojo localstorage > - IO thread posttask to File runner to talk to file service > - no outstanding tasks on DB thread, so thread is shut down that line should have read "- no oustanding tasks on either IO or DB thread, so both threads are shut down" (Since IO thread is emptied before DB thread) > - since code that could only run on DB thread had reference to file task runner, that reference will never be freed
mek@chromium.org changed reviewers: + jam@chromium.org
+jam for content OWNERS
https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... File content/browser/browser_context.cc (right): https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... content/browser/browser_context.cc:489: base::TaskTraits() See brand new way to specify TaskTraits as a bag instead of builder pattern (constexpr and much less verbose :)): https://cs.chromium.org/chromium/src/base/task_scheduler/task_traits.h?rcl=58... https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... content/browser/browser_context.cc:491: .WithBaseSyncPrimitives() Do you really need WithBaseSyncPrimitives()? (see its API)
On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote: > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > > > On 2017/05/05 at 11:23:46, gab wrote: > > > > I noticed that when I tried locally yesterday too with your change on top > of mine. I initially thought it was leaking the task runner but then tried it in > the debugger and did hit the expected crash, I first attributed it to my release > build's heap being generous and somehow not crashing on the invalid |outer_| > deref but doing so under debugging conditions... But now I'm thinking maybe it's > a race (which resolves differently when debugging). In unit tests shutdown we > should flush all tasks (even those not blocking shutdown) as leaking state > outside the scope of a unit test is not cool... > > > > > > I'm pretty sure the bug is almost entirely in my own code FWIW. Something > like what happens: > > > - UI thread PostTask to IO thread to shutdown mojo localstorage > implementation > > > - IO thread (via mojo) PostTask to File task runner to do some stuff > > > - IO thread shuts down, as nothing is currently queued there, and it's the > first one to shut down > > > - File task runner (via mojo) tries to reply to IO thread, but that fails > > > - Somehow somewhere this leaves a reference to the file task runner, > presumably somewhere internal to mojo > > > > > > Not sure about the details. And the DB thread might play a roll in there too > (some of the communication is more IO->DB->File->DB->IO). Actually, maybe the > "problematic" reference is the one the leveldb service (living on the DB thread) > holds to the file runner. With the change to mojo bindings to no longer have a > MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners > bug) it seems plausible that in some circumstances the taskrunner the leveldb > server "lives" on gets shutdown before its connection is properly severed, maybe > something like: > > > - UI thread posttask to IO thread to start shutdown of mojo localstorage > > > - IO thread posttask to leveldb service in DB thread to get something from > the DB > > > - DB thread posttask back to IO thread with result > > > - DB thread shuts down as nothing is currently queued for it > > > - IO thread tries to posttask to DB thread to sever leveldb service > connection, but thread is already gone so LevelDBService will never be destroyed > properly, and task runner reference it holds leaks > > > > Wait, no, that has thread shutdown in the wrong order, what could actually > happen, depending on state of the various components: > > - UI thread posttask to IO thread to start shutdown of mojo localstorage > > - IO thread posttask to File runner to talk to file service > > - no outstanding tasks on DB thread, so thread is shut down > that line should have read "- no oustanding tasks on either IO or DB thread, so > both threads are shut down" (Since IO thread is emptied before DB thread) > > - since code that could only run on DB thread had reference to file task > runner, that reference will never be freed As mentioned before this won't matter in production or browser tests so I'm guessing you're hitting this in a unit test, right? If so it might have to do with the order of things in ~TestBrowserThreadBundle(). If there somehow still is a task in |message_loop_| by the time we reach |scoped_async_task_scheduler_.reset()| and that task has a reference to a SingleThreadTaskRunner obtained from TaskScheduler you would hit this.
The CQ bit was checked by mek@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/08 at 17:30:58, gab wrote: > On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote: > > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > > > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > > > Wait, no, that has thread shutdown in the wrong order, what could actually > > happen, depending on state of the various components: > > > - UI thread posttask to IO thread to start shutdown of mojo localstorage > > > - IO thread posttask to File runner to talk to file service > > > - no outstanding tasks on DB thread, so thread is shut down > > that line should have read "- no oustanding tasks on either IO or DB thread, so > > both threads are shut down" (Since IO thread is emptied before DB thread) > > > - since code that could only run on DB thread had reference to file task > > runner, that reference will never be freed > > As mentioned before this won't matter in production or browser tests so I'm guessing you're hitting this in a unit test, right? > > If so it might have to do with the order of things in ~TestBrowserThreadBundle(). If there somehow still is a task in |message_loop_| by the time we reach |scoped_async_task_scheduler_.reset()| and that task has a reference to a SingleThreadTaskRunner obtained from TaskScheduler you would hit this. I wouldn't say that this only matters in unit tests. In production and browser tests things might not crash/dcheck, but shutdown will still not finish properly (i.e. I'd like to be able to post shutdown blocking tasks from one thread/taskrunner to another, and be able to get that other thread/taskrunner to post a shutdown blocking task back to the first thread/taskrunner). And that fails in both unit tests and production because both TestBrowserThreadBundle and production code (BrowserMainLoop::ShutdownThreadsAndCleanUp) simply flush all browserthreads in order, shutting them down when there are no more shutdown blocking tasks for the thread, not allowing later threads in shutdown order to post back to earlier threads. https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... File content/browser/browser_context.cc (right): https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... content/browser/browser_context.cc:489: base::TaskTraits() On 2017/05/08 at 17:14:32, gab wrote: > See brand new way to specify TaskTraits as a bag instead of builder pattern (constexpr and much less verbose :)): https://cs.chromium.org/chromium/src/base/task_scheduler/task_traits.h?rcl=58... Ah, nice. Done. The sample code in post_task.h should probably be updated to the new pattern too? https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... content/browser/browser_context.cc:491: .WithBaseSyncPrimitives() On 2017/05/08 at 17:14:32, gab wrote: > Do you really need WithBaseSyncPrimitives()? (see its API) Ah yes, it indeed seems that to signal on a waitable event you don't need WithBaseSyncPrimitives, only to wait for it it's needed? Removed it. The other task runner here (which currently still uses the DB thread), as well as a third thread internal in the leveldb mojo code do need WithBaseSyncPrimitives as they actually wait on waitable events, but this indeed doesn't.
On 2017/05/08 18:18:01, Marijn Kruisselbrink wrote: > On 2017/05/08 at 17:30:58, gab wrote: > > On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote: > > > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > > > > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > > > > Wait, no, that has thread shutdown in the wrong order, what could actually > > > happen, depending on state of the various components: > > > > - UI thread posttask to IO thread to start shutdown of mojo localstorage > > > > - IO thread posttask to File runner to talk to file service > > > > - no outstanding tasks on DB thread, so thread is shut down > > > that line should have read "- no oustanding tasks on either IO or DB thread, > so > > > both threads are shut down" (Since IO thread is emptied before DB thread) > > > > - since code that could only run on DB thread had reference to file task > > > runner, that reference will never be freed > > > > As mentioned before this won't matter in production or browser tests so I'm > guessing you're hitting this in a unit test, right? > > > > If so it might have to do with the order of things in > ~TestBrowserThreadBundle(). If there somehow still is a task in |message_loop_| > by the time we reach |scoped_async_task_scheduler_.reset()| and that task has a > reference to a SingleThreadTaskRunner obtained from TaskScheduler you would hit > this. > > I wouldn't say that this only matters in unit tests. In production and browser > tests things might not crash/dcheck, but shutdown will still not finish properly > (i.e. I'd like to be able to post shutdown blocking tasks from one > thread/taskrunner to another, and be able to get that other thread/taskrunner to > post a shutdown blocking task back to the first thread/taskrunner). > And that fails in both unit tests and production because both > TestBrowserThreadBundle and production code > (BrowserMainLoop::ShutdownThreadsAndCleanUp) simply flush all browserthreads in > order, shutting them down when there are no more shutdown blocking tasks for the > thread, not allowing later threads in shutdown order to post back to earlier > threads. Ah, you're talking about the broader shutdown issues (pre-dating TaskScheduler). I definitely agree with you. In fact, all of TaskScheduler was designed such that we can one day support what I call "atomic shutdown". The idea is that once everything goes through TaskScheduler, every thread/sequence can post BLOCK_SHUTDOWN work back-and-forth until done without winding/tearing down resources, followed by immediate process termination (gets rid of so many ordering and lifetime problems strictly caused by shutdown). https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... File content/browser/browser_context.cc (right): https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser... content/browser/browser_context.cc:489: base::TaskTraits() On 2017/05/08 18:18:01, Marijn Kruisselbrink wrote: > On 2017/05/08 at 17:14:32, gab wrote: > > See brand new way to specify TaskTraits as a bag instead of builder pattern > (constexpr and much less verbose :)): > https://cs.chromium.org/chromium/src/base/task_scheduler/task_traits.h?rcl=58... > > Ah, nice. Done. The sample code in post_task.h should probably be updated to the > new pattern too? Ah oops looks like we forgot to update that, will do, thanks!
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
lgtm
The CQ bit was checked by mek@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 60001, "attempt_start_ts": 1494288774088670, "parent_rev": "bb457134c0afdd23c2547993b9843f4d30594a65", "commit_rev": "17399e67ce617b8936b5088dba92abb92b0cb8ee"}
Message was sent while issue was closed.
Description was changed from ========== Fix possible deadlock during shutdown by not using the FILE thread. The leveldb service needs to be able to call code in the file service, even during browser shutdown. If the file service runs on the file thread, this will fail, leading to deadlock/timeout. Instead move the file service to a new SingleThreadTaskRunner managed by the task scheduler. Also re-enable tests that hopefully should be fixed by this. BUG=712872 BUG=586194 ========== to ========== Fix possible deadlock during shutdown by not using the FILE thread. The leveldb service needs to be able to call code in the file service, even during browser shutdown. If the file service runs on the file thread, this will fail, leading to deadlock/timeout. Instead move the file service to a new SingleThreadTaskRunner managed by the task scheduler. Also re-enable tests that hopefully should be fixed by this. BUG=712872 BUG=586194 Review-Url: https://codereview.chromium.org/2852333004 Cr-Commit-Position: refs/heads/master@{#470169} Committed: https://chromium.googlesource.com/chromium/src/+/17399e67ce617b8936b5088dba92... ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001) as https://chromium.googlesource.com/chromium/src/+/17399e67ce617b8936b5088dba92... |