Issue 2852333004: Fix possible deadlock during shutdown by not using the FILE thread.

Marijn Kruisselbrink

The CQ bit was checked by mek@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-02 23:48:49 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2852333004/1

3 years, 7 months ago (2017-05-02 23:49:25 UTC) #2

Marijn Kruisselbrink

The CQ bit was checked by mek@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-03 00:19:27 UTC) #3

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2852333004/20001

3 years, 7 months ago (2017-05-03 00:19:59 UTC) #4

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-03 00:55:49 UTC) #5

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/375202)

3 years, 7 months ago (2017-05-03 00:55:50 UTC) #6

Marijn Kruisselbrink

mek@chromium.org changed reviewers: + gab@chromium.org

3 years, 7 months ago (2017-05-03 19:09:20 UTC) #7

Marijn Kruisselbrink

gab: this change seems like a fairly straight forward replacing of a usage of a ...

3 years, 7 months ago (2017-05-03 19:09:21 UTC) #8

gab

On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > gab: this change seems like a fairly straight ...

3 years, 7 months ago (2017-05-03 20:18:57 UTC) #9

On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote:
> gab: this change seems like a fairly straight forward replacing of a usage of
a
> BrowserThread with a task scheduler managed thread. But I'm not sure how to
deal
> with the resulting test failures (caused by TaskRunners outliving the task
> scheduler). For more context/explanation of what is going on see also the
email
> on scheduler-dev I sent about this yesterday.

So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is valid I
think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle
wasn't catching that before).

What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest ->
ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants to
scratch the current execution environment (I don't see why it makes sense to do
that honestly). In doing so it brings down the MessageLoops and worker threads
that were brought up in the SetUp() phase... swtiching the task environment
midflight in a unit test seems like a bad idea to me....

The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to be
able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment (which it
defaults to ref. kThreadOptions in extension_service_test_base.cc).

But this is the wrong way to do that... the task environment needs to be the
first thing set up and last one teared down (or at least must not be set up
after nor teared down anything that uses it). So to make it configurable that
fixture's options should be tweaked in the constructor or something...

I realize this is outside the scope of what you're trying to do, we were going
to start manually migrating a few FILE thread callsites in the coming days to
try and weed out such issues but looks like you beat us to it :) -- solution IMO
is to fix ExtensionServiceTestBase to be configurable earlier instead of
wholesale resetting its task environment.

https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
File content/browser/browser_context.cc (right):

https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
content/browser/browser_context.cc:484: // runner when mojo supports that.
Add http://crbug.com/678155 for ease of searching.

But overall curious why this has to do with Mojo? Mojo isn't happy running its
services per se on sequences but it shouldn't care that the
"file_service_runner_" is a SequencedTaskRunner (just requires adjusting
FileService's interface to allow that? it doesn't appear to be truly
thread-affine, rather just thread-unsafe == sequence-affine?)

Marijn Kruisselbrink

On 2017/05/03 at 20:18:57, gab wrote: > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote: > > ...

3 years, 7 months ago (2017-05-03 21:06:29 UTC) #10

On 2017/05/03 at 20:18:57, gab wrote:
> On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote:
> > gab: this change seems like a fairly straight forward replacing of a usage
of a
> > BrowserThread with a task scheduler managed thread. But I'm not sure how to
deal
> > with the resulting test failures (caused by TaskRunners outliving the task
> > scheduler). For more context/explanation of what is going on see also the
email
> > on scheduler-dev I sent about this yesterday.
> 
> So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is valid
I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle
wasn't catching that before).
> 
> What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest ->
ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants to
scratch the current execution environment (I don't see why it makes sense to do
that honestly). In doing so it brings down the MessageLoops and worker threads
that were brought up in the SetUp() phase... swtiching the task environment
midflight in a unit test seems like a bad idea to me....
> 
> The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to
be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment (which
it defaults to ref. kThreadOptions in extension_service_test_base.cc).
> 
> But this is the wrong way to do that... the task environment needs to be the
first thing set up and last one teared down (or at least must not be set up
after nor teared down anything that uses it). So to make it configurable that
fixture's options should be tweaked in the constructor or something...
> 
> I realize this is outside the scope of what you're trying to do, we were going
to start manually migrating a few FILE thread callsites in the coming days to
try and weed out such issues but looks like you beat us to it :) -- solution IMO
is to fix ExtensionServiceTestBase to be configurable earlier instead of
wholesale resetting its task environment.

I think ResetThreadBundle is kind of a read herring here. If I combine this CL
with another in-flight CL (https://codereview.chromium.org/2861473002, which
moves some unrelated code from the UI to the IO thread to fix other threading
issues...) more tests start failing with the same error, but this time during
test harness destruction (see
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel...
for a test run of the combination of this CL and that other CL).

I of course agree that ideally all nothing should try to use the task
environment after shutting it down. But it's not even clear that that is what is
happening. What the DCHECK is complaining about is that there still is code with
references to a TaskRunner, not that code is actively trying to use those
TaskRunners. Also for other Task Runners its perfectly fine to hold on
references to them after the system managing them has shut down. PostTask will
simply start returning false, since the tasks will never run again. Why can't
these TaskRunners do the same? As explained in my email, actually making sure no
references to the task runners remain is very much non-trivial, in particular
with mojo copying references to these task runners around all over the place
(with StrongBinding being particularly annoying, in that it generally manages
its own lifetime, and intentionally doesn't try to observe the task runner going
away anymore).

>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> File content/browser/browser_context.cc (right):
> 
>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> content/browser/browser_context.cc:484: // runner when mojo supports that.
> Add http://crbug.com/678155 for ease of searching.
> 
> But overall curious why this has to do with Mojo? Mojo isn't happy running its
services per se on sequences but it shouldn't care that the
"file_service_runner_" is a SequencedTaskRunner (just requires adjusting
FileService's interface to allow that? it doesn't appear to be truly
thread-affine, rather just thread-unsafe == sequence-affine?)

Pretty much every mojo related class (binding, interfaceptr, etc) uses
base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks on.
So in effect every bit of code that tries to either call a mojo service or
implement a mojo service currently has to live in a SingleThreadTaskRunner. (I
agree that there doesn't seem to necessarily be any good reason why mojo would
require thread-affine, rather than sequence-affine, but that just is the current
situation...)

gab

On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote: > On 2017/05/03 at 20:18:57, gab wrote: > > ...

3 years, 7 months ago (2017-05-04 18:45:48 UTC) #11

On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote:
> On 2017/05/03 at 20:18:57, gab wrote:
> > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote:
> > > gab: this change seems like a fairly straight forward replacing of a usage
> of a
> > > BrowserThread with a task scheduler managed thread. But I'm not sure how
to
> deal
> > > with the resulting test failures (caused by TaskRunners outliving the task
> > > scheduler). For more context/explanation of what is going on see also the
> email
> > > on scheduler-dev I sent about this yesterday.
> > 
> > So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is
valid
> I think (and it's unfortunate that BrowserThread::FILE/TestBrowserThreadBundle
> wasn't catching that before).
> > 
> > What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest ->
> ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants
to
> scratch the current execution environment (I don't see why it makes sense to
do
> that honestly). In doing so it brings down the MessageLoops and worker threads
> that were brought up in the SetUp() phase... swtiching the task environment
> midflight in a unit test seems like a bad idea to me....
> > 
> > The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be to
> be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment
(which
> it defaults to ref. kThreadOptions in extension_service_test_base.cc).
> > 
> > But this is the wrong way to do that... the task environment needs to be the
> first thing set up and last one teared down (or at least must not be set up
> after nor teared down anything that uses it). So to make it configurable that
> fixture's options should be tweaked in the constructor or something...
> > 
> > I realize this is outside the scope of what you're trying to do, we were
going
> to start manually migrating a few FILE thread callsites in the coming days to
> try and weed out such issues but looks like you beat us to it :) -- solution
IMO
> is to fix ExtensionServiceTestBase to be configurable earlier instead of
> wholesale resetting its task environment.
> 
> I think ResetThreadBundle is kind of a read herring here. If I combine this CL
> with another in-flight CL (https://codereview.chromium.org/2861473002, which
> moves some unrelated code from the UI to the IO thread to fix other threading
> issues...) more tests start failing with the same error, but this time during
> test harness destruction (see
>
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel...
> for a test run of the combination of this CL and that other CL).
> 
> I of course agree that ideally all nothing should try to use the task
> environment after shutting it down. But it's not even clear that that is what
is
> happening. What the DCHECK is complaining about is that there still is code
with
> references to a TaskRunner, not that code is actively trying to use those
> TaskRunners. Also for other Task Runners its perfectly fine to hold on
> references to them after the system managing them has shut down. PostTask will
> simply start returning false, since the tasks will never run again. Why can't
> these TaskRunners do the same? As explained in my email, actually making sure
no
> references to the task runners remain is very much non-trivial, in particular
> with mojo copying references to these task runners around all over the place
> (with StrongBinding being particularly annoying, in that it generally manages
> its own lifetime, and intentionally doesn't try to observe the task runner
going
> away anymore).


I agree with you for production code and for that very reason in production we
only call TaskScheduler::Shutdown() to stop it but never actually tear down its
state (post tasks will return false, etc.).

In tests however we have to tear things down between each test. As such the task
environment should be set up first and brought down last.

Looks like this other failure you're hitting is another fallout of
ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in
that case I'd suspect that having the content::RenderViewHostTestEnabler member
earlier in the list is the problem as the task environment will be ripped from
under it on tear down). I'm looking into  improving documentation and DCHECKs to
make it clearer what to do when such an ordering issue arises in tests (and
fixing ExtensionServiceTestBase).

> 
> >
>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> > File content/browser/browser_context.cc (right):
> > 
> >
>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> > content/browser/browser_context.cc:484: // runner when mojo supports that.
> > Add http://crbug.com/678155 for ease of searching.
> > 
> > But overall curious why this has to do with Mojo? Mojo isn't happy running
its
> services per se on sequences but it shouldn't care that the
> "file_service_runner_" is a SequencedTaskRunner (just requires adjusting
> FileService's interface to allow that? it doesn't appear to be truly
> thread-affine, rather just thread-unsafe == sequence-affine?)
> 
> Pretty much every mojo related class (binding, interfaceptr, etc) uses
> base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks
on.
> So in effect every bit of code that tries to either call a mojo service or
> implement a mojo service currently has to live in a SingleThreadTaskRunner. (I
> agree that there doesn't seem to necessarily be any good reason why mojo would
> require thread-affine, rather than sequence-affine, but that just is the
current
> situation...)

Right... well let's at least link to http://crbug.com/678155 so we can come back
and address these when Mojo fixes that.

gab

On 2017/05/04 18:45:48, gab wrote: > On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote: > > On ...

3 years, 7 months ago (2017-05-04 18:47:43 UTC) #12

On 2017/05/04 18:45:48, gab wrote:
> On 2017/05/03 21:06:29, Marijn Kruisselbrink wrote:
> > On 2017/05/03 at 20:18:57, gab wrote:
> > > On 2017/05/03 19:09:21, Marijn Kruisselbrink wrote:
> > > > gab: this change seems like a fairly straight forward replacing of a
usage
> > of a
> > > > BrowserThread with a task scheduler managed thread. But I'm not sure how
> to
> > deal
> > > > with the resulting test failures (caused by TaskRunners outliving the
task
> > > > scheduler). For more context/explanation of what is going on see also
the
> > email
> > > > on scheduler-dev I sent about this yesterday.
> > > 
> > > So the reason ExtensionInfoGeneratorUnitTest.BasicInfoTest is failing is
> valid
> > I think (and it's unfortunate that
BrowserThread::FILE/TestBrowserThreadBundle
> > wasn't catching that before).
> > > 
> > > What's happening is that ExtensionInfoGeneratorUnitTest_BasicInfoTest ->
> > ExtensionServiceTestBase::ResetThreadBundle() is effectively saying it wants
> to
> > scratch the current execution environment (I don't see why it makes sense to
> do
> > that honestly). In doing so it brings down the MessageLoops and worker
threads
> > that were brought up in the SetUp() phase... swtiching the task environment
> > midflight in a unit test seems like a bad idea to me....
> > > 
> > > The intent of ExtensionServiceTestBase::ResetThreadBundle() appears to be
to
> > be able to run ExtensionServiceTestBase in a non-IO_MAINLOOP environment
> (which
> > it defaults to ref. kThreadOptions in extension_service_test_base.cc).
> > > 
> > > But this is the wrong way to do that... the task environment needs to be
the
> > first thing set up and last one teared down (or at least must not be set up
> > after nor teared down anything that uses it). So to make it configurable
that
> > fixture's options should be tweaked in the constructor or something...
> > > 
> > > I realize this is outside the scope of what you're trying to do, we were
> going
> > to start manually migrating a few FILE thread callsites in the coming days
to
> > try and weed out such issues but looks like you beat us to it :) -- solution
> IMO
> > is to fix ExtensionServiceTestBase to be configurable earlier instead of
> > wholesale resetting its task environment.
> > 
> > I think ResetThreadBundle is kind of a read herring here. If I combine this
CL
> > with another in-flight CL (https://codereview.chromium.org/2861473002, which
> > moves some unrelated code from the UI to the IO thread to fix other
threading
> > issues...) more tests start failing with the same error, but this time
during
> > test harness destruction (see
> >
>
https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel...
> > for a test run of the combination of this CL and that other CL).
> > 
> > I of course agree that ideally all nothing should try to use the task
> > environment after shutting it down. But it's not even clear that that is
what
> is
> > happening. What the DCHECK is complaining about is that there still is code
> with
> > references to a TaskRunner, not that code is actively trying to use those
> > TaskRunners. Also for other Task Runners its perfectly fine to hold on
> > references to them after the system managing them has shut down. PostTask
will
> > simply start returning false, since the tasks will never run again. Why
can't
> > these TaskRunners do the same? As explained in my email, actually making
sure
> no
> > references to the task runners remain is very much non-trivial, in
particular
> > with mojo copying references to these task runners around all over the place
> > (with StrongBinding being particularly annoying, in that it generally
manages
> > its own lifetime, and intentionally doesn't try to observe the task runner
> going
> > away anymore).
> 
> 
> I agree with you for production code and for that very reason in production we
> only call TaskScheduler::Shutdown() to stop it but never actually tear down
its
> state (post tasks will return false, etc.).
> 
> In tests however we have to tear things down between each test. As such the
task
> environment should be set up first and brought down last.

FWIW, in the absence of this DCHECK, the test would crash later when calling
~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its
deletion.

> 
> Looks like this other failure you're hitting is another fallout of
> ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in
> that case I'd suspect that having the content::RenderViewHostTestEnabler
member
> earlier in the list is the problem as the task environment will be ripped from
> under it on tear down). I'm looking into  improving documentation and DCHECKs
to
> make it clearer what to do when such an ordering issue arises in tests (and
> fixing ExtensionServiceTestBase).
> 
> > 
> > >
> >
>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> > > File content/browser/browser_context.cc (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/2852333004/diff/20001/content/browser/browser...
> > > content/browser/browser_context.cc:484: // runner when mojo supports that.
> > > Add http://crbug.com/678155 for ease of searching.
> > > 
> > > But overall curious why this has to do with Mojo? Mojo isn't happy running
> its
> > services per se on sequences but it shouldn't care that the
> > "file_service_runner_" is a SequencedTaskRunner (just requires adjusting
> > FileService's interface to allow that? it doesn't appear to be truly
> > thread-affine, rather just thread-unsafe == sequence-affine?)
> > 
> > Pretty much every mojo related class (binding, interfaceptr, etc) uses
> > base::ThreadTaskRunnerHandle::Get() to get the task handler to run callbacks
> on.
> > So in effect every bit of code that tries to either call a mojo service or
> > implement a mojo service currently has to live in a SingleThreadTaskRunner.
(I
> > agree that there doesn't seem to necessarily be any good reason why mojo
would
> > require thread-affine, rather than sequence-affine, but that just is the
> current
> > situation...)
> 
> Right... well let's at least link to http://crbug.com/678155 so we can come
back
> and address these when Mojo fixes that.

gab

Improved documentation and check in https://codereview.chromium.org/2860063003/ so that you should now get the stack traces ...

3 years, 7 months ago (2017-05-04 21:43:11 UTC) #13

Marijn Kruisselbrink

On 2017/05/04 at 18:45:48, gab wrote: > Looks like this other failure you're hitting is ...

3 years, 7 months ago (2017-05-05 00:01:06 UTC) #14

On 2017/05/04 at 18:45:48, gab wrote:
> Looks like this other failure you're hitting is another fallout of
ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in
that case I'd suspect that having the content::RenderViewHostTestEnabler member
earlier in the list is the problem as the task environment will be ripped from
under it on tear down). I'm looking into  improving documentation and DCHECKs to
make it clearer what to do when such an ordering issue arises in tests (and
fixing ExtensionServiceTestBase).

> FWIW, in the absence of this DCHECK, the test would crash later when calling
~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its
deletion.

Actually changing that order didn't do anything. And in the absence of the
DCHECK things just worked. But I did (mostly) figure out what was going on. My
other CL (https://codereview.chromium.org/2861473002) changed some shutdown
logic from being sync to now be async, and somehow even the various flushing and
running-until-idle that were done as part of ~TestBrowserThreadBundle wasn't
enough to let the async shutdown logic finish. Not sure where things were
failing though. Anyway, I changed something in that other CL that happens to
have as a side effect that in these particular tests the shutdown logic happens
to still be sync, so the problem no longer manifests itself. Not entirely
satisfying, as there still seems to be a problem with with should be shutdown
blocking tasks not actually blocking shutdown (well, not sure if all of these
tasks on the many threads involved are actually shutdown blocking; and there
probably still are problems remaining there too where a thread posts a task to
another thread, and expects a reply, but the only one of those directions work
with the way BrowserThreads are flushed/destroyed). Hopefully once more gets
moved to the TaskScheduler rather than various BrowserThreads those things get
solved as well, and these multi-thread async shutdown sequences actually work
more reliably during task scheduler flushing?

For now at least I'm satisfied that this seems to work fine, and any remaining
problems are definitely unrelated to this change.

> Right... well let's at least link to http://crbug.com/678155 so we can come
back and address these when Mojo fixes that.

Added a link to that.

On 2017/05/04 at 21:43:11, gab wrote:
> Improved documentation and check in
https://codereview.chromium.org/2860063003/ so that you should now get the stack
traces for both destructors happening out-of-order in the test fixture.
> 
> But like I said the issue is very likely with ExtensionServiceTestBase not
keeping its task environment alive long enough (and also resetting it
midflight..!)

Unfortunately the second stack trace wouldn't ever actually fire for me, since
the various browser threads that were already shut down were needed to make
progress to actually release the task runner (at least that's my theory).

Marijn Kruisselbrink

The CQ bit was checked by mek@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-05 00:01:09 UTC) #15

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2852333004/40001

3 years, 7 months ago (2017-05-05 00:01:44 UTC) #16

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-05 01:10:07 UTC) #17

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-05 01:10:08 UTC) #18

gab

On 2017/05/05 00:01:06, Marijn Kruisselbrink wrote: > On 2017/05/04 at 18:45:48, gab wrote: > > ...

3 years, 7 months ago (2017-05-05 11:23:46 UTC) #19

On 2017/05/05 00:01:06, Marijn Kruisselbrink wrote:
> On 2017/05/04 at 18:45:48, gab wrote:
> > Looks like this other failure you're hitting is another fallout of
> ExtensionServiceTestBase managing its TestBrowserThreadBundle incorrectly (in
> that case I'd suspect that having the content::RenderViewHostTestEnabler
member
> earlier in the list is the problem as the task environment will be ripped from
> under it on tear down). I'm looking into  improving documentation and DCHECKs
to
> make it clearer what to do when such an ordering issue arises in tests (and
> fixing ExtensionServiceTestBase).
> 
> > FWIW, in the absence of this DCHECK, the test would crash later when calling
> ~SchedulerSingleThreadTaskRunner() when dereferencing |outer_| after its
> deletion.
> 
> Actually changing that order didn't do anything. And in the absence of the
> DCHECK things just worked. But I did (mostly) figure out what was going on. My
> other CL (https://codereview.chromium.org/2861473002) changed some shutdown
> logic from being sync to now be async, and somehow even the various flushing
and
> running-until-idle that were done as part of ~TestBrowserThreadBundle wasn't
> enough to let the async shutdown logic finish. Not sure where things were
> failing though. Anyway, I changed something in that other CL that happens to
> have as a side effect that in these particular tests the shutdown logic
happens
> to still be sync, so the problem no longer manifests itself. Not entirely
> satisfying, as there still seems to be a problem with with should be shutdown
> blocking tasks not actually blocking shutdown (well, not sure if all of these
> tasks on the many threads involved are actually shutdown blocking; and there
> probably still are problems remaining there too where a thread posts a task to
> another thread, and expects a reply, but the only one of those directions work
> with the way BrowserThreads are flushed/destroyed). Hopefully once more gets
> moved to the TaskScheduler rather than various BrowserThreads those things get
> solved as well, and these multi-thread async shutdown sequences actually work
> more reliably during task scheduler flushing?
> 
> For now at least I'm satisfied that this seems to work fine, and any remaining
> problems are definitely unrelated to this change.
> 
> > Right... well let's at least link to http://crbug.com/678155 so we can come
> back and address these when Mojo fixes that.
> 
> Added a link to that.
> 
> 
> On 2017/05/04 at 21:43:11, gab wrote:
> > Improved documentation and check in
> https://codereview.chromium.org/2860063003/ so that you should now get the
stack
> traces for both destructors happening out-of-order in the test fixture.
> > 
> > But like I said the issue is very likely with ExtensionServiceTestBase not
> keeping its task environment alive long enough (and also resetting it
> midflight..!)
> 
> Unfortunately the second stack trace wouldn't ever actually fire for me, since
> the various browser threads that were already shut down were needed to make
> progress to actually release the task runner (at least that's my theory).

I noticed that when I tried locally yesterday too with your change on top of
mine. I initially thought it was leaking the task runner but then tried it in
the debugger and did hit the expected crash, I first attributed it to my release
build's heap being generous and somehow not crashing on the invalid |outer_|
deref but doing so under debugging conditions... But now I'm thinking maybe it's
a race (which resolves differently when debugging). In unit tests shutdown we
should flush all tasks (even those not blocking shutdown) as leaking state
outside the scope of a unit test is not cool... Anyways, at least that frees you
up, but I'll try to figure out what's up.. Thanks for your patience and for
using TaskScheduler!

Marijn Kruisselbrink

On 2017/05/05 at 11:23:46, gab wrote: > I noticed that when I tried locally yesterday ...

3 years, 7 months ago (2017-05-05 16:44:20 UTC) #20

On 2017/05/05 at 11:23:46, gab wrote:
> I noticed that when I tried locally yesterday too with your change on top of
mine. I initially thought it was leaking the task runner but then tried it in
the debugger and did hit the expected crash, I first attributed it to my release
build's heap being generous and somehow not crashing on the invalid |outer_|
deref but doing so under debugging conditions... But now I'm thinking maybe it's
a race (which resolves differently when debugging). In unit tests shutdown we
should flush all tasks (even those not blocking shutdown) as leaking state
outside the scope of a unit test is not cool...

I'm pretty sure the bug is almost entirely in my own code FWIW. Something like
what happens:
- UI thread PostTask to IO thread to shutdown mojo localstorage implementation
- IO thread (via mojo) PostTask to File task runner to do some stuff
- IO thread shuts down, as nothing is currently queued there, and it's the first
one to shut down
- File task runner (via mojo) tries to reply to IO thread, but that fails
- Somehow somewhere this leaves a reference to the file task runner, presumably
somewhere internal to mojo

Not sure about the details. And the DB thread might play a roll in there too
(some of the communication is more IO->DB->File->DB->IO). Actually, maybe the
"problematic" reference is the one the leveldb service (living on the DB thread)
holds to the file runner. With the change to mojo bindings to no longer have a
MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners
bug) it seems plausible that in some circumstances the taskrunner the leveldb
server "lives" on gets shutdown before its connection is properly severed, maybe
something like:
- UI thread posttask to IO thread to start shutdown of mojo localstorage
- IO thread posttask to leveldb service in DB thread to get something from the
DB
- DB thread posttask back to IO thread with result
- DB thread shuts down as nothing is currently queued for it
- IO thread tries to posttask to DB thread to sever leveldb service connection,
but thread is already gone so LevelDBService will never be destroyed properly,
and task runner reference it holds leaks

> Anyways, at least that frees you up, but I'll try to figure out what's up..
Thanks for your patience and for using TaskScheduler!

Thanks for working on TaskScheduler, it does seem to have a lot of promise and
should make things a lot simpler! So does this CL look good from a TaskScheduler
point of view (other than the unfortunate part of having to use a
SingleThreadTaskRunner for now)?

Marijn Kruisselbrink

On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: > On 2017/05/05 at 11:23:46, gab wrote: > ...

3 years, 7 months ago (2017-05-05 16:58:47 UTC) #21

On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote:
> On 2017/05/05 at 11:23:46, gab wrote:
> > I noticed that when I tried locally yesterday too with your change on top of
mine. I initially thought it was leaking the task runner but then tried it in
the debugger and did hit the expected crash, I first attributed it to my release
build's heap being generous and somehow not crashing on the invalid |outer_|
deref but doing so under debugging conditions... But now I'm thinking maybe it's
a race (which resolves differently when debugging). In unit tests shutdown we
should flush all tasks (even those not blocking shutdown) as leaking state
outside the scope of a unit test is not cool...
> 
> I'm pretty sure the bug is almost entirely in my own code FWIW. Something like
what happens:
> - UI thread PostTask to IO thread to shutdown mojo localstorage implementation
> - IO thread (via mojo) PostTask to File task runner to do some stuff
> - IO thread shuts down, as nothing is currently queued there, and it's the
first one to shut down
> - File task runner (via mojo) tries to reply to IO thread, but that fails
> - Somehow somewhere this leaves a reference to the file task runner,
presumably somewhere internal to mojo
> 
> Not sure about the details. And the DB thread might play a roll in there too
(some of the communication is more IO->DB->File->DB->IO). Actually, maybe the
"problematic" reference is the one the leveldb service (living on the DB thread)
holds to the file runner. With the change to mojo bindings to no longer have a
MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners
bug) it seems plausible that in some circumstances the taskrunner the leveldb
server "lives" on gets shutdown before its connection is properly severed, maybe
something like:
> - UI thread posttask to IO thread to start shutdown of mojo localstorage
> - IO thread posttask to leveldb service in DB thread to get something from the
DB
> - DB thread posttask back to IO thread with result
> - DB thread shuts down as nothing is currently queued for it
> - IO thread tries to posttask to DB thread to sever leveldb service
connection, but thread is already gone so LevelDBService will never be destroyed
properly, and task runner reference it holds leaks

Wait, no, that has thread shutdown in the wrong order, what could actually
happen, depending on state of the various components:
- UI thread posttask to IO thread to start shutdown of mojo localstorage
- IO thread posttask to File runner to talk to file service
- no outstanding tasks on DB thread, so thread is shut down
- since code that could only run on DB thread had reference to file task runner,
that reference will never be freed

Marijn Kruisselbrink

On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote: ...

3 years, 7 months ago (2017-05-05 17:21:49 UTC) #22

On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote:
> On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote:
> > On 2017/05/05 at 11:23:46, gab wrote:
> > > I noticed that when I tried locally yesterday too with your change on top
of mine. I initially thought it was leaking the task runner but then tried it in
the debugger and did hit the expected crash, I first attributed it to my release
build's heap being generous and somehow not crashing on the invalid |outer_|
deref but doing so under debugging conditions... But now I'm thinking maybe it's
a race (which resolves differently when debugging). In unit tests shutdown we
should flush all tasks (even those not blocking shutdown) as leaking state
outside the scope of a unit test is not cool...
> > 
> > I'm pretty sure the bug is almost entirely in my own code FWIW. Something
like what happens:
> > - UI thread PostTask to IO thread to shutdown mojo localstorage
implementation
> > - IO thread (via mojo) PostTask to File task runner to do some stuff
> > - IO thread shuts down, as nothing is currently queued there, and it's the
first one to shut down
> > - File task runner (via mojo) tries to reply to IO thread, but that fails
> > - Somehow somewhere this leaves a reference to the file task runner,
presumably somewhere internal to mojo
> > 
> > Not sure about the details. And the DB thread might play a roll in there too
(some of the communication is more IO->DB->File->DB->IO). Actually, maybe the
"problematic" reference is the one the leveldb service (living on the DB thread)
holds to the file runner. With the change to mojo bindings to no longer have a
MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners
bug) it seems plausible that in some circumstances the taskrunner the leveldb
server "lives" on gets shutdown before its connection is properly severed, maybe
something like:
> > - UI thread posttask to IO thread to start shutdown of mojo localstorage
> > - IO thread posttask to leveldb service in DB thread to get something from
the DB
> > - DB thread posttask back to IO thread with result
> > - DB thread shuts down as nothing is currently queued for it
> > - IO thread tries to posttask to DB thread to sever leveldb service
connection, but thread is already gone so LevelDBService will never be destroyed
properly, and task runner reference it holds leaks
> 
> Wait, no, that has thread shutdown in the wrong order, what could actually
happen, depending on state of the various components:
> - UI thread posttask to IO thread to start shutdown of mojo localstorage
> - IO thread posttask to File runner to talk to file service
> - no outstanding tasks on DB thread, so thread is shut down
that line should have read "- no oustanding tasks on either IO or DB thread, so
both threads are shut down" (Since IO thread is emptied before DB thread)
> - since code that could only run on DB thread had reference to file task
runner, that reference will never be freed

Marijn Kruisselbrink

mek@chromium.org changed reviewers: + jam@chromium.org

3 years, 7 months ago (2017-05-08 17:07:46 UTC) #23

gab

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser_context.cc File content/browser/browser_context.cc (right): https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser_context.cc#newcode489 content/browser/browser_context.cc:489: base::TaskTraits() See brand new way to specify TaskTraits as ...

3 years, 7 months ago (2017-05-08 17:14:33 UTC) #25

gab

On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote: > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote: > ...

3 years, 7 months ago (2017-05-08 17:30:58 UTC) #26

On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote:
> On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote:
> > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote:
> > > On 2017/05/05 at 11:23:46, gab wrote:
> > > > I noticed that when I tried locally yesterday too with your change on
top
> of mine. I initially thought it was leaking the task runner but then tried it
in
> the debugger and did hit the expected crash, I first attributed it to my
release
> build's heap being generous and somehow not crashing on the invalid |outer_|
> deref but doing so under debugging conditions... But now I'm thinking maybe
it's
> a race (which resolves differently when debugging). In unit tests shutdown we
> should flush all tasks (even those not blocking shutdown) as leaking state
> outside the scope of a unit test is not cool...
> > > 
> > > I'm pretty sure the bug is almost entirely in my own code FWIW. Something
> like what happens:
> > > - UI thread PostTask to IO thread to shutdown mojo localstorage
> implementation
> > > - IO thread (via mojo) PostTask to File task runner to do some stuff
> > > - IO thread shuts down, as nothing is currently queued there, and it's the
> first one to shut down
> > > - File task runner (via mojo) tries to reply to IO thread, but that fails
> > > - Somehow somewhere this leaves a reference to the file task runner,
> presumably somewhere internal to mojo
> > > 
> > > Not sure about the details. And the DB thread might play a roll in there
too
> (some of the communication is more IO->DB->File->DB->IO). Actually, maybe the
> "problematic" reference is the one the leveldb service (living on the DB
thread)
> holds to the file runner. With the change to mojo bindings to no longer have a
> MessageLoop::DestructionObserver (from the linked mojo in sequencedtaskrunners
> bug) it seems plausible that in some circumstances the taskrunner the leveldb
> server "lives" on gets shutdown before its connection is properly severed,
maybe
> something like:
> > > - UI thread posttask to IO thread to start shutdown of mojo localstorage
> > > - IO thread posttask to leveldb service in DB thread to get something from
> the DB
> > > - DB thread posttask back to IO thread with result
> > > - DB thread shuts down as nothing is currently queued for it
> > > - IO thread tries to posttask to DB thread to sever leveldb service
> connection, but thread is already gone so LevelDBService will never be
destroyed
> properly, and task runner reference it holds leaks
> > 
> > Wait, no, that has thread shutdown in the wrong order, what could actually
> happen, depending on state of the various components:
> > - UI thread posttask to IO thread to start shutdown of mojo localstorage
> > - IO thread posttask to File runner to talk to file service
> > - no outstanding tasks on DB thread, so thread is shut down
> that line should have read "- no oustanding tasks on either IO or DB thread,
so
> both threads are shut down" (Since IO thread is emptied before DB thread)
> > - since code that could only run on DB thread had reference to file task
> runner, that reference will never be freed

As mentioned before this won't matter in production or browser tests so I'm
guessing you're hitting this in a unit test, right?

If so it might have to do with the order of things in
~TestBrowserThreadBundle(). If there somehow still is a task in |message_loop_|
by the time we reach |scoped_async_task_scheduler_.reset()| and that task has a
reference to a SingleThreadTaskRunner obtained from TaskScheduler you would hit
this.

Marijn Kruisselbrink

The CQ bit was checked by mek@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-08 18:10:41 UTC) #27

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2852333004/60001

3 years, 7 months ago (2017-05-08 18:11:53 UTC) #28

Marijn Kruisselbrink

On 2017/05/08 at 17:30:58, gab wrote: > On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote: > > ...

3 years, 7 months ago (2017-05-08 18:18:01 UTC) #29

On 2017/05/08 at 17:30:58, gab wrote:
> On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote:
> > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote:
> > > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote:
> > > Wait, no, that has thread shutdown in the wrong order, what could actually
> > happen, depending on state of the various components:
> > > - UI thread posttask to IO thread to start shutdown of mojo localstorage
> > > - IO thread posttask to File runner to talk to file service
> > > - no outstanding tasks on DB thread, so thread is shut down
> > that line should have read "- no oustanding tasks on either IO or DB thread,
so
> > both threads are shut down" (Since IO thread is emptied before DB thread)
> > > - since code that could only run on DB thread had reference to file task
> > runner, that reference will never be freed
> 
> As mentioned before this won't matter in production or browser tests so I'm
guessing you're hitting this in a unit test, right?
> 
> If so it might have to do with the order of things in
~TestBrowserThreadBundle(). If there somehow still is a task in |message_loop_|
by the time we reach |scoped_async_task_scheduler_.reset()| and that task has a
reference to a SingleThreadTaskRunner obtained from TaskScheduler you would hit
this.

I wouldn't say that this only matters in unit tests. In production and browser
tests things might not crash/dcheck, but shutdown will still not finish properly
(i.e. I'd like to be able to post shutdown blocking tasks from one
thread/taskrunner to another, and be able to get that other thread/taskrunner to
post a shutdown blocking task back to the first thread/taskrunner).
And that fails in both unit tests and production because both
TestBrowserThreadBundle and production code
(BrowserMainLoop::ShutdownThreadsAndCleanUp) simply flush all browserthreads in
order, shutting them down when there are no more shutdown blocking tasks for the
thread, not allowing later threads in shutdown order to post back to earlier
threads.

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser...
File content/browser/browser_context.cc (right):

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser...
content/browser/browser_context.cc:489: base::TaskTraits()
On 2017/05/08 at 17:14:32, gab wrote:
> See brand new way to specify TaskTraits as a bag instead of builder pattern
(constexpr and much less verbose :)):
https://cs.chromium.org/chromium/src/base/task_scheduler/task_traits.h?rcl=58...

Ah, nice. Done. The sample code in post_task.h should probably be updated to the
new pattern too?

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser...
content/browser/browser_context.cc:491: .WithBaseSyncPrimitives()
On 2017/05/08 at 17:14:32, gab wrote:
> Do you really need WithBaseSyncPrimitives()? (see its API)

Ah yes, it indeed seems that to signal on a waitable event you don't need
WithBaseSyncPrimitives, only to wait for it it's needed? Removed it. The other
task runner here (which currently still uses the DB thread), as well as a third
thread internal in the leveldb mojo code do need WithBaseSyncPrimitives as they
actually wait on waitable events, but this indeed doesn't.

gab

On 2017/05/08 18:18:01, Marijn Kruisselbrink wrote: > On 2017/05/08 at 17:30:58, gab wrote: > > ...

3 years, 7 months ago (2017-05-08 19:25:14 UTC) #30

On 2017/05/08 18:18:01, Marijn Kruisselbrink wrote:
> On 2017/05/08 at 17:30:58, gab wrote:
> > On 2017/05/05 17:21:49, Marijn Kruisselbrink wrote:
> > > On 2017/05/05 at 16:58:47, Marijn Kruisselbrink wrote:
> > > > On 2017/05/05 at 16:44:20, Marijn Kruisselbrink wrote:
> > > > Wait, no, that has thread shutdown in the wrong order, what could
actually
> > > happen, depending on state of the various components:
> > > > - UI thread posttask to IO thread to start shutdown of mojo localstorage
> > > > - IO thread posttask to File runner to talk to file service
> > > > - no outstanding tasks on DB thread, so thread is shut down
> > > that line should have read "- no oustanding tasks on either IO or DB
thread,
> so
> > > both threads are shut down" (Since IO thread is emptied before DB thread)
> > > > - since code that could only run on DB thread had reference to file task
> > > runner, that reference will never be freed
> > 
> > As mentioned before this won't matter in production or browser tests so I'm
> guessing you're hitting this in a unit test, right?
> > 
> > If so it might have to do with the order of things in
> ~TestBrowserThreadBundle(). If there somehow still is a task in
|message_loop_|
> by the time we reach |scoped_async_task_scheduler_.reset()| and that task has
a
> reference to a SingleThreadTaskRunner obtained from TaskScheduler you would
hit
> this.
> 
> I wouldn't say that this only matters in unit tests. In production and browser
> tests things might not crash/dcheck, but shutdown will still not finish
properly
> (i.e. I'd like to be able to post shutdown blocking tasks from one
> thread/taskrunner to another, and be able to get that other thread/taskrunner
to
> post a shutdown blocking task back to the first thread/taskrunner).
> And that fails in both unit tests and production because both
> TestBrowserThreadBundle and production code
> (BrowserMainLoop::ShutdownThreadsAndCleanUp) simply flush all browserthreads
in
> order, shutting them down when there are no more shutdown blocking tasks for
the
> thread, not allowing later threads in shutdown order to post back to earlier
> threads.

Ah, you're talking about the broader shutdown issues (pre-dating TaskScheduler).
I definitely agree with you. In fact, all of TaskScheduler was designed such
that we can one day support what I call "atomic shutdown". The idea is that once
everything goes through TaskScheduler, every thread/sequence can post
BLOCK_SHUTDOWN work back-and-forth until done without winding/tearing down
resources, followed by immediate process termination (gets rid of so many
ordering and lifetime problems strictly caused by shutdown).

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser...
File content/browser/browser_context.cc (right):

https://codereview.chromium.org/2852333004/diff/40001/content/browser/browser...
content/browser/browser_context.cc:489: base::TaskTraits()
On 2017/05/08 18:18:01, Marijn Kruisselbrink wrote:
> On 2017/05/08 at 17:14:32, gab wrote:
> > See brand new way to specify TaskTraits as a bag instead of builder pattern
> (constexpr and much less verbose :)):
>
https://cs.chromium.org/chromium/src/base/task_scheduler/task_traits.h?rcl=58...
> 
> Ah, nice. Done. The sample code in post_task.h should probably be updated to
the
> new pattern too?

Ah oops looks like we forgot to update that, will do, thanks!

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-08 19:44:04 UTC) #31

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/420880)

3 years, 7 months ago (2017-05-08 19:44:05 UTC) #32

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2852333004/60001

3 years, 7 months ago (2017-05-09 00:18:57 UTC) #35

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 60001, "attempt_start_ts": 1494288774088670, "parent_rev": "bb457134c0afdd23c2547993b9843f4d30594a65", "commit_rev": "17399e67ce617b8936b5088dba92abb92b0cb8ee"}

3 years, 7 months ago (2017-05-09 03:51:24 UTC) #36

commit-bot: I haz the power

Description was changed from ========== Fix possible deadlock during shutdown by not using the FILE ...

3 years, 7 months ago (2017-05-09 03:51:35 UTC) #37

commit-bot: I haz the power

3 years, 7 months ago (2017-05-09 03:51:41 UTC) #38

Message was sent while issue was closed.

Committed patchset #4 (id:60001) as
https://chromium.googlesource.com/chromium/src/+/17399e67ce617b8936b5088dba92...

Issue 2852333004: Fix possible deadlock during shutdown by not using the FILE thread. (Closed)

Description

Patch Set 1 #

Patch Set 2 : temporary change to show problem with task scheduler shutdown #

Patch Set 3 : undo temporary change, and add comment linking to mojo bug #

Patch Set 4 : use new way of specifying task traits #

Messages