Issue 667683002: Ensure RenderFrameHostImpl's RenderFrameSetupPtr doesn't die too early

Issue 667683002: Ensure RenderFrameHostImpl's RenderFrameSetupPtr doesn't die too early (Closed)

Created:
6 years, 2 months ago by blundell

Modified:
6 years, 1 month ago

Reviewers:
Sam McNally, darin (slow to review), jochen (gone - plz use gerrit), Charlie Reis, yzshen1

CC:
chromium-reviews, creis+watch_chromium.org, darin-cc_chromium.org, jam, nasko+codewatch_chromium.org, viettrungluu

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Project:
chromium

Visibility:
Public.

More Reviews

Description

Ensure RenderFrameHostImpl's RenderFrameSetupPtr doesn't die too early RenderFrameHostImpl is creating a RenderFrameSetupPtr, binding it via ConnectToRemoteService(), and then using it to bootstrap the per-frame Mojo connection. However, it holds this instance in a local variable, meaning that the pipe is closed when the local variable goes out of scope. The closing of the pipe races with the processing of the message write that is the result of calling RenderFrameSetupPtr->GetServiceProviderForFrame(), as that processing occurs on a different thread. The net outcome is that the per-frame Mojo connection will flakily fail to be set up. To eliminate this race, this CL changes the RenderFrameSetupPtr instance that the RFHI holds from a local variable to a member variable. BUG=424069 TEST=Visit about://omnibox in multiple tabs. In each tab, check that submitting text causes output to appear. Committed: https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6 Cr-Commit-Position: refs/heads/master@{#300864}

Patch Set 1 #

Patch Set 2 : Add TODO #

Patch Set 3 : Export hard dependency as necessary #

Patch Set 4 : Fix #

Created: 6 years, 2 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+25 lines, -5 lines)			Patch
M	content/browser/frame_host/render_frame_host_impl.h	View	1 2	2 chunks	+9 lines, -0 lines	0 comments	Download
M	content/browser/frame_host/render_frame_host_impl.cc	View	1 2	2 chunks	+4 lines, -5 lines	0 comments	Download
M	content/content_browser.gypi	View	1 2	1 chunk	+6 lines, -0 lines	0 comments	Download
M	content/content_tests.gypi	View	1 2 3	1 chunk	+6 lines, -0 lines	0 comments	Download

Messages

Total messages: 26 (6 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

yzshen1

On 2014/10/20 11:00:12, blundell wrote: I also thought of this. But according to Trung, even ...

6 years, 2 months ago (2014-10-20 16:04:02 UTC) #4

blundell

On 2014/10/20 16:04:02, yzshen1 wrote: > On 2014/10/20 11:00:12, blundell wrote: > > I also ...

6 years, 2 months ago (2014-10-20 19:21:34 UTC) #5

yzshen1

On 2014/10/20 19:21:34, blundell wrote: > On 2014/10/20 16:04:02, yzshen1 wrote: > > On 2014/10/20 ...

6 years, 2 months ago (2014-10-21 16:40:45 UTC) #8

On 2014/10/20 19:21:34, blundell wrote:
> On 2014/10/20 16:04:02, yzshen1 wrote:
> > On 2014/10/20 11:00:12, blundell wrote:
> > 
> > I also thought of this. But according to Trung, even using a local variable
> and
> > closing it immediately after calling a method on it, it should work.
> > There may be some root cause that prevents this from working correctly.
> > 
> > +CC trung.
> 
> It doesn't work in at least one edge case, which is what's tripping us up
here:
> the channel endpoint hasn't been run yet
>
(https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...).
> 
> In that case, the local variable's call to the method will result in the
channel
> endpoint enqueuing the message. When the local variable goes out of scope, its
> destructor will end up calling DetachFromMessagePipe() on the channel
endpoint.
> If that call ends up occurring before the call to ChannelEndpoint::Run() that
> races with it on the IO thread, then the call to ChannelEndpoint::Run() will
> short-circuit out because the channel is null:
>
https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch....
> 
> This is the behavior that I was seeing in the geolocation browsertests that I
> was debugging, and indeed, those tests no longer fail with this change
included.
> In addition, I cannot repro the about://omnibox unresponsiveness with this
> change included.

(No need to wait for my LG. IMO it doesn't hurt to land this CL. I just think we
also need a more fundamental fix.)
My understanding is that postponing the destruction of the interface pointer
(i.e., closing the message pipe endpoint) will make the race harder to repro,
but not eliminate it. There isn't any notification/signal telling us when is
"safe" to close that message pipe endpoint after we use it to send messages.
Besides, I think by design we want a message pipe to reliably deliver remaining
messages after we close one end of it.

blundell

On 2014/10/21 16:40:45, yzshen1 wrote: > On 2014/10/20 19:21:34, blundell wrote: > > On 2014/10/20 ...

6 years, 2 months ago (2014-10-21 17:52:10 UTC) #9

On 2014/10/21 16:40:45, yzshen1 wrote:
> On 2014/10/20 19:21:34, blundell wrote:
> > On 2014/10/20 16:04:02, yzshen1 wrote:
> > > On 2014/10/20 11:00:12, blundell wrote:
> > > 
> > > I also thought of this. But according to Trung, even using a local
variable
> > and
> > > closing it immediately after calling a method on it, it should work.
> > > There may be some root cause that prevents this from working correctly.
> > > 
> > > +CC trung.
> > 
> > It doesn't work in at least one edge case, which is what's tripping us up
> here:
> > the channel endpoint hasn't been run yet
> >
>
(https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...).
> > 
> > In that case, the local variable's call to the method will result in the
> channel
> > endpoint enqueuing the message. When the local variable goes out of scope,
its
> > destructor will end up calling DetachFromMessagePipe() on the channel
> endpoint.
> > If that call ends up occurring before the call to ChannelEndpoint::Run()
that
> > races with it on the IO thread, then the call to ChannelEndpoint::Run() will
> > short-circuit out because the channel is null:
> >
>
https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch....
> > 
> > This is the behavior that I was seeing in the geolocation browsertests that
I
> > was debugging, and indeed, those tests no longer fail with this change
> included.
> > In addition, I cannot repro the about://omnibox unresponsiveness with this
> > change included.
> 
> (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just think
we
> also need a more fundamental fix.)
> My understanding is that postponing the destruction of the interface pointer
> (i.e., closing the message pipe endpoint) will make the race harder to repro,
> but not eliminate it. There isn't any notification/signal telling us when is
> "safe" to close that message pipe endpoint after we use it to send messages.
> Besides, I think by design we want a message pipe to reliably deliver
remaining
> messages after we close one end of it.

Hi Yuzhu,

It will eliminate the race in that the message pipe endpoint won't be closed
until the RenderFrameHostImpl dies, at which point we don't care about the Mojo
connection between that RFHI and its associated RenderFrame.

I would agree with you that it seems like a bug that the message doesn't get
reliably delivered in this case, but I'm not an expert here and would be curious
on Trung's opinion. In any case, I believe that this fix will eliminate the
problems that we're seeing on the frame-level Mojo connection due to this
behavior.

yzshen1

On 2014/10/21 17:52:10, blundell wrote: > On 2014/10/21 16:40:45, yzshen1 wrote: > > On 2014/10/20 ...

6 years, 2 months ago (2014-10-21 18:04:16 UTC) #10

On 2014/10/21 17:52:10, blundell wrote:
> On 2014/10/21 16:40:45, yzshen1 wrote:
> > On 2014/10/20 19:21:34, blundell wrote:
> > > On 2014/10/20 16:04:02, yzshen1 wrote:
> > > > On 2014/10/20 11:00:12, blundell wrote:
> > > > 
> > > > I also thought of this. But according to Trung, even using a local
> variable
> > > and
> > > > closing it immediately after calling a method on it, it should work.
> > > > There may be some root cause that prevents this from working correctly.
> > > > 
> > > > +CC trung.
> > > 
> > > It doesn't work in at least one edge case, which is what's tripping us up
> > here:
> > > the channel endpoint hasn't been run yet
> > >
> >
>
(https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...).
> > > 
> > > In that case, the local variable's call to the method will result in the
> > channel
> > > endpoint enqueuing the message. When the local variable goes out of scope,
> its
> > > destructor will end up calling DetachFromMessagePipe() on the channel
> > endpoint.
> > > If that call ends up occurring before the call to ChannelEndpoint::Run()
> that
> > > races with it on the IO thread, then the call to ChannelEndpoint::Run()
will
> > > short-circuit out because the channel is null:
> > >
> >
>
https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch....
> > > 
> > > This is the behavior that I was seeing in the geolocation browsertests
that
> I
> > > was debugging, and indeed, those tests no longer fail with this change
> > included.
> > > In addition, I cannot repro the about://omnibox unresponsiveness with this
> > > change included.
> > 
> > (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just
think
> we
> > also need a more fundamental fix.)
> > My understanding is that postponing the destruction of the interface pointer
> > (i.e., closing the message pipe endpoint) will make the race harder to
repro,
> > but not eliminate it. There isn't any notification/signal telling us when is
> > "safe" to close that message pipe endpoint after we use it to send messages.
> > Besides, I think by design we want a message pipe to reliably deliver
> remaining
> > messages after we close one end of it.
> 
> Hi Yuzhu,
> 
> It will eliminate the race in that the message pipe endpoint won't be closed
> until the RenderFrameHostImpl dies, at which point we don't care about the
Mojo
> connection between that RFHI and its associated RenderFrame.


> 
> I would agree with you that it seems like a bug that the message doesn't get
> reliably delivered in this case, but I'm not an expert here and would be
curious
> on Trung's opinion. In any case, I believe that this fix will eliminate the
> problems that we're seeing on the frame-level Mojo connection due to this
> behavior.

I agree with all you said above.
I was just trying to say that it doesn't seems like a general solution to delay
destruction of the message pipe handle until we no longer care about the
messages delivered over it.

(And as I said, I don't have a problem with landing this CL.)

Thanks!

Sam McNally

On 2014/10/21 18:04:16, yzshen1 wrote: > On 2014/10/21 17:52:10, blundell wrote: > > On 2014/10/21 ...

6 years, 2 months ago (2014-10-22 02:14:48 UTC) #11

On 2014/10/21 18:04:16, yzshen1 wrote:
> On 2014/10/21 17:52:10, blundell wrote:
> > On 2014/10/21 16:40:45, yzshen1 wrote:
> > > On 2014/10/20 19:21:34, blundell wrote:
> > > > On 2014/10/20 16:04:02, yzshen1 wrote:
> > > > > On 2014/10/20 11:00:12, blundell wrote:
> > > > > 
> > > > > I also thought of this. But according to Trung, even using a local
> > variable
> > > > and
> > > > > closing it immediately after calling a method on it, it should work.
> > > > > There may be some root cause that prevents this from working
correctly.
> > > > > 
> > > > > +CC trung.
> > > > 
> > > > It doesn't work in at least one edge case, which is what's tripping us
up
> > > here:
> > > > the channel endpoint hasn't been run yet
> > > >
> > >
> >
>
(https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...).
> > > > 
> > > > In that case, the local variable's call to the method will result in the
> > > channel
> > > > endpoint enqueuing the message. When the local variable goes out of
scope,
> > its
> > > > destructor will end up calling DetachFromMessagePipe() on the channel
> > > endpoint.
> > > > If that call ends up occurring before the call to ChannelEndpoint::Run()
> > that
> > > > races with it on the IO thread, then the call to ChannelEndpoint::Run()
> will
> > > > short-circuit out because the channel is null:
> > > >
> > >
> >
>
https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch....
> > > > 
> > > > This is the behavior that I was seeing in the geolocation browsertests
> that
> > I
> > > > was debugging, and indeed, those tests no longer fail with this change
> > > included.
> > > > In addition, I cannot repro the about://omnibox unresponsiveness with
this
> > > > change included.
> > > 
> > > (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just
> think
> > we
> > > also need a more fundamental fix.)
> > > My understanding is that postponing the destruction of the interface
pointer
> > > (i.e., closing the message pipe endpoint) will make the race harder to
> repro,
> > > but not eliminate it. There isn't any notification/signal telling us when
is
> > > "safe" to close that message pipe endpoint after we use it to send
messages.
> > > Besides, I think by design we want a message pipe to reliably deliver
> > remaining
> > > messages after we close one end of it.
> > 
> > Hi Yuzhu,
> > 
> > It will eliminate the race in that the message pipe endpoint won't be closed
> > until the RenderFrameHostImpl dies, at which point we don't care about the
> Mojo
> > connection between that RFHI and its associated RenderFrame.
> 
> 
> > 
> > I would agree with you that it seems like a bug that the message doesn't get
> > reliably delivered in this case, but I'm not an expert here and would be
> curious
> > on Trung's opinion. In any case, I believe that this fix will eliminate the
> > problems that we're seeing on the frame-level Mojo connection due to this
> > behavior.
> 
> I agree with all you said above.
> I was just trying to say that it doesn't seems like a general solution to
delay
> destruction of the message pipe handle until we no longer care about the
> messages delivered over it.
> 
> (And as I said, I don't have a problem with landing this CL.)
> 
> Thanks!

Losing messages when closing one end of a pipe does seem like a bug in the mojo
system. Please add a TODO to change this back once it's fixed. Other than that
LGTM.

blundell

+jochen@ for //content in case you can get to it before Darin.

6 years, 2 months ago (2014-10-23 06:32:07 UTC) #14

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/667683002/20001

6 years, 2 months ago (2014-10-23 09:53:32 UTC) #18

commit-bot: I haz the power

Patchset 2 (id:??) landed as https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6 Cr-Commit-Position: refs/heads/master@{#300864}

6 years, 2 months ago (2014-10-23 10:41:06 UTC) #20

chromium-reviews

I think the CL may have something to do with the compilation failure on Google ...

6 years, 2 months ago (2014-10-23 16:37:32 UTC) #21

yzshen1

It also failed Google Chrome Linux x64: http://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Linux%20x64/builds/53445/steps/compile/logs/stdio I have reverted the CL: https://codereview.chromium.org/672263002 and ...

6 years, 2 months ago (2014-10-23 17:23:44 UTC) #22

blundell

Charlie, could you review the gypfile changes since jochen@ is out of office? This CL ...

6 years, 2 months ago (2014-10-24 16:51:25 UTC) #24

Charlie Reis

On 2014/10/24 16:51:25, blundell wrote: > Charlie, could you review the gypfile changes since jochen@ ...

6 years, 2 months ago (2014-10-24 18:07:16 UTC) #25

blundell

6 years, 1 month ago (2014-10-28 09:59:36 UTC) #26

In the meantime, the underlying Mojo issue has been fixed in
https://codereview.chromium.org/664763002, so closing this out.

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages