|
|
Created:
6 years, 2 months ago by blundell Modified:
6 years, 1 month ago Reviewers:
Sam McNally, darin (slow to review), jochen (gone - plz use gerrit), Charlie Reis, yzshen1 CC:
chromium-reviews, creis+watch_chromium.org, darin-cc_chromium.org, jam, nasko+codewatch_chromium.org, viettrungluu Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionEnsure RenderFrameHostImpl's RenderFrameSetupPtr doesn't die too early
RenderFrameHostImpl is creating a RenderFrameSetupPtr, binding it via
ConnectToRemoteService(), and then using it to bootstrap the per-frame
Mojo connection. However, it holds this instance in a local variable,
meaning that the pipe is closed when the local variable goes out of
scope. The closing of the pipe races with the processing of the message
write that is the result of calling
RenderFrameSetupPtr->GetServiceProviderForFrame(), as that processing occurs
on a different thread. The net outcome is that the per-frame Mojo
connection will flakily fail to be set up.
To eliminate this race, this CL changes the RenderFrameSetupPtr instance
that the RFHI holds from a local variable to a member variable.
BUG=424069
TEST=Visit about://omnibox in multiple tabs. In each tab, check that submitting
text causes output to appear.
Committed: https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6
Cr-Commit-Position: refs/heads/master@{#300864}
Patch Set 1 #Patch Set 2 : Add TODO #Patch Set 3 : Export hard dependency as necessary #Patch Set 4 : Fix #
Messages
Total messages: 26 (6 generated)
blundell@chromium.org changed reviewers: + sammc@chromium.org
yzshen@chromium.org changed reviewers: + yzshen@chromium.org
On 2014/10/20 11:00:12, blundell wrote: I also thought of this. But according to Trung, even using a local variable and closing it immediately after calling a method on it, it should work. There may be some root cause that prevents this from working correctly. +CC trung.
On 2014/10/20 16:04:02, yzshen1 wrote: > On 2014/10/20 11:00:12, blundell wrote: > > I also thought of this. But according to Trung, even using a local variable and > closing it immediately after calling a method on it, it should work. > There may be some root cause that prevents this from working correctly. > > +CC trung. It doesn't work in at least one edge case, which is what's tripping us up here: the channel endpoint hasn't been run yet (https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...). In that case, the local variable's call to the method will result in the channel endpoint enqueuing the message. When the local variable goes out of scope, its destructor will end up calling DetachFromMessagePipe() on the channel endpoint. If that call ends up occurring before the call to ChannelEndpoint::Run() that races with it on the IO thread, then the call to ChannelEndpoint::Run() will short-circuit out because the channel is null: https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch.... This is the behavior that I was seeing in the geolocation browsertests that I was debugging, and indeed, those tests no longer fail with this change included. In addition, I cannot repro the about://omnibox unresponsiveness with this change included.
blundell@chromium.org changed reviewers: + darin@chromium.org
+darin@ for //content OWNERS
On 2014/10/20 19:21:34, blundell wrote: > On 2014/10/20 16:04:02, yzshen1 wrote: > > On 2014/10/20 11:00:12, blundell wrote: > > > > I also thought of this. But according to Trung, even using a local variable > and > > closing it immediately after calling a method on it, it should work. > > There may be some root cause that prevents this from working correctly. > > > > +CC trung. > > It doesn't work in at least one edge case, which is what's tripping us up here: > the channel endpoint hasn't been run yet > (https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...). > > In that case, the local variable's call to the method will result in the channel > endpoint enqueuing the message. When the local variable goes out of scope, its > destructor will end up calling DetachFromMessagePipe() on the channel endpoint. > If that call ends up occurring before the call to ChannelEndpoint::Run() that > races with it on the IO thread, then the call to ChannelEndpoint::Run() will > short-circuit out because the channel is null: > https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch.... > > This is the behavior that I was seeing in the geolocation browsertests that I > was debugging, and indeed, those tests no longer fail with this change included. > In addition, I cannot repro the about://omnibox unresponsiveness with this > change included. (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just think we also need a more fundamental fix.) My understanding is that postponing the destruction of the interface pointer (i.e., closing the message pipe endpoint) will make the race harder to repro, but not eliminate it. There isn't any notification/signal telling us when is "safe" to close that message pipe endpoint after we use it to send messages. Besides, I think by design we want a message pipe to reliably deliver remaining messages after we close one end of it.
On 2014/10/21 16:40:45, yzshen1 wrote: > On 2014/10/20 19:21:34, blundell wrote: > > On 2014/10/20 16:04:02, yzshen1 wrote: > > > On 2014/10/20 11:00:12, blundell wrote: > > > > > > I also thought of this. But according to Trung, even using a local variable > > and > > > closing it immediately after calling a method on it, it should work. > > > There may be some root cause that prevents this from working correctly. > > > > > > +CC trung. > > > > It doesn't work in at least one edge case, which is what's tripping us up > here: > > the channel endpoint hasn't been run yet > > > (https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...). > > > > In that case, the local variable's call to the method will result in the > channel > > endpoint enqueuing the message. When the local variable goes out of scope, its > > destructor will end up calling DetachFromMessagePipe() on the channel > endpoint. > > If that call ends up occurring before the call to ChannelEndpoint::Run() that > > races with it on the IO thread, then the call to ChannelEndpoint::Run() will > > short-circuit out because the channel is null: > > > https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch.... > > > > This is the behavior that I was seeing in the geolocation browsertests that I > > was debugging, and indeed, those tests no longer fail with this change > included. > > In addition, I cannot repro the about://omnibox unresponsiveness with this > > change included. > > (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just think we > also need a more fundamental fix.) > My understanding is that postponing the destruction of the interface pointer > (i.e., closing the message pipe endpoint) will make the race harder to repro, > but not eliminate it. There isn't any notification/signal telling us when is > "safe" to close that message pipe endpoint after we use it to send messages. > Besides, I think by design we want a message pipe to reliably deliver remaining > messages after we close one end of it. Hi Yuzhu, It will eliminate the race in that the message pipe endpoint won't be closed until the RenderFrameHostImpl dies, at which point we don't care about the Mojo connection between that RFHI and its associated RenderFrame. I would agree with you that it seems like a bug that the message doesn't get reliably delivered in this case, but I'm not an expert here and would be curious on Trung's opinion. In any case, I believe that this fix will eliminate the problems that we're seeing on the frame-level Mojo connection due to this behavior.
On 2014/10/21 17:52:10, blundell wrote: > On 2014/10/21 16:40:45, yzshen1 wrote: > > On 2014/10/20 19:21:34, blundell wrote: > > > On 2014/10/20 16:04:02, yzshen1 wrote: > > > > On 2014/10/20 11:00:12, blundell wrote: > > > > > > > > I also thought of this. But according to Trung, even using a local > variable > > > and > > > > closing it immediately after calling a method on it, it should work. > > > > There may be some root cause that prevents this from working correctly. > > > > > > > > +CC trung. > > > > > > It doesn't work in at least one edge case, which is what's tripping us up > > here: > > > the channel endpoint hasn't been run yet > > > > > > (https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...). > > > > > > In that case, the local variable's call to the method will result in the > > channel > > > endpoint enqueuing the message. When the local variable goes out of scope, > its > > > destructor will end up calling DetachFromMessagePipe() on the channel > > endpoint. > > > If that call ends up occurring before the call to ChannelEndpoint::Run() > that > > > races with it on the IO thread, then the call to ChannelEndpoint::Run() will > > > short-circuit out because the channel is null: > > > > > > https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch.... > > > > > > This is the behavior that I was seeing in the geolocation browsertests that > I > > > was debugging, and indeed, those tests no longer fail with this change > > included. > > > In addition, I cannot repro the about://omnibox unresponsiveness with this > > > change included. > > > > (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just think > we > > also need a more fundamental fix.) > > My understanding is that postponing the destruction of the interface pointer > > (i.e., closing the message pipe endpoint) will make the race harder to repro, > > but not eliminate it. There isn't any notification/signal telling us when is > > "safe" to close that message pipe endpoint after we use it to send messages. > > Besides, I think by design we want a message pipe to reliably deliver > remaining > > messages after we close one end of it. > > Hi Yuzhu, > > It will eliminate the race in that the message pipe endpoint won't be closed > until the RenderFrameHostImpl dies, at which point we don't care about the Mojo > connection between that RFHI and its associated RenderFrame. > > I would agree with you that it seems like a bug that the message doesn't get > reliably delivered in this case, but I'm not an expert here and would be curious > on Trung's opinion. In any case, I believe that this fix will eliminate the > problems that we're seeing on the frame-level Mojo connection due to this > behavior. I agree with all you said above. I was just trying to say that it doesn't seems like a general solution to delay destruction of the message pipe handle until we no longer care about the messages delivered over it. (And as I said, I don't have a problem with landing this CL.) Thanks!
On 2014/10/21 18:04:16, yzshen1 wrote: > On 2014/10/21 17:52:10, blundell wrote: > > On 2014/10/21 16:40:45, yzshen1 wrote: > > > On 2014/10/20 19:21:34, blundell wrote: > > > > On 2014/10/20 16:04:02, yzshen1 wrote: > > > > > On 2014/10/20 11:00:12, blundell wrote: > > > > > > > > > > I also thought of this. But according to Trung, even using a local > > variable > > > > and > > > > > closing it immediately after calling a method on it, it should work. > > > > > There may be some root cause that prevents this from working correctly. > > > > > > > > > > +CC trung. > > > > > > > > It doesn't work in at least one edge case, which is what's tripping us up > > > here: > > > > the channel endpoint hasn't been run yet > > > > > > > > > > (https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch...). > > > > > > > > In that case, the local variable's call to the method will result in the > > > channel > > > > endpoint enqueuing the message. When the local variable goes out of scope, > > its > > > > destructor will end up calling DetachFromMessagePipe() on the channel > > > endpoint. > > > > If that call ends up occurring before the call to ChannelEndpoint::Run() > > that > > > > races with it on the IO thread, then the call to ChannelEndpoint::Run() > will > > > > short-circuit out because the channel is null: > > > > > > > > > > https://code.google.com/p/chromium/codesearch#chromium/src/mojo/edk/system/ch.... > > > > > > > > This is the behavior that I was seeing in the geolocation browsertests > that > > I > > > > was debugging, and indeed, those tests no longer fail with this change > > > included. > > > > In addition, I cannot repro the about://omnibox unresponsiveness with this > > > > change included. > > > > > > (No need to wait for my LG. IMO it doesn't hurt to land this CL. I just > think > > we > > > also need a more fundamental fix.) > > > My understanding is that postponing the destruction of the interface pointer > > > (i.e., closing the message pipe endpoint) will make the race harder to > repro, > > > but not eliminate it. There isn't any notification/signal telling us when is > > > "safe" to close that message pipe endpoint after we use it to send messages. > > > Besides, I think by design we want a message pipe to reliably deliver > > remaining > > > messages after we close one end of it. > > > > Hi Yuzhu, > > > > It will eliminate the race in that the message pipe endpoint won't be closed > > until the RenderFrameHostImpl dies, at which point we don't care about the > Mojo > > connection between that RFHI and its associated RenderFrame. > > > > > > I would agree with you that it seems like a bug that the message doesn't get > > reliably delivered in this case, but I'm not an expert here and would be > curious > > on Trung's opinion. In any case, I believe that this fix will eliminate the > > problems that we're seeing on the frame-level Mojo connection due to this > > behavior. > > I agree with all you said above. > I was just trying to say that it doesn't seems like a general solution to delay > destruction of the message pipe handle until we no longer care about the > messages delivered over it. > > (And as I said, I don't have a problem with landing this CL.) > > Thanks! Losing messages when closing one end of a pipe does seem like a bug in the mojo system. Please add a TODO to change this back once it's fixed. Other than that LGTM.
darin: ping.
blundell@chromium.org changed reviewers: + jochen@chromium.org
+jochen@ for //content in case you can get to it before Darin.
lgtm
sammc: TODO added.
The CQ bit was checked by blundell@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/667683002/20001
Message was sent while issue was closed.
Committed patchset #2 (id:20001)
Message was sent while issue was closed.
Patchset 2 (id:??) landed as https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6 Cr-Commit-Position: refs/heads/master@{#300864}
Message was sent while issue was closed.
I think the CL may have something to do with the compilation failure on Google Chrome Linux bot: http://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Linux/... It didn't fail right after the CL was landed, but a few builds later. Maybe some dependencies haven't been properly specified? On Thu, Oct 23, 2014 at 3:41 AM, <commit-bot@chromium.org> wrote: > Patchset 2 (id:??) landed as > https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6 > Cr-Commit-Position: refs/heads/master@{#300864} > > https://codereview.chromium.org/667683002/ > -- Best regards, Yuzhu Shen. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Message was sent while issue was closed.
It also failed Google Chrome Linux x64: http://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Linux%... I have reverted the CL: https://codereview.chromium.org/672263002 and also commented about possible solution there. On Thu, Oct 23, 2014 at 9:37 AM, Yuzhu Shen <yzshen@google.com> wrote: > I think the CL may have something to do with the compilation failure on > Google Chrome Linux bot: > > http://build.chromium.org/p/chromium.chrome/builders/Google%20Chrome%20Linux/... > > It didn't fail right after the CL was landed, but a few builds later. > Maybe some dependencies haven't been properly specified? > > On Thu, Oct 23, 2014 at 3:41 AM, <commit-bot@chromium.org> wrote: > >> Patchset 2 (id:??) landed as >> https://crrev.com/235d012e9d23fbcc456a06496b246fb4e05d98d6 >> Cr-Commit-Position: refs/heads/master@{#300864} >> >> https://codereview.chromium.org/667683002/ >> > > > > -- > Best regards, > Yuzhu Shen. > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
blundell@chromium.org changed reviewers: + creis@chromium.org
Charlie, could you review the gypfile changes since jochen@ is out of office? This CL had already landed, but got reverted because I didn't get the dependencies right. Thanks!
On 2014/10/24 16:51:25, blundell wrote: > Charlie, could you review the gypfile changes since jochen@ is out of office? > This CL had already landed, but got reverted because I didn't get the > dependencies right. Thanks! I don't understand the gyp issues involved here, but rubber stamp LGTM if it matches James's recommendation from mojo-dev.
In the meantime, the underlying Mojo issue has been fixed in https://codereview.chromium.org/664763002, so closing this out. |