|
|
DescriptionFix telemetry hangs with PlzNavigate.
This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that
1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_.
2) AboutToNavigate(NavigationHandle* navigation_handle) is called
3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient.
4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder
5) DidFinishNavigation is called which sets current_ to pending_.
This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead).
There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case.
1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer.
2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits.
This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate.
BUG=674730
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation
Committed: https://crrev.com/af4c3f9c6ca41f00427a0f1f8de02782ce227629
Cr-Commit-Position: refs/heads/master@{#439592}
Patch Set 1 #Patch Set 2 : with plznav enabled #Patch Set 3 : fix for DevToolsProtocolTest.CrossSitePauseInBeforeUnload #Patch Set 4 : now disable plznav again #Patch Set 5 : better fix to match non-plznavigate case #Patch Set 6 : now disable plznav again #Patch Set 7 : better fix to match non-plznavigate case #Patch Set 8 : now disable plznav again #
Total comments: 10
Patch Set 9 : review comment (with PlzNavigate enabled) #Patch Set 10 : now disable plznav again #Messages
Total messages: 61 (52 generated)
Description was changed from ========== a fix testWebPageReplay BUG= ========== to ========== a fix testWebPageReplay BUG= CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ==========
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== a fix testWebPageReplay BUG= CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ========== to ========== Fix telemetry hangs with PlzNavigate. The issue is that in the inspector protocol, when a navigation spans processes the Page::navigate command was going to the first renderer. The frameId response was therefore from the old renderer's processId. Then when the frameNavigated was sent from the new renderer, the frameIds didn't match and the catapult code was ignoring it. The fix was to start buffering messages before beforeunload request was sent. However, this broke DevToolsProtocolTest.CrossSitePauseInBeforeUnload. Looking through history, I saw r378741 which fixed it by the reverse of this change :) I put in a workaround for that test case, although it's very targeted and it doesn't seem right (i.e. could that message be intended for the new page?). So I'm sending this change to get your thoughts. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 ==========
jam@chromium.org changed reviewers: + clamy@chromium.org, dgozman@chromium.org
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
jam@chromium.org changed reviewers: - clamy@chromium.org, dgozman@chromium.org
Description was changed from ========== Fix telemetry hangs with PlzNavigate. The issue is that in the inspector protocol, when a navigation spans processes the Page::navigate command was going to the first renderer. The frameId response was therefore from the old renderer's processId. Then when the frameNavigated was sent from the new renderer, the frameIds didn't match and the catapult code was ignoring it. The fix was to start buffering messages before beforeunload request was sent. However, this broke DevToolsProtocolTest.CrossSitePauseInBeforeUnload. Looking through history, I saw r378741 which fixed it by the reverse of this change :) I put in a workaround for that test case, although it's very targeted and it doesn't seem right (i.e. could that message be intended for the new page?). So I'm sending this change to get your thoughts. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 ========== to ========== Fix telemetry hangs with PlzNavigate. The issue is that in the inspector protocol, when a navigation spans processes the Page::navigate command was going to the first renderer. The frameId response was therefore from the old renderer's processId. Then when the frameNavigated was sent from the new renderer, the frameIds didn't match and the catapult code was ignoring it. The fix was to start buffering messages before beforeunload request was sent. However, this broke DevToolsProtocolTest.CrossSitePauseInBeforeUnload. Looking through history, I saw r378741 which fixed it by the reverse of this change :) I put in a workaround for that test case, although it's very targeted and it doesn't seem right (i.e. could that message be intended for the new page?). So I'm sending this change to get your thoughts. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ==========
Patchset #5 (id:80001) has been deleted
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Fix telemetry hangs with PlzNavigate. The issue is that in the inspector protocol, when a navigation spans processes the Page::navigate command was going to the first renderer. The frameId response was therefore from the old renderer's processId. Then when the frameNavigated was sent from the new renderer, the frameIds didn't match and the catapult code was ignoring it. The fix was to start buffering messages before beforeunload request was sent. However, this broke DevToolsProtocolTest.CrossSitePauseInBeforeUnload. Looking through history, I saw r378741 which fixed it by the reverse of this change :) I put in a workaround for that test case, although it's very targeted and it doesn't seem right (i.e. could that message be intended for the new page?). So I'm sending this change to get your thoughts. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ========== to ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since on cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead), leading to hangs. There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the renderer while it was paused to have the responses go to the inspector client. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ==========
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since on cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead), leading to hangs. There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the renderer while it was paused to have the responses go to the inspector client. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay and gpu_process_launch_tests and hardware_accelerated_feature_tests with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ========== to ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_. 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient. 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead). There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ==========
jam@chromium.org changed reviewers: + clamy@chromium.org, dgozman@chromium.org
Thank you for the fix! lgtm https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... File content/browser/devtools/render_frame_devtools_agent_host.cc (right): https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:601: if (!navigating_handles_.empty()) { I think that bug#2 from description is addressed here, isn't it? Since we now suspend early, no messages are being sent after AboutToNavigate up to the DidFinishNavigation - they are put to in_navigation_protocol_message_buffer_. The fix with sent_messages_whose_reply_came_while_suspended_ addresses the issue that we get the new state_cookie with a reply, and thus we have to resend all the messages we didn't get response for before navigation to get the same backend state in new renderer. WDYT? https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:708: current_->Resume(); Let's only resume when navigation_handles_ is empty to be on the safe side. https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:750: current_->Suspend(); I think we should only do this (and line before) if navigation_handle's FrameTreeNode equals frame_tree_node_. Otherwise, we'll suspend for any subframe navigation which is unnecessary.
https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... File content/browser/devtools/render_frame_devtools_agent_host.cc (right): https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:601: if (!navigating_handles_.empty()) { On 2016/12/19 18:26:44, dgozman wrote: > I think that bug#2 from description is addressed here, isn't it? The Page.navigate message is sent before we're (now) paused. It's step 1 in the description, while the pausing is now happening in step 2. > Since we now > suspend early, no messages are being sent after AboutToNavigate up to the > DidFinishNavigation - they are put to in_navigation_protocol_message_buffer_. > > The fix with sent_messages_whose_reply_came_while_suspended_ addresses the issue > that we get the new state_cookie with a reply, and thus we have to resend all > the messages we didn't get response for before navigation to get the same > backend state in new renderer. WDYT? https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:708: current_->Resume(); On 2016/12/19 18:26:44, dgozman wrote: > Let's only resume when navigation_handles_ is empty to be on the safe side. Ok, I don't really understand why there are multiple NavigationHandles being tracked? https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:750: current_->Suspend(); On 2016/12/19 18:26:44, dgozman wrote: > I think we should only do this (and line before) if navigation_handle's > FrameTreeNode equals frame_tree_node_. Otherwise, we'll suspend for any subframe > navigation which is unnecessary. That's always the case, i.e. see https://cs.chromium.org/chromium/src/content/browser/devtools/render_frame_de... The static RenderFrameDevToolsAgentHost::OnBeforeNavigation(NavigationHandle* navigation_handle) method always finds the right RenderFrameDevToolsAgentHost by comparing frametreenodes.
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #10 (id:200001) has been deleted
https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... File content/browser/devtools/render_frame_devtools_agent_host.cc (right): https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:708: current_->Resume(); On 2016/12/19 20:13:02, jam wrote: > On 2016/12/19 18:26:44, dgozman wrote: > > Let's only resume when navigation_handles_ is empty to be on the safe side. > > Ok, I don't really understand why there are multiple NavigationHandles being > tracked? I'm not sure. From the time this was added, I vaguely remember there could be multiple navigations at the same time (maybe in download case or something?) Or this is just wrong. https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:750: current_->Suspend(); On 2016/12/19 20:13:02, jam wrote: > On 2016/12/19 18:26:44, dgozman wrote: > > I think we should only do this (and line before) if navigation_handle's > > FrameTreeNode equals frame_tree_node_. Otherwise, we'll suspend for any > subframe > > navigation which is unnecessary. > > That's always the case, i.e. see > https://cs.chromium.org/chromium/src/content/browser/devtools/render_frame_de... > > The static RenderFrameDevToolsAgentHost::OnBeforeNavigation(NavigationHandle* > navigation_handle) method always finds the right RenderFrameDevToolsAgentHost by > comparing frametreenodes. Right, my bad.
The CQ bit was unchecked by jam@chromium.org
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by jam@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
nasko@chromium.org changed reviewers: + nasko@chromium.org
https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... File content/browser/devtools/render_frame_devtools_agent_host.cc (right): https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:708: current_->Resume(); On 2016/12/19 20:29:52, dgozman wrote: > On 2016/12/19 20:13:02, jam wrote: > > On 2016/12/19 18:26:44, dgozman wrote: > > > Let's only resume when navigation_handles_ is empty to be on the safe side. > > > > Ok, I don't really understand why there are multiple NavigationHandles being > > tracked? > > I'm not sure. From the time this was added, I vaguely remember there could be > multiple navigations at the same time (maybe in download case or something?) Or > this is just wrong. > Yes, you can have multiple navigations simultaneously, though rarely. The most common example is a cross-process navigation that takes long time and a pushState/replaceState/fragment navigation in the meantime. The latter is considered "same page", so it starts/commits/finishes in a single event loop and the cross-process slow navigation continues to exist until timeout or commit.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
jam@chromium.org changed reviewers: - nasko@chromium.org
The CQ bit was checked by jam@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from dgozman@chromium.org Link to the patchset: https://codereview.chromium.org/2585053002/#ps220001 (title: "now disable plznav again")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... File content/browser/devtools/render_frame_devtools_agent_host.cc (right): https://codereview.chromium.org/2585053002/diff/160001/content/browser/devtoo... content/browser/devtools/render_frame_devtools_agent_host.cc:708: current_->Resume(); On 2016/12/19 20:59:00, nasko wrote: > On 2016/12/19 20:29:52, dgozman wrote: > > On 2016/12/19 20:13:02, jam wrote: > > > On 2016/12/19 18:26:44, dgozman wrote: > > > > Let's only resume when navigation_handles_ is empty to be on the safe > side. > > > > > > Ok, I don't really understand why there are multiple NavigationHandles being > > > tracked? > > > > I'm not sure. From the time this was added, I vaguely remember there could be > > multiple navigations at the same time (maybe in download case or something?) > Or > > this is just wrong. > > > > Yes, you can have multiple navigations simultaneously, though rarely. The most > common example is a cross-process navigation that takes long time and a > pushState/replaceState/fragment navigation in the meantime. The latter is > considered "same page", so it starts/commits/finishes in a single event loop and > the cross-process slow navigation continues to exist until timeout or commit. Thanks, this is helpful to know. You say "most common example", what are the other cases? If that was the only example, I would have changed this code to ignore in-page navigations and just have one NavigationHandle instead of a set which would simplify things.
CQ is committing da patch. Bot data: {"patchset_id": 220001, "attempt_start_ts": 1482185865416720, "parent_rev": "bbd6b319be41c428a46d6b622fd440904d002333", "commit_rev": "515fc247d336afe78768b4beea206c428841e90b"}
Message was sent while issue was closed.
Description was changed from ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_. 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient. 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead). There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation ========== to ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_. 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient. 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead). There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation Review-Url: https://codereview.chromium.org/2585053002 ==========
Message was sent while issue was closed.
Committed patchset #10 (id:220001)
Message was sent while issue was closed.
Description was changed from ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_. 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient. 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead). There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation Review-Url: https://codereview.chromium.org/2585053002 ========== to ========== Fix telemetry hangs with PlzNavigate. This was caused because some inspector protocol messages were being dropped with PlzNavigate with cross-process navigations. The scenario is that 1) FrameHostHolder::DispatchProtocolMessage was being called before the navigation happens. It adds the message to sent_messages_. 2) AboutToNavigate(NavigationHandle* navigation_handle) is called 3) FrameHostHolder::SendMessageToClient is called since the first renderer has responded to the protocol message from 1). It removes the sent message from sent_messages_ and forwards the reply using SendMessageToClient. 4) ReadyToCommitNavigation is called, which only then creates the pending FrameHostHolder 5) DidFinishNavigation is called which sets current_ to pending_. This manifested in hangs with telemetry, since in cross-process navigations the Page.navigate command was being recieved incorrectly in the first renderer and it was sending a frameId from that renderer. The catapult code was waiting for a navigation event for that frameId which would never come (it would get it from the new process instead). There were two bugs here with PlzNavigate, compared to the non-PlzNavigate case. 1 The first FrameHostHolder (i.e. current_ during the navigation) was not being paused. This was leading to responses to messages that were sent to the first renderer to respond to the inspector client. Since the response arrived, sent_messages_ was being cleared and we never got to send the message to the new renderer. 2) Even with the above fixed, we had to keep track of messages that were sent to the first renderer while the FrameHostHolder is (now) suspended. This didn't need to happen in the non-PlzNavigate case because the pending FrameHostHolder is created immediately (in the equivalent to step 2 above). At that point in time, SetPending ends up calling Reattach which then grabs the sent_messages_ from the current_ FrameHostHolder. The Page.navigate message was therefore being sent correctly to the second process. When the first process replies, it's ignored since it goes into pending_messages_ buffer. With PlzNavigate, there's a period of time between 2) and 4) where we know we have a navigation but we don't have the RenderFrameHost yet. So we need to emulate the non-PlzNavigate behavior by holding on to sent messages during that window, so that we can send them to the new FrameHostHolder once it commits. This fixes telemetry_unittests' testWebPageReplay, gpu_process_launch_tests, hardware_accelerated_feature_tests and the page cyclers with PlzNavigate. BUG=674730 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_site_isolation Committed: https://crrev.com/af4c3f9c6ca41f00427a0f1f8de02782ce227629 Cr-Commit-Position: refs/heads/master@{#439592} ==========
Message was sent while issue was closed.
Patchset 10 (id:??) landed as https://crrev.com/af4c3f9c6ca41f00427a0f1f8de02782ce227629 Cr-Commit-Position: refs/heads/master@{#439592} |