|
|
Chromium Code Reviews|
Created:
4 years, 7 months ago by Mikhail Goncharov Modified:
4 years, 7 months ago CC:
chromium-reviews, darin-cc_chromium.org, nasko+codewatch_chromium.org, jam, creis+watch_chromium.org, Avi (use Gerrit) Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionFix: tabs slow to response to beforeunload are closed prematurely.
Issue mitigates as browser loses some tabs after relaunch.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.
BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
Committed: https://crrev.com/73fc9555a1ecb23f1076b51e4602dfee5419e2f9
Cr-Commit-Position: refs/heads/master@{#395368}
Patch Set 1 : Draft #Patch Set 2 : Revisited solution #
Total comments: 6
Patch Set 3 : Use whole rfhi::BeforeUnloadAck #
Total comments: 15
Patch Set 4 : Review fixes; new tests are disabled #Patch Set 5 : with tests completely disabled #Patch Set 6 : Correct order of checks #
Messages
Total messages: 58 (18 generated)
Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. Issue mitigates as browser loses some tabs after relaunch. Scenario is following: unload controller fires 'beforeunload' event every tab sequentially and waits for answer. If tab failed to respond within one second then web_contents_impl sends ack response on behalf of tab and unload controller continues with the next. But if tab was slow to respond because of overall load to machine and fired beforeunload ACK after been declared unresponsive then unload controller is surprised by that response and tells tab to close immediately. This fix changes unload controller and web content behavior to do not force close unresponding tabs when browser is going to shutdown. Additionally fix addresses an issue of tabs that failed to respond to beforeunload quickly were marked as "don't ask again". Thus if user cancels shutdown and then clicked 'close' once again she might miss unload dialog from tab that is responding now. New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs because under heavy load tab might fail to response within one second and treated as if it has no beforeunload handler. BUG=365052 R=clamy@chromium.org ========== to ========== Fix: tabs slow to response to beforeunload are closed prematurely. Issue mitigates as browser loses some tabs after relaunch. Scenario is following: unload controller fires 'beforeunload' event every tab sequentially and waits for answer. If tab failed to respond within one second then web_contents_impl sends ack response on behalf of tab and unload controller continues with the next. But if tab was slow to respond because of overall load to machine and fired beforeunload ACK after been declared unresponsive then unload controller is surprised by that response and tells tab to close immediately. This fix changes unload controller and web content behavior to do not force close unresponding tabs when browser is going to shutdown. Additionally fix addresses an issue of tabs that failed to respond to beforeunload quickly were marked as "don't ask again". Thus if user cancels shutdown and then clicked 'close' once again she might miss unload dialog from tab that is responding now. New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs because under heavy load tab might fail to response within one second and treated as if it has no beforeunload handler. BUG=365052 R=clamy@chromium.org CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation ==========
creis@chromium.org changed reviewers: + creis@chromium.org
We should not do this. Wouldn't it mean that you couldn't close a tab if the renderer process was slow to respond, or in an infinite loop? We intentionally have a timer to keep Chrome responsive to the user's request in these cases. > Issue mitigates as browser loses some tabs after relaunch. I'm curious how this CL is related to that bug? The tabs closed when beforeunload is bypassed do get restored in the next session, which I've just verified on Canary. That's a serious bug, and any leads to help diagnose it would be great. I just don't see how this would help.
On 2016/05/11 17:10:17, Charlie Reis wrote:
> We should not do this. Wouldn't it mean that you couldn't close a tab if the
> renderer process was slow to respond, or in an infinite loop?
> We intentionally
> have a timer to keep Chrome responsive to the user's request in these cases.
No, user id still able to close the tab as all measures are still in place.
Closing of single tab isn't affected at all - only browser closing process.
> > Issue mitigates as browser loses some tabs after relaunch.
>
> I'm curious how this CL is related to that bug? The tabs closed when
> beforeunload is bypassed do get restored in the next session, which I've just
> verified on Canary.
You can easily reproduce bug this way:
- download and open "beforeunload_slow.html" in 4 tabs
- select "quit chrome" in menu
(note that some tabs are closed during shutdown exactly like users describe
in bug comments)
- open chrome: only two tabs restored while we expect to see all four.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
> That's a serious bug, and any leads to help diagnose it would be great. I
just
> don't see how this would help.
On 2016/05/12 06:09:01, metaflow wrote:
> On 2016/05/11 17:10:17, Charlie Reis wrote:
> > We should not do this. Wouldn't it mean that you couldn't close a tab if
the
> > renderer process was slow to respond, or in an infinite loop?
> > We intentionally
> > have a timer to keep Chrome responsive to the user's request in these cases.
>
> No, user id still able to close the tab as all measures are still in place.
> Closing of single tab isn't affected at all - only browser closing process.
>
> > > Issue mitigates as browser loses some tabs after relaunch.
> >
> > I'm curious how this CL is related to that bug? The tabs closed when
> > beforeunload is bypassed do get restored in the next session, which I've
just
> > verified on Canary.
>
> You can easily reproduce bug this way:
> - download and open "beforeunload_slow.html" in 4 tabs
> - select "quit chrome" in menu
> (note that some tabs are closed during shutdown exactly like users describe
> in bug comments)
> - open chrome: only two tabs restored while we expect to see all four.
>
> What happens underneath in time order:
>
> - browser initiates shutdown and asks unload controller to fire beforeunload
> events
> - unload controller fills up queue of tabs to notify and fires beforeunload to
> tab #1
> - frame host proxies event to tab #1 and starts one second timer
> - tab #1 receives beforeunload and starts processing
> - timer expires and web_content_impl receives RendererUnresponsive. It fires
> onbeforeunloadACK on behalf of tab #1
> - unload controller receives ACK; finds tab #1 in beforeunload queue, removes
it
> from
> that queue and put to the queue "unload", responses with 'false' (aka
> "don't unload") as we want to fire unload events when every tab accepted
> unload. Now unload controller fires beforeunload to tab #2
> - frame host proxies event to tab #2 and starts one second timer
> - tab #1 has finished and fires beforeunload ACK (it was slow to respond
because
> of
> system load or trip to server or w/e)
> - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> already removed), assumes that tab #1 was created after queue was formed and
> responds with 'true' ("proceed unload")
> - tab #1 unloads and notifies session service (session service is still alive
> atm)
> - session service records that tab #1 was closed. That's it. Now chrome will
not
> restore this tab. Same thing may happen to some other tabs. Missed tabs will
> be scattered randomly across tabstrips depending on OS schedule and content.
>
> Intend of this CL is to prevent such scenarios. Responsiveness of browser
should
> not be changed - even if tab is really unresponsive it will be closed as
before,
> after unload controller made sure that every other tab is acknowledged unload.
>
> > That's a serious bug, and any leads to help diagnose it would be great. I
> just
> > don't see how this would help.
To me it seems that the issue is more with the unload controller than with
WebContents. It could just check whether the tab is in the queue "unload"
instead of assuming right away that it was created after the queue
"beforeunload" started. If it finds it in the queue unload, then it won't
respond with true (aka unload) but false. It seems like a much simpler fix than
what you're trying to do in that CL.
On 2016/05/12 06:26:11, clamy wrote:
> On 2016/05/12 06:09:01, metaflow wrote:
> > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > We should not do this. Wouldn't it mean that you couldn't close a tab if
> the
> > > renderer process was slow to respond, or in an infinite loop?
> > > We intentionally
> > > have a timer to keep Chrome responsive to the user's request in these
cases.
> >
> > No, user id still able to close the tab as all measures are still in place.
> > Closing of single tab isn't affected at all - only browser closing process.
> >
> > > > Issue mitigates as browser loses some tabs after relaunch.
> > >
> > > I'm curious how this CL is related to that bug? The tabs closed when
> > > beforeunload is bypassed do get restored in the next session, which I've
> just
> > > verified on Canary.
> >
> > You can easily reproduce bug this way:
> > - download and open "beforeunload_slow.html" in 4 tabs
> > - select "quit chrome" in menu
> > (note that some tabs are closed during shutdown exactly like users describe
> > in bug comments)
> > - open chrome: only two tabs restored while we expect to see all four.
> >
> > What happens underneath in time order:
> >
> > - browser initiates shutdown and asks unload controller to fire beforeunload
> > events
> > - unload controller fills up queue of tabs to notify and fires beforeunload
to
> > tab #1
> > - frame host proxies event to tab #1 and starts one second timer
> > - tab #1 receives beforeunload and starts processing
> > - timer expires and web_content_impl receives RendererUnresponsive. It fires
> > onbeforeunloadACK on behalf of tab #1
> > - unload controller receives ACK; finds tab #1 in beforeunload queue,
removes
> it
> > from
> > that queue and put to the queue "unload", responses with 'false' (aka
> > "don't unload") as we want to fire unload events when every tab accepted
> > unload. Now unload controller fires beforeunload to tab #2
> > - frame host proxies event to tab #2 and starts one second timer
> > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> because
> > of
> > system load or trip to server or w/e)
> > - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> > already removed), assumes that tab #1 was created after queue was formed and
> > responds with 'true' ("proceed unload")
> > - tab #1 unloads and notifies session service (session service is still
alive
> > atm)
> > - session service records that tab #1 was closed. That's it. Now chrome will
> not
> > restore this tab. Same thing may happen to some other tabs. Missed tabs will
> > be scattered randomly across tabstrips depending on OS schedule and content.
> >
> > Intend of this CL is to prevent such scenarios. Responsiveness of browser
> should
> > not be changed - even if tab is really unresponsive it will be closed as
> before,
> > after unload controller made sure that every other tab is acknowledged
unload.
> >
> > > That's a serious bug, and any leads to help diagnose it would be great. I
> > just
> > > don't see how this would help.
>
> To me it seems that the issue is more with the unload controller than with
> WebContents. It could just check whether the tab is in the queue "unload"
> instead of assuming right away that it was created after the queue
> "beforeunload" started. If it finds it in the queue unload, then it won't
> respond with true (aka unload) but false. It seems like a much simpler fix
than
> what you're trying to do in that CL.
Yes I agree and I am going to remove additional method in web_contents as it
seems really unnecessary.
What about modifications to handle following scenario?
- open two tabs with beforeunload handlers (some doc editing)
- edit data in tab #2 and select "close browser"
- for some reason tab #1 was unable to respond in time to beforeunload but still
alive
- user answers "keep browser open" to dialog from tab #2
- opens tab #1 do some edits and clicks button to close this single tab
expected behavior: tab #1 asks user if she wants to close.
actual behavior: tab will close promptly (as it was marked "allow sudden
termination" in RendererUnresponsive)
I suggest to keep this fix even though I didn't manage to find corresponding
bug. There is a test to reproduce that.
On 2016/05/12 06:52:11, metaflow wrote:
> On 2016/05/12 06:26:11, clamy wrote:
> > On 2016/05/12 06:09:01, metaflow wrote:
> > > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > > We should not do this. Wouldn't it mean that you couldn't close a tab
if
> > the
> > > > renderer process was slow to respond, or in an infinite loop?
> > > > We intentionally
> > > > have a timer to keep Chrome responsive to the user's request in these
> cases.
> > >
> > > No, user id still able to close the tab as all measures are still in
place.
> > > Closing of single tab isn't affected at all - only browser closing
process.
> > >
> > > > > Issue mitigates as browser loses some tabs after relaunch.
> > > >
> > > > I'm curious how this CL is related to that bug? The tabs closed when
> > > > beforeunload is bypassed do get restored in the next session, which I've
> > just
> > > > verified on Canary.
> > >
> > > You can easily reproduce bug this way:
> > > - download and open "beforeunload_slow.html" in 4 tabs
> > > - select "quit chrome" in menu
> > > (note that some tabs are closed during shutdown exactly like users
describe
> > > in bug comments)
> > > - open chrome: only two tabs restored while we expect to see all four.
> > >
> > > What happens underneath in time order:
> > >
> > > - browser initiates shutdown and asks unload controller to fire
beforeunload
> > > events
> > > - unload controller fills up queue of tabs to notify and fires
beforeunload
> to
> > > tab #1
> > > - frame host proxies event to tab #1 and starts one second timer
> > > - tab #1 receives beforeunload and starts processing
> > > - timer expires and web_content_impl receives RendererUnresponsive. It
fires
> > > onbeforeunloadACK on behalf of tab #1
> > > - unload controller receives ACK; finds tab #1 in beforeunload queue,
> removes
> > it
> > > from
> > > that queue and put to the queue "unload", responses with 'false' (aka
> > > "don't unload") as we want to fire unload events when every tab accepted
> > > unload. Now unload controller fires beforeunload to tab #2
> > > - frame host proxies event to tab #2 and starts one second timer
> > > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> > because
> > > of
> > > system load or trip to server or w/e)
> > > - unload controller receives ACK; is unable to find tab #1 in queue (tab
was
> > > already removed), assumes that tab #1 was created after queue was formed
and
> > > responds with 'true' ("proceed unload")
> > > - tab #1 unloads and notifies session service (session service is still
> alive
> > > atm)
> > > - session service records that tab #1 was closed. That's it. Now chrome
will
> > not
> > > restore this tab. Same thing may happen to some other tabs. Missed tabs
will
> > > be scattered randomly across tabstrips depending on OS schedule and
content.
> > >
> > > Intend of this CL is to prevent such scenarios. Responsiveness of browser
> > should
> > > not be changed - even if tab is really unresponsive it will be closed as
> > before,
> > > after unload controller made sure that every other tab is acknowledged
> unload.
> > >
> > > > That's a serious bug, and any leads to help diagnose it would be great.
I
> > > just
> > > > don't see how this would help.
> >
> > To me it seems that the issue is more with the unload controller than with
> > WebContents. It could just check whether the tab is in the queue "unload"
> > instead of assuming right away that it was created after the queue
> > "beforeunload" started. If it finds it in the queue unload, then it won't
> > respond with true (aka unload) but false. It seems like a much simpler fix
> than
> > what you're trying to do in that CL.
>
> Yes I agree and I am going to remove additional method in web_contents as it
> seems really unnecessary.
>
> What about modifications to handle following scenario?
>
> - open two tabs with beforeunload handlers (some doc editing)
> - edit data in tab #2 and select "close browser"
> - for some reason tab #1 was unable to respond in time to beforeunload but
still
> alive
> - user answers "keep browser open" to dialog from tab #2
> - opens tab #1 do some edits and clicks button to close this single tab
> expected behavior: tab #1 asks user if she wants to close.
> actual behavior: tab will close promptly (as it was marked "allow sudden
> termination" in RendererUnresponsive)
Just set allow sudden termination to false in
WebContentsImpl::RendererResponsive. It's only set to true when the renderer is
unresponsive, so it seems logical to set it to false when the renderer becomes
responsive again.
>
> I suggest to keep this fix even though I didn't manage to find corresponding
> bug. There is a test to reproduce that.
On 2016/05/12 07:24:31, clamy wrote:
> On 2016/05/12 06:52:11, metaflow wrote:
> > On 2016/05/12 06:26:11, clamy wrote:
> > > On 2016/05/12 06:09:01, metaflow wrote:
> > > > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > > > We should not do this. Wouldn't it mean that you couldn't close a tab
> if
> > > the
> > > > > renderer process was slow to respond, or in an infinite loop?
> > > > > We intentionally
> > > > > have a timer to keep Chrome responsive to the user's request in these
> > cases.
> > > >
> > > > No, user id still able to close the tab as all measures are still in
> place.
> > > > Closing of single tab isn't affected at all - only browser closing
> process.
> > > >
> > > > > > Issue mitigates as browser loses some tabs after relaunch.
> > > > >
> > > > > I'm curious how this CL is related to that bug? The tabs closed when
> > > > > beforeunload is bypassed do get restored in the next session, which
I've
> > > just
> > > > > verified on Canary.
> > > >
> > > > You can easily reproduce bug this way:
> > > > - download and open "beforeunload_slow.html" in 4 tabs
> > > > - select "quit chrome" in menu
> > > > (note that some tabs are closed during shutdown exactly like users
> describe
> > > > in bug comments)
> > > > - open chrome: only two tabs restored while we expect to see all four.
> > > >
> > > > What happens underneath in time order:
> > > >
> > > > - browser initiates shutdown and asks unload controller to fire
> beforeunload
> > > > events
> > > > - unload controller fills up queue of tabs to notify and fires
> beforeunload
> > to
> > > > tab #1
> > > > - frame host proxies event to tab #1 and starts one second timer
> > > > - tab #1 receives beforeunload and starts processing
> > > > - timer expires and web_content_impl receives RendererUnresponsive. It
> fires
> > > > onbeforeunloadACK on behalf of tab #1
> > > > - unload controller receives ACK; finds tab #1 in beforeunload queue,
> > removes
> > > it
> > > > from
> > > > that queue and put to the queue "unload", responses with 'false' (aka
> > > > "don't unload") as we want to fire unload events when every tab accepted
> > > > unload. Now unload controller fires beforeunload to tab #2
> > > > - frame host proxies event to tab #2 and starts one second timer
> > > > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> > > because
> > > > of
> > > > system load or trip to server or w/e)
> > > > - unload controller receives ACK; is unable to find tab #1 in queue (tab
> was
> > > > already removed), assumes that tab #1 was created after queue was formed
> and
> > > > responds with 'true' ("proceed unload")
> > > > - tab #1 unloads and notifies session service (session service is still
> > alive
> > > > atm)
> > > > - session service records that tab #1 was closed. That's it. Now chrome
> will
> > > not
> > > > restore this tab. Same thing may happen to some other tabs. Missed tabs
> will
> > > > be scattered randomly across tabstrips depending on OS schedule and
> content.
> > > >
> > > > Intend of this CL is to prevent such scenarios. Responsiveness of
browser
> > > should
> > > > not be changed - even if tab is really unresponsive it will be closed as
> > > before,
> > > > after unload controller made sure that every other tab is acknowledged
> > unload.
> > > >
> > > > > That's a serious bug, and any leads to help diagnose it would be
great.
> I
> > > > just
> > > > > don't see how this would help.
> > >
> > > To me it seems that the issue is more with the unload controller than with
> > > WebContents. It could just check whether the tab is in the queue "unload"
> > > instead of assuming right away that it was created after the queue
> > > "beforeunload" started. If it finds it in the queue unload, then it won't
> > > respond with true (aka unload) but false. It seems like a much simpler fix
> > than
> > > what you're trying to do in that CL.
> >
> > Yes I agree and I am going to remove additional method in web_contents as it
> > seems really unnecessary.
> >
> > What about modifications to handle following scenario?
> >
> > - open two tabs with beforeunload handlers (some doc editing)
> > - edit data in tab #2 and select "close browser"
> > - for some reason tab #1 was unable to respond in time to beforeunload but
> still
> > alive
> > - user answers "keep browser open" to dialog from tab #2
> > - opens tab #1 do some edits and clicks button to close this single tab
> > expected behavior: tab #1 asks user if she wants to close.
> > actual behavior: tab will close promptly (as it was marked "allow sudden
> > termination" in RendererUnresponsive)
>
> Just set allow sudden termination to false in
> WebContentsImpl::RendererResponsive. It's only set to true when the renderer
is
> unresponsive, so it seems logical to set it to false when the renderer becomes
> responsive again.
>
Alright. What do you think about tests flakiness? I bet new test will have same
issue as TestHangInBeforeUnloadMultipleTabs:
sometimes tab that supposed to prevent browser from close doesn't respond within
a second then browser closes and test hung or fails.
> >
> > I suggest to keep this fix even though I didn't manage to find corresponding
> > bug. There is a test to reproduce that.
creis@chromium.org changed reviewers: + sky@chromium.org
[+sky]
Wow, thanks for explaining! That didn't come across in the CL description. Can
you post this comment to the bug, to indicate that you've found the likely
cause?
I like the direction you and clamy@ are exploring to simplify the fix. I think
sky@ is probably a good reviewer for this, since he's familiar with the session
service and probably UnloadController as well.
On 2016/05/12 06:09:01, Mikhail Goncharov wrote:
> On 2016/05/11 17:10:17, Charlie Reis wrote:
> > We should not do this. Wouldn't it mean that you couldn't close a tab if
the
> > renderer process was slow to respond, or in an infinite loop?
> > We intentionally
> > have a timer to keep Chrome responsive to the user's request in these cases.
>
> No, user id still able to close the tab as all measures are still in place.
> Closing of single tab isn't affected at all - only browser closing process.
Ah. What will the outcome be if a beforeunload handler is in an infinite loop
when you try to shut down the whole browser? Will we still exit after the 1
second timer fires?
> > > Issue mitigates as browser loses some tabs after relaunch.
> >
> > I'm curious how this CL is related to that bug? The tabs closed when
> > beforeunload is bypassed do get restored in the next session, which I've
just
> > verified on Canary.
>
> You can easily reproduce bug this way:
> - download and open "beforeunload_slow.html" in 4 tabs
> - select "quit chrome" in menu
> (note that some tabs are closed during shutdown exactly like users describe
> in bug comments)
> - open chrome: only two tabs restored while we expect to see all four.
>
> What happens underneath in time order:
>
> - browser initiates shutdown and asks unload controller to fire beforeunload
> events
> - unload controller fills up queue of tabs to notify and fires beforeunload to
> tab #1
> - frame host proxies event to tab #1 and starts one second timer
> - tab #1 receives beforeunload and starts processing
> - timer expires and web_content_impl receives RendererUnresponsive. It fires
> onbeforeunloadACK on behalf of tab #1
> - unload controller receives ACK; finds tab #1 in beforeunload queue, removes
it
> from
> that queue and put to the queue "unload", responses with 'false' (aka
> "don't unload") as we want to fire unload events when every tab accepted
> unload. Now unload controller fires beforeunload to tab #2
> - frame host proxies event to tab #2 and starts one second timer
> - tab #1 has finished and fires beforeunload ACK (it was slow to respond
because
> of
> system load or trip to server or w/e)
> - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> already removed), assumes that tab #1 was created after queue was formed and
> responds with 'true' ("proceed unload")
> - tab #1 unloads and notifies session service (session service is still alive
> atm)
> - session service records that tab #1 was closed. That's it. Now chrome will
not
> restore this tab. Same thing may happen to some other tabs. Missed tabs will
> be scattered randomly across tabstrips depending on OS schedule and content.
>
> Intend of this CL is to prevent such scenarios. Responsiveness of browser
should
> not be changed - even if tab is really unresponsive it will be closed as
before,
> after unload controller made sure that every other tab is acknowledged unload.
>
> > That's a serious bug, and any leads to help diagnose it would be great. I
> just
> > don't see how this would help.
I have uploaded revisited solution: - turns out it is enough to tell render_frame_host_impl to do not accept next BeforeUnloadAck to prevent duplicate responses. Modification of unload controller is not necessary. Plus we don't run into situation when render_frame_host_impl waited for both BeforeUnloadAck and UnloadAck. - I decided to remove handling of second scenario as fast_unload_controller doesn't want to get along with resetting sudden_termination_allowed and solving this is not easy side pickup anymore ^_^ I have created second patch set from scratch and not sure if that is correct way to upload it.
On 2016/05/12 17:03:41, Charlie Reis wrote: > [+sky] > > Wow, thanks for explaining! That didn't come across in the CL description. Can > you post this comment to the bug, to indicate that you've found the likely > cause? I will re-post in a moment. > I like the direction you and clamy@ are exploring to simplify the fix. I think > sky@ is probably a good reviewer for this, since he's familiar with the session > service and probably UnloadController as well. > Ah. What will the outcome be if a beforeunload handler is in an infinite loop > when you try to shut down the whole browser? Will we still exit after the 1 > second timer fires? Hanged tab will be closed along with others when unload controller finished with collecting beforeunload responses.
On 2016/05/12 17:11:31, Mikhail Goncharov wrote: > I have uploaded revisited solution: > > - turns out it is enough to tell render_frame_host_impl to do not accept next > BeforeUnloadAck to prevent duplicate responses. Modification of unload > controller is not necessary. Plus we don't run into situation when > render_frame_host_impl waited for both BeforeUnloadAck and UnloadAck. > - I decided to remove handling of second scenario as fast_unload_controller > doesn't want to get along with resetting sudden_termination_allowed and solving > this is not easy side pickup anymore ^_^ > > I have created second patch set from scratch and not sure if that is correct way > to upload it. I guess I may just delete first patch set, right?
On 2016/05/12 17:17:13, Mikhail Goncharov wrote: > On 2016/05/12 17:03:41, Charlie Reis wrote: > > [+sky] > > > > Wow, thanks for explaining! That didn't come across in the CL description. > Can > > you post this comment to the bug, to indicate that you've found the likely > > cause? > > I will re-post in a moment. > You did it already :) Thanks. > > I like the direction you and clamy@ are exploring to simplify the fix. I > think > > sky@ is probably a good reviewer for this, since he's familiar with the > session > > service and probably UnloadController as well. > > > Ah. What will the outcome be if a beforeunload handler is in an infinite loop > > when you try to shut down the whole browser? Will we still exit after the 1 > > second timer fires? > > Hanged tab will be closed along with others when unload controller finished with > collecting beforeunload responses.
Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. Issue mitigates as browser loses some tabs after relaunch. Scenario is following: unload controller fires 'beforeunload' event every tab sequentially and waits for answer. If tab failed to respond within one second then web_contents_impl sends ack response on behalf of tab and unload controller continues with the next. But if tab was slow to respond because of overall load to machine and fired beforeunload ACK after been declared unresponsive then unload controller is surprised by that response and tells tab to close immediately. This fix changes unload controller and web content behavior to do not force close unresponding tabs when browser is going to shutdown. Additionally fix addresses an issue of tabs that failed to respond to beforeunload quickly were marked as "don't ask again". Thus if user cancels shutdown and then clicked 'close' once again she might miss unload dialog from tab that is responding now. New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs because under heavy load tab might fail to response within one second and treated as if it has no beforeunload handler. BUG=365052 R=clamy@chromium.org CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation ========== to ========== Fix: tabs slow to response to beforeunload are closed prematurely. Issue mitigates as browser loses some tabs after relaunch. What happens underneath in time order: - browser initiates shutdown and asks unload controller to fire beforeunload events - unload controller fills up queue of tabs to notify and fires beforeunload to tab #1 - frame host proxies event to tab #1 and starts one second timer - tab #1 receives beforeunload and starts processing - timer expires and web_content_impl receives RendererUnresponsive. It fires onbeforeunloadACK on behalf of tab #1 - unload controller receives ACK; finds tab #1 in beforeunload queue, removes it from that queue and put to the queue "unload", responses with 'false' (aka "don't unload") as we want to fire unload events when every tab accepted unload. Now unload controller fires beforeunload to tab #2 - frame host proxies event to tab #2 and starts one second timer - tab #1 has finished and fires beforeunload ACK (it was slow to respond because of system load or trip to server or w/e) - unload controller receives ACK; is unable to find tab #1 in queue (tab was already removed), assumes that tab #1 was created after queue was formed and responds with 'true' ("proceed unload") - tab #1 unloads and notifies session service (session service is still alive atm) - session service records that tab #1 was closed. That's it. Now chrome will not restore this tab. Same thing may happen to some other tabs. Missed tabs will be scattered randomly across tabstrips depending on OS schedule and content. Intend of this CL is to prevent such scenarios. Responsiveness of browser should not be changed - even if tab is really unresponsive it will be closed as before, after unload controller made sure that every other tab is acknowledged unload. New tests might be flacky like TestHangInBeforeUnloadMultipleTabs because under heavy load tab might fail to response within one second and treated as if it has no beforeunload handler. BUG=365052 R=clamy@chromium.org,sky@chromium.org CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation ==========
On 2016/05/12 17:18:14, Mikhail Goncharov wrote: > On 2016/05/12 17:11:31, Mikhail Goncharov wrote: > > I have uploaded revisited solution: > > > > - turns out it is enough to tell render_frame_host_impl to do not accept next > > BeforeUnloadAck to prevent duplicate responses. Modification of unload > > controller is not necessary. Plus we don't run into situation when > > render_frame_host_impl waited for both BeforeUnloadAck and UnloadAck. > > - I decided to remove handling of second scenario as fast_unload_controller > > doesn't want to get along with resetting sudden_termination_allowed and > solving > > this is not easy side pickup anymore ^_^ > > > > I have created second patch set from scratch and not sure if that is correct > way > > to upload it. Great, thanks. > > I guess I may just delete first patch set, right? Please don't delete the earlier patch set. It's useful for understanding how the CL evolves and what the discussion was about.
sky@, can you see if you have any recommendations for how to write a non-flaky test for this? I don't know the tests in chrome/browser/lifetime. One question about an alternate approach, but it may be bigger than we want to take on for a fix we want to merge to branch. (clamy@, I'm interested in your thoughts on how it affects PlzNavigate and FTN loading state.) https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i = 0; i < 7; i++) { Can you add a comment to this block saying why you're loading one instance of beforeunload.html vs lots of beforeunload_slow.html? https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: TestBeforeUnloadMultipleSlowWindows) { Is there a meaningful difference between slow windows and slow tabs? I would imagine that's the same code, but I don't know the UnloadController. https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... File content/browser/web_contents/web_contents_impl.cc (right): https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... content/browser/web_contents/web_contents_impl.cc:4576: rfhi->DontWaitForBeforeUnloadAck(); I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. There's other state that gets reset in there, and listeners that get notified. Plus it would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like that keeps with the spirit of pretending the ACK arrived when the timer expires. clamy@, does that sound reasonable to you? I notice there are some timing UMA stats in there, as well as implications for PlzNavigate (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call BeforeUnloadFired). On the other hand, it's possible we could do this as cleanup after landing this simpler change. The simple change is nice and might be more easily merged to branch, though I do worry about adding more one-off changes that will make this harder to refactor in the long run.
On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > sky@, can you see if you have any recommendations for how to write a non-flaky > test for this? I don't know the tests in chrome/browser/lifetime. > > One question about an alternate approach, but it may be bigger than we want to > take on for a fix we want to merge to branch. (clamy@, I'm interested in your > thoughts on how it affects PlzNavigate and FTN loading state.) > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i = > 0; i < 7; i++) { > Can you add a comment to this block saying why you're loading one instance of > beforeunload.html vs lots of beforeunload_slow.html? > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: > TestBeforeUnloadMultipleSlowWindows) { > Is there a meaningful difference between slow windows and slow tabs? I would > imagine that's the same code, but I don't know the UnloadController. > > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... > File content/browser/web_contents/web_contents_impl.cc (right): > > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... > content/browser/web_contents/web_contents_impl.cc:4576: > rfhi->DontWaitForBeforeUnloadAck(); > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. There's > other state that gets reset in there, and listeners that get notified. Plus it > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like that > keeps with the spirit of pretending the ACK arrived when the timer expires. > > clamy@, does that sound reasonable to you? I notice there are some timing UMA > stats in there, as well as implications for PlzNavigate > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call > BeforeUnloadFired). > > On the other hand, it's possible we could do this as cleanup after landing this > simpler change. The simple change is nice and might be more easily merged to > branch, though I do worry about adding more one-off changes that will make this > harder to refactor in the long run. I have to admit I'm not familiar with this code, or the impact it has on session restore. As to flakiness, can you elaborate on what is causing the flakiness? If it's timeouts, why can't you disable those for the test?
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: TestBeforeUnloadMultipleSlowWindows) { On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > Is there a meaningful difference between slow windows and slow tabs? I would > imagine that's the same code, but I don't know the UnloadController. There is a subtle difference on how unload controller iterates over windows vs tabs and this test fails in a different way with no fix (on (d)check that tab closed on another tabsstrip)
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... File content/browser/web_contents/web_contents_impl.cc (right): https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... content/browser/web_contents/web_contents_impl.cc:4576: rfhi->DontWaitForBeforeUnloadAck(); On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. There's > other state that gets reset in there, and listeners that get notified. Plus it > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like that > keeps with the spirit of pretending the ACK arrived when the timer expires. > > clamy@, does that sound reasonable to you? I notice there are some timing UMA > stats in there, as well as implications for PlzNavigate > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call > BeforeUnloadFired). > > On the other hand, it's possible we could do this as cleanup after landing this > simpler change. The simple change is nice and might be more easily merged to > branch, though I do worry about adding more one-off changes that will make this > harder to refactor in the long run. I have inspected RenderFrameHostImpl::OnBeforeUnloadACK and it's seems to me that we indeed should expose this method as right now we will end up with wrong |in_flight_event_count_| as we don't decrement it in web_contents_impl.
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i = 0; i < 7; i++) { On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > Can you add a comment to this block saying why you're loading one instance of > beforeunload.html vs lots of beforeunload_slow.html? I have added comment and uploaded next patchset.
On 2016/05/13 23:38:03, sky wrote: > On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > > sky@, can you see if you have any recommendations for how to write a non-flaky > > test for this? I don't know the tests in chrome/browser/lifetime. > > > > One question about an alternate approach, but it may be bigger than we want to > > take on for a fix we want to merge to branch. (clamy@, I'm interested in your > > thoughts on how it affects PlzNavigate and FTN loading state.) > > > > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > > File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): > > > > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > > chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i = > > 0; i < 7; i++) { > > Can you add a comment to this block saying why you're loading one instance of > > beforeunload.html vs lots of beforeunload_slow.html? > > > > > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... > > chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: > > TestBeforeUnloadMultipleSlowWindows) { > > Is there a meaningful difference between slow windows and slow tabs? I would > > imagine that's the same code, but I don't know the UnloadController. > > > > > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... > > File content/browser/web_contents/web_contents_impl.cc (right): > > > > > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... > > content/browser/web_contents/web_contents_impl.cc:4576: > > rfhi->DontWaitForBeforeUnloadAck(); > > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. > There's > > other state that gets reset in there, and listeners that get notified. Plus > it > > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like that > > keeps with the spirit of pretending the ACK arrived when the timer expires. > > > > clamy@, does that sound reasonable to you? I notice there are some timing UMA > > stats in there, as well as implications for PlzNavigate > > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call > > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call > > BeforeUnloadFired). > > > > On the other hand, it's possible we could do this as cleanup after landing > this > > simpler change. The simple change is nice and might be more easily merged to > > branch, though I do worry about adding more one-off changes that will make > this > > harder to refactor in the long run. > > I have to admit I'm not familiar with this code, or the impact it has on session > restore. > > As to flakiness, can you elaborate on what is causing the flakiness? If it's > timeouts, why can't you disable those for the test? I have managed to make test fail on windows under load when 'beforeunload.html' didn't manage to respond within one second and all tabs were closed. If that is a case for original flakiness then increasing responsiveness timeout from within a test might stabilize it. Could you please take a look into http://crbug.com/276366 and http://crbug.com/517687 ? They might contain some interesting details but I don't have an access to.
On Sun, May 15, 2016 at 11:17 PM, <metaflow@yandex-team.ru> wrote: > On 2016/05/13 23:38:03, sky wrote: >> On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: >> > sky@, can you see if you have any recommendations for how to write a > non-flaky >> > test for this? I don't know the tests in chrome/browser/lifetime. >> > >> > One question about an alternate approach, but it may be bigger than we >> > want > to >> > take on for a fix we want to merge to branch. (clamy@, I'm interested in > your >> > thoughts on how it affects PlzNavigate and FTN loading state.) >> > >> > >> > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... >> > File chrome/browser/lifetime/browser_close_manager_browsertest.cc >> > (right): >> > >> > >> > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... >> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for >> > (int i > = >> > 0; i < 7; i++) { >> > Can you add a comment to this block saying why you're loading one >> > instance > of >> > beforeunload.html vs lots of beforeunload_slow.html? >> > >> > >> > https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime... >> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: >> > TestBeforeUnloadMultipleSlowWindows) { >> > Is there a meaningful difference between slow windows and slow tabs? I > would >> > imagine that's the same code, but I don't know the UnloadController. >> > >> > >> > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... >> > File content/browser/web_contents/web_contents_impl.cc (right): >> > >> > >> > https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con... >> > content/browser/web_contents/web_contents_impl.cc:4576: >> > rfhi->DontWaitForBeforeUnloadAck(); >> > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. >> There's >> > other state that gets reset in there, and listeners that get notified. >> > Plus >> it >> > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like >> > that >> > keeps with the spirit of pretending the ACK arrived when the timer >> > expires. >> > >> > clamy@, does that sound reasonable to you? I notice there are some >> > timing > UMA >> > stats in there, as well as implications for PlzNavigate >> > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call >> > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call >> > BeforeUnloadFired). >> > >> > On the other hand, it's possible we could do this as cleanup after >> > landing >> this >> > simpler change. The simple change is nice and might be more easily >> > merged > to >> > branch, though I do worry about adding more one-off changes that will >> > make >> this >> > harder to refactor in the long run. >> >> I have to admit I'm not familiar with this code, or the impact it has on > session >> restore. >> >> As to flakiness, can you elaborate on what is causing the flakiness? If >> it's >> timeouts, why can't you disable those for the test? > > I have managed to make test fail on windows under load when > 'beforeunload.html' > didn't manage to respond within one second and all tabs were closed. If that > is > a case for original flakiness then increasing responsiveness timeout from > within > a test might stabilize it. > Could you please take a look into http://crbug.com/276366 and > http://crbug.com/517687 ? They might contain some interesting details but I > don't have an access to. I've made 276366 no longer private. You should be able to view 517687 already (there isn't anything really interesting in it). -Scott > > https://codereview.chromium.org/1968933002/ -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Ok, I've patched in the CL locally to better understand the flakiness, and it makes sense now. Any test that uses CancelClose() or AcceptClose() is flaky, since those functions wait for a dialog that might never appear if the 1 second timeout fires. I suggest that we proceed with landing this CL with the new tests disabled, then return to fix the flakiness. I've listed some ideas for how to fix it below, and I'll post them to one of the flaky test bugs. A few other cleanup comments, and otherwise I think this is ready to go. Thanks for your help tracking it down! https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:483: // Regression for chrbug.com/365052 caused some of tabs to be closed even if nit: No h in crbug.com. https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:491: // RenderViewHostImpl::kUnloadTimeoutMS . nit: No space before period. https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:506: ASSERT_NO_FATAL_FAILURE(CancelClose()); Yep, this is where the test will be flaky, as you mentioned. If the bot is overloaded and it takes more than 1 second to get the dialog from the responsive beforeunload.html tab, this CancelClose call will never see a dialog to cancel, and we'll quit (or rather, crash in AppModalDialogWaiter::Wait). You can repro this by setting RenderViewHostImpl::kUnloadTimeoutMS to 0. That exactly matches the flakiness in other tests in this file (e.g., the report in https://crbug.com/276366). Any test that uses CancelClose() or AcceptClose() will have this problem. I'm not really sure how to fix this. We need the timeout to fire to exercise the bug in this test, so we can't disable it entirely, but we also can't realistically guarantee that the "fast" page will respond before the timeout. One option might be gracefully handling the case that the dialog is never displayed, such that the test passes. That means that the test would always pass with the bug fixed, and it would usually fail but occasionally pass (on slow bots) if the bug regressed. I'd be ok with that outcome. Another option would be exposing some interface to the underlying timer on each tab, so that we can selectively tell them to time out or not. That seems like it would be pretty nice. That said, I don't think we want that to be a prerequisite for landing this CL, since it's a very valuable fix, and I've already verified locally that the test correctly exercises the bug and the fix. I'm inclined to say that we should land this CL with these new tests disabled, and follow up with a way to fix and enable them (perhaps with a timeout interface for testing). https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:509: // All tabs should still be open. nit: Add blank line before. https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:511: RepeatedNotificationObserver close_observer( nit: Add blank line and comment before (e.g., Quit again and close all tabs). https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:523: // Regression for chrbug.com/365052 caused CHECK in tabstrip. nit: No h in crbug.com. https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:551: // All windows should still be open. nit: Blank line before. https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:555: RepeatedNotificationObserver close_observer( nit: Comment before. https://codereview.chromium.org/1968933002/diff/40001/chrome/test/data/before... File chrome/test/data/beforeunload_slow.html (right): https://codereview.chromium.org/1968933002/diff/40001/chrome/test/data/before... chrome/test/data/beforeunload_slow.html:6: var ms = 1500; Please put a comment saying this is expected to be longer than RenderViewHostImpl::kUnloadTimeoutMS. nit: s/ms/delay/ https://codereview.chromium.org/1968933002/diff/40001/chrome/test/data/before... chrome/test/data/beforeunload_slow.html:7: ms += new Date().getTime(); nit: var finishTime = new Date().getTime() + delay; https://codereview.chromium.org/1968933002/diff/40001/content/browser/frame_h... File content/browser/frame_host/render_frame_host_impl.h (right): https://codereview.chromium.org/1968933002/diff/40001/content/browser/frame_h... content/browser/frame_host/render_frame_host_impl.h:404: // Simulate beforeunloadack on behalf of renderer if it's unrenresponsive. nit: Space before ack. https://codereview.chromium.org/1968933002/diff/40001/content/browser/frame_h... content/browser/frame_host/render_frame_host_impl.h:405: void ForceBeforeUnloadAck(); nit: s/Force/Simulate/ https://codereview.chromium.org/1968933002/diff/40001/content/browser/web_con... File content/browser/web_contents/web_contents_impl.cc (right): https://codereview.chromium.org/1968933002/diff/40001/content/browser/web_con... content/browser/web_contents/web_contents_impl.cc:4557: // If the hang is in the beforeunload handler, nit: Fix line wrap. https://codereview.chromium.org/1968933002/diff/40001/content/browser/web_con... content/browser/web_contents/web_contents_impl.cc:4566: // Hang occurred while firing the beforeunload/unload handler. nit: Drop "beforeunload/"
On 2016/05/19 22:11:56, Charlie Reis (slow to reply) wrote: > I suggest that we proceed with landing this CL with the new tests disabled, then > return to fix the flakiness. I've listed some ideas for how to fix it below, > and I'll post them to one of the flaky test bugs. I merged them all into https://crbug.com/519646 and posted the explanation there.
https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... chrome/browser/lifetime/browser_close_manager_browsertest.cc:506: ASSERT_NO_FATAL_FAILURE(CancelClose()); On 2016/05/19 22:11:55, Charlie Reis (slow to reply) wrote: > Yep, this is where the test will be flaky, as you mentioned. If the bot is > overloaded and it takes more than 1 second to get the dialog from the responsive > beforeunload.html tab, this CancelClose call will never see a dialog to cancel, > and we'll quit (or rather, crash in AppModalDialogWaiter::Wait). You can repro > this by setting RenderViewHostImpl::kUnloadTimeoutMS to 0. > > That exactly matches the flakiness in other tests in this file (e.g., the report > in https://crbug.com/276366). Any test that uses CancelClose() or AcceptClose() > will have this problem. > > I'm not really sure how to fix this. We need the timeout to fire to exercise > the bug in this test, so we can't disable it entirely, but we also can't > realistically guarantee that the "fast" page will respond before the timeout. That's right. I have experimented with different timeout values locally and it turns out that plain increase of timeout is not a clear cut. Under the same load, --gtest_repeat=100 for TestHangInBeforeUnloadMultipleTabs: kUnloadTimeoutMS = 100 => 15% failed kUnloadTimeoutMS = 1000 => 10% failed kUnloadTimeoutMS = 2000 => 9% failed kUnloadTimeoutMS = 3000 => 12% failed (lager timeout increases a chance that test will not be able to finish within 30 seconds). It is interesting that test with multiple windows doesn't seem to flack as much as one with single window and multiple tabs. One of possible explanations is that process composition and os scheduling differs. I am going to disable tests on the same platforms for now and implement second approach you've described below. Is it OK to upload "flak-less" patchset to this CL or better to create new one? Do you have means to bulk run particular test with load similar to normal on bots to check flakiness before / after fix? > One option might be gracefully handling the case that the dialog is never > displayed, such that the test passes. That means that the test would always > pass with the bug fixed, and it would usually fail but occasionally pass (on > slow bots) if the bug regressed. I'd be ok with that outcome. > > Another option would be exposing some interface to the underlying timer on each > tab, so that we can selectively tell them to time out or not. That seems like > it would be pretty nice. I like that > That said, I don't think we want that to be a prerequisite for landing this CL, > since it's a very valuable fix, and I've already verified locally that the test > correctly exercises the bug and the fix. I'm inclined to say that we should > land this CL with these new tests disabled, and follow up with a way to fix and > enable them (perhaps with a timeout interface for testing).
On 2016/05/20 04:20:01, Mikhail Goncharov wrote: > https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... > File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): > > https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime... > chrome/browser/lifetime/browser_close_manager_browsertest.cc:506: > ASSERT_NO_FATAL_FAILURE(CancelClose()); > On 2016/05/19 22:11:55, Charlie Reis (slow to reply) wrote: > > Yep, this is where the test will be flaky, as you mentioned. If the bot is > > overloaded and it takes more than 1 second to get the dialog from the > responsive > > beforeunload.html tab, this CancelClose call will never see a dialog to > cancel, > > and we'll quit (or rather, crash in AppModalDialogWaiter::Wait). You can > repro > > this by setting RenderViewHostImpl::kUnloadTimeoutMS to 0. > > > > That exactly matches the flakiness in other tests in this file (e.g., the > report > > in https://crbug.com/276366). Any test that uses CancelClose() or > AcceptClose() > > will have this problem. > > > > I'm not really sure how to fix this. We need the timeout to fire to exercise > > the bug in this test, so we can't disable it entirely, but we also can't > > realistically guarantee that the "fast" page will respond before the timeout. > > That's right. I have experimented with different timeout values locally and it > turns out that plain increase of timeout is not a clear cut. > Under the same load, --gtest_repeat=100 for TestHangInBeforeUnloadMultipleTabs: > kUnloadTimeoutMS = 100 => 15% failed > kUnloadTimeoutMS = 1000 => 10% failed > kUnloadTimeoutMS = 2000 => 9% failed > kUnloadTimeoutMS = 3000 => 12% failed > (lager timeout increases a chance that test will not be able to finish within 30 > seconds). Yeah, we'd prefer not to have even a low probability of flaking, since that will cause work for the tree sheriffs and lead to the test being disabled eventually. > > It is interesting that test with multiple windows doesn't seem to flack as much > as one with single window and multiple tabs. One of possible explanations is > that process composition and os scheduling differs. > > I am going to disable tests on the same platforms for now and implement second > approach you've described below. Let's just land it entirely disabled, since we know it can fail on any platform. I've verified that the test does what it should, and we'll separately fix it to work on all platforms without flakes. > Is it OK to upload "flak-less" patchset to this CL or better to create new one? Let's keep this CL as the fix plus the disabled test. In a followup CL, we can fix the test framework and enable the test. > Do you have means to bulk run particular test with load similar to normal on > bots to check flakiness before / after fix? Not sure. "git try -h" will show you the options it supports (or see http://www.chromium.org/developers/testing/try-server-usage). > > One option might be gracefully handling the case that the dialog is never > > displayed, such that the test passes. That means that the test would always > > pass with the bug fixed, and it would usually fail but occasionally pass (on > > slow bots) if the bug regressed. I'd be ok with that outcome. > > > > Another option would be exposing some interface to the underlying timer on > each > > tab, so that we can selectively tell them to time out or not. That seems like > > it would be pretty nice. > > I like that Great. Let's pursue that once this fix lands. Thanks! > > > > That said, I don't think we want that to be a prerequisite for landing this > CL, > > since it's a very valuable fix, and I've already verified locally that the > test > > correctly exercises the bug and the fix. I'm inclined to say that we should > > land this CL with these new tests disabled, and follow up with a way to fix > and > > enable them (perhaps with a timeout interface for testing).
The CQ bit was checked by metaflow@yandex-team.ru to run a CQ dry run
The CQ bit was unchecked by metaflow@yandex-team.ru
The CQ bit was checked by metaflow@yandex-team.ru
The CQ bit was unchecked by metaflow@yandex-team.ru
The CQ bit was checked by metaflow@yandex-team.ru
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/80001
The CQ bit was unchecked by commit-bot@chromium.org
No L-G-T-M from a valid reviewer yet. CQ run can only be started by full committers or once the patch has received an L-G-T-M from a full committer. Even if an L-G-T-M may have been provided, it was from a non-committer, _not_ a full super star committer. See http://www.chromium.org/getting-involved/become-a-committer Note that this has nothing to do with OWNERS files.
LGTM if the try bots are happy. sky@, can you review chrome/ for OWNERS? We're planning to enable the tests as part of fixing https://crbug.com/519646. I've verified locally that they fail without the fix and pass with it (apart from the flakiness).
The CQ bit was checked by creis@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/80001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
Mikhail, can you take look at the failure in the RenderFrameHostManagerTest.CloseWithPendingWhileUnresponsive unit test?
LGTM
On 2016/05/20 16:50:18, Charlie Reis (slow to reply) wrote: > Mikhail, can you take look at the failure in the > RenderFrameHostManagerTest.CloseWithPendingWhileUnresponsive unit test? Sure I will.
On 2016/05/21 08:34:37, Mikhail Goncharov wrote: > On 2016/05/20 16:50:18, Charlie Reis (slow to reply) wrote: > > Mikhail, can you take look at the failure in the > > RenderFrameHostManagerTest.CloseWithPendingWhileUnresponsive unit test? > > Sure I will. I have fixed issue found by CloseWithPendingWhileUnresponsive test - we might have both is_waiting_for_beforeunload_ack and IsWaitingForUnloadACK set when user tries to navigate within tab to be closed.
The CQ bit was checked by metaflow@yandex-team.ru to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/100001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
> Let's keep this CL as the fix plus the disabled test. In a followup CL, we can > fix the test framework and enable the test. Here it goes https://codereview.chromium.org/2005663002
Thanks for the fix! LGTM.
The CQ bit was checked by creis@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from sky@chromium.org Link to the patchset: https://codereview.chromium.org/1968933002/#ps100001 (title: "Correct order of checks")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/100001
Message was sent while issue was closed.
Description was changed from
==========
Fix: tabs slow to response to beforeunload are closed prematurely.
Issue mitigates as browser loses some tabs after relaunch.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.
BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========
to
==========
Fix: tabs slow to response to beforeunload are closed prematurely.
Issue mitigates as browser loses some tabs after relaunch.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.
BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========
Message was sent while issue was closed.
Committed patchset #6 (id:100001)
Message was sent while issue was closed.
Description was changed from
==========
Fix: tabs slow to response to beforeunload are closed prematurely.
Issue mitigates as browser loses some tabs after relaunch.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.
BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========
to
==========
Fix: tabs slow to response to beforeunload are closed prematurely.
Issue mitigates as browser loses some tabs after relaunch.
What happens underneath in time order:
- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.
Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.
New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.
BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
Committed: https://crrev.com/73fc9555a1ecb23f1076b51e4602dfee5419e2f9
Cr-Commit-Position: refs/heads/master@{#395368}
==========
Message was sent while issue was closed.
Patchset 6 (id:??) landed as https://crrev.com/73fc9555a1ecb23f1076b51e4602dfee5419e2f9 Cr-Commit-Position: refs/heads/master@{#395368} |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
