Issue 1968933002: Fix: tabs slow to response to beforeunload are closed prematurely.

Mikhail Goncharov

4 years, 7 months ago (2016-05-11 10:42:54 UTC) #1

Mikhail Goncharov

Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. ...

4 years, 7 months ago (2016-05-11 10:42:56 UTC) #2

Description was changed from

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

Scenario is following: unload controller fires 'beforeunload' event every tab
sequentially and waits for answer. If tab failed to respond within one second
then web_contents_impl sends ack response on behalf of tab and unload
controller continues with the next. But if tab was slow to respond because of
overall load to machine and fired beforeunload ACK after been declared
unresponsive then unload controller is surprised by that response and tells tab
to close immediately.

This fix changes unload controller and web content behavior to do not force
close unresponding tabs when browser is going to shutdown.

Additionally fix addresses an issue of tabs that failed to respond to
beforeunload quickly were marked as "don't ask again". Thus if user cancels
shutdown and then clicked 'close' once again she might miss unload dialog from
tab that is responding now.

New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org
==========

to

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

Scenario is following: unload controller fires 'beforeunload' event every tab
sequentially and waits for answer. If tab failed to respond within one second
then web_contents_impl sends ack response on behalf of tab and unload
controller continues with the next. But if tab was slow to respond because of
overall load to machine and fired beforeunload ACK after been declared
unresponsive then unload controller is surprised by that response and tells tab
to close immediately.

This fix changes unload controller and web content behavior to do not force
close unresponding tabs when browser is going to shutdown.

Additionally fix addresses an issue of tabs that failed to respond to
beforeunload quickly were marked as "don't ask again". Thus if user cancels
shutdown and then clicked 'close' once again she might miss unload dialog from
tab that is responding now.

New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

Charlie Reis

creis@chromium.org changed reviewers: + creis@chromium.org

4 years, 7 months ago (2016-05-11 17:10:16 UTC) #3

Charlie Reis

We should not do this. Wouldn't it mean that you couldn't close a tab if ...

4 years, 7 months ago (2016-05-11 17:10:17 UTC) #4

Mikhail Goncharov

On 2016/05/11 17:10:17, Charlie Reis wrote: > We should not do this. Wouldn't it mean ...

4 years, 7 months ago (2016-05-12 06:09:01 UTC) #5

On 2016/05/11 17:10:17, Charlie Reis wrote:
> We should not do this.  Wouldn't it mean that you couldn't close a tab if the
> renderer process was slow to respond, or in an infinite loop?
> We intentionally
> have a timer to keep Chrome responsive to the user's request in these cases.

No, user id still able to close the tab as all measures are still in place.
Closing of single tab isn't affected at all - only browser closing process.

> > Issue mitigates as browser loses some tabs after relaunch.
> 
> I'm curious how this CL is related to that bug?  The tabs closed when
> beforeunload is bypassed do get restored in the next session, which I've just
> verified on Canary.

You can easily reproduce bug this way:
- download and open "beforeunload_slow.html" in 4 tabs
- select "quit chrome" in menu
(note that some tabs are closed during shutdown exactly like users describe
in bug comments)
- open chrome: only two tabs restored while we expect to see all four.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

> That's a serious bug, and any leads to help diagnose it would be great.  I
just
> don't see how this would help.

clamy

On 2016/05/12 06:09:01, metaflow wrote: > On 2016/05/11 17:10:17, Charlie Reis wrote: > > We ...

4 years, 7 months ago (2016-05-12 06:26:11 UTC) #6

On 2016/05/12 06:09:01, metaflow wrote:
> On 2016/05/11 17:10:17, Charlie Reis wrote:
> > We should not do this.  Wouldn't it mean that you couldn't close a tab if
the
> > renderer process was slow to respond, or in an infinite loop?
> > We intentionally
> > have a timer to keep Chrome responsive to the user's request in these cases.
> 
> No, user id still able to close the tab as all measures are still in place.
> Closing of single tab isn't affected at all - only browser closing process.
> 
> > > Issue mitigates as browser loses some tabs after relaunch.
> > 
> > I'm curious how this CL is related to that bug?  The tabs closed when
> > beforeunload is bypassed do get restored in the next session, which I've
just
> > verified on Canary.
> 
> You can easily reproduce bug this way:
> - download and open "beforeunload_slow.html" in 4 tabs
> - select "quit chrome" in menu
> (note that some tabs are closed during shutdown exactly like users describe
> in bug comments)
> - open chrome: only two tabs restored while we expect to see all four.
> 
> What happens underneath in time order:
> 
> - browser initiates shutdown and asks unload controller to fire beforeunload
> events
> - unload controller fills up queue of tabs to notify and fires beforeunload to
> tab #1
> - frame host proxies event to tab #1 and starts one second timer
> - tab #1 receives beforeunload and starts processing
> - timer expires and web_content_impl receives RendererUnresponsive. It fires
> onbeforeunloadACK on behalf of tab #1
> - unload controller receives ACK; finds tab #1 in beforeunload queue, removes
it
> from
> that queue and put to the queue "unload", responses with 'false' (aka
> "don't unload") as we want to fire unload events when every tab accepted
> unload. Now unload controller fires beforeunload to tab #2
> - frame host proxies event to tab #2 and starts one second timer
> - tab #1 has finished and fires beforeunload ACK (it was slow to respond
because
> of
> system load or trip to server or w/e)
> - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> already removed), assumes that tab #1 was created after queue was formed and
> responds with 'true' ("proceed unload")
> - tab #1 unloads and notifies session service (session service is still alive
> atm)
> - session service records that tab #1 was closed. That's it. Now chrome will
not
> restore this tab. Same thing may happen to some other tabs. Missed tabs will
> be scattered randomly across tabstrips depending on OS schedule and content.
> 
> Intend of this CL is to prevent such scenarios. Responsiveness of browser
should
> not be changed - even if tab is really unresponsive it will be closed as
before,
> after unload controller made sure that every other tab is acknowledged unload.
> 
> > That's a serious bug, and any leads to help diagnose it would be great.  I
> just
> > don't see how this would help.

To me it seems that the issue is more with the unload controller than with
WebContents. It could just check whether the tab is in the queue "unload"
instead of assuming right away that it was created after the queue
"beforeunload" started. If it finds it in the queue unload, then it won't
respond with true (aka unload) but false. It seems like a much simpler fix than
what you're trying to do in that CL.

Mikhail Goncharov

On 2016/05/12 06:26:11, clamy wrote: > On 2016/05/12 06:09:01, metaflow wrote: > > On 2016/05/11 ...

4 years, 7 months ago (2016-05-12 06:52:11 UTC) #7

On 2016/05/12 06:26:11, clamy wrote:
> On 2016/05/12 06:09:01, metaflow wrote:
> > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > We should not do this.  Wouldn't it mean that you couldn't close a tab if
> the
> > > renderer process was slow to respond, or in an infinite loop?
> > > We intentionally
> > > have a timer to keep Chrome responsive to the user's request in these
cases.
> > 
> > No, user id still able to close the tab as all measures are still in place.
> > Closing of single tab isn't affected at all - only browser closing process.
> > 
> > > > Issue mitigates as browser loses some tabs after relaunch.
> > > 
> > > I'm curious how this CL is related to that bug?  The tabs closed when
> > > beforeunload is bypassed do get restored in the next session, which I've
> just
> > > verified on Canary.
> > 
> > You can easily reproduce bug this way:
> > - download and open "beforeunload_slow.html" in 4 tabs
> > - select "quit chrome" in menu
> > (note that some tabs are closed during shutdown exactly like users describe
> > in bug comments)
> > - open chrome: only two tabs restored while we expect to see all four.
> > 
> > What happens underneath in time order:
> > 
> > - browser initiates shutdown and asks unload controller to fire beforeunload
> > events
> > - unload controller fills up queue of tabs to notify and fires beforeunload
to
> > tab #1
> > - frame host proxies event to tab #1 and starts one second timer
> > - tab #1 receives beforeunload and starts processing
> > - timer expires and web_content_impl receives RendererUnresponsive. It fires
> > onbeforeunloadACK on behalf of tab #1
> > - unload controller receives ACK; finds tab #1 in beforeunload queue,
removes
> it
> > from
> > that queue and put to the queue "unload", responses with 'false' (aka
> > "don't unload") as we want to fire unload events when every tab accepted
> > unload. Now unload controller fires beforeunload to tab #2
> > - frame host proxies event to tab #2 and starts one second timer
> > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> because
> > of
> > system load or trip to server or w/e)
> > - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> > already removed), assumes that tab #1 was created after queue was formed and
> > responds with 'true' ("proceed unload")
> > - tab #1 unloads and notifies session service (session service is still
alive
> > atm)
> > - session service records that tab #1 was closed. That's it. Now chrome will
> not
> > restore this tab. Same thing may happen to some other tabs. Missed tabs will
> > be scattered randomly across tabstrips depending on OS schedule and content.
> > 
> > Intend of this CL is to prevent such scenarios. Responsiveness of browser
> should
> > not be changed - even if tab is really unresponsive it will be closed as
> before,
> > after unload controller made sure that every other tab is acknowledged
unload.
> > 
> > > That's a serious bug, and any leads to help diagnose it would be great.  I
> > just
> > > don't see how this would help.
> 
> To me it seems that the issue is more with the unload controller than with
> WebContents. It could just check whether the tab is in the queue "unload"
> instead of assuming right away that it was created after the queue
> "beforeunload" started. If it finds it in the queue unload, then it won't
> respond with true (aka unload) but false. It seems like a much simpler fix
than
> what you're trying to do in that CL.

Yes I agree and I am going to remove additional method in web_contents as it
seems really unnecessary.

What about modifications to handle following scenario?

- open two tabs with beforeunload handlers (some doc editing)
- edit data in tab #2 and select "close browser"
- for some reason tab #1 was unable to respond in time to beforeunload but still
alive
- user answers "keep browser open" to dialog from tab #2
- opens tab #1 do some edits and clicks button to close this single tab
expected behavior: tab #1 asks user if she wants to close.
actual behavior: tab will close promptly (as it was marked "allow sudden
termination" in RendererUnresponsive)

I suggest to keep this fix even though I didn't manage to find corresponding
bug. There is a test to reproduce that.

clamy

On 2016/05/12 06:52:11, metaflow wrote: > On 2016/05/12 06:26:11, clamy wrote: > > On 2016/05/12 ...

4 years, 7 months ago (2016-05-12 07:24:31 UTC) #8

On 2016/05/12 06:52:11, metaflow wrote:
> On 2016/05/12 06:26:11, clamy wrote:
> > On 2016/05/12 06:09:01, metaflow wrote:
> > > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > > We should not do this.  Wouldn't it mean that you couldn't close a tab
if
> > the
> > > > renderer process was slow to respond, or in an infinite loop?
> > > > We intentionally
> > > > have a timer to keep Chrome responsive to the user's request in these
> cases.
> > > 
> > > No, user id still able to close the tab as all measures are still in
place.
> > > Closing of single tab isn't affected at all - only browser closing
process.
> > > 
> > > > > Issue mitigates as browser loses some tabs after relaunch.
> > > > 
> > > > I'm curious how this CL is related to that bug?  The tabs closed when
> > > > beforeunload is bypassed do get restored in the next session, which I've
> > just
> > > > verified on Canary.
> > > 
> > > You can easily reproduce bug this way:
> > > - download and open "beforeunload_slow.html" in 4 tabs
> > > - select "quit chrome" in menu
> > > (note that some tabs are closed during shutdown exactly like users
describe
> > > in bug comments)
> > > - open chrome: only two tabs restored while we expect to see all four.
> > > 
> > > What happens underneath in time order:
> > > 
> > > - browser initiates shutdown and asks unload controller to fire
beforeunload
> > > events
> > > - unload controller fills up queue of tabs to notify and fires
beforeunload
> to
> > > tab #1
> > > - frame host proxies event to tab #1 and starts one second timer
> > > - tab #1 receives beforeunload and starts processing
> > > - timer expires and web_content_impl receives RendererUnresponsive. It
fires
> > > onbeforeunloadACK on behalf of tab #1
> > > - unload controller receives ACK; finds tab #1 in beforeunload queue,
> removes
> > it
> > > from
> > > that queue and put to the queue "unload", responses with 'false' (aka
> > > "don't unload") as we want to fire unload events when every tab accepted
> > > unload. Now unload controller fires beforeunload to tab #2
> > > - frame host proxies event to tab #2 and starts one second timer
> > > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> > because
> > > of
> > > system load or trip to server or w/e)
> > > - unload controller receives ACK; is unable to find tab #1 in queue (tab
was
> > > already removed), assumes that tab #1 was created after queue was formed
and
> > > responds with 'true' ("proceed unload")
> > > - tab #1 unloads and notifies session service (session service is still
> alive
> > > atm)
> > > - session service records that tab #1 was closed. That's it. Now chrome
will
> > not
> > > restore this tab. Same thing may happen to some other tabs. Missed tabs
will
> > > be scattered randomly across tabstrips depending on OS schedule and
content.
> > > 
> > > Intend of this CL is to prevent such scenarios. Responsiveness of browser
> > should
> > > not be changed - even if tab is really unresponsive it will be closed as
> > before,
> > > after unload controller made sure that every other tab is acknowledged
> unload.
> > > 
> > > > That's a serious bug, and any leads to help diagnose it would be great. 
I
> > > just
> > > > don't see how this would help.
> > 
> > To me it seems that the issue is more with the unload controller than with
> > WebContents. It could just check whether the tab is in the queue "unload"
> > instead of assuming right away that it was created after the queue
> > "beforeunload" started. If it finds it in the queue unload, then it won't
> > respond with true (aka unload) but false. It seems like a much simpler fix
> than
> > what you're trying to do in that CL.
> 
> Yes I agree and I am going to remove additional method in web_contents as it
> seems really unnecessary.
> 
> What about modifications to handle following scenario?
> 
> - open two tabs with beforeunload handlers (some doc editing)
> - edit data in tab #2 and select "close browser"
> - for some reason tab #1 was unable to respond in time to beforeunload but
still
> alive
> - user answers "keep browser open" to dialog from tab #2
> - opens tab #1 do some edits and clicks button to close this single tab
> expected behavior: tab #1 asks user if she wants to close.
> actual behavior: tab will close promptly (as it was marked "allow sudden
> termination" in RendererUnresponsive)

Just set allow sudden termination to false in
WebContentsImpl::RendererResponsive. It's only set to true when the renderer is
unresponsive, so it seems logical to set it to false when the renderer becomes
responsive again.

> 
> I suggest to keep this fix even though I didn't manage to find corresponding
> bug. There is a test to reproduce that.

Mikhail Goncharov

On 2016/05/12 07:24:31, clamy wrote: > On 2016/05/12 06:52:11, metaflow wrote: > > On 2016/05/12 ...

4 years, 7 months ago (2016-05-12 08:10:32 UTC) #9

On 2016/05/12 07:24:31, clamy wrote:
> On 2016/05/12 06:52:11, metaflow wrote:
> > On 2016/05/12 06:26:11, clamy wrote:
> > > On 2016/05/12 06:09:01, metaflow wrote:
> > > > On 2016/05/11 17:10:17, Charlie Reis wrote:
> > > > > We should not do this.  Wouldn't it mean that you couldn't close a tab
> if
> > > the
> > > > > renderer process was slow to respond, or in an infinite loop?
> > > > > We intentionally
> > > > > have a timer to keep Chrome responsive to the user's request in these
> > cases.
> > > > 
> > > > No, user id still able to close the tab as all measures are still in
> place.
> > > > Closing of single tab isn't affected at all - only browser closing
> process.
> > > > 
> > > > > > Issue mitigates as browser loses some tabs after relaunch.
> > > > > 
> > > > > I'm curious how this CL is related to that bug?  The tabs closed when
> > > > > beforeunload is bypassed do get restored in the next session, which
I've
> > > just
> > > > > verified on Canary.
> > > > 
> > > > You can easily reproduce bug this way:
> > > > - download and open "beforeunload_slow.html" in 4 tabs
> > > > - select "quit chrome" in menu
> > > > (note that some tabs are closed during shutdown exactly like users
> describe
> > > > in bug comments)
> > > > - open chrome: only two tabs restored while we expect to see all four.
> > > > 
> > > > What happens underneath in time order:
> > > > 
> > > > - browser initiates shutdown and asks unload controller to fire
> beforeunload
> > > > events
> > > > - unload controller fills up queue of tabs to notify and fires
> beforeunload
> > to
> > > > tab #1
> > > > - frame host proxies event to tab #1 and starts one second timer
> > > > - tab #1 receives beforeunload and starts processing
> > > > - timer expires and web_content_impl receives RendererUnresponsive. It
> fires
> > > > onbeforeunloadACK on behalf of tab #1
> > > > - unload controller receives ACK; finds tab #1 in beforeunload queue,
> > removes
> > > it
> > > > from
> > > > that queue and put to the queue "unload", responses with 'false' (aka
> > > > "don't unload") as we want to fire unload events when every tab accepted
> > > > unload. Now unload controller fires beforeunload to tab #2
> > > > - frame host proxies event to tab #2 and starts one second timer
> > > > - tab #1 has finished and fires beforeunload ACK (it was slow to respond
> > > because
> > > > of
> > > > system load or trip to server or w/e)
> > > > - unload controller receives ACK; is unable to find tab #1 in queue (tab
> was
> > > > already removed), assumes that tab #1 was created after queue was formed
> and
> > > > responds with 'true' ("proceed unload")
> > > > - tab #1 unloads and notifies session service (session service is still
> > alive
> > > > atm)
> > > > - session service records that tab #1 was closed. That's it. Now chrome
> will
> > > not
> > > > restore this tab. Same thing may happen to some other tabs. Missed tabs
> will
> > > > be scattered randomly across tabstrips depending on OS schedule and
> content.
> > > > 
> > > > Intend of this CL is to prevent such scenarios. Responsiveness of
browser
> > > should
> > > > not be changed - even if tab is really unresponsive it will be closed as
> > > before,
> > > > after unload controller made sure that every other tab is acknowledged
> > unload.
> > > > 
> > > > > That's a serious bug, and any leads to help diagnose it would be
great. 
> I
> > > > just
> > > > > don't see how this would help.
> > > 
> > > To me it seems that the issue is more with the unload controller than with
> > > WebContents. It could just check whether the tab is in the queue "unload"
> > > instead of assuming right away that it was created after the queue
> > > "beforeunload" started. If it finds it in the queue unload, then it won't
> > > respond with true (aka unload) but false. It seems like a much simpler fix
> > than
> > > what you're trying to do in that CL.
> > 
> > Yes I agree and I am going to remove additional method in web_contents as it
> > seems really unnecessary.
> > 
> > What about modifications to handle following scenario?
> > 
> > - open two tabs with beforeunload handlers (some doc editing)
> > - edit data in tab #2 and select "close browser"
> > - for some reason tab #1 was unable to respond in time to beforeunload but
> still
> > alive
> > - user answers "keep browser open" to dialog from tab #2
> > - opens tab #1 do some edits and clicks button to close this single tab
> > expected behavior: tab #1 asks user if she wants to close.
> > actual behavior: tab will close promptly (as it was marked "allow sudden
> > termination" in RendererUnresponsive)
> 
> Just set allow sudden termination to false in
> WebContentsImpl::RendererResponsive. It's only set to true when the renderer
is
> unresponsive, so it seems logical to set it to false when the renderer becomes
> responsive again.
> 

Alright. What do you think about tests flakiness? I bet new test will have same
issue as TestHangInBeforeUnloadMultipleTabs:
sometimes tab that supposed to prevent browser from close doesn't respond within
a second then browser closes and test hung or fails.

> > 
> > I suggest to keep this fix even though I didn't manage to find corresponding
> > bug. There is a test to reproduce that.

Charlie Reis

creis@chromium.org changed reviewers: + sky@chromium.org

4 years, 7 months ago (2016-05-12 17:03:41 UTC) #10

Charlie Reis

[+sky] Wow, thanks for explaining! That didn't come across in the CL description. Can you ...

4 years, 7 months ago (2016-05-12 17:03:41 UTC) #11

[+sky]

Wow, thanks for explaining!  That didn't come across in the CL description.  Can
you post this comment to the bug, to indicate that you've found the likely
cause?

I like the direction you and clamy@ are exploring to simplify the fix.  I think
sky@ is probably a good reviewer for this, since he's familiar with the session
service and probably UnloadController as well.

On 2016/05/12 06:09:01, Mikhail Goncharov wrote:
> On 2016/05/11 17:10:17, Charlie Reis wrote:
> > We should not do this.  Wouldn't it mean that you couldn't close a tab if
the
> > renderer process was slow to respond, or in an infinite loop?
> > We intentionally
> > have a timer to keep Chrome responsive to the user's request in these cases.
> 
> No, user id still able to close the tab as all measures are still in place.
> Closing of single tab isn't affected at all - only browser closing process.

Ah.  What will the outcome be if a beforeunload handler is in an infinite loop
when you try to shut down the whole browser?  Will we still exit after the 1
second timer fires?


> > > Issue mitigates as browser loses some tabs after relaunch.
> > 
> > I'm curious how this CL is related to that bug?  The tabs closed when
> > beforeunload is bypassed do get restored in the next session, which I've
just
> > verified on Canary.
> 
> You can easily reproduce bug this way:
> - download and open "beforeunload_slow.html" in 4 tabs
> - select "quit chrome" in menu
> (note that some tabs are closed during shutdown exactly like users describe
> in bug comments)
> - open chrome: only two tabs restored while we expect to see all four.
> 
> What happens underneath in time order:
> 
> - browser initiates shutdown and asks unload controller to fire beforeunload
> events
> - unload controller fills up queue of tabs to notify and fires beforeunload to
> tab #1
> - frame host proxies event to tab #1 and starts one second timer
> - tab #1 receives beforeunload and starts processing
> - timer expires and web_content_impl receives RendererUnresponsive. It fires
> onbeforeunloadACK on behalf of tab #1
> - unload controller receives ACK; finds tab #1 in beforeunload queue, removes
it
> from
> that queue and put to the queue "unload", responses with 'false' (aka
> "don't unload") as we want to fire unload events when every tab accepted
> unload. Now unload controller fires beforeunload to tab #2
> - frame host proxies event to tab #2 and starts one second timer
> - tab #1 has finished and fires beforeunload ACK (it was slow to respond
because
> of
> system load or trip to server or w/e)
> - unload controller receives ACK; is unable to find tab #1 in queue (tab was
> already removed), assumes that tab #1 was created after queue was formed and
> responds with 'true' ("proceed unload")
> - tab #1 unloads and notifies session service (session service is still alive
> atm)
> - session service records that tab #1 was closed. That's it. Now chrome will
not
> restore this tab. Same thing may happen to some other tabs. Missed tabs will
> be scattered randomly across tabstrips depending on OS schedule and content.
> 
> Intend of this CL is to prevent such scenarios. Responsiveness of browser
should
> not be changed - even if tab is really unresponsive it will be closed as
before,
> after unload controller made sure that every other tab is acknowledged unload.
> 
> > That's a serious bug, and any leads to help diagnose it would be great.  I
> just
> > don't see how this would help.

Mikhail Goncharov

I have uploaded revisited solution: - turns out it is enough to tell render_frame_host_impl to ...

4 years, 7 months ago (2016-05-12 17:11:31 UTC) #12

Mikhail Goncharov

On 2016/05/12 17:03:41, Charlie Reis wrote: > [+sky] > > Wow, thanks for explaining! That ...

4 years, 7 months ago (2016-05-12 17:17:13 UTC) #13

Mikhail Goncharov

On 2016/05/12 17:11:31, Mikhail Goncharov wrote: > I have uploaded revisited solution: > > - ...

4 years, 7 months ago (2016-05-12 17:18:14 UTC) #14

Mikhail Goncharov

On 2016/05/12 17:17:13, Mikhail Goncharov wrote: > On 2016/05/12 17:03:41, Charlie Reis wrote: > > ...

4 years, 7 months ago (2016-05-12 17:21:42 UTC) #15

Mikhail Goncharov

Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. ...

4 years, 7 months ago (2016-05-12 17:24:43 UTC) #16

Description was changed from

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

Scenario is following: unload controller fires 'beforeunload' event every tab
sequentially and waits for answer. If tab failed to respond within one second
then web_contents_impl sends ack response on behalf of tab and unload
controller continues with the next. But if tab was slow to respond because of
overall load to machine and fired beforeunload ACK after been declared
unresponsive then unload controller is surprised by that response and tells tab
to close immediately.

This fix changes unload controller and web content behavior to do not force
close unresponding tabs when browser is going to shutdown.

Additionally fix addresses an issue of tabs that failed to respond to
beforeunload quickly were marked as "don't ask again". Thus if user cancels
shutdown and then clicked 'close' once again she might miss unload dialog from
tab that is responding now.

New tests might be flacky like MAYBE_TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

to

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

Charlie Reis

On 2016/05/12 17:18:14, Mikhail Goncharov wrote: > On 2016/05/12 17:11:31, Mikhail Goncharov wrote: > > ...

4 years, 7 months ago (2016-05-12 18:40:56 UTC) #17

Charlie Reis

sky@, can you see if you have any recommendations for how to write a non-flaky ...

4 years, 7 months ago (2016-05-13 22:24:01 UTC) #18

sky@, can you see if you have any recommendations for how to write a non-flaky
test for this?  I don't know the tests in chrome/browser/lifetime.

One question about an alternate approach, but it may be bigger than we want to
take on for a fix we want to merge to branch.  (clamy@, I'm interested in your
thoughts on how it affects PlzNavigate and FTN loading state.)

https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right):

https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i =
0; i < 7; i++) {
Can you add a comment to this block saying why you're loading one instance of
beforeunload.html vs lots of beforeunload_slow.html?

https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
chrome/browser/lifetime/browser_close_manager_browsertest.cc:520:
TestBeforeUnloadMultipleSlowWindows) {
Is there a meaningful difference between slow windows and slow tabs?  I would
imagine that's the same code, but I don't know the UnloadController.

https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
File content/browser/web_contents/web_contents_impl.cc (right):

https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
content/browser/web_contents/web_contents_impl.cc:4576:
rfhi->DontWaitForBeforeUnloadAck();
I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead.  There's
other state that gets reset in there, and listeners that get notified.  Plus it
would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK).  Seems like that
keeps with the spirit of pretending the ACK arrived when the timer expires.

clamy@, does that sound reasonable to you?  I notice there are some timing UMA
stats in there, as well as implications for PlzNavigate
(NavigatorImpl::OnBeforeUnloadACK doesn't apparently call
RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call
BeforeUnloadFired).

On the other hand, it's possible we could do this as cleanup after landing this
simpler change.  The simple change is nice and might be more easily merged to
branch, though I do worry about adding more one-off changes that will make this
harder to refactor in the long run.

sky

On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: > sky@, can you see if ...

4 years, 7 months ago (2016-05-13 23:38:03 UTC) #19

On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote:
> sky@, can you see if you have any recommendations for how to write a non-flaky
> test for this?  I don't know the tests in chrome/browser/lifetime.
> 
> One question about an alternate approach, but it may be bigger than we want to
> take on for a fix we want to merge to branch.  (clamy@, I'm interested in your
> thoughts on how it affects PlzNavigate and FTN loading state.)
> 
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right):
> 
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i =
> 0; i < 7; i++) {
> Can you add a comment to this block saying why you're loading one instance of
> beforeunload.html vs lots of beforeunload_slow.html?
> 
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> chrome/browser/lifetime/browser_close_manager_browsertest.cc:520:
> TestBeforeUnloadMultipleSlowWindows) {
> Is there a meaningful difference between slow windows and slow tabs?  I would
> imagine that's the same code, but I don't know the UnloadController.
> 
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
> File content/browser/web_contents/web_contents_impl.cc (right):
> 
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
> content/browser/web_contents/web_contents_impl.cc:4576:
> rfhi->DontWaitForBeforeUnloadAck();
> I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. 
There's
> other state that gets reset in there, and listeners that get notified.  Plus
it
> would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK).  Seems like that
> keeps with the spirit of pretending the ACK arrived when the timer expires.
> 
> clamy@, does that sound reasonable to you?  I notice there are some timing UMA
> stats in there, as well as implications for PlzNavigate
> (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call
> RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call
> BeforeUnloadFired).
> 
> On the other hand, it's possible we could do this as cleanup after landing
this
> simpler change.  The simple change is nice and might be more easily merged to
> branch, though I do worry about adding more one-off changes that will make
this
> harder to refactor in the long run.

I have to admit I'm not familiar with this code, or the impact it has on session
restore.

As to flakiness, can you elaborate on what is causing the flakiness? If it's
timeouts, why can't you disable those for the test?

Mikhail Goncharov

https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime/browser_close_manager_browsertest.cc File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime/browser_close_manager_browsertest.cc#newcode520 chrome/browser/lifetime/browser_close_manager_browsertest.cc:520: TestBeforeUnloadMultipleSlowWindows) { On 2016/05/13 22:24:01, Charlie Reis (slow to ...

4 years, 7 months ago (2016-05-16 04:50:40 UTC) #20

Mikhail Goncharov

https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_contents/web_contents_impl.cc File content/browser/web_contents/web_contents_impl.cc (right): https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_contents/web_contents_impl.cc#newcode4576 content/browser/web_contents/web_contents_impl.cc:4576: rfhi->DontWaitForBeforeUnloadAck(); On 2016/05/13 22:24:01, Charlie Reis (slow to reply) ...

4 years, 7 months ago (2016-05-16 05:02:01 UTC) #21

Mikhail Goncharov

https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime/browser_close_manager_browsertest.cc File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime/browser_close_manager_browsertest.cc#newcode486 chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i = 0; i < 7; i++) ...

4 years, 7 months ago (2016-05-16 06:10:31 UTC) #22

Mikhail Goncharov

On 2016/05/13 23:38:03, sky wrote: > On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote: ...

4 years, 7 months ago (2016-05-16 06:17:37 UTC) #23

On 2016/05/13 23:38:03, sky wrote:
> On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote:
> > sky@, can you see if you have any recommendations for how to write a
non-flaky
> > test for this?  I don't know the tests in chrome/browser/lifetime.
> > 
> > One question about an alternate approach, but it may be bigger than we want
to
> > take on for a fix we want to merge to branch.  (clamy@, I'm interested in
your
> > thoughts on how it affects PlzNavigate and FTN loading state.)
> > 
> >
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> > File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right):
> > 
> >
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for (int i
=
> > 0; i < 7; i++) {
> > Can you add a comment to this block saying why you're loading one instance
of
> > beforeunload.html vs lots of beforeunload_slow.html?
> > 
> >
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:520:
> > TestBeforeUnloadMultipleSlowWindows) {
> > Is there a meaningful difference between slow windows and slow tabs?  I
would
> > imagine that's the same code, but I don't know the UnloadController.
> > 
> >
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
> > File content/browser/web_contents/web_contents_impl.cc (right):
> > 
> >
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
> > content/browser/web_contents/web_contents_impl.cc:4576:
> > rfhi->DontWaitForBeforeUnloadAck();
> > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead. 
> There's
> > other state that gets reset in there, and listeners that get notified.  Plus
> it
> > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK).  Seems like that
> > keeps with the spirit of pretending the ACK arrived when the timer expires.
> > 
> > clamy@, does that sound reasonable to you?  I notice there are some timing
UMA
> > stats in there, as well as implications for PlzNavigate
> > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call
> > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call
> > BeforeUnloadFired).
> > 
> > On the other hand, it's possible we could do this as cleanup after landing
> this
> > simpler change.  The simple change is nice and might be more easily merged
to
> > branch, though I do worry about adding more one-off changes that will make
> this
> > harder to refactor in the long run.
> 
> I have to admit I'm not familiar with this code, or the impact it has on
session
> restore.
> 
> As to flakiness, can you elaborate on what is causing the flakiness? If it's
> timeouts, why can't you disable those for the test?

I have managed to make test fail on windows under load when 'beforeunload.html'
didn't manage to respond within one second and all tabs were closed. If that is
a case for original flakiness then increasing responsiveness timeout from within
a test might stabilize it.
Could you please take a look into http://crbug.com/276366 and
http://crbug.com/517687 ? They might contain some interesting details but I
don't have an access to.

sky

On Sun, May 15, 2016 at 11:17 PM, <metaflow@yandex-team.ru> wrote: > On 2016/05/13 23:38:03, sky ...

4 years, 7 months ago (2016-05-16 18:04:14 UTC) #24

On Sun, May 15, 2016 at 11:17 PM,  <metaflow@yandex-team.ru> wrote:
> On 2016/05/13 23:38:03, sky wrote:
>> On 2016/05/13 22:24:01, Charlie Reis (slow to reply) wrote:
>> > sky@, can you see if you have any recommendations for how to write a
> non-flaky
>> > test for this? I don't know the tests in chrome/browser/lifetime.
>> >
>> > One question about an alternate approach, but it may be bigger than we
>> > want
> to
>> > take on for a fix we want to merge to branch. (clamy@, I'm interested in
> your
>> > thoughts on how it affects PlzNavigate and FTN loading state.)
>> >
>> >
>>
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
>> > File chrome/browser/lifetime/browser_close_manager_browsertest.cc
>> > (right):
>> >
>> >
>>
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
>> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:486: for
>> > (int i
> =
>> > 0; i < 7; i++) {
>> > Can you add a comment to this block saying why you're loading one
>> > instance
> of
>> > beforeunload.html vs lots of beforeunload_slow.html?
>> >
>> >
>>
>
https://codereview.chromium.org/1968933002/diff/20001/chrome/browser/lifetime...
>> > chrome/browser/lifetime/browser_close_manager_browsertest.cc:520:
>> > TestBeforeUnloadMultipleSlowWindows) {
>> > Is there a meaningful difference between slow windows and slow tabs? I
> would
>> > imagine that's the same code, but I don't know the UnloadController.
>> >
>> >
>>
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
>> > File content/browser/web_contents/web_contents_impl.cc (right):
>> >
>> >
>>
>
https://codereview.chromium.org/1968933002/diff/20001/content/browser/web_con...
>> > content/browser/web_contents/web_contents_impl.cc:4576:
>> > rfhi->DontWaitForBeforeUnloadAck();
>> > I wonder if we should expose and call RFHI::OnBeforeUnloadACK instead.
>> There's
>> > other state that gets reset in there, and listeners that get notified.
>> > Plus
>> it
>> > would call BeforeUnloadFired (via RFHM::OnBeforeUnloadACK). Seems like
>> > that
>> > keeps with the spirit of pretending the ACK arrived when the timer
>> > expires.
>> >
>> > clamy@, does that sound reasonable to you? I notice there are some
>> > timing
> UMA
>> > stats in there, as well as implications for PlzNavigate
>> > (NavigatorImpl::OnBeforeUnloadACK doesn't apparently call
>> > RFHM::OnBeforeUnloadACK, so I'm not sure it will automatically call
>> > BeforeUnloadFired).
>> >
>> > On the other hand, it's possible we could do this as cleanup after
>> > landing
>> this
>> > simpler change. The simple change is nice and might be more easily
>> > merged
> to
>> > branch, though I do worry about adding more one-off changes that will
>> > make
>> this
>> > harder to refactor in the long run.
>>
>> I have to admit I'm not familiar with this code, or the impact it has on
> session
>> restore.
>>
>> As to flakiness, can you elaborate on what is causing the flakiness? If
>> it's
>> timeouts, why can't you disable those for the test?
>
> I have managed to make test fail on windows under load when
> 'beforeunload.html'
> didn't manage to respond within one second and all tabs were closed. If that
> is
> a case for original flakiness then increasing responsiveness timeout from
> within
> a test might stabilize it.
> Could you please take a look into http://crbug.com/276366 and
> http://crbug.com/517687 ? They might contain some interesting details but I
> don't have an access to.

I've made 276366 no longer private. You should be able to view 517687
already (there isn't anything really interesting in it).

  -Scott

>
> https://codereview.chromium.org/1968933002/

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Charlie Reis

Ok, I've patched in the CL locally to better understand the flakiness, and it makes ...

4 years, 7 months ago (2016-05-19 22:11:56 UTC) #25

Charlie Reis

On 2016/05/19 22:11:56, Charlie Reis (slow to reply) wrote: > I suggest that we proceed ...

4 years, 7 months ago (2016-05-20 01:12:22 UTC) #26

Mikhail Goncharov

https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime/browser_close_manager_browsertest.cc File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime/browser_close_manager_browsertest.cc#newcode506 chrome/browser/lifetime/browser_close_manager_browsertest.cc:506: ASSERT_NO_FATAL_FAILURE(CancelClose()); On 2016/05/19 22:11:55, Charlie Reis (slow to reply) ...

4 years, 7 months ago (2016-05-20 04:20:01 UTC) #27

https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime...
File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right):

https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime...
chrome/browser/lifetime/browser_close_manager_browsertest.cc:506:
ASSERT_NO_FATAL_FAILURE(CancelClose());
On 2016/05/19 22:11:55, Charlie Reis (slow to reply) wrote:
> Yep, this is where the test will be flaky, as you mentioned.  If the bot is
> overloaded and it takes more than 1 second to get the dialog from the
responsive
> beforeunload.html tab, this CancelClose call will never see a dialog to
cancel,
> and we'll quit (or rather, crash in AppModalDialogWaiter::Wait).  You can
repro
> this by setting RenderViewHostImpl::kUnloadTimeoutMS to 0.
> 
> That exactly matches the flakiness in other tests in this file (e.g., the
report
> in https://crbug.com/276366).  Any test that uses CancelClose() or
AcceptClose()
> will have this problem.
> 
> I'm not really sure how to fix this.  We need the timeout to fire to exercise
> the bug in this test, so we can't disable it entirely, but we also can't
> realistically guarantee that the "fast" page will respond before the timeout.

That's right. I have experimented with different timeout values locally and it
turns out that plain increase of timeout is not a clear cut.
Under the same load, --gtest_repeat=100 for TestHangInBeforeUnloadMultipleTabs:
kUnloadTimeoutMS = 100  => 15% failed
kUnloadTimeoutMS = 1000 => 10% failed
kUnloadTimeoutMS = 2000 => 9% failed
kUnloadTimeoutMS = 3000 => 12% failed
(lager timeout increases a chance that test will not be able to finish within 30
seconds).

It is interesting that test with multiple windows doesn't seem to flack as much
as one with single window and multiple tabs. One of possible explanations is
that process composition and os scheduling differs.

I am going to disable tests on the same platforms for now and implement second
approach you've described below.

Is it OK to upload "flak-less" patchset to this CL or better to create new one?

Do you have means to bulk run particular test with load similar to normal on
bots to check flakiness before / after fix?

> One option might be gracefully handling the case that the dialog is never
> displayed, such that the test passes.  That means that the test would always
> pass with the bug fixed, and it would usually fail but occasionally pass (on
> slow bots) if the bug regressed.  I'd be ok with that outcome.
> 
> Another option would be exposing some interface to the underlying timer on
each
> tab, so that we can selectively tell them to time out or not.  That seems like
> it would be pretty nice.

I like that 

> That said, I don't think we want that to be a prerequisite for landing this
CL,
> since it's a very valuable fix, and I've already verified locally that the
test
> correctly exercises the bug and the fix.  I'm inclined to say that we should
> land this CL with these new tests disabled, and follow up with a way to fix
and
> enable them (perhaps with a timeout interface for testing).

Charlie Reis

On 2016/05/20 04:20:01, Mikhail Goncharov wrote: > https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime/browser_close_manager_browsertest.cc > File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right): > > https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime/browser_close_manager_browsertest.cc#newcode506 ...

4 years, 7 months ago (2016-05-20 04:42:10 UTC) #28

On 2016/05/20 04:20:01, Mikhail Goncharov wrote:
>
https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime...
> File chrome/browser/lifetime/browser_close_manager_browsertest.cc (right):
> 
>
https://codereview.chromium.org/1968933002/diff/40001/chrome/browser/lifetime...
> chrome/browser/lifetime/browser_close_manager_browsertest.cc:506:
> ASSERT_NO_FATAL_FAILURE(CancelClose());
> On 2016/05/19 22:11:55, Charlie Reis (slow to reply) wrote:
> > Yep, this is where the test will be flaky, as you mentioned.  If the bot is
> > overloaded and it takes more than 1 second to get the dialog from the
> responsive
> > beforeunload.html tab, this CancelClose call will never see a dialog to
> cancel,
> > and we'll quit (or rather, crash in AppModalDialogWaiter::Wait).  You can
> repro
> > this by setting RenderViewHostImpl::kUnloadTimeoutMS to 0.
> > 
> > That exactly matches the flakiness in other tests in this file (e.g., the
> report
> > in https://crbug.com/276366).  Any test that uses CancelClose() or
> AcceptClose()
> > will have this problem.
> > 
> > I'm not really sure how to fix this.  We need the timeout to fire to
exercise
> > the bug in this test, so we can't disable it entirely, but we also can't
> > realistically guarantee that the "fast" page will respond before the
timeout.
> 
> That's right. I have experimented with different timeout values locally and it
> turns out that plain increase of timeout is not a clear cut.
> Under the same load, --gtest_repeat=100 for
TestHangInBeforeUnloadMultipleTabs:
> kUnloadTimeoutMS = 100  => 15% failed
> kUnloadTimeoutMS = 1000 => 10% failed
> kUnloadTimeoutMS = 2000 => 9% failed
> kUnloadTimeoutMS = 3000 => 12% failed
> (lager timeout increases a chance that test will not be able to finish within
30
> seconds).

Yeah, we'd prefer not to have even a low probability of flaking, since that will
cause work for the tree sheriffs and lead to the test being disabled eventually.

> 
> It is interesting that test with multiple windows doesn't seem to flack as
much
> as one with single window and multiple tabs. One of possible explanations is
> that process composition and os scheduling differs.
> 
> I am going to disable tests on the same platforms for now and implement second
> approach you've described below.

Let's just land it entirely disabled, since we know it can fail on any platform.
 I've verified that the test does what it should, and we'll separately fix it to
work on all platforms without flakes.

> Is it OK to upload "flak-less" patchset to this CL or better to create new
one?

Let's keep this CL as the fix plus the disabled test.  In a followup CL, we can
fix the test framework and enable the test.

> Do you have means to bulk run particular test with load similar to normal on
> bots to check flakiness before / after fix?

Not sure.  "git try -h" will show you the options it supports (or see
http://www.chromium.org/developers/testing/try-server-usage).

> > One option might be gracefully handling the case that the dialog is never
> > displayed, such that the test passes.  That means that the test would always
> > pass with the bug fixed, and it would usually fail but occasionally pass (on
> > slow bots) if the bug regressed.  I'd be ok with that outcome.
> > 
> > Another option would be exposing some interface to the underlying timer on
> each
> > tab, so that we can selectively tell them to time out or not.  That seems
like
> > it would be pretty nice.
> 
> I like that 

Great.  Let's pursue that once this fix lands.  Thanks!
> 
> 
> > That said, I don't think we want that to be a prerequisite for landing this
> CL,
> > since it's a very valuable fix, and I've already verified locally that the
> test
> > correctly exercises the bug and the fix.  I'm inclined to say that we should
> > land this CL with these new tests disabled, and follow up with a way to fix
> and
> > enable them (perhaps with a timeout interface for testing).

Mikhail Goncharov

The CQ bit was checked by metaflow@yandex-team.ru to run a CQ dry run

4 years, 7 months ago (2016-05-20 10:05:52 UTC) #29

Mikhail Goncharov

The CQ bit was unchecked by metaflow@yandex-team.ru

4 years, 7 months ago (2016-05-20 10:06:01 UTC) #30

Mikhail Goncharov

The CQ bit was checked by metaflow@yandex-team.ru

4 years, 7 months ago (2016-05-20 10:06:02 UTC) #31

Mikhail Goncharov

The CQ bit was unchecked by metaflow@yandex-team.ru

4 years, 7 months ago (2016-05-20 10:06:03 UTC) #32

Mikhail Goncharov

The CQ bit was checked by metaflow@yandex-team.ru

4 years, 7 months ago (2016-05-20 10:06:03 UTC) #33

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/80001

4 years, 7 months ago (2016-05-20 10:06:12 UTC) #34

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 7 months ago (2016-05-20 10:06:14 UTC) #35

commit-bot: I haz the power

No L-G-T-M from a valid reviewer yet. CQ run can only be started by full ...

4 years, 7 months ago (2016-05-20 10:06:15 UTC) #36

Charlie Reis

LGTM if the try bots are happy. sky@, can you review chrome/ for OWNERS? We're ...

4 years, 7 months ago (2016-05-20 16:24:37 UTC) #37

Charlie Reis

The CQ bit was checked by creis@chromium.org to run a CQ dry run

4 years, 7 months ago (2016-05-20 16:24:45 UTC) #38

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/80001

4 years, 7 months ago (2016-05-20 16:24:53 UTC) #39

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 7 months ago (2016-05-20 16:46:57 UTC) #40

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/173871)

4 years, 7 months ago (2016-05-20 16:46:58 UTC) #41

Charlie Reis

Mikhail, can you take look at the failure in the RenderFrameHostManagerTest.CloseWithPendingWhileUnresponsive unit test?

4 years, 7 months ago (2016-05-20 16:50:18 UTC) #42

Mikhail Goncharov

On 2016/05/20 16:50:18, Charlie Reis (slow to reply) wrote: > Mikhail, can you take look ...

4 years, 7 months ago (2016-05-21 08:34:37 UTC) #44

Mikhail Goncharov

On 2016/05/21 08:34:37, Mikhail Goncharov wrote: > On 2016/05/20 16:50:18, Charlie Reis (slow to reply) ...

4 years, 7 months ago (2016-05-21 09:32:24 UTC) #45

Mikhail Goncharov

The CQ bit was checked by metaflow@yandex-team.ru to run a CQ dry run

4 years, 7 months ago (2016-05-21 09:33:08 UTC) #46

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/100001

4 years, 7 months ago (2016-05-21 09:33:20 UTC) #47

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 7 months ago (2016-05-21 10:45:44 UTC) #48

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 7 months ago (2016-05-21 10:45:45 UTC) #49

Mikhail Goncharov

> Let's keep this CL as the fix plus the disabled test. In a followup ...

4 years, 7 months ago (2016-05-21 15:46:16 UTC) #50

Charlie Reis

The patchset sent to the CQ was uploaded after l-g-t-m from sky@chromium.org Link to the ...

4 years, 7 months ago (2016-05-23 16:36:45 UTC) #53

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1968933002/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1968933002/100001

4 years, 7 months ago (2016-05-23 16:36:57 UTC) #54

commit-bot: I haz the power

Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. ...

4 years, 7 months ago (2016-05-23 18:26:59 UTC) #55

Message was sent while issue was closed.

Description was changed from

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

to

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

commit-bot: I haz the power

Description was changed from ========== Fix: tabs slow to response to beforeunload are closed prematurely. ...

4 years, 7 months ago (2016-05-23 18:29:45 UTC) #57

Message was sent while issue was closed.

Description was changed from

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation
==========

to

==========
Fix: tabs slow to response to beforeunload are closed prematurely.

Issue mitigates as browser loses some tabs after relaunch.

What happens underneath in time order:

- browser initiates shutdown and asks unload controller to fire beforeunload
events
- unload controller fills up queue of tabs to notify and fires beforeunload to
tab #1
- frame host proxies event to tab #1 and starts one second timer
- tab #1 receives beforeunload and starts processing
- timer expires and web_content_impl receives RendererUnresponsive. It fires
onbeforeunloadACK on behalf of tab #1
- unload controller receives ACK; finds tab #1 in beforeunload queue, removes it
from
that queue and put to the queue "unload", responses with 'false' (aka
"don't unload") as we want to fire unload events when every tab accepted
unload. Now unload controller fires beforeunload to tab #2
- frame host proxies event to tab #2 and starts one second timer
- tab #1 has finished and fires beforeunload ACK (it was slow to respond because
of
system load or trip to server or w/e)
- unload controller receives ACK; is unable to find tab #1 in queue (tab was
already removed), assumes that tab #1 was created after queue was formed and
responds with 'true' ("proceed unload")
- tab #1 unloads and notifies session service (session service is still alive
atm)
- session service records that tab #1 was closed. That's it. Now chrome will not
restore this tab. Same thing may happen to some other tabs. Missed tabs will
be scattered randomly across tabstrips depending on OS schedule and content.

Intend of this CL is to prevent such scenarios. Responsiveness of browser should
not be changed - even if tab is really unresponsive it will be closed as before,
after unload controller made sure that every other tab is acknowledged unload.

New tests might be flacky like TestHangInBeforeUnloadMultipleTabs
because under heavy load tab might fail to response within one second and
treated as if it has no beforeunload handler.

BUG=365052
R=clamy@chromium.org,sky@chromium.org
CQ_INCLUDE_TRYBOTS=tryserver.chromium.linux:linux_site_isolation

Committed: https://crrev.com/73fc9555a1ecb23f1076b51e4602dfee5419e2f9
Cr-Commit-Position: refs/heads/master@{#395368}
==========

commit-bot: I haz the power

4 years, 7 months ago (2016-05-23 18:29:47 UTC) #58

Message was sent while issue was closed.

Patchset 6 (id:??) landed as
https://crrev.com/73fc9555a1ecb23f1076b51e4602dfee5419e2f9
Cr-Commit-Position: refs/heads/master@{#395368}

Issue 1968933002: Fix: tabs slow to response to beforeunload are closed prematurely. (Closed)

Description

Patch Set 1 : Draft #

Patch Set 2 : Revisited solution #

Patch Set 3 : Use whole rfhi::BeforeUnloadAck #

Patch Set 4 : Review fixes; new tests are disabled #

Patch Set 5 : with tests completely disabled #

Patch Set 6 : Correct order of checks #

Messages