|
|
Created:
6 years, 1 month ago by Ken Russell (switch to Gerrit) Modified:
6 years, 1 month ago CC:
chromium-reviews, telemetry+watch_chromium.org, hendrikw, jbauman Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionShut down browser processes on Windows by sending WM_CLOSE messages.
This avoids calling TerminateProcess against the browser during the
window when doing so would leak a suspended sub-process.
BUG=424024
Committed: https://crrev.com/1c02272b77cede124c7985da598b9ff51e1696cd
Cr-Commit-Position: refs/heads/master@{#304342}
Patch Set 1 #Patch Set 2 : Added back in robustness fixes. #
Total comments: 4
Patch Set 3 : Check specifically for Chrome's windows. #
Total comments: 2
Patch Set 4 : Addressed review feedback from nedngyuen. Pass app_name to CooperativelyShutdown. #Patch Set 5 : Removed tab character #
Messages
Total messages: 28 (6 generated)
kbr@chromium.org changed reviewers: + dtu@chromium.org, nednguyen@google.com, tonyg@chromium.org
Please review. Thanks. This has been tested locally. I'm running a stress test overnight to ensure this fixes all of the problems seen. Note I had to bypass the presubmit checks per Issue 433144.
https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... File tools/telemetry/telemetry/core/platform/platform_backend.py (right): https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/platform_backend.py:130: self._running_browser_backends.discard(browser_backend) Getting around the weakset look up problem with this is not a good idea. I want us to understand the root cause of it. https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/platform_backend.py:268: def IsCooperativeShutdownSupported(self): Instead of this, can we make a function called StopApplication(pid). Posix platform do "kill -9 ...", win platform do what you want below.
https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... File tools/telemetry/telemetry/core/platform/platform_backend.py (right): https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/platform_backend.py:130: self._running_browser_backends.discard(browser_backend) On 2014/11/14 03:56:42, nednguyen wrote: > Getting around the weakset look up problem with this is not a good idea. I want > us to understand the root cause of it. Qualifying this fix is very expensive -- it takes overnight to really confirm the fix is working. I have already seen that without this workaround, the weak set lookup occasionally throws an exception, causing tests to fail -- and with it, the tests run reliably. I am strongly in favor of putting this workaround in now, getting the bots reliable again, and debugging the root cause of the weakset problem separately. https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/platform_backend.py:268: def IsCooperativeShutdownSupported(self): On 2014/11/14 03:56:42, nednguyen wrote: > Instead of this, can we make a function called StopApplication(pid). Posix > platform do "kill -9 ...", win platform do what you want below. I don't want to change the current behavior of non-Windows platforms with this fix. Popen.terminate, which is currently used, seems to work fine on all platforms except Windows.
On 2014/11/14 04:03:58, Ken Russell wrote: > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > File tools/telemetry/telemetry/core/platform/platform_backend.py (right): > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > tools/telemetry/telemetry/core/platform/platform_backend.py:130: > self._running_browser_backends.discard(browser_backend) > On 2014/11/14 03:56:42, nednguyen wrote: > > Getting around the weakset look up problem with this is not a good idea. I > want > > us to understand the root cause of it. > > Qualifying this fix is very expensive -- it takes overnight to really confirm > the fix is working. I have already seen that without this workaround, the weak > set lookup occasionally throws an exception, causing tests to fail -- and with > it, the tests run reliably. I am strongly in favor of putting this workaround in > now, getting the bots reliable again, and debugging the root cause of the > weakset problem separately. This is fair. Can you please file a bug and add me and other telemetry guys to keep track of this? > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > tools/telemetry/telemetry/core/platform/platform_backend.py:268: def > IsCooperativeShutdownSupported(self): > On 2014/11/14 03:56:42, nednguyen wrote: > > Instead of this, can we make a function called StopApplication(pid). Posix > > platform do "kill -9 ...", win platform do what you want below. > > I don't want to change the current behavior of non-Windows platforms with this > fix. Popen.terminate, which is currently used, seems to work fine on all > platforms except Windows. Yes, then please use Popen.terminate for StopApplication(process_handle) for other platform.
Please add your responses inline in the code review. It makes it easier to track the conversations about individual changes. On 2014/11/14 04:10:50, nednguyen wrote: > On 2014/11/14 04:03:58, Ken Russell wrote: > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > File tools/telemetry/telemetry/core/platform/platform_backend.py (right): > > > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > tools/telemetry/telemetry/core/platform/platform_backend.py:130: > > self._running_browser_backends.discard(browser_backend) > > On 2014/11/14 03:56:42, nednguyen wrote: > > > Getting around the weakset look up problem with this is not a good idea. I > > want > > > us to understand the root cause of it. > > > > Qualifying this fix is very expensive -- it takes overnight to really confirm > > the fix is working. I have already seen that without this workaround, the weak > > set lookup occasionally throws an exception, causing tests to fail -- and with > > it, the tests run reliably. I am strongly in favor of putting this workaround > in > > now, getting the bots reliable again, and debugging the root cause of the > > weakset problem separately. > This is fair. Can you please file a bug and add me and other telemetry guys to > keep track of this? > > > > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > tools/telemetry/telemetry/core/platform/platform_backend.py:268: def > > IsCooperativeShutdownSupported(self): > > On 2014/11/14 03:56:42, nednguyen wrote: > > > Instead of this, can we make a function called StopApplication(pid). Posix > > > platform do "kill -9 ...", win platform do what you want below. > > > > I don't want to change the current behavior of non-Windows platforms with this > > fix. Popen.terminate, which is currently used, seems to work fine on all > > platforms except Windows. > > Yes, then please use Popen.terminate for StopApplication(process_handle) for > other platform. I think this is not a good direction. Doing this will require *all* of the platform backends to change for this fix, not just the Windows backend. This carries the risk of breaking other platforms. Also, the Android backend (for example) works very differently than the other ports. This abstraction will not work for that platform backend. Also, the try job result http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered... indicates there might be remaining problems with this fix. I am trying to solve the immediate flakiness problem that has been plaguing the GPU waterfall for three weeks. Please help toward this goal.
It looks like this change ends up killing the buildbot slave process when it's sent to the try servers. Logging on to build144-a4 showed that its slave.bat window was gone after trying this CL. I suspect the next try run on win_gpu_triggered_tests will have the same problem. (I'll fix the slave if so.) Hendrik, John, do you have any ideas on how to more carefully identify the windows to which to send WM_CLOSE?
On 2014/11/14 04:52:29, Ken Russell wrote: > Please add your responses inline in the code review. It makes it easier to track > the conversations about individual changes. > > On 2014/11/14 04:10:50, nednguyen wrote: > > On 2014/11/14 04:03:58, Ken Russell wrote: > > > > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > > File tools/telemetry/telemetry/core/platform/platform_backend.py (right): > > > > > > > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > > tools/telemetry/telemetry/core/platform/platform_backend.py:130: > > > self._running_browser_backends.discard(browser_backend) > > > On 2014/11/14 03:56:42, nednguyen wrote: > > > > Getting around the weakset look up problem with this is not a good idea. I > > > want > > > > us to understand the root cause of it. > > > > > > Qualifying this fix is very expensive -- it takes overnight to really > confirm > > > the fix is working. I have already seen that without this workaround, the > weak > > > set lookup occasionally throws an exception, causing tests to fail -- and > with > > > it, the tests run reliably. I am strongly in favor of putting this > workaround > > in > > > now, getting the bots reliable again, and debugging the root cause of the > > > weakset problem separately. > > This is fair. Can you please file a bug and add me and other telemetry guys to > > keep track of this? > > > > > > > > > > > https://codereview.chromium.org/726563003/diff/20001/tools/telemetry/telemetr... > > > tools/telemetry/telemetry/core/platform/platform_backend.py:268: def > > > IsCooperativeShutdownSupported(self): > > > On 2014/11/14 03:56:42, nednguyen wrote: > > > > Instead of this, can we make a function called StopApplication(pid). Posix > > > > platform do "kill -9 ...", win platform do what you want below. > > > > > > I don't want to change the current behavior of non-Windows platforms with > this > > > fix. Popen.terminate, which is currently used, seems to work fine on all > > > platforms except Windows. > > > > Yes, then please use Popen.terminate for StopApplication(process_handle) for > > other platform. > > I think this is not a good direction. Doing this will require *all* of the > platform backends to change for this fix, not just the Windows backend. This > carries the risk of breaking other platforms. Also, the Android backend (for > example) works very differently than the other ports. This abstraction will not > work for that platform backend. > > Also, the try job result > http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered... > indicates there might be remaining problems with this fix. I am trying to solve > the immediate flakiness problem that has been plaguing the GPU waterfall for > three weeks. Please help toward this goal. I understand that you are in rush to fix this. But the way this code is structured adds more complexity to browser_backend unnecessarily. Imagine someone implements a new application backend implementation (yes, this is the goal of telemetry v2), how would they know that they should use CooperativeShutDown? It seems clear to me that the details of shutting down the application is platform dependent, then the code should be structured that way so that it's clear what's the right way to shut down the application. I can lg2m this but I want to a bug cc'ed to me to keep track of the technical debt to be cleaned.
On 2014/11/14 04:59:28, Ken Russell wrote: > It looks like this change ends up killing the buildbot slave process when it's > sent to the try servers. Logging on to build144-a4 showed that its slave.bat > window was gone after trying this CL. I suspect the next try run on > win_gpu_triggered_tests will have the same problem. (I'll fix the slave if so.) > Hendrik, John, do you have any ideas on how to more carefully identify the > windows to which to send WM_CLOSE? This blog describes a solution to kill subprocesses on Windows, can we try this? http://mackeblog.blogspot.com/2012/05/killing-subprocesses-on-windows-in.html
On 2014/11/14 06:22:10, nednguyen wrote: > On 2014/11/14 04:59:28, Ken Russell wrote: > > It looks like this change ends up killing the buildbot slave process when it's > > sent to the try servers. Logging on to build144-a4 showed that its slave.bat > > window was gone after trying this CL. I suspect the next try run on > > win_gpu_triggered_tests will have the same problem. (I'll fix the slave if > so.) > > Hendrik, John, do you have any ideas on how to more carefully identify the > > windows to which to send WM_CLOSE? > > This blog describes a solution to kill subprocesses on Windows, can we try this? > > > http://mackeblog.blogspot.com/2012/05/killing-subprocesses-on-windows-in.html Looks like this method works on win platform: http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered... succeeds with --pageset-repeat=200.
On 2014/11/14 17:54:08, nednguyen wrote: > On 2014/11/14 06:22:10, nednguyen wrote: > > On 2014/11/14 04:59:28, Ken Russell wrote: > > > It looks like this change ends up killing the buildbot slave process when > it's > > > sent to the try servers. Logging on to build144-a4 showed that its slave.bat > > > window was gone after trying this CL. I suspect the next try run on > > > win_gpu_triggered_tests will have the same problem. (I'll fix the slave if > > so.) > > > Hendrik, John, do you have any ideas on how to more carefully identify the > > > windows to which to send WM_CLOSE? > > > > This blog describes a solution to kill subprocesses on Windows, can we try > this? > > > > > > http://mackeblog.blogspot.com/2012/05/killing-subprocesses-on-windows-in.html > > Looks like this method works on win platform: > http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered... > succeeds with --pageset-repeat=200. Please see my comments on https://codereview.chromium.org/723403003 . One try job of that CL failed, indicating that the approach didn't fix the problem: http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered... . I have other concerns with that approach which are posted on that CL.
Please re-review. Thanks. I specialized the EnumWindows callback to find only Chrome windows. One of the try jobs of the earlier CL indicated that it had inadvertently closed buildbot's window, taking the slave offline. I'm not sure how that happened, but it was necessary to make the code more conservative in response. Filed http://crbug.com/433473 about the WeakSet problems. I haven't yet filed a follow-on bug about technical debt associated with this change. If it lands, will do so. Please leave any comments about the code inline using the review tool instead of using "Reply", so that the conversation thread is kept. I'll be running this CL through the tryjobs multiple times as well as stress testing it on one of the bots, so please let me CQ it.
https://codereview.chromium.org/726563003/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/core/platform/win_platform_backend.py (right): https://codereview.chromium.org/726563003/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/win_platform_backend.py:391: win32gui.GetClassName(hwnd).lower().startswith('chrome')): Can you please find another way that does not hardcode the name "chrome"? Maybe pass in app_name? The assumption that you are always deal with chrome browser is no longer true.
https://codereview.chromium.org/726563003/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/core/platform/win_platform_backend.py (right): https://codereview.chromium.org/726563003/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/core/platform/win_platform_backend.py:391: win32gui.GetClassName(hwnd).lower().startswith('chrome')): On 2014/11/14 22:14:13, nednguyen wrote: > Can you please find another way that does not hardcode the name "chrome"? Maybe > pass in app_name? The assumption that you are always deal with chrome browser is > no longer true. Done. Now passes in app name from desktop_browser_backend.py.
lgtm
The CQ bit was checked by nednguyen@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/726563003/60001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: chromium_presubmit on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...)
The CQ bit was checked by kbr@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/726563003/80001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: chromium_presubmit on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...)
On 2014/11/15 01:47:57, I haz the power (commit-bot) wrote: > Try jobs failed on following builders: > chromium_presubmit on tryserver.chromium.linux > (http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...) That was due to https://codereview.chromium.org/707353002/ , reverted in https://codereview.chromium.org/719313003 .
The CQ bit was checked by kbr@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/726563003/80001
Message was sent while issue was closed.
Committed patchset #5 (id:80001)
Message was sent while issue was closed.
Patchset 5 (id:??) landed as https://crrev.com/1c02272b77cede124c7985da598b9ff51e1696cd Cr-Commit-Position: refs/heads/master@{#304342} |