|
|
Created:
6 years, 10 months ago by sigbjorn Modified:
6 years, 9 months ago CC:
chromium-reviews, cbentzel+watch_chromium.org, Johnny(Jianning) Ding, Daniel Bratell Base URL:
https://chromium.googlesource.com/chromium/src.git@master Visibility:
Public. |
DescriptionGive localserver time to shut down
On machines with slow I/O, the localserver might not shut down immediately,
wait for it if necessary.
BUG=
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=260201
Patch Set 1 #Patch Set 2 : Fixup! Give Windows plenty of time to shut down, don't call KillProcess #
Total comments: 7
Patch Set 3 : Comment cleanup #Patch Set 4 : Use TerminateJobObject. Restore KillProcess for Windows #
Total comments: 7
Patch Set 5 : Original code, using action_timeout #
Total comments: 1
Patch Set 6 : Special Windows code for shutting down localserver #Patch Set 7 : Proper windows timeout #Messages
Total messages: 59 (0 generated)
Please forward if I got the wrong reviewers (though a one line change should be quick to review)
On 2014/02/11 17:30:19, sigbjorn wrote: > Please forward if I got the wrong reviewers (though a one line change should be > quick to review) You didn't set any reviewer. You'll want to use one that is in the closest OWNERS file.
Well, what's wrong with killing the process if it doesn't shut down in time?
On 2014/02/12 18:45:51, Paweł Hajdan Jr. wrote: > Well, what's wrong with killing the process if it doesn't shut down in time? "In time" was previously 0 seconds, so would fail frequently on slow machines. For suitable definitions of "in time", I agree with you, which is what this patch does. KillProcess also attemtps to wait, so might as well wait in an more appropriate place instead. If job_handle_.Close() and WaitForSingleProcess both failed, KillProcess also tends to fail (I have not investigated why), which returns false from the function, messing with tests which use the function.
On 2014/02/13 09:01:01, sigbjorn wrote: > On 2014/02/12 18:45:51, Paweł Hajdan Jr. wrote: > > Well, what's wrong with killing the process if it doesn't shut down in time? > > "In time" was previously 0 seconds, so would fail frequently on slow machines. > For suitable definitions of "in time", I agree with you, which is what this > patch does. > > KillProcess also attemtps to wait, so might as well wait in an more appropriate > place instead. If job_handle_.Close() and WaitForSingleProcess both failed, > KillProcess also tends to fail (I have not investigated why), which returns > false from the function, messing with tests which use the function. To be clear, job_handle_.Close() should never fail. All the process in the job objects are killed. The problem here from what I understand is that WaitForSingleProcess doesn't wait at all, and since the process are killed asynchronously, the the call to KillProcess may still execute. Then the call to KillProcessw will likely fail because the process was killed at that point. I'm actually fine to put an infinite time out on Windows, but only on Windows and remove the KillProcess call on Windows.
On 2014/02/13 13:39:50, M-A Ruel wrote: > I'm actually fine to put an infinite time out on Windows, but only on Windows > and remove the KillProcess call on Windows. Infinite sounds like a long time, if it doesn't shut down timely, the function should still return an error. Increased to 60 seconds, and no calling of KillProcess on Windows.
https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:130: // Wait for the process to terminate. processes http://msdn.microsoft.com/library/windows/desktop/ms684161.aspx " The state of a job object is set to signaled when all of its processes are terminated because the specified end-of-job time limit has been exceeded. " So do not wait for process_handle_, wait for job_handle_ to become signaled. https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:136: ret = base::KillProcess(process_handle_, 1, true); That doesn't compile.
https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:130: // Wait for the process to terminate. > processes Fixed > So do not wait for process_handle_, wait for job_handle_ to become signaled. How do you propose to do this? job_handle is explicitly nulled in Close() https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:136: ret = base::KillProcess(process_handle_, 1, true); On 2014/02/14 13:54:35, M-A Ruel wrote: > That doesn't compile. Which compiler and how?
https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:130: // Wait for the process to terminate. On 2014/02/14 16:05:51, sigbjorn wrote: > > So do not wait for process_handle_, wait for job_handle_ to become signaled. > How do you propose to do this? job_handle is explicitly nulled in Close() > With TerminateJobObject(), since you obviously can't wait on a closed handle. http://msdn.microsoft.com/library/windows/desktop/ms686709.aspx https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:136: ret = base::KillProcess(process_handle_, 1, true); On 2014/02/14 16:05:51, sigbjorn wrote: > On 2014/02/14 13:54:35, M-A Ruel wrote: > > That doesn't compile. > > Which compiler and how? Oh sorry, I missed the fact that you had removed { too.
Please make sure KillProcess is also called on Windows. Job objects should be a secondary safety net, not the primary mechanism we rely on for process termination. "not lgtm" in the current shape
KillProcess restored to work also on Windows. https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/70001/net/test/spawned_test_se... net/test/spawned_test_server/local_test_server.cc:130: // Wait for the process to terminate. On 2014/02/14 20:14:37, M-A Ruel wrote: > On 2014/02/14 16:05:51, sigbjorn wrote: > > > So do not wait for process_handle_, wait for job_handle_ to become signaled. > > How do you propose to do this? job_handle is explicitly nulled in Close() > > > > With TerminateJobObject(), since you obviously can't wait on a closed handle. > http://msdn.microsoft.com/library/windows/desktop/ms686709.aspx Done.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); Can this actually fail? If so, when and why? How is that different from just closing the job handle with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE?
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/18 19:29:54, Paweł Hajdan Jr. wrote: > Can this actually fail? If so, when and why? > > How is that different from just closing the job handle with > JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE? Sigbjorn found out it is actually killing contained processes asynchronously.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/18 19:39:33, M-A Ruel wrote: > On 2014/02/18 19:29:54, Paweł Hajdan Jr. wrote: > > Can this actually fail? If so, when and why? > > > > How is that different from just closing the job handle with > > JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE? > > Sigbjorn found out it is actually killing contained processes asynchronously. Our tests would have random failures due to this function returning False. We only saw the failures on our virtual test slaves with virtual disks (thus very slow I/O). Increasing the timeout in WaitForSingleProcess from 0 to 10 seconds eliminated the problem. That was using job_handle_.Close() though, I have no idea if TerminateJobObject behaves the same way.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/19 09:29:13, sigbjorn wrote: > Our tests would have random failures due to this function returning False. We > only saw the failures on our virtual test slaves with virtual disks (thus very > slow I/O). Increasing the timeout in WaitForSingleProcess from 0 to 10 seconds > eliminated the problem. > > That was using job_handle_.Close() though, I have no idea if TerminateJobObject > behaves the same way. Could you try this instead? 1. Close the job handle (like it was before). 2. Give the process more time to shut down (all platforms, and use the timeouts from base/test/test_timeouts). 3. Always kill the process if wait fails.
On 2014/02/19 19:20:48, Paweł Hajdan Jr. wrote: > https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... > File net/test/spawned_test_server/local_test_server.cc (right): > > https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... > net/test/spawned_test_server/local_test_server.cc:129: bool ret = > TerminateJobObject(job_handle_, 0); > On 2014/02/19 09:29:13, sigbjorn wrote: > > Our tests would have random failures due to this function returning False. We > > only saw the failures on our virtual test slaves with virtual disks (thus very > > slow I/O). Increasing the timeout in WaitForSingleProcess from 0 to 10 seconds > > eliminated the problem. > > > > That was using job_handle_.Close() though, I have no idea if > TerminateJobObject > > behaves the same way. > > Could you try this instead? > > 1. Close the job handle (like it was before). > 2. Give the process more time to shut down (all platforms, and use the timeouts > from base/test/test_timeouts). > 3. Always kill the process if wait fails. I explicitly said to the the reverse! Call TerminateJobObject() Wait for the job object Close it.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/19 19:20:49, Paweł Hajdan Jr. wrote: > On 2014/02/19 09:29:13, sigbjorn wrote: > > Our tests would have random failures due to this function returning False. We > > only saw the failures on our virtual test slaves with virtual disks (thus very > > slow I/O). Increasing the timeout in WaitForSingleProcess from 0 to 10 seconds > > eliminated the problem. > > > > That was using job_handle_.Close() though, I have no idea if > TerminateJobObject > > behaves the same way. > > Could you try this instead? > > 1. Close the job handle (like it was before). > 2. Give the process more time to shut down (all platforms, and use the timeouts > from base/test/test_timeouts). > 3. Always kill the process if wait fails. You mean like patch set 1 did? Except change the hardcoded 10 to a proper timeout? I have no strong opinion on how this should be implemented, and will follow recommendations, but right now we seem to be going in circles. If you and maruel could agree on a preferred method first, then I can upload a patch which does that.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/20 09:25:56, sigbjorn wrote: > You mean like patch set 1 did? Except change the hardcoded 10 to a proper > timeout? Sounds reasonable. > I have no strong opinion on how this should be implemented, and will follow > recommendations, but right now we seem to be going in circles. If you and maruel > could agree on a preferred method first, then I can upload a patch which does > that. I think my comments are pretty consistent: mostly objecting to lack of KillProcess.
https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/200001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: bool ret = TerminateJobObject(job_handle_, 0); On 2014/02/20 19:06:20, Paweł Hajdan Jr. wrote: > On 2014/02/20 09:25:56, sigbjorn wrote: > > You mean like patch set 1 did? Except change the hardcoded 10 to a proper > > timeout? > > Sounds reasonable. Now in patch set 5
Patch Set 5 LGTM++ Please let me take another look if any changes are made to it.
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/350001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_chromium_rel
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/350001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_chromium_rel, mac_chromium_rel
On 2014/02/24 13:14:13, I haz the power (commit-bot) wrote: > Try jobs failed on following builders: linux_chromium_rel I know to little about trybot to understand what is going on, trying again to see if this was random failures.
We have 4 stable failures in tests on linux_chromium_rel. They all time out in WaitForSingleProcess, suggesting local_test_server does not shut down cleanly. Either the tests are already very close to their timeout value, or trybot runs in debug mode (where action_timeout_ms_ is modified). The problem is likely elsewhere, uncovered by this change, but not introduced by it. However, I know nothing about trybot, little about local_test_server, I can't reproduce on Windows, and I don't have a Linux setup. Would one of you be able to help me out by taking a look at these failures?
On 2014/02/26 15:26:08, sigbjorn wrote: > Would one of you be able > to help me out by taking a look at these failures? Alternatively I can apply the timeout only on Windows, as initially suggested in comment #5?
On 2014/02/28 10:54:46, sigbjorn wrote: > On 2014/02/26 15:26:08, sigbjorn wrote: > > Would one of you be able > > to help me out by taking a look at these failures? > > Alternatively I can apply the timeout only on Windows, as initially suggested in > comment #5? Paweł and/or M-A Ruel, any thoughts?
On 2014/03/12 10:37:16, sigbjorn wrote: > Paweł and/or M-A Ruel, any thoughts? There are no logs now. I'm opposed to only applying the timeout on Windows. Note that even if the timeout is "too short", the process should still get killed, and if kill is successful the method should return true. It's hard to say something more without the logs. Sorry for not looking at this earlier.
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/350001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: tryserver.chromium on linux_chromium_rel
On 2014/03/12 16:03:53, Paweł Hajdan Jr. wrote: > On 2014/03/12 10:37:16, sigbjorn wrote: > > Paweł and/or M-A Ruel, any thoughts? > > There are no logs now. > > I'm opposed to only applying the timeout on Windows. > > Note that even if the timeout is "too short", the process should still get > killed, and if kill is successful the method should return true. > > It's hard to say something more without the logs. Sorry for not looking at this > earlier. New logs on http://build.chromium.org/p/tryserver.chromium/builders/linux_chromium_rel/bu... and http://build.chromium.org/p/tryserver.chromium/builders/linux_chromium_rel/bu... Search for WaitForSingleProcess
Paweł and/or M-A Ruel, any thoughts before the logs disappear again? Are there any keywords I should use to ensure your attention in mail?
https://codereview.chromium.org/145873004/diff/350001/net/test/spawned_test_s... File net/test/spawned_test_server/local_test_server.cc (right): https://codereview.chromium.org/145873004/diff/350001/net/test/spawned_test_s... net/test/spawned_test_server/local_test_server.cc:129: // This kills all the processes in the job object. Remove the change to lines 133-136 and replace lines 128-131 with the following: #if defined(OS_WIN) // This kills all the processes in the job object. ::TerminateJobObject(job_handle_.Get(), 1); // This is done asynchronously so wait a bit for the processes to be killed. ::WaitForSingleObject(job_handle_.Get(), TestTimeouts::action_timeout()); job_handle_.Close(); #endif AFAIK, that's the only correct way to do this, independent of what Pawel said. Reference: http://msdn.microsoft.com/library/windows/desktop/ms684161.aspx
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/690001
The CQ bit was unchecked by commit-bot@chromium.org
Retried try job too often on win_rel for step(s) base_unittests, browser_tests, interactive_ui_tests, net_unittests, unit_tests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=win_rel&nu...
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/710001
The CQ bit was unchecked by commit-bot@chromium.org
Retried try job too often on win_rel for step(s) content_browsertests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=win_rel&nu...
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/710001
The CQ bit was unchecked by commit-bot@chromium.org
Retried try job too often on win_rel for step(s) content_browsertests http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=win_rel&nu...
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/710001
The CQ bit was unchecked by commit-bot@chromium.org
Retried try job too often on chromium_presubmit for step(s) presubmit http://build.chromium.org/p/tryserver.chromium/buildstatus?builder=chromium_p...
The CQ bit was checked by sigbjorn@opera.com
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sigbjorn@opera.com/145873004/710001
Message was sent while issue was closed.
Change committed as 260201 |