Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(62)

Issue 300063017: Fix silent failures in test runner that lead to long timeouts (Closed)

Created:
6 years, 6 months ago by Sami
Modified:
6 years, 6 months ago
CC:
chromium-reviews, klundberg+watch_chromium.org, bulach+watch_chromium.org, yfriedman+watch_chromium.org, ilevy-cc_chromium.org, Andrew Hayden (chromium.org)
Visibility:
Public.

Description

Fix silent failures in test runner that lead to long timeouts Previously the test runner was using pexpect with a timeout to run the test script. The problem with this is that it can cause the test runner to hang if the child test script exits while still having live child processes of its own. The exact sequence of operations leading to the failure is: 1. TestRunner calls pexpect.run('test_script', timeout, ...). 2. The test script spawns at least one sub process (e.g., the host forwarder), which is attached to the same stdout as the test script itself. 3. Pexpect calls select on the test script's stdout file descriptor to receive output from the script. This is a blocking call. 4. The test script dies for some reason (e.g., disconnected device), failing to clean up all the sub processes it created in step 2. 5. Because the sub processes still have a reference to the (now-dead) test script's stdout, the select call in step #3 is not unblocked, so it hangs until the timeout fires (1.5h by default). The root cause of the problem is that the test script's sub processes outlive their parent (which is standard Unix behavior). In order to fix this we would have to change all the known daemons spawned by every test script to be proper daemons that release the controlling terminal and make the test script kill all of its subprocesses when exiting (i.e., prctl(PR_SET_PDEATHSIG, TERM)). Because the above fix involves a large number of changes and is a little fragile in the sense that we may miss updating some scripts by mistake, this patch takes a defensive approach. Namely, we fix the problem by replacing the use of pexpect with a custom loop which combines the call to select() as well as polling for the exit status of the test script. The call to select uses a smaller timeout (1s), so we can detect a dying test script almost immediately. This patch also simplifies some existing code in cmd_helper. BUG=355952 TEST=src/build/android/test_runner.py perf -v --release --single-step \ --src/tools/perf/run_benchmark -v --output-format=buildbot \ --report-root-metrics --browser=android-chromium-testshell \ rasterize_and_record_micro.key_mobile_sites Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=273530

Patch Set 1 #

Total comments: 1

Patch Set 2 : Also capture stderr. Read output before polling for completion. #

Patch Set 3 : Unbreak windows. #

Unified diffs Side-by-side diffs Delta from patch set Stats (+71 lines, -25 lines) Patch
M build/android/pylib/cmd_helper.py View 1 2 2 chunks +64 lines, -17 lines 0 comments Download
M build/android/pylib/perf/test_runner.py View 2 chunks +7 lines, -8 lines 0 comments Download

Messages

Total messages: 15 (0 generated)
Sami
6 years, 6 months ago (2014-05-28 21:35:22 UTC) #1
Andrew Hayden (chromium.org)
LGTM, and kudos for tracking this down and fixing it. https://codereview.chromium.org/300063017/diff/1/build/android/pylib/perf/test_runner.py File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/300063017/diff/1/build/android/pylib/perf/test_runner.py#newcode204 ...
6 years, 6 months ago (2014-05-28 21:43:37 UTC) #2
Sami
On 2014/05/28 21:43:37, Andrew Hayden wrote: > Minor thing but for the sake of conformity ...
6 years, 6 months ago (2014-05-28 22:33:32 UTC) #3
Primiano Tucci (use gerrit)
LGTM (it has ben a loooooong night) and thanks for cleaning up the crazy temp ...
6 years, 6 months ago (2014-05-28 22:44:09 UTC) #4
Sami
The CQ bit was checked by skyostil@chromium.org
6 years, 6 months ago (2014-05-28 22:44:28 UTC) #5
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/skyostil@chromium.org/300063017/20001
6 years, 6 months ago (2014-05-28 22:45:19 UTC) #6
commit-bot: I haz the power
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: win_chromium_rel on tryserver.chromium ...
6 years, 6 months ago (2014-05-29 02:58:49 UTC) #7
commit-bot: I haz the power
The CQ bit was unchecked by commit-bot@chromium.org
6 years, 6 months ago (2014-05-29 03:01:28 UTC) #8
commit-bot: I haz the power
Try jobs failed on following builders: win_gpu_triggered_tests on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/11650)
6 years, 6 months ago (2014-05-29 03:01:28 UTC) #9
Andrew Hayden (chromium.org)
Windoze: Traceback (most recent call last): <module> at tools\telemetry\run_tests:9 from telemetry.unittest import gtest_testrunner <module> at ...
6 years, 6 months ago (2014-05-29 07:58:18 UTC) #10
Andrew Hayden (chromium.org)
FWIW I have spent the past 20 minutes (Internet time: eternity) looking through stackoverflow and ...
6 years, 6 months ago (2014-05-29 08:14:15 UTC) #11
Sami
On 2014/05/29 08:14:15, Andrew Hayden wrote: > FWIW I have spent the past 20 minutes ...
6 years, 6 months ago (2014-05-29 14:54:44 UTC) #12
Sami
The CQ bit was checked by skyostil@chromium.org
6 years, 6 months ago (2014-05-29 14:54:49 UTC) #13
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/skyostil@chromium.org/300063017/40001
6 years, 6 months ago (2014-05-29 14:57:30 UTC) #14
commit-bot: I haz the power
6 years, 6 months ago (2014-05-29 19:02:25 UTC) #15
Message was sent while issue was closed.
Change committed as 273530

Powered by Google App Engine
This is Rietveld 408576698