|
|
Descriptiontelemetry: Report error correctly on connection timeout
When we hit the 30s timeout, we would continue to run the test, then report a success, then attempt to dump the stdio and stacktrace.
Instead re-raise the exception and remove the finally block to stop the test from going through its paces
BUG=424024
Committed: https://crrev.com/2a25d5052fa4be1c9930775ac792ec53af13a4e5
Cr-Commit-Position: refs/heads/master@{#302577}
Patch Set 1 #
Total comments: 5
Patch Set 2 : Continue to run on a browser timeout but report as failed #
Total comments: 1
Patch Set 3 : Added unit test and addressed review comments #
Total comments: 1
Patch Set 4 : Shorten the function name #Patch Set 5 : remove unused vars #
Messages
Total messages: 23 (5 generated)
hendrikw@chromium.org changed reviewers: + ernstm@chromium.org, kbr@chromium.org
PTAL - keep in mind my python knowledge is zero :) Thanks!
CC'ing others. What's the behavioral change here? Does it abort the test run if an exception is thrown? Is that the desired behavior? What about any retries (I realize retries are being removed)? This change would be more obviously correct if there were an "except" clause that indicated an exception was raised, and the code in the finally block took that into account.
On 2014/10/31 23:11:04, Ken Russell wrote: > CC'ing others. > > What's the behavioral change here? Does it abort the test run if an exception is > thrown? Is that the desired behavior? What about any retries (I realize retries > are being removed)? > > This change would be more obviously correct if there were an "except" clause > that indicated an exception was raised, and the code in the finally block took > that into account. The retry logic still works correctly after this change. The behavioural change is that we no longer attempt to run the last test after a timeout (in the finalize).
> (in the finalize). I guess it's actually at the top of the loop. No matter, not running the code in the finalize doesn't prevent the retries.
https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... File tools/telemetry/telemetry/page/page_runner.py (right): https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... tools/telemetry/telemetry/page/page_runner.py:399: raise Per our offline discussion: I think it would be better behavior overall if the specific test was marked as having failed, but that subsequent tests at least attempted to run. This will cause the step on the waterfall to still turn red, but will at least let other tests run instead of completely losing coverage if this browser crash happens.
ernstm@chromium.org changed reviewers: + nednguyen@google.com
Ned has recently tried to remove the retries entirely. He would be the best reviewer for this patch.
https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... File tools/telemetry/telemetry/page/page_runner.py (right): https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... tools/telemetry/telemetry/page/page_runner.py:394: try: try: DoX() except: raise Why not just DoX()? https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... tools/telemetry/telemetry/page/page_runner.py:399: raise On 2014/10/31 23:44:02, Ken Russell wrote: > Per our offline discussion: I think it would be better behavior overall if the > specific test was marked as having failed, but that subsequent tests at least > attempted to run. This will cause the step on the waterfall to still turn red, > but will at least let other tests run instead of completely losing coverage if > this browser crash happens. Restarting the browser does affect performance characteristics of subsequent page test. Imagine the scenario where we have an existing test cases of 20 pages. Someone adds a new page in the 3rd position that is flaky, making the performance graph of page 4th to page 21st more noisy. How would one debug regression in this case? I think the right approach for keeping the test coverage of other pages is grouping the pages that cause browser failures to a separate telemetry benchmarks.
We are now reporting these tests as failed if we don't connect to the browser, but we continue to run the remaining tests. https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... File tools/telemetry/telemetry/page/page_runner.py (right): https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... tools/telemetry/telemetry/page/page_runner.py:394: try: On 2014/11/03 16:42:54, nednguyen wrote: > try: > DoX() > except: > raise > > Why not just DoX()? Yup, that was stupid. Doesn't matter, this code is gone. https://codereview.chromium.org/694893004/diff/1/tools/telemetry/telemetry/pa... tools/telemetry/telemetry/page/page_runner.py:399: raise On 2014/11/03 16:42:54, nednguyen wrote: > On 2014/10/31 23:44:02, Ken Russell wrote: > > Per our offline discussion: I think it would be better behavior overall if the > > specific test was marked as having failed, but that subsequent tests at least > > attempted to run. This will cause the step on the waterfall to still turn red, > > but will at least let other tests run instead of completely losing coverage if > > this browser crash happens. > > Restarting the browser does affect performance characteristics of subsequent > page test. Imagine the scenario where we have an existing test cases of 20 > pages. Someone adds a new page in the 3rd position that is flaky, making the > performance graph of page 4th to page 21st more noisy. How would one debug > regression in this case? > > I think the right approach for keeping the test coverage of other pages is > grouping the pages that cause browser failures to a separate telemetry > benchmarks. The proposed change doesn't change the way to restart the browser, it makes a test report a failure instead of a pass, but continue to run as it normally would. In this case the browser is not crashing, but instead never starting fully (or never connecting).
On 2014/11/03 16:42:54, nednguyen wrote: > Restarting the browser does affect performance characteristics of subsequent > page test. Imagine the scenario where we have an existing test cases of 20 > pages. Someone adds a new page in the 3rd position that is flaky, making the > performance graph of page 4th to page 21st more noisy. How would one debug > regression in this case? With Hendrik's change here, the test will be reported as failing if the browser crashes while the test is running. Currently the test is reported to pass in this case (!) -- which is wrong behavior by any measure. > I think the right approach for keeping the test coverage of other pages is > grouping the pages that cause browser failures to a separate telemetry > benchmarks. These pages aren't expected to cause browser failures. The browser crash is another bug being separately investigated. The fact that the browser crash results in the test reported as passing is a bug. LGTM
Yes, correct error reporting is the right thing. Can you add unittest to verify this behavior? https://codereview.chromium.org/694893004/diff/20001/tools/telemetry/telemetr... File tools/telemetry/telemetry/page/page_runner.py (right): https://codereview.chromium.org/694893004/diff/20001/tools/telemetry/telemetr... tools/telemetry/telemetry/page/page_runner.py:295: raise How about removing this raise and the try ... except BrowserGoneException block at line 399?
Hi, PTAL Added the unittest to test this failure + extra unit test for a BrowserGoneException from ValidatePage (as requested by Ned), and we no longer throw for the failure, so the except could be removed. There's still an issue if the test.is_multi_tab_test is true, but someone else can handle this later.
LGTM again -- tests look good. https://codereview.chromium.org/694893004/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/page/page_runner_unittest.py (right): https://codereview.chromium.org/694893004/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/page/page_runner_unittest.py:191: def testHandlingOfTestThatRaisesBrowserGoneExceptionFromRestartBrowserBeforeEachPage(self): That's a mouthful. Does Telemetry respect the 80-column limit? If so let's choose a shorter name. "testRaisingBrowserGoneException..."?
LGTM, please find a better naming for the tests as Ken suggests.
The CQ bit was checked by hendrikw@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/694893004/60001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: chromium_presubmit on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...)
The CQ bit was checked by hendrikw@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/694893004/80001
Message was sent while issue was closed.
Committed patchset #5 (id:80001)
Message was sent while issue was closed.
Patchset 5 (id:??) landed as https://crrev.com/2a25d5052fa4be1c9930775ac792ec53af13a4e5 Cr-Commit-Position: refs/heads/master@{#302577} |