Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(607)

Issue 393283006: Temporarily increase retry limit in test_launcher (Closed)

Created:
6 years, 5 months ago by Sergey Berezin
Modified:
6 years, 5 months ago
CC:
chromium-reviews, erikwright+watch_chromium.org
Project:
chromium
Visibility:
Public.

Description

Temporarily increase retry limit in test_launcher In bot mode, test_launcher would normally retry tests up to 3 times. This is sufficient to keep large test suites like browser_tests under 1% flakiness rate if the flakiness of individual tests is also within 1%. However, some tests especially in browser_tests are up to 33% flaky, making the browser_tests suite 40% flaky. Subsequent top-level retries effectively double the runtime of all bots running these tests, and we need a solution to shorten the try job runtime. A longer-term solution is the ignorer bot (see the bug), but until it's deployed, a quick hack is to increase the number of retries so statistically we should see about 1% flakiness of browser_tests at the expense of a slight runtime increase. R=sergiyb@chromium.org BUG=390600, 395189 Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=284650

Patch Set 1 #

Unified diffs Side-by-side diffs Delta from patch set Stats (+3 lines, -1 line) Patch
M base/test/launcher/test_launcher.cc View 1 chunk +3 lines, -1 line 0 comments Download

Messages

Total messages: 21 (0 generated)
Sergey Berezin
PTAL. I'll run a few bots with browser_tests to see how this behaves before trying ...
6 years, 5 months ago (2014-07-17 00:42:59 UTC) #1
Sergiy Byelozyorov
lgtm
6 years, 5 months ago (2014-07-17 07:53:24 UTC) #2
Sergey Berezin
+agable for rubber stamp, please
6 years, 5 months ago (2014-07-17 18:17:54 UTC) #3
agable
On 2014/07/17 18:17:54, Sergey Berezin wrote: > +agable for rubber stamp, please As part of ...
6 years, 5 months ago (2014-07-17 23:37:03 UTC) #4
Sergey Berezin
On 2014/07/17 23:37:03, agable wrote: > On 2014/07/17 18:17:54, Sergey Berezin wrote: > > +agable ...
6 years, 5 months ago (2014-07-18 00:05:22 UTC) #5
agable
On 2014/07/18 00:05:22, Sergey Berezin wrote: > On 2014/07/17 23:37:03, agable wrote: > > On ...
6 years, 5 months ago (2014-07-18 15:28:07 UTC) #6
Sergey Berezin
The bug to revert it is http://crbug.com/395189 On 2014/07/18 15:28:07, agable wrote: > LGTM as ...
6 years, 5 months ago (2014-07-18 17:45:58 UTC) #7
Sergey Berezin
The CQ bit was checked by sergeyberezin@chromium.org
6 years, 5 months ago (2014-07-18 17:46:41 UTC) #8
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sergeyberezin@chromium.org/393283006/1
6 years, 5 months ago (2014-07-18 17:48:00 UTC) #9
Sergey Berezin
+phajdan.jr for OWNERS stamp.
6 years, 5 months ago (2014-07-18 18:36:51 UTC) #10
commit-bot: I haz the power
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: chromium_presubmit on tryserver.chromium ...
6 years, 5 months ago (2014-07-18 19:49:08 UTC) #11
commit-bot: I haz the power
The CQ bit was unchecked by commit-bot@chromium.org
6 years, 5 months ago (2014-07-18 19:52:59 UTC) #12
commit-bot: I haz the power
Try jobs failed on following builders: chromium_presubmit on tryserver.chromium (http://build.chromium.org/p/tryserver.chromium/builders/chromium_presubmit/builds/80971)
6 years, 5 months ago (2014-07-18 19:53:00 UTC) #13
Sergiy Byelozyorov
The CQ bit was checked by sergiyb@chromium.org
6 years, 5 months ago (2014-07-21 13:19:30 UTC) #14
Sergiy Byelozyorov
The CQ bit was unchecked by sergiyb@chromium.org
6 years, 5 months ago (2014-07-21 13:20:13 UTC) #15
Paweł Hajdan Jr.
LGTM
6 years, 5 months ago (2014-07-22 07:59:47 UTC) #16
Paweł Hajdan Jr.
The CQ bit was checked by phajdan.jr@chromium.org
6 years, 5 months ago (2014-07-22 07:59:53 UTC) #17
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sergeyberezin@chromium.org/393283006/1
6 years, 5 months ago (2014-07-22 08:00:56 UTC) #18
commit-bot: I haz the power
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: win_gpu_triggered_tests on tryserver.chromium.gpu ...
6 years, 5 months ago (2014-07-22 08:33:08 UTC) #19
commit-bot: I haz the power
Change committed as 284650
6 years, 5 months ago (2014-07-22 09:02:44 UTC) #20
jabdelmalek
6 years, 5 months ago (2014-07-22 22:20:44 UTC) #21
Message was sent while issue was closed.
On 2014/07/18 00:05:22, Sergey Berezin wrote:
> On 2014/07/17 23:37:03, agable wrote:
> > On 2014/07/17 18:17:54, Sergey Berezin wrote:
> > > +agable for rubber stamp, please
> > 
> > As part of the code yellow, we're trying to get rid of retries, not put even
> > more of them in place. I realize that this reduces the likelihood of having
to
> > retry the entire browser_tests binary, but it still feels like a step in the
> > wrong direction. Does this CL have buy-in from phajdan@, or people leading
> > aspects of the code yellow effort?
> > 
> > At the very least, I wouldn't approve this CL unless the number of times a
> given
> > test ran is part of the information reported to the flakiness dashboard. (I
> > don't know if it is or not -- please inform me.)
> > 
> > Either way, this is going to make truly failing tests take a heck of a lot
> > longer to fail :(
> 
> phajdan.jr is OOO till Monday. Regardless, I'm basing this CL on the data
we've
> been collecting for the past few weeks. Some facts:
> 
> 1. If more than 10% of tests fail in a suite / shard, they are not retried,
and
> the entire suite / shard fails.
> 2. Each test normally runs under 2 sec, only a handful are about 10 sec.
> Assuming the worst, we'll retry 10% of tests 12 times, resulting in additional
> 120% runtime increase. That is, a failing browser_test step will take 80-100
> minutes instead of 40 min in the worst case. My preliminary data suggests that
> less than 10% of CLs will fail because of browser_tests, and most of those
> failures will not hit the worst case degradation.
> 3. Currently, browser_tests flakiness is about 40%,

Where is this number from? I haven't seen that before, and I've been looking at
the try bots output for a while. I could see that this number is referring to
the number of green browser_tests runs where a test failed in its first run but
passed in one of the three retries.

> and bots like
> win_chromium_rel are at ~20% flakiness total. This means 40% of good CLs now
> take 2x time retrying the entire suite.

ditto, I've never seen rates that high. I think that's what we would have if
browser_tests never retried, but that isn't the case now.

> This change should bring browser_tests
> flakiness down to ~1%, reducing latency for these 40%. 
> 
> So, it's 40% of good CLs taking 2x time vs. <10% of failing CLs taking
slightly
> longer, and only in the worst case 2x.
> 
> On top of it, the ignorer bot will be online in a week or two, this change
will
> be reverted and we'll have the same benefit without any extra penalty.

Powered by Google App Engine
This is Rietveld 408576698