Issue 393283006: Temporarily increase retry limit in test_launcher

Issue 393283006: Temporarily increase retry limit in test_launcher (Closed)

Created:
6 years, 5 months ago by Sergey Berezin

Modified:
6 years, 5 months ago

Reviewers:
agable, Sergiy Byelozyorov, Paweł Hajdan Jr.

CC:
chromium-reviews, erikwright+watch_chromium.org

Base URL:
svn://svn.chromium.org/chrome/trunk/src

Project:
chromium

Visibility:
Public.

More Reviews

Description

Temporarily increase retry limit in test_launcher In bot mode, test_launcher would normally retry tests up to 3 times. This is sufficient to keep large test suites like browser_tests under 1% flakiness rate if the flakiness of individual tests is also within 1%. However, some tests especially in browser_tests are up to 33% flaky, making the browser_tests suite 40% flaky. Subsequent top-level retries effectively double the runtime of all bots running these tests, and we need a solution to shorten the try job runtime. A longer-term solution is the ignorer bot (see the bug), but until it's deployed, a quick hack is to increase the number of retries so statistically we should see about 1% flakiness of browser_tests at the expense of a slight runtime increase. R=sergiyb@chromium.org BUG=390600, 395189 Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=284650

Patch Set 1 #

Created: 6 years, 5 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+3 lines, -1 line)			Patch
	M	base/test/launcher/test_launcher.cc	View		1 chunk	+3 lines, -1 line	0 comments	Download

Messages

Total messages: 21 (0 generated)

Expand Messages | Collapse Messages

Sergey Berezin

PTAL. I'll run a few bots with browser_tests to see how this behaves before trying ...

6 years, 5 months ago (2014-07-17 00:42:59 UTC) #1

agable

On 2014/07/17 18:17:54, Sergey Berezin wrote: > +agable for rubber stamp, please As part of ...

6 years, 5 months ago (2014-07-17 23:37:03 UTC) #4

Sergey Berezin

On 2014/07/17 23:37:03, agable wrote: > On 2014/07/17 18:17:54, Sergey Berezin wrote: > > +agable ...

6 years, 5 months ago (2014-07-18 00:05:22 UTC) #5

On 2014/07/17 23:37:03, agable wrote:
> On 2014/07/17 18:17:54, Sergey Berezin wrote:
> > +agable for rubber stamp, please
> 
> As part of the code yellow, we're trying to get rid of retries, not put even
> more of them in place. I realize that this reduces the likelihood of having to
> retry the entire browser_tests binary, but it still feels like a step in the
> wrong direction. Does this CL have buy-in from phajdan@, or people leading
> aspects of the code yellow effort?
> 
> At the very least, I wouldn't approve this CL unless the number of times a
given
> test ran is part of the information reported to the flakiness dashboard. (I
> don't know if it is or not -- please inform me.)
> 
> Either way, this is going to make truly failing tests take a heck of a lot
> longer to fail :(

phajdan.jr is OOO till Monday. Regardless, I'm basing this CL on the data we've
been collecting for the past few weeks. Some facts:

1. If more than 10% of tests fail in a suite / shard, they are not retried, and
the entire suite / shard fails.
2. Each test normally runs under 2 sec, only a handful are about 10 sec.
Assuming the worst, we'll retry 10% of tests 12 times, resulting in additional
120% runtime increase. That is, a failing browser_test step will take 80-100
minutes instead of 40 min in the worst case. My preliminary data suggests that
less than 10% of CLs will fail because of browser_tests, and most of those
failures will not hit the worst case degradation.
3. Currently, browser_tests flakiness is about 40%, and bots like
win_chromium_rel are at ~20% flakiness total. This means 40% of good CLs now
take 2x time retrying the entire suite. This change should bring browser_tests
flakiness down to ~1%, reducing latency for these 40%. 

So, it's 40% of good CLs taking 2x time vs. <10% of failing CLs taking slightly
longer, and only in the worst case 2x.

On top of it, the ignorer bot will be online in a week or two, this change will
be reverted and we'll have the same benefit without any extra penalty.

agable

On 2014/07/18 00:05:22, Sergey Berezin wrote: > On 2014/07/17 23:37:03, agable wrote: > > On ...

6 years, 5 months ago (2014-07-18 15:28:07 UTC) #6

On 2014/07/18 00:05:22, Sergey Berezin wrote:
> On 2014/07/17 23:37:03, agable wrote:
> > On 2014/07/17 18:17:54, Sergey Berezin wrote:
> > > +agable for rubber stamp, please
> > 
> > As part of the code yellow, we're trying to get rid of retries, not put even
> > more of them in place. I realize that this reduces the likelihood of having
to
> > retry the entire browser_tests binary, but it still feels like a step in the
> > wrong direction. Does this CL have buy-in from phajdan@, or people leading
> > aspects of the code yellow effort?
> > 
> > At the very least, I wouldn't approve this CL unless the number of times a
> given
> > test ran is part of the information reported to the flakiness dashboard. (I
> > don't know if it is or not -- please inform me.)
> > 
> > Either way, this is going to make truly failing tests take a heck of a lot
> > longer to fail :(
> 
> phajdan.jr is OOO till Monday. Regardless, I'm basing this CL on the data
we've
> been collecting for the past few weeks. Some facts:
> 
> 1. If more than 10% of tests fail in a suite / shard, they are not retried,
and
> the entire suite / shard fails.
> 2. Each test normally runs under 2 sec, only a handful are about 10 sec.
> Assuming the worst, we'll retry 10% of tests 12 times, resulting in additional
> 120% runtime increase. That is, a failing browser_test step will take 80-100
> minutes instead of 40 min in the worst case. My preliminary data suggests that
> less than 10% of CLs will fail because of browser_tests, and most of those
> failures will not hit the worst case degradation.
> 3. Currently, browser_tests flakiness is about 40%, and bots like
> win_chromium_rel are at ~20% flakiness total. This means 40% of good CLs now
> take 2x time retrying the entire suite. This change should bring browser_tests
> flakiness down to ~1%, reducing latency for these 40%. 
> 
> So, it's 40% of good CLs taking 2x time vs. <10% of failing CLs taking
slightly
> longer, and only in the worst case 2x.
> 
> On top of it, the ignorer bot will be online in a week or two, this change
will
> be reverted and we'll have the same benefit without any extra penalty.

LGTM as a temporary measure. Please link to a bug specifically to revert this
CL, so that we're tracking the fact that this is temporary with more than just a
TODO.

Sergey Berezin

The bug to revert it is http://crbug.com/395189 On 2014/07/18 15:28:07, agable wrote: > LGTM as ...

6 years, 5 months ago (2014-07-18 17:45:58 UTC) #7

Sergey Berezin

The CQ bit was checked by sergeyberezin@chromium.org

6 years, 5 months ago (2014-07-18 17:46:41 UTC) #8

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sergeyberezin@chromium.org/393283006/1

6 years, 5 months ago (2014-07-18 17:48:00 UTC) #9

commit-bot: I haz the power

FYI, CQ is re-trying this CL (attempt #1). The failing builders are: chromium_presubmit on tryserver.chromium ...

6 years, 5 months ago (2014-07-18 19:49:08 UTC) #11

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

6 years, 5 months ago (2014-07-18 19:52:59 UTC) #12

commit-bot: I haz the power

Try jobs failed on following builders: chromium_presubmit on tryserver.chromium (http://build.chromium.org/p/tryserver.chromium/builders/chromium_presubmit/builds/80971)

6 years, 5 months ago (2014-07-18 19:53:00 UTC) #13

Sergiy Byelozyorov

The CQ bit was checked by sergiyb@chromium.org

6 years, 5 months ago (2014-07-21 13:19:30 UTC) #14

Sergiy Byelozyorov

The CQ bit was unchecked by sergiyb@chromium.org

6 years, 5 months ago (2014-07-21 13:20:13 UTC) #15

Paweł Hajdan Jr.

The CQ bit was checked by phajdan.jr@chromium.org

6 years, 5 months ago (2014-07-22 07:59:53 UTC) #17

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sergeyberezin@chromium.org/393283006/1

6 years, 5 months ago (2014-07-22 08:00:56 UTC) #18

commit-bot: I haz the power

FYI, CQ is re-trying this CL (attempt #1). The failing builders are: win_gpu_triggered_tests on tryserver.chromium.gpu ...

6 years, 5 months ago (2014-07-22 08:33:08 UTC) #19

jabdelmalek

6 years, 5 months ago (2014-07-22 22:20:44 UTC) #21

Message was sent while issue was closed.

On 2014/07/18 00:05:22, Sergey Berezin wrote:
> On 2014/07/17 23:37:03, agable wrote:
> > On 2014/07/17 18:17:54, Sergey Berezin wrote:
> > > +agable for rubber stamp, please
> > 
> > As part of the code yellow, we're trying to get rid of retries, not put even
> > more of them in place. I realize that this reduces the likelihood of having
to
> > retry the entire browser_tests binary, but it still feels like a step in the
> > wrong direction. Does this CL have buy-in from phajdan@, or people leading
> > aspects of the code yellow effort?
> > 
> > At the very least, I wouldn't approve this CL unless the number of times a
> given
> > test ran is part of the information reported to the flakiness dashboard. (I
> > don't know if it is or not -- please inform me.)
> > 
> > Either way, this is going to make truly failing tests take a heck of a lot
> > longer to fail :(
> 
> phajdan.jr is OOO till Monday. Regardless, I'm basing this CL on the data
we've
> been collecting for the past few weeks. Some facts:
> 
> 1. If more than 10% of tests fail in a suite / shard, they are not retried,
and
> the entire suite / shard fails.
> 2. Each test normally runs under 2 sec, only a handful are about 10 sec.
> Assuming the worst, we'll retry 10% of tests 12 times, resulting in additional
> 120% runtime increase. That is, a failing browser_test step will take 80-100
> minutes instead of 40 min in the worst case. My preliminary data suggests that
> less than 10% of CLs will fail because of browser_tests, and most of those
> failures will not hit the worst case degradation.
> 3. Currently, browser_tests flakiness is about 40%,

Where is this number from? I haven't seen that before, and I've been looking at
the try bots output for a while. I could see that this number is referring to
the number of green browser_tests runs where a test failed in its first run but
passed in one of the three retries.

> and bots like
> win_chromium_rel are at ~20% flakiness total. This means 40% of good CLs now
> take 2x time retrying the entire suite.

ditto, I've never seen rates that high. I think that's what we would have if
browser_tests never retried, but that isn't the case now.

> This change should bring browser_tests
> flakiness down to ~1%, reducing latency for these 40%. 
> 
> So, it's 40% of good CLs taking 2x time vs. <10% of failing CLs taking
slightly
> longer, and only in the worst case 2x.
> 
> On top of it, the ignorer bot will be online in a week or two, this change
will
> be reverted and we'll have the same benefit without any extra penalty.

Expand Messages | Collapse Messages