tools/perf/docs/perf_bot_sheriffing.md - Issue 1770383005: Reduce indentation levels for sheriff docs. - Code Review

Chromium Code Reviews

chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out

(790)

My Issues | Starred Open | Closed | All

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 1770383005: Reduce indentation levels for sheriff docs. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Created 4 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
1 # Perf Bot Sheriffing	1 # Perf Bot Sheriffing

2	2

3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf	3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf

4 waterfall up and running, and triaging performance test failures and flakes.	4 waterfall up and running, and triaging performance test failures and flakes.

5	5

6 ## Key Responsibilities	6 ## Key Responsibilities

7	7

8 * [Keeping the chromium.perf waterfall green](#chromiumperf)	8 * [Handle Test Failures](#testfailures)

9 * [Handling Test Failures](#testfailures)	9 * [Handle Device and Bot Failures](#botfailures)
	aiolos (Not reviewing) 2016/03/09 17:57:38 Can you move the Device and Bot Failures before th Can you move the Device and Bot Failures before the Test Failures? dtu 2016/03/09 22:27:44 Done. Show quoted text On 2016/03/09 at 17:57:38, aiolos wrote: > Can you move the Device and Bot Failures before the Test Failures? Done.
10 * [Handling Device and Bot Failures](#botfailures)	10 * [Follow up on failures](#followup)

11 * [Follow up on failures](#followup)

12	11

13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green	12 ##<a name="waterfallstate"></a> Understanding the Waterfall State

14

15 The primary responsibility of the perfbot health sheriff is to keep the

16 chromium.perf waterfall green.

17

18 ####<a name="waterfallstate"></a> Understanding the Waterfall State

19	13

20 Everyone can view the chromium.perf waterfall at	14 Everyone can view the chromium.perf waterfall at

21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended	15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended

22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/]	16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/]

23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason	17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason

24 for this is that in order to make the performance tests as realistic as	18 for this is that in order to make the performance tests as realistic as

25 possible, the chromium.perf waterfall runs release official builds of Chrome.	19 possible, the chromium.perf waterfall runs release official builds of Chrome.

26 But the logs from release official builds may leak info from our partners that	20 But the logs from release official builds may leak info from our partners that

27 we do not have permission to share outside of Google. So the logs are available	21 we do not have permission to share outside of Google. So the logs are available

28 to Googlers only. To avoid manually rewriting the URL when switching between	22 to Googlers only. To avoid manually rewriting the URL when switching between

29 the upstream and downstream views of the waterfall and bots, you can install the	23 the upstream and downstream views of the waterfall and bots, you can install the

30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),	24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),

31 which adds a switching button to Chrome's URL bar.	25 which adds a switching button to Chrome's URL bar.

32	26

33 Note that there are four different views:	27 Note that there are four different views:

34	28

35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)	29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)

36 makes it easier to see a summary.	30 makes it easier to see a summary.

37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall)	31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall)

38 shows more details, including recent changes.	32 shows more details, including recent changes.

39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of	33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of

40 recent builds. It takes url parameter arguments:	34 recent builds. It takes url parameter arguments:

41 * master can be chromium.perf, tryserver.chromium.perf	35 * master can be chromium.perf, tryserver.chromium.perf

42 * builder can be a builder or tester name, like	36 * builder can be a builder or tester name, like

43 "Android Nexus5 Perf (2)"	37 "Android Nexus5 Perf (2)"

44 * start_time is seconds since the epoch.	38 * start_time is seconds since the epoch.

45	39

46 You can see a list of all previously filed bugs using the	40 You can see a list of all previously filed bugs using the

47 [Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q =label%3APerformance-BotHealth)	41 [Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2 &q=label%3APerformance-BotHealth)

48 label in crbug.	42 label in crbug.

49	43

50 Please also check the recent	44 Please also check the recent

51 [perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)	45 [perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)

52 postings for important announcements about bot turndowns and other known issues.	46 postings for important announcements about bot turndowns and other known issues.

53	47

54 ####<a name="testfailures"></a> Handling Test Failures	48 ##<a name="testfailures"></a> Handle Test Failures

55	49

56 You want to keep the waterfall green! So any bot that is red or purple needs to	50 You want to keep the waterfall green! So any bot that is red or purple needs to

57 be investigated. When a test fails:	51 be investigated. When a test fails:

58	52

59 1. File a bug using	53 1. File a bug using

60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen :%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please% 20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi sionrange%3E).	54 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre visionrange%3E).

61 You'll want to be sure to include:	55 You'll want to be sure to include:

62 * Link to buildbot status page of failing build.	56 * Link to buildbot status page of failing build.

63 * Copy and paste of relevant failure snippet from the stdio.	57 * Copy and paste of relevant failure snippet from the stdio.

64 * CC the test owner from	58 * CC the test owner from

65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).	59 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).

66 * The revision range the test occurred on.	60 * The revision range the test occurred on.

67 * A list of all platforms the test fails on.	61 * A list of all platforms the test fails on.

68	62

69 2. Disable the failing test if it is failing more than one out of five runs.	63 2. Disable the failing test if it is failing more than one out of five runs.

70 (see below for instructions on telemetry and other types of tests). Make sure	64 (see below for instructions on telemetry and other types of tests). Make sure

(...skipping 14 matching lines...) Expand all Loading...
85 3. Type the Bug ID from step 1, the Good Revision the last commit	79 3. Type the Bug ID from step 1, the Good Revision the last commit

86 pos data was received from, the Bad Revision the last commit pos	80 pos data was received from, the Bad Revision the last commit pos

87 and set Bisect mode to `return_code`.	81 and set Bisect mode to `return_code`.

88 * On Android and Mac, you can view platform-level screenshots of the device	82 * On Android and Mac, you can view platform-level screenshots of the device

89 screen for failing tests, links to which are printed in the logs. Often	83 screen for failing tests, links to which are printed in the logs. Often

90 this will immediately reveal failure causes that are opaque from the logs	84 this will immediately reveal failure causes that are opaque from the logs

91 alone. On other platforms, Devtools will produce tab screenshots as long as	85 alone. On other platforms, Devtools will produce tab screenshots as long as

92 the tab did not crash.	86 the tab did not crash.

93	87

94	88

95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests	89 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests

96	90

97 If the test is a telemetry test, its name will have a '.' in it, such as	91 If the test is a telemetry test, its name will have a '.' in it, such as

98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first	92 thread\_times.key\_mobile\_sites, or page\_cycler.top\_10. The part before the f irst
	sullivan 2016/03/09 14:38:25 We could also consider backticks instead of backsl We could also consider backticks instead of backslashes: `thread_times.key_mobile_sites`, or `page_cycler.top_10` dtu 2016/03/09 22:27:44 Done. Show quoted text On 2016/03/09 at 14:38:25, sullivan wrote: > We could also consider backticks instead of backslashes: > > `thread_times.key_mobile_sites`, or `page_cycler.top_10` Done.
99 dot will be a python file in [tools/perf/benchmarks](	93 dot will be a python file in [tools/perf/benchmarks](

100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /).	94 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /).

101	95

102 If a telemetry test is failing and there is no clear culprit to revert	96 If a telemetry test is failing and there is no clear culprit to revert

103 immediately, disable the test. You can do this with the `@benchmark.Disabled`	97 immediately, disable the test. You can do this with the `@benchmark.Disabled`

104 decorator. **Always add a comment next to your decorator with the bug id which	98 decorator. **Always add a comment next to your decorator with the bug id which

105 has background on why the test was disabled, and also include a BUG= line in	99 has background on why the test was disabled, and also include a BUG= line in

106 the CL.**	100 the CL.**

107	101

108 Please disable the narrowest set of bots possible; for example, if	102 Please disable the narrowest set of bots possible; for example, if

(...skipping 13 matching lines...) Expand all Loading...
122 * `all` (please use as a last resort)	116 * `all` (please use as a last resort)

123	117

124 If the test fails consistently in a very narrow set of circumstances, you may	118 If the test fails consistently in a very narrow set of circumstances, you may

125 consider implementing a ShouldDisable method on the benchmark instead.	119 consider implementing a ShouldDisable method on the benchmark instead.

126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is	120 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is

127 and example of disabling a benchmark which OOMs on svelte.	121 and example of disabling a benchmark which OOMs on svelte.

128	122

129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do *not *	123 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do *not *

130 submit with NOTRY=true.	124 submit with NOTRY=true.

131	125

132 #####<a name="otherfailures"></a> Disabling Other Tests	126 ###<a name="otherfailures"></a> Disabling Other Tests

133	127

134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).	128 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).

135 You can TBR any of the per-file OWNERS, but please do not submit with	129 You can TBR any of the per-file OWNERS, but please do not submit with

136 NOTRY=true.	130 NOTRY=true.

137	131

138 ####<a name="botfailures"></a> Handling Device and Bot Failures	132 ##<a name="botfailures"></a> Handle Device and Bot Failures

139	133

140 #####<a name="purplebots"></a> Purple bots	134 ###<a name="purplebots"></a> Purple bots

141	135

142 When a bot goes purple, it's it's usually because of an infrastructure failure	136 When a bot goes purple, it's it's usually because of an infrastructure failure

143 outside of the tests. But you should first check the logs of a purple bot to	137 outside of the tests. But you should first check the logs of a purple bot to

144 try to better understand the problem. Sometimes a telemetry test failure can	138 try to better understand the problem. Sometimes a telemetry test failure can

145 turn the bot purple, for example.	139 turn the bot purple, for example.

146	140

147 If the bot goes purple and you believe it's an infrastructure issue, file a bug	141 If the bot goes purple and you believe it's an infrastructure issue, file a bug

148 with	142 with

149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum mary=Purple+Bot+on+chromium.perf),	143 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s ummary=Purple+Bot+on+chromium.perf),

150 which will automatically add the bug to the trooper queue. Be sure to note	144 which will automatically add the bug to the trooper queue. Be sure to note

151 which step is failing, and paste any relevant info from the logs into the bug.	145 which step is failing, and paste any relevant info from the logs into the bug.

152	146

153 #####<a name="devicefailures"></a> Android Device failures	147 ###<a name="devicefailures"></a> Android Device failures

154	148

155 There are two types of device failures:	149 There are two types of device failures:

156	150

157 1. A device is blacklisted in the `device_status_check` step. You can look at	151 1. A device is blacklisted in the `device_status_check` step. You can look at

158 the buildbot status page to see how many devices were listed as online during	152 the buildbot status page to see how many devices were listed as online during

159 this step. You should always see 7 devices online. If you see fewer than 7	153 this step. You should always see 7 devices online. If you see fewer than 7

160 devices online, there is a problem in the lab.	154 devices online, there is a problem in the lab.

161 2. A device is passing `device_status_check` but still in poor health. The	155 2. A device is passing `device_status_check` but still in poor health. The

162 symptom of this is that all the tests are failing on it. You can see that on	156 symptom of this is that all the tests are failing on it. You can see that on

163 the buildbot status page by looking at the `Device Affinity`. If all tests	157 the buildbot status page by looking at the `Device Affinity`. If all tests

164 with the same device affinity number are failing, it's probably a device	158 with the same device affinity number are failing, it's probably a device

165 failure.	159 failure.

166	160

167 For both types of failures, please file a bug with [this template](https://code. google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs ,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro mium.perf)	161 For both types of failures, please file a bug with [this template](https://bugs. chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch romium.perf)

168 which will add an issue to the infra labs queue.	162 which will add an issue to the infra labs queue.

169	163

170 If you need help triaging, here are the common labels you should use:	164 If you need help triaging, here are the common labels you should use:

171	165

172 * Performance-BotHealth should go on all bugs you file about the bots;	166 * Performance-BotHealth should go on all bugs you file about the bots;

173 it's the label we use to track all the issues.	167 it's the label we use to track all the issues.

174 * Infra-Troopers adds the bug to the trooper queue. This is for high	168 * Infra-Troopers adds the bug to the trooper queue. This is for high

175 priority issues, like a build breakage. Please add a comment explaining	169 priority issues, like a build breakage. Please add a comment explaining

176 what you want the trooper to do.	170 what you want the trooper to do.

177 * Infra-Labs adds the bug to the labs queue. If there is a hardware	171 * Infra-Labs adds the bug to the labs queue. If there is a hardware

178 problem, like an android device not responding or a bot that likely needs	172 problem, like an android device not responding or a bot that likely needs

179 a restart, please use this label. Make sure you set the OS- label	173 a restart, please use this label. Make sure you set the OS- label

180 correctly as well, and add a comment explaining what you want the labs	174 correctly as well, and add a comment explaining what you want the labs

181 team to do.	175 team to do.

182 * Infra label is appropriate for bugs that are not high priority, but we	176 * Infra label is appropriate for bugs that are not high priority, but we

183 need infra team's help to triage. For example, the buildbot status page	177 need infra team's help to triage. For example, the buildbot status page

184 UI is weird or we are getting some infra-related log spam. The infra team	178 UI is weird or we are getting some infra-related log spam. The infra team

185 works to triage these bugs within 24 hours, so you should ping if you do	179 works to triage these bugs within 24 hours, so you should ping if you do

186 not get a response.	180 not get a response.

187 * Cr-Tests-Telemetry for telemetry failures.	181 * Cr-Tests-Telemetry for telemetry failures.

188 * Cr-Tests-AutoBisect for bisect and perf try job failures.	182 * Cr-Tests-AutoBisect for bisect and perf try job failures.

189	183

190 If you still need help, ask the speed infra chat, or escalate to sullivan@.	184 If you still need help, ask the speed infra chat, or escalate to sullivan@.

191	185

192 ####<a name="followup"></a> Follow up on failures	186 ##<a name="followup"></a> Follow up on failures

193	187

194 [Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-0)	188 [Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-0)

195 should have an owner or contact on speed infra team and be worked on as top	189 should have an owner or contact on speed infra team and be worked on as top

196 priority. Pri-0 generally implies an entire waterfall is down.	190 priority. Pri-0 generally implies an entire waterfall is down.

197	191

198 [Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-1)	192 [Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-1)

199 should be pinged daily, and checked to make sure someone is following up. Pri-1	193 should be pinged daily, and checked to make sure someone is following up. Pri-1

200 bugs are for a red test (not yet disabled), purple bot, or failing device.	194 bugs are for a red test (not yet disabled), purple bot, or failing device.

201	195

202 [Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-2)	196 [Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-2)

203 are for disabled tests. These should be pinged weekly, and work towards fixing	197 are for disabled tests. These should be pinged weekly, and work towards fixing

204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the	198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the

205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com /p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi fied-before:today-7&sort=modified)	199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo dified-before:today-7&sort=modified)

206	200

207 <!-- Unresolved issues:	201 <!-- Unresolved issues:

208 1. Do perf sheriffs watch the bisect waterfall?	202 1. Do perf sheriffs watch the bisect waterfall?

209 2. Do perf sheriffs watch the internal clank waterfall?	203 2. Do perf sheriffs watch the internal clank waterfall?

210 -->	204 -->

OLD	NEW

« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »

Powered by Google App Engine

This is Rietveld 408576698