tools/perf/docs/perf_bot_sheriffing.md - Issue 1770383005: Reduce indentation levels for sheriff docs.

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 1770383005: Reduce indentation levels for sheriff docs. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Commentz Created 4 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # Perf Bot Sheriffing	1 # Perf Bot Sheriffing

2	2

3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf	3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf

4 waterfall up and running, and triaging performance test failures and flakes.	4 waterfall up and running, and triaging performance test failures and flakes.

5	5

6 ## Key Responsibilities	6 ## Key Responsibilities

7	7

8 * [Keeping the chromium.perf waterfall green](#chromiumperf)	8 * [Handle Device and Bot Failures](#botfailures)

9 * [Handling Test Failures](#testfailures)	9 * [Handle Test Failures](#testfailures)

10 * [Handling Device and Bot Failures](#botfailures)	10 * [Follow up on failures](#followup)

11 * [Follow up on failures](#followup)

12	11

13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green	12 ##<a name="waterfallstate"></a> Understanding the Waterfall State

14

15 The primary responsibility of the perfbot health sheriff is to keep the

16 chromium.perf waterfall green.

17

18 ####<a name="waterfallstate"></a> Understanding the Waterfall State

19	13

20 Everyone can view the chromium.perf waterfall at	14 Everyone can view the chromium.perf waterfall at

21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended	15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended

22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/]	16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/]

23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason	17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason

24 for this is that in order to make the performance tests as realistic as	18 for this is that in order to make the performance tests as realistic as

25 possible, the chromium.perf waterfall runs release official builds of Chrome.	19 possible, the chromium.perf waterfall runs release official builds of Chrome.

26 But the logs from release official builds may leak info from our partners that	20 But the logs from release official builds may leak info from our partners that

27 we do not have permission to share outside of Google. So the logs are available	21 we do not have permission to share outside of Google. So the logs are available

28 to Googlers only. To avoid manually rewriting the URL when switching between	22 to Googlers only. To avoid manually rewriting the URL when switching between

29 the upstream and downstream views of the waterfall and bots, you can install the	23 the upstream and downstream views of the waterfall and bots, you can install the

30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),	24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),

31 which adds a switching button to Chrome's URL bar.	25 which adds a switching button to Chrome's URL bar.

32	26

33 Note that there are four different views:	27 Note that there are four different views:

34	28

35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)	29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)

36 makes it easier to see a summary.	30 makes it easier to see a summary.

37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall)	31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall)

38 shows more details, including recent changes.	32 shows more details, including recent changes.

39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of	33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of

40 recent builds. It takes url parameter arguments:	34 recent builds. It takes url parameter arguments:

41 * master can be chromium.perf, tryserver.chromium.perf	35 * master can be chromium.perf, tryserver.chromium.perf

42 * builder can be a builder or tester name, like	36 * builder can be a builder or tester name, like

43 "Android Nexus5 Perf (2)"	37 "Android Nexus5 Perf (2)"

44 * start_time is seconds since the epoch.	38 * start_time is seconds since the epoch.

45	39

46 You can see a list of all previously filed bugs using the	40 You can see a list of all previously filed bugs using the

47 [Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q =label%3APerformance-BotHealth)	41 [Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2 &q=label%3APerformance-BotHealth)

48 label in crbug.	42 label in crbug.

49	43

50 Please also check the recent	44 Please also check the recent

51 [perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)	45 [perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)

52 postings for important announcements about bot turndowns and other known issues.	46 postings for important announcements about bot turndowns and other known issues.

53	47

54 ####<a name="testfailures"></a> Handling Test Failures	48 ##<a name="botfailures"></a> Handle Device and Bot Failures

	49

	50 ###<a name="purplebots"></a> Purple bots

	51

	52 When a bot goes purple, it's it's usually because of an infrastructure failure

	53 outside of the tests. But you should first check the logs of a purple bot to

	54 try to better understand the problem. Sometimes a telemetry test failure can

	55 turn the bot purple, for example.

	56

	57 If the bot goes purple and you believe it's an infrastructure issue, file a bug

	58 with

	59 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s ummary=Purple+Bot+on+chromium.perf),

	60 which will automatically add the bug to the trooper queue. Be sure to note

	61 which step is failing, and paste any relevant info from the logs into the bug.

	62

	63 ###<a name="devicefailures"></a> Android Device failures

	64

	65 There are two types of device failures:

	66

	67 1. A device is blacklisted in the `device_status_check` step. You can look at

	68 the buildbot status page to see how many devices were listed as online during

	69 this step. You should always see 7 devices online. If you see fewer than 7

	70 devices online, there is a problem in the lab.

	71 2. A device is passing `device_status_check` but still in poor health. The

	72 symptom of this is that all the tests are failing on it. You can see that on

	73 the buildbot status page by looking at the `Device Affinity`. If all tests

	74 with the same device affinity number are failing, it's probably a device

	75 failure.

	76

	77 For both types of failures, please file a bug with [this template](https://bugs. chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch romium.perf)

	78 which will add an issue to the infra labs queue.

	79

	80 If you need help triaging, here are the common labels you should use:

	81

	82 * Performance-BotHealth should go on all bugs you file about the bots;

	83 it's the label we use to track all the issues.

	84 * Infra-Troopers adds the bug to the trooper queue. This is for high

	85 priority issues, like a build breakage. Please add a comment explaining

	86 what you want the trooper to do.

	87 * Infra-Labs adds the bug to the labs queue. If there is a hardware

	88 problem, like an android device not responding or a bot that likely needs

	89 a restart, please use this label. Make sure you set the OS- label

	90 correctly as well, and add a comment explaining what you want the labs

	91 team to do.

	92 * Infra label is appropriate for bugs that are not high priority, but we

	93 need infra team's help to triage. For example, the buildbot status page

	94 UI is weird or we are getting some infra-related log spam. The infra team

	95 works to triage these bugs within 24 hours, so you should ping if you do

	96 not get a response.

	97 * Cr-Tests-Telemetry for telemetry failures.

	98 * Cr-Tests-AutoBisect for bisect and perf try job failures.

	99

	100 If you still need help, ask the speed infra chat, or escalate to sullivan@.

	101

	102 ##<a name="testfailures"></a> Handle Test Failures

55	103

56 You want to keep the waterfall green! So any bot that is red or purple needs to	104 You want to keep the waterfall green! So any bot that is red or purple needs to

57 be investigated. When a test fails:	105 be investigated. When a test fails:

58	106

59 1. File a bug using	107 1. File a bug using

60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen :%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please% 20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi sionrange%3E).	108 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre visionrange%3E).

61 You'll want to be sure to include:	109 You'll want to be sure to include:

62 * Link to buildbot status page of failing build.	110 * Link to buildbot status page of failing build.

63 * Copy and paste of relevant failure snippet from the stdio.	111 * Copy and paste of relevant failure snippet from the stdio.

64 * CC the test owner from	112 * CC the test owner from

65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).	113 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).

66 * The revision range the test occurred on.	114 * The revision range the test occurred on.

67 * A list of all platforms the test fails on.	115 * A list of all platforms the test fails on.

68	116

69 2. Disable the failing test if it is failing more than one out of five runs.	117 2. Disable the failing test if it is failing more than one out of five runs.

70 (see below for instructions on telemetry and other types of tests). Make sure	118 (see below for instructions on telemetry and other types of tests). Make sure

(...skipping 14 matching lines...) Expand all Loading...
85 3. Type the Bug ID from step 1, the Good Revision the last commit	133 3. Type the Bug ID from step 1, the Good Revision the last commit

86 pos data was received from, the Bad Revision the last commit pos	134 pos data was received from, the Bad Revision the last commit pos

87 and set Bisect mode to `return_code`.	135 and set Bisect mode to `return_code`.

88 * On Android and Mac, you can view platform-level screenshots of the device	136 * On Android and Mac, you can view platform-level screenshots of the device

89 screen for failing tests, links to which are printed in the logs. Often	137 screen for failing tests, links to which are printed in the logs. Often

90 this will immediately reveal failure causes that are opaque from the logs	138 this will immediately reveal failure causes that are opaque from the logs

91 alone. On other platforms, Devtools will produce tab screenshots as long as	139 alone. On other platforms, Devtools will produce tab screenshots as long as

92 the tab did not crash.	140 the tab did not crash.

93	141

94	142

95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests	143 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests

96	144

97 If the test is a telemetry test, its name will have a '.' in it, such as	145 If the test is a telemetry test, its name will have a '.' in it, such as

98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first	146 `thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the

99 dot will be a python file in [tools/perf/benchmarks](	147 first dot will be a python file in [tools/perf/benchmarks](

100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /).	148 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /).

101	149

102 If a telemetry test is failing and there is no clear culprit to revert	150 If a telemetry test is failing and there is no clear culprit to revert

103 immediately, disable the test. You can do this with the `@benchmark.Disabled`	151 immediately, disable the test. You can do this with the `@benchmark.Disabled`

104 decorator. **Always add a comment next to your decorator with the bug id which	152 decorator. **Always add a comment next to your decorator with the bug id which

105 has background on why the test was disabled, and also include a BUG= line in	153 has background on why the test was disabled, and also include a BUG= line in

106 the CL.**	154 the CL.**

107	155

108 Please disable the narrowest set of bots possible; for example, if	156 Please disable the narrowest set of bots possible; for example, if

109 the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vist a')`.	157 the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vist a')`.

(...skipping 12 matching lines...) Expand all Loading...
122 * `all` (please use as a last resort)	170 * `all` (please use as a last resort)

123	171

124 If the test fails consistently in a very narrow set of circumstances, you may	172 If the test fails consistently in a very narrow set of circumstances, you may

125 consider implementing a ShouldDisable method on the benchmark instead.	173 consider implementing a ShouldDisable method on the benchmark instead.

126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is	174 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is

127 and example of disabling a benchmark which OOMs on svelte.	175 and example of disabling a benchmark which OOMs on svelte.

128	176

129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do *not *	177 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do *not *

130 submit with NOTRY=true.	178 submit with NOTRY=true.

131	179

132 #####<a name="otherfailures"></a> Disabling Other Tests	180 ###<a name="otherfailures"></a> Disabling Other Tests

133	181

134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).	182 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).

135 You can TBR any of the per-file OWNERS, but please do not submit with	183 You can TBR any of the per-file OWNERS, but please do not submit with

136 NOTRY=true.	184 NOTRY=true.

137	185

138 ####<a name="botfailures"></a> Handling Device and Bot Failures	186 ##<a name="followup"></a> Follow up on failures

139	187

140 #####<a name="purplebots"></a> Purple bots	188 [Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-0)

141

142 When a bot goes purple, it's it's usually because of an infrastructure failure

143 outside of the tests. But you should first check the logs of a purple bot to

144 try to better understand the problem. Sometimes a telemetry test failure can

145 turn the bot purple, for example.

146

147 If the bot goes purple and you believe it's an infrastructure issue, file a bug

148 with

149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum mary=Purple+Bot+on+chromium.perf),

150 which will automatically add the bug to the trooper queue. Be sure to note

151 which step is failing, and paste any relevant info from the logs into the bug.

152

153 #####<a name="devicefailures"></a> Android Device failures

154

155 There are two types of device failures:

156

157 1. A device is blacklisted in the `device_status_check` step. You can look at

158 the buildbot status page to see how many devices were listed as online during

159 this step. You should always see 7 devices online. If you see fewer than 7

160 devices online, there is a problem in the lab.

161 2. A device is passing `device_status_check` but still in poor health. The

162 symptom of this is that all the tests are failing on it. You can see that on

163 the buildbot status page by looking at the `Device Affinity`. If all tests

164 with the same device affinity number are failing, it's probably a device

165 failure.

166

167 For both types of failures, please file a bug with [this template](https://code. google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs ,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro mium.perf)

168 which will add an issue to the infra labs queue.

169

170 If you need help triaging, here are the common labels you should use:

171

172 * Performance-BotHealth should go on all bugs you file about the bots;

173 it's the label we use to track all the issues.

174 * Infra-Troopers adds the bug to the trooper queue. This is for high

175 priority issues, like a build breakage. Please add a comment explaining

176 what you want the trooper to do.

177 * Infra-Labs adds the bug to the labs queue. If there is a hardware

178 problem, like an android device not responding or a bot that likely needs

179 a restart, please use this label. Make sure you set the OS- label

180 correctly as well, and add a comment explaining what you want the labs

181 team to do.

182 * Infra label is appropriate for bugs that are not high priority, but we

183 need infra team's help to triage. For example, the buildbot status page

184 UI is weird or we are getting some infra-related log spam. The infra team

185 works to triage these bugs within 24 hours, so you should ping if you do

186 not get a response.

187 * Cr-Tests-Telemetry for telemetry failures.

188 * Cr-Tests-AutoBisect for bisect and perf try job failures.

189

190 If you still need help, ask the speed infra chat, or escalate to sullivan@.

191

192 ####<a name="followup"></a> Follow up on failures

193

194 [Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-0)

195 should have an owner or contact on speed infra team and be worked on as top	189 should have an owner or contact on speed infra team and be worked on as top

196 priority. Pri-0 generally implies an entire waterfall is down.	190 priority. Pri-0 generally implies an entire waterfall is down.

197	191

198 [Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-1)	192 [Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-1)

199 should be pinged daily, and checked to make sure someone is following up. Pri-1	193 should be pinged daily, and checked to make sure someone is following up. Pri-1

200 bugs are for a red test (not yet disabled), purple bot, or failing device.	194 bugs are for a red test (not yet disabled), purple bot, or failing device.

201	195

202 [Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-2)	196 [Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-2)

203 are for disabled tests. These should be pinged weekly, and work towards fixing	197 are for disabled tests. These should be pinged weekly, and work towards fixing

204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the	198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the

205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com /p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi fied-before:today-7&sort=modified)	199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo dified-before:today-7&sort=modified)

206	200

207 <!-- Unresolved issues:	201 <!-- Unresolved issues:

208 1. Do perf sheriffs watch the bisect waterfall?	202 1. Do perf sheriffs watch the bisect waterfall?

209 2. Do perf sheriffs watch the internal clank waterfall?	203 2. Do perf sheriffs watch the internal clank waterfall?

210 -->	204 -->

OLD	NEW

« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »