Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(790)

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 1770383005: Reduce indentation levels for sheriff docs. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master
Patch Set: Created 4 years, 9 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Perf Bot Sheriffing 1 # Perf Bot Sheriffing
2 2
3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf
4 waterfall up and running, and triaging performance test failures and flakes. 4 waterfall up and running, and triaging performance test failures and flakes.
5 5
6 ## Key Responsibilities 6 ## Key Responsibilities
7 7
8 * [Keeping the chromium.perf waterfall green](#chromiumperf) 8 * [Handle Test Failures](#testfailures)
9 * [Handling Test Failures](#testfailures) 9 * [Handle Device and Bot Failures](#botfailures)
aiolos (Not reviewing) 2016/03/09 17:57:38 Can you move the Device and Bot Failures before th
dtu 2016/03/09 22:27:44 Done.
10 * [Handling Device and Bot Failures](#botfailures) 10 * [Follow up on failures](#followup)
11 * [Follow up on failures](#followup)
12 11
13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green 12 ##<a name="waterfallstate"></a> Understanding the Waterfall State
14
15 The primary responsibility of the perfbot health sheriff is to keep the
16 chromium.perf waterfall green.
17
18 ####<a name="waterfallstate"></a> Understanding the Waterfall State
19 13
20 Everyone can view the chromium.perf waterfall at 14 Everyone can view the chromium.perf waterfall at
21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended 15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended
22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] 16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/]
23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason 17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason
24 for this is that in order to make the performance tests as realistic as 18 for this is that in order to make the performance tests as realistic as
25 possible, the chromium.perf waterfall runs release official builds of Chrome. 19 possible, the chromium.perf waterfall runs release official builds of Chrome.
26 But the logs from release official builds may leak info from our partners that 20 But the logs from release official builds may leak info from our partners that
27 we do not have permission to share outside of Google. So the logs are available 21 we do not have permission to share outside of Google. So the logs are available
28 to Googlers only. To avoid manually rewriting the URL when switching between 22 to Googlers only. To avoid manually rewriting the URL when switching between
29 the upstream and downstream views of the waterfall and bots, you can install the 23 the upstream and downstream views of the waterfall and bots, you can install the
30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), 24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp),
31 which adds a switching button to Chrome's URL bar. 25 which adds a switching button to Chrome's URL bar.
32 26
33 Note that there are four different views: 27 Note that there are four different views:
34 28
35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) 29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/)
36 makes it easier to see a summary. 30 makes it easier to see a summary.
37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall) 31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall)
38 shows more details, including recent changes. 32 shows more details, including recent changes.
39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of 33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of
40 recent builds. It takes url parameter arguments: 34 recent builds. It takes url parameter arguments:
41 * **master** can be chromium.perf, tryserver.chromium.perf 35 * **master** can be chromium.perf, tryserver.chromium.perf
42 * **builder** can be a builder or tester name, like 36 * **builder** can be a builder or tester name, like
43 "Android Nexus5 Perf (2)" 37 "Android Nexus5 Perf (2)"
44 * **start_time** is seconds since the epoch. 38 * **start_time** is seconds since the epoch.
45 39
46 You can see a list of all previously filed bugs using the 40 You can see a list of all previously filed bugs using the
47 **[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q =label%3APerformance-BotHealth)** 41 **[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2 &q=label%3APerformance-BotHealth)**
48 label in crbug. 42 label in crbug.
49 43
50 Please also check the recent 44 Please also check the recent
51 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)** 45 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)**
52 postings for important announcements about bot turndowns and other known issues. 46 postings for important announcements about bot turndowns and other known issues.
53 47
54 ####<a name="testfailures"></a> Handling Test Failures 48 ##<a name="testfailures"></a> Handle Test Failures
55 49
56 You want to keep the waterfall green! So any bot that is red or purple needs to 50 You want to keep the waterfall green! So any bot that is red or purple needs to
57 be investigated. When a test fails: 51 be investigated. When a test fails:
58 52
59 1. File a bug using 53 1. File a bug using
60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen :%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please% 20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi sionrange%3E). 54 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre visionrange%3E).
61 You'll want to be sure to include: 55 You'll want to be sure to include:
62 * Link to buildbot status page of failing build. 56 * Link to buildbot status page of failing build.
63 * Copy and paste of relevant failure snippet from the stdio. 57 * Copy and paste of relevant failure snippet from the stdio.
64 * CC the test owner from 58 * CC the test owner from
65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). 59 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0).
66 * The revision range the test occurred on. 60 * The revision range the test occurred on.
67 * A list of all platforms the test fails on. 61 * A list of all platforms the test fails on.
68 62
69 2. Disable the failing test if it is failing more than one out of five runs. 63 2. Disable the failing test if it is failing more than one out of five runs.
70 (see below for instructions on telemetry and other types of tests). Make sure 64 (see below for instructions on telemetry and other types of tests). Make sure
(...skipping 14 matching lines...) Expand all
85 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit 79 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit
86 pos data was received from, the **Bad Revision** the last commit pos 80 pos data was received from, the **Bad Revision** the last commit pos
87 and set **Bisect mode** to `return_code`. 81 and set **Bisect mode** to `return_code`.
88 * On Android and Mac, you can view platform-level screenshots of the device 82 * On Android and Mac, you can view platform-level screenshots of the device
89 screen for failing tests, links to which are printed in the logs. Often 83 screen for failing tests, links to which are printed in the logs. Often
90 this will immediately reveal failure causes that are opaque from the logs 84 this will immediately reveal failure causes that are opaque from the logs
91 alone. On other platforms, Devtools will produce tab screenshots as long as 85 alone. On other platforms, Devtools will produce tab screenshots as long as
92 the tab did not crash. 86 the tab did not crash.
93 87
94 88
95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests 89 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests
96 90
97 If the test is a telemetry test, its name will have a '.' in it, such as 91 If the test is a telemetry test, its name will have a '.' in it, such as
98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first 92 thread\_times.key\_mobile\_sites, or page\_cycler.top\_10. The part before the f irst
sullivan 2016/03/09 14:38:25 We could also consider backticks instead of backsl
dtu 2016/03/09 22:27:44 Done.
99 dot will be a python file in [tools/perf/benchmarks]( 93 dot will be a python file in [tools/perf/benchmarks](
100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /). 94 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /).
101 95
102 If a telemetry test is failing and there is no clear culprit to revert 96 If a telemetry test is failing and there is no clear culprit to revert
103 immediately, disable the test. You can do this with the `@benchmark.Disabled` 97 immediately, disable the test. You can do this with the `@benchmark.Disabled`
104 decorator. **Always add a comment next to your decorator with the bug id which 98 decorator. **Always add a comment next to your decorator with the bug id which
105 has background on why the test was disabled, and also include a BUG= line in 99 has background on why the test was disabled, and also include a BUG= line in
106 the CL.** 100 the CL.**
107 101
108 Please disable the narrowest set of bots possible; for example, if 102 Please disable the narrowest set of bots possible; for example, if
(...skipping 13 matching lines...) Expand all
122 * `all` (please use as a last resort) 116 * `all` (please use as a last resort)
123 117
124 If the test fails consistently in a very narrow set of circumstances, you may 118 If the test fails consistently in a very narrow set of circumstances, you may
125 consider implementing a ShouldDisable method on the benchmark instead. 119 consider implementing a ShouldDisable method on the benchmark instead.
126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is 120 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is
127 and example of disabling a benchmark which OOMs on svelte. 121 and example of disabling a benchmark which OOMs on svelte.
128 122
129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* * 123 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* *
130 submit with NOTRY=true. 124 submit with NOTRY=true.
131 125
132 #####<a name="otherfailures"></a> Disabling Other Tests 126 ###<a name="otherfailures"></a> Disabling Other Tests
133 127
134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). 128 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).
135 You can TBR any of the per-file OWNERS, but please do **not** submit with 129 You can TBR any of the per-file OWNERS, but please do **not** submit with
136 NOTRY=true. 130 NOTRY=true.
137 131
138 ####<a name="botfailures"></a> Handling Device and Bot Failures 132 ##<a name="botfailures"></a> Handle Device and Bot Failures
139 133
140 #####<a name="purplebots"></a> Purple bots 134 ###<a name="purplebots"></a> Purple bots
141 135
142 When a bot goes purple, it's it's usually because of an infrastructure failure 136 When a bot goes purple, it's it's usually because of an infrastructure failure
143 outside of the tests. But you should first check the logs of a purple bot to 137 outside of the tests. But you should first check the logs of a purple bot to
144 try to better understand the problem. Sometimes a telemetry test failure can 138 try to better understand the problem. Sometimes a telemetry test failure can
145 turn the bot purple, for example. 139 turn the bot purple, for example.
146 140
147 If the bot goes purple and you believe it's an infrastructure issue, file a bug 141 If the bot goes purple and you believe it's an infrastructure issue, file a bug
148 with 142 with
149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum mary=Purple+Bot+on+chromium.perf), 143 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s ummary=Purple+Bot+on+chromium.perf),
150 which will automatically add the bug to the trooper queue. Be sure to note 144 which will automatically add the bug to the trooper queue. Be sure to note
151 which step is failing, and paste any relevant info from the logs into the bug. 145 which step is failing, and paste any relevant info from the logs into the bug.
152 146
153 #####<a name="devicefailures"></a> Android Device failures 147 ###<a name="devicefailures"></a> Android Device failures
154 148
155 There are two types of device failures: 149 There are two types of device failures:
156 150
157 1. A device is blacklisted in the `device_status_check` step. You can look at 151 1. A device is blacklisted in the `device_status_check` step. You can look at
158 the buildbot status page to see how many devices were listed as online during 152 the buildbot status page to see how many devices were listed as online during
159 this step. You should always see 7 devices online. If you see fewer than 7 153 this step. You should always see 7 devices online. If you see fewer than 7
160 devices online, there is a problem in the lab. 154 devices online, there is a problem in the lab.
161 2. A device is passing `device_status_check` but still in poor health. The 155 2. A device is passing `device_status_check` but still in poor health. The
162 symptom of this is that all the tests are failing on it. You can see that on 156 symptom of this is that all the tests are failing on it. You can see that on
163 the buildbot status page by looking at the `Device Affinity`. If all tests 157 the buildbot status page by looking at the `Device Affinity`. If all tests
164 with the same device affinity number are failing, it's probably a device 158 with the same device affinity number are failing, it's probably a device
165 failure. 159 failure.
166 160
167 For both types of failures, please file a bug with [this template](https://code. google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs ,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro mium.perf) 161 For both types of failures, please file a bug with [this template](https://bugs. chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch romium.perf)
168 which will add an issue to the infra labs queue. 162 which will add an issue to the infra labs queue.
169 163
170 If you need help triaging, here are the common labels you should use: 164 If you need help triaging, here are the common labels you should use:
171 165
172 * **Performance-BotHealth** should go on all bugs you file about the bots; 166 * **Performance-BotHealth** should go on all bugs you file about the bots;
173 it's the label we use to track all the issues. 167 it's the label we use to track all the issues.
174 * **Infra-Troopers** adds the bug to the trooper queue. This is for high 168 * **Infra-Troopers** adds the bug to the trooper queue. This is for high
175 priority issues, like a build breakage. Please add a comment explaining 169 priority issues, like a build breakage. Please add a comment explaining
176 what you want the trooper to do. 170 what you want the trooper to do.
177 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware 171 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware
178 problem, like an android device not responding or a bot that likely needs 172 problem, like an android device not responding or a bot that likely needs
179 a restart, please use this label. Make sure you set the **OS-** label 173 a restart, please use this label. Make sure you set the **OS-** label
180 correctly as well, and add a comment explaining what you want the labs 174 correctly as well, and add a comment explaining what you want the labs
181 team to do. 175 team to do.
182 * **Infra** label is appropriate for bugs that are not high priority, but we 176 * **Infra** label is appropriate for bugs that are not high priority, but we
183 need infra team's help to triage. For example, the buildbot status page 177 need infra team's help to triage. For example, the buildbot status page
184 UI is weird or we are getting some infra-related log spam. The infra team 178 UI is weird or we are getting some infra-related log spam. The infra team
185 works to triage these bugs within 24 hours, so you should ping if you do 179 works to triage these bugs within 24 hours, so you should ping if you do
186 not get a response. 180 not get a response.
187 * **Cr-Tests-Telemetry** for telemetry failures. 181 * **Cr-Tests-Telemetry** for telemetry failures.
188 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. 182 * **Cr-Tests-AutoBisect** for bisect and perf try job failures.
189 183
190 If you still need help, ask the speed infra chat, or escalate to sullivan@. 184 If you still need help, ask the speed infra chat, or escalate to sullivan@.
191 185
192 ####<a name="followup"></a> Follow up on failures 186 ##<a name="followup"></a> Follow up on failures
193 187
194 **[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-0)** 188 **[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-0)**
195 should have an owner or contact on speed infra team and be worked on as top 189 should have an owner or contact on speed infra team and be worked on as top
196 priority. Pri-0 generally implies an entire waterfall is down. 190 priority. Pri-0 generally implies an entire waterfall is down.
197 191
198 **[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-1)** 192 **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-1)**
199 should be pinged daily, and checked to make sure someone is following up. Pri-1 193 should be pinged daily, and checked to make sure someone is following up. Pri-1
200 bugs are for a red test (not yet disabled), purple bot, or failing device. 194 bugs are for a red test (not yet disabled), purple bot, or failing device.
201 195
202 **[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-2)** 196 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-2)**
203 are for disabled tests. These should be pinged weekly, and work towards fixing 197 are for disabled tests. These should be pinged weekly, and work towards fixing
204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the 198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the
205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com /p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi fied-before:today-7&sort=modified) 199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo dified-before:today-7&sort=modified)
206 200
207 <!-- Unresolved issues: 201 <!-- Unresolved issues:
208 1. Do perf sheriffs watch the bisect waterfall? 202 1. Do perf sheriffs watch the bisect waterfall?
209 2. Do perf sheriffs watch the internal clank waterfall? 203 2. Do perf sheriffs watch the internal clank waterfall?
210 --> 204 -->
OLDNEW
« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698