OLD | NEW |
---|---|
1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
2 | 2 |
3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
5 | 5 |
6 ## Key Responsibilities | 6 ## Key Responsibilities |
7 | 7 |
8 * [Keeping the chromium.perf waterfall green](#chromiumperf) | 8 * [Handle Test Failures](#testfailures) |
9 * [Handling Test Failures](#testfailures) | 9 * [Handle Device and Bot Failures](#botfailures) |
aiolos (Not reviewing)
2016/03/09 17:57:38
Can you move the Device and Bot Failures before th
dtu
2016/03/09 22:27:44
Done.
| |
10 * [Handling Device and Bot Failures](#botfailures) | 10 * [Follow up on failures](#followup) |
11 * [Follow up on failures](#followup) | |
12 | 11 |
13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green | 12 ##<a name="waterfallstate"></a> Understanding the Waterfall State |
14 | |
15 The primary responsibility of the perfbot health sheriff is to keep the | |
16 chromium.perf waterfall green. | |
17 | |
18 ####<a name="waterfallstate"></a> Understanding the Waterfall State | |
19 | 13 |
20 Everyone can view the chromium.perf waterfall at | 14 Everyone can view the chromium.perf waterfall at |
21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended | 15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended |
22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] | 16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] |
23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason | 17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason |
24 for this is that in order to make the performance tests as realistic as | 18 for this is that in order to make the performance tests as realistic as |
25 possible, the chromium.perf waterfall runs release official builds of Chrome. | 19 possible, the chromium.perf waterfall runs release official builds of Chrome. |
26 But the logs from release official builds may leak info from our partners that | 20 But the logs from release official builds may leak info from our partners that |
27 we do not have permission to share outside of Google. So the logs are available | 21 we do not have permission to share outside of Google. So the logs are available |
28 to Googlers only. To avoid manually rewriting the URL when switching between | 22 to Googlers only. To avoid manually rewriting the URL when switching between |
29 the upstream and downstream views of the waterfall and bots, you can install the | 23 the upstream and downstream views of the waterfall and bots, you can install the |
30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), | 24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), |
31 which adds a switching button to Chrome's URL bar. | 25 which adds a switching button to Chrome's URL bar. |
32 | 26 |
33 Note that there are four different views: | 27 Note that there are four different views: |
34 | 28 |
35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) | 29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) |
36 makes it easier to see a summary. | 30 makes it easier to see a summary. |
37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall) | 31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall) |
38 shows more details, including recent changes. | 32 shows more details, including recent changes. |
39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of | 33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of |
40 recent builds. It takes url parameter arguments: | 34 recent builds. It takes url parameter arguments: |
41 * **master** can be chromium.perf, tryserver.chromium.perf | 35 * **master** can be chromium.perf, tryserver.chromium.perf |
42 * **builder** can be a builder or tester name, like | 36 * **builder** can be a builder or tester name, like |
43 "Android Nexus5 Perf (2)" | 37 "Android Nexus5 Perf (2)" |
44 * **start_time** is seconds since the epoch. | 38 * **start_time** is seconds since the epoch. |
45 | 39 |
46 You can see a list of all previously filed bugs using the | 40 You can see a list of all previously filed bugs using the |
47 **[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q =label%3APerformance-BotHealth)** | 41 **[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2 &q=label%3APerformance-BotHealth)** |
48 label in crbug. | 42 label in crbug. |
49 | 43 |
50 Please also check the recent | 44 Please also check the recent |
51 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)** | 45 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)** |
52 postings for important announcements about bot turndowns and other known issues. | 46 postings for important announcements about bot turndowns and other known issues. |
53 | 47 |
54 ####<a name="testfailures"></a> Handling Test Failures | 48 ##<a name="testfailures"></a> Handle Test Failures |
55 | 49 |
56 You want to keep the waterfall green! So any bot that is red or purple needs to | 50 You want to keep the waterfall green! So any bot that is red or purple needs to |
57 be investigated. When a test fails: | 51 be investigated. When a test fails: |
58 | 52 |
59 1. File a bug using | 53 1. File a bug using |
60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen :%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please% 20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi sionrange%3E). | 54 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre visionrange%3E). |
61 You'll want to be sure to include: | 55 You'll want to be sure to include: |
62 * Link to buildbot status page of failing build. | 56 * Link to buildbot status page of failing build. |
63 * Copy and paste of relevant failure snippet from the stdio. | 57 * Copy and paste of relevant failure snippet from the stdio. |
64 * CC the test owner from | 58 * CC the test owner from |
65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). | 59 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). |
66 * The revision range the test occurred on. | 60 * The revision range the test occurred on. |
67 * A list of all platforms the test fails on. | 61 * A list of all platforms the test fails on. |
68 | 62 |
69 2. Disable the failing test if it is failing more than one out of five runs. | 63 2. Disable the failing test if it is failing more than one out of five runs. |
70 (see below for instructions on telemetry and other types of tests). Make sure | 64 (see below for instructions on telemetry and other types of tests). Make sure |
(...skipping 14 matching lines...) Expand all Loading... | |
85 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit | 79 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit |
86 pos data was received from, the **Bad Revision** the last commit pos | 80 pos data was received from, the **Bad Revision** the last commit pos |
87 and set **Bisect mode** to `return_code`. | 81 and set **Bisect mode** to `return_code`. |
88 * On Android and Mac, you can view platform-level screenshots of the device | 82 * On Android and Mac, you can view platform-level screenshots of the device |
89 screen for failing tests, links to which are printed in the logs. Often | 83 screen for failing tests, links to which are printed in the logs. Often |
90 this will immediately reveal failure causes that are opaque from the logs | 84 this will immediately reveal failure causes that are opaque from the logs |
91 alone. On other platforms, Devtools will produce tab screenshots as long as | 85 alone. On other platforms, Devtools will produce tab screenshots as long as |
92 the tab did not crash. | 86 the tab did not crash. |
93 | 87 |
94 | 88 |
95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests | 89 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests |
96 | 90 |
97 If the test is a telemetry test, its name will have a '.' in it, such as | 91 If the test is a telemetry test, its name will have a '.' in it, such as |
98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first | 92 thread\_times.key\_mobile\_sites, or page\_cycler.top\_10. The part before the f irst |
sullivan
2016/03/09 14:38:25
We could also consider backticks instead of backsl
dtu
2016/03/09 22:27:44
Done.
| |
99 dot will be a python file in [tools/perf/benchmarks]( | 93 dot will be a python file in [tools/perf/benchmarks]( |
100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /). | 94 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /). |
101 | 95 |
102 If a telemetry test is failing and there is no clear culprit to revert | 96 If a telemetry test is failing and there is no clear culprit to revert |
103 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 97 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
104 decorator. **Always add a comment next to your decorator with the bug id which | 98 decorator. **Always add a comment next to your decorator with the bug id which |
105 has background on why the test was disabled, and also include a BUG= line in | 99 has background on why the test was disabled, and also include a BUG= line in |
106 the CL.** | 100 the CL.** |
107 | 101 |
108 Please disable the narrowest set of bots possible; for example, if | 102 Please disable the narrowest set of bots possible; for example, if |
(...skipping 13 matching lines...) Expand all Loading... | |
122 * `all` (please use as a last resort) | 116 * `all` (please use as a last resort) |
123 | 117 |
124 If the test fails consistently in a very narrow set of circumstances, you may | 118 If the test fails consistently in a very narrow set of circumstances, you may |
125 consider implementing a ShouldDisable method on the benchmark instead. | 119 consider implementing a ShouldDisable method on the benchmark instead. |
126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is | 120 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is |
127 and example of disabling a benchmark which OOMs on svelte. | 121 and example of disabling a benchmark which OOMs on svelte. |
128 | 122 |
129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* * | 123 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* * |
130 submit with NOTRY=true. | 124 submit with NOTRY=true. |
131 | 125 |
132 #####<a name="otherfailures"></a> Disabling Other Tests | 126 ###<a name="otherfailures"></a> Disabling Other Tests |
133 | 127 |
134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). | 128 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). |
135 You can TBR any of the per-file OWNERS, but please do **not** submit with | 129 You can TBR any of the per-file OWNERS, but please do **not** submit with |
136 NOTRY=true. | 130 NOTRY=true. |
137 | 131 |
138 ####<a name="botfailures"></a> Handling Device and Bot Failures | 132 ##<a name="botfailures"></a> Handle Device and Bot Failures |
139 | 133 |
140 #####<a name="purplebots"></a> Purple bots | 134 ###<a name="purplebots"></a> Purple bots |
141 | 135 |
142 When a bot goes purple, it's it's usually because of an infrastructure failure | 136 When a bot goes purple, it's it's usually because of an infrastructure failure |
143 outside of the tests. But you should first check the logs of a purple bot to | 137 outside of the tests. But you should first check the logs of a purple bot to |
144 try to better understand the problem. Sometimes a telemetry test failure can | 138 try to better understand the problem. Sometimes a telemetry test failure can |
145 turn the bot purple, for example. | 139 turn the bot purple, for example. |
146 | 140 |
147 If the bot goes purple and you believe it's an infrastructure issue, file a bug | 141 If the bot goes purple and you believe it's an infrastructure issue, file a bug |
148 with | 142 with |
149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum mary=Purple+Bot+on+chromium.perf), | 143 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s ummary=Purple+Bot+on+chromium.perf), |
150 which will automatically add the bug to the trooper queue. Be sure to note | 144 which will automatically add the bug to the trooper queue. Be sure to note |
151 which step is failing, and paste any relevant info from the logs into the bug. | 145 which step is failing, and paste any relevant info from the logs into the bug. |
152 | 146 |
153 #####<a name="devicefailures"></a> Android Device failures | 147 ###<a name="devicefailures"></a> Android Device failures |
154 | 148 |
155 There are two types of device failures: | 149 There are two types of device failures: |
156 | 150 |
157 1. A device is blacklisted in the `device_status_check` step. You can look at | 151 1. A device is blacklisted in the `device_status_check` step. You can look at |
158 the buildbot status page to see how many devices were listed as online during | 152 the buildbot status page to see how many devices were listed as online during |
159 this step. You should always see 7 devices online. If you see fewer than 7 | 153 this step. You should always see 7 devices online. If you see fewer than 7 |
160 devices online, there is a problem in the lab. | 154 devices online, there is a problem in the lab. |
161 2. A device is passing `device_status_check` but still in poor health. The | 155 2. A device is passing `device_status_check` but still in poor health. The |
162 symptom of this is that all the tests are failing on it. You can see that on | 156 symptom of this is that all the tests are failing on it. You can see that on |
163 the buildbot status page by looking at the `Device Affinity`. If all tests | 157 the buildbot status page by looking at the `Device Affinity`. If all tests |
164 with the same device affinity number are failing, it's probably a device | 158 with the same device affinity number are failing, it's probably a device |
165 failure. | 159 failure. |
166 | 160 |
167 For both types of failures, please file a bug with [this template](https://code. google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs ,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro mium.perf) | 161 For both types of failures, please file a bug with [this template](https://bugs. chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch romium.perf) |
168 which will add an issue to the infra labs queue. | 162 which will add an issue to the infra labs queue. |
169 | 163 |
170 If you need help triaging, here are the common labels you should use: | 164 If you need help triaging, here are the common labels you should use: |
171 | 165 |
172 * **Performance-BotHealth** should go on all bugs you file about the bots; | 166 * **Performance-BotHealth** should go on all bugs you file about the bots; |
173 it's the label we use to track all the issues. | 167 it's the label we use to track all the issues. |
174 * **Infra-Troopers** adds the bug to the trooper queue. This is for high | 168 * **Infra-Troopers** adds the bug to the trooper queue. This is for high |
175 priority issues, like a build breakage. Please add a comment explaining | 169 priority issues, like a build breakage. Please add a comment explaining |
176 what you want the trooper to do. | 170 what you want the trooper to do. |
177 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware | 171 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware |
178 problem, like an android device not responding or a bot that likely needs | 172 problem, like an android device not responding or a bot that likely needs |
179 a restart, please use this label. Make sure you set the **OS-** label | 173 a restart, please use this label. Make sure you set the **OS-** label |
180 correctly as well, and add a comment explaining what you want the labs | 174 correctly as well, and add a comment explaining what you want the labs |
181 team to do. | 175 team to do. |
182 * **Infra** label is appropriate for bugs that are not high priority, but we | 176 * **Infra** label is appropriate for bugs that are not high priority, but we |
183 need infra team's help to triage. For example, the buildbot status page | 177 need infra team's help to triage. For example, the buildbot status page |
184 UI is weird or we are getting some infra-related log spam. The infra team | 178 UI is weird or we are getting some infra-related log spam. The infra team |
185 works to triage these bugs within 24 hours, so you should ping if you do | 179 works to triage these bugs within 24 hours, so you should ping if you do |
186 not get a response. | 180 not get a response. |
187 * **Cr-Tests-Telemetry** for telemetry failures. | 181 * **Cr-Tests-Telemetry** for telemetry failures. |
188 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. | 182 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. |
189 | 183 |
190 If you still need help, ask the speed infra chat, or escalate to sullivan@. | 184 If you still need help, ask the speed infra chat, or escalate to sullivan@. |
191 | 185 |
192 ####<a name="followup"></a> Follow up on failures | 186 ##<a name="followup"></a> Follow up on failures |
193 | 187 |
194 **[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-0)** | 188 **[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-0)** |
195 should have an owner or contact on speed infra team and be worked on as top | 189 should have an owner or contact on speed infra team and be worked on as top |
196 priority. Pri-0 generally implies an entire waterfall is down. | 190 priority. Pri-0 generally implies an entire waterfall is down. |
197 | 191 |
198 **[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-1)** | 192 **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-1)** |
199 should be pinged daily, and checked to make sure someone is following up. Pri-1 | 193 should be pinged daily, and checked to make sure someone is following up. Pri-1 |
200 bugs are for a red test (not yet disabled), purple bot, or failing device. | 194 bugs are for a red test (not yet disabled), purple bot, or failing device. |
201 | 195 |
202 **[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-2)** | 196 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-2)** |
203 are for disabled tests. These should be pinged weekly, and work towards fixing | 197 are for disabled tests. These should be pinged weekly, and work towards fixing |
204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the | 198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the |
205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com /p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi fied-before:today-7&sort=modified) | 199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo dified-before:today-7&sort=modified) |
206 | 200 |
207 <!-- Unresolved issues: | 201 <!-- Unresolved issues: |
208 1. Do perf sheriffs watch the bisect waterfall? | 202 1. Do perf sheriffs watch the bisect waterfall? |
209 2. Do perf sheriffs watch the internal clank waterfall? | 203 2. Do perf sheriffs watch the internal clank waterfall? |
210 --> | 204 --> |
OLD | NEW |