| OLD | NEW |
| 1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
| 2 | 2 |
| 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
| 4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
| 5 | 5 |
| 6 ## Key Responsibilities | 6 ## Key Responsibilities |
| 7 | 7 |
| 8 * [Keeping the chromium.perf waterfall green](#chromiumperf) | 8 * [Handle Device and Bot Failures](#botfailures) |
| 9 * [Handling Test Failures](#testfailures) | 9 * [Handle Test Failures](#testfailures) |
| 10 * [Handling Device and Bot Failures](#botfailures) | 10 * [Follow up on failures](#followup) |
| 11 * [Follow up on failures](#followup) | |
| 12 | 11 |
| 13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green | 12 ##<a name="waterfallstate"></a> Understanding the Waterfall State |
| 14 | |
| 15 The primary responsibility of the perfbot health sheriff is to keep the | |
| 16 chromium.perf waterfall green. | |
| 17 | |
| 18 ####<a name="waterfallstate"></a> Understanding the Waterfall State | |
| 19 | 13 |
| 20 Everyone can view the chromium.perf waterfall at | 14 Everyone can view the chromium.perf waterfall at |
| 21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended | 15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended |
| 22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] | 16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] |
| 23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason | 17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason |
| 24 for this is that in order to make the performance tests as realistic as | 18 for this is that in order to make the performance tests as realistic as |
| 25 possible, the chromium.perf waterfall runs release official builds of Chrome. | 19 possible, the chromium.perf waterfall runs release official builds of Chrome. |
| 26 But the logs from release official builds may leak info from our partners that | 20 But the logs from release official builds may leak info from our partners that |
| 27 we do not have permission to share outside of Google. So the logs are available | 21 we do not have permission to share outside of Google. So the logs are available |
| 28 to Googlers only. To avoid manually rewriting the URL when switching between | 22 to Googlers only. To avoid manually rewriting the URL when switching between |
| 29 the upstream and downstream views of the waterfall and bots, you can install the | 23 the upstream and downstream views of the waterfall and bots, you can install the |
| 30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/
a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), | 24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/
a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), |
| 31 which adds a switching button to Chrome's URL bar. | 25 which adds a switching button to Chrome's URL bar. |
| 32 | 26 |
| 33 Note that there are four different views: | 27 Note that there are four different views: |
| 34 | 28 |
| 35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) | 29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) |
| 36 makes it easier to see a summary. | 30 makes it easier to see a summary. |
| 37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate
rfall) | 31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate
rfall) |
| 38 shows more details, including recent changes. | 32 shows more details, including recent changes. |
| 39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of | 33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of |
| 40 recent builds. It takes url parameter arguments: | 34 recent builds. It takes url parameter arguments: |
| 41 * **master** can be chromium.perf, tryserver.chromium.perf | 35 * **master** can be chromium.perf, tryserver.chromium.perf |
| 42 * **builder** can be a builder or tester name, like | 36 * **builder** can be a builder or tester name, like |
| 43 "Android Nexus5 Perf (2)" | 37 "Android Nexus5 Perf (2)" |
| 44 * **start_time** is seconds since the epoch. | 38 * **start_time** is seconds since the epoch. |
| 45 | 39 |
| 46 You can see a list of all previously filed bugs using the | 40 You can see a list of all previously filed bugs using the |
| 47 **[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q
=label%3APerformance-BotHealth)** | 41 **[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2
&q=label%3APerformance-BotHealth)** |
| 48 label in crbug. | 42 label in crbug. |
| 49 | 43 |
| 50 Please also check the recent | 44 Please also check the recent |
| 51 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#!
forum/perf-sheriffs)** | 45 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#!
forum/perf-sheriffs)** |
| 52 postings for important announcements about bot turndowns and other known issues. | 46 postings for important announcements about bot turndowns and other known issues. |
| 53 | 47 |
| 54 ####<a name="testfailures"></a> Handling Test Failures | 48 ##<a name="botfailures"></a> Handle Device and Bot Failures |
| 49 |
| 50 ###<a name="purplebots"></a> Purple bots |
| 51 |
| 52 When a bot goes purple, it's it's usually because of an infrastructure failure |
| 53 outside of the tests. But you should first check the logs of a purple bot to |
| 54 try to better understand the problem. Sometimes a telemetry test failure can |
| 55 turn the bot purple, for example. |
| 56 |
| 57 If the bot goes purple and you believe it's an infrastructure issue, file a bug |
| 58 with |
| 59 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P
erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s
ummary=Purple+Bot+on+chromium.perf), |
| 60 which will automatically add the bug to the trooper queue. Be sure to note |
| 61 which step is failing, and paste any relevant info from the logs into the bug. |
| 62 |
| 63 ###<a name="devicefailures"></a> Android Device failures |
| 64 |
| 65 There are two types of device failures: |
| 66 |
| 67 1. A device is blacklisted in the `device_status_check` step. You can look at |
| 68 the buildbot status page to see how many devices were listed as online during |
| 69 this step. You should always see 7 devices online. If you see fewer than 7 |
| 70 devices online, there is a problem in the lab. |
| 71 2. A device is passing `device_status_check` but still in poor health. The |
| 72 symptom of this is that all the tests are failing on it. You can see that on |
| 73 the buildbot status page by looking at the `Device Affinity`. If all tests |
| 74 with the same device affinity number are failing, it's probably a device |
| 75 failure. |
| 76 |
| 77 For both types of failures, please file a bug with [this template](https://bugs.
chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La
bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch
romium.perf) |
| 78 which will add an issue to the infra labs queue. |
| 79 |
| 80 If you need help triaging, here are the common labels you should use: |
| 81 |
| 82 * **Performance-BotHealth** should go on all bugs you file about the bots; |
| 83 it's the label we use to track all the issues. |
| 84 * **Infra-Troopers** adds the bug to the trooper queue. This is for high |
| 85 priority issues, like a build breakage. Please add a comment explaining |
| 86 what you want the trooper to do. |
| 87 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware |
| 88 problem, like an android device not responding or a bot that likely needs |
| 89 a restart, please use this label. Make sure you set the **OS-** label |
| 90 correctly as well, and add a comment explaining what you want the labs |
| 91 team to do. |
| 92 * **Infra** label is appropriate for bugs that are not high priority, but we |
| 93 need infra team's help to triage. For example, the buildbot status page |
| 94 UI is weird or we are getting some infra-related log spam. The infra team |
| 95 works to triage these bugs within 24 hours, so you should ping if you do |
| 96 not get a response. |
| 97 * **Cr-Tests-Telemetry** for telemetry failures. |
| 98 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. |
| 99 |
| 100 If you still need help, ask the speed infra chat, or escalate to sullivan@. |
| 101 |
| 102 ##<a name="testfailures"></a> Handle Test Failures |
| 55 | 103 |
| 56 You want to keep the waterfall green! So any bot that is red or purple needs to | 104 You want to keep the waterfall green! So any bot that is red or purple needs to |
| 57 be investigated. When a test fails: | 105 be investigated. When a test fails: |
| 58 | 106 |
| 59 1. File a bug using | 107 1. File a bug using |
| 60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor
mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen
:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%
20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi
sionrange%3E). | 108 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf
ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se
en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas
e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre
visionrange%3E). |
| 61 You'll want to be sure to include: | 109 You'll want to be sure to include: |
| 62 * Link to buildbot status page of failing build. | 110 * Link to buildbot status page of failing build. |
| 63 * Copy and paste of relevant failure snippet from the stdio. | 111 * Copy and paste of relevant failure snippet from the stdio. |
| 64 * CC the test owner from | 112 * CC the test owner from |
| 65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w
B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). | 113 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w
B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). |
| 66 * The revision range the test occurred on. | 114 * The revision range the test occurred on. |
| 67 * A list of all platforms the test fails on. | 115 * A list of all platforms the test fails on. |
| 68 | 116 |
| 69 2. Disable the failing test if it is failing more than one out of five runs. | 117 2. Disable the failing test if it is failing more than one out of five runs. |
| 70 (see below for instructions on telemetry and other types of tests). Make sure | 118 (see below for instructions on telemetry and other types of tests). Make sure |
| (...skipping 14 matching lines...) Expand all Loading... |
| 85 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit | 133 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit |
| 86 pos data was received from, the **Bad Revision** the last commit pos | 134 pos data was received from, the **Bad Revision** the last commit pos |
| 87 and set **Bisect mode** to `return_code`. | 135 and set **Bisect mode** to `return_code`. |
| 88 * On Android and Mac, you can view platform-level screenshots of the device | 136 * On Android and Mac, you can view platform-level screenshots of the device |
| 89 screen for failing tests, links to which are printed in the logs. Often | 137 screen for failing tests, links to which are printed in the logs. Often |
| 90 this will immediately reveal failure causes that are opaque from the logs | 138 this will immediately reveal failure causes that are opaque from the logs |
| 91 alone. On other platforms, Devtools will produce tab screenshots as long as | 139 alone. On other platforms, Devtools will produce tab screenshots as long as |
| 92 the tab did not crash. | 140 the tab did not crash. |
| 93 | 141 |
| 94 | 142 |
| 95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests | 143 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests |
| 96 | 144 |
| 97 If the test is a telemetry test, its name will have a '.' in it, such as | 145 If the test is a telemetry test, its name will have a '.' in it, such as |
| 98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first | 146 `thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the |
| 99 dot will be a python file in [tools/perf/benchmarks]( | 147 first dot will be a python file in [tools/perf/benchmarks]( |
| 100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks
/). | 148 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks
/). |
| 101 | 149 |
| 102 If a telemetry test is failing and there is no clear culprit to revert | 150 If a telemetry test is failing and there is no clear culprit to revert |
| 103 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 151 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
| 104 decorator. **Always add a comment next to your decorator with the bug id which | 152 decorator. **Always add a comment next to your decorator with the bug id which |
| 105 has background on why the test was disabled, and also include a BUG= line in | 153 has background on why the test was disabled, and also include a BUG= line in |
| 106 the CL.** | 154 the CL.** |
| 107 | 155 |
| 108 Please disable the narrowest set of bots possible; for example, if | 156 Please disable the narrowest set of bots possible; for example, if |
| 109 the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vist
a')`. | 157 the benchmark only fails on Windows Vista you can use `@benchmark.Disabled('vist
a')`. |
| (...skipping 12 matching lines...) Expand all Loading... |
| 122 * `all` (please use as a last resort) | 170 * `all` (please use as a last resort) |
| 123 | 171 |
| 124 If the test fails consistently in a very narrow set of circumstances, you may | 172 If the test fails consistently in a very narrow set of circumstances, you may |
| 125 consider implementing a ShouldDisable method on the benchmark instead. | 173 consider implementing a ShouldDisable method on the benchmark instead. |
| 126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben
chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs&
l=72) is | 174 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben
chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs&
l=72) is |
| 127 and example of disabling a benchmark which OOMs on svelte. | 175 and example of disabling a benchmark which OOMs on svelte. |
| 128 | 176 |
| 129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not*
* | 177 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not*
* |
| 130 submit with NOTRY=true. | 178 submit with NOTRY=true. |
| 131 | 179 |
| 132 #####<a name="otherfailures"></a> Disabling Other Tests | 180 ###<a name="otherfailures"></a> Disabling Other Tests |
| 133 | 181 |
| 134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c
om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). | 182 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c
om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). |
| 135 You can TBR any of the per-file OWNERS, but please do **not** submit with | 183 You can TBR any of the per-file OWNERS, but please do **not** submit with |
| 136 NOTRY=true. | 184 NOTRY=true. |
| 137 | 185 |
| 138 ####<a name="botfailures"></a> Handling Device and Bot Failures | 186 ##<a name="followup"></a> Follow up on failures |
| 139 | 187 |
| 140 #####<a name="purplebots"></a> Purple bots | 188 **[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A
Performance-BotHealth+label%3APri-0)** |
| 141 | |
| 142 When a bot goes purple, it's it's usually because of an infrastructure failure | |
| 143 outside of the tests. But you should first check the logs of a purple bot to | |
| 144 try to better understand the problem. Sometimes a telemetry test failure can | |
| 145 turn the bot purple, for example. | |
| 146 | |
| 147 If the bot goes purple and you believe it's an infrastructure issue, file a bug | |
| 148 with | |
| 149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per
formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum
mary=Purple+Bot+on+chromium.perf), | |
| 150 which will automatically add the bug to the trooper queue. Be sure to note | |
| 151 which step is failing, and paste any relevant info from the logs into the bug. | |
| 152 | |
| 153 #####<a name="devicefailures"></a> Android Device failures | |
| 154 | |
| 155 There are two types of device failures: | |
| 156 | |
| 157 1. A device is blacklisted in the `device_status_check` step. You can look at | |
| 158 the buildbot status page to see how many devices were listed as online during | |
| 159 this step. You should always see 7 devices online. If you see fewer than 7 | |
| 160 devices online, there is a problem in the lab. | |
| 161 2. A device is passing `device_status_check` but still in poor health. The | |
| 162 symptom of this is that all the tests are failing on it. You can see that on | |
| 163 the buildbot status page by looking at the `Device Affinity`. If all tests | |
| 164 with the same device affinity number are failing, it's probably a device | |
| 165 failure. | |
| 166 | |
| 167 For both types of failures, please file a bug with [this template](https://code.
google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs
,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro
mium.perf) | |
| 168 which will add an issue to the infra labs queue. | |
| 169 | |
| 170 If you need help triaging, here are the common labels you should use: | |
| 171 | |
| 172 * **Performance-BotHealth** should go on all bugs you file about the bots; | |
| 173 it's the label we use to track all the issues. | |
| 174 * **Infra-Troopers** adds the bug to the trooper queue. This is for high | |
| 175 priority issues, like a build breakage. Please add a comment explaining | |
| 176 what you want the trooper to do. | |
| 177 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware | |
| 178 problem, like an android device not responding or a bot that likely needs | |
| 179 a restart, please use this label. Make sure you set the **OS-** label | |
| 180 correctly as well, and add a comment explaining what you want the labs | |
| 181 team to do. | |
| 182 * **Infra** label is appropriate for bugs that are not high priority, but we | |
| 183 need infra team's help to triage. For example, the buildbot status page | |
| 184 UI is weird or we are getting some infra-related log spam. The infra team | |
| 185 works to triage these bugs within 24 hours, so you should ping if you do | |
| 186 not get a response. | |
| 187 * **Cr-Tests-Telemetry** for telemetry failures. | |
| 188 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. | |
| 189 | |
| 190 If you still need help, ask the speed infra chat, or escalate to sullivan@. | |
| 191 | |
| 192 ####<a name="followup"></a> Follow up on failures | |
| 193 | |
| 194 **[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe
rformance-BotHealth+label%3APri-0)** | |
| 195 should have an owner or contact on speed infra team and be worked on as top | 189 should have an owner or contact on speed infra team and be worked on as top |
| 196 priority. Pri-0 generally implies an entire waterfall is down. | 190 priority. Pri-0 generally implies an entire waterfall is down. |
| 197 | 191 |
| 198 **[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe
rformance-BotHealth+label%3APri-1)** | 192 **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A
Performance-BotHealth+label%3APri-1)** |
| 199 should be pinged daily, and checked to make sure someone is following up. Pri-1 | 193 should be pinged daily, and checked to make sure someone is following up. Pri-1 |
| 200 bugs are for a red test (not yet disabled), purple bot, or failing device. | 194 bugs are for a red test (not yet disabled), purple bot, or failing device. |
| 201 | 195 |
| 202 **[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe
rformance-BotHealth+label%3APri-2)** | 196 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A
Performance-BotHealth+label%3APri-2)** |
| 203 are for disabled tests. These should be pinged weekly, and work towards fixing | 197 are for disabled tests. These should be pinged weekly, and work towards fixing |
| 204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the | 198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the |
| 205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com
/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi
fied-before:today-7&sort=modified) | 199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o
rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo
dified-before:today-7&sort=modified) |
| 206 | 200 |
| 207 <!-- Unresolved issues: | 201 <!-- Unresolved issues: |
| 208 1. Do perf sheriffs watch the bisect waterfall? | 202 1. Do perf sheriffs watch the bisect waterfall? |
| 209 2. Do perf sheriffs watch the internal clank waterfall? | 203 2. Do perf sheriffs watch the internal clank waterfall? |
| 210 --> | 204 --> |
| OLD | NEW |