Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
| 2 | 2 |
| 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
| 4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
| 5 | 5 |
| 6 ## Key Responsibilities | 6 ## Key Responsibilities |
| 7 | 7 |
| 8 * [Keeping the chromium.perf waterfall green](#chromiumperf) | 8 * [Handle Test Failures](#testfailures) |
| 9 * [Handling Test Failures](#testfailures) | 9 * [Handle Device and Bot Failures](#botfailures) |
|
aiolos (Not reviewing)
2016/03/09 17:57:38
Can you move the Device and Bot Failures before th
dtu
2016/03/09 22:27:44
Done.
| |
| 10 * [Handling Device and Bot Failures](#botfailures) | 10 * [Follow up on failures](#followup) |
| 11 * [Follow up on failures](#followup) | |
| 12 | 11 |
| 13 ###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green | 12 ##<a name="waterfallstate"></a> Understanding the Waterfall State |
| 14 | |
| 15 The primary responsibility of the perfbot health sheriff is to keep the | |
| 16 chromium.perf waterfall green. | |
| 17 | |
| 18 ####<a name="waterfallstate"></a> Understanding the Waterfall State | |
| 19 | 13 |
| 20 Everyone can view the chromium.perf waterfall at | 14 Everyone can view the chromium.perf waterfall at |
| 21 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended | 15 https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended |
| 22 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] | 16 that you use the url **[https://uberchromegw.corp.google.com/i/chromium.perf/] |
| 23 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason | 17 (https://uberchromegw.corp.google.com/i/chromium.perf/)** instead. The reason |
| 24 for this is that in order to make the performance tests as realistic as | 18 for this is that in order to make the performance tests as realistic as |
| 25 possible, the chromium.perf waterfall runs release official builds of Chrome. | 19 possible, the chromium.perf waterfall runs release official builds of Chrome. |
| 26 But the logs from release official builds may leak info from our partners that | 20 But the logs from release official builds may leak info from our partners that |
| 27 we do not have permission to share outside of Google. So the logs are available | 21 we do not have permission to share outside of Google. So the logs are available |
| 28 to Googlers only. To avoid manually rewriting the URL when switching between | 22 to Googlers only. To avoid manually rewriting the URL when switching between |
| 29 the upstream and downstream views of the waterfall and bots, you can install the | 23 the upstream and downstream views of the waterfall and bots, you can install the |
| 30 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), | 24 [Chromium Waterfall View Switcher extension](https://chrome.google.com/webstore/ a/google.com/detail/chromium-waterfall-view-s/hnnplblfkmfaadpjdpkepbkdjhjpjbdp), |
| 31 which adds a switching button to Chrome's URL bar. | 25 which adds a switching button to Chrome's URL bar. |
| 32 | 26 |
| 33 Note that there are four different views: | 27 Note that there are four different views: |
| 34 | 28 |
| 35 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) | 29 1. [Console view](https://uberchromegw.corp.google.com/i/chromium.perf/) |
| 36 makes it easier to see a summary. | 30 makes it easier to see a summary. |
| 37 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall) | 31 2. [Waterfall view](https://uberchromegw.corp.google.com/i/chromium.perf/wate rfall) |
| 38 shows more details, including recent changes. | 32 shows more details, including recent changes. |
| 39 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of | 33 3. [Firefighter](https://chromiumperfstats.appspot.com/) shows traces of |
| 40 recent builds. It takes url parameter arguments: | 34 recent builds. It takes url parameter arguments: |
| 41 * **master** can be chromium.perf, tryserver.chromium.perf | 35 * **master** can be chromium.perf, tryserver.chromium.perf |
| 42 * **builder** can be a builder or tester name, like | 36 * **builder** can be a builder or tester name, like |
| 43 "Android Nexus5 Perf (2)" | 37 "Android Nexus5 Perf (2)" |
| 44 * **start_time** is seconds since the epoch. | 38 * **start_time** is seconds since the epoch. |
| 45 | 39 |
| 46 You can see a list of all previously filed bugs using the | 40 You can see a list of all previously filed bugs using the |
| 47 **[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q =label%3APerformance-BotHealth)** | 41 **[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2 &q=label%3APerformance-BotHealth)** |
| 48 label in crbug. | 42 label in crbug. |
| 49 | 43 |
| 50 Please also check the recent | 44 Please also check the recent |
| 51 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)** | 45 **[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#! forum/perf-sheriffs)** |
| 52 postings for important announcements about bot turndowns and other known issues. | 46 postings for important announcements about bot turndowns and other known issues. |
| 53 | 47 |
| 54 ####<a name="testfailures"></a> Handling Test Failures | 48 ##<a name="testfailures"></a> Handle Test Failures |
| 55 | 49 |
| 56 You want to keep the waterfall green! So any bot that is red or purple needs to | 50 You want to keep the waterfall green! So any bot that is red or purple needs to |
| 57 be investigated. When a test fails: | 51 be investigated. When a test fails: |
| 58 | 52 |
| 59 1. File a bug using | 53 1. File a bug using |
| 60 [this template](https://code.google.com/p/chromium/issues/entry?labels=Perfor mance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen :%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please% 20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevi sionrange%3E). | 54 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Perf ormance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+se en:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20pleas e%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Cre visionrange%3E). |
| 61 You'll want to be sure to include: | 55 You'll want to be sure to include: |
| 62 * Link to buildbot status page of failing build. | 56 * Link to buildbot status page of failing build. |
| 63 * Copy and paste of relevant failure snippet from the stdio. | 57 * Copy and paste of relevant failure snippet from the stdio. |
| 64 * CC the test owner from | 58 * CC the test owner from |
| 65 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). | 59 [go/perf-owners](https://docs.google.com/spreadsheets/d/1R_1BAOd3xeVtR0jn6w B5HHJ2K25mIbKp3iIRQKkX38o/edit#gid=0). |
| 66 * The revision range the test occurred on. | 60 * The revision range the test occurred on. |
| 67 * A list of all platforms the test fails on. | 61 * A list of all platforms the test fails on. |
| 68 | 62 |
| 69 2. Disable the failing test if it is failing more than one out of five runs. | 63 2. Disable the failing test if it is failing more than one out of five runs. |
| 70 (see below for instructions on telemetry and other types of tests). Make sure | 64 (see below for instructions on telemetry and other types of tests). Make sure |
| (...skipping 14 matching lines...) Expand all Loading... | |
| 85 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit | 79 3. Type the **Bug ID** from step 1, the **Good Revision** the last commit |
| 86 pos data was received from, the **Bad Revision** the last commit pos | 80 pos data was received from, the **Bad Revision** the last commit pos |
| 87 and set **Bisect mode** to `return_code`. | 81 and set **Bisect mode** to `return_code`. |
| 88 * On Android and Mac, you can view platform-level screenshots of the device | 82 * On Android and Mac, you can view platform-level screenshots of the device |
| 89 screen for failing tests, links to which are printed in the logs. Often | 83 screen for failing tests, links to which are printed in the logs. Often |
| 90 this will immediately reveal failure causes that are opaque from the logs | 84 this will immediately reveal failure causes that are opaque from the logs |
| 91 alone. On other platforms, Devtools will produce tab screenshots as long as | 85 alone. On other platforms, Devtools will produce tab screenshots as long as |
| 92 the tab did not crash. | 86 the tab did not crash. |
| 93 | 87 |
| 94 | 88 |
| 95 #####<a name="telemetryfailures"></a> Disabling Telemetry Tests | 89 ###<a name="telemetryfailures"></a> Disabling Telemetry Tests |
| 96 | 90 |
| 97 If the test is a telemetry test, its name will have a '.' in it, such as | 91 If the test is a telemetry test, its name will have a '.' in it, such as |
| 98 thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first | 92 thread\_times.key\_mobile\_sites, or page\_cycler.top\_10. The part before the f irst |
|
sullivan
2016/03/09 14:38:25
We could also consider backticks instead of backsl
dtu
2016/03/09 22:27:44
Done.
| |
| 99 dot will be a python file in [tools/perf/benchmarks]( | 93 dot will be a python file in [tools/perf/benchmarks]( |
| 100 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /). | 94 https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks /). |
| 101 | 95 |
| 102 If a telemetry test is failing and there is no clear culprit to revert | 96 If a telemetry test is failing and there is no clear culprit to revert |
| 103 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 97 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
| 104 decorator. **Always add a comment next to your decorator with the bug id which | 98 decorator. **Always add a comment next to your decorator with the bug id which |
| 105 has background on why the test was disabled, and also include a BUG= line in | 99 has background on why the test was disabled, and also include a BUG= line in |
| 106 the CL.** | 100 the CL.** |
| 107 | 101 |
| 108 Please disable the narrowest set of bots possible; for example, if | 102 Please disable the narrowest set of bots possible; for example, if |
| (...skipping 13 matching lines...) Expand all Loading... | |
| 122 * `all` (please use as a last resort) | 116 * `all` (please use as a last resort) |
| 123 | 117 |
| 124 If the test fails consistently in a very narrow set of circumstances, you may | 118 If the test fails consistently in a very narrow set of circumstances, you may |
| 125 consider implementing a ShouldDisable method on the benchmark instead. | 119 consider implementing a ShouldDisable method on the benchmark instead. |
| 126 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is | 120 [Here](https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/ben chmarks/power.py&q=svelte%20file:%5Esrc/tools/perf/&sq=package:chromium&type=cs& l=72) is |
| 127 and example of disabling a benchmark which OOMs on svelte. | 121 and example of disabling a benchmark which OOMs on svelte. |
| 128 | 122 |
| 129 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* * | 123 Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not* * |
| 130 submit with NOTRY=true. | 124 submit with NOTRY=true. |
| 131 | 125 |
| 132 #####<a name="otherfailures"></a> Disabling Other Tests | 126 ###<a name="otherfailures"></a> Disabling Other Tests |
| 133 | 127 |
| 134 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). | 128 Non-telemetry tests are configured in [chromium.perf.json](https://code.google.c om/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json). |
| 135 You can TBR any of the per-file OWNERS, but please do **not** submit with | 129 You can TBR any of the per-file OWNERS, but please do **not** submit with |
| 136 NOTRY=true. | 130 NOTRY=true. |
| 137 | 131 |
| 138 ####<a name="botfailures"></a> Handling Device and Bot Failures | 132 ##<a name="botfailures"></a> Handle Device and Bot Failures |
| 139 | 133 |
| 140 #####<a name="purplebots"></a> Purple bots | 134 ###<a name="purplebots"></a> Purple bots |
| 141 | 135 |
| 142 When a bot goes purple, it's it's usually because of an infrastructure failure | 136 When a bot goes purple, it's it's usually because of an infrastructure failure |
| 143 outside of the tests. But you should first check the logs of a purple bot to | 137 outside of the tests. But you should first check the logs of a purple bot to |
| 144 try to better understand the problem. Sometimes a telemetry test failure can | 138 try to better understand the problem. Sometimes a telemetry test failure can |
| 145 turn the bot purple, for example. | 139 turn the bot purple, for example. |
| 146 | 140 |
| 147 If the bot goes purple and you believe it's an infrastructure issue, file a bug | 141 If the bot goes purple and you believe it's an infrastructure issue, file a bug |
| 148 with | 142 with |
| 149 [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Per formance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&sum mary=Purple+Bot+on+chromium.perf), | 143 [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,P erformance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&s ummary=Purple+Bot+on+chromium.perf), |
| 150 which will automatically add the bug to the trooper queue. Be sure to note | 144 which will automatically add the bug to the trooper queue. Be sure to note |
| 151 which step is failing, and paste any relevant info from the logs into the bug. | 145 which step is failing, and paste any relevant info from the logs into the bug. |
| 152 | 146 |
| 153 #####<a name="devicefailures"></a> Android Device failures | 147 ###<a name="devicefailures"></a> Android Device failures |
| 154 | 148 |
| 155 There are two types of device failures: | 149 There are two types of device failures: |
| 156 | 150 |
| 157 1. A device is blacklisted in the `device_status_check` step. You can look at | 151 1. A device is blacklisted in the `device_status_check` step. You can look at |
| 158 the buildbot status page to see how many devices were listed as online during | 152 the buildbot status page to see how many devices were listed as online during |
| 159 this step. You should always see 7 devices online. If you see fewer than 7 | 153 this step. You should always see 7 devices online. If you see fewer than 7 |
| 160 devices online, there is a problem in the lab. | 154 devices online, there is a problem in the lab. |
| 161 2. A device is passing `device_status_check` but still in poor health. The | 155 2. A device is passing `device_status_check` but still in poor health. The |
| 162 symptom of this is that all the tests are failing on it. You can see that on | 156 symptom of this is that all the tests are failing on it. You can see that on |
| 163 the buildbot status page by looking at the `Device Affinity`. If all tests | 157 the buildbot status page by looking at the `Device Affinity`. If all tests |
| 164 with the same device affinity number are failing, it's probably a device | 158 with the same device affinity number are failing, it's probably a device |
| 165 failure. | 159 failure. |
| 166 | 160 |
| 167 For both types of failures, please file a bug with [this template](https://code. google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs ,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chro mium.perf) | 161 For both types of failures, please file a bug with [this template](https://bugs. chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-La bs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+ch romium.perf) |
| 168 which will add an issue to the infra labs queue. | 162 which will add an issue to the infra labs queue. |
| 169 | 163 |
| 170 If you need help triaging, here are the common labels you should use: | 164 If you need help triaging, here are the common labels you should use: |
| 171 | 165 |
| 172 * **Performance-BotHealth** should go on all bugs you file about the bots; | 166 * **Performance-BotHealth** should go on all bugs you file about the bots; |
| 173 it's the label we use to track all the issues. | 167 it's the label we use to track all the issues. |
| 174 * **Infra-Troopers** adds the bug to the trooper queue. This is for high | 168 * **Infra-Troopers** adds the bug to the trooper queue. This is for high |
| 175 priority issues, like a build breakage. Please add a comment explaining | 169 priority issues, like a build breakage. Please add a comment explaining |
| 176 what you want the trooper to do. | 170 what you want the trooper to do. |
| 177 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware | 171 * **Infra-Labs** adds the bug to the labs queue. If there is a hardware |
| 178 problem, like an android device not responding or a bot that likely needs | 172 problem, like an android device not responding or a bot that likely needs |
| 179 a restart, please use this label. Make sure you set the **OS-** label | 173 a restart, please use this label. Make sure you set the **OS-** label |
| 180 correctly as well, and add a comment explaining what you want the labs | 174 correctly as well, and add a comment explaining what you want the labs |
| 181 team to do. | 175 team to do. |
| 182 * **Infra** label is appropriate for bugs that are not high priority, but we | 176 * **Infra** label is appropriate for bugs that are not high priority, but we |
| 183 need infra team's help to triage. For example, the buildbot status page | 177 need infra team's help to triage. For example, the buildbot status page |
| 184 UI is weird or we are getting some infra-related log spam. The infra team | 178 UI is weird or we are getting some infra-related log spam. The infra team |
| 185 works to triage these bugs within 24 hours, so you should ping if you do | 179 works to triage these bugs within 24 hours, so you should ping if you do |
| 186 not get a response. | 180 not get a response. |
| 187 * **Cr-Tests-Telemetry** for telemetry failures. | 181 * **Cr-Tests-Telemetry** for telemetry failures. |
| 188 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. | 182 * **Cr-Tests-AutoBisect** for bisect and perf try job failures. |
| 189 | 183 |
| 190 If you still need help, ask the speed infra chat, or escalate to sullivan@. | 184 If you still need help, ask the speed infra chat, or escalate to sullivan@. |
| 191 | 185 |
| 192 ####<a name="followup"></a> Follow up on failures | 186 ##<a name="followup"></a> Follow up on failures |
| 193 | 187 |
| 194 **[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-0)** | 188 **[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-0)** |
| 195 should have an owner or contact on speed infra team and be worked on as top | 189 should have an owner or contact on speed infra team and be worked on as top |
| 196 priority. Pri-0 generally implies an entire waterfall is down. | 190 priority. Pri-0 generally implies an entire waterfall is down. |
| 197 | 191 |
| 198 **[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-1)** | 192 **[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-1)** |
| 199 should be pinged daily, and checked to make sure someone is following up. Pri-1 | 193 should be pinged daily, and checked to make sure someone is following up. Pri-1 |
| 200 bugs are for a red test (not yet disabled), purple bot, or failing device. | 194 bugs are for a red test (not yet disabled), purple bot, or failing device. |
| 201 | 195 |
| 202 **[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APe rformance-BotHealth+label%3APri-2)** | 196 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-BotHealth+label%3APri-2)** |
| 203 are for disabled tests. These should be pinged weekly, and work towards fixing | 197 are for disabled tests. These should be pinged weekly, and work towards fixing |
| 204 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the | 198 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the |
| 205 [list of Pri-2 bugs that have not been pinged in a week](https://code.google.com /p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modi fied-before:today-7&sort=modified) | 199 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20mo dified-before:today-7&sort=modified) |
| 206 | 200 |
| 207 <!-- Unresolved issues: | 201 <!-- Unresolved issues: |
| 208 1. Do perf sheriffs watch the bisect waterfall? | 202 1. Do perf sheriffs watch the bisect waterfall? |
| 209 2. Do perf sheriffs watch the internal clank waterfall? | 203 2. Do perf sheriffs watch the internal clank waterfall? |
| 210 --> | 204 --> |
| OLD | NEW |