| OLD | NEW |
| 1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
| 2 | 2 |
| 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
| 4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
| 5 | 5 |
| 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_
2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** | 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_
2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** |
| 7 | 7 |
| 8 ## Key Responsibilities | 8 ## Key Responsibilities |
| 9 | 9 |
| 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) | 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) |
| (...skipping 188 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 199 * The revision range the test occurred on. | 199 * The revision range the test occurred on. |
| 200 * A list of all platforms the test fails on. | 200 * A list of all platforms the test fails on. |
| 201 2. Disable the failing test if it is failing more than one out of five runs. | 201 2. Disable the failing test if it is failing more than one out of five runs. |
| 202 (see below for instructions on telemetry and other types of tests). Make | 202 (see below for instructions on telemetry and other types of tests). Make |
| 203 sure your disable cl includes a BUG= line with the bug from step 1 and the | 203 sure your disable cl includes a BUG= line with the bug from step 1 and the |
| 204 test owner is cc-ed on the bug. | 204 test owner is cc-ed on the bug. |
| 205 3. After the disable CL lands, you can downgrade the priority to Pri-2 and | 205 3. After the disable CL lands, you can downgrade the priority to Pri-2 and |
| 206 ensure that the bug title reflects something like "Fix and re-enable | 206 ensure that the bug title reflects something like "Fix and re-enable |
| 207 testname". | 207 testname". |
| 208 4. Investigate the failure. Some tips for investigating: | 208 4. Investigate the failure. Some tips for investigating: |
| 209 * When viewing buildbot step logs, **use the **<font color="blue">[stdout]
</font>** link to view logs!**. | |
| 210 This will link to logdog logs which do not expire. Do not use or link | |
| 211 to the logs found through the <font color="blue">stdio</font> link | |
| 212 whenever possible as these logs will expire. | |
| 213 * When investigating Android, look for the logcat which is uploaded to | |
| 214 Google Storage at the end of the run. logcat will contain much more | |
| 215 detailed Android device and crash info than will be found in | |
| 216 Telemetry logs. | |
| 217 * If it's a non flaky failure, indentify the first failed | 209 * If it's a non flaky failure, indentify the first failed |
| 218 build so you can narrow down the range of CLs that causes the failure. | 210 build so you can narrow down the range of CLs that causes the failure. |
| 219 You can use the | 211 You can use the |
| 220 [diagnose_test_failure](https://code.google.com/p/chromium/codesearch#ch
romium/src/tools/perf/diagnose_test_failure) | 212 [diagnose_test_failure](https://code.google.com/p/chromium/codesearch#ch
romium/src/tools/perf/diagnose_test_failure) |
| 221 script to automatically find the first failed build and the good & bad | 213 script to automatically find the first failed build and the good & bad |
| 222 revisions (which can also be used for return code bisect). | 214 revisions (which can also be used for return code bisect). |
| 223 * If you suspect a specific CL in the range, you can revert it locally and | 215 * If you suspect a specific CL in the range, you can revert it locally and |
| 224 run the test on the | 216 run the test on the |
| 225 [perf trybots](https://www.chromium.org/developers/telemetry/performance
-try-bots). | 217 [perf trybots](https://www.chromium.org/developers/telemetry/performance
-try-bots). |
| 226 * You can run a return code bisect to narrow down the culprit CL: | 218 * You can run a return code bisect to narrow down the culprit CL: |
| 227 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot
.com/report) | 219 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot
.com/report) |
| 228 on one of the failing platforms. | 220 on one of the failing platforms. |
| 229 2. Hover over a data point and click the "Bisect" button on the | 221 2. Hover over a data point and click the "Bisect" button on the |
| 230 tooltip. | 222 tooltip. |
| 231 3. Type the **Bug ID** from step 1, the **Good Revision** the last | 223 3. Type the **Bug ID** from step 1, the **Good Revision** the last |
| 232 commit pos data was received from, the **Bad Revision** the last | 224 commit pos data was received from, the **Bad Revision** the last |
| 233 commit pos and set **Bisect mode** to `return_code`. | 225 commit pos and set **Bisect mode** to `return_code`. |
| 234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem
etry/diagnosing-test-failures) | 226 * [Debugging telemetry failures](https://www.chromium.org/developers/telem
etry/diagnosing-test-failures) |
| 235 * On Android and Mac, you can view platform-level screenshots of the | 227 * On Android and Mac, you can view platform-level screenshots of the |
| 236 device screen for failing tests, links to which are printed in the logs. | 228 device screen for failing tests, links to which are printed in the logs. |
| 237 Often this will immediately reveal failure causes that are opaque from | 229 Often this will immediately reveal failure causes that are opaque from |
| 238 the logs alone. On other platforms, Devtools will produce tab | 230 the logs alone. On other platforms, Devtools will produce tab |
| 239 screenshots as long as the tab did not crash. | 231 screenshots as long as the tab did not crash. |
| 240 | 232 |
| 233 ### Useful Logs and Debugging Info |
| 234 |
| 235 1. **Telemetry test runner logs** |
| 236 |
| 237 **_Useful Content:_** Best place to start. These logs contain all of the |
| 238 python logging information from the telemetry test runner scripts. |
| 239 |
| 240 **_Where to find:_** These logs can be found from the buildbot build page. |
| 241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps |
| 242 to view the logs. Do not use the "stdio" link which will show similiar |
| 243 information but will expire earilier and be slower to load. |
| 244 |
| 245 2. **Android Logcat (Android)** |
| 246 |
| 247 **_Useful Content:_** This file contains all Android device logs. All |
| 248 Android apps and the Android system will log information to logcat. Good |
| 249 place to look if you believe an issue is device related |
| 250 (Android out-of-memory problem for example). Additionally, often information |
| 251 about native crashes will be logged to here. |
| 252 |
| 253 **_Where to find:_** These logs can be found from the buildbot status page. |
| 254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps. |
| 255 |
| 256 3. **Test Trace (Android)** |
| 257 |
| 258 **_Useful Content:_** These logs graphically depict the start/end times for |
| 259 all telemetry tests on all of the devices. This can help determine if test |
| 260 failures were caused by an environmental issue. |
| 261 (see [Cross-Device Failures](#Android-Cross-Device-Failures)) |
| 262 |
| 263 **_Where to find:_** These logs can be found from the buildbot status page. |
| 264 Click the _"Test Trace"_ link under one of the |
| 265 _"gsutil Upload Test Trace"_ steps. |
| 266 |
| 267 4. **Symbolized Stack Traces (Android)** |
| 268 |
| 269 **_Useful Content:_** Contains symbolized stack traces of any Chrome or |
| 270 Android crashes. |
| 271 |
| 272 **_Where to find_:** These logs can be found from the buildbot status page. |
| 273 The symbolized stack traces can be found under several steps. Click link |
| 274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes. |
| 275 Click link under _"stack tool with logcat dump"_ to see symbolized Android |
| 276 crashes. |
| 277 |
| 241 ### Disabling Telemetry Tests | 278 ### Disabling Telemetry Tests |
| 242 | 279 |
| 243 If the test is a telemetry test, its name will have a '.' in it, such as | 280 If the test is a telemetry test, its name will have a '.' in it, such as |
| 244 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the | 281 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the |
| 245 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c
om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). | 282 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c
om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). |
| 246 | 283 |
| 247 If a telemetry test is failing and there is no clear culprit to revert | 284 If a telemetry test is failing and there is no clear culprit to revert |
| 248 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 285 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
| 249 decorator. **Always add a comment next to your decorator with the bug id which | 286 decorator. **Always add a comment next to your decorator with the bug id which |
| 250 has background on why the test was disabled, and also include a BUG= line in | 287 has background on why the test was disabled, and also include a BUG= line in |
| (...skipping 52 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 303 | 340 |
| 304 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A
Performance-Sheriff-BotHealth+label%3APri-2)** | 341 **[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A
Performance-Sheriff-BotHealth+label%3APri-2)** |
| 305 are for disabled tests. These should be pinged weekly, and work towards fixing | 342 are for disabled tests. These should be pinged weekly, and work towards fixing |
| 306 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the | 343 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the |
| 307 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o
rg/p/chromium/issues/list?can=2&q=label:Performance-Sheriff-BotHealth%20label:Pr
i-2%20modified-before:today-7&sort=modified). | 344 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o
rg/p/chromium/issues/list?can=2&q=label:Performance-Sheriff-BotHealth%20label:Pr
i-2%20modified-before:today-7&sort=modified). |
| 308 | 345 |
| 309 <!-- Unresolved issues: | 346 <!-- Unresolved issues: |
| 310 1. Do perf sheriffs watch the bisect waterfall? | 347 1. Do perf sheriffs watch the bisect waterfall? |
| 311 2. Do perf sheriffs watch the internal clank waterfall? | 348 2. Do perf sheriffs watch the internal clank waterfall? |
| 312 --> | 349 --> |
| OLD | NEW |