| OLD | NEW |
| 1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
| 2 | 2 |
| 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
| 4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
| 5 | 5 |
| 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_
2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** | 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_
2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** |
| 7 | 7 |
| 8 ## Key Responsibilities | 8 ## Key Responsibilities |
| 9 | 9 |
| 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) | 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) |
| (...skipping 217 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 228 device screen for failing tests, links to which are printed in the logs. | 228 device screen for failing tests, links to which are printed in the logs. |
| 229 Often this will immediately reveal failure causes that are opaque from | 229 Often this will immediately reveal failure causes that are opaque from |
| 230 the logs alone. On other platforms, Devtools will produce tab | 230 the logs alone. On other platforms, Devtools will produce tab |
| 231 screenshots as long as the tab did not crash. | 231 screenshots as long as the tab did not crash. |
| 232 | 232 |
| 233 ### Useful Logs and Debugging Info | 233 ### Useful Logs and Debugging Info |
| 234 | 234 |
| 235 1. **Telemetry test runner logs** | 235 1. **Telemetry test runner logs** |
| 236 | 236 |
| 237 **_Useful Content:_** Best place to start. These logs contain all of the | 237 **_Useful Content:_** Best place to start. These logs contain all of the |
| 238 python logging information from the telemetry test runner scripts. | 238 python logging information from the telemetry test runner scripts. |
| 239 | 239 |
| 240 **_Where to find:_** These logs can be found from the buildbot build page. | 240 **_Where to find:_** These logs can be found from the buildbot build page. |
| 241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps | 241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps |
| 242 to view the logs. Do not use the "stdio" link which will show similiar | 242 to view the logs. Do not use the "stdio" link which will show similiar |
| 243 information but will expire earilier and be slower to load. | 243 information but will expire earilier and be slower to load. |
| 244 | 244 |
| 245 2. **Android Logcat (Android)** | 245 2. **Android Logcat (Android)** |
| 246 | 246 |
| 247 **_Useful Content:_** This file contains all Android device logs. All | 247 **_Useful Content:_** This file contains all Android device logs. All |
| 248 Android apps and the Android system will log information to logcat. Good | 248 Android apps and the Android system will log information to logcat. Good |
| 249 place to look if you believe an issue is device related | 249 place to look if you believe an issue is device related |
| 250 (Android out-of-memory problem for example). Additionally, often information | 250 (Android out-of-memory problem for example). Additionally, often information |
| 251 about native crashes will be logged to here. | 251 about native crashes will be logged to here. |
| 252 | 252 |
| 253 **_Where to find:_** These logs can be found from the buildbot status page. | 253 **_Where to find:_** These logs can be found from the buildbot status page. |
| 254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps. | 254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps. |
| 255 | 255 |
| 256 3. **Test Trace (Android)** | 256 3. **Test Trace (Android)** |
| 257 | 257 |
| (...skipping 10 matching lines...) Expand all Loading... |
| 268 | 268 |
| 269 **_Useful Content:_** Contains symbolized stack traces of any Chrome or | 269 **_Useful Content:_** Contains symbolized stack traces of any Chrome or |
| 270 Android crashes. | 270 Android crashes. |
| 271 | 271 |
| 272 **_Where to find_:** These logs can be found from the buildbot status page. | 272 **_Where to find_:** These logs can be found from the buildbot status page. |
| 273 The symbolized stack traces can be found under several steps. Click link | 273 The symbolized stack traces can be found under several steps. Click link |
| 274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes. | 274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes. |
| 275 Click link under _"stack tool with logcat dump"_ to see symbolized Android | 275 Click link under _"stack tool with logcat dump"_ to see symbolized Android |
| 276 crashes. | 276 crashes. |
| 277 | 277 |
| 278 ## Swarming Bots |
| 279 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal |
| 280 of moving all android bots to swarming in early 2017. There is now one machine |
| 281 on the chromium.perf waterfall for each desktop configuration that is triggering |
| 282 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in |
| 283 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id&
c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc) |
| 284 |
| 285 1. Buildbot status page FYIs |
| 286 * Every test that is run now has 2-3 recipe steps on the buildbot status |
| 287 page associated with it |
| 288 1. '[trigger] <test_name>' step (you can mostly ignore this) |
| 289 2. '<test_name>' This is the test that was run on the swarming bot, |
| 290 'shard #0' link on the step takes you to the swarming task page |
| 291 3. '<test_name> Dashboard Upload' This is the upload of the perf tests |
| 292 results to the perf dashboard. This will not be present if the test |
| 293 was disabled. |
| 294 * We now run all benchmark tests even if they are disabled, but disabled |
| 295 tests will always return success and you can ignore them. You can |
| 296 identify these by the 'DISABLED_BENCHMARK' link under the step and the |
| 297 fact that they don’t have an upload step after them |
| 298 2. Debugging Expiring Jobs on the waterfall |
| 299 * You can tell a job is expiring in one of two ways: |
| 300 1. Click on the 'shard #0' link of the failed test and you will see |
| 301 EXPIRED on the swarming task page |
| 302 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on |
| 303 the buildbot failing test step with the dashboard upload step |
| 304 failing (Note: this could be an EXPIRED job or a TIMEOUT. An |
| 305 Expired job means the task never got scheduled within the 5 hour |
| 306 swarming timeout and TIMEOUT means it started running but couldn’t |
| 307 finish before the 5 hour swarming timeout) |
| 308 * You can quickly see what bots the jobs are expiring/timing out on with |
| 309 the ‘Bot id’ annotation on the failing test step |
| 310 * Troubleshooting why they are expiring |
| 311 1. Bot might be down, check the chrome-perf pool for that bot-id and |
| 312 file a ticket with go/bugatrooper if the bot is down. |
| 313 * Can also identify a down bot through [viceroy](https://viceroy.c
orp.google.com/chrome_infra/Machines/per_machine) |
| 314 Search for a bot id and if the graph stops it tells you the bot |
| 315 is down |
| 316 2. Otherwise check the bots swarming page task list for each bot that |
| 317 has failing jobs and examine what might be going on (good [video](h
ttps://youtu.be/gRa0LvICthk) |
| 318 from maruel@ on the swarming ui and how to filter and search bot |
| 319 task lists. For example you can filter on bot-id and name to |
| 320 examine the last n runs of a test). |
| 321 * A test might be timing out on a bot that is causing subsequent |
| 322 tests to expire even though they would pass normally but never |
| 323 get scheduled due to that timing out test. Debug the timing out |
| 324 test. |
| 325 * A test might be taking a longer time than normal but still |
| 326 passing, but the extra execution time causes other unrelated |
| 327 tests to fail. Examine the last passing run to the first |
| 328 failing run and see if you can see a test that is taking a |
| 329 significantly longer time and debug that issue. |
| 330 3. Reproducing swarming task runs |
| 331 * Reproduce on local machine using same inputs as bot |
| 332 1. Note that the local machines spec must roughly match that of the |
| 333 swarming bot |
| 334 2. See 'Reproducing the task locally' on swarming task page |
| 335 3. First run the command under |
| 336 'Download input files into directory foo' |
| 337 4. cd into foo/out/Release if those downloaded inputs |
| 338 5. Execute test from this directory. Command you are looking for |
| 339 should be at the top of the logs, you just need to update the |
| 340 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and |
| 341 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o
utput.json` |
| 342 flags to be a local path |
| 343 6. Example with tmp as locally created dir: |
| 344 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go
ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results --
output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp
ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json` |
| 345 * ssh into swarming bot and run test on that machine |
| 346 1. NOTE: this should be a last resort since it will cause a fifth of |
| 347 the benchmarks to continuously fail on the waterfall |
| 348 2 First you need to decommission the swarming bot so other jobs don’t |
| 349 interfere, file a ticket with go/bugatrooper |
| 350 3. See [remote access to bots](https://sites.google.com/a/google.com/ch
rome-infrastructure/golo/remote-access?pli=1) |
| 351 on how to ssh into the bot and then run the test. |
| 352 Rough overview for build161-m1 |
| 353 * prodaccess --chromegolo_ssh |
| 354 * Ssh build161-m1.golo |
| 355 * Password is in valentine |
| 356 "Chrome Golo, Perf, GPU bots - chrome-bot" |
| 357 * File a bug to reboot the machine to get it online in the |
| 358 swarming pool again |
| 359 4. Running local changes on swarming bot |
| 360 * Using sunspider as example benchmark since it is a quick one |
| 361 * First, run test locally to make sure there is no issue with the binary |
| 362 or the script running the test on the swarming bot. Make sure dir foo |
| 363 exists: |
| 364 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p
erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow
ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j
son --isolated-script-test-chartjson-output=foo/chart-output.json` |
| 365 * Build any dependencies needed in isolate: |
| 366 1. ninja -C out/Release chrome/test:telemetry_perf_tests |
| 367 2. This target should be enough if you are running a benchmark, |
| 368 otherwise build any targets that they say are missing when building |
| 369 the isolate in step #2. |
| 370 3. Make sure [compiler proxy is running](https://sites.google.com/a/go
ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1) |
| 371 * ./goma_ctl.py ensure_start from goma directory |
| 372 * Build the isolate |
| 373 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin
ux Builder" telemetry_perf_tests` |
| 374 * -m is the master |
| 375 * -b is the builder name from mb_config.pyl that corresponds to |
| 376 the platform you are running this command on |
| 377 * telemetry_perf_tests is the isolate name |
| 378 * Might run into internal source deps when building the isolate, |
| 379 depending on the isolate. Might need to update the entry in |
| 380 mb_config.pyl for this builder to not be an official built so |
| 381 src/internal isn’t required |
| 382 * Archive and create the isolate hash |
| 383 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap
pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p
erf_tests.isolated` |
| 384 * Run the test with the has from step #3 |
| 385 1. Run hash locally |
| 386 * Note output paths are local |
| 387 * `./tools/swarming_client/run_isolated.py -I https://isolateserve
r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo
rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes
t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json` |
| 388 2. Trigger on swarming bot |
| 389 * Note paths are using swarming output dir environment variable |
| 390 ISOLATED_OUTDIR and dimensions are based on the bot and os you |
| 391 are triggering the job on |
| 392 * `python tools/swarming_client/swarming.py trigger -v --isolate-s
erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1
-d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r
esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is
olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes
t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'` |
| 393 * All args after the '--' are for the swarming task and not for |
| 394 the trigger command. The output dirs must be in quotes when |
| 395 triggering on swarming bot |
| 396 |
| 278 ### Disabling Telemetry Tests | 397 ### Disabling Telemetry Tests |
| 279 | 398 |
| 280 If the test is a telemetry test, its name will have a '.' in it, such as | 399 If the test is a telemetry test, its name will have a '.' in it, such as |
| 281 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the | 400 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the |
| 282 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c
om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). | 401 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c
om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). |
| 283 | 402 |
| 284 If a telemetry test is failing and there is no clear culprit to revert | 403 If a telemetry test is failing and there is no clear culprit to revert |
| 285 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 404 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
| 286 decorator. **Always add a comment next to your decorator with the bug id which | 405 decorator. **Always add a comment next to your decorator with the bug id which |
| 287 has background on why the test was disabled, and also include a BUG= line in | 406 has background on why the test was disabled, and also include a BUG= line in |
| (...skipping 66 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 354 | 473 |
| 355 There is also a weekly debrief that you should see on your calendar titled | 474 There is also a weekly debrief that you should see on your calendar titled |
| 356 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare | 475 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare |
| 357 any highlights or lowlights from your sheriffing shift as well as any other | 476 any highlights or lowlights from your sheriffing shift as well as any other |
| 358 feedback you may have that could improve future sheriffing shifts. | 477 feedback you may have that could improve future sheriffing shifts. |
| 359 | 478 |
| 360 <!-- Unresolved issues: | 479 <!-- Unresolved issues: |
| 361 1. Do perf sheriffs watch the bisect waterfall? | 480 1. Do perf sheriffs watch the bisect waterfall? |
| 362 2. Do perf sheriffs watch the internal clank waterfall? | 481 2. Do perf sheriffs watch the internal clank waterfall? |
| 363 --> | 482 --> |
| OLD | NEW |