Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 # Perf Bot Sheriffing | 1 # Perf Bot Sheriffing |
| 2 | 2 |
| 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf | 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf |
| 4 waterfall up and running, and triaging performance test failures and flakes. | 4 waterfall up and running, and triaging performance test failures and flakes. |
| 5 | 5 |
| 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** | 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** |
| 7 | 7 |
| 8 ## Key Responsibilities | 8 ## Key Responsibilities |
| 9 | 9 |
| 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) | 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) |
| (...skipping 220 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 231 3. Type the **Bug ID** from step 1, the **Good Revision** the last | 231 3. Type the **Bug ID** from step 1, the **Good Revision** the last |
| 232 commit pos data was received from, the **Bad Revision** the last | 232 commit pos data was received from, the **Bad Revision** the last |
| 233 commit pos and set **Bisect mode** to `return_code`. | 233 commit pos and set **Bisect mode** to `return_code`. |
| 234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures) | 234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures) |
| 235 * On Android and Mac, you can view platform-level screenshots of the | 235 * On Android and Mac, you can view platform-level screenshots of the |
| 236 device screen for failing tests, links to which are printed in the logs. | 236 device screen for failing tests, links to which are printed in the logs. |
| 237 Often this will immediately reveal failure causes that are opaque from | 237 Often this will immediately reveal failure causes that are opaque from |
| 238 the logs alone. On other platforms, Devtools will produce tab | 238 the logs alone. On other platforms, Devtools will produce tab |
| 239 screenshots as long as the tab did not crash. | 239 screenshots as long as the tab did not crash. |
| 240 | 240 |
| 241 ## Swarming Bots | |
| 242 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal | |
| 243 of moving all android bots to swarming in early 2017. There is now one machine | |
| 244 on the chromium.perf waterfall for each desktop configuration that is triggering | |
| 245 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in | |
| 246 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id& c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc) | |
| 247 | |
| 248 1. Buildbot status page FYIs | |
| 249 * Every test that is run now has 2-3 recipe steps on the buildbot status | |
| 250 page associated with it | |
| 251 1. '[trigger] <test_name>' step (you can mostly ignore this) | |
| 252 2. '<test_name>' This is the test that was run on the swarming bot, | |
| 253 'shard #0' link on the step takes you to the swarming task page | |
| 254 3. '<test_name> Dashboard Upload' This is the upload of the perf tests | |
| 255 results to the perf dashboard. This will not be present if the test | |
| 256 was disabled. | |
| 257 * We now run all benchmark tests even if they are disabled, but disabled | |
| 258 tests will always return success and you can ignore them. You can | |
| 259 identify these by the 'DISABLED_BENCHMARK' link under the step and the | |
| 260 fact that they don’t have an upload step after them | |
| 261 2. Debugging Expiring Jobs on the waterfall | |
| 262 * You can tell a job is expiring in one of two ways: | |
| 263 1. Click on the 'shard #0' link of the failed test and you will see | |
| 264 EXPIRED on the swarming task page | |
| 265 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on | |
| 266 the buildbot failing test step with the dashboard upload step | |
| 267 failing (Note: this could be an EXPIRED job or a TIMEOUT. An | |
| 268 Expired job means the task never got scheduled within the 5 hour | |
| 269 swarming timeout and TIMEOUT means it started running but couldn’t | |
| 270 finish before the 5 hour swarming timeout) | |
| 271 * You can quickly see what bots the jobs are expiring/timing out on with | |
| 272 the ‘Bot id’ annotation on the failing test step | |
| 273 * Troubleshooting why they are expiring | |
| 274 1. Bot might be down, check the chrome-perf pool for that bot-id and | |
|
martiniss
2017/01/09 18:23:54
Maybe add a link to https://viceroy.corp.google.co
eyaich1
2017/01/09 18:44:31
Done.
| |
| 275 file a ticket with go/bugatrooper if the bot is down. | |
| 276 2. Otherwise check the bots swarming page task list for each bot that | |
| 277 has failing jobs and examine what might be going on (good [video](h ttps://youtu.be/gRa0LvICthk) | |
| 278 from maruel@ on the swarming ui and how to filter and search bot | |
| 279 task lists. For example you can filter on bot-id and name to | |
| 280 examine the last n runs of a test). | |
| 281 * A test might be timing out on a bot that is causing subsequent | |
| 282 tests to expire even though they would pass normally but never | |
| 283 get scheduled due to that timing out test. Debug the timing out | |
| 284 test. | |
| 285 * A test might be taking a longer time than normal but still | |
| 286 passing, but the extra execution time causes other unrelated | |
| 287 tests to fail. Examine the last passing run to the first | |
| 288 failing run and see if you can see a test that is taking a | |
| 289 significantly longer time and debug that issue. | |
| 290 3. Reproducing swarming task runs | |
| 291 * Reproduce on local machine using same inputs as bot | |
| 292 1. Note that the local machines spec must roughly match that of the | |
| 293 swarming bot | |
| 294 2. See 'Reproducing the task locally' on swarming task page | |
| 295 3. First run the command under | |
| 296 'Download input files into directory foo' | |
| 297 4. cd into foo/out/Release if those downloaded inputs | |
| 298 5. Execute test from this directory. Command you are looking for | |
| 299 should be at the top of the logs, you just need to update the | |
| 300 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and | |
| 301 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o utput.json` | |
| 302 flags to be a local path | |
| 303 6. Example with tmp as locally created dir: | |
| 304 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results -- output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json` | |
| 305 * ssh into swarming bot and run test on that machine | |
| 306 1. NOTE: this should be a last resort since it will cause a fifth of | |
| 307 the benchmarks to continuously fail on the waterfall | |
| 308 2 First you need to decommission the swarming bot so other jobs don’t | |
| 309 interfere, file a ticket with go/bugatrooper | |
| 310 3. See [remote access to bots](https://sites.google.com/a/google.com/ch rome-infrastructure/golo/remote-access?pli=1) | |
| 311 on how to ssh into the bot and then run the test. | |
| 312 Rough overview for build161-m1 | |
| 313 * prodaccess --chromegolo_ssh | |
| 314 * Ssh build161-m1.golo | |
| 315 * Password is in valentine | |
| 316 "Chrome Golo, Perf, GPU bots - chrome-bot" | |
| 317 * File a bug to reboot the machine to get it online in the | |
| 318 swarming pool again | |
| 319 4. Running local changes on swarming bot | |
| 320 * Using sunspider as example benchmark since it is a quick one | |
| 321 * First, run test locally to make sure there is no issue with the binary | |
| 322 or the script running the test on the swarming bot. Make sure dir foo | |
| 323 exists: | |
| 324 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j son --isolated-script-test-chartjson-output=foo/chart-output.json` | |
| 325 * Build any dependencies needed in isolate: | |
| 326 1. ninja -C out/Release chrome/test:telemetry_perf_tests | |
| 327 2. This target should be enough if you are running a benchmark, | |
| 328 otherwise build any targets that they say are missing when building | |
| 329 the isolate in step #2. | |
| 330 3. Make sure [compiler proxy is running](https://sites.google.com/a/go ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1) | |
| 331 * ./goma_ctl.py ensure_start from goma directory | |
| 332 * Build the isolate | |
| 333 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin ux Builder" telemetry_perf_tests` | |
| 334 * -m is the master | |
| 335 * -b is the builder name from mb_config.pyl that corresponds to | |
| 336 the platform you are running this command on | |
| 337 * telemetry_perf_tests is the isolate name | |
| 338 * Might run into internal source deps when building the isolate, | |
| 339 depending on the isolate. Might need to update the entry in | |
| 340 mb_config.pyl for this builder to not be an official built so | |
| 341 src/internal isn’t required | |
| 342 * Archive and create the isolate hash | |
| 343 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p erf_tests.isolated` | |
| 344 * Run the test with the has from step #3 | |
| 345 1. Run hash locally | |
| 346 * Note output paths are local | |
| 347 * `./tools/swarming_client/run_isolated.py -I https://isolateserve r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json` | |
| 348 2. Trigger on swarming bot | |
| 349 * Note paths are using swarming output dir environment variable | |
| 350 ISOLATED_OUTDIR and dimensions are based on the bot and os you | |
| 351 are triggering the job on | |
| 352 * `python tools/swarming_client/swarming.py trigger -v --isolate-s erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'` | |
| 353 * All args after the '--' are for the swarming task and not for | |
| 354 the trigger command. The output dirs must be in quotes when | |
| 355 triggering on swarming bot | |
| 356 | |
| 241 ### Disabling Telemetry Tests | 357 ### Disabling Telemetry Tests |
| 242 | 358 |
| 243 If the test is a telemetry test, its name will have a '.' in it, such as | 359 If the test is a telemetry test, its name will have a '.' in it, such as |
| 244 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the | 360 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the |
| 245 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). | 361 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). |
| 246 | 362 |
| 247 If a telemetry test is failing and there is no clear culprit to revert | 363 If a telemetry test is failing and there is no clear culprit to revert |
| 248 immediately, disable the test. You can do this with the `@benchmark.Disabled` | 364 immediately, disable the test. You can do this with the `@benchmark.Disabled` |
| 249 decorator. **Always add a comment next to your decorator with the bug id which | 365 decorator. **Always add a comment next to your decorator with the bug id which |
| 250 has background on why the test was disabled, and also include a BUG= line in | 366 has background on why the test was disabled, and also include a BUG= line in |
| (...skipping 66 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 317 | 433 |
| 318 There is also a weekly debrief that you should see on your calendar titled | 434 There is also a weekly debrief that you should see on your calendar titled |
| 319 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare | 435 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare |
| 320 any highlights or lowlights from your sheriffing shift as well as any other | 436 any highlights or lowlights from your sheriffing shift as well as any other |
| 321 feedback you may have that could improve future sheriffing shifts. | 437 feedback you may have that could improve future sheriffing shifts. |
| 322 | 438 |
| 323 <!-- Unresolved issues: | 439 <!-- Unresolved issues: |
| 324 1. Do perf sheriffs watch the bisect waterfall? | 440 1. Do perf sheriffs watch the bisect waterfall? |
| 325 2. Do perf sheriffs watch the internal clank waterfall? | 441 2. Do perf sheriffs watch the internal clank waterfall? |
| 326 --> | 442 --> |
| OLD | NEW |