tools/perf/docs/perf_bot_sheriffing.md - Issue 2621713002: Adding swarming documentation to perf sheriff docs.

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)

Patch Set: merge and review comments Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Perf Bot Sheriffing	1 # Perf Bot Sheriffing

2	2

3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf	3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf

4 waterfall up and running, and triaging performance test failures and flakes.	4 waterfall up and running, and triaging performance test failures and flakes.

5	5

6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)	6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)

7	7

8 ## Key Responsibilities	8 ## Key Responsibilities

9	9

10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)	10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)

(...skipping 217 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
228 device screen for failing tests, links to which are printed in the logs.	228 device screen for failing tests, links to which are printed in the logs.

229 Often this will immediately reveal failure causes that are opaque from	229 Often this will immediately reveal failure causes that are opaque from

230 the logs alone. On other platforms, Devtools will produce tab	230 the logs alone. On other platforms, Devtools will produce tab

231 screenshots as long as the tab did not crash.	231 screenshots as long as the tab did not crash.

232	232

233 ### Useful Logs and Debugging Info	233 ### Useful Logs and Debugging Info

234	234

235 1. Telemetry test runner logs	235 1. Telemetry test runner logs

236	236

237 _Useful Content:_ Best place to start. These logs contain all of the	237 _Useful Content:_ Best place to start. These logs contain all of the

238 python logging information from the telemetry test runner scripts.	238 python logging information from the telemetry test runner scripts.

239	239

240 _Where to find:_ These logs can be found from the buildbot build page.	240 _Where to find:_ These logs can be found from the buildbot build page.

241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps	241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps

242 to view the logs. Do not use the "stdio" link which will show similiar	242 to view the logs. Do not use the "stdio" link which will show similiar

243 information but will expire earilier and be slower to load.	243 information but will expire earilier and be slower to load.

244	244

245 2. Android Logcat (Android)	245 2. Android Logcat (Android)

246	246

247 _Useful Content:_ This file contains all Android device logs. All	247 _Useful Content:_ This file contains all Android device logs. All

248 Android apps and the Android system will log information to logcat. Good	248 Android apps and the Android system will log information to logcat. Good

249 place to look if you believe an issue is device related	249 place to look if you believe an issue is device related

250 (Android out-of-memory problem for example). Additionally, often information	250 (Android out-of-memory problem for example). Additionally, often information

251 about native crashes will be logged to here.	251 about native crashes will be logged to here.

252	252

253 _Where to find:_ These logs can be found from the buildbot status page.	253 _Where to find:_ These logs can be found from the buildbot status page.

254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps.	254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps.

255	255

256 3. Test Trace (Android)	256 3. Test Trace (Android)

257	257

(...skipping 10 matching lines...) Expand all Loading...
268	268

269 _Useful Content:_ Contains symbolized stack traces of any Chrome or	269 _Useful Content:_ Contains symbolized stack traces of any Chrome or

270 Android crashes.	270 Android crashes.

271	271

272 _Where to find_: These logs can be found from the buildbot status page.	272 _Where to find_: These logs can be found from the buildbot status page.

273 The symbolized stack traces can be found under several steps. Click link	273 The symbolized stack traces can be found under several steps. Click link

274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes.	274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes.

275 Click link under _"stack tool with logcat dump"_ to see symbolized Android	275 Click link under _"stack tool with logcat dump"_ to see symbolized Android

276 crashes.	276 crashes.

277	277

	278 ## Swarming Bots

	279 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal

	280 of moving all android bots to swarming in early 2017. There is now one machine

	281 on the chromium.perf waterfall for each desktop configuration that is triggering

	282 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in

	283 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id& c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)

	284

	285 1. Buildbot status page FYIs

	286 * Every test that is run now has 2-3 recipe steps on the buildbot status

	287 page associated with it

	288 1. '[trigger] <test_name>' step (you can mostly ignore this)

	289 2. '<test_name>' This is the test that was run on the swarming bot,

	290 'shard #0' link on the step takes you to the swarming task page

	291 3. '<test_name> Dashboard Upload' This is the upload of the perf tests

	292 results to the perf dashboard. This will not be present if the test

	293 was disabled.

	294 * We now run all benchmark tests even if they are disabled, but disabled

	295 tests will always return success and you can ignore them. You can

	296 identify these by the 'DISABLED_BENCHMARK' link under the step and the

	297 fact that they don’t have an upload step after them

	298 2. Debugging Expiring Jobs on the waterfall

	299 * You can tell a job is expiring in one of two ways:

	300 1. Click on the 'shard #0' link of the failed test and you will see

	301 EXPIRED on the swarming task page

	302 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on

	303 the buildbot failing test step with the dashboard upload step

	304 failing (Note: this could be an EXPIRED job or a TIMEOUT. An

	305 Expired job means the task never got scheduled within the 5 hour

	306 swarming timeout and TIMEOUT means it started running but couldn’t

	307 finish before the 5 hour swarming timeout)

	308 * You can quickly see what bots the jobs are expiring/timing out on with

	309 the ‘Bot id’ annotation on the failing test step

	310 * Troubleshooting why they are expiring

	311 1. Bot might be down, check the chrome-perf pool for that bot-id and

	312 file a ticket with go/bugatrooper if the bot is down.

	313 * Can also identify a down bot through [viceroy](https://viceroy.c orp.google.com/chrome_infra/Machines/per_machine)

	314 Search for a bot id and if the graph stops it tells you the bot

	315 is down

	316 2. Otherwise check the bots swarming page task list for each bot that

	317 has failing jobs and examine what might be going on (good [video](h ttps://youtu.be/gRa0LvICthk)

	318 from maruel@ on the swarming ui and how to filter and search bot

	319 task lists. For example you can filter on bot-id and name to

	320 examine the last n runs of a test).

	321 * A test might be timing out on a bot that is causing subsequent

	322 tests to expire even though they would pass normally but never

	323 get scheduled due to that timing out test. Debug the timing out

	324 test.

	325 * A test might be taking a longer time than normal but still

	326 passing, but the extra execution time causes other unrelated

	327 tests to fail. Examine the last passing run to the first

	328 failing run and see if you can see a test that is taking a

	329 significantly longer time and debug that issue.

	330 3. Reproducing swarming task runs

	331 * Reproduce on local machine using same inputs as bot

	332 1. Note that the local machines spec must roughly match that of the

	333 swarming bot

	334 2. See 'Reproducing the task locally' on swarming task page

	335 3. First run the command under

	336 'Download input files into directory foo'

	337 4. cd into foo/out/Release if those downloaded inputs

	338 5. Execute test from this directory. Command you are looking for

	339 should be at the top of the logs, you just need to update the

	340 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and

	341 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o utput.json`

	342 flags to be a local path

	343 6. Example with tmp as locally created dir:

	344 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results -- output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`

	345 * ssh into swarming bot and run test on that machine

	346 1. NOTE: this should be a last resort since it will cause a fifth of

	347 the benchmarks to continuously fail on the waterfall

	348 2 First you need to decommission the swarming bot so other jobs don’t

	349 interfere, file a ticket with go/bugatrooper

	350 3. See [remote access to bots](https://sites.google.com/a/google.com/ch rome-infrastructure/golo/remote-access?pli=1)

	351 on how to ssh into the bot and then run the test.

	352 Rough overview for build161-m1

	353 * prodaccess --chromegolo_ssh

	354 * Ssh build161-m1.golo

	355 * Password is in valentine

	356 "Chrome Golo, Perf, GPU bots - chrome-bot"

	357 * File a bug to reboot the machine to get it online in the

	358 swarming pool again

	359 4. Running local changes on swarming bot

	360 * Using sunspider as example benchmark since it is a quick one

	361 * First, run test locally to make sure there is no issue with the binary

	362 or the script running the test on the swarming bot. Make sure dir foo

	363 exists:

	364 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j son --isolated-script-test-chartjson-output=foo/chart-output.json`

	365 * Build any dependencies needed in isolate:

	366 1. ninja -C out/Release chrome/test:telemetry_perf_tests

	367 2. This target should be enough if you are running a benchmark,

	368 otherwise build any targets that they say are missing when building

	369 the isolate in step #2.

	370 3. Make sure [compiler proxy is running](https://sites.google.com/a/go ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)

	371 * ./goma_ctl.py ensure_start from goma directory

	372 * Build the isolate

	373 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin ux Builder" telemetry_perf_tests`

	374 * -m is the master

	375 * -b is the builder name from mb_config.pyl that corresponds to

	376 the platform you are running this command on

	377 * telemetry_perf_tests is the isolate name

	378 * Might run into internal source deps when building the isolate,

	379 depending on the isolate. Might need to update the entry in

	380 mb_config.pyl for this builder to not be an official built so

	381 src/internal isn’t required

	382 * Archive and create the isolate hash

	383 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p erf_tests.isolated`

	384 * Run the test with the has from step #3

	385 1. Run hash locally

	386 * Note output paths are local

	387 * `./tools/swarming_client/run_isolated.py -I https://isolateserve r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`

	388 2. Trigger on swarming bot

	389 * Note paths are using swarming output dir environment variable

	390 ISOLATED_OUTDIR and dimensions are based on the bot and os you

	391 are triggering the job on

	392 * `python tools/swarming_client/swarming.py trigger -v --isolate-s erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`

	393 * All args after the '--' are for the swarming task and not for

	394 the trigger command. The output dirs must be in quotes when

	395 triggering on swarming bot

	396

278 ### Disabling Telemetry Tests	397 ### Disabling Telemetry Tests

279	398

280 If the test is a telemetry test, its name will have a '.' in it, such as	399 If the test is a telemetry test, its name will have a '.' in it, such as

281 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the	400 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the

282 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).	401 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).

283	402

284 If a telemetry test is failing and there is no clear culprit to revert	403 If a telemetry test is failing and there is no clear culprit to revert

285 immediately, disable the test. You can do this with the `@benchmark.Disabled`	404 immediately, disable the test. You can do this with the `@benchmark.Disabled`

286 decorator. **Always add a comment next to your decorator with the bug id which	405 decorator. **Always add a comment next to your decorator with the bug id which

287 has background on why the test was disabled, and also include a BUG= line in	406 has background on why the test was disabled, and also include a BUG= line in

(...skipping 66 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
354	473

355 There is also a weekly debrief that you should see on your calendar titled	474 There is also a weekly debrief that you should see on your calendar titled

356 Weekly Speed Sheriff Retrospective. For this meeting you should prepare	475 Weekly Speed Sheriff Retrospective. For this meeting you should prepare

357 any highlights or lowlights from your sheriffing shift as well as any other	476 any highlights or lowlights from your sheriffing shift as well as any other

358 feedback you may have that could improve future sheriffing shifts.	477 feedback you may have that could improve future sheriffing shifts.

359	478

360 <!-- Unresolved issues:	479 <!-- Unresolved issues:

361 1. Do perf sheriffs watch the bisect waterfall?	480 1. Do perf sheriffs watch the bisect waterfall?

362 2. Do perf sheriffs watch the internal clank waterfall?	481 2. Do perf sheriffs watch the internal clank waterfall?

363 -->	482 -->

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »