tools/perf/docs/perf_bot_sheriffing.md - Issue 2621713002: Adding swarming documentation to perf sheriff docs.

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)

Patch Set: Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Perf Bot Sheriffing	1 # Perf Bot Sheriffing

2	2

3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf	3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf

4 waterfall up and running, and triaging performance test failures and flakes.	4 waterfall up and running, and triaging performance test failures and flakes.

5	5

6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)	6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)

7	7

8 ## Key Responsibilities	8 ## Key Responsibilities

9	9

10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)	10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)

(...skipping 220 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
231 3. Type the Bug ID from step 1, the Good Revision the last	231 3. Type the Bug ID from step 1, the Good Revision the last

232 commit pos data was received from, the Bad Revision the last	232 commit pos data was received from, the Bad Revision the last

233 commit pos and set Bisect mode to `return_code`.	233 commit pos and set Bisect mode to `return_code`.

234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures)	234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures)

235 * On Android and Mac, you can view platform-level screenshots of the	235 * On Android and Mac, you can view platform-level screenshots of the

236 device screen for failing tests, links to which are printed in the logs.	236 device screen for failing tests, links to which are printed in the logs.

237 Often this will immediately reveal failure causes that are opaque from	237 Often this will immediately reveal failure causes that are opaque from

238 the logs alone. On other platforms, Devtools will produce tab	238 the logs alone. On other platforms, Devtools will produce tab

239 screenshots as long as the tab did not crash.	239 screenshots as long as the tab did not crash.

240	240

	241 ## Swarming Bots

	242 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal

	243 of moving all android bots to swarming in early 2017. There is now one machine

	244 on the chromium.perf waterfall for each desktop configuration that is triggering

	245 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in

	246 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id& c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)

	247

	248 1. Buildbot status page FYIs

	249 * Every test that is run now has 2-3 recipe steps on the buildbot status

	250 page associated with it

	251 1. '[trigger] <test_name>' step (you can mostly ignore this)

	252 2. '<test_name>' This is the test that was run on the swarming bot,

	253 'shard #0' link on the step takes you to the swarming task page

	254 3. '<test_name> Dashboard Upload' This is the upload of the perf tests

	255 results to the perf dashboard. This will not be present if the test

	256 was disabled.

	257 * We now run all benchmark tests even if they are disabled, but disabled

	258 tests will always return success and you can ignore them. You can

	259 identify these by the 'DISABLED_BENCHMARK' link under the step and the

	260 fact that they don’t have an upload step after them

	261 2. Debugging Expiring Jobs on the waterfall

	262 * You can tell a job is expiring in one of two ways:

	263 1. Click on the 'shard #0' link of the failed test and you will see

	264 EXPIRED on the swarming task page

	265 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on

	266 the buildbot failing test step with the dashboard upload step

	267 failing (Note: this could be an EXPIRED job or a TIMEOUT. An

	268 Expired job means the task never got scheduled within the 5 hour

	269 swarming timeout and TIMEOUT means it started running but couldn’t

	270 finish before the 5 hour swarming timeout)

	271 * You can quickly see what bots the jobs are expiring/timing out on with

	272 the ‘Bot id’ annotation on the failing test step

	273 * Troubleshooting why they are expiring

	274 1. Bot might be down, check the chrome-perf pool for that bot-id and
	martiniss 2017/01/09 18:23:54 Maybe add a link to https://viceroy.corp.google.co Maybe add a link to https://viceroy.corp.google.com/chrome_infra/Machines/per_machine here? Not sure if that's self explanatory enough... I find it useful when debugging though. eyaich1 2017/01/09 18:44:31 Done. Show quoted text On 2017/01/09 18:23:54, martiniss wrote: > Maybe add a link to > https://viceroy.corp.google.com/chrome_infra/Machines/per_machine here? Not sure > if that's self explanatory enough... I find it useful when debugging though. Done.
	275 file a ticket with go/bugatrooper if the bot is down.

	276 2. Otherwise check the bots swarming page task list for each bot that

	277 has failing jobs and examine what might be going on (good [video](h ttps://youtu.be/gRa0LvICthk)

	278 from maruel@ on the swarming ui and how to filter and search bot

	279 task lists. For example you can filter on bot-id and name to

	280 examine the last n runs of a test).

	281 * A test might be timing out on a bot that is causing subsequent

	282 tests to expire even though they would pass normally but never

	283 get scheduled due to that timing out test. Debug the timing out

	284 test.

	285 * A test might be taking a longer time than normal but still

	286 passing, but the extra execution time causes other unrelated

	287 tests to fail. Examine the last passing run to the first

	288 failing run and see if you can see a test that is taking a

	289 significantly longer time and debug that issue.

	290 3. Reproducing swarming task runs

	291 * Reproduce on local machine using same inputs as bot

	292 1. Note that the local machines spec must roughly match that of the

	293 swarming bot

	294 2. See 'Reproducing the task locally' on swarming task page

	295 3. First run the command under

	296 'Download input files into directory foo'

	297 4. cd into foo/out/Release if those downloaded inputs

	298 5. Execute test from this directory. Command you are looking for

	299 should be at the top of the logs, you just need to update the

	300 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and

	301 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o utput.json`

	302 flags to be a local path

	303 6. Example with tmp as locally created dir:

	304 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results -- output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`

	305 * ssh into swarming bot and run test on that machine

	306 1. NOTE: this should be a last resort since it will cause a fifth of

	307 the benchmarks to continuously fail on the waterfall

	308 2 First you need to decommission the swarming bot so other jobs don’t

	309 interfere, file a ticket with go/bugatrooper

	310 3. See [remote access to bots](https://sites.google.com/a/google.com/ch rome-infrastructure/golo/remote-access?pli=1)

	311 on how to ssh into the bot and then run the test.

	312 Rough overview for build161-m1

	313 * prodaccess --chromegolo_ssh

	314 * Ssh build161-m1.golo

	315 * Password is in valentine

	316 "Chrome Golo, Perf, GPU bots - chrome-bot"

	317 * File a bug to reboot the machine to get it online in the

	318 swarming pool again

	319 4. Running local changes on swarming bot

	320 * Using sunspider as example benchmark since it is a quick one

	321 * First, run test locally to make sure there is no issue with the binary

	322 or the script running the test on the swarming bot. Make sure dir foo

	323 exists:

	324 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j son --isolated-script-test-chartjson-output=foo/chart-output.json`

	325 * Build any dependencies needed in isolate:

	326 1. ninja -C out/Release chrome/test:telemetry_perf_tests

	327 2. This target should be enough if you are running a benchmark,

	328 otherwise build any targets that they say are missing when building

	329 the isolate in step #2.

	330 3. Make sure [compiler proxy is running](https://sites.google.com/a/go ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)

	331 * ./goma_ctl.py ensure_start from goma directory

	332 * Build the isolate

	333 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin ux Builder" telemetry_perf_tests`

	334 * -m is the master

	335 * -b is the builder name from mb_config.pyl that corresponds to

	336 the platform you are running this command on

	337 * telemetry_perf_tests is the isolate name

	338 * Might run into internal source deps when building the isolate,

	339 depending on the isolate. Might need to update the entry in

	340 mb_config.pyl for this builder to not be an official built so

	341 src/internal isn’t required

	342 * Archive and create the isolate hash

	343 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p erf_tests.isolated`

	344 * Run the test with the has from step #3

	345 1. Run hash locally

	346 * Note output paths are local

	347 * `./tools/swarming_client/run_isolated.py -I https://isolateserve r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`

	348 2. Trigger on swarming bot

	349 * Note paths are using swarming output dir environment variable

	350 ISOLATED_OUTDIR and dimensions are based on the bot and os you

	351 are triggering the job on

	352 * `python tools/swarming_client/swarming.py trigger -v --isolate-s erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`

	353 * All args after the '--' are for the swarming task and not for

	354 the trigger command. The output dirs must be in quotes when

	355 triggering on swarming bot

	356

241 ### Disabling Telemetry Tests	357 ### Disabling Telemetry Tests

242	358

243 If the test is a telemetry test, its name will have a '.' in it, such as	359 If the test is a telemetry test, its name will have a '.' in it, such as

244 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the	360 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the

245 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).	361 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).

246	362

247 If a telemetry test is failing and there is no clear culprit to revert	363 If a telemetry test is failing and there is no clear culprit to revert

248 immediately, disable the test. You can do this with the `@benchmark.Disabled`	364 immediately, disable the test. You can do this with the `@benchmark.Disabled`

249 decorator. **Always add a comment next to your decorator with the bug id which	365 decorator. **Always add a comment next to your decorator with the bug id which

250 has background on why the test was disabled, and also include a BUG= line in	366 has background on why the test was disabled, and also include a BUG= line in

(...skipping 66 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
317	433

318 There is also a weekly debrief that you should see on your calendar titled	434 There is also a weekly debrief that you should see on your calendar titled

319 Weekly Speed Sheriff Retrospective. For this meeting you should prepare	435 Weekly Speed Sheriff Retrospective. For this meeting you should prepare

320 any highlights or lowlights from your sheriffing shift as well as any other	436 any highlights or lowlights from your sheriffing shift as well as any other

321 feedback you may have that could improve future sheriffing shifts.	437 feedback you may have that could improve future sheriffing shifts.

322	438

323 <!-- Unresolved issues:	439 <!-- Unresolved issues:

324 1. Do perf sheriffs watch the bisect waterfall?	440 1. Do perf sheriffs watch the bisect waterfall?

325 2. Do perf sheriffs watch the internal clank waterfall?	441 2. Do perf sheriffs watch the internal clank waterfall?

326 -->	442 -->

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »