tools/perf/docs/perf_bot_sheriffing.md - Issue 2621713002: Adding swarming documentation to perf sheriff docs.

Unified Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)

Patch Set: Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: tools/perf/docs/perf_bot_sheriffing.md

diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md

index 28325f113db42da2c27f64af9c32de2dd9477db2..b020ae1825ce828474bafb272ee82ded1e54a3f1 100644

--- a/tools/perf/docs/perf_bot_sheriffing.md

+++ b/tools/perf/docs/perf_bot_sheriffing.md

@@ -238,6 +238,122 @@ be investigated. When a test fails:

the logs alone. On other platforms, Devtools will produce tab

screenshots as long as the tab did not crash.

+## Swarming Bots

+As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal

+of moving all android bots to swarming in early 2017. There is now one machine

+on the chromium.perf waterfall for each desktop configuration that is triggering

+test tasks on 5 corresponding swarming bots. All of our swarming bots exists in

+the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)

+1. Buildbot status page FYIs

+ * Every test that is run now has 2-3 recipe steps on the buildbot status

+ page associated with it

+ 1. '[trigger] <test_name>' step (you can mostly ignore this)

+ 2. '<test_name>' This is the test that was run on the swarming bot,

+ 'shard #0' link on the step takes you to the swarming task page

+ 3. '<test_name> Dashboard Upload' This is the upload of the perf tests

+ results to the perf dashboard. This will not be present if the test

+ was disabled.

+ * We now run all benchmark tests even if they are disabled, but disabled

+ tests will always return success and you can ignore them. You can

+ identify these by the 'DISABLED_BENCHMARK' link under the step and the

+ fact that they don’t have an upload step after them

+2. Debugging Expiring Jobs on the waterfall

+ * You can tell a job is expiring in one of two ways:

+ 1. Click on the 'shard #0' link of the failed test and you will see

+ EXPIRED on the swarming task page

+ 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on

+ the buildbot failing test step with the dashboard upload step

+ failing (Note: this could be an EXPIRED job or a TIMEOUT. An

+ Expired job means the task never got scheduled within the 5 hour

+ swarming timeout and TIMEOUT means it started running but couldn’t

+ finish before the 5 hour swarming timeout)

+ * You can quickly see what bots the jobs are expiring/timing out on with

+ the ‘Bot id’ annotation on the failing test step

+ * Troubleshooting why they are expiring

+ 1. Bot might be down, check the chrome-perf pool for that bot-id and

martiniss 2017/01/09 18:23:54 Maybe add a link to https://viceroy.corp.google.co

eyaich1 2017/01/09 18:44:31 Done.

+ file a ticket with go/bugatrooper if the bot is down.

+ 2. Otherwise check the bots swarming page task list for each bot that

+ has failing jobs and examine what might be going on (good [video](https://youtu.be/gRa0LvICthk)

+ from maruel@ on the swarming ui and how to filter and search bot

+ task lists. For example you can filter on bot-id and name to

+ examine the last n runs of a test).

+ * A test might be timing out on a bot that is causing subsequent

+ tests to expire even though they would pass normally but never

+ get scheduled due to that timing out test. Debug the timing out

+ test.

+ * A test might be taking a longer time than normal but still

+ passing, but the extra execution time causes other unrelated

+ tests to fail. Examine the last passing run to the first

+ failing run and see if you can see a test that is taking a

+ significantly longer time and debug that issue.

+3. Reproducing swarming task runs

+ * Reproduce on local machine using same inputs as bot

+ 1. Note that the local machines spec must roughly match that of the

+ swarming bot

+ 2. See 'Reproducing the task locally' on swarming task page

+ 3. First run the command under

+ 'Download input files into directory foo'

+ 4. cd into foo/out/Release if those downloaded inputs

+ 5. Execute test from this directory. Command you are looking for

+ should be at the top of the logs, you just need to update the

+ `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and

+ `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-output.json`

+ flags to be a local path

+ 6. Example with tmp as locally created dir:

+ `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results --output-format=chartjson --browser=release --isolated-script-test-output=tmp/output.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`

+ * ssh into swarming bot and run test on that machine

+ 1. NOTE: this should be a last resort since it will cause a fifth of

+ the benchmarks to continuously fail on the waterfall

+ 2 First you need to decommission the swarming bot so other jobs don’t

+ interfere, file a ticket with go/bugatrooper

+ 3. See [remote access to bots](https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1)

+ on how to ssh into the bot and then run the test.

+ Rough overview for build161-m1

+ * prodaccess --chromegolo_ssh

+ * Ssh build161-m1.golo

+ * Password is in valentine

+ "Chrome Golo, Perf, GPU bots - chrome-bot"

+ * File a bug to reboot the machine to get it online in the

+ swarming pool again

+4. Running local changes on swarming bot

+ * Using sunspider as example benchmark since it is a quick one

+ * First, run test locally to make sure there is no issue with the binary

+ or the script running the test on the swarming bot. Make sure dir foo

+ exists:

+ `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/perf/run_benchmark sunspider -v --output-format=chartjson --upload-results --browser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.json --isolated-script-test-chartjson-output=foo/chart-output.json`

+ * Build any dependencies needed in isolate:

+ 1. ninja -C out/Release chrome/test:telemetry_perf_tests

+ 2. This target should be enough if you are running a benchmark,

+ otherwise build any targets that they say are missing when building

+ the isolate in step #2.

+ 3. Make sure [compiler proxy is running](https://sites.google.com/a/google.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)

+ * ./goma_ctl.py ensure_start from goma directory

+ * Build the isolate

+ 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Linux Builder" telemetry_perf_tests`

+ * -m is the master

+ * -b is the builder name from mb_config.pyl that corresponds to

+ the platform you are running this command on

+ * telemetry_perf_tests is the isolate name

+ * Might run into internal source deps when building the isolate,

+ depending on the isolate. Might need to update the entry in

+ mb_config.pyl for this builder to not be an official built so

+ src/internal isn’t required

+ * Archive and create the isolate hash

+ 1. `python tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_perf_tests.isolated`

+ * Run the test with the has from step #3

+ 1. Run hash locally

+ * Note output paths are local

+ * `./tools/swarming_client/run_isolated.py -I https://isolateserver.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-test-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`

+ 2. Trigger on swarming bot

+ * Note paths are using swarming output dir environment variable

+ ISOLATED_OUTDIR and dimensions are based on the bot and os you

+ are triggering the job on

+ * `python tools/swarming_client/swarming.py trigger -v --isolate-server isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref -isolated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-test-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`

+ * All args after the '--' are for the swarming task and not for

+ the trigger command. The output dirs must be in quotes when

+ triggering on swarming bot

### Disabling Telemetry Tests

If the test is a telemetry test, its name will have a '.' in it, such as

« no previous file with comments | « no previous file | no next file » | no next file with comments »