tools/perf/docs/perf_bot_sheriffing.md - Issue 2621713002: Adding swarming documentation to perf sheriff docs.

Unified Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)

Patch Set: merge and review comments Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: tools/perf/docs/perf_bot_sheriffing.md

diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md

index c907e172377a262acb8fcfe2656ef999efbc13c2..7799812c68c4cdb324b0200c8c7eb2dec2f141b4 100644

--- a/tools/perf/docs/perf_bot_sheriffing.md

+++ b/tools/perf/docs/perf_bot_sheriffing.md

@@ -235,7 +235,7 @@ be investigated. When a test fails:

1. **Telemetry test runner logs**

**_Useful Content:_** Best place to start. These logs contain all of the

- python logging information from the telemetry test runner scripts.

+ python logging information from the telemetry test runner scripts.

**_Where to find:_** These logs can be found from the buildbot build page.

Click the _"[stdout]"_ link under any of the telemetry test buildbot steps

@@ -244,7 +244,7 @@ be investigated. When a test fails:

2. **Android Logcat (Android)**

- **_Useful Content:_** This file contains all Android device logs. All

+ **_Useful Content:_** This file contains all Android device logs. All

Android apps and the Android system will log information to logcat. Good

place to look if you believe an issue is device related

(Android out-of-memory problem for example). Additionally, often information

@@ -275,6 +275,125 @@ be investigated. When a test fails:

Click link under _"stack tool with logcat dump"_ to see symbolized Android

crashes.

+## Swarming Bots

+As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal

+of moving all android bots to swarming in early 2017. There is now one machine

+on the chromium.perf waterfall for each desktop configuration that is triggering

+test tasks on 5 corresponding swarming bots. All of our swarming bots exists in

+the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)

+1. Buildbot status page FYIs

+ * Every test that is run now has 2-3 recipe steps on the buildbot status

+ page associated with it

+ 1. '[trigger] <test_name>' step (you can mostly ignore this)

+ 2. '<test_name>' This is the test that was run on the swarming bot,

+ 'shard #0' link on the step takes you to the swarming task page

+ 3. '<test_name> Dashboard Upload' This is the upload of the perf tests

+ results to the perf dashboard. This will not be present if the test

+ was disabled.

+ * We now run all benchmark tests even if they are disabled, but disabled

+ tests will always return success and you can ignore them. You can

+ identify these by the 'DISABLED_BENCHMARK' link under the step and the

+ fact that they don’t have an upload step after them

+2. Debugging Expiring Jobs on the waterfall

+ * You can tell a job is expiring in one of two ways:

+ 1. Click on the 'shard #0' link of the failed test and you will see

+ EXPIRED on the swarming task page

+ 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on

+ the buildbot failing test step with the dashboard upload step

+ failing (Note: this could be an EXPIRED job or a TIMEOUT. An

+ Expired job means the task never got scheduled within the 5 hour

+ swarming timeout and TIMEOUT means it started running but couldn’t

+ finish before the 5 hour swarming timeout)

+ * You can quickly see what bots the jobs are expiring/timing out on with

+ the ‘Bot id’ annotation on the failing test step

+ * Troubleshooting why they are expiring

+ 1. Bot might be down, check the chrome-perf pool for that bot-id and

+ file a ticket with go/bugatrooper if the bot is down.

+ * Can also identify a down bot through [viceroy](https://viceroy.corp.google.com/chrome_infra/Machines/per_machine)

+ Search for a bot id and if the graph stops it tells you the bot

+ is down

+ 2. Otherwise check the bots swarming page task list for each bot that

+ has failing jobs and examine what might be going on (good [video](https://youtu.be/gRa0LvICthk)

+ from maruel@ on the swarming ui and how to filter and search bot

+ task lists. For example you can filter on bot-id and name to

+ examine the last n runs of a test).

+ * A test might be timing out on a bot that is causing subsequent

+ tests to expire even though they would pass normally but never

+ get scheduled due to that timing out test. Debug the timing out

+ test.

+ * A test might be taking a longer time than normal but still

+ passing, but the extra execution time causes other unrelated

+ tests to fail. Examine the last passing run to the first

+ failing run and see if you can see a test that is taking a

+ significantly longer time and debug that issue.

+3. Reproducing swarming task runs

+ * Reproduce on local machine using same inputs as bot

+ 1. Note that the local machines spec must roughly match that of the

+ swarming bot

+ 2. See 'Reproducing the task locally' on swarming task page

+ 3. First run the command under

+ 'Download input files into directory foo'

+ 4. cd into foo/out/Release if those downloaded inputs

+ 5. Execute test from this directory. Command you are looking for

+ should be at the top of the logs, you just need to update the

+ `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and

+ `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-output.json`

+ flags to be a local path

+ 6. Example with tmp as locally created dir:

+ `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results --output-format=chartjson --browser=release --isolated-script-test-output=tmp/output.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`

+ * ssh into swarming bot and run test on that machine

+ 1. NOTE: this should be a last resort since it will cause a fifth of

+ the benchmarks to continuously fail on the waterfall

+ 2 First you need to decommission the swarming bot so other jobs don’t

+ interfere, file a ticket with go/bugatrooper

+ 3. See [remote access to bots](https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1)

+ on how to ssh into the bot and then run the test.

+ Rough overview for build161-m1

+ * prodaccess --chromegolo_ssh

+ * Ssh build161-m1.golo

+ * Password is in valentine

+ "Chrome Golo, Perf, GPU bots - chrome-bot"

+ * File a bug to reboot the machine to get it online in the

+ swarming pool again

+4. Running local changes on swarming bot

+ * Using sunspider as example benchmark since it is a quick one

+ * First, run test locally to make sure there is no issue with the binary

+ or the script running the test on the swarming bot. Make sure dir foo

+ exists:

+ `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/perf/run_benchmark sunspider -v --output-format=chartjson --upload-results --browser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.json --isolated-script-test-chartjson-output=foo/chart-output.json`

+ * Build any dependencies needed in isolate:

+ 1. ninja -C out/Release chrome/test:telemetry_perf_tests

+ 2. This target should be enough if you are running a benchmark,

+ otherwise build any targets that they say are missing when building

+ the isolate in step #2.

+ 3. Make sure [compiler proxy is running](https://sites.google.com/a/google.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)

+ * ./goma_ctl.py ensure_start from goma directory

+ * Build the isolate

+ 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Linux Builder" telemetry_perf_tests`

+ * -m is the master

+ * -b is the builder name from mb_config.pyl that corresponds to

+ the platform you are running this command on

+ * telemetry_perf_tests is the isolate name

+ * Might run into internal source deps when building the isolate,

+ depending on the isolate. Might need to update the entry in

+ mb_config.pyl for this builder to not be an official built so

+ src/internal isn’t required

+ * Archive and create the isolate hash

+ 1. `python tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_perf_tests.isolated`

+ * Run the test with the has from step #3

+ 1. Run hash locally

+ * Note output paths are local

+ * `./tools/swarming_client/run_isolated.py -I https://isolateserver.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-test-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`

+ 2. Trigger on swarming bot

+ * Note paths are using swarming output dir environment variable

+ ISOLATED_OUTDIR and dimensions are based on the bot and os you

+ are triggering the job on

+ * `python tools/swarming_client/swarming.py trigger -v --isolate-server isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref -isolated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-test-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`

+ * All args after the '--' are for the swarming task and not for

+ the trigger command. The output dirs must be in quotes when

+ triggering on swarming bot

### Disabling Telemetry Tests

If the test is a telemetry test, its name will have a '.' in it, such as

« no previous file with comments | « no previous file | no next file » | no next file with comments »