Chromium Code Reviews| Index: tools/perf/docs/perf_bot_sheriffing.md |
| diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md |
| index 28325f113db42da2c27f64af9c32de2dd9477db2..b020ae1825ce828474bafb272ee82ded1e54a3f1 100644 |
| --- a/tools/perf/docs/perf_bot_sheriffing.md |
| +++ b/tools/perf/docs/perf_bot_sheriffing.md |
| @@ -238,6 +238,122 @@ be investigated. When a test fails: |
| the logs alone. On other platforms, Devtools will produce tab |
| screenshots as long as the tab did not crash. |
| +## Swarming Bots |
| +As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal |
| +of moving all android bots to swarming in early 2017. There is now one machine |
| +on the chromium.perf waterfall for each desktop configuration that is triggering |
| +test tasks on 5 corresponding swarming bots. All of our swarming bots exists in |
| +the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc) |
| + |
| +1. Buildbot status page FYIs |
| + * Every test that is run now has 2-3 recipe steps on the buildbot status |
| + page associated with it |
| + 1. '[trigger] <test_name>' step (you can mostly ignore this) |
| + 2. '<test_name>' This is the test that was run on the swarming bot, |
| + 'shard #0' link on the step takes you to the swarming task page |
| + 3. '<test_name> Dashboard Upload' This is the upload of the perf tests |
| + results to the perf dashboard. This will not be present if the test |
| + was disabled. |
| + * We now run all benchmark tests even if they are disabled, but disabled |
| + tests will always return success and you can ignore them. You can |
| + identify these by the 'DISABLED_BENCHMARK' link under the step and the |
| + fact that they don’t have an upload step after them |
| +2. Debugging Expiring Jobs on the waterfall |
| + * You can tell a job is expiring in one of two ways: |
| + 1. Click on the 'shard #0' link of the failed test and you will see |
| + EXPIRED on the swarming task page |
| + 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on |
| + the buildbot failing test step with the dashboard upload step |
| + failing (Note: this could be an EXPIRED job or a TIMEOUT. An |
| + Expired job means the task never got scheduled within the 5 hour |
| + swarming timeout and TIMEOUT means it started running but couldn’t |
| + finish before the 5 hour swarming timeout) |
| + * You can quickly see what bots the jobs are expiring/timing out on with |
| + the ‘Bot id’ annotation on the failing test step |
| + * Troubleshooting why they are expiring |
| + 1. Bot might be down, check the chrome-perf pool for that bot-id and |
|
martiniss
2017/01/09 18:23:54
Maybe add a link to https://viceroy.corp.google.co
eyaich1
2017/01/09 18:44:31
Done.
|
| + file a ticket with go/bugatrooper if the bot is down. |
| + 2. Otherwise check the bots swarming page task list for each bot that |
| + has failing jobs and examine what might be going on (good [video](https://youtu.be/gRa0LvICthk) |
| + from maruel@ on the swarming ui and how to filter and search bot |
| + task lists. For example you can filter on bot-id and name to |
| + examine the last n runs of a test). |
| + * A test might be timing out on a bot that is causing subsequent |
| + tests to expire even though they would pass normally but never |
| + get scheduled due to that timing out test. Debug the timing out |
| + test. |
| + * A test might be taking a longer time than normal but still |
| + passing, but the extra execution time causes other unrelated |
| + tests to fail. Examine the last passing run to the first |
| + failing run and see if you can see a test that is taking a |
| + significantly longer time and debug that issue. |
| +3. Reproducing swarming task runs |
| + * Reproduce on local machine using same inputs as bot |
| + 1. Note that the local machines spec must roughly match that of the |
| + swarming bot |
| + 2. See 'Reproducing the task locally' on swarming task page |
| + 3. First run the command under |
| + 'Download input files into directory foo' |
| + 4. cd into foo/out/Release if those downloaded inputs |
| + 5. Execute test from this directory. Command you are looking for |
| + should be at the top of the logs, you just need to update the |
| + `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and |
| + `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-output.json` |
| + flags to be a local path |
| + 6. Example with tmp as locally created dir: |
| + `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results --output-format=chartjson --browser=release --isolated-script-test-output=tmp/output.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json` |
| + * ssh into swarming bot and run test on that machine |
| + 1. NOTE: this should be a last resort since it will cause a fifth of |
| + the benchmarks to continuously fail on the waterfall |
| + 2 First you need to decommission the swarming bot so other jobs don’t |
| + interfere, file a ticket with go/bugatrooper |
| + 3. See [remote access to bots](https://sites.google.com/a/google.com/chrome-infrastructure/golo/remote-access?pli=1) |
| + on how to ssh into the bot and then run the test. |
| + Rough overview for build161-m1 |
| + * prodaccess --chromegolo_ssh |
| + * Ssh build161-m1.golo |
| + * Password is in valentine |
| + "Chrome Golo, Perf, GPU bots - chrome-bot" |
| + * File a bug to reboot the machine to get it online in the |
| + swarming pool again |
| +4. Running local changes on swarming bot |
| + * Using sunspider as example benchmark since it is a quick one |
| + * First, run test locally to make sure there is no issue with the binary |
| + or the script running the test on the swarming bot. Make sure dir foo |
| + exists: |
| + `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/perf/run_benchmark sunspider -v --output-format=chartjson --upload-results --browser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.json --isolated-script-test-chartjson-output=foo/chart-output.json` |
| + * Build any dependencies needed in isolate: |
| + 1. ninja -C out/Release chrome/test:telemetry_perf_tests |
| + 2. This target should be enough if you are running a benchmark, |
| + otherwise build any targets that they say are missing when building |
| + the isolate in step #2. |
| + 3. Make sure [compiler proxy is running](https://sites.google.com/a/google.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1) |
| + * ./goma_ctl.py ensure_start from goma directory |
| + * Build the isolate |
| + 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Linux Builder" telemetry_perf_tests` |
| + * -m is the master |
| + * -b is the builder name from mb_config.pyl that corresponds to |
| + the platform you are running this command on |
| + * telemetry_perf_tests is the isolate name |
| + * Might run into internal source deps when building the isolate, |
| + depending on the isolate. Might need to update the entry in |
| + mb_config.pyl for this builder to not be an official built so |
| + src/internal isn’t required |
| + * Archive and create the isolate hash |
| + 1. `python tools/swarming_client/isolate.py archive -I isolateserver.appspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_perf_tests.isolated` |
| + * Run the test with the has from step #3 |
| + 1. Run hash locally |
| + * Note output paths are local |
| + * `./tools/swarming_client/run_isolated.py -I https://isolateserver.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-test-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json` |
| + 2. Trigger on swarming bot |
| + * Note paths are using swarming output dir environment variable |
| + ISOLATED_OUTDIR and dimensions are based on the bot and os you |
| + are triggering the job on |
| + * `python tools/swarming_client/swarming.py trigger -v --isolate-server isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-results --output-format=chartjson --browser=reference --output-trace-tag=_ref -isolated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-test-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'` |
| + * All args after the '--' are for the swarming task and not for |
| + the trigger command. The output dirs must be in quotes when |
| + triggering on swarming bot |
| + |
| ### Disabling Telemetry Tests |
| If the test is a telemetry test, its name will have a '.' in it, such as |