Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(184)

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)
Patch Set: Created 3 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Perf Bot Sheriffing 1 # Perf Bot Sheriffing
2 2
3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf
4 waterfall up and running, and triaging performance test failures and flakes. 4 waterfall up and running, and triaging performance test failures and flakes.
5 5
6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)**
7 7
8 ## Key Responsibilities 8 ## Key Responsibilities
9 9
10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)
(...skipping 220 matching lines...) Expand 10 before | Expand all | Expand 10 after
231 3. Type the **Bug ID** from step 1, the **Good Revision** the last 231 3. Type the **Bug ID** from step 1, the **Good Revision** the last
232 commit pos data was received from, the **Bad Revision** the last 232 commit pos data was received from, the **Bad Revision** the last
233 commit pos and set **Bisect mode** to `return_code`. 233 commit pos and set **Bisect mode** to `return_code`.
234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures) 234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures)
235 * On Android and Mac, you can view platform-level screenshots of the 235 * On Android and Mac, you can view platform-level screenshots of the
236 device screen for failing tests, links to which are printed in the logs. 236 device screen for failing tests, links to which are printed in the logs.
237 Often this will immediately reveal failure causes that are opaque from 237 Often this will immediately reveal failure causes that are opaque from
238 the logs alone. On other platforms, Devtools will produce tab 238 the logs alone. On other platforms, Devtools will produce tab
239 screenshots as long as the tab did not crash. 239 screenshots as long as the tab did not crash.
240 240
241 ## Swarming Bots
242 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal
243 of moving all android bots to swarming in early 2017. There is now one machine
244 on the chromium.perf waterfall for each desktop configuration that is triggering
245 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in
246 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id& c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)
247
248 1. Buildbot status page FYIs
249 * Every test that is run now has 2-3 recipe steps on the buildbot status
250 page associated with it
251 1. '[trigger] <test_name>' step (you can mostly ignore this)
252 2. '<test_name>' This is the test that was run on the swarming bot,
253 'shard #0' link on the step takes you to the swarming task page
254 3. '<test_name> Dashboard Upload' This is the upload of the perf tests
255 results to the perf dashboard. This will not be present if the test
256 was disabled.
257 * We now run all benchmark tests even if they are disabled, but disabled
258 tests will always return success and you can ignore them. You can
259 identify these by the 'DISABLED_BENCHMARK' link under the step and the
260 fact that they don’t have an upload step after them
261 2. Debugging Expiring Jobs on the waterfall
262 * You can tell a job is expiring in one of two ways:
263 1. Click on the 'shard #0' link of the failed test and you will see
264 EXPIRED on the swarming task page
265 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on
266 the buildbot failing test step with the dashboard upload step
267 failing (Note: this could be an EXPIRED job or a TIMEOUT. An
268 Expired job means the task never got scheduled within the 5 hour
269 swarming timeout and TIMEOUT means it started running but couldn’t
270 finish before the 5 hour swarming timeout)
271 * You can quickly see what bots the jobs are expiring/timing out on with
272 the ‘Bot id’ annotation on the failing test step
273 * Troubleshooting why they are expiring
274 1. Bot might be down, check the chrome-perf pool for that bot-id and
martiniss 2017/01/09 18:23:54 Maybe add a link to https://viceroy.corp.google.co
eyaich1 2017/01/09 18:44:31 Done.
275 file a ticket with go/bugatrooper if the bot is down.
276 2. Otherwise check the bots swarming page task list for each bot that
277 has failing jobs and examine what might be going on (good [video](h ttps://youtu.be/gRa0LvICthk)
278 from maruel@ on the swarming ui and how to filter and search bot
279 task lists. For example you can filter on bot-id and name to
280 examine the last n runs of a test).
281 * A test might be timing out on a bot that is causing subsequent
282 tests to expire even though they would pass normally but never
283 get scheduled due to that timing out test. Debug the timing out
284 test.
285 * A test might be taking a longer time than normal but still
286 passing, but the extra execution time causes other unrelated
287 tests to fail. Examine the last passing run to the first
288 failing run and see if you can see a test that is taking a
289 significantly longer time and debug that issue.
290 3. Reproducing swarming task runs
291 * Reproduce on local machine using same inputs as bot
292 1. Note that the local machines spec must roughly match that of the
293 swarming bot
294 2. See 'Reproducing the task locally' on swarming task page
295 3. First run the command under
296 'Download input files into directory foo'
297 4. cd into foo/out/Release if those downloaded inputs
298 5. Execute test from this directory. Command you are looking for
299 should be at the top of the logs, you just need to update the
300 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and
301 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o utput.json`
302 flags to be a local path
303 6. Example with tmp as locally created dir:
304 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results -- output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`
305 * ssh into swarming bot and run test on that machine
306 1. NOTE: this should be a last resort since it will cause a fifth of
307 the benchmarks to continuously fail on the waterfall
308 2 First you need to decommission the swarming bot so other jobs don’t
309 interfere, file a ticket with go/bugatrooper
310 3. See [remote access to bots](https://sites.google.com/a/google.com/ch rome-infrastructure/golo/remote-access?pli=1)
311 on how to ssh into the bot and then run the test.
312 Rough overview for build161-m1
313 * prodaccess --chromegolo_ssh
314 * Ssh build161-m1.golo
315 * Password is in valentine
316 "Chrome Golo, Perf, GPU bots - chrome-bot"
317 * File a bug to reboot the machine to get it online in the
318 swarming pool again
319 4. Running local changes on swarming bot
320 * Using sunspider as example benchmark since it is a quick one
321 * First, run test locally to make sure there is no issue with the binary
322 or the script running the test on the swarming bot. Make sure dir foo
323 exists:
324 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j son --isolated-script-test-chartjson-output=foo/chart-output.json`
325 * Build any dependencies needed in isolate:
326 1. ninja -C out/Release chrome/test:telemetry_perf_tests
327 2. This target should be enough if you are running a benchmark,
328 otherwise build any targets that they say are missing when building
329 the isolate in step #2.
330 3. Make sure [compiler proxy is running](https://sites.google.com/a/go ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)
331 * ./goma_ctl.py ensure_start from goma directory
332 * Build the isolate
333 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin ux Builder" telemetry_perf_tests`
334 * -m is the master
335 * -b is the builder name from mb_config.pyl that corresponds to
336 the platform you are running this command on
337 * telemetry_perf_tests is the isolate name
338 * Might run into internal source deps when building the isolate,
339 depending on the isolate. Might need to update the entry in
340 mb_config.pyl for this builder to not be an official built so
341 src/internal isn’t required
342 * Archive and create the isolate hash
343 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p erf_tests.isolated`
344 * Run the test with the has from step #3
345 1. Run hash locally
346 * Note output paths are local
347 * `./tools/swarming_client/run_isolated.py -I https://isolateserve r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`
348 2. Trigger on swarming bot
349 * Note paths are using swarming output dir environment variable
350 ISOLATED_OUTDIR and dimensions are based on the bot and os you
351 are triggering the job on
352 * `python tools/swarming_client/swarming.py trigger -v --isolate-s erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`
353 * All args after the '--' are for the swarming task and not for
354 the trigger command. The output dirs must be in quotes when
355 triggering on swarming bot
356
241 ### Disabling Telemetry Tests 357 ### Disabling Telemetry Tests
242 358
243 If the test is a telemetry test, its name will have a '.' in it, such as 359 If the test is a telemetry test, its name will have a '.' in it, such as
244 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the 360 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the
245 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). 361 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
246 362
247 If a telemetry test is failing and there is no clear culprit to revert 363 If a telemetry test is failing and there is no clear culprit to revert
248 immediately, disable the test. You can do this with the `@benchmark.Disabled` 364 immediately, disable the test. You can do this with the `@benchmark.Disabled`
249 decorator. **Always add a comment next to your decorator with the bug id which 365 decorator. **Always add a comment next to your decorator with the bug id which
250 has background on why the test was disabled, and also include a BUG= line in 366 has background on why the test was disabled, and also include a BUG= line in
(...skipping 66 matching lines...) Expand 10 before | Expand all | Expand 10 after
317 433
318 There is also a weekly debrief that you should see on your calendar titled 434 There is also a weekly debrief that you should see on your calendar titled
319 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare 435 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare
320 any highlights or lowlights from your sheriffing shift as well as any other 436 any highlights or lowlights from your sheriffing shift as well as any other
321 feedback you may have that could improve future sheriffing shifts. 437 feedback you may have that could improve future sheriffing shifts.
322 438
323 <!-- Unresolved issues: 439 <!-- Unresolved issues:
324 1. Do perf sheriffs watch the bisect waterfall? 440 1. Do perf sheriffs watch the bisect waterfall?
325 2. Do perf sheriffs watch the internal clank waterfall? 441 2. Do perf sheriffs watch the internal clank waterfall?
326 --> 442 -->
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698