Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(1305)

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2621713002: Adding swarming documentation to perf sheriff docs. (Closed)
Patch Set: merge and review comments Created 3 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Perf Bot Sheriffing 1 # Perf Bot Sheriffing
2 2
3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf 3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf
4 waterfall up and running, and triaging performance test failures and flakes. 4 waterfall up and running, and triaging performance test failures and flakes.
5 5
6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)** 6 **[Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)**
7 7
8 ## Key Responsibilities 8 ## Key Responsibilities
9 9
10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures) 10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)
(...skipping 217 matching lines...) Expand 10 before | Expand all | Expand 10 after
228 device screen for failing tests, links to which are printed in the logs. 228 device screen for failing tests, links to which are printed in the logs.
229 Often this will immediately reveal failure causes that are opaque from 229 Often this will immediately reveal failure causes that are opaque from
230 the logs alone. On other platforms, Devtools will produce tab 230 the logs alone. On other platforms, Devtools will produce tab
231 screenshots as long as the tab did not crash. 231 screenshots as long as the tab did not crash.
232 232
233 ### Useful Logs and Debugging Info 233 ### Useful Logs and Debugging Info
234 234
235 1. **Telemetry test runner logs** 235 1. **Telemetry test runner logs**
236 236
237 **_Useful Content:_** Best place to start. These logs contain all of the 237 **_Useful Content:_** Best place to start. These logs contain all of the
238 python logging information from the telemetry test runner scripts. 238 python logging information from the telemetry test runner scripts.
239 239
240 **_Where to find:_** These logs can be found from the buildbot build page. 240 **_Where to find:_** These logs can be found from the buildbot build page.
241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps 241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps
242 to view the logs. Do not use the "stdio" link which will show similiar 242 to view the logs. Do not use the "stdio" link which will show similiar
243 information but will expire earilier and be slower to load. 243 information but will expire earilier and be slower to load.
244 244
245 2. **Android Logcat (Android)** 245 2. **Android Logcat (Android)**
246 246
247 **_Useful Content:_** This file contains all Android device logs. All 247 **_Useful Content:_** This file contains all Android device logs. All
248 Android apps and the Android system will log information to logcat. Good 248 Android apps and the Android system will log information to logcat. Good
249 place to look if you believe an issue is device related 249 place to look if you believe an issue is device related
250 (Android out-of-memory problem for example). Additionally, often information 250 (Android out-of-memory problem for example). Additionally, often information
251 about native crashes will be logged to here. 251 about native crashes will be logged to here.
252 252
253 **_Where to find:_** These logs can be found from the buildbot status page. 253 **_Where to find:_** These logs can be found from the buildbot status page.
254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps. 254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps.
255 255
256 3. **Test Trace (Android)** 256 3. **Test Trace (Android)**
257 257
(...skipping 10 matching lines...) Expand all
268 268
269 **_Useful Content:_** Contains symbolized stack traces of any Chrome or 269 **_Useful Content:_** Contains symbolized stack traces of any Chrome or
270 Android crashes. 270 Android crashes.
271 271
272 **_Where to find_:** These logs can be found from the buildbot status page. 272 **_Where to find_:** These logs can be found from the buildbot status page.
273 The symbolized stack traces can be found under several steps. Click link 273 The symbolized stack traces can be found under several steps. Click link
274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes. 274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes.
275 Click link under _"stack tool with logcat dump"_ to see symbolized Android 275 Click link under _"stack tool with logcat dump"_ to see symbolized Android
276 crashes. 276 crashes.
277 277
278 ## Swarming Bots
279 As of Q4 2016 all desktop bots have been moved to the swarming pool with a goal
280 of moving all android bots to swarming in early 2017. There is now one machine
281 on the chromium.perf waterfall for each desktop configuration that is triggering
282 test tasks on 5 corresponding swarming bots. All of our swarming bots exists in
283 the [chrome-perf swarming pool](https://chromium-swarm.appspot.com/botlist?c=id& c=os&c=task&c=status&f=pool%3AChrome-perf&l=100&s=id%3Aasc)
284
285 1. Buildbot status page FYIs
286 * Every test that is run now has 2-3 recipe steps on the buildbot status
287 page associated with it
288 1. '[trigger] <test_name>' step (you can mostly ignore this)
289 2. '<test_name>' This is the test that was run on the swarming bot,
290 'shard #0' link on the step takes you to the swarming task page
291 3. '<test_name> Dashboard Upload' This is the upload of the perf tests
292 results to the perf dashboard. This will not be present if the test
293 was disabled.
294 * We now run all benchmark tests even if they are disabled, but disabled
295 tests will always return success and you can ignore them. You can
296 identify these by the 'DISABLED_BENCHMARK' link under the step and the
297 fact that they don’t have an upload step after them
298 2. Debugging Expiring Jobs on the waterfall
299 * You can tell a job is expiring in one of two ways:
300 1. Click on the 'shard #0' link of the failed test and you will see
301 EXPIRED on the swarming task page
302 2. If there is a 'no_results_exc' and an 'invalid_results_exc' link on
303 the buildbot failing test step with the dashboard upload step
304 failing (Note: this could be an EXPIRED job or a TIMEOUT. An
305 Expired job means the task never got scheduled within the 5 hour
306 swarming timeout and TIMEOUT means it started running but couldn’t
307 finish before the 5 hour swarming timeout)
308 * You can quickly see what bots the jobs are expiring/timing out on with
309 the ‘Bot id’ annotation on the failing test step
310 * Troubleshooting why they are expiring
311 1. Bot might be down, check the chrome-perf pool for that bot-id and
312 file a ticket with go/bugatrooper if the bot is down.
313 * Can also identify a down bot through [viceroy](https://viceroy.c orp.google.com/chrome_infra/Machines/per_machine)
314 Search for a bot id and if the graph stops it tells you the bot
315 is down
316 2. Otherwise check the bots swarming page task list for each bot that
317 has failing jobs and examine what might be going on (good [video](h ttps://youtu.be/gRa0LvICthk)
318 from maruel@ on the swarming ui and how to filter and search bot
319 task lists. For example you can filter on bot-id and name to
320 examine the last n runs of a test).
321 * A test might be timing out on a bot that is causing subsequent
322 tests to expire even though they would pass normally but never
323 get scheduled due to that timing out test. Debug the timing out
324 test.
325 * A test might be taking a longer time than normal but still
326 passing, but the extra execution time causes other unrelated
327 tests to fail. Examine the last passing run to the first
328 failing run and see if you can see a test that is taking a
329 significantly longer time and debug that issue.
330 3. Reproducing swarming task runs
331 * Reproduce on local machine using same inputs as bot
332 1. Note that the local machines spec must roughly match that of the
333 swarming bot
334 2. See 'Reproducing the task locally' on swarming task page
335 3. First run the command under
336 'Download input files into directory foo'
337 4. cd into foo/out/Release if those downloaded inputs
338 5. Execute test from this directory. Command you are looking for
339 should be at the top of the logs, you just need to update the
340 `--isolated-script-test-output=/b/s/w/ioFB73Qz/output.json` and
341 `--isolated-script-test-chartjson-output=/b/s/w/ioFB73Qz/chartjson-o utput.json`
342 flags to be a local path
343 6. Example with tmp as locally created dir:
344 `/usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_go ogletest.py ../../tools/perf/run_benchmark indexeddb_perf -v --upload-results -- output-format=chartjson --browser=release --isolated-script-test-output=tmp/outp ut.json --isolated-script-test-chartjson-output=tmp/chartjson-output.json`
345 * ssh into swarming bot and run test on that machine
346 1. NOTE: this should be a last resort since it will cause a fifth of
347 the benchmarks to continuously fail on the waterfall
348 2 First you need to decommission the swarming bot so other jobs don’t
349 interfere, file a ticket with go/bugatrooper
350 3. See [remote access to bots](https://sites.google.com/a/google.com/ch rome-infrastructure/golo/remote-access?pli=1)
351 on how to ssh into the bot and then run the test.
352 Rough overview for build161-m1
353 * prodaccess --chromegolo_ssh
354 * Ssh build161-m1.golo
355 * Password is in valentine
356 "Chrome Golo, Perf, GPU bots - chrome-bot"
357 * File a bug to reboot the machine to get it online in the
358 swarming pool again
359 4. Running local changes on swarming bot
360 * Using sunspider as example benchmark since it is a quick one
361 * First, run test locally to make sure there is no issue with the binary
362 or the script running the test on the swarming bot. Make sure dir foo
363 exists:
364 `python testing/scripts/run_telemetry_benchmark_as_googletest.py tools/p erf/run_benchmark sunspider -v --output-format=chartjson --upload-results --brow ser=reference --output-trace-tag=_ref --isolated-script-test-output=foo/output.j son --isolated-script-test-chartjson-output=foo/chart-output.json`
365 * Build any dependencies needed in isolate:
366 1. ninja -C out/Release chrome/test:telemetry_perf_tests
367 2. This target should be enough if you are running a benchmark,
368 otherwise build any targets that they say are missing when building
369 the isolate in step #2.
370 3. Make sure [compiler proxy is running](https://sites.google.com/a/go ogle.com/goma/how-to-use-goma/how-to-use-goma-for-chrome-team?pli=1)
371 * ./goma_ctl.py ensure_start from goma directory
372 * Build the isolate
373 1. `python tools/mb/mb.py isolate //out/Release -m chromium.perf -b "Lin ux Builder" telemetry_perf_tests`
374 * -m is the master
375 * -b is the builder name from mb_config.pyl that corresponds to
376 the platform you are running this command on
377 * telemetry_perf_tests is the isolate name
378 * Might run into internal source deps when building the isolate,
379 depending on the isolate. Might need to update the entry in
380 mb_config.pyl for this builder to not be an official built so
381 src/internal isn’t required
382 * Archive and create the isolate hash
383 1. `python tools/swarming_client/isolate.py archive -I isolateserver.ap pspot.com -i out/Release/telemetry_perf_tests.isolate -s out/Release/telemetry_p erf_tests.isolated`
384 * Run the test with the has from step #3
385 1. Run hash locally
386 * Note output paths are local
387 * `./tools/swarming_client/run_isolated.py -I https://isolateserve r.appspot.com -s <insert_hash_here> -- sunspider -v --upload-results --output-fo rmat=chartjson --browser=reference --output-trace-tag=_ref --isolated-script-tes t-output=/usr/local/google/home/eyaich/projects/chromium/src/tmp/output.json`
388 2. Trigger on swarming bot
389 * Note paths are using swarming output dir environment variable
390 ISOLATED_OUTDIR and dimensions are based on the bot and os you
391 are triggering the job on
392 * `python tools/swarming_client/swarming.py trigger -v --isolate-s erver isolateserver.appspot.com -S chromium-swarm.appspot.com -d id build150-m1 -d pool Chrome-perf -d os Linux -s <insert_hash_here> -- sunspider -v --upload-r esults --output-format=chartjson --browser=reference --output-trace-tag=_ref -is olated-script-test-output='${ISOLATED_OUTDIR}/output.json' --isolated-script-tes t-chartjson-output='${ISOLATED_OUTDIR}/chart-output.json'`
393 * All args after the '--' are for the swarming task and not for
394 the trigger command. The output dirs must be in quotes when
395 triggering on swarming bot
396
278 ### Disabling Telemetry Tests 397 ### Disabling Telemetry Tests
279 398
280 If the test is a telemetry test, its name will have a '.' in it, such as 399 If the test is a telemetry test, its name will have a '.' in it, such as
281 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the 400 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the
282 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/). 401 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
283 402
284 If a telemetry test is failing and there is no clear culprit to revert 403 If a telemetry test is failing and there is no clear culprit to revert
285 immediately, disable the test. You can do this with the `@benchmark.Disabled` 404 immediately, disable the test. You can do this with the `@benchmark.Disabled`
286 decorator. **Always add a comment next to your decorator with the bug id which 405 decorator. **Always add a comment next to your decorator with the bug id which
287 has background on why the test was disabled, and also include a BUG= line in 406 has background on why the test was disabled, and also include a BUG= line in
(...skipping 66 matching lines...) Expand 10 before | Expand all | Expand 10 after
354 473
355 There is also a weekly debrief that you should see on your calendar titled 474 There is also a weekly debrief that you should see on your calendar titled
356 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare 475 **Weekly Speed Sheriff Retrospective**. For this meeting you should prepare
357 any highlights or lowlights from your sheriffing shift as well as any other 476 any highlights or lowlights from your sheriffing shift as well as any other
358 feedback you may have that could improve future sheriffing shifts. 477 feedback you may have that could improve future sheriffing shifts.
359 478
360 <!-- Unresolved issues: 479 <!-- Unresolved issues:
361 1. Do perf sheriffs watch the bisect waterfall? 480 1. Do perf sheriffs watch the bisect waterfall?
362 2. Do perf sheriffs watch the internal clank waterfall? 481 2. Do perf sheriffs watch the internal clank waterfall?
363 --> 482 -->
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698