tools/perf/docs/perf_bot_sheriffing.md - Issue 2611183005: Add "Useful Logs and Debugging Info" section to perf sheriff doc.

Side by Side Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 2611183005: Add "Useful Logs and Debugging Info" section to perf sheriff doc. (Closed)

Patch Set: Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Perf Bot Sheriffing	1 # Perf Bot Sheriffing

2	2

3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf	3 The perf bot sheriff is responsible for keeping the bots on the chromium.perf

4 waterfall up and running, and triaging performance test failures and flakes.	4 waterfall up and running, and triaging performance test failures and flakes.

5	5

6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)	6 [Rotation calendar](https://calendar.google.com/calendar/embed?src=google.com_ 2fpmo740pd1unrui9d7cgpbg2k%40group.calendar.google.com)

7	7

8 ## Key Responsibilities	8 ## Key Responsibilities

9	9

10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)	10 * [Handle Device and Bot Failures](#Handle-Device-and-Bot-Failures)

(...skipping 188 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
199 * The revision range the test occurred on.	199 * The revision range the test occurred on.

200 * A list of all platforms the test fails on.	200 * A list of all platforms the test fails on.

201 2. Disable the failing test if it is failing more than one out of five runs.	201 2. Disable the failing test if it is failing more than one out of five runs.

202 (see below for instructions on telemetry and other types of tests). Make	202 (see below for instructions on telemetry and other types of tests). Make

203 sure your disable cl includes a BUG= line with the bug from step 1 and the	203 sure your disable cl includes a BUG= line with the bug from step 1 and the

204 test owner is cc-ed on the bug.	204 test owner is cc-ed on the bug.

205 3. After the disable CL lands, you can downgrade the priority to Pri-2 and	205 3. After the disable CL lands, you can downgrade the priority to Pri-2 and

206 ensure that the bug title reflects something like "Fix and re-enable	206 ensure that the bug title reflects something like "Fix and re-enable

207 testname".	207 testname".

208 4. Investigate the failure. Some tips for investigating:	208 4. Investigate the failure. Some tips for investigating:

209 * When viewing buildbot step logs, use the <font color="blue">[stdout] </font> link to view logs!.

210 This will link to logdog logs which do not expire. Do not use or link

211 to the logs found through the <font color="blue">stdio</font> link

212 whenever possible as these logs will expire.

213 * When investigating Android, look for the logcat which is uploaded to

214 Google Storage at the end of the run. logcat will contain much more

215 detailed Android device and crash info than will be found in

216 Telemetry logs.

217 * If it's a non flaky failure, indentify the first failed	209 * If it's a non flaky failure, indentify the first failed

218 build so you can narrow down the range of CLs that causes the failure.	210 build so you can narrow down the range of CLs that causes the failure.

219 You can use the	211 You can use the

220 [diagnose_test_failure](https://code.google.com/p/chromium/codesearch#ch romium/src/tools/perf/diagnose_test_failure)	212 [diagnose_test_failure](https://code.google.com/p/chromium/codesearch#ch romium/src/tools/perf/diagnose_test_failure)

221 script to automatically find the first failed build and the good & bad	213 script to automatically find the first failed build and the good & bad

222 revisions (which can also be used for return code bisect).	214 revisions (which can also be used for return code bisect).

223 * If you suspect a specific CL in the range, you can revert it locally and	215 * If you suspect a specific CL in the range, you can revert it locally and

224 run the test on the	216 run the test on the

225 [perf trybots](https://www.chromium.org/developers/telemetry/performance -try-bots).	217 [perf trybots](https://www.chromium.org/developers/telemetry/performance -try-bots).

226 * You can run a return code bisect to narrow down the culprit CL:	218 * You can run a return code bisect to narrow down the culprit CL:

227 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot .com/report)	219 1. Open up the graph in the [perf dashboard](https://chromeperf.appspot .com/report)

228 on one of the failing platforms.	220 on one of the failing platforms.

229 2. Hover over a data point and click the "Bisect" button on the	221 2. Hover over a data point and click the "Bisect" button on the

230 tooltip.	222 tooltip.

231 3. Type the Bug ID from step 1, the Good Revision the last	223 3. Type the Bug ID from step 1, the Good Revision the last

232 commit pos data was received from, the Bad Revision the last	224 commit pos data was received from, the Bad Revision the last

233 commit pos and set Bisect mode to `return_code`.	225 commit pos and set Bisect mode to `return_code`.

234 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures)	226 * [Debugging telemetry failures](https://www.chromium.org/developers/telem etry/diagnosing-test-failures)

235 * On Android and Mac, you can view platform-level screenshots of the	227 * On Android and Mac, you can view platform-level screenshots of the

236 device screen for failing tests, links to which are printed in the logs.	228 device screen for failing tests, links to which are printed in the logs.

237 Often this will immediately reveal failure causes that are opaque from	229 Often this will immediately reveal failure causes that are opaque from

238 the logs alone. On other platforms, Devtools will produce tab	230 the logs alone. On other platforms, Devtools will produce tab

239 screenshots as long as the tab did not crash.	231 screenshots as long as the tab did not crash.

240	232

	233 ### Useful Logs and Debugging Info

	234

	235 1. Telemetry test runner logs

	236

	237 _Useful Content:_ Best place to start. These logs contain all of the

	238 python logging information from the telemetry test runner scripts.

	239

	240 _Where to find:_ These logs can be found from the buildbot build page.

	241 Click the _"[stdout]"_ link under any of the telemetry test buildbot steps

	242 to view the logs. Do not use the "stdio" link which will show similiar

	243 information but will expire earilier and be slower to load.

	244

	245 2. Android Logcat (Android)

	246

	247 _Useful Content:_ This file contains all Android device logs. All

	248 Android apps and the Android system will log information to logcat. Good

	249 place to look if you believe an issue is device related

	250 (Android out-of-memory problem for example). Additionally, often information

	251 about native crashes will be logged to here.

	252

	253 _Where to find:_ These logs can be found from the buildbot status page.

	254 Click the _"logcat dump"_ link under one of the _"gsutil upload"_ steps.

	255

	256 3. Test Trace (Android)

	257

	258 _Useful Content:_ These logs graphically depict the start/end times for

	259 all telemetry tests on all of the devices. This can help determine if test

	260 failures were caused by an environmental issue.

	261 (see [Cross-Device Failures](#Android-Cross-Device-Failures))

	262

	263 _Where to find:_ These logs can be found from the buildbot status page.

	264 Click the _"Test Trace"_ link under one of the

	265 _"gsutil Upload Test Trace"_ steps.

	266

	267 4. Symbolized Stack Traces (Android)

	268

	269 _Useful Content:_ Contains symbolized stack traces of any Chrome or

	270 Android crashes.

	271

	272 _Where to find_: These logs can be found from the buildbot status page.

	273 The symbolized stack traces can be found under several steps. Click link

	274 under _"symbolized breakpad crashes"_ step to see symbolized Chrome crashes.

	275 Click link under _"stack tool with logcat dump"_ to see symbolized Android

	276 crashes.

	277

241 ### Disabling Telemetry Tests	278 ### Disabling Telemetry Tests

242	279

243 If the test is a telemetry test, its name will have a '.' in it, such as	280 If the test is a telemetry test, its name will have a '.' in it, such as

244 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the	281 `thread_times.key_mobile_sites` or `page_cycler.top_10`. The part before the

245 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).	282 first dot will be a python file in [tools/perf/benchmarks](https://code.google.c om/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).

246	283

247 If a telemetry test is failing and there is no clear culprit to revert	284 If a telemetry test is failing and there is no clear culprit to revert

248 immediately, disable the test. You can do this with the `@benchmark.Disabled`	285 immediately, disable the test. You can do this with the `@benchmark.Disabled`

249 decorator. **Always add a comment next to your decorator with the bug id which	286 decorator. **Always add a comment next to your decorator with the bug id which

250 has background on why the test was disabled, and also include a BUG= line in	287 has background on why the test was disabled, and also include a BUG= line in

(...skipping 52 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
303	340

304 [Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-Sheriff-BotHealth+label%3APri-2)	341 [Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3A Performance-Sheriff-BotHealth+label%3APri-2)

305 are for disabled tests. These should be pinged weekly, and work towards fixing	342 are for disabled tests. These should be pinged weekly, and work towards fixing

306 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the	343 should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the

307 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-Sheriff-BotHealth%20label:Pr i-2%20modified-before:today-7&sort=modified).	344 [list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.o rg/p/chromium/issues/list?can=2&q=label:Performance-Sheriff-BotHealth%20label:Pr i-2%20modified-before:today-7&sort=modified).

308	345

309 <!-- Unresolved issues:	346 <!-- Unresolved issues:

310 1. Do perf sheriffs watch the bisect waterfall?	347 1. Do perf sheriffs watch the bisect waterfall?

311 2. Do perf sheriffs watch the internal clank waterfall?	348 2. Do perf sheriffs watch the internal clank waterfall?

312 -->	349 -->

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »