tools/perf/docs/perf_bot_sheriffing.md - Issue 1770383005: Reduce indentation levels for sheriff docs.

Unified Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 1770383005: Reduce indentation levels for sheriff docs. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Commentz Created 4 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: tools/perf/docs/perf_bot_sheriffing.md

diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md

index 6b072b8f563c35225792aa526c29b02ae888e4f2..69407bac3a81e04b512bf8e132ea25554d11a2c6 100644

--- a/tools/perf/docs/perf_bot_sheriffing.md

+++ b/tools/perf/docs/perf_bot_sheriffing.md

@@ -5,17 +5,11 @@ waterfall up and running, and triaging performance test failures and flakes.

## Key Responsibilities

- * [Keeping the chromium.perf waterfall green](#chromiumperf)

- * [Handling Test Failures](#testfailures)

- * [Handling Device and Bot Failures](#botfailures)

- * [Follow up on failures](#followup)

+ * [Handle Device and Bot Failures](#botfailures)

+ * [Handle Test Failures](#testfailures)

+ * [Follow up on failures](#followup)

-###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green

-The primary responsibility of the perfbot health sheriff is to keep the

-chromium.perf waterfall green.

-####<a name="waterfallstate"></a> Understanding the Waterfall State

+##<a name="waterfallstate"></a> Understanding the Waterfall State

Everyone can view the chromium.perf waterfall at

https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended

@@ -44,20 +38,74 @@ Note that there are four different views:

* **start_time** is seconds since the epoch.

You can see a list of all previously filed bugs using the

-**[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth)**

+**[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth)**

label in crbug.

Please also check the recent

**[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#!forum/perf-sheriffs)**

postings for important announcements about bot turndowns and other known issues.

-####<a name="testfailures"></a> Handling Test Failures

+##<a name="botfailures"></a> Handle Device and Bot Failures

+###<a name="purplebots"></a> Purple bots

+When a bot goes purple, it's it's usually because of an infrastructure failure

+outside of the tests. But you should first check the logs of a purple bot to

+try to better understand the problem. Sometimes a telemetry test failure can

+turn the bot purple, for example.

+If the bot goes purple and you believe it's an infrastructure issue, file a bug

+with

+[this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&summary=Purple+Bot+on+chromium.perf),

+which will automatically add the bug to the trooper queue. Be sure to note

+which step is failing, and paste any relevant info from the logs into the bug.

+###<a name="devicefailures"></a> Android Device failures

+There are two types of device failures:

+1. A device is blacklisted in the `device_status_check` step. You can look at

+ the buildbot status page to see how many devices were listed as online during

+ this step. You should always see 7 devices online. If you see fewer than 7

+ devices online, there is a problem in the lab.

+2. A device is passing `device_status_check` but still in poor health. The

+ symptom of this is that all the tests are failing on it. You can see that on

+ the buildbot status page by looking at the `Device Affinity`. If all tests

+ with the same device affinity number are failing, it's probably a device

+ failure.

+For both types of failures, please file a bug with [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)

+which will add an issue to the infra labs queue.

+If you need help triaging, here are the common labels you should use:

+ * **Performance-BotHealth** should go on all bugs you file about the bots;

+ it's the label we use to track all the issues.

+ * **Infra-Troopers** adds the bug to the trooper queue. This is for high

+ priority issues, like a build breakage. Please add a comment explaining

+ what you want the trooper to do.

+ * **Infra-Labs** adds the bug to the labs queue. If there is a hardware

+ problem, like an android device not responding or a bot that likely needs

+ a restart, please use this label. Make sure you set the **OS-** label

+ correctly as well, and add a comment explaining what you want the labs

+ team to do.

+ * **Infra** label is appropriate for bugs that are not high priority, but we

+ need infra team's help to triage. For example, the buildbot status page

+ UI is weird or we are getting some infra-related log spam. The infra team

+ works to triage these bugs within 24 hours, so you should ping if you do

+ not get a response.

+ * **Cr-Tests-Telemetry** for telemetry failures.

+ * **Cr-Tests-AutoBisect** for bisect and perf try job failures.

+ If you still need help, ask the speed infra chat, or escalate to sullivan@.

+##<a name="testfailures"></a> Handle Test Failures

You want to keep the waterfall green! So any bot that is red or purple needs to

be investigated. When a test fails:

1. File a bug using

- [this template](https://code.google.com/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).

+ [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).

You'll want to be sure to include:

* Link to buildbot status page of failing build.

* Copy and paste of relevant failure snippet from the stdio.

@@ -92,11 +140,11 @@ be investigated. When a test fails:

the tab did not crash.

-#####<a name="telemetryfailures"></a> Disabling Telemetry Tests

+###<a name="telemetryfailures"></a> Disabling Telemetry Tests

If the test is a telemetry test, its name will have a '.' in it, such as

-thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first

-dot will be a python file in [tools/perf/benchmarks](

+`thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the

+first dot will be a python file in [tools/perf/benchmarks](

https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).

If a telemetry test is failing and there is no clear culprit to revert

@@ -129,80 +177,26 @@ and example of disabling a benchmark which OOMs on svelte.

Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not**

submit with NOTRY=true.

-#####<a name="otherfailures"></a> Disabling Other Tests

+###<a name="otherfailures"></a> Disabling Other Tests

Non-telemetry tests are configured in [chromium.perf.json](https://code.google.com/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).

You can TBR any of the per-file OWNERS, but please do **not** submit with

NOTRY=true.

-####<a name="botfailures"></a> Handling Device and Bot Failures

-#####<a name="purplebots"></a> Purple bots

-When a bot goes purple, it's it's usually because of an infrastructure failure

-outside of the tests. But you should first check the logs of a purple bot to

-try to better understand the problem. Sometimes a telemetry test failure can

-turn the bot purple, for example.

-If the bot goes purple and you believe it's an infrastructure issue, file a bug

-with

-[this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&summary=Purple+Bot+on+chromium.perf),

-which will automatically add the bug to the trooper queue. Be sure to note

-which step is failing, and paste any relevant info from the logs into the bug.

-#####<a name="devicefailures"></a> Android Device failures

-There are two types of device failures:

-1. A device is blacklisted in the `device_status_check` step. You can look at

- the buildbot status page to see how many devices were listed as online during

- this step. You should always see 7 devices online. If you see fewer than 7

- devices online, there is a problem in the lab.

-2. A device is passing `device_status_check` but still in poor health. The

- symptom of this is that all the tests are failing on it. You can see that on

- the buildbot status page by looking at the `Device Affinity`. If all tests

- with the same device affinity number are failing, it's probably a device

- failure.

-For both types of failures, please file a bug with [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)

-which will add an issue to the infra labs queue.

-If you need help triaging, here are the common labels you should use:

- * **Performance-BotHealth** should go on all bugs you file about the bots;

- it's the label we use to track all the issues.

- * **Infra-Troopers** adds the bug to the trooper queue. This is for high

- priority issues, like a build breakage. Please add a comment explaining

- what you want the trooper to do.

- * **Infra-Labs** adds the bug to the labs queue. If there is a hardware

- problem, like an android device not responding or a bot that likely needs

- a restart, please use this label. Make sure you set the **OS-** label

- correctly as well, and add a comment explaining what you want the labs

- team to do.

- * **Infra** label is appropriate for bugs that are not high priority, but we

- need infra team's help to triage. For example, the buildbot status page

- UI is weird or we are getting some infra-related log spam. The infra team

- works to triage these bugs within 24 hours, so you should ping if you do

- not get a response.

- * **Cr-Tests-Telemetry** for telemetry failures.

- * **Cr-Tests-AutoBisect** for bisect and perf try job failures.

- If you still need help, ask the speed infra chat, or escalate to sullivan@.

-####<a name="followup"></a> Follow up on failures

+##<a name="followup"></a> Follow up on failures

-**[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-0)**

+**[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-0)**

should have an owner or contact on speed infra team and be worked on as top

priority. Pri-0 generally implies an entire waterfall is down.

-**[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**

+**[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**

should be pinged daily, and checked to make sure someone is following up. Pri-1

bugs are for a red test (not yet disabled), purple bot, or failing device.

-**[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**

+**[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**

are for disabled tests. These should be pinged weekly, and work towards fixing

should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the

-[list of Pri-2 bugs that have not been pinged in a week](https://code.google.com/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)

+[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)

<!-- Unresolved issues:

1. Do perf sheriffs watch the bisect waterfall?

« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »