Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(104)

Unified Diff: tools/perf/docs/perf_bot_sheriffing.md

Issue 1770383005: Reduce indentation levels for sheriff docs. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master
Patch Set: Commentz Created 4 years, 9 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: tools/perf/docs/perf_bot_sheriffing.md
diff --git a/tools/perf/docs/perf_bot_sheriffing.md b/tools/perf/docs/perf_bot_sheriffing.md
index 6b072b8f563c35225792aa526c29b02ae888e4f2..69407bac3a81e04b512bf8e132ea25554d11a2c6 100644
--- a/tools/perf/docs/perf_bot_sheriffing.md
+++ b/tools/perf/docs/perf_bot_sheriffing.md
@@ -5,17 +5,11 @@ waterfall up and running, and triaging performance test failures and flakes.
## Key Responsibilities
- * [Keeping the chromium.perf waterfall green](#chromiumperf)
- * [Handling Test Failures](#testfailures)
- * [Handling Device and Bot Failures](#botfailures)
- * [Follow up on failures](#followup)
+ * [Handle Device and Bot Failures](#botfailures)
+ * [Handle Test Failures](#testfailures)
+ * [Follow up on failures](#followup)
-###<a name="chromiumperf"></a> Keeping the chromium.perf waterfall green
-
-The primary responsibility of the perfbot health sheriff is to keep the
-chromium.perf waterfall green.
-
-####<a name="waterfallstate"></a> Understanding the Waterfall State
+##<a name="waterfallstate"></a> Understanding the Waterfall State
Everyone can view the chromium.perf waterfall at
https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended
@@ -44,20 +38,74 @@ Note that there are four different views:
* **start_time** is seconds since the epoch.
You can see a list of all previously filed bugs using the
-**[Performance-BotHealth](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth)**
+**[Performance-BotHealth](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth)**
label in crbug.
Please also check the recent
**[perf-sheriffs@chromium.org](https://groups.google.com/a/chromium.org/forum/#!forum/perf-sheriffs)**
postings for important announcements about bot turndowns and other known issues.
-####<a name="testfailures"></a> Handling Test Failures
+##<a name="botfailures"></a> Handle Device and Bot Failures
+
+###<a name="purplebots"></a> Purple bots
+
+When a bot goes purple, it's it's usually because of an infrastructure failure
+outside of the tests. But you should first check the logs of a purple bot to
+try to better understand the problem. Sometimes a telemetry test failure can
+turn the bot purple, for example.
+
+If the bot goes purple and you believe it's an infrastructure issue, file a bug
+with
+[this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&summary=Purple+Bot+on+chromium.perf),
+which will automatically add the bug to the trooper queue. Be sure to note
+which step is failing, and paste any relevant info from the logs into the bug.
+
+###<a name="devicefailures"></a> Android Device failures
+
+There are two types of device failures:
+
+1. A device is blacklisted in the `device_status_check` step. You can look at
+ the buildbot status page to see how many devices were listed as online during
+ this step. You should always see 7 devices online. If you see fewer than 7
+ devices online, there is a problem in the lab.
+2. A device is passing `device_status_check` but still in poor health. The
+ symptom of this is that all the tests are failing on it. You can see that on
+ the buildbot status page by looking at the `Device Affinity`. If all tests
+ with the same device affinity number are failing, it's probably a device
+ failure.
+
+For both types of failures, please file a bug with [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
+which will add an issue to the infra labs queue.
+
+If you need help triaging, here are the common labels you should use:
+
+ * **Performance-BotHealth** should go on all bugs you file about the bots;
+ it's the label we use to track all the issues.
+ * **Infra-Troopers** adds the bug to the trooper queue. This is for high
+ priority issues, like a build breakage. Please add a comment explaining
+ what you want the trooper to do.
+ * **Infra-Labs** adds the bug to the labs queue. If there is a hardware
+ problem, like an android device not responding or a bot that likely needs
+ a restart, please use this label. Make sure you set the **OS-** label
+ correctly as well, and add a comment explaining what you want the labs
+ team to do.
+ * **Infra** label is appropriate for bugs that are not high priority, but we
+ need infra team's help to triage. For example, the buildbot status page
+ UI is weird or we are getting some infra-related log spam. The infra team
+ works to triage these bugs within 24 hours, so you should ping if you do
+ not get a response.
+ * **Cr-Tests-Telemetry** for telemetry failures.
+ * **Cr-Tests-AutoBisect** for bisect and perf try job failures.
+
+ If you still need help, ask the speed infra chat, or escalate to sullivan@.
+
+##<a name="testfailures"></a> Handle Test Failures
You want to keep the waterfall green! So any bot that is red or purple needs to
be investigated. When a test fails:
1. File a bug using
- [this template](https://code.google.com/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
+ [this template](https://bugs.chromium.org/p/chromium/issues/entry?labels=Performance-BotHealth,Pri-1,Type-Bug-Regression,OS-?&comment=Revision+range+first+seen:%0ALink+to+failing+step+log:%0A%0A%0AIf%20the%20test%20is%20disabled,%20please%20downgrade%20to%20Pri-2.&summary=%3Ctest%3E+failure+on+chromium.perf+at+%3Crevisionrange%3E).
You'll want to be sure to include:
* Link to buildbot status page of failing build.
* Copy and paste of relevant failure snippet from the stdio.
@@ -92,11 +140,11 @@ be investigated. When a test fails:
the tab did not crash.
-#####<a name="telemetryfailures"></a> Disabling Telemetry Tests
+###<a name="telemetryfailures"></a> Disabling Telemetry Tests
If the test is a telemetry test, its name will have a '.' in it, such as
-thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first
-dot will be a python file in [tools/perf/benchmarks](
+`thread_times.key_mobile_sites`, or `page_cycler.top_10`. The part before the
+first dot will be a python file in [tools/perf/benchmarks](
https://code.google.com/p/chromium/codesearch#chromium/src/tools/perf/benchmarks/).
If a telemetry test is failing and there is no clear culprit to revert
@@ -129,80 +177,26 @@ and example of disabling a benchmark which OOMs on svelte.
Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do **not**
submit with NOTRY=true.
-#####<a name="otherfailures"></a> Disabling Other Tests
+###<a name="otherfailures"></a> Disabling Other Tests
Non-telemetry tests are configured in [chromium.perf.json](https://code.google.com/p/chromium/codesearch#chromium/src/testing/buildbot/chromium.perf.json).
You can TBR any of the per-file OWNERS, but please do **not** submit with
NOTRY=true.
-####<a name="botfailures"></a> Handling Device and Bot Failures
-
-#####<a name="purplebots"></a> Purple bots
-
-When a bot goes purple, it's it's usually because of an infrastructure failure
-outside of the tests. But you should first check the logs of a purple bot to
-try to better understand the problem. Sometimes a telemetry test failure can
-turn the bot purple, for example.
-
-If the bot goes purple and you believe it's an infrastructure issue, file a bug
-with
-[this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Troopers,OS-?&comment=Link+to+buildbot+status+page:&summary=Purple+Bot+on+chromium.perf),
-which will automatically add the bug to the trooper queue. Be sure to note
-which step is failing, and paste any relevant info from the logs into the bug.
-
-#####<a name="devicefailures"></a> Android Device failures
-
-There are two types of device failures:
-
-1. A device is blacklisted in the `device_status_check` step. You can look at
- the buildbot status page to see how many devices were listed as online during
- this step. You should always see 7 devices online. If you see fewer than 7
- devices online, there is a problem in the lab.
-2. A device is passing `device_status_check` but still in poor health. The
- symptom of this is that all the tests are failing on it. You can see that on
- the buildbot status page by looking at the `Device Affinity`. If all tests
- with the same device affinity number are failing, it's probably a device
- failure.
-
-For both types of failures, please file a bug with [this template](https://code.google.com/p/chromium/issues/entry?labels=Pri-1,Performance-BotHealth,Infra-Labs,OS-Android&comment=Link+to+buildbot+status+page:&summary=Device+offline+on+chromium.perf)
-which will add an issue to the infra labs queue.
-
-If you need help triaging, here are the common labels you should use:
-
- * **Performance-BotHealth** should go on all bugs you file about the bots;
- it's the label we use to track all the issues.
- * **Infra-Troopers** adds the bug to the trooper queue. This is for high
- priority issues, like a build breakage. Please add a comment explaining
- what you want the trooper to do.
- * **Infra-Labs** adds the bug to the labs queue. If there is a hardware
- problem, like an android device not responding or a bot that likely needs
- a restart, please use this label. Make sure you set the **OS-** label
- correctly as well, and add a comment explaining what you want the labs
- team to do.
- * **Infra** label is appropriate for bugs that are not high priority, but we
- need infra team's help to triage. For example, the buildbot status page
- UI is weird or we are getting some infra-related log spam. The infra team
- works to triage these bugs within 24 hours, so you should ping if you do
- not get a response.
- * **Cr-Tests-Telemetry** for telemetry failures.
- * **Cr-Tests-AutoBisect** for bisect and perf try job failures.
-
- If you still need help, ask the speed infra chat, or escalate to sullivan@.
-
-####<a name="followup"></a> Follow up on failures
+##<a name="followup"></a> Follow up on failures
-**[Pri-0 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-0)**
+**[Pri-0 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-0)**
should have an owner or contact on speed infra team and be worked on as top
priority. Pri-0 generally implies an entire waterfall is down.
-**[Pri-1 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**
+**[Pri-1 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-1)**
should be pinged daily, and checked to make sure someone is following up. Pri-1
bugs are for a red test (not yet disabled), purple bot, or failing device.
-**[Pri-2 bugs](https://code.google.com/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**
+**[Pri-2 bugs](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label%3APerformance-BotHealth+label%3APri-2)**
are for disabled tests. These should be pinged weekly, and work towards fixing
should be ongoing when the sheriff is not working on a Pri-1 issue. Here is the
-[list of Pri-2 bugs that have not been pinged in a week](https://code.google.com/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)
+[list of Pri-2 bugs that have not been pinged in a week](https://bugs.chromium.org/p/chromium/issues/list?can=2&q=label:Performance-BotHealth%20label:Pri-2%20modified-before:today-7&sort=modified)
<!-- Unresolved issues:
1. Do perf sheriffs watch the bisect waterfall?
« no previous file with comments | « no previous file | tools/perf/docs/perf_regression_sheriffing.md » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698