Issue 2075423002: [Findit] Group failures by culprit and send notification to codereview.

stgao

Patchset #1 (id:1) has been deleted

4 years, 6 months ago (2016-06-20 21:54:36 UTC) #1

stgao

Description was changed from ========== [Findit] Group failures by culprit to send notifications. BUG= ========== ...

4 years, 6 months ago (2016-06-20 21:55:03 UTC) #2

stgao

stgao@chromium.org changed reviewers: + chanli@chromium.org, lijeffrey@chromium.org

4 years, 6 months ago (2016-06-20 21:56:06 UTC) #3

stgao

ptal This is to use the Rieveld client to send the notifications to culprits.

4 years, 6 months ago (2016-06-20 21:56:07 UTC) #4

lijeffrey

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py File appengine/findit/model/wf_culprit.py (right): https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py#newcode21 appengine/findit/model/wf_culprit.py:21: found_time = ndb.DateTimeProperty() do we want to index found_time ...

4 years, 6 months ago (2016-06-21 00:07:35 UTC) #5

stgao

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py File appengine/findit/model/wf_culprit.py (right): https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py#newcode21 appengine/findit/model/wf_culprit.py:21: found_time = ndb.DateTimeProperty() On 2016/06/21 00:07:34, lijeffrey wrote: > ...

4 years, 6 months ago (2016-06-21 15:14:30 UTC) #7

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/...
File appengine/findit/model/wf_culprit.py (right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/...
appengine/findit/model/wf_culprit.py:21: found_time = ndb.DateTimeProperty()
On 2016/06/21 00:07:34, lijeffrey wrote:
> do we want to index found_time should we ever want to query for historical
> WfCulprit entities? I imagine this could eventually be used in conjunction
with
> auto revert and we may want metrics for that

We do want to query for historical data to know which and how many culprits we
have sent notification to code-review, and later trigger auto-revert when
supported.

> 
> It also doesn't look like found_time is being set anywhere
Ooops, good catch. Changed to notification_time instead.

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:19: from
waterfall.send_notification_for_culprit_pipeline import \
On 2016/06/21 00:07:34, lijeffrey wrote:
> nit: I think gpylint doesn't like multi-line lines using \
> 
> is it possible to do something like:
> 
> from waterfall.send_notification_for_culprit_pipeline import (
>     SendNotificationForCulpritPipeline)?

Done.

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/send_notification_for_culprit_pipeline.py
(right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:22: culprit
= WfCulprit.Get(repo_name, revision)
On 2016/06/21 00:07:35, lijeffrey wrote:
> nit: how about
> 
> culprit = WfCulprit.Get(repo_name, revision) or WfCulprit.Create(repo_name,
> revision)?

Good idea! Done.

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:31:
should_send = len(culprit.failed_builds) >= 2  # TODO(stgao): move to config.
On 2016/06/21 00:07:34, lijeffrey wrote:
> is it possible to merge these 2?
> 
> i.e.
> 
> should_send = (len(culprit.failed_builds >= 2 and
>     culprit.notification_status not in (status.COMPLETED, status.RUNNING))

Done.

chanli

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py File appengine/findit/model/wf_culprit.py (right): https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py#newcode23 appengine/findit/model/wf_culprit.py:23: # The status of notification delivery. Do you want ...

4 years, 6 months ago (2016-06-21 17:50:32 UTC) #8

lijeffrey

lgtm https://codereview.chromium.org/2075423002/diff/60001/appengine/findit/model/wf_culprit.py File appengine/findit/model/wf_culprit.py (right): https://codereview.chromium.org/2075423002/diff/60001/appengine/findit/model/wf_culprit.py#newcode23 appengine/findit/model/wf_culprit.py:23: cr_notification_time = ndb.DateTimeProperty() do we still want to ...

4 years, 6 months ago (2016-06-21 17:59:54 UTC) #9

stgao

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py File appengine/findit/model/wf_culprit.py (right): https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/wf_culprit.py#newcode23 appengine/findit/model/wf_culprit.py:23: # The status of notification delivery. On 2016/06/21 17:50:31, ...

4 years, 5 months ago (2016-06-24 16:10:29 UTC) #11

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/...
File appengine/findit/model/wf_culprit.py (right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/...
appengine/findit/model/wf_culprit.py:23: # The status of notification delivery.
On 2016/06/21 17:50:31, chanli wrote:
> Do you want to list the statuses here? For reference?

Done.

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/model/...
appengine/findit/model/wf_culprit.py:30: def project_name(self):  # pragma: no
cover
On 2016/06/21 17:50:31, chanli wrote:
> Why don't just use repo_name? 

This is for UI.

I'm more leaning to make repo_name the full path of repo checkout dir (i.e.
third_party/pdfium), while project name as the short name (i.e. pdfium).

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:344:
_NotifyCulprits(master_name, builder_name, build_number, culprits)
On 2016/06/21 17:50:31, chanli wrote:
> So only culprits found by try jobs will be included, right? And that's
expected?

That's what I thought. Are you proposing heuristic results should be notified
too?

> 
> If so, when do you want to enable this change in prod? I mean, should we wait
> for the accuracy rate for try jobs to get higher?

As in the other pipeline, we only notify if the same culprit is identified for
2+ different builders. I think that's good enough to avoid the false positive
caused by flaky compile.
Did you find some cases that it won't work?

I was supposed to add a flag to disable the notification.
But it will be in a separate CL.

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/send_notification_for_culprit_pipeline.py
(right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:30: #
builds to avoid false positive due to flakiness.
On 2016/06/21 17:50:31, chanli wrote:
> Do you have data to back up this idea? What percentage of false positives
could
> be filtered out by this?

This is from data in the InCorrect-Found. Only one case can't be caught by this
filter so far.

https://codereview.chromium.org/2075423002/diff/60001/appengine/findit/model/...
File appengine/findit/model/wf_culprit.py (right):

https://codereview.chromium.org/2075423002/diff/60001/appengine/findit/model/...
appengine/findit/model/wf_culprit.py:23: cr_notification_time =
ndb.DateTimeProperty()
On 2016/06/21 17:59:54, lijeffrey wrote:
> do we still want to do ndb.DateTimeProperty(indexed=True) to be able to query
> for this later?

By default, it is indexed.
But I made it explicit.

stgao

Description was changed from ========== [Findit] Group failures by culprit to send notifications. BUG=621140 ========== ...

4 years, 5 months ago (2016-06-24 16:11:32 UTC) #12

chanli

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterfall/identify_try_job_culprit_pipeline.py File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right): https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterfall/identify_try_job_culprit_pipeline.py#newcode344 appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:344: _NotifyCulprits(master_name, builder_name, build_number, culprits) > As in the other ...

4 years, 5 months ago (2016-06-24 23:42:38 UTC) #13

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:344:
_NotifyCulprits(master_name, builder_name, build_number, culprits)
> As in the other pipeline, we only notify if the same culprit is identified for
> 2+ different builders. I think that's good enough to avoid the false positive
> caused by flaky compile.
> Did you find some cases that it won't work?
> 
> I was supposed to add a flag to disable the notification.
> But it will be in a separate CL.

Such as the case where the culprit revision is skipped by analyze?

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
File appengine/findit/waterfall/send_notification_for_culprit_pipeline.py
(right):

https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:30: #
builds to avoid false positive due to flakiness.
On 2016/06/24 16:10:28, stgao wrote:
> On 2016/06/21 17:50:31, chanli wrote:
> > Do you have data to back up this idea? What percentage of false positives
> could
> > be filtered out by this?
> 
> This is from data in the InCorrect-Found. Only one case can't be caught by
this
> filter so far.

I'm not sure about it: I can see quite several cases where multiple builds
shared the same culprit but that culprit is incorrect.

For example:
401764: [(chromium.memory, Linux Chromium OS ASan LSan Tests (1), 13919),
(chromium.win, Win7 Tests (1), 54147), (chromium.win, Win7 (32) Tests, 6659),
...]

398149: [(chromium.chromiumos, ChromiumOS amd64-generic Compile, 19370),
(chromium.chromiumos, ChromiumOS daisy Compile, 21018), (chromium.chromiumos,
ChromiumOS x86-generic Compile, 20468)]

393771: [(chromium.chrome, Google Chrome Win, 7100), (chromium, Win, 43403)]

I would say we can go with this approach but we'd better wait for try jobs to be
more reliable

chanli

On 2016/06/24 23:42:38, chanli wrote: > https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterfall/identify_try_job_culprit_pipeline.py > File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right): > > https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterfall/identify_try_job_culprit_pipeline.py#newcode344 > ...

4 years, 5 months ago (2016-06-24 23:44:55 UTC) #14

On 2016/06/24 23:42:38, chanli wrote:
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py (right):
> 
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:344:
> _NotifyCulprits(master_name, builder_name, build_number, culprits)
> > As in the other pipeline, we only notify if the same culprit is identified
for
> > 2+ different builders. I think that's good enough to avoid the false
positive
> > caused by flaky compile.
> > Did you find some cases that it won't work?
> > 
> > I was supposed to add a flag to disable the notification.
> > But it will be in a separate CL.
> 
> Such as the case where the culprit revision is skipped by analyze?
> 
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> File appengine/findit/waterfall/send_notification_for_culprit_pipeline.py
> (right):
> 
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:30: #
> builds to avoid false positive due to flakiness.
> On 2016/06/24 16:10:28, stgao wrote:
> > On 2016/06/21 17:50:31, chanli wrote:
> > > Do you have data to back up this idea? What percentage of false positives
> > could
> > > be filtered out by this?
> > 
> > This is from data in the InCorrect-Found. Only one case can't be caught by
> this
> > filter so far.
> 
> I'm not sure about it: I can see quite several cases where multiple builds
> shared the same culprit but that culprit is incorrect.
> 
> For example:
> 401764: [(chromium.memory, Linux Chromium OS ASan LSan Tests (1), 13919),
> (chromium.win, Win7 Tests (1), 54147), (chromium.win, Win7 (32) Tests, 6659),
> ...]
> 
> 398149: [(chromium.chromiumos, ChromiumOS amd64-generic Compile, 19370),
> (chromium.chromiumos, ChromiumOS daisy Compile, 21018), (chromium.chromiumos,
> ChromiumOS x86-generic Compile, 20468)]
> 
> 393771: [(chromium.chrome, Google Chrome Win, 7100), (chromium, Win, 43403)]
> 
> I would say we can go with this approach but we'd better wait for try jobs to
be
> more reliable

lgtm on functionality, though I'm a little worried that we can not avoid all the
false positives.

stgao

On 2016/06/24 23:44:55, chanli wrote: > On 2016/06/24 23:42:38, chanli wrote: > > > https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterfall/identify_try_job_culprit_pipeline.py ...

4 years, 5 months ago (2016-06-27 17:30:12 UTC) #15

On 2016/06/24 23:44:55, chanli wrote:
> On 2016/06/24 23:42:38, chanli wrote:
> >
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> > File appengine/findit/waterfall/identify_try_job_culprit_pipeline.py
(right):
> > 
> >
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> > appengine/findit/waterfall/identify_try_job_culprit_pipeline.py:344:
> > _NotifyCulprits(master_name, builder_name, build_number, culprits)
> > > As in the other pipeline, we only notify if the same culprit is identified
> for
> > > 2+ different builders. I think that's good enough to avoid the false
> positive
> > > caused by flaky compile.
> > > Did you find some cases that it won't work?
> > > 
> > > I was supposed to add a flag to disable the notification.
> > > But it will be in a separate CL.
> > 
> > Such as the case where the culprit revision is skipped by analyze?
> > 
> >
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> > File appengine/findit/waterfall/send_notification_for_culprit_pipeline.py
> > (right):
> > 
> >
>
https://codereview.chromium.org/2075423002/diff/20001/appengine/findit/waterf...
> > appengine/findit/waterfall/send_notification_for_culprit_pipeline.py:30: #
> > builds to avoid false positive due to flakiness.
> > On 2016/06/24 16:10:28, stgao wrote:
> > > On 2016/06/21 17:50:31, chanli wrote:
> > > > Do you have data to back up this idea? What percentage of false
positives
> > > could
> > > > be filtered out by this?
> > > 
> > > This is from data in the InCorrect-Found. Only one case can't be caught by
> > this
> > > filter so far.
> > 
> > I'm not sure about it: I can see quite several cases where multiple builds
> > shared the same culprit but that culprit is incorrect.
> > 
> > For example:
> > 401764: [(chromium.memory, Linux Chromium OS ASan LSan Tests (1), 13919),
> > (chromium.win, Win7 Tests (1), 54147), (chromium.win, Win7 (32) Tests,
6659),
> > ...]
This is a bug in the dependency "Analyze", not an issue within Findit.

chanli@, was the bug filed to the two guys working on "Analyze"?

> > 
> > 398149: [(chromium.chromiumos, ChromiumOS amd64-generic Compile, 19370),
> > (chromium.chromiumos, ChromiumOS daisy Compile, 21018),
(chromium.chromiumos,
> > ChromiumOS x86-generic Compile, 20468)]
This is not clear to me yet, but seems more like an infra issue.

> > 
> > 393771: [(chromium.chrome, Google Chrome Win, 7100), (chromium, Win, 43403)]
This is flaky compile failure. Findit can't do anything for it.

> > 
> > I would say we can go with this approach but we'd better wait for try jobs
to
> be
> > more reliable
> 
> lgtm on functionality, though I'm a little worried that we can not avoid all
the
> false positives.

I will land as-is. Later we could tune the criteria of when the notification
should be sent.

stgao

The patchset sent to the CQ was uploaded after l-g-t-m from lijeffrey@chromium.org Link to the ...

4 years, 5 months ago (2016-06-27 17:30:19 UTC) #17