Issue 209853009: Refactor perf bisect script _CalculateConfidence method.

Issue 209853009: Refactor perf bisect script _CalculateConfidence method. (Closed)

Created:
6 years, 9 months ago by qyearsley

Modified:
6 years, 8 months ago

Reviewers:
prasadv, shatch

CC:
chromium-reviews, ghost stip (do not use)

Base URL:
https://chromium.googlesource.com/chromium/src.git@master

Visibility:
Public.

More Reviews

Description

Refactor perf bisect script _CalculateConfidence method. - Adding docstrings - Factoring out _CalculateBounds method - Bug fix??? BUG= Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=260241

Patch Set 1 #

Total comments: 9

Patch Set 2 : data -> values_list, current_mean -> mean #

Patch Set 3 : Refactor CalculateConfidence and add a unit test. #

Created: 6 years, 8 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+152 lines, -48 lines)			Patch
	M	tools/bisect-perf-regression.py	View	1 2	6 chunks	+73 lines, -48 lines	0 comments	Download
	A	tools/bisect-perf-regression_test.py	View	1 2	1 chunk	+79 lines, -0 lines	0 comments	Download

Messages

Total messages: 15 (0 generated)

Expand Messages | Collapse Messages

qyearsley

Hey guys, I was just trying to figure out what the bisect script "confidence" means, ...

6 years, 9 months ago (2014-03-25 02:25:33 UTC) #1

prasadv

https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2826 tools/bisect-perf-regression.py:2826: for values in data: s/data/values_list https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2829 tools/bisect-perf-regression.py:2829: bounds[0] = ...

6 years, 9 months ago (2014-03-25 16:43:47 UTC) #2

qyearsley

Thanks Prasad :-) https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2826 tools/bisect-perf-regression.py:2826: for values in data: On 2014/03/25 ...

6 years, 9 months ago (2014-03-25 17:31:19 UTC) #3

shatch

https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2830 tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) Nice catch! Yeah that should ...

6 years, 9 months ago (2014-03-25 19:56:21 UTC) #4

shatch

https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2830 tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) BTW, if you want to ...

6 years, 9 months ago (2014-03-25 19:58:35 UTC) #5

qyearsley

On 2014/03/25 19:58:35, shatch wrote: > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > File tools/bisect-perf-regression.py (right): > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py#newcode2830 > ...

6 years, 9 months ago (2014-03-25 20:49:03 UTC) #6

shatch

On 2014/03/25 20:49:03, qyearsley wrote: > On 2014/03/25 19:58:35, shatch wrote: > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py ...

6 years, 9 months ago (2014-03-26 17:55:48 UTC) #7

On 2014/03/25 20:49:03, qyearsley wrote:
> On 2014/03/25 19:58:35, shatch wrote:
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py
> > File tools/bisect-perf-regression.py (right):
> > 
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression...
> > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean,
bounds[1])
> > BTW, if you want to experiment with changing this, go right ahead! It was
just
> a
> > simple, conservative stab at calculating confidence, but it's pretty
> arbitrary.
> > Had plans to try out a t-test at some point to see if it gave better
results.
> 
> Aye, just looked into t-tests and it certainly seems applicable. Not entirely
> sure how we can decide if it gives "better results", but I do think that I
would
> expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- if
> the evidence is very strong that the "bad" and "good" means are statistically
> significantly different, I think I would expect a number more like 99.75%
rather
> than 100.
> 

Yeah I'm not sure how to tell if it's giving "better results" other than to
output it along with the existing value, and comparing them to see. Would love
it if you want to experiment with this though, trying out t-tests have been on
my TODO list for a long time.

> I think that right now, we get "100% confidence" if minimum distance between
> groups (dist_between_groups) is greater than the sum of the standard
deviations
> of the two groups  (len_broken_group + len_working_group), and we get "0%
> confidence" if there's any overlap between the two groups.
> 
> Possible follow-up change:
>  - Calculate the t value for the two groups.
>  - Use this and the sample size to calculate the p-value.
>  - Confidence = (1 - p-value).
> 
> One thing I still don't understand is: What exactly is in those working_means
> and broken_means lists? (It seems they're lists of lists of values, but I
don't
> know if those values are means calculated from other lists of values...)

Those values are just the values from the perf runs.

qyearsley

On 2014/03/26 17:55:48, shatch wrote: > On 2014/03/25 20:49:03, qyearsley wrote: > > On 2014/03/25 ...

6 years, 9 months ago (2014-03-26 18:48:56 UTC) #8

On 2014/03/26 17:55:48, shatch wrote:
> On 2014/03/25 20:49:03, qyearsley wrote:
> > On 2014/03/25 19:58:35, shatch wrote:
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py
> > > File tools/bisect-perf-regression.py (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression...
> > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean,
> bounds[1])
> > > BTW, if you want to experiment with changing this, go right ahead! It was
> just
> > a
> > > simple, conservative stab at calculating confidence, but it's pretty
> > arbitrary.
> > > Had plans to try out a t-test at some point to see if it gave better
> results.
> > 
> > Aye, just looked into t-tests and it certainly seems applicable. Not
entirely
> > sure how we can decide if it gives "better results", but I do think that I
> would
> > expect confidence to usually be between 0 and 100, not exactly 0 or 100 --
if
> > the evidence is very strong that the "bad" and "good" means are
statistically
> > significantly different, I think I would expect a number more like 99.75%
> rather
> > than 100.
> > 
> 
> Yeah I'm not sure how to tell if it's giving "better results" other than to
> output it along with the existing value, and comparing them to see. Would love
> it if you want to experiment with this though, trying out t-tests have been on
> my TODO list for a long time.
> 
> > I think that right now, we get "100% confidence" if minimum distance between
> > groups (dist_between_groups) is greater than the sum of the standard
> deviations
> > of the two groups  (len_broken_group + len_working_group), and we get "0%
> > confidence" if there's any overlap between the two groups.
> > 
> > Possible follow-up change:
> >  - Calculate the t value for the two groups.
> >  - Use this and the sample size to calculate the p-value.
> >  - Confidence = (1 - p-value).
> > 
> > One thing I still don't understand is: What exactly is in those
working_means
> > and broken_means lists? (It seems they're lists of lists of values, but I
> don't
> > know if those values are means calculated from other lists of values...)
> 
> Those values are just the values from the perf runs.

Alright -- I guess each entry is results from a test run for one revision?
And the results for a test run for one revision for one metric might be several
values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per
RESULT line), or one value (e.g. for a metric that only has one number per
RESULT line)?

[Note: I just looked at the function TryParseResultValuesFromOutput in the
bisect script and I think this is what it appears to do -- it returns a list of
one or more values.]

shatch

On 2014/03/26 18:48:56, qyearsley wrote: > On 2014/03/26 17:55:48, shatch wrote: > > On 2014/03/25 ...

6 years, 9 months ago (2014-03-27 18:57:33 UTC) #9

On 2014/03/26 18:48:56, qyearsley wrote:
> On 2014/03/26 17:55:48, shatch wrote:
> > On 2014/03/25 20:49:03, qyearsley wrote:
> > > On 2014/03/25 19:58:35, shatch wrote:
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py
> > > > File tools/bisect-perf-regression.py (right):
> > > > 
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression...
> > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean,
> > bounds[1])
> > > > BTW, if you want to experiment with changing this, go right ahead! It
was
> > just
> > > a
> > > > simple, conservative stab at calculating confidence, but it's pretty
> > > arbitrary.
> > > > Had plans to try out a t-test at some point to see if it gave better
> > results.
> > > 
> > > Aye, just looked into t-tests and it certainly seems applicable. Not
> entirely
> > > sure how we can decide if it gives "better results", but I do think that I
> > would
> > > expect confidence to usually be between 0 and 100, not exactly 0 or 100 --
> if
> > > the evidence is very strong that the "bad" and "good" means are
> statistically
> > > significantly different, I think I would expect a number more like 99.75%
> > rather
> > > than 100.
> > > 
> > 
> > Yeah I'm not sure how to tell if it's giving "better results" other than to
> > output it along with the existing value, and comparing them to see. Would
love
> > it if you want to experiment with this though, trying out t-tests have been
on
> > my TODO list for a long time.
> > 
> > > I think that right now, we get "100% confidence" if minimum distance
between
> > > groups (dist_between_groups) is greater than the sum of the standard
> > deviations
> > > of the two groups  (len_broken_group + len_working_group), and we get "0%
> > > confidence" if there's any overlap between the two groups.
> > > 
> > > Possible follow-up change:
> > >  - Calculate the t value for the two groups.
> > >  - Use this and the sample size to calculate the p-value.
> > >  - Confidence = (1 - p-value).
> > > 
> > > One thing I still don't understand is: What exactly is in those
> working_means
> > > and broken_means lists? (It seems they're lists of lists of values, but I
> > don't
> > > know if those values are means calculated from other lists of values...)
> > 
> > Those values are just the values from the perf runs.
> 
> Alright -- I guess each entry is results from a test run for one revision?
> And the results for a test run for one revision for one metric might be
several
> values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per
> RESULT line), or one value (e.g. for a metric that only has one number per
> RESULT line)?
> 
> [Note: I just looked at the function TryParseResultValuesFromOutput in the
> bisect script and I think this is what it appears to do -- it returns a list
of
> one or more values.]

Yup, also the script might run the test several times.

qyearsley

On 2014/03/27 18:57:33, shatch wrote: > On 2014/03/26 18:48:56, qyearsley wrote: > > On 2014/03/26 ...

6 years, 9 months ago (2014-03-27 22:02:08 UTC) #10

On 2014/03/27 18:57:33, shatch wrote:
> On 2014/03/26 18:48:56, qyearsley wrote:
> > On 2014/03/26 17:55:48, shatch wrote:
> > > On 2014/03/25 20:49:03, qyearsley wrote:
> > > > On 2014/03/25 19:58:35, shatch wrote:
> > > > >
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py
> > > > > File tools/bisect-perf-regression.py (right):
> > > > > 
> > > > >
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression...
> > > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean,
> > > bounds[1])
> > > > > BTW, if you want to experiment with changing this, go right ahead! It
> was
> > > just
> > > > a
> > > > > simple, conservative stab at calculating confidence, but it's pretty
> > > > arbitrary.
> > > > > Had plans to try out a t-test at some point to see if it gave better
> > > results.
> > > > 
> > > > Aye, just looked into t-tests and it certainly seems applicable. Not
> > entirely
> > > > sure how we can decide if it gives "better results", but I do think that
I
> > > would
> > > > expect confidence to usually be between 0 and 100, not exactly 0 or 100
--
> > if
> > > > the evidence is very strong that the "bad" and "good" means are
> > statistically
> > > > significantly different, I think I would expect a number more like
99.75%
> > > rather
> > > > than 100.
> > > > 
> > > 
> > > Yeah I'm not sure how to tell if it's giving "better results" other than
to
> > > output it along with the existing value, and comparing them to see. Would
> love
> > > it if you want to experiment with this though, trying out t-tests have
been
> on
> > > my TODO list for a long time.
> > > 
> > > > I think that right now, we get "100% confidence" if minimum distance
> between
> > > > groups (dist_between_groups) is greater than the sum of the standard
> > > deviations
> > > > of the two groups  (len_broken_group + len_working_group), and we get
"0%
> > > > confidence" if there's any overlap between the two groups.
> > > > 
> > > > Possible follow-up change:
> > > >  - Calculate the t value for the two groups.
> > > >  - Use this and the sample size to calculate the p-value.
> > > >  - Confidence = (1 - p-value).
> > > > 
> > > > One thing I still don't understand is: What exactly is in those
> > working_means
> > > > and broken_means lists? (It seems they're lists of lists of values, but
I
> > > don't
> > > > know if those values are means calculated from other lists of values...)
> > > 
> > > Those values are just the values from the perf runs.
> > 
> > Alright -- I guess each entry is results from a test run for one revision?
> > And the results for a test run for one revision for one metric might be
> several
> > values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per
> > RESULT line), or one value (e.g. for a metric that only has one number per
> > RESULT line)?
> > 
> > [Note: I just looked at the function TryParseResultValuesFromOutput in the
> > bisect script and I think this is what it appears to do -- it returns a list
> of
> > one or more values.]
> 
> Yup, also the script might run the test several times.

Alright, now I understand better.

In any case is this CL alright to submit?

shatch

On 2014/03/27 22:02:08, qyearsley wrote: > On 2014/03/27 18:57:33, shatch wrote: > > On 2014/03/26 ...

6 years, 9 months ago (2014-03-28 16:17:52 UTC) #11

On 2014/03/27 22:02:08, qyearsley wrote:
> On 2014/03/27 18:57:33, shatch wrote:
> > On 2014/03/26 18:48:56, qyearsley wrote:
> > > On 2014/03/26 17:55:48, shatch wrote:
> > > > On 2014/03/25 20:49:03, qyearsley wrote:
> > > > > On 2014/03/25 19:58:35, shatch wrote:
> > > > > >
> > > > >
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py
> > > > > > File tools/bisect-perf-regression.py (right):
> > > > > > 
> > > > > >
> > > > >
> > > >
> > >
> >
>
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression...
> > > > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean,
> > > > bounds[1])
> > > > > > BTW, if you want to experiment with changing this, go right ahead!
It
> > was
> > > > just
> > > > > a
> > > > > > simple, conservative stab at calculating confidence, but it's pretty
> > > > > arbitrary.
> > > > > > Had plans to try out a t-test at some point to see if it gave better
> > > > results.
> > > > > 
> > > > > Aye, just looked into t-tests and it certainly seems applicable. Not
> > > entirely
> > > > > sure how we can decide if it gives "better results", but I do think
that
> I
> > > > would
> > > > > expect confidence to usually be between 0 and 100, not exactly 0 or
100
> --
> > > if
> > > > > the evidence is very strong that the "bad" and "good" means are
> > > statistically
> > > > > significantly different, I think I would expect a number more like
> 99.75%
> > > > rather
> > > > > than 100.
> > > > > 
> > > > 
> > > > Yeah I'm not sure how to tell if it's giving "better results" other than
> to
> > > > output it along with the existing value, and comparing them to see.
Would
> > love
> > > > it if you want to experiment with this though, trying out t-tests have
> been
> > on
> > > > my TODO list for a long time.
> > > > 
> > > > > I think that right now, we get "100% confidence" if minimum distance
> > between
> > > > > groups (dist_between_groups) is greater than the sum of the standard
> > > > deviations
> > > > > of the two groups  (len_broken_group + len_working_group), and we get
> "0%
> > > > > confidence" if there's any overlap between the two groups.
> > > > > 
> > > > > Possible follow-up change:
> > > > >  - Calculate the t value for the two groups.
> > > > >  - Use this and the sample size to calculate the p-value.
> > > > >  - Confidence = (1 - p-value).
> > > > > 
> > > > > One thing I still don't understand is: What exactly is in those
> > > working_means
> > > > > and broken_means lists? (It seems they're lists of lists of values,
but
> I
> > > > don't
> > > > > know if those values are means calculated from other lists of
values...)
> > > > 
> > > > Those values are just the values from the perf runs.
> > > 
> > > Alright -- I guess each entry is results from a test run for one revision?
> > > And the results for a test run for one revision for one metric might be
> > several
> > > values (e.g. for page_cycler warm_times/page_load_time, multiple numbers
per
> > > RESULT line), or one value (e.g. for a metric that only has one number per
> > > RESULT line)?
> > > 
> > > [Note: I just looked at the function TryParseResultValuesFromOutput in the
> > > bisect script and I think this is what it appears to do -- it returns a
list
> > of
> > > one or more values.]
> > 
> > Yup, also the script might run the test several times.
> 
> Alright, now I understand better.
> 
> In any case is this CL alright to submit?

yup, lgtm

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/qyearsley@chromium.org/209853009/30001

6 years, 9 months ago (2014-03-28 17:07:05 UTC) #13

qyearsley

6 years, 8 months ago (2014-04-01 16:36:18 UTC) #15

Message was sent while issue was closed.

A revert of this CL has been created in
https://codereview.chromium.org/218613012/ by qyearsley@chromium.org.

The reason for reverting is: Caused issue 358622.

The bug was that at line 2832 I used a tuple instead of a list, then later I
tried to assign to a member of the tuple, but Python tuples are immutable..

Expand Messages | Collapse Messages