|
|
Created:
6 years, 9 months ago by qyearsley Modified:
6 years, 8 months ago CC:
chromium-reviews, ghost stip (do not use) Base URL:
https://chromium.googlesource.com/chromium/src.git@master Visibility:
Public. |
DescriptionRefactor perf bisect script _CalculateConfidence method.
- Adding docstrings
- Factoring out _CalculateBounds method
- Bug fix???
BUG=
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=260241
Patch Set 1 #
Total comments: 9
Patch Set 2 : data -> values_list, current_mean -> mean #Patch Set 3 : Refactor CalculateConfidence and add a unit test. #
Messages
Total messages: 15 (0 generated)
Hey guys, I was just trying to figure out what the bisect script "confidence" means, and I saw one thing that might be a bug ... Could you take a look? https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2797: working_means: A list of lists of "good" result numbers. So, this argument name kind of confused me. Individual members of this list are input to CalculateTruncatedMean, so that makes it seem like the members of this list are lists of values, and are not means? https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2827: mean = CalculateTruncatedMean(values, 0) The second argument to CalculateTruncatedMean is supposed to be a "The % from the upper/lower portions of the data set to discards". So wouldn't CalculateTruncatedMean(vals, 0) be just Mean(vals) ? Maybe it would be clearer to add a function like the one below just to make it clear we're not doing any fancy truncation stuff? def Mean(x): return CalculateTruncatedMean(x, 0) https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) Attention! This changes the behavior! In the existing version on line 2797, max is calculated as: max(current_mean, bounds_working[0]) But it looks like bounds_working[0] is supposed to be the lower bound. Is it? Does this seem right?
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2826: for values in data: s/data/values_list https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2829: bounds[0] = min(current_mean, bounds[0]) s/current_mean/mean
Thanks Prasad :-) https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2826: for values in data: On 2014/03/25 16:43:47, prasadv wrote: > s/data/values_list Done. https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2829: bounds[0] = min(current_mean, bounds[0]) On 2014/03/25 16:43:47, prasadv wrote: > s/current_mean/mean Done.
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) Nice catch! Yeah that should be max(current_mean, bounds_working[1]). On 2014/03/25 02:25:33, qyearsley wrote: > Attention! This changes the behavior! > > In the existing version on line 2797, max is calculated as: > max(current_mean, bounds_working[0]) > > But it looks like bounds_working[0] is supposed to be the lower bound. Is it? > Does this seem right?
https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py File tools/bisect-perf-regression.py (right): https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) BTW, if you want to experiment with changing this, go right ahead! It was just a simple, conservative stab at calculating confidence, but it's pretty arbitrary. Had plans to try out a t-test at some point to see if it gave better results. On 2014/03/25 19:56:21, shatch wrote: > Nice catch! Yeah that should be max(current_mean, bounds_working[1]). > > On 2014/03/25 02:25:33, qyearsley wrote: > > Attention! This changes the behavior! > > > > In the existing version on line 2797, max is calculated as: > > max(current_mean, bounds_working[0]) > > > > But it looks like bounds_working[0] is supposed to be the lower bound. Is it? > > Does this seem right? >
On 2014/03/25 19:58:35, shatch wrote: > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > File tools/bisect-perf-regression.py (right): > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) > BTW, if you want to experiment with changing this, go right ahead! It was just a > simple, conservative stab at calculating confidence, but it's pretty arbitrary. > Had plans to try out a t-test at some point to see if it gave better results. Aye, just looked into t-tests and it certainly seems applicable. Not entirely sure how we can decide if it gives "better results", but I do think that I would expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- if the evidence is very strong that the "bad" and "good" means are statistically significantly different, I think I would expect a number more like 99.75% rather than 100. I think that right now, we get "100% confidence" if minimum distance between groups (dist_between_groups) is greater than the sum of the standard deviations of the two groups (len_broken_group + len_working_group), and we get "0% confidence" if there's any overlap between the two groups. Possible follow-up change: - Calculate the t value for the two groups. - Use this and the sample size to calculate the p-value. - Confidence = (1 - p-value). One thing I still don't understand is: What exactly is in those working_means and broken_means lists? (It seems they're lists of lists of values, but I don't know if those values are means calculated from other lists of values...)
On 2014/03/25 20:49:03, qyearsley wrote: > On 2014/03/25 19:58:35, shatch wrote: > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > > File tools/bisect-perf-regression.py (right): > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, bounds[1]) > > BTW, if you want to experiment with changing this, go right ahead! It was just > a > > simple, conservative stab at calculating confidence, but it's pretty > arbitrary. > > Had plans to try out a t-test at some point to see if it gave better results. > > Aye, just looked into t-tests and it certainly seems applicable. Not entirely > sure how we can decide if it gives "better results", but I do think that I would > expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- if > the evidence is very strong that the "bad" and "good" means are statistically > significantly different, I think I would expect a number more like 99.75% rather > than 100. > Yeah I'm not sure how to tell if it's giving "better results" other than to output it along with the existing value, and comparing them to see. Would love it if you want to experiment with this though, trying out t-tests have been on my TODO list for a long time. > I think that right now, we get "100% confidence" if minimum distance between > groups (dist_between_groups) is greater than the sum of the standard deviations > of the two groups (len_broken_group + len_working_group), and we get "0% > confidence" if there's any overlap between the two groups. > > Possible follow-up change: > - Calculate the t value for the two groups. > - Use this and the sample size to calculate the p-value. > - Confidence = (1 - p-value). > > One thing I still don't understand is: What exactly is in those working_means > and broken_means lists? (It seems they're lists of lists of values, but I don't > know if those values are means calculated from other lists of values...) Those values are just the values from the perf runs.
On 2014/03/26 17:55:48, shatch wrote: > On 2014/03/25 20:49:03, qyearsley wrote: > > On 2014/03/25 19:58:35, shatch wrote: > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > > > File tools/bisect-perf-regression.py (right): > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, > bounds[1]) > > > BTW, if you want to experiment with changing this, go right ahead! It was > just > > a > > > simple, conservative stab at calculating confidence, but it's pretty > > arbitrary. > > > Had plans to try out a t-test at some point to see if it gave better > results. > > > > Aye, just looked into t-tests and it certainly seems applicable. Not entirely > > sure how we can decide if it gives "better results", but I do think that I > would > > expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- if > > the evidence is very strong that the "bad" and "good" means are statistically > > significantly different, I think I would expect a number more like 99.75% > rather > > than 100. > > > > Yeah I'm not sure how to tell if it's giving "better results" other than to > output it along with the existing value, and comparing them to see. Would love > it if you want to experiment with this though, trying out t-tests have been on > my TODO list for a long time. > > > I think that right now, we get "100% confidence" if minimum distance between > > groups (dist_between_groups) is greater than the sum of the standard > deviations > > of the two groups (len_broken_group + len_working_group), and we get "0% > > confidence" if there's any overlap between the two groups. > > > > Possible follow-up change: > > - Calculate the t value for the two groups. > > - Use this and the sample size to calculate the p-value. > > - Confidence = (1 - p-value). > > > > One thing I still don't understand is: What exactly is in those working_means > > and broken_means lists? (It seems they're lists of lists of values, but I > don't > > know if those values are means calculated from other lists of values...) > > Those values are just the values from the perf runs. Alright -- I guess each entry is results from a test run for one revision? And the results for a test run for one revision for one metric might be several values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per RESULT line), or one value (e.g. for a metric that only has one number per RESULT line)? [Note: I just looked at the function TryParseResultValuesFromOutput in the bisect script and I think this is what it appears to do -- it returns a list of one or more values.]
On 2014/03/26 18:48:56, qyearsley wrote: > On 2014/03/26 17:55:48, shatch wrote: > > On 2014/03/25 20:49:03, qyearsley wrote: > > > On 2014/03/25 19:58:35, shatch wrote: > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > > > > File tools/bisect-perf-regression.py (right): > > > > > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, > > bounds[1]) > > > > BTW, if you want to experiment with changing this, go right ahead! It was > > just > > > a > > > > simple, conservative stab at calculating confidence, but it's pretty > > > arbitrary. > > > > Had plans to try out a t-test at some point to see if it gave better > > results. > > > > > > Aye, just looked into t-tests and it certainly seems applicable. Not > entirely > > > sure how we can decide if it gives "better results", but I do think that I > > would > > > expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- > if > > > the evidence is very strong that the "bad" and "good" means are > statistically > > > significantly different, I think I would expect a number more like 99.75% > > rather > > > than 100. > > > > > > > Yeah I'm not sure how to tell if it's giving "better results" other than to > > output it along with the existing value, and comparing them to see. Would love > > it if you want to experiment with this though, trying out t-tests have been on > > my TODO list for a long time. > > > > > I think that right now, we get "100% confidence" if minimum distance between > > > groups (dist_between_groups) is greater than the sum of the standard > > deviations > > > of the two groups (len_broken_group + len_working_group), and we get "0% > > > confidence" if there's any overlap between the two groups. > > > > > > Possible follow-up change: > > > - Calculate the t value for the two groups. > > > - Use this and the sample size to calculate the p-value. > > > - Confidence = (1 - p-value). > > > > > > One thing I still don't understand is: What exactly is in those > working_means > > > and broken_means lists? (It seems they're lists of lists of values, but I > > don't > > > know if those values are means calculated from other lists of values...) > > > > Those values are just the values from the perf runs. > > Alright -- I guess each entry is results from a test run for one revision? > And the results for a test run for one revision for one metric might be several > values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per > RESULT line), or one value (e.g. for a metric that only has one number per > RESULT line)? > > [Note: I just looked at the function TryParseResultValuesFromOutput in the > bisect script and I think this is what it appears to do -- it returns a list of > one or more values.] Yup, also the script might run the test several times.
On 2014/03/27 18:57:33, shatch wrote: > On 2014/03/26 18:48:56, qyearsley wrote: > > On 2014/03/26 17:55:48, shatch wrote: > > > On 2014/03/25 20:49:03, qyearsley wrote: > > > > On 2014/03/25 19:58:35, shatch wrote: > > > > > > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > > > > > File tools/bisect-perf-regression.py (right): > > > > > > > > > > > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > > > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, > > > bounds[1]) > > > > > BTW, if you want to experiment with changing this, go right ahead! It > was > > > just > > > > a > > > > > simple, conservative stab at calculating confidence, but it's pretty > > > > arbitrary. > > > > > Had plans to try out a t-test at some point to see if it gave better > > > results. > > > > > > > > Aye, just looked into t-tests and it certainly seems applicable. Not > > entirely > > > > sure how we can decide if it gives "better results", but I do think that I > > > would > > > > expect confidence to usually be between 0 and 100, not exactly 0 or 100 -- > > if > > > > the evidence is very strong that the "bad" and "good" means are > > statistically > > > > significantly different, I think I would expect a number more like 99.75% > > > rather > > > > than 100. > > > > > > > > > > Yeah I'm not sure how to tell if it's giving "better results" other than to > > > output it along with the existing value, and comparing them to see. Would > love > > > it if you want to experiment with this though, trying out t-tests have been > on > > > my TODO list for a long time. > > > > > > > I think that right now, we get "100% confidence" if minimum distance > between > > > > groups (dist_between_groups) is greater than the sum of the standard > > > deviations > > > > of the two groups (len_broken_group + len_working_group), and we get "0% > > > > confidence" if there's any overlap between the two groups. > > > > > > > > Possible follow-up change: > > > > - Calculate the t value for the two groups. > > > > - Use this and the sample size to calculate the p-value. > > > > - Confidence = (1 - p-value). > > > > > > > > One thing I still don't understand is: What exactly is in those > > working_means > > > > and broken_means lists? (It seems they're lists of lists of values, but I > > > don't > > > > know if those values are means calculated from other lists of values...) > > > > > > Those values are just the values from the perf runs. > > > > Alright -- I guess each entry is results from a test run for one revision? > > And the results for a test run for one revision for one metric might be > several > > values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per > > RESULT line), or one value (e.g. for a metric that only has one number per > > RESULT line)? > > > > [Note: I just looked at the function TryParseResultValuesFromOutput in the > > bisect script and I think this is what it appears to do -- it returns a list > of > > one or more values.] > > Yup, also the script might run the test several times. Alright, now I understand better. In any case is this CL alright to submit?
On 2014/03/27 22:02:08, qyearsley wrote: > On 2014/03/27 18:57:33, shatch wrote: > > On 2014/03/26 18:48:56, qyearsley wrote: > > > On 2014/03/26 17:55:48, shatch wrote: > > > > On 2014/03/25 20:49:03, qyearsley wrote: > > > > > On 2014/03/25 19:58:35, shatch wrote: > > > > > > > > > > > > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression.py > > > > > > File tools/bisect-perf-regression.py (right): > > > > > > > > > > > > > > > > > > > > > > > > > > > https://codereview.chromium.org/209853009/diff/1/tools/bisect-perf-regression... > > > > > > tools/bisect-perf-regression.py:2830: bounds[1] = max(current_mean, > > > > bounds[1]) > > > > > > BTW, if you want to experiment with changing this, go right ahead! It > > was > > > > just > > > > > a > > > > > > simple, conservative stab at calculating confidence, but it's pretty > > > > > arbitrary. > > > > > > Had plans to try out a t-test at some point to see if it gave better > > > > results. > > > > > > > > > > Aye, just looked into t-tests and it certainly seems applicable. Not > > > entirely > > > > > sure how we can decide if it gives "better results", but I do think that > I > > > > would > > > > > expect confidence to usually be between 0 and 100, not exactly 0 or 100 > -- > > > if > > > > > the evidence is very strong that the "bad" and "good" means are > > > statistically > > > > > significantly different, I think I would expect a number more like > 99.75% > > > > rather > > > > > than 100. > > > > > > > > > > > > > Yeah I'm not sure how to tell if it's giving "better results" other than > to > > > > output it along with the existing value, and comparing them to see. Would > > love > > > > it if you want to experiment with this though, trying out t-tests have > been > > on > > > > my TODO list for a long time. > > > > > > > > > I think that right now, we get "100% confidence" if minimum distance > > between > > > > > groups (dist_between_groups) is greater than the sum of the standard > > > > deviations > > > > > of the two groups (len_broken_group + len_working_group), and we get > "0% > > > > > confidence" if there's any overlap between the two groups. > > > > > > > > > > Possible follow-up change: > > > > > - Calculate the t value for the two groups. > > > > > - Use this and the sample size to calculate the p-value. > > > > > - Confidence = (1 - p-value). > > > > > > > > > > One thing I still don't understand is: What exactly is in those > > > working_means > > > > > and broken_means lists? (It seems they're lists of lists of values, but > I > > > > don't > > > > > know if those values are means calculated from other lists of values...) > > > > > > > > Those values are just the values from the perf runs. > > > > > > Alright -- I guess each entry is results from a test run for one revision? > > > And the results for a test run for one revision for one metric might be > > several > > > values (e.g. for page_cycler warm_times/page_load_time, multiple numbers per > > > RESULT line), or one value (e.g. for a metric that only has one number per > > > RESULT line)? > > > > > > [Note: I just looked at the function TryParseResultValuesFromOutput in the > > > bisect script and I think this is what it appears to do -- it returns a list > > of > > > one or more values.] > > > > Yup, also the script might run the test several times. > > Alright, now I understand better. > > In any case is this CL alright to submit? yup, lgtm
The CQ bit was checked by qyearsley@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/qyearsley@chromium.org/209853009/30001
Message was sent while issue was closed.
Change committed as 260241
Message was sent while issue was closed.
A revert of this CL has been created in https://codereview.chromium.org/218613012/ by qyearsley@chromium.org. The reason for reverting is: Caused issue 358622. The bug was that at line 2832 I used a tuple instead of a list, then later I tried to assign to a member of the tuple, but Python tuples are immutable.. |