Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(21)

Side by Side Diff: appengine/findit/crash/detect_regression_range.py

Issue 2325503002: Reorganizing detect_regression_range.py (Closed)
Patch Set: detect_regression_range.py: added todo note Created 4 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Copyright 2016 The Chromium Authors. All rights reserved. 1 # Copyright 2016 The Chromium Authors. All rights reserved.
2 # Use of this source code is governed by a BSD-style license that can be 2 # Use of this source code is governed by a BSD-style license that can be
3 # found in the LICENSE file. 3 # found in the LICENSE file.
4 4
5 import logging 5 import logging
6 6
7 # Default value for the maximum number of versions to look back. 7 # Default value for the maximum number of versions to look back.
8 _MAXIMUM_WINDOW_SIZE = 30 8 _MAXIMUM_WINDOW_SIZE = 30
9 9
10 # Add epsilon to avoid dividing by zero when computing spikiness. 10 # Add epsilon to avoid dividing by zero when computing spikiness.
11 _EPSILON = 0.00000001 11 _EPSILON = 0.00000001
12 12
13 # Default value to control weight of current data when computing spikiness. 13 # Default value to control weight of current data when computing spikiness.
14 _DEFAULT_ALPHA = 0.9 14 _DEFAULT_ALPHA = 0.9
15 15
16 # Threshold for calling something a spike. 16 # Threshold for calling something a spike.
17 _SPIKINESS_THRESHOLD = 20 17 _SPIKINESS_THRESHOLD = 20
18 18
19 19
20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.
21 def GetSpikes(data_series, alpha=_DEFAULT_ALPHA, 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA,
22 threshold=_SPIKINESS_THRESHOLD): 22 threshold=_SPIKINESS_THRESHOLD):
23 """Given a time series, determine the regression ranges for each 23 """Given a time series, determine the regression ranges for each
24 anomolous spike. 24 anomolous spike.
25 25
26 The time series is represented by a list of (x,y) tuples, where each 26 The time series is represented by a list of "events" together with a
27 tuple denotes an event, the position of the tuple in the list indicates 27 function for computing the "value" of each event. The events themselves
28 the relative ordering of events, x is an abstract name for the point 28 can be any sort of object (including None), all we care about is their
29 in time the event occured, and y is the value associated with that 29 value according to the valuation function. We assume the list is given
30 event. N.B., this function doesn't care at all about what the x values 30 with the events already in order, but otherwise we do not care about
31 actually look like; we just pass them around as abstract values, 31 their exact position within the list. We return a list of pairs of
32 thus they can be whatever the caller wants. 32 events, where the two events in each pair are ajacent in the original
33 33 list. Whenever we determine that the value of some event deviates too
34 We use exponential smoothing as our model for determining whether 34 much from the values of previous events, we will return a pair of the
35 an event is a spike or not. This model has two parameters: the alpha 35 event which deviates too much, together with the immediately previous
36 parameter determines how readily we update the running average for each 36 event.
37 new event, and the threshold parameter determines when an event is 37
38 considered anomolous enough to be called a spike. N.B., exponential 38 We use exponential smoothing as our model for determining when there
39 smoothing is considered a naive model for analysing time series, 39 are spikes in the events' values. This model has two parameters:
40 because it cannot account for trends; but since we are only interested 40 the alpha parameter determines how readily we update the running
41 in detecting deviation from zero, this model is good enough for us. 41 average each time we see a new event, and the threshold parameter
42 determines when an event's value is considered anomolous enough to
43 be called a spike. N.B., exponential smoothing is considered a naive
44 model for analysing time series, because it cannot account for trends;
45 but since we are only interested in detecting deviation from zero,
46 this model is good enough for us.
42 47
43 Args: 48 Args:
44 data_series (list): A list of (x, y) tuples where y is a number. 49 events (list): A list of objects.
50 get_value (callable): a function from event objects to a number.
45 alpha (float): In (0, 1], controls how we balance between evidence 51 alpha (float): In (0, 1], controls how we balance between evidence
46 from new events vs the running average. When alpha=0 we completely 52 from new events vs the running average. When alpha=0 we completely
47 ignore the new event; when alpha=1 we completely ignore the running 53 ignore the new event; when alpha=1 we completely ignore the running
48 average. 54 average.
49 threshold (float): Threshold for when something is considered to be 55 threshold (float): Threshold for when something is considered to be
50 a spike. 56 a spike.
51 57
52 Returns: 58 Returns:
53 A list of (y0,y1) pairs denoting the regression range for each 59 A list of (e0,e1) pairs denoting the regression range for each
54 spike. That is, y0 is the abstract name for the point in time just 60 spike. That is, e0 is the event just before the spike, and e1 is the
55 before the spike, and y1 is the abstract name for the point in time 61 event just after the spike. If data_series has fewer than 2 elements,
56 just after the spike. If data_series has fewer than 2 elements,
57 then we return the empty list rather than returning None. 62 then we return the empty list rather than returning None.
58 """ 63 """
64 # TODO(wrengr): assert callable(get_value), so we can give a decent
65 # error message if it's not.
66
59 # TODO(wrengr): would it be better to return None so callers fail fast? 67 # TODO(wrengr): would it be better to return None so callers fail fast?
60 if not data_series or len(data_series) == 1: 68 if not events or len(events) == 1:
61 return [] 69 return []
62 70
63 logging.info('For data series %s', repr(data_series)) 71 logging.info('For data series %s', repr(events))
64 logging.info('Alpha is %f', alpha) 72 logging.info('Alpha is %f', alpha)
65 logging.info('Threshold is %f', threshold) 73 logging.info('Threshold is %f', threshold)
66 74
67 # TODO(wrengr): do we care enough about precision to avoid the floating 75 # TODO(wrengr): do we care enough about precision to avoid the floating
68 # point issues with computing things the naive way? 76 # point issues with computing things the naive way?
69 spikes = [] 77 spikes = []
70 previous_x, previous_mean = data_series[0] 78 previous_event = events[0]
71 for event in data_series[1:]: 79 previous_mean = get_value(previous_event)
72 x, y = event 80 for event in events[1:]:
73 81 current_value = get_value(event)
74 current_mean = (1 - alpha) * previous_mean + alpha * y 82 current_mean = (1 - alpha) * previous_mean + alpha * current_value
75 83
76 # Spikiness is basically relative increase. Add epsilon to both 84 # Spikiness is basically relative increase. Add epsilon to both
77 # numerator and denominator to avoid dividing by zero error. 85 # numerator and denominator to avoid dividing by zero error.
78 spikiness = ((current_mean - previous_mean + _EPSILON) / 86 spikiness = ((current_mean - previous_mean + _EPSILON) /
79 (previous_mean + _EPSILON)) 87 (previous_mean + _EPSILON))
80 logging.info('The spikiness of data %s is %s', 88 logging.info('The spikiness of event %s is %s',
81 repr(event), spikiness) 89 repr(event), spikiness)
82 if spikiness > threshold: 90 if spikiness > threshold:
83 spikes.append((previous_x,x)) 91 spikes.append((previous_event, event))
84 92
93 previous_event = event
85 previous_mean = current_mean 94 previous_mean = current_mean
86 previous_x = x
87
88 # Because we don't know the x for the next event after the final one,
89 # we ignore spikes at the very end of the series.
90 95
91 return spikes 96 return spikes
92 97
93 98
94 def GetAttributesListFromHistoricData(historic_metadata, attributes): 99 def GetAttributesListFromHistoricData(historic_metadata, attributes):
95 """Returns a list of attributes from historic_metadata. 100 """Returns a list of attributes from historic_metadata.
96 101
97 Args: 102 Args:
98 historic_metadata (list): A list of dict of metadata, for example: 103 historic_metadata (list): A list of dict of metadata, for example:
99 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] 104 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]
(...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after
131 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] 136 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]
132 max_win_size (int): Number of versions to look back from 137 max_win_size (int): Number of versions to look back from
133 the currect version. 138 the currect version.
134 139
135 Returns: 140 Returns:
136 A tuple, (last_good_version, first_bad_version) or None if none found. 141 A tuple, (last_good_version, first_bad_version) or None if none found.
137 """ 142 """
138 if not historic_metadata: 143 if not historic_metadata:
139 return None 144 return None
140 145
146 # TODO(http://crbug.com/644406): we should be able to remove this code
147 # for converting historic_metadata to pairs now. Just need to update
148 # the unit tests to match.
149
141 # Truncate the historic data so we only analyze data for max_win_size of 150 # Truncate the historic data so we only analyze data for max_win_size of
142 # latest versions. 151 # latest versions.
143 versions_to_cpm = GetAttributesListFromHistoricData( 152 versions_to_cpm = GetAttributesListFromHistoricData(
144 historic_metadata[-max_win_size:], ['chrome_version', 'cpm']) 153 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])
Sharu Jiang 2016/09/08 00:48:24 We don't need this anymore. We can just use: spik
wrengr (wrong one) 2016/09/08 18:12:13 Right. I just hadn't gotten around to deleting the
Sharu Jiang 2016/09/08 18:26:08 Acknowledged.
145 154
146 spikes = GetSpikes(versions_to_cpm) 155 spikes = GetSpikes(versions_to_cpm, lambda x: x[1])
147 156
148 if not spikes: 157 if not spikes:
149 logging.warning('Failed to find spikes in history data %s' % repr( 158 logging.warning('Failed to find spikes in history data %s' % repr(
150 historic_metadata)) 159 historic_metadata))
151 return None 160 return None
152 161
153 # Only return the latest regression range. 162 # Only return the latest regression range.
154 return spikes[-1] 163 return spikes[-1][0]
Sharu Jiang 2016/09/08 00:48:24 I think this should be return spikes[-1][0]['chro
wrengr (wrong one) 2016/09/08 18:12:13 Do you mean you want DetectRegressionRange to (con
Sharu Jiang 2016/09/08 18:26:08 The final result of DetectRegressionRange should s
155 164
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698