Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(691)

Side by Side Diff: appengine/findit/crash/detect_regression_range.py

Issue 2325503002: Reorganizing detect_regression_range.py (Closed)
Patch Set: detect_regression_range.py: removing dependency on https://codereview.chromium.org/2311393003/ and … Created 4 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
1 # Copyright 2016 The Chromium Authors. All rights reserved. 1 # Copyright 2016 The Chromium Authors. All rights reserved.
2 # Use of this source code is governed by a BSD-style license that can be 2 # Use of this source code is governed by a BSD-style license that can be
3 # found in the LICENSE file. 3 # found in the LICENSE file.
4 4
5 import logging 5 import logging
6 6
7 # Maximum number of versions to look back. 7 # Default value for the maximum number of versions to look back.
8 _MAXIMUM_WINDOW_SIZE = 30 8 _MAXIMUM_WINDOW_SIZE = 30
9 9
10 # Add epsilon to avoid dividing by zero when computing spike score. 10 # Add epsilon to avoid dividing by zero when computing spikiness.
11 _EPSILON = 0.00000001 11 _EPSILON = 0.00000001
12 12
13 # Default value to control weight of current data when computing spike score. 13 # Default value to control weight of current data when computing spikiness.
14 _DEFAULT_ALPHA = 0.9 14 _DEFAULT_ALPHA = 0.9
15 15
16 # Threshold of spike score. 16 # Threshold for calling something a spike.
17 _SPIKENESS_THRESHOLD = 20 17 _SPIKINESS_THRESHOLD = 20
18 18
19 19
20 # TODO(http://crbug.com/644406): This code needs cleaning/refactoring 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.
21 def GetSpikeIndexes(data_series, alpha=_DEFAULT_ALPHA, 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA,
22 threshold=_SPIKENESS_THRESHOLD): 22 threshold=_SPIKINESS_THRESHOLD):
23 """Given a time series, determine anomolous spikes. 23 """Given a time series, determine the regression ranges for each
Sharu Jiang 2016/09/08 21:33:14 nit: A convention, for long doc string, keep the f
stgao 2016/09/09 18:13:59 nit: one line summary here.
24 anomolous spike.
24 25
25 The time series is represented by a list of (x,y) tuples, where the 26 The time series is represented by a list of "events" together with a
26 position of the tuple in the list is the time at which the event 27 function for computing the "value" of each event. The events themselves
stgao 2016/09/09 18:13:58 Information in this paragraph could be merged into
27 occured, x is an abstract name for that time, and y is the value of 28 can be any sort of object (including None), all we care about is their
28 the time series at that point in time. We detect spikes by determining 29 value according to the valuation function. We assume the list is given
29 when the time series diverges too much from the running average (of 30 with the events already in order, but otherwise we do not care about
30 the y values). The alpha parameter determines how readily we update 31 their exact position within the list. We return a list of pairs of
31 the running average; which is to say, how much weight we give each 32 events, where the two events in each pair are ajacent in the original
32 new event compared to the weight we give to the current running 33 list. Whenever we determine that the value of some event deviates too
33 average. The threshold parameter defines when we've deviated 'too far' 34 much from the values of previous events, we will return a pair of the
34 from the running average, and therefore say that this event is in fact 35 event which deviates too much, together with the immediately previous
35 a spike. 36 event.
37
38 We use exponential smoothing as our model for determining when there
39 are spikes in the events' values. This model has two parameters:
40 the alpha parameter determines how readily we update the running
41 average each time we see a new event, and the threshold parameter
42 determines when an event's value is considered anomolous enough to
43 be called a spike. N.B., exponential smoothing is considered a naive
44 model for analysing time series, because it cannot account for trends;
45 but since we are only interested in detecting deviation from zero,
46 this model is good enough for us.
36 47
37 Args: 48 Args:
38 data_series (list): A list of (x, y) tuples where y is a number. 49 events (list): A list of objects.
39 alpha (float): In (0, 1], it controls the weight of current data 50 get_value (callable): a function from event objects to a number.
40 when computing the running mean, a higher value has more weight. 51 alpha (float): In (0, 1], controls how we balance between evidence
41 threshold (float): Threshold for when something counts as a spike. 52 from new events vs the running average. When alpha=0 we completely
53 ignore the new event; when alpha=1 we completely ignore the running
54 average.
55 threshold (float): Threshold for when something is considered to be
56 a spike.
42 57
43 Returns: 58 Returns:
44 A list of spike indexes (i.e., positions in the data_series list; 59 A list of (e0,e1) pairs denoting the regression range for each
45 not the abstract names associated with those indices) 60 spike. That is, e0 is the event just before the spike, and e1 is the
61 event just after the spike. If data_series has fewer than 2 elements,
62 then we return the empty list rather than returning None.
46 """ 63 """
47 spike_indexes = [] 64 # TODO(wrengr): assert callable(get_value), so we can give a decent
65 # error message if it's not.
48 66
49 if not data_series or len(data_series) == 1: 67 # TODO(wrengr): would it be better to return None so callers fail fast?
50 return spike_indexes 68 if not events or len(events) == 1:
69 return []
51 70
52 logging.info('For data series %s', repr(data_series)) 71 logging.info('For data series %s', repr(events))
72 logging.info('Alpha is %f', alpha)
53 logging.info('Threshold is %f', threshold) 73 logging.info('Threshold is %f', threshold)
54 74
55 previous_mean = data_series[0][1] 75 # TODO(wrengr): do we care enough about precision to avoid the floating
56 for i, (_, y) in enumerate(data_series[1:]): 76 # point issues of computing the mean in the naive/straightforward way?
57 current_mean = (1 - alpha) * previous_mean + alpha * y 77 spikes = []
78 previous_event = events[0]
79 previous_mean = get_value(previous_event)
80 for event in events[1:]:
81 current_value = get_value(event)
82 current_mean = (1 - alpha) * previous_mean + alpha * current_value
58 83
59 # Spike score is basically relative increase. Add epsilon 84 # Spikiness is basically relative increase. Add epsilon to both
60 # to both numerator and denominator to avoid dividing by zero error. 85 # numerator and denominator to avoid dividing by zero error.
61 spike_score = ((current_mean - previous_mean + _EPSILON) / 86 spikiness = ((current_mean - previous_mean + _EPSILON) /
62 (previous_mean + _EPSILON)) 87 (previous_mean + _EPSILON))
63 logging.info('The spike score for data %s is %s', 88 logging.info('The spikiness of event %s is %s',
64 repr(data_series[i]), spike_score) 89 repr(event), spikiness)
stgao 2016/09/09 18:13:58 nit: could it fix in one line?
65 if spike_score > threshold: 90 if spikiness > threshold:
66 spike_indexes.append(i + 1) 91 spikes.append((previous_event, event))
67 92
93 previous_event = event
68 previous_mean = current_mean 94 previous_mean = current_mean
69 95
70 return spike_indexes 96 return spikes
71
72
73 def GetRegressionRangeFromSpike(spike_index, versions):
74 """Get the regression range based on spike_index and versions."""
75 if spike_index < 1 or spike_index >= len(versions):
76 return None
77
78 return (versions[spike_index - 1], versions[spike_index])
79
80
81 def GetAttributesListFromHistoricData(historic_metadata, attributes):
82 """Returns a list of attributes from historic_metadata.
83
84 Args:
85 historic_metadata (list): A list of dict of metadata, for example:
86 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]
87 attributes (list): A list of attribute names.
88
89 Returns:
90 A list of strs(attributes has only 1 element), or tuples
91 """
92 if not attributes:
93 return []
94
95 attributes_list = []
96
97 for data in historic_metadata:
98 attributes_entry = []
99
100 for attribute in attributes:
101 attributes_entry.append(data[attribute])
102
103 if len(attributes_entry) == 1:
104 attributes_list.append(attributes_entry[0])
105 else:
106 attributes_list.append(tuple(attributes_entry))
107
108 return attributes_list
109 97
110 98
111 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE): 99 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE):
112 """Detect regression range from historic_metadata data. 100 """Detect regression range from historic_metadata data.
113 101
114 Args: 102 Args:
115 historic_metadata (list): A list of dict of metadata, the list is sorted by 103 historic_metadata (list): A list of dict of metadata, the list is sorted by
116 'chrome_version' from oldest to latest. 104 'chrome_version' from oldest to latest.
117 For example: 105 For example:
118 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] 106 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]
119 max_win_size (int): Number of versions to look back from 107 max_win_size (int): Number of versions to look back from
120 the currect version. 108 the currect version.
121 109
122 Returns: 110 Returns:
123 A tuple, (last_good_version, first_bad_version) or None if none found. 111 A tuple, (last_good_version, first_bad_version) or None if none found.
124 """ 112 """
125 if not historic_metadata: 113 if not historic_metadata:
126 return None 114 return None
127 115
128 # Truncate the historic data so we only analyze data for max_win_size of 116 # Truncate the historic data so we only analyze data for max_win_size of
129 # latest versions. 117 # latest versions.
130 versions_to_cpm = GetAttributesListFromHistoricData( 118 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm'])
131 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])
132 119
133 versions, _ = zip(*versions_to_cpm) 120 if not spikes:
134 spike_indexes = GetSpikeIndexes(versions_to_cpm)
135
136 if not spike_indexes:
137 logging.warning('Failed to find spikes in history data %s' % repr( 121 logging.warning('Failed to find spikes in history data %s' % repr(
138 historic_metadata)) 122 historic_metadata))
139 return None 123 return None
140 124
141 # Only return the latest regression range. 125 # Only return the last/most-recent regression range.
142 return GetRegressionRangeFromSpike(spike_indexes[-1], versions) 126 last_good, first_bad = spikes[-1]
127 return last_good['chrome_version'], first_bad['chrome_version']
128
OLDNEW

Powered by Google App Engine
This is Rietveld 408576698