Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 # Copyright 2016 The Chromium Authors. All rights reserved. | 1 # Copyright 2016 The Chromium Authors. All rights reserved. |
| 2 # Use of this source code is governed by a BSD-style license that can be | 2 # Use of this source code is governed by a BSD-style license that can be |
| 3 # found in the LICENSE file. | 3 # found in the LICENSE file. |
| 4 | 4 |
| 5 import logging | 5 import logging |
| 6 | 6 |
| 7 # Maximum number of versions to look back. | 7 # Default value for the maximum number of versions to look back. |
| 8 _MAXIMUM_WINDOW_SIZE = 30 | 8 _MAXIMUM_WINDOW_SIZE = 30 |
| 9 | 9 |
| 10 # Add epsilon to avoid dividing by zero when computing spike score. | 10 # Add epsilon to avoid dividing by zero when computing spikiness. |
| 11 _EPSILON = 0.00000001 | 11 _EPSILON = 0.00000001 |
| 12 | 12 |
| 13 # Default value to control weight of current data when computing spike score. | 13 # Default value to control weight of current data when computing spikiness. |
| 14 _DEFAULT_ALPHA = 0.9 | 14 _DEFAULT_ALPHA = 0.9 |
| 15 | 15 |
| 16 # Threshold of spike score. | 16 # Threshold for calling something a spike. |
| 17 _SPIKENESS_THRESHOLD = 20 | 17 _SPIKINESS_THRESHOLD = 20 |
| 18 | 18 |
| 19 | 19 |
| 20 # TODO(http://crbug.com/644406): This code needs cleaning/refactoring | 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. |
| 21 def GetSpikeIndexes(data_series, alpha=_DEFAULT_ALPHA, | 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA, |
| 22 threshold=_SPIKENESS_THRESHOLD): | 22 threshold=_SPIKINESS_THRESHOLD): |
| 23 """Given a time series, determine anomolous spikes. | 23 """Given a time series, determine the regression ranges for each |
|
Sharu Jiang
2016/09/08 21:33:14
nit: A convention, for long doc string, keep the f
stgao
2016/09/09 18:13:59
nit: one line summary here.
| |
| 24 anomolous spike. | |
| 24 | 25 |
| 25 The time series is represented by a list of (x,y) tuples, where the | 26 The time series is represented by a list of "events" together with a |
| 26 position of the tuple in the list is the time at which the event | 27 function for computing the "value" of each event. The events themselves |
|
stgao
2016/09/09 18:13:58
Information in this paragraph could be merged into
| |
| 27 occured, x is an abstract name for that time, and y is the value of | 28 can be any sort of object (including None), all we care about is their |
| 28 the time series at that point in time. We detect spikes by determining | 29 value according to the valuation function. We assume the list is given |
| 29 when the time series diverges too much from the running average (of | 30 with the events already in order, but otherwise we do not care about |
| 30 the y values). The alpha parameter determines how readily we update | 31 their exact position within the list. We return a list of pairs of |
| 31 the running average; which is to say, how much weight we give each | 32 events, where the two events in each pair are ajacent in the original |
| 32 new event compared to the weight we give to the current running | 33 list. Whenever we determine that the value of some event deviates too |
| 33 average. The threshold parameter defines when we've deviated 'too far' | 34 much from the values of previous events, we will return a pair of the |
| 34 from the running average, and therefore say that this event is in fact | 35 event which deviates too much, together with the immediately previous |
| 35 a spike. | 36 event. |
| 37 | |
| 38 We use exponential smoothing as our model for determining when there | |
| 39 are spikes in the events' values. This model has two parameters: | |
| 40 the alpha parameter determines how readily we update the running | |
| 41 average each time we see a new event, and the threshold parameter | |
| 42 determines when an event's value is considered anomolous enough to | |
| 43 be called a spike. N.B., exponential smoothing is considered a naive | |
| 44 model for analysing time series, because it cannot account for trends; | |
| 45 but since we are only interested in detecting deviation from zero, | |
| 46 this model is good enough for us. | |
| 36 | 47 |
| 37 Args: | 48 Args: |
| 38 data_series (list): A list of (x, y) tuples where y is a number. | 49 events (list): A list of objects. |
| 39 alpha (float): In (0, 1], it controls the weight of current data | 50 get_value (callable): a function from event objects to a number. |
| 40 when computing the running mean, a higher value has more weight. | 51 alpha (float): In (0, 1], controls how we balance between evidence |
| 41 threshold (float): Threshold for when something counts as a spike. | 52 from new events vs the running average. When alpha=0 we completely |
| 53 ignore the new event; when alpha=1 we completely ignore the running | |
| 54 average. | |
| 55 threshold (float): Threshold for when something is considered to be | |
| 56 a spike. | |
| 42 | 57 |
| 43 Returns: | 58 Returns: |
| 44 A list of spike indexes (i.e., positions in the data_series list; | 59 A list of (e0,e1) pairs denoting the regression range for each |
| 45 not the abstract names associated with those indices) | 60 spike. That is, e0 is the event just before the spike, and e1 is the |
| 61 event just after the spike. If data_series has fewer than 2 elements, | |
| 62 then we return the empty list rather than returning None. | |
| 46 """ | 63 """ |
| 47 spike_indexes = [] | 64 # TODO(wrengr): assert callable(get_value), so we can give a decent |
| 65 # error message if it's not. | |
| 48 | 66 |
| 49 if not data_series or len(data_series) == 1: | 67 # TODO(wrengr): would it be better to return None so callers fail fast? |
| 50 return spike_indexes | 68 if not events or len(events) == 1: |
| 69 return [] | |
| 51 | 70 |
| 52 logging.info('For data series %s', repr(data_series)) | 71 logging.info('For data series %s', repr(events)) |
| 72 logging.info('Alpha is %f', alpha) | |
| 53 logging.info('Threshold is %f', threshold) | 73 logging.info('Threshold is %f', threshold) |
| 54 | 74 |
| 55 previous_mean = data_series[0][1] | 75 # TODO(wrengr): do we care enough about precision to avoid the floating |
| 56 for i, (_, y) in enumerate(data_series[1:]): | 76 # point issues of computing the mean in the naive/straightforward way? |
| 57 current_mean = (1 - alpha) * previous_mean + alpha * y | 77 spikes = [] |
| 78 previous_event = events[0] | |
| 79 previous_mean = get_value(previous_event) | |
| 80 for event in events[1:]: | |
| 81 current_value = get_value(event) | |
| 82 current_mean = (1 - alpha) * previous_mean + alpha * current_value | |
| 58 | 83 |
| 59 # Spike score is basically relative increase. Add epsilon | 84 # Spikiness is basically relative increase. Add epsilon to both |
| 60 # to both numerator and denominator to avoid dividing by zero error. | 85 # numerator and denominator to avoid dividing by zero error. |
| 61 spike_score = ((current_mean - previous_mean + _EPSILON) / | 86 spikiness = ((current_mean - previous_mean + _EPSILON) / |
| 62 (previous_mean + _EPSILON)) | 87 (previous_mean + _EPSILON)) |
| 63 logging.info('The spike score for data %s is %s', | 88 logging.info('The spikiness of event %s is %s', |
| 64 repr(data_series[i]), spike_score) | 89 repr(event), spikiness) |
|
stgao
2016/09/09 18:13:58
nit: could it fix in one line?
| |
| 65 if spike_score > threshold: | 90 if spikiness > threshold: |
| 66 spike_indexes.append(i + 1) | 91 spikes.append((previous_event, event)) |
| 67 | 92 |
| 93 previous_event = event | |
| 68 previous_mean = current_mean | 94 previous_mean = current_mean |
| 69 | 95 |
| 70 return spike_indexes | 96 return spikes |
| 71 | |
| 72 | |
| 73 def GetRegressionRangeFromSpike(spike_index, versions): | |
| 74 """Get the regression range based on spike_index and versions.""" | |
| 75 if spike_index < 1 or spike_index >= len(versions): | |
| 76 return None | |
| 77 | |
| 78 return (versions[spike_index - 1], versions[spike_index]) | |
| 79 | |
| 80 | |
| 81 def GetAttributesListFromHistoricData(historic_metadata, attributes): | |
| 82 """Returns a list of attributes from historic_metadata. | |
| 83 | |
| 84 Args: | |
| 85 historic_metadata (list): A list of dict of metadata, for example: | |
| 86 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] | |
| 87 attributes (list): A list of attribute names. | |
| 88 | |
| 89 Returns: | |
| 90 A list of strs(attributes has only 1 element), or tuples | |
| 91 """ | |
| 92 if not attributes: | |
| 93 return [] | |
| 94 | |
| 95 attributes_list = [] | |
| 96 | |
| 97 for data in historic_metadata: | |
| 98 attributes_entry = [] | |
| 99 | |
| 100 for attribute in attributes: | |
| 101 attributes_entry.append(data[attribute]) | |
| 102 | |
| 103 if len(attributes_entry) == 1: | |
| 104 attributes_list.append(attributes_entry[0]) | |
| 105 else: | |
| 106 attributes_list.append(tuple(attributes_entry)) | |
| 107 | |
| 108 return attributes_list | |
| 109 | 97 |
| 110 | 98 |
| 111 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE): | 99 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE): |
| 112 """Detect regression range from historic_metadata data. | 100 """Detect regression range from historic_metadata data. |
| 113 | 101 |
| 114 Args: | 102 Args: |
| 115 historic_metadata (list): A list of dict of metadata, the list is sorted by | 103 historic_metadata (list): A list of dict of metadata, the list is sorted by |
| 116 'chrome_version' from oldest to latest. | 104 'chrome_version' from oldest to latest. |
| 117 For example: | 105 For example: |
| 118 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] | 106 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] |
| 119 max_win_size (int): Number of versions to look back from | 107 max_win_size (int): Number of versions to look back from |
| 120 the currect version. | 108 the currect version. |
| 121 | 109 |
| 122 Returns: | 110 Returns: |
| 123 A tuple, (last_good_version, first_bad_version) or None if none found. | 111 A tuple, (last_good_version, first_bad_version) or None if none found. |
| 124 """ | 112 """ |
| 125 if not historic_metadata: | 113 if not historic_metadata: |
| 126 return None | 114 return None |
| 127 | 115 |
| 128 # Truncate the historic data so we only analyze data for max_win_size of | 116 # Truncate the historic data so we only analyze data for max_win_size of |
| 129 # latest versions. | 117 # latest versions. |
| 130 versions_to_cpm = GetAttributesListFromHistoricData( | 118 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm']) |
| 131 historic_metadata[-max_win_size:], ['chrome_version', 'cpm']) | |
| 132 | 119 |
| 133 versions, _ = zip(*versions_to_cpm) | 120 if not spikes: |
| 134 spike_indexes = GetSpikeIndexes(versions_to_cpm) | |
| 135 | |
| 136 if not spike_indexes: | |
| 137 logging.warning('Failed to find spikes in history data %s' % repr( | 121 logging.warning('Failed to find spikes in history data %s' % repr( |
| 138 historic_metadata)) | 122 historic_metadata)) |
| 139 return None | 123 return None |
| 140 | 124 |
| 141 # Only return the latest regression range. | 125 # Only return the last/most-recent regression range. |
| 142 return GetRegressionRangeFromSpike(spike_indexes[-1], versions) | 126 last_good, first_bad = spikes[-1] |
| 127 return last_good['chrome_version'], first_bad['chrome_version'] | |
| 128 | |
| OLD | NEW |