Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 # Copyright 2016 The Chromium Authors. All rights reserved. | 1 # Copyright 2016 The Chromium Authors. All rights reserved. |
| 2 # Use of this source code is governed by a BSD-style license that can be | 2 # Use of this source code is governed by a BSD-style license that can be |
| 3 # found in the LICENSE file. | 3 # found in the LICENSE file. |
| 4 | 4 |
| 5 import logging | 5 import logging |
| 6 | 6 |
| 7 # Default value for the maximum number of versions to look back. | 7 # Default value for the maximum number of versions to look back. |
| 8 _MAXIMUM_WINDOW_SIZE = 30 | 8 _MAXIMUM_WINDOW_SIZE = 30 |
| 9 | 9 |
| 10 # Add epsilon to avoid dividing by zero when computing spikiness. | 10 # Add epsilon to avoid dividing by zero when computing spikiness. |
| 11 _EPSILON = 0.00000001 | 11 _EPSILON = 0.00000001 |
| 12 | 12 |
| 13 # Default value to control weight of current data when computing spikiness. | 13 # Default value to control weight of current data when computing spikiness. |
| 14 _DEFAULT_ALPHA = 0.9 | 14 _DEFAULT_ALPHA = 0.9 |
| 15 | 15 |
| 16 # Threshold for calling something a spike. | 16 # Threshold for calling something a spike. |
| 17 _SPIKINESS_THRESHOLD = 20 | 17 _SPIKINESS_THRESHOLD = 20 |
| 18 | 18 |
| 19 | 19 |
| 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. | 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. |
| 21 def GetSpikes(data_series, alpha=_DEFAULT_ALPHA, | 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA, |
| 22 threshold=_SPIKINESS_THRESHOLD): | 22 threshold=_SPIKINESS_THRESHOLD): |
| 23 """Given a time series, determine the regression ranges for each | 23 """Given a time series, determine the regression ranges for each |
| 24 anomolous spike. | 24 anomolous spike. |
| 25 | 25 |
| 26 The time series is represented by a list of (x,y) tuples, where each | 26 The time series is represented by a list of "events" together with a |
| 27 tuple denotes an event, the position of the tuple in the list indicates | 27 function for computing the "value" of each event. The events themselves |
| 28 the relative ordering of events, x is an abstract name for the point | 28 can be any sort of object (including None), all we care about is their |
| 29 in time the event occured, and y is the value associated with that | 29 value according to the valuation function. We assume the list is given |
| 30 event. N.B., this function doesn't care at all about what the x values | 30 with the events already in order, but otherwise we do not care about |
| 31 actually look like; we just pass them around as abstract values, | 31 their exact position within the list. We return a list of pairs of |
| 32 thus they can be whatever the caller wants. | 32 events, where the two events in each pair are ajacent in the original |
| 33 | 33 list. Whenever we determine that the value of some event deviates too |
| 34 We use exponential smoothing as our model for determining whether | 34 much from the values of previous events, we will return a pair of the |
| 35 an event is a spike or not. This model has two parameters: the alpha | 35 event which deviates too much, together with the immediately previous |
| 36 parameter determines how readily we update the running average for each | 36 event. |
| 37 new event, and the threshold parameter determines when an event is | 37 |
| 38 considered anomolous enough to be called a spike. N.B., exponential | 38 We use exponential smoothing as our model for determining when there |
| 39 smoothing is considered a naive model for analysing time series, | 39 are spikes in the events' values. This model has two parameters: |
| 40 because it cannot account for trends; but since we are only interested | 40 the alpha parameter determines how readily we update the running |
| 41 in detecting deviation from zero, this model is good enough for us. | 41 average each time we see a new event, and the threshold parameter |
| 42 determines when an event's value is considered anomolous enough to | |
| 43 be called a spike. N.B., exponential smoothing is considered a naive | |
| 44 model for analysing time series, because it cannot account for trends; | |
| 45 but since we are only interested in detecting deviation from zero, | |
| 46 this model is good enough for us. | |
| 42 | 47 |
| 43 Args: | 48 Args: |
| 44 data_series (list): A list of (x, y) tuples where y is a number. | 49 events (list): A list of objects. |
| 50 get_value (callable): a function from event objects to a number. | |
| 45 alpha (float): In (0, 1], controls how we balance between evidence | 51 alpha (float): In (0, 1], controls how we balance between evidence |
| 46 from new events vs the running average. When alpha=0 we completely | 52 from new events vs the running average. When alpha=0 we completely |
| 47 ignore the new event; when alpha=1 we completely ignore the running | 53 ignore the new event; when alpha=1 we completely ignore the running |
| 48 average. | 54 average. |
| 49 threshold (float): Threshold for when something is considered to be | 55 threshold (float): Threshold for when something is considered to be |
| 50 a spike. | 56 a spike. |
| 51 | 57 |
| 52 Returns: | 58 Returns: |
| 53 A list of (y0,y1) pairs denoting the regression range for each | 59 A list of (e0,e1) pairs denoting the regression range for each |
| 54 spike. That is, y0 is the abstract name for the point in time just | 60 spike. That is, e0 is the event just before the spike, and e1 is the |
| 55 before the spike, and y1 is the abstract name for the point in time | 61 event just after the spike. If data_series has fewer than 2 elements, |
| 56 just after the spike. If data_series has fewer than 2 elements, | |
| 57 then we return the empty list rather than returning None. | 62 then we return the empty list rather than returning None. |
| 58 """ | 63 """ |
| 64 # TODO(wrengr): assert callable(get_value), so we can give a decent | |
| 65 # error message if it's not. | |
| 66 | |
| 59 # TODO(wrengr): would it be better to return None so callers fail fast? | 67 # TODO(wrengr): would it be better to return None so callers fail fast? |
| 60 if not data_series or len(data_series) == 1: | 68 if not events or len(events) == 1: |
| 61 return [] | 69 return [] |
| 62 | 70 |
| 63 logging.info('For data series %s', repr(data_series)) | 71 logging.info('For data series %s', repr(events)) |
| 64 logging.info('Alpha is %f', alpha) | 72 logging.info('Alpha is %f', alpha) |
| 65 logging.info('Threshold is %f', threshold) | 73 logging.info('Threshold is %f', threshold) |
| 66 | 74 |
| 67 # TODO(wrengr): do we care enough about precision to avoid the floating | 75 # TODO(wrengr): do we care enough about precision to avoid the floating |
| 68 # point issues with computing things the naive way? | 76 # point issues with computing things the naive way? |
| 69 spikes = [] | 77 spikes = [] |
| 70 previous_x, previous_mean = data_series[0] | 78 previous_event = events[0] |
| 71 for event in data_series[1:]: | 79 previous_mean = get_value(previous_event) |
| 72 x, y = event | 80 for event in events[1:]: |
| 73 | 81 current_value = get_value(event) |
| 74 current_mean = (1 - alpha) * previous_mean + alpha * y | 82 current_mean = (1 - alpha) * previous_mean + alpha * current_value |
| 75 | 83 |
| 76 # Spikiness is basically relative increase. Add epsilon to both | 84 # Spikiness is basically relative increase. Add epsilon to both |
| 77 # numerator and denominator to avoid dividing by zero error. | 85 # numerator and denominator to avoid dividing by zero error. |
| 78 spikiness = ((current_mean - previous_mean + _EPSILON) / | 86 spikiness = ((current_mean - previous_mean + _EPSILON) / |
| 79 (previous_mean + _EPSILON)) | 87 (previous_mean + _EPSILON)) |
| 80 logging.info('The spikiness of data %s is %s', | 88 logging.info('The spikiness of event %s is %s', |
| 81 repr(event), spikiness) | 89 repr(event), spikiness) |
| 82 if spikiness > threshold: | 90 if spikiness > threshold: |
| 83 spikes.append((previous_x,x)) | 91 spikes.append((previous_event, event)) |
| 84 | 92 |
| 93 previous_event = event | |
| 85 previous_mean = current_mean | 94 previous_mean = current_mean |
| 86 previous_x = x | |
| 87 | |
| 88 # Because we don't know the x for the next event after the final one, | |
| 89 # we ignore spikes at the very end of the series. | |
| 90 | 95 |
| 91 return spikes | 96 return spikes |
| 92 | 97 |
| 93 | 98 |
| 94 def GetAttributesListFromHistoricData(historic_metadata, attributes): | 99 def GetAttributesListFromHistoricData(historic_metadata, attributes): |
| 95 """Returns a list of attributes from historic_metadata. | 100 """Returns a list of attributes from historic_metadata. |
| 96 | 101 |
| 97 Args: | 102 Args: |
| 98 historic_metadata (list): A list of dict of metadata, for example: | 103 historic_metadata (list): A list of dict of metadata, for example: |
| 99 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] | 104 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] |
| (...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 131 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] | 136 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}] |
| 132 max_win_size (int): Number of versions to look back from | 137 max_win_size (int): Number of versions to look back from |
| 133 the currect version. | 138 the currect version. |
| 134 | 139 |
| 135 Returns: | 140 Returns: |
| 136 A tuple, (last_good_version, first_bad_version) or None if none found. | 141 A tuple, (last_good_version, first_bad_version) or None if none found. |
| 137 """ | 142 """ |
| 138 if not historic_metadata: | 143 if not historic_metadata: |
| 139 return None | 144 return None |
| 140 | 145 |
| 146 # TODO(http://crbug.com/644406): we should be able to remove this code | |
| 147 # for converting historic_metadata to pairs now. Just need to update | |
| 148 # the unit tests to match. | |
| 149 | |
| 141 # Truncate the historic data so we only analyze data for max_win_size of | 150 # Truncate the historic data so we only analyze data for max_win_size of |
| 142 # latest versions. | 151 # latest versions. |
| 143 versions_to_cpm = GetAttributesListFromHistoricData( | 152 versions_to_cpm = GetAttributesListFromHistoricData( |
| 144 historic_metadata[-max_win_size:], ['chrome_version', 'cpm']) | 153 historic_metadata[-max_win_size:], ['chrome_version', 'cpm']) |
|
Sharu Jiang
2016/09/08 00:48:24
We don't need this anymore.
We can just use:
spik
wrengr (wrong one)
2016/09/08 18:12:13
Right. I just hadn't gotten around to deleting the
Sharu Jiang
2016/09/08 18:26:08
Acknowledged.
| |
| 145 | 154 |
| 146 spikes = GetSpikes(versions_to_cpm) | 155 spikes = GetSpikes(versions_to_cpm, lambda x: x[1]) |
| 147 | 156 |
| 148 if not spikes: | 157 if not spikes: |
| 149 logging.warning('Failed to find spikes in history data %s' % repr( | 158 logging.warning('Failed to find spikes in history data %s' % repr( |
| 150 historic_metadata)) | 159 historic_metadata)) |
| 151 return None | 160 return None |
| 152 | 161 |
| 153 # Only return the latest regression range. | 162 # Only return the latest regression range. |
| 154 return spikes[-1] | 163 return spikes[-1][0] |
|
Sharu Jiang
2016/09/08 00:48:24
I think this should be
return spikes[-1][0]['chro
wrengr (wrong one)
2016/09/08 18:12:13
Do you mean you want DetectRegressionRange to (con
Sharu Jiang
2016/09/08 18:26:08
The final result of DetectRegressionRange should s
| |
| 155 | 164 |
| OLD | NEW |