| OLD | NEW |
| 1 # Copyright 2016 The Chromium Authors. All rights reserved. | 1 # Copyright 2016 The Chromium Authors. All rights reserved. |
| 2 # Use of this source code is governed by a BSD-style license that can be | 2 # Use of this source code is governed by a BSD-style license that can be |
| 3 # found in the LICENSE file. | 3 # found in the LICENSE file. |
| 4 | 4 |
| 5 import logging | 5 import logging |
| 6 | 6 |
| 7 # Default value for the maximum number of versions to look back. | 7 # Default value for the maximum number of versions to look back. |
| 8 _MAXIMUM_WINDOW_SIZE = 30 | 8 _MAXIMUM_WINDOW_SIZE = 30 |
| 9 | 9 |
| 10 # Add epsilon to avoid dividing by zero when computing spikiness. | 10 # Add epsilon to avoid dividing by zero when computing spikiness. |
| 11 _EPSILON = 0.00000001 | 11 _EPSILON = 0.00000001 |
| 12 | 12 |
| 13 # Default value to control weight of current data when computing spikiness. | 13 # Default value to control weight of current data when computing spikiness. |
| 14 _DEFAULT_ALPHA = 0.9 | 14 _DEFAULT_ALPHA = 0.9 |
| 15 | 15 |
| 16 # Threshold for calling something a spike. | 16 # Threshold for calling something a spike. |
| 17 _SPIKINESS_THRESHOLD = 20 | 17 _SPIKINESS_THRESHOLD = 20 |
| 18 | 18 |
| 19 | 19 |
| 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. | 20 # TODO(wrengr): make this streaming, rather than holding onto the whole list. |
| 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA, | 21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA, |
| 22 threshold=_SPIKINESS_THRESHOLD): | 22 threshold=_SPIKINESS_THRESHOLD): |
| 23 """Given a time series, detect regression ranges for anomalous spikes. | 23 """Given a time series, detect regression ranges for anomalous spikes. |
| 24 | 24 |
| 25 The time series is represented by a list of "events" together with | 25 The time series is represented by a list of "events" together with |
| 26 a function for computing the "value" of each event. We assume the | 26 a function for computing the "value" of each event. We assume the |
| 27 events are given in order, and the only thing we care about them is | 27 events are given in order, and the only thing we care about them is |
| 28 their value. As we scan through the list, if we notice any "spikes" in | 28 their value. As we scan through the list, if we notice any "spikes" in |
| 29 the course of values (i.e., the current value seems anomalous compared | 29 the course of values (i.e., the current value seems anomalous compared |
| 30 to the values we've seen previously), then we produce a tuple of the | 30 to the values we've seen previously), then we produce a tuple of the |
| 31 events bracketing the spike. Since there can be many spikes, we return | 31 events bracketing the spike. Since there can be many spikes, we return |
| 32 a list of these tuples. | 32 a list of these tuples. |
| 33 | 33 |
| 34 The model we use for detecting spikes is exponential smoothing. This | 34 The model we use for detecting spikes is exponential smoothing. This |
| 35 model is based on the running average of the events' values, and it has | 35 model is based on the running average of the events' values, and it has |
| 36 two parameters: the alpha parameter determines how readily we update | 36 two parameters: the alpha parameter determines how readily we update |
| 37 the running average each time we see a new event, and the threshold | 37 the running average each time we see a new event, and the threshold |
| 38 parameter determines when an event's value deviates far enough from the | 38 parameter determines when an event's value deviates far enough from the |
| 39 running average that we call it a spike. N.B., in time series analysis | 39 running average that we call it a spike. N.B., in time series analysis |
| 40 more generally, exponential smoothing is considered a naive model since | 40 more generally, exponential smoothing is considered a naive model since |
| 41 it cannot account for things like scaling (i.e., if we multiply all | 41 it cannot account for things like scaling (i.e., if we multiply all |
| 42 the values by some constant, then we'll need to adjust the threshold | 42 the values by some constant, then we'll need to adjust the threshold |
| 43 by that constant too) and noise (i.e., if we add noise to the values, | 43 by that constant too) and noise (i.e., if we add noise to the values, |
| 44 then we'll need to adjust the threshold to try and filter that noise | 44 then we'll need to adjust the threshold to try and filter that noise |
| 45 out). However, based on some preliminary tests, this naive model seems | 45 out). However, based on some preliminary tests, this naive model seems |
| 46 to be good enough for our particular task. | 46 to be good enough for our particular task. |
| 47 | 47 |
| 48 Args: | 48 Args: |
| 49 events (list): A list of objects representing "events" in a time | 49 events (list): A list of objects representing "events" in a time |
| 50 series. The events themselves can be any sort of object (including | 50 series. The events themselves can be any sort of object (including |
| 51 None), all we care about is their value according to get_value. We | 51 None), all we care about is their value according to get_value. We |
| 52 assume the events are already in order, but we do not care about | 52 assume the events are already in order, but we do not care about |
| 53 their exact position within the list. | 53 their exact position within the list. |
| 54 get_value (callable): a valuation function mapping events to numbers. | 54 get_value (callable): a valuation function mapping events to numbers. |
| 55 alpha (float): In (0, 1], controls how we balance between evidence | 55 alpha (float): In (0, 1], controls how we balance between evidence |
| 56 from new events vs the running average. When alpha=0 we completely | 56 from new events vs the running average. When alpha=0 we completely |
| 57 ignore the new event; when alpha=1 we completely ignore the running | 57 ignore the new event; when alpha=1 we completely ignore the running |
| (...skipping 66 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 124 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm']) | 124 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm']) |
| 125 | 125 |
| 126 if not spikes: | 126 if not spikes: |
| 127 logging.warning('Failed to find spikes in history data %s' % repr( | 127 logging.warning('Failed to find spikes in history data %s' % repr( |
| 128 historic_metadata)) | 128 historic_metadata)) |
| 129 return None | 129 return None |
| 130 | 130 |
| 131 # Only return the last/most-recent regression range. | 131 # Only return the last/most-recent regression range. |
| 132 last_good, first_bad = spikes[-1] | 132 last_good, first_bad = spikes[-1] |
| 133 return last_good['chrome_version'], first_bad['chrome_version'] | 133 return last_good['chrome_version'], first_bad['chrome_version'] |
| 134 | |
| OLD | NEW |