appengine/findit/crash/detect_regression_range.py - Issue 2325503002: Reorganizing detect_regression_range.py

Side by Side Diff: appengine/findit/crash/detect_regression_range.py

Issue 2325503002: Reorganizing detect_regression_range.py (Closed)

Patch Set: detect_regression_range.py: added todo note Created 4 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 import logging	5 import logging

6	6

7 # Default value for the maximum number of versions to look back.	7 # Default value for the maximum number of versions to look back.

8 _MAXIMUM_WINDOW_SIZE = 30	8 _MAXIMUM_WINDOW_SIZE = 30

9	9

10 # Add epsilon to avoid dividing by zero when computing spikiness.	10 # Add epsilon to avoid dividing by zero when computing spikiness.

11 _EPSILON = 0.00000001	11 _EPSILON = 0.00000001

12	12

13 # Default value to control weight of current data when computing spikiness.	13 # Default value to control weight of current data when computing spikiness.

14 _DEFAULT_ALPHA = 0.9	14 _DEFAULT_ALPHA = 0.9

15	15

16 # Threshold for calling something a spike.	16 # Threshold for calling something a spike.

17 _SPIKINESS_THRESHOLD = 20	17 _SPIKINESS_THRESHOLD = 20

18	18

19	19

20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.	20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.

21 def GetSpikes(data_series, alpha=_DEFAULT_ALPHA,	21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA,

22 threshold=_SPIKINESS_THRESHOLD):	22 threshold=_SPIKINESS_THRESHOLD):

23 """Given a time series, determine the regression ranges for each	23 """Given a time series, determine the regression ranges for each

24 anomolous spike.	24 anomolous spike.

25	25

26 The time series is represented by a list of (x,y) tuples, where each	26 The time series is represented by a list of "events" together with a

27 tuple denotes an event, the position of the tuple in the list indicates	27 function for computing the "value" of each event. The events themselves

28 the relative ordering of events, x is an abstract name for the point	28 can be any sort of object (including None), all we care about is their

29 in time the event occured, and y is the value associated with that	29 value according to the valuation function. We assume the list is given

30 event. N.B., this function doesn't care at all about what the x values	30 with the events already in order, but otherwise we do not care about

31 actually look like; we just pass them around as abstract values,	31 their exact position within the list. We return a list of pairs of

32 thus they can be whatever the caller wants.	32 events, where the two events in each pair are ajacent in the original

33	33 list. Whenever we determine that the value of some event deviates too

34 We use exponential smoothing as our model for determining whether	34 much from the values of previous events, we will return a pair of the

35 an event is a spike or not. This model has two parameters: the alpha	35 event which deviates too much, together with the immediately previous

36 parameter determines how readily we update the running average for each	36 event.

37 new event, and the threshold parameter determines when an event is	37

38 considered anomolous enough to be called a spike. N.B., exponential	38 We use exponential smoothing as our model for determining when there

39 smoothing is considered a naive model for analysing time series,	39 are spikes in the events' values. This model has two parameters:

40 because it cannot account for trends; but since we are only interested	40 the alpha parameter determines how readily we update the running

41 in detecting deviation from zero, this model is good enough for us.	41 average each time we see a new event, and the threshold parameter

	42 determines when an event's value is considered anomolous enough to

	43 be called a spike. N.B., exponential smoothing is considered a naive

	44 model for analysing time series, because it cannot account for trends;

	45 but since we are only interested in detecting deviation from zero,

	46 this model is good enough for us.

42	47

43 Args:	48 Args:

44 data_series (list): A list of (x, y) tuples where y is a number.	49 events (list): A list of objects.

	50 get_value (callable): a function from event objects to a number.

45 alpha (float): In (0, 1], controls how we balance between evidence	51 alpha (float): In (0, 1], controls how we balance between evidence

46 from new events vs the running average. When alpha=0 we completely	52 from new events vs the running average. When alpha=0 we completely

47 ignore the new event; when alpha=1 we completely ignore the running	53 ignore the new event; when alpha=1 we completely ignore the running

48 average.	54 average.

49 threshold (float): Threshold for when something is considered to be	55 threshold (float): Threshold for when something is considered to be

50 a spike.	56 a spike.

51	57

52 Returns:	58 Returns:

53 A list of (y0,y1) pairs denoting the regression range for each	59 A list of (e0,e1) pairs denoting the regression range for each

54 spike. That is, y0 is the abstract name for the point in time just	60 spike. That is, e0 is the event just before the spike, and e1 is the

55 before the spike, and y1 is the abstract name for the point in time	61 event just after the spike. If data_series has fewer than 2 elements,

56 just after the spike. If data_series has fewer than 2 elements,

57 then we return the empty list rather than returning None.	62 then we return the empty list rather than returning None.

58 """	63 """

	64 # TODO(wrengr): assert callable(get_value), so we can give a decent

	65 # error message if it's not.

	66

59 # TODO(wrengr): would it be better to return None so callers fail fast?	67 # TODO(wrengr): would it be better to return None so callers fail fast?

60 if not data_series or len(data_series) == 1:	68 if not events or len(events) == 1:

61 return []	69 return []

62	70

63 logging.info('For data series %s', repr(data_series))	71 logging.info('For data series %s', repr(events))

64 logging.info('Alpha is %f', alpha)	72 logging.info('Alpha is %f', alpha)

65 logging.info('Threshold is %f', threshold)	73 logging.info('Threshold is %f', threshold)

66	74

67 # TODO(wrengr): do we care enough about precision to avoid the floating	75 # TODO(wrengr): do we care enough about precision to avoid the floating

68 # point issues with computing things the naive way?	76 # point issues with computing things the naive way?

69 spikes = []	77 spikes = []

70 previous_x, previous_mean = data_series[0]	78 previous_event = events[0]

71 for event in data_series[1:]:	79 previous_mean = get_value(previous_event)

72 x, y = event	80 for event in events[1:]:

73	81 current_value = get_value(event)

74 current_mean = (1 - alpha) * previous_mean + alpha * y	82 current_mean = (1 - alpha) * previous_mean + alpha * current_value

75	83

76 # Spikiness is basically relative increase. Add epsilon to both	84 # Spikiness is basically relative increase. Add epsilon to both

77 # numerator and denominator to avoid dividing by zero error.	85 # numerator and denominator to avoid dividing by zero error.

78 spikiness = ((current_mean - previous_mean + _EPSILON) /	86 spikiness = ((current_mean - previous_mean + _EPSILON) /

79 (previous_mean + _EPSILON))	87 (previous_mean + _EPSILON))

80 logging.info('The spikiness of data %s is %s',	88 logging.info('The spikiness of event %s is %s',

81 repr(event), spikiness)	89 repr(event), spikiness)

82 if spikiness > threshold:	90 if spikiness > threshold:

83 spikes.append((previous_x,x))	91 spikes.append((previous_event, event))

84	92

	93 previous_event = event

85 previous_mean = current_mean	94 previous_mean = current_mean

86 previous_x = x

87

88 # Because we don't know the x for the next event after the final one,

89 # we ignore spikes at the very end of the series.

90	95

91 return spikes	96 return spikes

92	97

93	98

94 def GetAttributesListFromHistoricData(historic_metadata, attributes):	99 def GetAttributesListFromHistoricData(historic_metadata, attributes):

95 """Returns a list of attributes from historic_metadata.	100 """Returns a list of attributes from historic_metadata.

96	101

97 Args:	102 Args:

98 historic_metadata (list): A list of dict of metadata, for example:	103 historic_metadata (list): A list of dict of metadata, for example:

99 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]	104 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

(...skipping 31 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
131 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]	136 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

132 max_win_size (int): Number of versions to look back from	137 max_win_size (int): Number of versions to look back from

133 the currect version.	138 the currect version.

134	139

135 Returns:	140 Returns:

136 A tuple, (last_good_version, first_bad_version) or None if none found.	141 A tuple, (last_good_version, first_bad_version) or None if none found.

137 """	142 """

138 if not historic_metadata:	143 if not historic_metadata:

139 return None	144 return None

140	145

	146 # TODO(http://crbug.com/644406): we should be able to remove this code

	147 # for converting historic_metadata to pairs now. Just need to update

	148 # the unit tests to match.

	149

141 # Truncate the historic data so we only analyze data for max_win_size of	150 # Truncate the historic data so we only analyze data for max_win_size of

142 # latest versions.	151 # latest versions.

143 versions_to_cpm = GetAttributesListFromHistoricData(	152 versions_to_cpm = GetAttributesListFromHistoricData(

144 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])	153 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])
	Sharu Jiang 2016/09/08 00:48:24 We don't need this anymore. We can just use: spik We don't need this anymore. We can just use: spikes = GetSpikes(historic_metadata, lambda x: x['cpm']) wrengr (wrong one) 2016/09/08 18:12:13 Right. I just hadn't gotten around to deleting the Show quoted text On 2016/09/08 00:48:24, sharu jiang wrote: > We don't need this anymore. > > We can just use: > spikes = GetSpikes(historic_metadata, lambda x: x['cpm']) Right. I just hadn't gotten around to deleting the GetAttributesListFromHistoricData stuff yet. (As I mentioned, right now this particular CL is just to explain what I meant, not actually ready for review. I also need to update the unit tests to reflect the change in GetSpike's type. Sharu Jiang 2016/09/08 18:26:08 Acknowledged. Show quoted text On 2016/09/08 18:12:13, wrengr wrote: > On 2016/09/08 00:48:24, sharu jiang wrote: > > We don't need this anymore. > > > > We can just use: > > spikes = GetSpikes(historic_metadata, lambda x: x['cpm']) > > Right. I just hadn't gotten around to deleting the > GetAttributesListFromHistoricData stuff yet. (As I mentioned, right now this > particular CL is just to explain what I meant, not actually ready for review. I > also need to update the unit tests to reflect the change in GetSpike's type. Acknowledged.
145	154

146 spikes = GetSpikes(versions_to_cpm)	155 spikes = GetSpikes(versions_to_cpm, lambda x: x[1])

147	156

148 if not spikes:	157 if not spikes:

149 logging.warning('Failed to find spikes in history data %s' % repr(	158 logging.warning('Failed to find spikes in history data %s' % repr(

150 historic_metadata))	159 historic_metadata))

151 return None	160 return None

152	161

153 # Only return the latest regression range.	162 # Only return the latest regression range.

154 return spikes[-1]	163 return spikes[-1][0]
	Sharu Jiang 2016/09/08 00:48:24 I think this should be return spikes[-1][0]['chro I think this should be return spikes[-1][0]['chrome_version'], spikes[-1][1]['chrome_version'] wrengr (wrong one) 2016/09/08 18:12:13 Do you mean you want DetectRegressionRange to (con Show quoted text On 2016/09/08 00:48:24, sharu jiang wrote: > I think this should be > > return spikes[-1][0]['chrome_version'], spikes[-1][1]['chrome_version'] Do you mean you want DetectRegressionRange to (continue to) return a pair of the 'chrome_version' strings, rather than returning a pair of historic metadata objects? Either way is fine by me, it's just that you'd mentioned about wanting to get other information out of the historic metadata objects. Sharu Jiang 2016/09/08 18:26:08 The final result of DetectRegressionRange should s Show quoted text On 2016/09/08 18:12:13, wrengr wrote: > On 2016/09/08 00:48:24, sharu jiang wrote: > > I think this should be > > > > return spikes[-1][0]['chrome_version'], spikes[-1][1]['chrome_version'] > > Do you mean you want DetectRegressionRange to (continue to) return a pair of the > 'chrome_version' strings, rather than returning a pair of historic metadata > objects? Either way is fine by me, it's just that you'd mentioned about wanting > to get other information out of the historic metadata objects. The final result of DetectRegressionRange should still be chrome_version pair, the other information is used later (need experiments) to get this final result :)
155	164

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »