appengine/findit/crash/detect_regression_range.py - Issue 2325503002: Reorganizing detect_regression_range.py

Side by Side Diff: appengine/findit/crash/detect_regression_range.py

Issue 2325503002: Reorganizing detect_regression_range.py (Closed)

Patch Set: detect_regression_range.py: removing dependency on https://codereview.chromium.org/2311393003/ and … Created 4 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 import logging	5 import logging

6	6

7 # Maximum number of versions to look back.	7 # Default value for the maximum number of versions to look back.

8 _MAXIMUM_WINDOW_SIZE = 30	8 _MAXIMUM_WINDOW_SIZE = 30

9	9

10 # Add epsilon to avoid dividing by zero when computing spike score.	10 # Add epsilon to avoid dividing by zero when computing spikiness.

11 _EPSILON = 0.00000001	11 _EPSILON = 0.00000001

12	12

13 # Default value to control weight of current data when computing spike score.	13 # Default value to control weight of current data when computing spikiness.

14 _DEFAULT_ALPHA = 0.9	14 _DEFAULT_ALPHA = 0.9

15	15

16 # Threshold of spike score.	16 # Threshold for calling something a spike.

17 _SPIKENESS_THRESHOLD = 20	17 _SPIKINESS_THRESHOLD = 20

18	18

19	19

20 # TODO(http://crbug.com/644406): This code needs cleaning/refactoring	20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.

21 def GetSpikeIndexes(data_series, alpha=_DEFAULT_ALPHA,	21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA,

22 threshold=_SPIKENESS_THRESHOLD):	22 threshold=_SPIKINESS_THRESHOLD):

23 """Given a time series, determine anomolous spikes.	23 """Given a time series, determine the regression ranges for each
	Sharu Jiang 2016/09/08 21:33:14 nit: A convention, for long doc string, keep the f nit: A convention, for long doc string, keep the first sentence one line. stgao 2016/09/09 18:13:59 nit: one line summary here. nit: one line summary here.
	24 anomolous spike.

24	25

25 The time series is represented by a list of (x,y) tuples, where the	26 The time series is represented by a list of "events" together with a

26 position of the tuple in the list is the time at which the event	27 function for computing the "value" of each event. The events themselves
	stgao 2016/09/09 18:13:58 Information in this paragraph could be merged into Information in this paragraph could be merged into the docstrings for arguments below.
27 occured, x is an abstract name for that time, and y is the value of	28 can be any sort of object (including None), all we care about is their

28 the time series at that point in time. We detect spikes by determining	29 value according to the valuation function. We assume the list is given

29 when the time series diverges too much from the running average (of	30 with the events already in order, but otherwise we do not care about

30 the y values). The alpha parameter determines how readily we update	31 their exact position within the list. We return a list of pairs of

31 the running average; which is to say, how much weight we give each	32 events, where the two events in each pair are ajacent in the original

32 new event compared to the weight we give to the current running	33 list. Whenever we determine that the value of some event deviates too

33 average. The threshold parameter defines when we've deviated 'too far'	34 much from the values of previous events, we will return a pair of the

34 from the running average, and therefore say that this event is in fact	35 event which deviates too much, together with the immediately previous

35 a spike.	36 event.

	37

	38 We use exponential smoothing as our model for determining when there

	39 are spikes in the events' values. This model has two parameters:

	40 the alpha parameter determines how readily we update the running

	41 average each time we see a new event, and the threshold parameter

	42 determines when an event's value is considered anomolous enough to

	43 be called a spike. N.B., exponential smoothing is considered a naive

	44 model for analysing time series, because it cannot account for trends;

	45 but since we are only interested in detecting deviation from zero,

	46 this model is good enough for us.

36	47

37 Args:	48 Args:

38 data_series (list): A list of (x, y) tuples where y is a number.	49 events (list): A list of objects.

39 alpha (float): In (0, 1], it controls the weight of current data	50 get_value (callable): a function from event objects to a number.

40 when computing the running mean, a higher value has more weight.	51 alpha (float): In (0, 1], controls how we balance between evidence

41 threshold (float): Threshold for when something counts as a spike.	52 from new events vs the running average. When alpha=0 we completely

	53 ignore the new event; when alpha=1 we completely ignore the running

	54 average.

	55 threshold (float): Threshold for when something is considered to be

	56 a spike.

42	57

43 Returns:	58 Returns:

44 A list of spike indexes (i.e., positions in the data_series list;	59 A list of (e0,e1) pairs denoting the regression range for each

45 not the abstract names associated with those indices)	60 spike. That is, e0 is the event just before the spike, and e1 is the

	61 event just after the spike. If data_series has fewer than 2 elements,

	62 then we return the empty list rather than returning None.

46 """	63 """

47 spike_indexes = []	64 # TODO(wrengr): assert callable(get_value), so we can give a decent

	65 # error message if it's not.

48	66

49 if not data_series or len(data_series) == 1:	67 # TODO(wrengr): would it be better to return None so callers fail fast?

50 return spike_indexes	68 if not events or len(events) == 1:

	69 return []

51	70

52 logging.info('For data series %s', repr(data_series))	71 logging.info('For data series %s', repr(events))

	72 logging.info('Alpha is %f', alpha)

53 logging.info('Threshold is %f', threshold)	73 logging.info('Threshold is %f', threshold)

54	74

55 previous_mean = data_series[0][1]	75 # TODO(wrengr): do we care enough about precision to avoid the floating

56 for i, (_, y) in enumerate(data_series[1:]):	76 # point issues of computing the mean in the naive/straightforward way?

57 current_mean = (1 - alpha) * previous_mean + alpha * y	77 spikes = []

	78 previous_event = events[0]

	79 previous_mean = get_value(previous_event)

	80 for event in events[1:]:

	81 current_value = get_value(event)

	82 current_mean = (1 - alpha) * previous_mean + alpha * current_value

58	83

59 # Spike score is basically relative increase. Add epsilon	84 # Spikiness is basically relative increase. Add epsilon to both

60 # to both numerator and denominator to avoid dividing by zero error.	85 # numerator and denominator to avoid dividing by zero error.

61 spike_score = ((current_mean - previous_mean + _EPSILON) /	86 spikiness = ((current_mean - previous_mean + _EPSILON) /

62 (previous_mean + _EPSILON))	87 (previous_mean + _EPSILON))

63 logging.info('The spike score for data %s is %s',	88 logging.info('The spikiness of event %s is %s',

64 repr(data_series[i]), spike_score)	89 repr(event), spikiness)
	stgao 2016/09/09 18:13:58 nit: could it fix in one line? nit: could it fix in one line?
65 if spike_score > threshold:	90 if spikiness > threshold:

66 spike_indexes.append(i + 1)	91 spikes.append((previous_event, event))

67	92

	93 previous_event = event

68 previous_mean = current_mean	94 previous_mean = current_mean

69	95

70 return spike_indexes	96 return spikes

71

72

73 def GetRegressionRangeFromSpike(spike_index, versions):

74 """Get the regression range based on spike_index and versions."""

75 if spike_index < 1 or spike_index >= len(versions):

76 return None

77

78 return (versions[spike_index - 1], versions[spike_index])

79

80

81 def GetAttributesListFromHistoricData(historic_metadata, attributes):

82 """Returns a list of attributes from historic_metadata.

83

84 Args:

85 historic_metadata (list): A list of dict of metadata, for example:

86 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

87 attributes (list): A list of attribute names.

88

89 Returns:

90 A list of strs(attributes has only 1 element), or tuples

91 """

92 if not attributes:

93 return []

94

95 attributes_list = []

96

97 for data in historic_metadata:

98 attributes_entry = []

99

100 for attribute in attributes:

101 attributes_entry.append(data[attribute])

102

103 if len(attributes_entry) == 1:

104 attributes_list.append(attributes_entry[0])

105 else:

106 attributes_list.append(tuple(attributes_entry))

107

108 return attributes_list

109	97

110	98

111 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE):	99 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE):

112 """Detect regression range from historic_metadata data.	100 """Detect regression range from historic_metadata data.

113	101

114 Args:	102 Args:

115 historic_metadata (list): A list of dict of metadata, the list is sorted by	103 historic_metadata (list): A list of dict of metadata, the list is sorted by

116 'chrome_version' from oldest to latest.	104 'chrome_version' from oldest to latest.

117 For example:	105 For example:

118 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]	106 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

119 max_win_size (int): Number of versions to look back from	107 max_win_size (int): Number of versions to look back from

120 the currect version.	108 the currect version.

121	109

122 Returns:	110 Returns:

123 A tuple, (last_good_version, first_bad_version) or None if none found.	111 A tuple, (last_good_version, first_bad_version) or None if none found.

124 """	112 """

125 if not historic_metadata:	113 if not historic_metadata:

126 return None	114 return None

127	115

128 # Truncate the historic data so we only analyze data for max_win_size of	116 # Truncate the historic data so we only analyze data for max_win_size of

129 # latest versions.	117 # latest versions.

130 versions_to_cpm = GetAttributesListFromHistoricData(	118 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm'])

131 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])

132	119

133 versions, _ = zip(*versions_to_cpm)	120 if not spikes:

134 spike_indexes = GetSpikeIndexes(versions_to_cpm)

135

136 if not spike_indexes:

137 logging.warning('Failed to find spikes in history data %s' % repr(	121 logging.warning('Failed to find spikes in history data %s' % repr(

138 historic_metadata))	122 historic_metadata))

139 return None	123 return None

140	124

141 # Only return the latest regression range.	125 # Only return the last/most-recent regression range.

142 return GetRegressionRangeFromSpike(spike_indexes[-1], versions)	126 last_good, first_bad = spikes[-1]

	127 return last_good['chrome_version'], first_bad['chrome_version']

	128

OLD	NEW

« no previous file with comments | « no previous file | appengine/findit/crash/test/detect_regression_range_test.py » ('j') | appengine/findit/crash/test/detect_regression_range_test.py » ('J')