appengine/findit/crash/detect_regression_range.py - Issue 2325503002: Reorganizing detect_regression_range.py

Side by Side Diff: appengine/findit/crash/detect_regression_range.py

Issue 2325503002: Reorganizing detect_regression_range.py (Closed)

Patch Set: Addressing nits Created 4 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 import logging	5 import logging

6	6

7 # Maximum number of versions to look back.	7 # Default value for the maximum number of versions to look back.

8 _MAXIMUM_WINDOW_SIZE = 30	8 _MAXIMUM_WINDOW_SIZE = 30

9	9

10 # Add epsilon to avoid dividing by zero when computing spike score.	10 # Add epsilon to avoid dividing by zero when computing spikiness.

11 _EPSILON = 0.00000001	11 _EPSILON = 0.00000001

12	12

13 # Default value to control weight of current data when computing spike score.	13 # Default value to control weight of current data when computing spikiness.

14 _DEFAULT_ALPHA = 0.9	14 _DEFAULT_ALPHA = 0.9

15	15

16 # Threshold of spike score.	16 # Threshold for calling something a spike.

17 _SPIKENESS_THRESHOLD = 20	17 _SPIKINESS_THRESHOLD = 20

18	18

19	19

20 # TODO(http://crbug.com/644406): This code needs cleaning/refactoring	20 # TODO(wrengr): make this streaming, rather than holding onto the whole list.

21 def GetSpikeIndexes(data_series, alpha=_DEFAULT_ALPHA,	21 def GetSpikes(events, get_value, alpha=_DEFAULT_ALPHA,

22 threshold=_SPIKENESS_THRESHOLD):	22 threshold=_SPIKINESS_THRESHOLD):

23 """Given a time series, determine anomolous spikes.	23 """Given a time series, detect regression ranges for anomalous spikes.

24	24

25 The time series is represented by a list of (x,y) tuples, where the	25 The time series is represented by a list of "events" together with

26 position of the tuple in the list is the time at which the event	26 a function for computing the "value" of each event. We assume the

27 occured, x is an abstract name for that time, and y is the value of	27 events are given in order, and the only thing we care about them is

28 the time series at that point in time. We detect spikes by determining	28 their value. As we scan through the list, if we notice any "spikes" in

29 when the time series diverges too much from the running average (of	29 the course of values (i.e., the current value seems anomalous compared

30 the y values). The alpha parameter determines how readily we update	30 to the values we've seen previously), then we produce a tuple of the

31 the running average; which is to say, how much weight we give each	31 events bracketing the spike. Since there can be many spikes, we return

32 new event compared to the weight we give to the current running	32 a list of these tuples.

33 average. The threshold parameter defines when we've deviated 'too far'	33

34 from the running average, and therefore say that this event is in fact	34 The model we use for detecting spikes is exponential smoothing. This

35 a spike.	35 model is based on the running average of the events' values, and it has

36	36 two parameters: the alpha parameter determines how readily we update

	37 the running average each time we see a new event, and the threshold

	38 parameter determines when an event's value deviates far enough from the

	39 running average that we call it a spike. N.B., in time series analysis

	40 more generally, exponential smoothing is considered a naive model since

	41 it cannot account for things like scaling (i.e., if we multiply all

	42 the values by some constant, then we'll need to adjust the threshold

	43 by that constant too) and noise (i.e., if we add noise to the values,

	44 then we'll need to adjust the threshold to try and filter that noise

	45 out). However, based on some preliminary tests, this naive model seems

	46 to be good enough for our particular task.

	47

37 Args:	48 Args:

38 data_series (list): A list of (x, y) tuples where y is a number.	49 events (list): A list of objects representing "events" in a time

39 alpha (float): In (0, 1], it controls the weight of current data	50 series. The events themselves can be any sort of object (including

40 when computing the running mean, a higher value has more weight.	51 None), all we care about is their value according to get_value. We

41 threshold (float): Threshold for when something counts as a spike.	52 assume the events are already in order, but we do not care about

	53 their exact position within the list.

	54 get_value (callable): a valuation function mapping events to numbers.

	55 alpha (float): In (0, 1], controls how we balance between evidence

	56 from new events vs the running average. When alpha=0 we completely

	57 ignore the new event; when alpha=1 we completely ignore the running

	58 average.

	59 threshold (float): How far away from the running average something

	60 can be before we consider it a spike.

42	61

43 Returns:	62 Returns:

44 A list of spike indexes (i.e., positions in the data_series list;	63 A list of event pairs, each of which denotes the regression range

45 not the abstract names associated with those indices)	64 for a spike. That is, for each pair, the first component is the

	65 event just before the spike, and the second component is the event

	66 just after the spike. Thus, the pairs are always of events which

	67 were adjacent in the event list. If the event list has fewer than

	68 two elements, then we return the empty list (rather than returning

	69 None or throwing an exception).

46 """	70 """

47 spike_indexes = []	71 # TODO(wrengr): assert callable(get_value), so we can give a decent

	72 # error message if it's not.

48	73

49 if not data_series or len(data_series) == 1:	74 # TODO(wrengr): would it be better to return None so callers fail fast?

50 return spike_indexes	75 if not events or len(events) == 1:

	76 return []

51	77

52 logging.info('For data series %s', repr(data_series))	78 logging.info('For data series %s', repr(events))

	79 logging.info('Alpha is %f', alpha)

53 logging.info('Threshold is %f', threshold)	80 logging.info('Threshold is %f', threshold)

54	81

55 previous_mean = data_series[0][1]	82 # TODO(wrengr): do we care enough about precision to avoid the floating

56 for i, (_, y) in enumerate(data_series[1:]):	83 # point issues of computing the mean in the naive/straightforward way?

57 current_mean = (1 - alpha) * previous_mean + alpha * y	84 spikes = []

	85 previous_event = events[0]

	86 previous_mean = get_value(previous_event)

	87 for event in events[1:]:

	88 current_value = get_value(event)

	89 current_mean = (1 - alpha) * previous_mean + alpha * current_value

58	90

59 # Spike score is basically relative increase. Add epsilon	91 # Spikiness is basically relative increase. Add epsilon to both

60 # to both numerator and denominator to avoid dividing by zero error.	92 # numerator and denominator to avoid dividing by zero error.

61 spike_score = ((current_mean - previous_mean + _EPSILON) /	93 spikiness = ((current_mean - previous_mean + _EPSILON) /

62 (previous_mean + _EPSILON))	94 (previous_mean + _EPSILON))

63 logging.info('The spike score for data %s is %s',	95 logging.info('The spikiness of event %s is %s', repr(event), spikiness)

64 repr(data_series[i]), spike_score)	96 if spikiness > threshold:

65 if spike_score > threshold:	97 spikes.append((previous_event, event))

66 spike_indexes.append(i + 1)

67	98

	99 previous_event = event

68 previous_mean = current_mean	100 previous_mean = current_mean

69	101

70 return spike_indexes	102 return spikes

71

72

73 def GetRegressionRangeFromSpike(spike_index, versions):

74 """Get the regression range based on spike_index and versions."""

75 if spike_index < 1 or spike_index >= len(versions):

76 return None

77

78 return (versions[spike_index - 1], versions[spike_index])

79

80

81 def GetAttributesListFromHistoricData(historic_metadata, attributes):

82 """Returns a list of attributes from historic_metadata.

83

84 Args:

85 historic_metadata (list): A list of dict of metadata, for example:

86 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

87 attributes (list): A list of attribute names.

88

89 Returns:

90 A list of strs(attributes has only 1 element), or tuples

91 """

92 if not attributes:

93 return []

94

95 attributes_list = []

96

97 for data in historic_metadata:

98 attributes_entry = []

99

100 for attribute in attributes:

101 attributes_entry.append(data[attribute])

102

103 if len(attributes_entry) == 1:

104 attributes_list.append(attributes_entry[0])

105 else:

106 attributes_list.append(tuple(attributes_entry))

107

108 return attributes_list

109	103

110	104

111 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE):	105 def DetectRegressionRange(historic_metadata, max_win_size=_MAXIMUM_WINDOW_SIZE):

112 """Detect regression range from historic_metadata data.	106 """Detect regression range from historic_metadata data.

113	107

114 Args:	108 Args:

115 historic_metadata (list): A list of dict of metadata, the list is sorted by	109 historic_metadata (list): A list of dict of metadata, the list is

116 'chrome_version' from oldest to latest.	110 sorted by 'chrome_version' from oldest to latest. For example:

117 For example:

118 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]	111 [{'chrome_version': '1', 'cpm': 0}, {'chrome_version': '2', 'cpm': 0}]

119 max_win_size (int): Number of versions to look back from	112 max_win_size (int): Number of versions to look back from the current

120 the currect version.	113 version.

121	114

122 Returns:	115 Returns:

123 A tuple, (last_good_version, first_bad_version) or None if none found.	116 A tuple, (last_good_version, first_bad_version) or None if none found.

124 """	117 """

125 if not historic_metadata:	118 if not historic_metadata:

126 return None	119 return None

127	120

128 # Truncate the historic data so we only analyze data for max_win_size of	121 # Truncate the historic data so we only analyze data for max_win_size of

129 # latest versions.	122 # latest versions.

130 versions_to_cpm = GetAttributesListFromHistoricData(	123 spikes = GetSpikes(historic_metadata[-max_win_size:], lambda x: x['cpm'])

131 historic_metadata[-max_win_size:], ['chrome_version', 'cpm'])

132	124

133 versions, _ = zip(*versions_to_cpm)	125 if not spikes:

134 spike_indexes = GetSpikeIndexes(versions_to_cpm)

135

136 if not spike_indexes:

137 logging.warning('Failed to find spikes in history data %s' % repr(	126 logging.warning('Failed to find spikes in history data %s' % repr(

138 historic_metadata))	127 historic_metadata))

139 return None	128 return None

140	129

141 # Only return the latest regression range.	130 # Only return the last/most-recent regression range.

142 return GetRegressionRangeFromSpike(spike_indexes[-1], versions)	131 last_good, first_bad = spikes[-1]

	132 return last_good['chrome_version'], first_bad['chrome_version']

	133

OLD	NEW

« no previous file with comments | « appengine/findit/crash/crash_pipeline.py ('k') | appengine/findit/crash/test/detect_regression_range_test.py » ('j') | no next file with comments »