appengine/findit/crash/loglinear/model.py - Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM.

Side by Side Diff: appengine/findit/crash/loglinear/model.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)

Patch Set: Address comments. Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

« appengine/findit/crash/loglinear/feature.py ('K') | « appengine/findit/crash/loglinear/feature.py ('k') | appengine/findit/crash/loglinear/test/changelist_classifier_test.py » ('j') | appengine/findit/crash/loglinear/test/model_test.py » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 # TODO(http://crbug.com/669639): there are lots of ways to make the code	5 # TODO(http://crbug.com/669639): there are lots of ways to make the code

6 # in this file better. We avoid having separate todos per task; instead	6 # in this file better. We avoid having separate todos per task; instead

7 # see that meta-issue ticket.	7 # see that meta-issue ticket.

8	8

	9 import logging

9 import math	10 import math

10 import numpy as np	11 import numpy as np

11 # N.B., ``np.array`` can't take generators; you must pass explicit lists.	12 # N.B., ``np.array`` can't take generators; you must pass explicit lists.

12	13

13 from libs.math.functions import MemoizedFunction	14 from libs.math.functions import MemoizedFunction

14 from libs.math.logarithms import logsumexp	15 from libs.math.logarithms import logsumexp

15 from libs.math.vectors import vsum	16 from libs.math.vectors import vsum

16	17

17 EPSILON = 0.00001	18 EPSILON = 0.00001

18	19

19	20

20 def ToFeatureFunction(fs):

21 """Given an array of scalar-valued functions, return an array-valued function.

22

23 Args:

24 fs (iterable): A collection of curried functions ``X -> Y -> A``.

25 That is, given a particular ``x`` they return a function ``Y -> A``.

26

27 Returns:

28 A function ``X -> Y -> list(A)`` where for all ``x``, ``y``, and

29 ``i`` we have that ``ToFeatureFunction(fs)(x)(y)[i] == fs[i](x)(y)``.

30 """

31 def _FeatureFunction(x):

32 fxs = [f(x) for f in fs]

33 return lambda y: [fx(y) for fx in fxs]

34

35 return _FeatureFunction

36

37

38 class UnnormalizedLogLinearModel(object):	21 class UnnormalizedLogLinearModel(object):

39 """An unnormalized loglinear model.	22 """An unnormalized loglinear model.

40	23

41 We use this model for combining a bunch of different features in order	24 We use this model for combining a bunch of different features in order

42 to decide who to blame. In particular, the nice thing about loglinear	25 to decide who to blame. In particular, the nice thing about loglinear

43 models is that (a) they allow separating the amount of weight we	26 models is that (a) they allow separating the amount of weight we

44 give to each feature from the feature itself, and (b) they allow us	27 give to each feature from the feature itself, and (b) they allow us

45 to keep adding features at will without needing to worry about the	28 to keep adding features at will without needing to worry about the

46 fiddly details needed for it to be a probability model.	29 fiddly details needed for it to be a probability model.

47	30

(...skipping 12 matching lines...) Expand all Loading...
60 rather than returning a probability per se.	43 rather than returning a probability per se.

61 """	44 """

62	45

63 def __init__(self, feature_function, weights, epsilon=None):	46 def __init__(self, feature_function, weights, epsilon=None):

64 """Construct a new model with the given weights and feature function.	47 """Construct a new model with the given weights and feature function.

65	48

66 Args:	49 Args:

67 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,	50 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,

68 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	51 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

69 must be the same as the length of ``weights``.	52 must be the same as the length of ``weights``.

70 weights (list of float): coefficients for how important we consider	53 weights (dict of float): the weights for the features. The keys of

71 each component of the feature vector for deciding which ``y``	54 the dictionary are the names of the feature that weight is

72 to blame.	55 for. We take this argument as a dict rather than as a list so that

	56 callers needn't worry about what order to provide the weights in.

73 epsilon (float): The absolute-error threshold for considering a	57 epsilon (float): The absolute-error threshold for considering a

74 weight to be "equal to zero". N.B., this should be a positive	58 weight to be "equal to zero". N.B., this should be a positive

75 number, as we will compare it against the absolute value of	59 number, as we will compare it against the absolute value of

76 each weight.	60 each weight.

77 """	61 """

78 if epsilon is None:	62 if epsilon is None:

79 epsilon = EPSILON	63 epsilon = EPSILON

80 self._weights = np.array([	64

81 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.	65 # TODO(crbug.com/680207) Filter zero weights, use sparse representaion of

82 for w in weights])	66 # weight covector.
	wrengr 2017/01/12 19:09:09 It is trivial to do that now. Instead of saying `` It is trivial to do that now. Instead of saying ``{k: (v if b else 0) for k, v in d.iteritems()}` which artificially avoids making things sparse, just say ``{k: v for k, v in d.iteritems() if b}``. The latter is exactly the same as filtering the former to remove all the items where you chose the else-branch. Sharu Jiang 2017/01/13 01:08:34 This one is trivial, there are some other places w Show quoted text On 2017/01/12 19:09:09, wrengr wrote: > It is trivial to do that now. Instead of saying ``{k: (v if b else 0) for k, v > in d.iteritems()}` which artificially avoids making things sparse, just say > ``{k: v for k, v in d.iteritems() if b}``. The latter is exactly the same as > filtering the former to remove all the items where you chose the else-branch. This one is trivial, there are some other places which needs to be changed, but I can change it if you want.
	67 self._weights = {

	68 name: weight if isinstance(weight, float) and

	69 math.fabs(weight) >= epsilon else 0.

	70 for name, weight in weights.iteritems()

	71 }

83	72

84 self._quadrance = None	73 self._quadrance = None

85	74

86 # TODO(crbug.com/674752): we need better names for ``self._features``.	75 # TODO(crbug.com/674752): we need better names for ``self._features``.

87 def _Features(x):	76 def _Features(x):

88 """Wrap ``feature_function`` to memoize things and ensure types.	77 """Wrap ``feature_function`` to memoize things and ensure types.

89	78

90 This outer wrapping takes each ``x`` to a memoized instance of	79 This outer wrapping takes each ``x`` to a memoized instance of

91 ``_FeaturesGivenX``. That is, for each ``x`` we return a	80 ``_FeaturesGivenX``. That is, for each ``x`` we return a

92 ``MemoizedFunction`` from ``Y`` to ``list(FeatureValue)``.	81 ``MemoizedFunction`` from ``Y`` to ``dict(str to FeatureValue)``.

93 """	82 """

94 fx = feature_function(x)	83 fx = feature_function(x)

95 def _FeaturesGivenX(y):	84 def _FeaturesGivenX(y):

96 """Wrap ``feature_function(x)`` to ensure appropriate types.	85 """Wrap ``feature_function(x)`` to ensure appropriate types.

97	86

98 This inner wrapper ensures that the resulting ``FeatureValue``	87 This inner wrapper ensures that the resulting ``FeatureValue``

99 array has the same length as the weight covector.	88 array has the same length as the weight covector.

100 """	89 """

101 fxy = fx(y)	90 fxy = fx(y)

102 # N.B., we're assuming that ``len(self.weights)`` is O(1).	91 # N.B., we're assuming that ``len(self.weights)`` is O(1).

103 assert len(fxy) == len(self.weights), TypeError(	92 assert len(fxy) == len(self.weights), TypeError(

104 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))	93 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))

105 return fxy	94 return fxy

106	95

107 # Memoize on ``Y``, to ensure we don't need to recompute	96 # Memoize on ``Y``, to ensure we don't need to recompute

108 # ``FeatureValue``s nor recheck the lengths.	97 # ``FeatureValue``s nor recheck the lengths.

109 return MemoizedFunction(_FeaturesGivenX)	98 return MemoizedFunction(_FeaturesGivenX)

110	99

111 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.	100 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.

112 self._features = MemoizedFunction(_Features)	101 self._features = MemoizedFunction(_Features)

113	102

114 # TODO(crbug.com/674752): we need better names for ``self._scores``.	103 # TODO(crbug.com/674752): we need better names for ``self._scores``.

115 # N.B., this is just the inner product of ``self.weights``	104 # N.B., this is just the inner product of ``self.weights``

116 # against ``self._features(x)``. If we can compute this in some	105 # against ``self._features(x)``. If we can compute this in some

117 # more efficient way, we should. In particular, we will want to	106 # more efficient way, we should. In particular, we will want to

118 # make the weights sparse, in which case we need to use a sparse	107 # make the weights sparse, in which case we need to use a sparse

119 # variant of the dot product.	108 # variant of the dot product.

120 self._scores = MemoizedFunction(lambda x:	109 self._scores = MemoizedFunction(lambda x: self._features(x).map(

121 self._features(x).map(lambda fxy:	110 lambda fxy: math.fsum(self.SingleFeatureScore(feature)

122 self.weights.dot(np.array([feature.value for feature in fxy]))))	111 for feature in fxy.itervalues())))

	112

	113 def SingleFeatureScore(self, feature_value):

	114 """Returns the score (aka weighted value) of a ``FeatureValue``.

	115

	116 This function assumes the report's stacktrace has already had any necessary

	117 preprocessing (like filtering or truncating) applied.
	wrengr 2017/01/12 19:09:09 I don't understand this sentence. If it was from t I don't understand this sentence. If it was from the old code, I don't think it applies anymore. Sharu Jiang 2017/01/13 01:08:34 Done. Show quoted text On 2017/01/12 19:09:09, wrengr wrote: > I don't understand this sentence. If it was from the old code, I don't think it > applies anymore. Done.
	118

	119 Args:

	120 feature_value (FeatureValue): the feature value to check.

	121

	122 Returns:

	123 The score of the feature value.

	124 """

	125 return feature_value.value * self._weights.get(feature_value.name, 0.)

	126

	127 # TODO(crbug.com/673964): something better for detecting "close to log(0)".

	128 def LogZeroish(self, x):

	129 """Determine whether a float is close enough to log(0).

	130

	131 If a ``FeatureValue`` has a (log-domain) score of -inf for a given

	132 ``Suspect``, then that suspect has zero probability of being the

	133 culprit. We want to filter these suspects out, to clean up the

	134 output of classification; so this method encapsulates the logic of

	135 that check.

	136

	137 Args:

	138 x (float): the float to check

	139

	140 Returns:

	141 ``True`` if ``x`` is close enough to log(0); else ``False``.

	142 """

	143 return x < 0 and math.isinf(x)

123	144

124 def ClearWeightBasedMemos(self):	145 def ClearWeightBasedMemos(self):

125 """Clear all the memos that depend on the weight covector."""	146 """Clear all the memos that depend on the weight covector."""

126 self._quadrance = None	147 self._quadrance = None

127 self._scores.ClearMemos()	148 self._scores.ClearMemos()

128	149

129 def ClearAllMemos(self):	150 def ClearAllMemos(self):

130 """Clear all memos, even those independent of the weight covector."""	151 """Clear all memos, even those independent of the weight covector."""

131 self.ClearWeightBasedMemos()	152 self.ClearWeightBasedMemos()

132 self._features.ClearMemos()	153 self._features.ClearMemos()

133	154

134 @property	155 @property

135 def weights(self):	156 def weights(self):

136 """The weight covector.	157 """The weight covector.

137	158

138 At present we return the weights as an ``np.ndarray``, but in the	159 At present we return the weights as an dict mapping feature name to its

139 future that may be replaced by a more general type which specifies	160 weight, but in the future that may be replaced by a more general type which

140 the semantics rather than the implementation details.	161 specifies the semantics rather than the implementation details.

141 """	162 """

142 return self._weights	163 return self._weights

143	164

144 @property	165 @property

145 def l0(self): # pragma: no cover	166 def l0(self): # pragma: no cover

146 """The l0-norm of the weight covector.	167 """The l0-norm of the weight covector.

147	168

148 N.B., despite being popularly called the "l0-norm", this isn't	169 N.B., despite being popularly called the "l0-norm", this isn't

149 actually a norm in the mathematical sense."""	170 actually a norm in the mathematical sense."""

150 return float(np.count_nonzero(self.weights))	171 return float(len(self.weights) - self.weights.values().count(0.))

151	172

152 @property	173 @property

153 def l1(self): # pragma: no cover	174 def l1(self): # pragma: no cover

154 """The l1 (aka: Manhattan) norm of the weight covector."""	175 """The l1 (aka: Manhattan) norm of the weight covector."""

155 return math.fsum(math.fabs(w) for w in self.weights)	176 return math.fsum(math.fabs(w) for w in self.weights.itervalues())

156	177

157 @property	178 @property

158 def quadrance(self):	179 def quadrance(self):

159 """The square of the l2 norm of the weight covector.	180 """The square of the l2 norm of the weight covector.

160	181

161 This value is often more helpful to have direct access to, as it	182 This value is often more helpful to have direct access to, as it

162 avoids the need for non-rational functions (e.g., sqrt) and shows up	183 avoids the need for non-rational functions (e.g., sqrt) and shows up

163 as its own quantity in many places. Also, computing it directly avoids	184 as its own quantity in many places. Also, computing it directly avoids

164 the error introduced by squaring the square-root of an IEEE-754 float.	185 the error introduced by squaring the square-root of an IEEE-754 float.

165 """	186 """

166 if self._quadrance is None:	187 if self._quadrance is None:

167 self._quadrance = math.fsum(math.fabs(w)**2 for w in self.weights)	188 self._quadrance = math.fsum(

	189 math.fabs(w)**2 for w in self.weights.itervalues())

168	190

169 return self._quadrance	191 return self._quadrance

170	192

171 @property	193 @property

172 def l2(self):	194 def l2(self):

173 """The l2 (aka Euclidean) norm of the weight covector.	195 """The l2 (aka Euclidean) norm of the weight covector.

174	196

175 If you need the square of the l2-norm, do not use this property.	197 If you need the square of the l2-norm, do not use this property.

176 Instead, use the ``quadrance`` property which is more accurate than	198 Instead, use the ``quadrance`` property which is more accurate than

177 squaring this one.	199 squaring this one.

178 """	200 """

179 return math.sqrt(self.quadrance)	201 return math.sqrt(self.quadrance)

180	202

181 def Features(self, x):	203 def Features(self, x):

182 """Returns a function mapping ``y`` to its feature vector given ``x``.	204 """Returns a function mapping ``y`` to its feature vector given ``x``.

183	205

184 Args:	206 Args:

185 x (X): the value of the independent variable.	207 x (X): the value of the independent variable.

186	208

187 Returns:	209 Returns:

188 A ``MemoizedFunction`` of type ``Y -> np.array(float)``.	210 A ``MemoizedFunction`` of type ``Y -> dict(str to float)``.

189 """	211 """

190 return self._features(x)	212 return self._features(x)

191	213

192 def Score(self, x):	214 def Score(self, x):

193 """Returns a function mapping ``y`` to its "score" given ``x``.	215 """Returns a function mapping ``y`` to its "score" given ``x``.

194	216

195 Semantically, the "score" of ``y`` given ``x`` is the	217 Semantically, the "score" of ``y`` given ``x`` is the

196 unnormalized log-domain conditional probability of ``y`` given	218 unnormalized log-domain conditional probability of ``y`` given

197 ``x``. Operationally, we compute this by taking the inner product	219 ``x``. Operationally, we compute this by taking the inner product

198 of the weight covector with the ``Features(x)(y)`` vector. The	220 of the weight covector with the ``Features(x)(y)`` vector. The

199 conditional probability itself can be computed by exponentiating	221 conditional probability itself can be computed by exponentiating

200 the "score" and normalizing with the partition function. However,	222 the "score" and normalizing with the partition function. However,

201 for determining the "best" ``y`` we don't need to bother computing	223 for determining the "best" ``y`` we don't need to bother computing

202 the probability, because ``Score(x)`` is monotonic with respect to	224 the probability, because ``Score(x)`` is monotonic with respect to

203 ``Probability(x)``.	225 ``Probability(x)``.

204	226

205 Args:	227 Args:

206 x (X): the value of the independent variable.	228 x (X): the value of the independent variable.

207	229

208 Returns:	230 Returns:

209 A ``MemoizedFunction`` of type ``Y -> float``.	231 A ``MemoizedFunction`` of type ``Y -> float``.

210 """	232 """

211 return self._scores(x)	233 return self._scores(x)

212	234

	235 def FormatReasons(self, features):

	236 """Collect and format a list of all ``FeatureValue.reason`` strings.

	237

	238 Args:

	239 features (literable of FeatureValue): the values whose ``reason``
	wrengr 2017/01/12 19:09:09 "literable" -> "iterable" "literable" -> "iterable" Sharu Jiang 2017/01/13 01:08:34 Done. Show quoted text On 2017/01/12 19:09:09, wrengr wrote: > "literable" -> "iterable" Done.
	240 strings should be collected.

	241

	242 Returns:

	243 A list of ``(str, float, str)`` triples; where the first string is

	244 the feature name, the float is some numeric representation of how

	245 much influence this feature exerts on the ``Suspect`` being blamed,

	246 and the final string is the ``FeatureValue.reason``. The list is

	247 sorted by feature name, just to ensure that it comes out in some

	248 canonical order.

	249

	250 At present, the float is the log-domain score of the feature

	251 value. However, this isn't the best thing for UX reasons. In the

	252 future it might be replaced by the normal-domain score, or by

	253 the probability.

	254 """

	255 formatted_reasons = []

	256 for feature in features:

	257 feature_score = self.SingleFeatureScore(feature)

	258 if self.LogZeroish(feature_score): # pragma: no cover

	259 logging.debug('Discarding reasons from feature %s'

	260 ' because it has zero probability' % feature.name)

	261 continue

	262

	263 formatted_reasons.append((feature.name, feature_score, feature.reason))

	264

	265 formatted_reasons.sort(key=lambda formatted_reason: formatted_reason[0])

	266 return formatted_reasons

	267

	268 def AggregateChangedFiles(self, features):

	269 """Merge multiple``FeatureValue.changed_files`` lists into one.

	270

	271 Args:

	272 features (iterable of FeatureValue): the values whose ``changed_files``

	273 lists should be aggregated.

	274

	275 Returns:

	276 A list of ``ChangedFile`` objects sorted by file name. The sorting

	277 is not essential, but is provided to ease testing by ensuring the

	278 output is in some canonical order.

	279

	280 Raises:

	281 ``ValueError`` if any file name is given inconsistent ``blame_url``s.

	282 """

	283 all_changed_files = {}

	284 for feature in features:

	285 if self.LogZeroish(self.SingleFeatureScore(feature)): # pragma: no cover

	286 logging.debug('Discarding changed files from feature %s'

	287 ' because it has zero probability' % feature.name)

	288 continue

	289

	290 for changed_file in feature.changed_files or []:

	291 accumulated_changed_file = all_changed_files.get(changed_file.name)

	292 if accumulated_changed_file is None:

	293 all_changed_files[changed_file.name] = changed_file

	294 continue

	295

	296 if (accumulated_changed_file.blame_url !=

	297 changed_file.blame_url): # pragma: no cover

	298 raise ValueError('Blame URLs do not match: %s != %s'

	299 % (accumulated_changed_file.blame_url, changed_file.blame_url))

	300 accumulated_changed_file.reasons.extend(changed_file.reasons or [])

	301

	302 changed_files = all_changed_files.values()

	303 changed_files.sort(key=lambda changed_file: changed_file.name)

	304 return changed_files

	305

213	306

214 class LogLinearModel(UnnormalizedLogLinearModel):	307 class LogLinearModel(UnnormalizedLogLinearModel):

215 """A loglinear probability model.	308 """A loglinear probability model.

216	309

217 This class is distinct from ``UnnormalizedLogLinearModel`` in that	310 This class is distinct from ``UnnormalizedLogLinearModel`` in that

218 we can provide probabilities (not just scores). However, to do so we	311 we can provide probabilities (not just scores). However, to do so we

219 require a specification of the subsets of ``Y`` for each ``x``.	312 require a specification of the subsets of ``Y`` for each ``x``.

220 """	313 """

221 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):	314 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):

222 """Construct a new probabilistic model.	315 """Construct a new probabilistic model.

223	316

224 Args:	317 Args:

225 Y_given_X: a function from ``X`` to an iterable object giving the	318 Y_given_X: a function from ``X`` to an iterable object giving the

226 subset of ``Y`` which has non-zero probability given the	319 subset of ``Y`` which has non-zero probability given the

227 ``x``. When in doubt about whether some ``y`` has zero probability	320 ``x``. When in doubt about whether some ``y`` has zero probability

228 or not, it is always safe/correct to return a larger subset of	321 or not, it is always safe/correct to return a larger subset of

229 ``Y`` (it'll just take more computation time is all). This is	322 ``Y`` (it'll just take more computation time is all). This is

230 needed for computing the partition function and expectation. N.B.,	323 needed for computing the partition function and expectation. N.B.,

231 we do not actually need to know/enumerate of all of ``Y``,	324 we do not actually need to know/enumerate of all of ``Y``,

232 only the subsets for each ``x``.	325 only the subsets for each ``x``.

233 feature_function: A function ``X -> Y -> list(float)``. N.B.,	326 feature_function: A function ``X -> Y -> list(float)``. N.B.,

234 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	327 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

235 must be the same as the length of ``weights``.	328 must be the same as the length of ``weights``.

236 weights (list of float): coefficients for how important we consider	329 weights (dict of float): the weights for the features. The keys of

237 each component of the feature vector for deciding which ``y``	330 the dictionary are the names of the feature that weight is

238 to blame.	331 for. We take this argument as a dict rather than as a list so that

	332 callers needn't worry about what order to provide the weights in.

239 epsilon (float): The absolute-error threshold for considering a	333 epsilon (float): The absolute-error threshold for considering a

240 weight to be "equal to zero". N.B., this should be a positive	334 weight to be "equal to zero". N.B., this should be a positive

241 number, as we will compare it against the absolute value of	335 number, as we will compare it against the absolute value of

242 each weight.	336 each weight.

243 """	337 """

244 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)	338 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)

245	339

246 self._Y = Y_given_X	340 self._Y = Y_given_X

247	341

248 def _LogZ(x):	342 def _LogZ(x):

(...skipping 70 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
319 function returns; rather, it's a sort of average of all the results	413 function returns; rather, it's a sort of average of all the results

320 returned. For more information you can take a look at Wikipedia	414 returned. For more information you can take a look at Wikipedia

321 <https://en.wikipedia.org/wiki/Expected_value>.	415 <https://en.wikipedia.org/wiki/Expected_value>.

322 """	416 """

323 prob_given_x = self.Probability(x)	417 prob_given_x = self.Probability(x)

324 # N.B., the ``*`` below is vector scaling! If we want to make this	418 # N.B., the ``*`` below is vector scaling! If we want to make this

325 # method polymorphic in the return type of ``f`` then we'll need an	419 # method polymorphic in the return type of ``f`` then we'll need an

326 # API that provides both scaling and ``vsum``.	420 # API that provides both scaling and ``vsum``.

327 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])	421 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])

328	422

OLD	NEW