appengine/findit/crash/loglinear/model.py - Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM.

Side by Side Diff: appengine/findit/crash/loglinear/model.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)

Patch Set: Fix nit. Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 # TODO(http://crbug.com/669639): there are lots of ways to make the code	5 # TODO(http://crbug.com/669639): there are lots of ways to make the code

6 # in this file better. We avoid having separate todos per task; instead	6 # in this file better. We avoid having separate todos per task; instead

7 # see that meta-issue ticket.	7 # see that meta-issue ticket.

8	8

	9 import logging

9 import math	10 import math

10 import numpy as np	11 import numpy as np

11 # N.B., ``np.array`` can't take generators; you must pass explicit lists.	12 # N.B., ``np.array`` can't take generators; you must pass explicit lists.

12	13

13 from libs.math.functions import MemoizedFunction	14 from libs.math.functions import MemoizedFunction

14 from libs.math.logarithms import logsumexp	15 from libs.math.logarithms import logsumexp

15 from libs.math.vectors import vsum	16 from libs.math.vectors import vsum

16	17

17 EPSILON = 0.00001	18 EPSILON = 0.00001

18	19

19	20

20 def ToFeatureFunction(fs):

21 """Given an array of scalar-valued functions, return an array-valued function.

22

23 Args:

24 fs (iterable): A collection of curried functions ``X -> Y -> A``.

25 That is, given a particular ``x`` they return a function ``Y -> A``.

26

27 Returns:

28 A function ``X -> Y -> list(A)`` where for all ``x``, ``y``, and

29 ``i`` we have that ``ToFeatureFunction(fs)(x)(y)[i] == fs[i](x)(y)``.

30 """

31 def _FeatureFunction(x):

32 fxs = [f(x) for f in fs]

33 return lambda y: [fx(y) for fx in fxs]

34

35 return _FeatureFunction

36

37

38 class UnnormalizedLogLinearModel(object):	21 class UnnormalizedLogLinearModel(object):

39 """An unnormalized loglinear model.	22 """An unnormalized loglinear model.

40	23

41 We use this model for combining a bunch of different features in order	24 We use this model for combining a bunch of different features in order

42 to decide who to blame. In particular, the nice thing about loglinear	25 to decide who to blame. In particular, the nice thing about loglinear

43 models is that (a) they allow separating the amount of weight we	26 models is that (a) they allow separating the amount of weight we

44 give to each feature from the feature itself, and (b) they allow us	27 give to each feature from the feature itself, and (b) they allow us

45 to keep adding features at will without needing to worry about the	28 to keep adding features at will without needing to worry about the

46 fiddly details needed for it to be a probability model.	29 fiddly details needed for it to be a probability model.

47	30

(...skipping 12 matching lines...) Expand all Loading...
60 rather than returning a probability per se.	43 rather than returning a probability per se.

61 """	44 """

62	45

63 def __init__(self, feature_function, weights, epsilon=None):	46 def __init__(self, feature_function, weights, epsilon=None):

64 """Construct a new model with the given weights and feature function.	47 """Construct a new model with the given weights and feature function.

65	48

66 Args:	49 Args:

67 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,	50 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,

68 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	51 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

69 must be the same as the length of ``weights``.	52 must be the same as the length of ``weights``.

70 weights (list of float): coefficients for how important we consider	53 weights (dict of float): the weights for the features. The keys of

71 each component of the feature vector for deciding which ``y``	54 the dictionary are the names of the feature that weight is

72 to blame.	55 for. We take this argument as a dict rather than as a list so that

	56 callers needn't worry about what order to provide the weights in.

73 epsilon (float): The absolute-error threshold for considering a	57 epsilon (float): The absolute-error threshold for considering a

74 weight to be "equal to zero". N.B., this should be a positive	58 weight to be "equal to zero". N.B., this should be a positive

75 number, as we will compare it against the absolute value of	59 number, as we will compare it against the absolute value of

76 each weight.	60 each weight.

77 """	61 """

78 if epsilon is None:	62 if epsilon is None:

79 epsilon = EPSILON	63 self._epsilon = EPSILON

80 self._weights = np.array([	64 else:

81 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.	65 self._epsilon = epsilon

82 for w in weights])	66

	67 # TODO(crbug.com/680207) Filter zero weights, use sparse representaion of

	68 # weight covector.

	69 self._weights = {

	70 name: weight for name, weight in weights.iteritems()

	71 if self.IsNonZeroWeight(weight)

	72 }

83	73

84 self._quadrance = None	74 self._quadrance = None

85	75

86 # TODO(crbug.com/674752): we need better names for ``self._features``.	76 # TODO(crbug.com/674752): we need better names for ``self._features``.

87 def _Features(x):	77 def _Features(x):

88 """Wrap ``feature_function`` to memoize things and ensure types.	78 """Wrap ``feature_function`` to memoize things and ensure types.

89	79

90 This outer wrapping takes each ``x`` to a memoized instance of	80 This outer wrapping takes each ``x`` to a memoized instance of

91 ``_FeaturesGivenX``. That is, for each ``x`` we return a	81 ``_FeaturesGivenX``. That is, for each ``x`` we return a

92 ``MemoizedFunction`` from ``Y`` to ``list(FeatureValue)``.	82 ``MemoizedFunction`` from ``Y`` to ``dict(str to FeatureValue)``.

93 """	83 """

94 fx = feature_function(x)	84 fx = feature_function(x)

95 def _FeaturesGivenX(y):	85 def _FeaturesGivenX(y):

96 """Wrap ``feature_function(x)`` to ensure appropriate types.	86 """Wrap ``feature_function(x)`` to ensure appropriate types.

97	87

98 This inner wrapper ensures that the resulting ``FeatureValue``	88 This inner wrapper ensures that the resulting ``FeatureValue``

99 array has the same length as the weight covector.	89 array has the same length as the weight covector.

100 """	90 """

101 fxy = fx(y)	91 fxy = fx(y)

102 # N.B., we're assuming that ``len(self.weights)`` is O(1).	92 # N.B., we're assuming that ``len(self.weights)`` is O(1).

103 assert len(fxy) == len(self.weights), TypeError(	93 assert len(fxy) == len(self.weights), TypeError(

104 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))	94 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))

105 return fxy	95 return fxy

106	96

107 # Memoize on ``Y``, to ensure we don't need to recompute	97 # Memoize on ``Y``, to ensure we don't need to recompute

108 # ``FeatureValue``s nor recheck the lengths.	98 # ``FeatureValue``s nor recheck the lengths.

109 return MemoizedFunction(_FeaturesGivenX)	99 return MemoizedFunction(_FeaturesGivenX)

110	100

111 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.	101 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.

112 self._features = MemoizedFunction(_Features)	102 self._features = MemoizedFunction(_Features)

113	103

114 # TODO(crbug.com/674752): we need better names for ``self._scores``.	104 # TODO(crbug.com/674752): we need better names for ``self._scores``.

115 # N.B., this is just the inner product of ``self.weights``	105 # N.B., this is just the inner product of ``self.weights``

116 # against ``self._features(x)``. If we can compute this in some	106 # against ``self._features(x)``. If we can compute this in some

117 # more efficient way, we should. In particular, we will want to	107 # more efficient way, we should. In particular, we will want to

118 # make the weights sparse, in which case we need to use a sparse	108 # make the weights sparse, in which case we need to use a sparse

119 # variant of the dot product.	109 # variant of the dot product.

120 self._scores = MemoizedFunction(lambda x:	110 self._scores = MemoizedFunction(lambda x: self._features(x).map(

121 self._features(x).map(lambda fxy:	111 lambda fxy: math.fsum(self.SingleFeatureScore(feature)

122 self.weights.dot(np.array([feature.value for feature in fxy]))))	112 for feature in fxy.itervalues())))

	113

	114 def IsNonZeroWeight(self, weight):

	115 return isinstance(weight, float) and math.fabs(weight) >= self._epsilon
	chanli 2017/01/23 01:56:39 In your CL https://codereview.chromium.org/2625073 In your CL https://codereview.chromium.org/2625073003/, appengine/findit/crash/loglinear/weight.py, DropZeroWeights you used '<=' for zero value, but here you use '>=' for NonZeroWeight? Sharu Jiang 2017/01/23 21:53:32 Done. Show quoted text On 2017/01/23 01:56:39, chanli wrote: > In your CL https://codereview.chromium.org/2625073003/, > appengine/findit/crash/loglinear/weight.py, DropZeroWeights you used '<=' for > zero value, but here you use '>=' for NonZeroWeight? Done.
	116

	117 def SingleFeatureScore(self, feature_value):

	118 """Returns the score (aka weighted value) of a ``FeatureValue``.

	119

	120 Args:

	121 feature_value (FeatureValue): the feature value to check.

	122

	123 Returns:

	124 The score of the feature value.

	125 """

	126 return feature_value.value * self._weights.get(feature_value.name, 0.)

	127

	128 # TODO(crbug.com/673964): something better for detecting "close to log(0)".

	129 def LogZeroish(self, x):

	130 """Determine whether a float is close enough to log(0).

	131

	132 If a ``FeatureValue`` has a (log-domain) score of -inf for a given

	133 ``Suspect``, then that suspect has zero probability of being the

	134 culprit. We want to filter these suspects out, to clean up the

	135 output of classification; so this method encapsulates the logic of

	136 that check.

	137

	138 Args:

	139 x (float): the float to check

	140

	141 Returns:

	142 ``True`` if ``x`` is close enough to log(0); else ``False``.

	143 """

	144 return x < 0 and math.isinf(x)

123	145

124 def ClearWeightBasedMemos(self):	146 def ClearWeightBasedMemos(self):

125 """Clear all the memos that depend on the weight covector."""	147 """Clear all the memos that depend on the weight covector."""

126 self._quadrance = None	148 self._quadrance = None

127 self._scores.ClearMemos()	149 self._scores.ClearMemos()

128	150

129 def ClearAllMemos(self):	151 def ClearAllMemos(self):

130 """Clear all memos, even those independent of the weight covector."""	152 """Clear all memos, even those independent of the weight covector."""

131 self.ClearWeightBasedMemos()	153 self.ClearWeightBasedMemos()

132 self._features.ClearMemos()	154 self._features.ClearMemos()

133	155

134 @property	156 @property

135 def weights(self):	157 def weights(self):

136 """The weight covector.	158 """The weight covector.

137	159

138 At present we return the weights as an ``np.ndarray``, but in the	160 At present we return the weights as an dict mapping feature name to its

139 future that may be replaced by a more general type which specifies	161 weight, but in the future that may be replaced by a more general type which

140 the semantics rather than the implementation details.	162 specifies the semantics rather than the implementation details.

141 """	163 """

142 return self._weights	164 return self._weights

143	165

144 @property	166 @property

145 def l0(self): # pragma: no cover	167 def l0(self): # pragma: no cover

146 """The l0-norm of the weight covector.	168 """The l0-norm of the weight covector.

147	169

148 N.B., despite being popularly called the "l0-norm", this isn't	170 N.B., despite being popularly called the "l0-norm", this isn't

149 actually a norm in the mathematical sense."""	171 actually a norm in the mathematical sense."""

150 return float(np.count_nonzero(self.weights))	172 return float(len(self.weights) - self.weights.values().count(0.))

151	173

152 @property	174 @property

153 def l1(self): # pragma: no cover	175 def l1(self): # pragma: no cover

154 """The l1 (aka: Manhattan) norm of the weight covector."""	176 """The l1 (aka: Manhattan) norm of the weight covector."""

155 return math.fsum(math.fabs(w) for w in self.weights)	177 return math.fsum(math.fabs(w) for w in self.weights.itervalues())

156	178

157 @property	179 @property

158 def quadrance(self):	180 def quadrance(self):

159 """The square of the l2 norm of the weight covector.	181 """The square of the l2 norm of the weight covector.

160	182

161 This value is often more helpful to have direct access to, as it	183 This value is often more helpful to have direct access to, as it

162 avoids the need for non-rational functions (e.g., sqrt) and shows up	184 avoids the need for non-rational functions (e.g., sqrt) and shows up

163 as its own quantity in many places. Also, computing it directly avoids	185 as its own quantity in many places. Also, computing it directly avoids

164 the error introduced by squaring the square-root of an IEEE-754 float.	186 the error introduced by squaring the square-root of an IEEE-754 float.

165 """	187 """

166 if self._quadrance is None:	188 if self._quadrance is None:

167 self._quadrance = math.fsum(math.fabs(w)**2 for w in self.weights)	189 self._quadrance = math.fsum(

	190 math.fabs(w)**2 for w in self.weights.itervalues())

168	191

169 return self._quadrance	192 return self._quadrance

170	193

171 @property	194 @property

172 def l2(self):	195 def l2(self):

173 """The l2 (aka Euclidean) norm of the weight covector.	196 """The l2 (aka Euclidean) norm of the weight covector.

174	197

175 If you need the square of the l2-norm, do not use this property.	198 If you need the square of the l2-norm, do not use this property.

176 Instead, use the ``quadrance`` property which is more accurate than	199 Instead, use the ``quadrance`` property which is more accurate than

177 squaring this one.	200 squaring this one.

178 """	201 """

179 return math.sqrt(self.quadrance)	202 return math.sqrt(self.quadrance)

180	203

181 def Features(self, x):	204 def Features(self, x):

182 """Returns a function mapping ``y`` to its feature vector given ``x``.	205 """Returns a function mapping ``y`` to its feature vector given ``x``.

183	206

184 Args:	207 Args:

185 x (X): the value of the independent variable.	208 x (X): the value of the independent variable.

186	209

187 Returns:	210 Returns:

188 A ``MemoizedFunction`` of type ``Y -> np.array(float)``.	211 A ``MemoizedFunction`` of type ``Y -> dict(str to float)``.

189 """	212 """

190 return self._features(x)	213 return self._features(x)

191	214

192 def Score(self, x):	215 def Score(self, x):

193 """Returns a function mapping ``y`` to its "score" given ``x``.	216 """Returns a function mapping ``y`` to its "score" given ``x``.

194	217

195 Semantically, the "score" of ``y`` given ``x`` is the	218 Semantically, the "score" of ``y`` given ``x`` is the

196 unnormalized log-domain conditional probability of ``y`` given	219 unnormalized log-domain conditional probability of ``y`` given

197 ``x``. Operationally, we compute this by taking the inner product	220 ``x``. Operationally, we compute this by taking the inner product

198 of the weight covector with the ``Features(x)(y)`` vector. The	221 of the weight covector with the ``Features(x)(y)`` vector. The

199 conditional probability itself can be computed by exponentiating	222 conditional probability itself can be computed by exponentiating

200 the "score" and normalizing with the partition function. However,	223 the "score" and normalizing with the partition function. However,

201 for determining the "best" ``y`` we don't need to bother computing	224 for determining the "best" ``y`` we don't need to bother computing

202 the probability, because ``Score(x)`` is monotonic with respect to	225 the probability, because ``Score(x)`` is monotonic with respect to

203 ``Probability(x)``.	226 ``Probability(x)``.

204	227

205 Args:	228 Args:

206 x (X): the value of the independent variable.	229 x (X): the value of the independent variable.

207	230

208 Returns:	231 Returns:

209 A ``MemoizedFunction`` of type ``Y -> float``.	232 A ``MemoizedFunction`` of type ``Y -> float``.

210 """	233 """

211 return self._scores(x)	234 return self._scores(x)

212	235

	236 def FormatReasons(self, features):

	237 """Collect and format a list of all ``FeatureValue.reason`` strings.

	238

	239 Args:

	240 features (iterable of FeatureValue): the values whose ``reason``

	241 strings should be collected.

	242

	243 Returns:

	244 A list of ``(str, float, str)`` triples; where the first string is

	245 the feature name, the float is some numeric representation of how

	246 much influence this feature exerts on the ``Suspect`` being blamed,

	247 and the final string is the ``FeatureValue.reason``. The list is

	248 sorted by feature name, just to ensure that it comes out in some

	249 canonical order.

	250

	251 At present, the float is the log-domain score of the feature

	252 value. However, this isn't the best thing for UX reasons. In the

	253 future it might be replaced by the normal-domain score, or by

	254 the probability.

	255 """

	256 formatted_reasons = []

	257 for feature in features:

	258 feature_score = self.SingleFeatureScore(feature)

	259 if self.LogZeroish(feature_score): # pragma: no cover

	260 logging.debug('Discarding reasons from feature %s'

	261 ' because it has zero probability' % feature.name)

	262 continue

	263

	264 formatted_reasons.append((feature.name, feature_score, feature.reason))

	265

	266 formatted_reasons.sort(key=lambda formatted_reason: formatted_reason[0])

	267 return formatted_reasons

	268

	269 def AggregateChangedFiles(self, features):

	270 """Merge multiple``FeatureValue.changed_files`` lists into one.

	271

	272 Args:

	273 features (iterable of FeatureValue): the values whose ``changed_files``

	274 lists should be aggregated.

	275

	276 Returns:

	277 A list of ``ChangedFile`` objects sorted by file name. The sorting

	278 is not essential, but is provided to ease testing by ensuring the

	279 output is in some canonical order.

	280

	281 Raises:

	282 ``ValueError`` if any file name is given inconsistent ``blame_url``s.

	283 """

	284 all_changed_files = {}

	285 for feature in features:

	286 if self.LogZeroish(self.SingleFeatureScore(feature)): # pragma: no cover

	287 logging.debug('Discarding changed files from feature %s'

	288 ' because it has zero probability' % feature.name)

	289 continue

	290

	291 for changed_file in feature.changed_files or []:

	292 accumulated_changed_file = all_changed_files.get(changed_file.name)

	293 if accumulated_changed_file is None:

	294 all_changed_files[changed_file.name] = changed_file

	295 continue

	296

	297 if (accumulated_changed_file.blame_url !=

	298 changed_file.blame_url): # pragma: no cover

	299 raise ValueError('Blame URLs do not match: %s != %s'

	300 % (accumulated_changed_file.blame_url, changed_file.blame_url))

	301 accumulated_changed_file.reasons.extend(changed_file.reasons or [])

	302

	303 changed_files = all_changed_files.values()

	304 changed_files.sort(key=lambda changed_file: changed_file.name)

	305 return changed_files

	306

213	307

214 class LogLinearModel(UnnormalizedLogLinearModel):	308 class LogLinearModel(UnnormalizedLogLinearModel):

215 """A loglinear probability model.	309 """A loglinear probability model.

216	310

217 This class is distinct from ``UnnormalizedLogLinearModel`` in that	311 This class is distinct from ``UnnormalizedLogLinearModel`` in that

218 we can provide probabilities (not just scores). However, to do so we	312 we can provide probabilities (not just scores). However, to do so we

219 require a specification of the subsets of ``Y`` for each ``x``.	313 require a specification of the subsets of ``Y`` for each ``x``.

220 """	314 """

221 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):	315 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):

222 """Construct a new probabilistic model.	316 """Construct a new probabilistic model.

223	317

224 Args:	318 Args:

225 Y_given_X: a function from ``X`` to an iterable object giving the	319 Y_given_X: a function from ``X`` to an iterable object giving the

226 subset of ``Y`` which has non-zero probability given the	320 subset of ``Y`` which has non-zero probability given the

227 ``x``. When in doubt about whether some ``y`` has zero probability	321 ``x``. When in doubt about whether some ``y`` has zero probability

228 or not, it is always safe/correct to return a larger subset of	322 or not, it is always safe/correct to return a larger subset of

229 ``Y`` (it'll just take more computation time is all). This is	323 ``Y`` (it'll just take more computation time is all). This is

230 needed for computing the partition function and expectation. N.B.,	324 needed for computing the partition function and expectation. N.B.,

231 we do not actually need to know/enumerate of all of ``Y``,	325 we do not actually need to know/enumerate of all of ``Y``,

232 only the subsets for each ``x``.	326 only the subsets for each ``x``.

233 feature_function: A function ``X -> Y -> list(float)``. N.B.,	327 feature_function: A function ``X -> Y -> list(float)``. N.B.,

234 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	328 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

235 must be the same as the length of ``weights``.	329 must be the same as the length of ``weights``.

236 weights (list of float): coefficients for how important we consider	330 weights (dict of float): the weights for the features. The keys of

237 each component of the feature vector for deciding which ``y``	331 the dictionary are the names of the feature that weight is

238 to blame.	332 for. We take this argument as a dict rather than as a list so that

	333 callers needn't worry about what order to provide the weights in.

239 epsilon (float): The absolute-error threshold for considering a	334 epsilon (float): The absolute-error threshold for considering a

240 weight to be "equal to zero". N.B., this should be a positive	335 weight to be "equal to zero". N.B., this should be a positive

241 number, as we will compare it against the absolute value of	336 number, as we will compare it against the absolute value of

242 each weight.	337 each weight.

243 """	338 """

244 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)	339 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)

245	340

246 self._Y = Y_given_X	341 self._Y = Y_given_X

247	342

248 def _LogZ(x):	343 def _LogZ(x):

(...skipping 70 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
319 function returns; rather, it's a sort of average of all the results	414 function returns; rather, it's a sort of average of all the results

320 returned. For more information you can take a look at Wikipedia	415 returned. For more information you can take a look at Wikipedia

321 <https://en.wikipedia.org/wiki/Expected_value>.	416 <https://en.wikipedia.org/wiki/Expected_value>.

322 """	417 """

323 prob_given_x = self.Probability(x)	418 prob_given_x = self.Probability(x)

324 # N.B., the ``*`` below is vector scaling! If we want to make this	419 # N.B., the ``*`` below is vector scaling! If we want to make this

325 # method polymorphic in the return type of ``f`` then we'll need an	420 # method polymorphic in the return type of ``f`` then we'll need an

326 # API that provides both scaling and ``vsum``.	421 # API that provides both scaling and ``vsum``.

327 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])	422 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])

328	423

OLD	NEW

« no previous file with comments | « appengine/findit/crash/loglinear/feature.py ('k') | appengine/findit/crash/loglinear/test/changelist_classifier_test.py » ('j') | no next file with comments »