Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(645)

Side by Side Diff: appengine/findit/crash/loglinear/model.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)
Patch Set: Address comments. Created 3 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
1 # Copyright 2016 The Chromium Authors. All rights reserved. 1 # Copyright 2016 The Chromium Authors. All rights reserved.
2 # Use of this source code is governed by a BSD-style license that can be 2 # Use of this source code is governed by a BSD-style license that can be
3 # found in the LICENSE file. 3 # found in the LICENSE file.
4 4
5 # TODO(http://crbug.com/669639): there are lots of ways to make the code 5 # TODO(http://crbug.com/669639): there are lots of ways to make the code
6 # in this file better. We avoid having separate todos per task; instead 6 # in this file better. We avoid having separate todos per task; instead
7 # see that meta-issue ticket. 7 # see that meta-issue ticket.
8 8
9 import logging
9 import math 10 import math
10 import numpy as np 11 import numpy as np
11 # N.B., ``np.array`` can't take generators; you must pass explicit lists. 12 # N.B., ``np.array`` can't take generators; you must pass explicit lists.
12 13
13 from libs.math.functions import MemoizedFunction 14 from libs.math.functions import MemoizedFunction
14 from libs.math.logarithms import logsumexp 15 from libs.math.logarithms import logsumexp
15 from libs.math.vectors import vsum 16 from libs.math.vectors import vsum
16 17
17 EPSILON = 0.00001 18 EPSILON = 0.00001
18 19
19 20
20 def ToFeatureFunction(fs):
21 """Given an array of scalar-valued functions, return an array-valued function.
22
23 Args:
24 fs (iterable): A collection of curried functions ``X -> Y -> A``.
25 That is, given a particular ``x`` they return a function ``Y -> A``.
26
27 Returns:
28 A function ``X -> Y -> list(A)`` where for all ``x``, ``y``, and
29 ``i`` we have that ``ToFeatureFunction(fs)(x)(y)[i] == fs[i](x)(y)``.
30 """
31 def _FeatureFunction(x):
32 fxs = [f(x) for f in fs]
33 return lambda y: [fx(y) for fx in fxs]
34
35 return _FeatureFunction
36
37
38 class UnnormalizedLogLinearModel(object): 21 class UnnormalizedLogLinearModel(object):
39 """An unnormalized loglinear model. 22 """An unnormalized loglinear model.
40 23
41 We use this model for combining a bunch of different features in order 24 We use this model for combining a bunch of different features in order
42 to decide who to blame. In particular, the nice thing about loglinear 25 to decide who to blame. In particular, the nice thing about loglinear
43 models is that (a) they allow separating the amount of weight we 26 models is that (a) they allow separating the amount of weight we
44 give to each feature from the feature itself, and (b) they allow us 27 give to each feature from the feature itself, and (b) they allow us
45 to keep adding features at will without needing to worry about the 28 to keep adding features at will without needing to worry about the
46 fiddly details needed for it to be a probability model. 29 fiddly details needed for it to be a probability model.
47 30
(...skipping 12 matching lines...) Expand all
60 rather than returning a probability per se. 43 rather than returning a probability per se.
61 """ 44 """
62 45
63 def __init__(self, feature_function, weights, epsilon=None): 46 def __init__(self, feature_function, weights, epsilon=None):
64 """Construct a new model with the given weights and feature function. 47 """Construct a new model with the given weights and feature function.
65 48
66 Args: 49 Args:
67 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B., 50 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,
68 for all ``x`` and ``y`` the length of ``feature_function(x)(y)`` 51 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``
69 must be the same as the length of ``weights``. 52 must be the same as the length of ``weights``.
70 weights (list of float): coefficients for how important we consider 53 weights (dict of float): the weights for the features. The keys of
71 each component of the feature vector for deciding which ``y`` 54 the dictionary are the names of the feature that weight is
72 to blame. 55 for. We take this argument as a dict rather than as a list so that
56 callers needn't worry about what order to provide the weights in.
73 epsilon (float): The absolute-error threshold for considering a 57 epsilon (float): The absolute-error threshold for considering a
74 weight to be "equal to zero". N.B., this should be a positive 58 weight to be "equal to zero". N.B., this should be a positive
75 number, as we will compare it against the absolute value of 59 number, as we will compare it against the absolute value of
76 each weight. 60 each weight.
77 """ 61 """
78 if epsilon is None: 62 if epsilon is None:
79 epsilon = EPSILON 63 epsilon = EPSILON
80 self._weights = np.array([ 64
81 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0. 65 # TODO(crbug.com/680207) Filter zero weights, use sparse representaion of
82 for w in weights]) 66 # weight covector.
wrengr 2017/01/12 19:09:09 It is trivial to do that now. Instead of saying ``
Sharu Jiang 2017/01/13 01:08:34 This one is trivial, there are some other places w
67 self._weights = {
68 name: weight if isinstance(weight, float) and
69 math.fabs(weight) >= epsilon else 0.
70 for name, weight in weights.iteritems()
71 }
83 72
84 self._quadrance = None 73 self._quadrance = None
85 74
86 # TODO(crbug.com/674752): we need better names for ``self._features``. 75 # TODO(crbug.com/674752): we need better names for ``self._features``.
87 def _Features(x): 76 def _Features(x):
88 """Wrap ``feature_function`` to memoize things and ensure types. 77 """Wrap ``feature_function`` to memoize things and ensure types.
89 78
90 This outer wrapping takes each ``x`` to a memoized instance of 79 This outer wrapping takes each ``x`` to a memoized instance of
91 ``_FeaturesGivenX``. That is, for each ``x`` we return a 80 ``_FeaturesGivenX``. That is, for each ``x`` we return a
92 ``MemoizedFunction`` from ``Y`` to ``list(FeatureValue)``. 81 ``MemoizedFunction`` from ``Y`` to ``dict(str to FeatureValue)``.
93 """ 82 """
94 fx = feature_function(x) 83 fx = feature_function(x)
95 def _FeaturesGivenX(y): 84 def _FeaturesGivenX(y):
96 """Wrap ``feature_function(x)`` to ensure appropriate types. 85 """Wrap ``feature_function(x)`` to ensure appropriate types.
97 86
98 This inner wrapper ensures that the resulting ``FeatureValue`` 87 This inner wrapper ensures that the resulting ``FeatureValue``
99 array has the same length as the weight covector. 88 array has the same length as the weight covector.
100 """ 89 """
101 fxy = fx(y) 90 fxy = fx(y)
102 # N.B., we're assuming that ``len(self.weights)`` is O(1). 91 # N.B., we're assuming that ``len(self.weights)`` is O(1).
103 assert len(fxy) == len(self.weights), TypeError( 92 assert len(fxy) == len(self.weights), TypeError(
104 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights))) 93 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))
105 return fxy 94 return fxy
106 95
107 # Memoize on ``Y``, to ensure we don't need to recompute 96 # Memoize on ``Y``, to ensure we don't need to recompute
108 # ``FeatureValue``s nor recheck the lengths. 97 # ``FeatureValue``s nor recheck the lengths.
109 return MemoizedFunction(_FeaturesGivenX) 98 return MemoizedFunction(_FeaturesGivenX)
110 99
111 # Memoize on ``X``, to ensure we share the memo tables on ``Y``. 100 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.
112 self._features = MemoizedFunction(_Features) 101 self._features = MemoizedFunction(_Features)
113 102
114 # TODO(crbug.com/674752): we need better names for ``self._scores``. 103 # TODO(crbug.com/674752): we need better names for ``self._scores``.
115 # N.B., this is just the inner product of ``self.weights`` 104 # N.B., this is just the inner product of ``self.weights``
116 # against ``self._features(x)``. If we can compute this in some 105 # against ``self._features(x)``. If we can compute this in some
117 # more efficient way, we should. In particular, we will want to 106 # more efficient way, we should. In particular, we will want to
118 # make the weights sparse, in which case we need to use a sparse 107 # make the weights sparse, in which case we need to use a sparse
119 # variant of the dot product. 108 # variant of the dot product.
120 self._scores = MemoizedFunction(lambda x: 109 self._scores = MemoizedFunction(lambda x: self._features(x).map(
121 self._features(x).map(lambda fxy: 110 lambda fxy: math.fsum(self.SingleFeatureScore(feature)
122 self.weights.dot(np.array([feature.value for feature in fxy])))) 111 for feature in fxy.itervalues())))
112
113 def SingleFeatureScore(self, feature_value):
114 """Returns the score (aka weighted value) of a ``FeatureValue``.
115
116 This function assumes the report's stacktrace has already had any necessary
117 preprocessing (like filtering or truncating) applied.
wrengr 2017/01/12 19:09:09 I don't understand this sentence. If it was from t
Sharu Jiang 2017/01/13 01:08:34 Done.
118
119 Args:
120 feature_value (FeatureValue): the feature value to check.
121
122 Returns:
123 The score of the feature value.
124 """
125 return feature_value.value * self._weights.get(feature_value.name, 0.)
126
127 # TODO(crbug.com/673964): something better for detecting "close to log(0)".
128 def LogZeroish(self, x):
129 """Determine whether a float is close enough to log(0).
130
131 If a ``FeatureValue`` has a (log-domain) score of -inf for a given
132 ``Suspect``, then that suspect has zero probability of being the
133 culprit. We want to filter these suspects out, to clean up the
134 output of classification; so this method encapsulates the logic of
135 that check.
136
137 Args:
138 x (float): the float to check
139
140 Returns:
141 ``True`` if ``x`` is close enough to log(0); else ``False``.
142 """
143 return x < 0 and math.isinf(x)
123 144
124 def ClearWeightBasedMemos(self): 145 def ClearWeightBasedMemos(self):
125 """Clear all the memos that depend on the weight covector.""" 146 """Clear all the memos that depend on the weight covector."""
126 self._quadrance = None 147 self._quadrance = None
127 self._scores.ClearMemos() 148 self._scores.ClearMemos()
128 149
129 def ClearAllMemos(self): 150 def ClearAllMemos(self):
130 """Clear all memos, even those independent of the weight covector.""" 151 """Clear all memos, even those independent of the weight covector."""
131 self.ClearWeightBasedMemos() 152 self.ClearWeightBasedMemos()
132 self._features.ClearMemos() 153 self._features.ClearMemos()
133 154
134 @property 155 @property
135 def weights(self): 156 def weights(self):
136 """The weight covector. 157 """The weight covector.
137 158
138 At present we return the weights as an ``np.ndarray``, but in the 159 At present we return the weights as an dict mapping feature name to its
139 future that may be replaced by a more general type which specifies 160 weight, but in the future that may be replaced by a more general type which
140 the semantics rather than the implementation details. 161 specifies the semantics rather than the implementation details.
141 """ 162 """
142 return self._weights 163 return self._weights
143 164
144 @property 165 @property
145 def l0(self): # pragma: no cover 166 def l0(self): # pragma: no cover
146 """The l0-norm of the weight covector. 167 """The l0-norm of the weight covector.
147 168
148 N.B., despite being popularly called the "l0-norm", this isn't 169 N.B., despite being popularly called the "l0-norm", this isn't
149 actually a norm in the mathematical sense.""" 170 actually a norm in the mathematical sense."""
150 return float(np.count_nonzero(self.weights)) 171 return float(len(self.weights) - self.weights.values().count(0.))
151 172
152 @property 173 @property
153 def l1(self): # pragma: no cover 174 def l1(self): # pragma: no cover
154 """The l1 (aka: Manhattan) norm of the weight covector.""" 175 """The l1 (aka: Manhattan) norm of the weight covector."""
155 return math.fsum(math.fabs(w) for w in self.weights) 176 return math.fsum(math.fabs(w) for w in self.weights.itervalues())
156 177
157 @property 178 @property
158 def quadrance(self): 179 def quadrance(self):
159 """The square of the l2 norm of the weight covector. 180 """The square of the l2 norm of the weight covector.
160 181
161 This value is often more helpful to have direct access to, as it 182 This value is often more helpful to have direct access to, as it
162 avoids the need for non-rational functions (e.g., sqrt) and shows up 183 avoids the need for non-rational functions (e.g., sqrt) and shows up
163 as its own quantity in many places. Also, computing it directly avoids 184 as its own quantity in many places. Also, computing it directly avoids
164 the error introduced by squaring the square-root of an IEEE-754 float. 185 the error introduced by squaring the square-root of an IEEE-754 float.
165 """ 186 """
166 if self._quadrance is None: 187 if self._quadrance is None:
167 self._quadrance = math.fsum(math.fabs(w)**2 for w in self.weights) 188 self._quadrance = math.fsum(
189 math.fabs(w)**2 for w in self.weights.itervalues())
168 190
169 return self._quadrance 191 return self._quadrance
170 192
171 @property 193 @property
172 def l2(self): 194 def l2(self):
173 """The l2 (aka Euclidean) norm of the weight covector. 195 """The l2 (aka Euclidean) norm of the weight covector.
174 196
175 If you need the square of the l2-norm, do not use this property. 197 If you need the square of the l2-norm, do not use this property.
176 Instead, use the ``quadrance`` property which is more accurate than 198 Instead, use the ``quadrance`` property which is more accurate than
177 squaring this one. 199 squaring this one.
178 """ 200 """
179 return math.sqrt(self.quadrance) 201 return math.sqrt(self.quadrance)
180 202
181 def Features(self, x): 203 def Features(self, x):
182 """Returns a function mapping ``y`` to its feature vector given ``x``. 204 """Returns a function mapping ``y`` to its feature vector given ``x``.
183 205
184 Args: 206 Args:
185 x (X): the value of the independent variable. 207 x (X): the value of the independent variable.
186 208
187 Returns: 209 Returns:
188 A ``MemoizedFunction`` of type ``Y -> np.array(float)``. 210 A ``MemoizedFunction`` of type ``Y -> dict(str to float)``.
189 """ 211 """
190 return self._features(x) 212 return self._features(x)
191 213
192 def Score(self, x): 214 def Score(self, x):
193 """Returns a function mapping ``y`` to its "score" given ``x``. 215 """Returns a function mapping ``y`` to its "score" given ``x``.
194 216
195 Semantically, the "score" of ``y`` given ``x`` is the 217 Semantically, the "score" of ``y`` given ``x`` is the
196 unnormalized log-domain conditional probability of ``y`` given 218 unnormalized log-domain conditional probability of ``y`` given
197 ``x``. Operationally, we compute this by taking the inner product 219 ``x``. Operationally, we compute this by taking the inner product
198 of the weight covector with the ``Features(x)(y)`` vector. The 220 of the weight covector with the ``Features(x)(y)`` vector. The
199 conditional probability itself can be computed by exponentiating 221 conditional probability itself can be computed by exponentiating
200 the "score" and normalizing with the partition function. However, 222 the "score" and normalizing with the partition function. However,
201 for determining the "best" ``y`` we don't need to bother computing 223 for determining the "best" ``y`` we don't need to bother computing
202 the probability, because ``Score(x)`` is monotonic with respect to 224 the probability, because ``Score(x)`` is monotonic with respect to
203 ``Probability(x)``. 225 ``Probability(x)``.
204 226
205 Args: 227 Args:
206 x (X): the value of the independent variable. 228 x (X): the value of the independent variable.
207 229
208 Returns: 230 Returns:
209 A ``MemoizedFunction`` of type ``Y -> float``. 231 A ``MemoizedFunction`` of type ``Y -> float``.
210 """ 232 """
211 return self._scores(x) 233 return self._scores(x)
212 234
235 def FormatReasons(self, features):
236 """Collect and format a list of all ``FeatureValue.reason`` strings.
237
238 Args:
239 features (literable of FeatureValue): the values whose ``reason``
wrengr 2017/01/12 19:09:09 "literable" -> "iterable"
Sharu Jiang 2017/01/13 01:08:34 Done.
240 strings should be collected.
241
242 Returns:
243 A list of ``(str, float, str)`` triples; where the first string is
244 the feature name, the float is some numeric representation of how
245 much influence this feature exerts on the ``Suspect`` being blamed,
246 and the final string is the ``FeatureValue.reason``. The list is
247 sorted by feature name, just to ensure that it comes out in some
248 canonical order.
249
250 At present, the float is the log-domain score of the feature
251 value. However, this isn't the best thing for UX reasons. In the
252 future it might be replaced by the normal-domain score, or by
253 the probability.
254 """
255 formatted_reasons = []
256 for feature in features:
257 feature_score = self.SingleFeatureScore(feature)
258 if self.LogZeroish(feature_score): # pragma: no cover
259 logging.debug('Discarding reasons from feature %s'
260 ' because it has zero probability' % feature.name)
261 continue
262
263 formatted_reasons.append((feature.name, feature_score, feature.reason))
264
265 formatted_reasons.sort(key=lambda formatted_reason: formatted_reason[0])
266 return formatted_reasons
267
268 def AggregateChangedFiles(self, features):
269 """Merge multiple``FeatureValue.changed_files`` lists into one.
270
271 Args:
272 features (iterable of FeatureValue): the values whose ``changed_files``
273 lists should be aggregated.
274
275 Returns:
276 A list of ``ChangedFile`` objects sorted by file name. The sorting
277 is not essential, but is provided to ease testing by ensuring the
278 output is in some canonical order.
279
280 Raises:
281 ``ValueError`` if any file name is given inconsistent ``blame_url``s.
282 """
283 all_changed_files = {}
284 for feature in features:
285 if self.LogZeroish(self.SingleFeatureScore(feature)): # pragma: no cover
286 logging.debug('Discarding changed files from feature %s'
287 ' because it has zero probability' % feature.name)
288 continue
289
290 for changed_file in feature.changed_files or []:
291 accumulated_changed_file = all_changed_files.get(changed_file.name)
292 if accumulated_changed_file is None:
293 all_changed_files[changed_file.name] = changed_file
294 continue
295
296 if (accumulated_changed_file.blame_url !=
297 changed_file.blame_url): # pragma: no cover
298 raise ValueError('Blame URLs do not match: %s != %s'
299 % (accumulated_changed_file.blame_url, changed_file.blame_url))
300 accumulated_changed_file.reasons.extend(changed_file.reasons or [])
301
302 changed_files = all_changed_files.values()
303 changed_files.sort(key=lambda changed_file: changed_file.name)
304 return changed_files
305
213 306
214 class LogLinearModel(UnnormalizedLogLinearModel): 307 class LogLinearModel(UnnormalizedLogLinearModel):
215 """A loglinear probability model. 308 """A loglinear probability model.
216 309
217 This class is distinct from ``UnnormalizedLogLinearModel`` in that 310 This class is distinct from ``UnnormalizedLogLinearModel`` in that
218 we can provide probabilities (not just scores). However, to do so we 311 we can provide probabilities (not just scores). However, to do so we
219 require a specification of the subsets of ``Y`` for each ``x``. 312 require a specification of the subsets of ``Y`` for each ``x``.
220 """ 313 """
221 def __init__(self, Y_given_X, feature_function, weights, epsilon=None): 314 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):
222 """Construct a new probabilistic model. 315 """Construct a new probabilistic model.
223 316
224 Args: 317 Args:
225 Y_given_X: a function from ``X`` to an iterable object giving the 318 Y_given_X: a function from ``X`` to an iterable object giving the
226 subset of ``Y`` which has non-zero probability given the 319 subset of ``Y`` which has non-zero probability given the
227 ``x``. When in doubt about whether some ``y`` has zero probability 320 ``x``. When in doubt about whether some ``y`` has zero probability
228 or not, it is always safe/correct to return a larger subset of 321 or not, it is always safe/correct to return a larger subset of
229 ``Y`` (it'll just take more computation time is all). This is 322 ``Y`` (it'll just take more computation time is all). This is
230 needed for computing the partition function and expectation. N.B., 323 needed for computing the partition function and expectation. N.B.,
231 we do not actually need to know/enumerate of *all* of ``Y``, 324 we do not actually need to know/enumerate of *all* of ``Y``,
232 only the subsets for each ``x``. 325 only the subsets for each ``x``.
233 feature_function: A function ``X -> Y -> list(float)``. N.B., 326 feature_function: A function ``X -> Y -> list(float)``. N.B.,
234 for all ``x`` and ``y`` the length of ``feature_function(x)(y)`` 327 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``
235 must be the same as the length of ``weights``. 328 must be the same as the length of ``weights``.
236 weights (list of float): coefficients for how important we consider 329 weights (dict of float): the weights for the features. The keys of
237 each component of the feature vector for deciding which ``y`` 330 the dictionary are the names of the feature that weight is
238 to blame. 331 for. We take this argument as a dict rather than as a list so that
332 callers needn't worry about what order to provide the weights in.
239 epsilon (float): The absolute-error threshold for considering a 333 epsilon (float): The absolute-error threshold for considering a
240 weight to be "equal to zero". N.B., this should be a positive 334 weight to be "equal to zero". N.B., this should be a positive
241 number, as we will compare it against the absolute value of 335 number, as we will compare it against the absolute value of
242 each weight. 336 each weight.
243 """ 337 """
244 super(LogLinearModel, self).__init__(feature_function, weights, epsilon) 338 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)
245 339
246 self._Y = Y_given_X 340 self._Y = Y_given_X
247 341
248 def _LogZ(x): 342 def _LogZ(x):
(...skipping 70 matching lines...) Expand 10 before | Expand all | Expand 10 after
319 function returns; rather, it's a sort of average of all the results 413 function returns; rather, it's a sort of average of all the results
320 returned. For more information you can take a look at Wikipedia 414 returned. For more information you can take a look at Wikipedia
321 <https://en.wikipedia.org/wiki/Expected_value>. 415 <https://en.wikipedia.org/wiki/Expected_value>.
322 """ 416 """
323 prob_given_x = self.Probability(x) 417 prob_given_x = self.Probability(x)
324 # N.B., the ``*`` below is vector scaling! If we want to make this 418 # N.B., the ``*`` below is vector scaling! If we want to make this
325 # method polymorphic in the return type of ``f`` then we'll need an 419 # method polymorphic in the return type of ``f`` then we'll need an
326 # API that provides both scaling and ``vsum``. 420 # API that provides both scaling and ``vsum``.
327 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)]) 421 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])
328 422
OLDNEW

Powered by Google App Engine
This is Rietveld 408576698