Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(61)

Side by Side Diff: appengine/findit/crash/loglinear/model.py

Issue 2625073003: [Predator] Add MetaWeight and MetaFeatureValue to group multiple weights and features together. (Closed)
Patch Set: Rebase. Created 3 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
1 # Copyright 2016 The Chromium Authors. All rights reserved. 1 # Copyright 2016 The Chromium Authors. All rights reserved.
2 # Use of this source code is governed by a BSD-style license that can be 2 # Use of this source code is governed by a BSD-style license that can be
3 # found in the LICENSE file. 3 # found in the LICENSE file.
4 4
5 # TODO(http://crbug.com/669639): there are lots of ways to make the code 5 # TODO(http://crbug.com/669639): there are lots of ways to make the code
6 # in this file better. We avoid having separate todos per task; instead 6 # in this file better. We avoid having separate todos per task; instead
7 # see that meta-issue ticket. 7 # see that meta-issue ticket.
8 8
9 import logging 9 import logging
10 import math 10 import math
(...skipping 25 matching lines...) Expand all
36 where particular values for that random variable called ``y``. The 36 where particular values for that random variable called ``y``. The
37 partition function is called ``Z``. And (somewhat non-conventionally) 37 partition function is called ``Z``. And (somewhat non-conventionally)
38 we will call the log of the partition function ``zeta``. 38 we will call the log of the partition function ``zeta``.
39 39
40 This class is distinct from ``LogLinearModel`` in that we do not require 40 This class is distinct from ``LogLinearModel`` in that we do not require
41 a specification of ``Y``. This class is sufficient for determining 41 a specification of ``Y``. This class is sufficient for determining
42 which ``y`` is the most likely, though it can only return a "score" 42 which ``y`` is the most likely, though it can only return a "score"
43 rather than returning a probability per se. 43 rather than returning a probability per se.
44 """ 44 """
45 45
46 def __init__(self, feature_function, weights, epsilon=None): 46 def __init__(self, meta_feature, meta_weight, epsilon=None):
47 """Construct a new model with the given weights and feature function. 47 """Construct a new model with the meta_feature and meta_weight.
48 48
49 Args: 49 Args:
50 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B., 50 meta_feature (MetaFeature): A function ``X -> Y -> MetaFeatureValue``.
51 for all ``x`` and ``y`` the length of ``feature_function(x)(y)`` 51 N.B., for all ``x`` and ``y`` the length of ``wrapped_feature(x)(y)``
52 must be the same as the length of ``weights``. 52 must be the same as the length of ``weights``.
53 weights (dict of float): the weights for the features. The keys of 53 All features.
54 the dictionary are the names of the feature that weight is 54 meta_weight (MetaWeight): All weights. the weights for the features.
55 The keys of the dictionary are the names of the feature that weight is
55 for. We take this argument as a dict rather than as a list so that 56 for. We take this argument as a dict rather than as a list so that
56 callers needn't worry about what order to provide the weights in. 57 callers needn't worry about what order to provide the weights in.
57 epsilon (float): The absolute-error threshold for considering a 58 epsilon (float): The absolute-error threshold for considering a
58 weight to be "equal to zero". N.B., this should be a positive 59 weight to be "equal to zero". N.B., this should be a positive
59 number, as we will compare it against the absolute value of 60 number, as we will compare it against the absolute value of
60 each weight. 61 each weight.
61 """ 62 """
62 if epsilon is None: 63 self._epsilon = EPSILON if epsilon is None else epsilon
63 self._epsilon = EPSILON
64 else:
65 self._epsilon = epsilon
66 64
67 # TODO(crbug.com/680207) Filter zero weights, use sparse representaion of 65 self._meta_weight = meta_weight
68 # weight covector. 66 self._meta_weight.DropZeroWeights(self._epsilon)
69 self._weights = {
70 name: weight for name, weight in weights.iteritems()
71 if self.IsNonZeroWeight(weight)
72 }
73 67
74 self._quadrance = None 68 self._quadrance = None
75
76 # TODO(crbug.com/674752): we need better names for ``self._features``. 69 # TODO(crbug.com/674752): we need better names for ``self._features``.
77 def _Features(x): 70 def _Features(x):
78 """Wrap ``feature_function`` to memoize things and ensure types. 71 """Wrap ``meta_feature`` to memoize things.
79 72
80 This outer wrapping takes each ``x`` to a memoized instance of 73 This outer wrapping takes each ``x`` to a memoized instance of
81 ``_FeaturesGivenX``. That is, for each ``x`` we return a 74 ``_FeaturesGivenX``. That is, for each ``x`` we return a
82 ``MemoizedFunction`` from ``Y`` to ``dict(str to FeatureValue)``. 75 ``MemoizedFunction`` from ``Y`` to ``dict(str to FeatureValue)``.
83 """ 76 """
84 fx = feature_function(x)
85 def _FeaturesGivenX(y):
86 """Wrap ``feature_function(x)`` to ensure appropriate types.
87
88 This inner wrapper ensures that the resulting ``FeatureValue``
89 array has the same length as the weight covector.
90 """
91 fxy = fx(y)
92 # N.B., we're assuming that ``len(self.weights)`` is O(1).
93 assert len(fxy) == len(self.weights), TypeError(
94 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))
95 return fxy
96
97 # Memoize on ``Y``, to ensure we don't need to recompute 77 # Memoize on ``Y``, to ensure we don't need to recompute
98 # ``FeatureValue``s nor recheck the lengths. 78 # ``FeatureValue``s nor recheck the lengths.
99 return MemoizedFunction(_FeaturesGivenX) 79 return MemoizedFunction(meta_feature(x))
100 80
101 # Memoize on ``X``, to ensure we share the memo tables on ``Y``. 81 # Memoize on ``X``, to ensure we share the memo tables on ``Y``.
102 self._features = MemoizedFunction(_Features) 82 self._features = MemoizedFunction(_Features)
103 83
104 # TODO(crbug.com/674752): we need better names for ``self._scores``. 84 # TODO(crbug.com/674752): we need better names for ``self._scores``.
105 # N.B., this is just the inner product of ``self.weights`` 85 # N.B., this is just the inner product of ``self._meta_weight``
106 # against ``self._features(x)``. If we can compute this in some 86 # against ``self._features(x)``. If we can compute this in some
107 # more efficient way, we should. In particular, we will want to 87 # more efficient way, we should. In particular, we will want to
108 # make the weights sparse, in which case we need to use a sparse 88 # make the weights sparse, in which case we need to use a sparse
109 # variant of the dot product. 89 # variant of the dot product.
110 self._scores = MemoizedFunction(lambda x: self._features(x).map( 90 self._scores = MemoizedFunction(lambda x: self._features(x).map(
111 lambda fxy: math.fsum(self.SingleFeatureScore(feature) 91 lambda fxy: self._meta_weight * fxy))
112 for feature in fxy.itervalues())))
113
114 def IsNonZeroWeight(self, weight):
115 """Determines whether a weight is zero or not using ``self._epsilon``."""
116 return isinstance(weight, float) and math.fabs(weight) > self._epsilon
117
118 def SingleFeatureScore(self, feature_value):
119 """Returns the score (aka weighted value) of a ``FeatureValue``.
120
121 Args:
122 feature_value (FeatureValue): the feature value to check.
123
124 Returns:
125 The score of the feature value.
126 """
127 return feature_value.value * self._weights.get(feature_value.name, 0.)
128
129 # TODO(crbug.com/673964): something better for detecting "close to log(0)".
130 def LogZeroish(self, x):
131 """Determine whether a float is close enough to log(0).
132
133 If a ``FeatureValue`` has a (log-domain) score of -inf for a given
134 ``Suspect``, then that suspect has zero probability of being the
135 culprit. We want to filter these suspects out, to clean up the
136 output of classification; so this method encapsulates the logic of
137 that check.
138
139 Args:
140 x (float): the float to check
141
142 Returns:
143 ``True`` if ``x`` is close enough to log(0); else ``False``.
144 """
145 return x < 0 and math.isinf(x)
146 92
147 def ClearWeightBasedMemos(self): 93 def ClearWeightBasedMemos(self):
148 """Clear all the memos that depend on the weight covector.""" 94 """Clear all the memos that depend on the weight covector."""
149 self._quadrance = None 95 self._quadrance = None
150 self._scores.ClearMemos() 96 self._scores.ClearMemos()
151 97
152 def ClearAllMemos(self): 98 def ClearAllMemos(self):
153 """Clear all memos, even those independent of the weight covector.""" 99 """Clear all memos, even those independent of the weight covector."""
154 self.ClearWeightBasedMemos() 100 self.ClearWeightBasedMemos()
155 self._features.ClearMemos() 101 self._features.ClearMemos()
156 102
157 @property 103 @property
158 def weights(self): 104 def meta_weight(self):
159 """The weight covector. 105 """The weight covector.
160 106
161 At present we return the weights as an dict mapping feature name to its 107 At present we return the weights as an dict mapping feature name to its
162 weight, but in the future that may be replaced by a more general type which 108 weight, but in the future that may be replaced by a more general type which
163 specifies the semantics rather than the implementation details. 109 specifies the semantics rather than the implementation details.
164 """ 110 """
165 return self._weights 111 return self._meta_weight
112
113 @meta_weight.setter
114 def meta_weight(self, new_meta_weight):
115 self._meta_weight = new_meta_weight
116 self._meta_weight.DropZeroWeights(self._epsilon)
166 117
167 @property 118 @property
168 def l0(self): # pragma: no cover 119 def l0(self): # pragma: no cover
169 """The l0-norm of the weight covector. 120 """The l0-norm of the weight covector.
170 121
171 N.B., despite being popularly called the "l0-norm", this isn't 122 N.B., despite being popularly called the "l0-norm", this isn't
172 actually a norm in the mathematical sense.""" 123 actually a norm in the mathematical sense."""
173 return float(len(self.weights) - self.weights.values().count(0.)) 124 return self._meta_weight.l0
174 125
175 @property 126 @property
176 def l1(self): # pragma: no cover 127 def l1(self): # pragma: no cover
177 """The l1 (aka: Manhattan) norm of the weight covector.""" 128 """The l1 (aka: Manhattan) norm of the weight covector."""
178 return math.fsum(math.fabs(w) for w in self.weights.itervalues()) 129 return self._meta_weight.l1
179 130
180 @property 131 @property
181 def quadrance(self): 132 def quadrance(self):
182 """The square of the l2 norm of the weight covector. 133 """The square of the l2 norm of the weight covector.
183 134
184 This value is often more helpful to have direct access to, as it 135 This value is often more helpful to have direct access to, as it
185 avoids the need for non-rational functions (e.g., sqrt) and shows up 136 avoids the need for non-rational functions (e.g., sqrt) and shows up
186 as its own quantity in many places. Also, computing it directly avoids 137 as its own quantity in many places. Also, computing it directly avoids
187 the error introduced by squaring the square-root of an IEEE-754 float. 138 the error introduced by squaring the square-root of an IEEE-754 float.
188 """ 139 """
189 if self._quadrance is None: 140 return self._meta_weight.quadrance
190 self._quadrance = math.fsum(
191 math.fabs(w)**2 for w in self.weights.itervalues())
192
193 return self._quadrance
194 141
195 @property 142 @property
196 def l2(self): 143 def l2(self):
197 """The l2 (aka Euclidean) norm of the weight covector. 144 """The l2 (aka Euclidean) norm of the weight covector.
198 145
199 If you need the square of the l2-norm, do not use this property. 146 If you need the square of the l2-norm, do not use this property.
200 Instead, use the ``quadrance`` property which is more accurate than 147 Instead, use the ``quadrance`` property which is more accurate than
201 squaring this one. 148 squaring this one.
202 """ 149 """
203 return math.sqrt(self.quadrance) 150 return math.sqrt(self.quadrance)
204 151
152 # TODO(crbug.com/673964): something better for detecting "close to log(0)".
153 def LogZeroish(self, x):
154 """Determine whether a float is close enough to log(0).
155
156 If a ``FeatureValue`` has a (log-domain) score of -inf for a given
157 ``Suspect``, then that suspect has zero probability of being the
158 culprit. We want to filter these suspects out, to clean up the
159 output of classification; so this method encapsulates the logic of
160 that check.
161
162 Args:
163 x (float): the float to check
164
165 Returns:
166 ``True`` if ``x`` is close enough to log(0); else ``False``.
167 """
168 return x < 0 and math.isinf(x)
169
205 def Features(self, x): 170 def Features(self, x):
206 """Returns a function mapping ``y`` to its feature vector given ``x``. 171 """Returns a function mapping ``y`` to its feature vector given ``x``.
207 172
208 Args: 173 Args:
209 x (X): the value of the independent variable. 174 x (X): the value of the independent variable.
210 175
211 Returns: 176 Returns:
212 A ``MemoizedFunction`` of type ``Y -> dict(str to float)``. 177 A ``MemoizedFunction`` of type ``Y -> dict(str to float)``.
213 """ 178 """
214 return self._features(x) 179 return self._features(x)
(...skipping 12 matching lines...) Expand all
227 ``Probability(x)``. 192 ``Probability(x)``.
228 193
229 Args: 194 Args:
230 x (X): the value of the independent variable. 195 x (X): the value of the independent variable.
231 196
232 Returns: 197 Returns:
233 A ``MemoizedFunction`` of type ``Y -> float``. 198 A ``MemoizedFunction`` of type ``Y -> float``.
234 """ 199 """
235 return self._scores(x) 200 return self._scores(x)
236 201
237 def FormatReasons(self, features):
238 """Collect and format a list of all ``FeatureValue.reason`` strings.
239
240 Args:
241 features (iterable of FeatureValue): the values whose ``reason``
242 strings should be collected.
243
244 Returns:
245 A list of ``(str, float, str)`` triples; where the first string is
246 the feature name, the float is some numeric representation of how
247 much influence this feature exerts on the ``Suspect`` being blamed,
248 and the final string is the ``FeatureValue.reason``. The list is
249 sorted by feature name, just to ensure that it comes out in some
250 canonical order.
251
252 At present, the float is the log-domain score of the feature
253 value. However, this isn't the best thing for UX reasons. In the
254 future it might be replaced by the normal-domain score, or by
255 the probability.
256 """
257 formatted_reasons = []
258 for feature in features:
259 feature_score = self.SingleFeatureScore(feature)
260 if self.LogZeroish(feature_score): # pragma: no cover
261 logging.debug('Discarding reasons from feature %s'
262 ' because it has zero probability' % feature.name)
263 continue
264
265 formatted_reasons.append((feature.name, feature_score, feature.reason))
266
267 formatted_reasons.sort(key=lambda formatted_reason: formatted_reason[0])
268 return formatted_reasons
269
270 def AggregateChangedFiles(self, features):
271 """Merge multiple``FeatureValue.changed_files`` lists into one.
272
273 Args:
274 features (iterable of FeatureValue): the values whose ``changed_files``
275 lists should be aggregated.
276
277 Returns:
278 A list of ``ChangedFile`` objects sorted by file name. The sorting
279 is not essential, but is provided to ease testing by ensuring the
280 output is in some canonical order.
281
282 Raises:
283 ``ValueError`` if any file name is given inconsistent ``blame_url``s.
284 """
285 all_changed_files = {}
286 for feature in features:
287 if self.LogZeroish(self.SingleFeatureScore(feature)): # pragma: no cover
288 logging.debug('Discarding changed files from feature %s'
289 ' because it has zero probability' % feature.name)
290 continue
291
292 for changed_file in feature.changed_files or []:
293 accumulated_changed_file = all_changed_files.get(changed_file.name)
294 if accumulated_changed_file is None:
295 all_changed_files[changed_file.name] = changed_file
296 continue
297
298 if (accumulated_changed_file.blame_url !=
299 changed_file.blame_url): # pragma: no cover
300 raise ValueError('Blame URLs do not match: %s != %s'
301 % (accumulated_changed_file.blame_url, changed_file.blame_url))
302 accumulated_changed_file.reasons.extend(changed_file.reasons or [])
303
304 changed_files = all_changed_files.values()
305 changed_files.sort(key=lambda changed_file: changed_file.name)
306 return changed_files
307
308 202
309 class LogLinearModel(UnnormalizedLogLinearModel): 203 class LogLinearModel(UnnormalizedLogLinearModel):
310 """A loglinear probability model. 204 """A loglinear probability model.
311 205
312 This class is distinct from ``UnnormalizedLogLinearModel`` in that 206 This class is distinct from ``UnnormalizedLogLinearModel`` in that
313 we can provide probabilities (not just scores). However, to do so we 207 we can provide probabilities (not just scores). However, to do so we
314 require a specification of the subsets of ``Y`` for each ``x``. 208 require a specification of the subsets of ``Y`` for each ``x``.
315 """ 209 """
316 def __init__(self, Y_given_X, feature_function, weights, epsilon=None): 210 def __init__(self, Y_given_X, meta_feature, meta_weight, epsilon=None):
317 """Construct a new probabilistic model. 211 """Construct a new probabilistic model.
318 212
319 Args: 213 Args:
320 Y_given_X: a function from ``X`` to an iterable object giving the 214 Y_given_X: a function from ``X`` to an iterable object giving the
321 subset of ``Y`` which has non-zero probability given the 215 subset of ``Y`` which has non-zero probability given the
322 ``x``. When in doubt about whether some ``y`` has zero probability 216 ``x``. When in doubt about whether some ``y`` has zero probability
323 or not, it is always safe/correct to return a larger subset of 217 or not, it is always safe/correct to return a larger subset of
324 ``Y`` (it'll just take more computation time is all). This is 218 ``Y`` (it'll just take more computation time is all). This is
325 needed for computing the partition function and expectation. N.B., 219 needed for computing the partition function and expectation. N.B.,
326 we do not actually need to know/enumerate of *all* of ``Y``, 220 we do not actually need to know/enumerate of *all* of ``Y``,
327 only the subsets for each ``x``. 221 only the subsets for each ``x``.
328 feature_function: A function ``X -> Y -> list(float)``. N.B., 222 meta_feature (MetaFeature): A function ``X -> Y -> MetaFeatureValue``.
329 for all ``x`` and ``y`` the length of ``feature_function(x)(y)`` 223 N.B., for all ``x`` and ``y`` the length of ``wrapped_feature(x)(y)``
330 must be the same as the length of ``weights``. 224 must be the same as the length of ``weights``.
331 weights (dict of float): the weights for the features. The keys of 225 All features.
332 the dictionary are the names of the feature that weight is 226 meta_weight (MetaWeight): All weights. the weights for the features.
227 The keys of the dictionary are the names of the feature that weight is
333 for. We take this argument as a dict rather than as a list so that 228 for. We take this argument as a dict rather than as a list so that
334 callers needn't worry about what order to provide the weights in. 229 callers needn't worry about what order to provide the weights in.
335 epsilon (float): The absolute-error threshold for considering a 230 epsilon (float): The absolute-error threshold for considering a
336 weight to be "equal to zero". N.B., this should be a positive 231 weight to be "equal to zero". N.B., this should be a positive
337 number, as we will compare it against the absolute value of 232 number, as we will compare it against the absolute value of
338 each weight. 233 each weight.
339 """ 234 """
340 super(LogLinearModel, self).__init__(feature_function, weights, epsilon) 235 super(LogLinearModel, self).__init__(meta_feature, meta_weight, epsilon)
341 236
342 self._Y = Y_given_X 237 self._Y = Y_given_X
343 238
344 def _LogZ(x): 239 def _LogZ(x):
345 score_given_x = self._scores(x) 240 score_given_x = self._scores(x)
346 return logsumexp([score_given_x(y) for y in self._Y(x)]) 241 return logsumexp([score_given_x(y) for y in self._Y(x)])
347 self._zetas = MemoizedFunction(_LogZ) 242 self._zetas = MemoizedFunction(_LogZ)
348 243
349 def ClearWeightBasedMemos(self): 244 def ClearWeightBasedMemos(self):
350 """Clear all the memos that depend on the weight covector.""" 245 """Clear all the memos that depend on the weight covector."""
(...skipping 22 matching lines...) Expand all
373 x (X): the value of the independent variable. 268 x (X): the value of the independent variable.
374 269
375 Returns: 270 Returns:
376 The normalizing constant of ``LogProbability(x)``. 271 The normalizing constant of ``LogProbability(x)``.
377 """ 272 """
378 return self._zetas(x) 273 return self._zetas(x)
379 274
380 def Probability(self, x): 275 def Probability(self, x):
381 """The normal-domain distribution over ``y`` given ``x``. 276 """The normal-domain distribution over ``y`` given ``x``.
382 277
383 That is, ``self.Probability(x)(y)`` returns ``p(y | x; self.weights)`` 278 That is, ``self.Probability(x)(y)`` returns ``p(y | x; self._meta_weight)``
384 which is the model's estimation of ``Pr(y|x)``. 279 which is the model's estimation of ``Pr(y|x)``.
385 280
386 If you need the log-probability, don't use this method. Instead, 281 If you need the log-probability, don't use this method. Instead,
387 use the ``LogProbability`` method which computes it directly and 282 use the ``LogProbability`` method which computes it directly and
388 thus is more accurate than taking the log of this one. 283 thus is more accurate than taking the log of this one.
389 """ 284 """
390 logprob_given_x = self.LogProbability(x) 285 logprob_given_x = self.LogProbability(x)
391 return lambda y: math.exp(logprob_given_x(y)) 286 return lambda y: math.exp(logprob_given_x(y))
392 287
393 def LogProbability(self, x): 288 def LogProbability(self, x):
(...skipping 20 matching lines...) Expand all
414 the particular array returned may not actually be one that the 309 the particular array returned may not actually be one that the
415 function returns; rather, it's a sort of average of all the results 310 function returns; rather, it's a sort of average of all the results
416 returned. For more information you can take a look at Wikipedia 311 returned. For more information you can take a look at Wikipedia
417 <https://en.wikipedia.org/wiki/Expected_value>. 312 <https://en.wikipedia.org/wiki/Expected_value>.
418 """ 313 """
419 prob_given_x = self.Probability(x) 314 prob_given_x = self.Probability(x)
420 # N.B., the ``*`` below is vector scaling! If we want to make this 315 # N.B., the ``*`` below is vector scaling! If we want to make this
421 # method polymorphic in the return type of ``f`` then we'll need an 316 # method polymorphic in the return type of ``f`` then we'll need an
422 # API that provides both scaling and ``vsum``. 317 # API that provides both scaling and ``vsum``.
423 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)]) 318 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])
424
OLDNEW
« no previous file with comments | « appengine/findit/crash/loglinear/feature.py ('k') | appengine/findit/crash/loglinear/test/changelist_classifier_test.py » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698