appengine/findit/crash/loglinear/model.py - Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM.

Side by Side Diff: appengine/findit/crash/loglinear/model.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)

Patch Set: . Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

« appengine/findit/crash/loglinear/changelist_classifier.py ('K') | « appengine/findit/crash/loglinear/changelist_classifier.py ('k') | appengine/findit/crash/loglinear/test/loglinear_testcase.py » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 # TODO(http://crbug.com/669639): there are lots of ways to make the code	5 # TODO(http://crbug.com/669639): there are lots of ways to make the code

6 # in this file better. We avoid having separate todos per task; instead	6 # in this file better. We avoid having separate todos per task; instead

7 # see that meta-issue ticket.	7 # see that meta-issue ticket.

8	8

9 import math	9 import math

10 import numpy as np	10 import numpy as np

(...skipping 42 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
53 where particular values for that random variable called ``y``. The	53 where particular values for that random variable called ``y``. The

54 partition function is called ``Z``. And (somewhat non-conventionally)	54 partition function is called ``Z``. And (somewhat non-conventionally)

55 we will call the log of the partition function ``zeta``.	55 we will call the log of the partition function ``zeta``.

56	56

57 This class is distinct from ``LogLinearModel`` in that we do not require	57 This class is distinct from ``LogLinearModel`` in that we do not require

58 a specification of ``Y``. This class is sufficient for determining	58 a specification of ``Y``. This class is sufficient for determining

59 which ``y`` is the most likely, though it can only return a "score"	59 which ``y`` is the most likely, though it can only return a "score"

60 rather than returning a probability per se.	60 rather than returning a probability per se.

61 """	61 """

62	62

63 def __init__(self, feature_function, weights, epsilon=None):	63 def __init__(self, feature_function, weights, feature_to_weight,

	64 epsilon=None):
	wrengr 2017/01/09 19:31:51 given the dict of weights the list is redundant. F given the dict of weights the list is redundant. For clarity I'd rather keep the "weights" name for describing the dict
64 """Construct a new model with the given weights and feature function.	65 """Construct a new model with the given weights and feature function.

65	66

66 Args:	67 Args:

67 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,	68 feature_function: A function ``X -> Y -> list(FeatureValue)``. N.B.,

68 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	69 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

69 must be the same as the length of ``weights``.	70 must be the same as the length of ``weights``.

70 weights (list of float): coefficients for how important we consider	71 weights (list of float): coefficients for how important we consider

71 each component of the feature vector for deciding which ``y``	72 each component of the feature vector for deciding which ``y``

72 to blame.	73 to blame.

	74 feature_to_weight (dict of float): the weights for the features. The keys

	75 of the dictionary are the names of the feature that weight is

	76 for. We take this argument as a dict rather than as a list so that

	77 callers needn't worry about what order to provide the weights in.

73 epsilon (float): The absolute-error threshold for considering a	78 epsilon (float): The absolute-error threshold for considering a

74 weight to be "equal to zero". N.B., this should be a positive	79 weight to be "equal to zero". N.B., this should be a positive

75 number, as we will compare it against the absolute value of	80 number, as we will compare it against the absolute value of

76 each weight.	81 each weight.

77 """	82 """

78 if epsilon is None:	83 if epsilon is None:

79 epsilon = EPSILON	84 epsilon = EPSILON

	85

	86 self._feature_to_weight = feature_to_weight

80 self._weights = np.array([	87 self._weights = np.array([
	wrengr 2017/01/09 19:31:52 Once we no longer take the list of weights, this w Once we no longer take the list of weights, this will need to be autovivified somehow. The order of feature names is defined by the final result of feature_function. Of course, we can't get at the order of that list without providing the X and Y arguments; so it'd make the most sense to change the feature_function to be an instance of a class/interface which specifies both the X -> Y -> list(FV) function as well as the list(str) of feature names. Notably, this introduces the constraint that all features must have unique names, which LLM didn't require previously (though SingleFeatureScore did/does assume it).
81 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.	88 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.

82 for w in weights])	89 for w in weights])

83	90

84 self._quadrance = None	91 self._quadrance = None

85	92

86 # TODO(crbug.com/674752): we need better names for ``self._features``.	93 # TODO(crbug.com/674752): we need better names for ``self._features``.

87 def _Features(x):	94 def _Features(x):

88 """Wrap ``feature_function`` to memoize things and ensure types.	95 """Wrap ``feature_function`` to memoize things and ensure types.

89	96

90 This outer wrapping takes each ``x`` to a memoized instance of	97 This outer wrapping takes each ``x`` to a memoized instance of

(...skipping 23 matching lines...) Expand all Loading...
114 # TODO(crbug.com/674752): we need better names for ``self._scores``.	121 # TODO(crbug.com/674752): we need better names for ``self._scores``.

115 # N.B., this is just the inner product of ``self.weights``	122 # N.B., this is just the inner product of ``self.weights``

116 # against ``self._features(x)``. If we can compute this in some	123 # against ``self._features(x)``. If we can compute this in some

117 # more efficient way, we should. In particular, we will want to	124 # more efficient way, we should. In particular, we will want to

118 # make the weights sparse, in which case we need to use a sparse	125 # make the weights sparse, in which case we need to use a sparse

119 # variant of the dot product.	126 # variant of the dot product.

120 self._scores = MemoizedFunction(lambda x:	127 self._scores = MemoizedFunction(lambda x:

121 self._features(x).map(lambda fxy:	128 self._features(x).map(lambda fxy:

122 self.weights.dot(np.array([feature.value for feature in fxy]))))	129 self.weights.dot(np.array([feature.value for feature in fxy]))))

123	130

	131 def SingleFeatureScore(self, feature_value):

	132 """Returns the score (aka weighted value) of a ``FeatureValue``.

	133

	134 This function assumes the report's stacktrace has already had any necessary

	135 preprocessing (like filtering or truncating) applied.

	136

	137 Args:

	138 feature_value (FeatureValue): the feature value to check.

	139

	140 Returns:

	141 The score of the feature value.

	142 """

	143 return feature_value.value * self._feature_to_weight.get(

	144 feature_value.name, 0.)

	145

124 def ClearWeightBasedMemos(self):	146 def ClearWeightBasedMemos(self):

125 """Clear all the memos that depend on the weight covector."""	147 """Clear all the memos that depend on the weight covector."""

126 self._quadrance = None	148 self._quadrance = None

127 self._scores.ClearMemos()	149 self._scores.ClearMemos()

128	150

129 def ClearAllMemos(self):	151 def ClearAllMemos(self):

130 """Clear all memos, even those independent of the weight covector."""	152 """Clear all memos, even those independent of the weight covector."""

131 self.ClearWeightBasedMemos()	153 self.ClearWeightBasedMemos()

132 self._features.ClearMemos()	154 self._features.ClearMemos()

133	155

(...skipping 77 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
211 return self._scores(x)	233 return self._scores(x)

212	234

213	235

214 class LogLinearModel(UnnormalizedLogLinearModel):	236 class LogLinearModel(UnnormalizedLogLinearModel):

215 """A loglinear probability model.	237 """A loglinear probability model.

216	238

217 This class is distinct from ``UnnormalizedLogLinearModel`` in that	239 This class is distinct from ``UnnormalizedLogLinearModel`` in that

218 we can provide probabilities (not just scores). However, to do so we	240 we can provide probabilities (not just scores). However, to do so we

219 require a specification of the subsets of ``Y`` for each ``x``.	241 require a specification of the subsets of ``Y`` for each ``x``.

220 """	242 """

221 def __init__(self, Y_given_X, feature_function, weights, epsilon=None):	243 def __init__(self, Y_given_X, feature_function, weights, feature_to_weight,

	244 epsilon=None):

222 """Construct a new probabilistic model.	245 """Construct a new probabilistic model.

223	246

224 Args:	247 Args:

225 Y_given_X: a function from ``X`` to an iterable object giving the	248 Y_given_X: a function from ``X`` to an iterable object giving the

226 subset of ``Y`` which has non-zero probability given the	249 subset of ``Y`` which has non-zero probability given the

227 ``x``. When in doubt about whether some ``y`` has zero probability	250 ``x``. When in doubt about whether some ``y`` has zero probability

228 or not, it is always safe/correct to return a larger subset of	251 or not, it is always safe/correct to return a larger subset of

229 ``Y`` (it'll just take more computation time is all). This is	252 ``Y`` (it'll just take more computation time is all). This is

230 needed for computing the partition function and expectation. N.B.,	253 needed for computing the partition function and expectation. N.B.,

231 we do not actually need to know/enumerate of all of ``Y``,	254 we do not actually need to know/enumerate of all of ``Y``,

232 only the subsets for each ``x``.	255 only the subsets for each ``x``.

233 feature_function: A function ``X -> Y -> list(float)``. N.B.,	256 feature_function: A function ``X -> Y -> list(float)``. N.B.,

234 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``	257 for all ``x`` and ``y`` the length of ``feature_function(x)(y)``

235 must be the same as the length of ``weights``.	258 must be the same as the length of ``weights``.

236 weights (list of float): coefficients for how important we consider	259 weights (list of float): coefficients for how important we consider

237 each component of the feature vector for deciding which ``y``	260 each component of the feature vector for deciding which ``y``

238 to blame.	261 to blame.
	wrengr 2017/01/09 19:31:52 documentation wasn't updated; though see my previo documentation wasn't updated; though see my previous comments.
239 epsilon (float): The absolute-error threshold for considering a	262 epsilon (float): The absolute-error threshold for considering a

240 weight to be "equal to zero". N.B., this should be a positive	263 weight to be "equal to zero". N.B., this should be a positive

241 number, as we will compare it against the absolute value of	264 number, as we will compare it against the absolute value of

242 each weight.	265 each weight.

243 """	266 """

244 super(LogLinearModel, self).__init__(feature_function, weights, epsilon)	267 super(LogLinearModel, self).__init__(feature_function, weights,

	268 feature_to_weight, epsilon)

245	269

246 self._Y = Y_given_X	270 self._Y = Y_given_X

247	271

248 def _LogZ(x):	272 def _LogZ(x):

249 score_given_x = self._scores(x)	273 score_given_x = self._scores(x)

250 return logsumexp([score_given_x(y) for y in self._Y(x)])	274 return logsumexp([score_given_x(y) for y in self._Y(x)])

251 self._zetas = MemoizedFunction(_LogZ)	275 self._zetas = MemoizedFunction(_LogZ)

252	276

253 def ClearWeightBasedMemos(self):	277 def ClearWeightBasedMemos(self):

254 """Clear all the memos that depend on the weight covector."""	278 """Clear all the memos that depend on the weight covector."""

(...skipping 64 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
319 function returns; rather, it's a sort of average of all the results	343 function returns; rather, it's a sort of average of all the results

320 returned. For more information you can take a look at Wikipedia	344 returned. For more information you can take a look at Wikipedia

321 <https://en.wikipedia.org/wiki/Expected_value>.	345 <https://en.wikipedia.org/wiki/Expected_value>.

322 """	346 """

323 prob_given_x = self.Probability(x)	347 prob_given_x = self.Probability(x)

324 # N.B., the ``*`` below is vector scaling! If we want to make this	348 # N.B., the ``*`` below is vector scaling! If we want to make this

325 # method polymorphic in the return type of ``f`` then we'll need an	349 # method polymorphic in the return type of ``f`` then we'll need an

326 # API that provides both scaling and ``vsum``.	350 # API that provides both scaling and ``vsum``.

327 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])	351 return vsum([prob_given_x(y) * f(y) for y in self._Y(x)])

328	352

OLD	NEW