appengine/findit/crash/loglinear.py - Issue 2517383005: Implementing loglinear classification (without training), for CL classification

Side by Side Diff: appengine/findit/crash/loglinear.py

Issue 2517383005: Implementing loglinear classification (without training), for CL classification (Closed)

Patch Set: rebase Created 4 years ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

« appengine/findit/crash/changelist_features/top_frame_index.py ('K') | « appengine/findit/crash/changelist_features/top_frame_index.py ('k') | appengine/findit/crash/results.py » ('j') | appengine/findit/crash/test/loglinear_test.py » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
(Empty)
	1 # Copyright 2016 The Chromium Authors. All rights reserved.

	2 # Use of this source code is governed by a BSD-style license that can be

	3 # found in the LICENSE file.

	4

	5 import math

	6 import numpy as np

	7

	8 from libs.math.functions import MemoizedFunction

	9 from libs.math.logarithms import logsumexp

	10 from libs.math.vectors import vsum

	11

	12

	13 def ToFeatureFunction(fs):

	14 """Given an array of scalar-valued functions, return a vector-valued function.

	15

	16 Args:

	17 fs (iterable): A collection of functions from ``X`` to functions from
	Sharu Jiang 2016/12/06 20:49:19 X -> Y -> float should be more clear. X -> Y -> float should be more clear. wrengr 2016/12/07 00:55:38 Can do. I wasn't sure if folks were comfortable wi Show quoted text On 2016/12/06 20:49:19, Sharu Jiang wrote: > X -> Y -> float should be more clear. Can do. I wasn't sure if folks were comfortable with that notation being used in the docstrings.
	18 ``Y`` to ``float``.

	19

	20 Returns:

	21 A function from ``X`` to a function from ``Y`` to a ``list`` of

	22 ``float`` where ``ToFeatureFunction(fs)(x)(y)[i] == fs[i](x)(y)``.

	23 """

	24 def _FeatureFunction(x):
	Sharu Jiang 2016/12/06 20:49:19 What would X look like? (regression or stacktrace What would X look like? (regression or stacktrace maybe?) e.g. if fx is one kind of features like LinearMinDistance, the only parameter it needs is the maximum distance, which is an Constant specified in the module. In this case, X is independent of Y->float, do we plan to make X->(Y->float) an identity function? wrengr 2016/12/07 00:55:38 X will be whatever information is given a priori a Show quoted text On 2016/12/06 20:49:19, Sharu Jiang wrote: > What would X look like? (regression or stacktrace maybe?) e.g. if fx is one kind > of features like LinearMinDistance, the only parameter it needs is the maximum > distance, which is an Constant specified in the module. In this case, X is > independent of Y->float, do we plan to make X->(Y->float) an identity function? X will be whatever information is given a priori and stays the same for all possible answers we consider. For our classifiers that'd be something like the CrashReport (perhaps with some additional info Predator computed before getting to the classifiers). X is all the stuff we take for granted and assume to be true no matter which y we're looking at. It's perfectly fine for features to completely ignore the x they're given; we provide it so that if the feature wants/needs it, then the feature has access to it. Y will be whatever information we're trying to generate; e.g., the labels classification will return. Ultimately we'll return the y with the highest probability, or a list of ys in order of probability. So for the component classifier, Y would be the set of all possible components. Similarly for the project classifier. For the CL classifier the natural choice is to have Y be the CLs themselves (or some simplification of CLs that's sufficient for giving good answers), since we want to get the CL most likely to be the culprit. Alas, since there are so many possible CLs that could be blamed, that introduces some practical issues with using the set of CLs as Y; but that's the idea at least. So each feature will return a real value for each pair [(x,y) for y in Y], where x is fixed for the classification task. If we assume the weight of this feature is positive, then mapping (x,y) to a higher value means the feature thinks y is more likely to be the culprit. We don't need to worry about scaling the exact value (since the weights will handle that); just so long a higher means more likely. (If the weights are negative then the meanings get flipped: larger values => more negative scores => less likely.) --- The ToFeatureFunction function is just for pushing things around so they have the right type. Often it's easier to define features independently, but then we end up with a list of all the features. ToFeatureFunction squishes things around so we end up with a single function that returns a list of values. So, for the sake of this function it doesn't really matter what X and Y are; we can convert list(X -> Y -> float) to X -> Y -> list(float) regardless. Sharu Jiang 2016/12/07 18:40:04 Acknowledged. Show quoted text On 2016/12/07 00:55:38, wrengr wrote: > On 2016/12/06 20:49:19, Sharu Jiang wrote: > > What would X look like? (regression or stacktrace maybe?) e.g. if fx is one > kind > > of features like LinearMinDistance, the only parameter it needs is the maximum > > distance, which is an Constant specified in the module. In this case, X is > > independent of Y->float, do we plan to make X->(Y->float) an identity > function? > > X will be whatever information is given a priori and stays the same for all > possible answers we consider. For our classifiers that'd be something like the > CrashReport (perhaps with some additional info Predator computed before getting > to the classifiers). X is all the stuff we take for granted and assume to be > true no matter which y we're looking at. It's perfectly fine for features to > completely ignore the x they're given; we provide it so that if the feature > wants/needs it, then the feature has access to it. > > Y will be whatever information we're trying to generate; e.g., the labels > classification will return. Ultimately we'll return the y with the highest > probability, or a list of ys in order of probability. So for the component > classifier, Y would be the set of all possible components. Similarly for the > project classifier. For the CL classifier the natural choice is to have Y be the > CLs themselves (or some simplification of CLs that's sufficient for giving good > answers), since we want to get the CL most likely to be the culprit. Alas, since > there are so many possible CLs that could be blamed, that introduces some > practical issues with using the set of CLs as Y; but that's the idea at least. > > So each feature will return a real value for each pair [(x,y) for y in Y], where > x is fixed for the classification task. If we assume the weight of this feature > is positive, then mapping (x,y) to a higher value means the feature thinks y is > more likely to be the culprit. We don't need to worry about scaling the exact > value (since the weights will handle that); just so long a higher means more > likely. (If the weights are negative then the meanings get flipped: larger > values => more negative scores => less likely.) > > --- > > The ToFeatureFunction function is just for pushing things around so they have > the right type. Often it's easier to define features independently, but then we > end up with a list of all the features. ToFeatureFunction squishes things around > so we end up with a single function that returns a list of values. So, for the > sake of this function it doesn't really matter what X and Y are; we can convert > list(X -> Y -> float) to X -> Y -> list(float) regardless. Acknowledged.
	25 fxs = [f(x) for f in fs]

	26 # TODO(wrengr): allow ``fx(y)`` to be a list, and flatten everything.

	27 return lambda y: [fx(y) for fx in fxs]

	28

	29 return _FeatureFunction

	30

	31

	32 # TODO(http://crbug.com/669639): lots of ways to make this code better.

	33 class LogLinearModel(object):

	34 """A loglinear probability model.

	35

	36 We use this model for combining a bunch of different features in order

	37 to decide who to blame. In particular, the nice thing about loglinear

	38 models is that (a) they allow separating the amount of weight we

	39 give to each feature from the feature itself, and (b) they allow us

	40 to keep adding features at will without needing to worry about the

	41 fiddly details needed for it to be a probability model.

	42

	43 Throughout the documentation we use the following conventional

	44 terminology. The independent variable (aka: data to be classified) is

	45 called ``X``, with particular values for that variable called ``x``. The

	46 dependent variable (aka: the label returned by classification) is

	47 called ``Y``, with particular values for that random variable called

	48 ``y``. The partition function is called ``Z``. And, somewhat

	49 non-conventionally, we will call the log of the partition function

	50 ``zeta``.

	51 """

	52

	53 def __init__(self, Y, feature_function, weights, epsilon=None):

	54 """

	55 Args:

	56 Y (iterable): the entire range of values for the independent

	57 variable. This is needed for computing the partition function.

	58 feature_function: A function from ``X`` to ``Y`` to a list of

	59 ``float``. N.B., for all ``x`` and ``y`` the length of

	60 ``feature_function(x)(y)`` must be the same as the length of

	61 ``weights``.

	62 weights (list of float): coefficients for how much we believe

	63 each component of the feature vector.

	64 epsilon (float): The absolute-error threshold for considering a

	65 weight to be "equal to zero". N.B., this should be a positive

	66 number, as we will compare it against the absolute value of

	67 each weight.

	68 """

	69 self._Y = frozenset(Y)
	Sharu Jiang 2016/12/07 00:05:06 Instead of passing all Y(changelogs) into LogLinea Instead of passing all Y(changelogs) into LogLinear, how about just passing an offline computed constant Z？ wrengr 2016/12/07 00:55:38 Could do that. Of course, those constants need to Show quoted text On 2016/12/07 00:05:06, Sharu Jiang wrote: > Instead of passing all Y(changelogs) into LogLinear, how about just passing an > offline computed constant Z？ Could do that. Of course, those constants need to be computed somewhere...
	70

	71 if epsilon is None:

	72 epsilon = 0.00001
	Sharu Jiang 2016/12/06 20:49:19 we should define a EPSILON constant. we should define a EPSILON constant. wrengr 2016/12/07 00:55:38 Done. Show quoted text On 2016/12/06 20:49:19, Sharu Jiang wrote: > we should define a EPSILON constant. Done.
	73 # N.B., ``np.array`` can't take generators.

	74 self._weights = np.array([

	75 w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.0

	76 for w in weights])

	77

	78 def _FeaturesMemoizedOnY(x):

	79 fx = feature_function(x)

	80 def _FeaturesCoercedToCovector(y):

	81 # N.B., ``np.array`` can't take generators.

	82 fxy = np.array(list(fx(y)))

	83 # N.B., we're assuming that ``len(self.weights)`` is O(1).

	84 assert len(fxy) == len(self.weights), TypeError(

	85 "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))

	86 return fxy

	87 return MemoizedFunction(_FeaturesCoercedToCovector)

	88 self._features = MemoizedFunction(_FeaturesMemoizedOnY)

	89

	90 # N.B., this is just the inner product of ``self.weights``

	91 # against ``self._features(x)``. If we can compute this in some

	92 # more efficient way, we should. In particular, we will want to

	93 # make the wrights sparse, in which case we need to use a sparse

	94 # variant of the dot product.

	95 self._scores = MemoizedFunction(lambda x:

	96 self._features(x).map(self.weights.dot))

	97

	98 def _LogZ(x):

	99 score_given_x = self._scores(x)

	100 return logsumexp([score_given_x(y) for y in self._Y])

	101 self._zetas = MemoizedFunction(_LogZ)

	102

	103 self._quadrance = None

	104

	105 def ClearWeightBasedMemos(self):

	106 """Clear all the memos that depend on the weight covector."""

	107 self._quadrance = None

	108 self._scores.ClearMemos()

	109 self._zetas.ClearMemos()

	110

	111 def ClearAllMemos(self):

	112 """Clear all memos, even those independent of the weight covector."""

	113 self.ClearWeightBasedMemos()

	114 self._features.ClearMemos()

	115

	116 @property

	117 def weights(self):

	118 """The weight covector.

	119

	120 At present we return the weights as an ``np.ndarray``, but in the

	121 future that may be replaced by a more general type which specifies

	122 the semantics rather than the implementation details.

	123 """

	124 return self._weights

	125

	126 @property

	127 def l0(self): # pragma: no cover

	128 """The l0-norm of the weight covector.

	129

	130 N.B., despite being popularly called the "l0-norm", this isn't

	131 actually a norm in the mathematical sense."""

	132 return float(np.count_nonzero(self.weights))

	133

	134 @property

	135 def l1(self): # pragma: no cover

	136 """The l1 (aka: Manhattan) norm of the weight covector."""

	137 return math.fsum(math.fabs(w) for w in self.weights)

	138

	139 @property

	140 def quadrance(self):

	141 """The square of the l2 norm of the weight covector.

	142

	143 This value is often more helpful to have direct access to, as it

	144 avoids the need for non-rational functions (e.g., sqrt) and shows up

	145 as its own quantity in many places. Also, computing it directly avoids

	146 the error introduced by squaring the square-root of an IEEE-754 float.

	147 """

	148 if self._quadrance is None:

	149 self._quadrance = math.fsum(math.fabs(w)**2 for w in self.weights)

	150

	151 return self._quadrance

	152

	153 @property

	154 def l2(self):

	155 """The l2 (aka Euclidean) norm of the weight covector.

	156

	157 If you need the square of the l2-norm, do not use this property.

	158 Instead, use the ``quadrance`` property which is more accurate than

	159 squaring this one.

	160 """

	161 return math.sqrt(self.quadrance)

	162

	163 # TODO(wrengr): avoid the extra indirection from this method to the

	164 # MemoizedFunction stored in the field.

	165 def Features(self, x):

	166 """Return a function mapping ``y`` to its feature vector given ``x``.

	167

	168 Args:

	169 x (X): the value of the dependent variable.

	170

	171 Returns:

	172 A ``MemoizedFunction`` from ``Y`` to an ``np.array`` of ``float``.

	173 """

	174 return self._features(x)

	175

	176 # TODO(wrengr): avoid the extra indirection from this method to the

	177 # MemoizedFunction stored in the field.

	178 def Score(self, x):

	179 """Return a function mapping ``y`` to its "score" given ``x``.

	180

	181 Semantically, the "score" of ``y`` given ``x`` is the

	182 unnormalized log-domain conditional probability of ``y`` given

	183 ``x``. Operationally, we compute this by taking the inner product

	184 of the weight covector with the ``Features(x)(y)`` vector. The

	185 conditional probability itself can be computed by exponentiating

	186 the "score" and normalizing with the partition function. However,

	187 for determining the "best" ``y`` we don't need to bother computing

	188 the probability, because ``Score(x)`` is monotonic with respect to

	189 ``Probability(x)``.

	190

	191 Args:

	192 x (X): the value of the dependent variable.

	193

	194 Returns:

	195 A ``MemoizedFunction`` from ``Y`` to ``float``.

	196 """

	197 return self._scores(x)

	198

	199 def Z(self, x):

	200 """The normal-domain conditional partition-function given ``x``.

	201

	202 If you need the log of the partition function, do not use this

	203 method. Instead, use the ``logZ`` method which computes it directly

	204 and thus is more accurate than taking the log of this one.

	205

	206 Args:

	207 x (X): the value of the dependent variable.
	Sharu Jiang 2016/12/07 00:05:06 I think x is independent variable? I think x is independent variable? wrengr 2016/12/07 00:55:38 yep, good catch. Show quoted text On 2016/12/07 00:05:06, Sharu Jiang wrote: > I think x is independent variable? yep, good catch.
	208

	209 Returns:

	210 The normalizing constant of ``Probability(x)``.

	211 """

	212 return math.exp(self.logZ(x))

	213

	214 # TODO(wrengr): avoid the extra indirection from this method to the

	215 # MemoizedFunction stored in the field.

	216 def logZ(self, x):
	Sharu Jiang 2016/12/07 00:05:06 To be consistent with other names, this should be To be consistent with other names, this should be LogZ wrengr 2016/12/07 00:55:38 Done. Show quoted text On 2016/12/07 00:05:06, Sharu Jiang wrote: > To be consistent with other names, this should be LogZ Done.
	217 """The log-domain conditional partition-function given ``x``.

	218

	219 Args:

	220 x (X): the value of the dependent variable.

	221

	222 Returns:

	223 The normalizing constant of ``LogProbability(x)``.

	224 """

	225 return self._zetas(x)

	226

	227 def Probability(self, x):

	228 """The normal-domain distribution over ``y`` given ``x``.

	229

	230 That is, ``self.Probability(x)(y)`` returns ``p(y \| x; self.weights)``

	231 which is the model's estimation of ``Pr(y\|x)``.

	232

	233 If you need the log-probability, don't use this method. Instead,

	234 use the ``LogProbability`` method which computes it directly and

	235 thus is more accurate than taking the log of this one.

	236 """

	237 logprob_given_x = self.LogProbability(x)

	238 return lambda y: math.exp(logprob_given_x(y))

	239

	240 def LogProbability(self, x):

	241 """The log-domain distribution over ``y`` given ``x``.

	242

	243 That is, ``self.LogProbability(x)(y)`` returns the log of

	244 ``self.Probability(x)(y)``. However, this method computes the

	245 log-probabilities directly and so is more accurate and efficient

	246 than taking the log of ``Probability``.

	247 """

	248 zeta_x = self.logZ(x)

	249 score_given_x = self.Score(x)

	250 return lambda y: score_given_x(y) - zeta_x

	251

	252 def Expectation(self, x, f):

	253 """Compute the expectation of a function with respect to ``Probability(x)``.
	Sharu Jiang 2016/12/07 00:05:06 Probability and Expectation are only used for trai Probability and Expectation are only used for training not for culprit suggestion, how about move them to training module? So we don't need to pass huge set of Y to this class. wrengr 2016/12/07 00:55:38 Semantically there's actually a three-way split ba Show quoted text On 2016/12/07 00:05:06, Sharu Jiang wrote: > Probability and Expectation are only used for training not for culprit > suggestion, how about move them to training module? So we don't need to pass > huge set of Y to this class. Semantically there's actually a three-way split based on what you want to get out: (1) scores: needs the weights and the feature function, that's it (2) probabilities: also needs Y (3) likelihoods: needs both Y and training_data It makes sense to be able to return probabilities even if we don't have a training set. Realizing this three-way split in the code is what I was suggesting at the end of today's meeting.
	254

	255 Args:

	256 x (X): the value of the dependent variable.

	257 f (...): a function from ``Y`` to ``np.ndarray`` of ``float``.

	258

	259 Returns:

	260 An ``np.ndarray`` of ``float``

	261 """

	262 prob_given_x = self.Probability(x)

	263 # N.B., the ``*`` below is vector scaling! If we want to make this

	264 # method polymorphic in the return type of ``f`` then we'll need an

	265 # API that provides both scaling and ``vsum``.

	266 return vsum([prob_given_x(y) * f(y) for y in self._Y])

	267

OLD	NEW