appengine/findit/crash/loglinear.py - Issue 2517383005: Implementing loglinear classification (without training), for CL classification

Unified Diff: appengine/findit/crash/loglinear.py

Issue 2517383005: Implementing loglinear classification (without training), for CL classification (Closed)

Patch Set: rebase Created 4 years ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

« appengine/findit/crash/changelist_features/top_frame_index.py ('K') | « appengine/findit/crash/changelist_features/top_frame_index.py ('k') | appengine/findit/crash/results.py » ('j') | appengine/findit/crash/test/loglinear_test.py » ('J')
Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

Index: appengine/findit/crash/loglinear.py

diff --git a/appengine/findit/crash/loglinear.py b/appengine/findit/crash/loglinear.py

new file mode 100644

index 0000000000000000000000000000000000000000..cd54981c730e21a08558ef86b86724806cce7ccf

--- /dev/null

+++ b/appengine/findit/crash/loglinear.py

@@ -0,0 +1,267 @@

+# Use of this source code is governed by a BSD-style license that can be

+# found in the LICENSE file.

+import math

+import numpy as np

+from libs.math.functions import MemoizedFunction

+from libs.math.logarithms import logsumexp

+from libs.math.vectors import vsum

+def ToFeatureFunction(fs):

+ """Given an array of scalar-valued functions, return a vector-valued function.

+ Args:

+ fs (iterable): A collection of functions from ``X`` to functions from

Sharu Jiang 2016/12/06 20:49:19 X -> Y -> float should be more clear.

wrengr 2016/12/07 00:55:38 Can do. I wasn't sure if folks were comfortable wi

+ ``Y`` to ``float``.

+ Returns:

+ A function from ``X`` to a function from ``Y`` to a ``list`` of

+ ``float`` where ``ToFeatureFunction(fs)(x)(y)[i] == fs[i](x)(y)``.

+ """

+ def _FeatureFunction(x):

Sharu Jiang 2016/12/06 20:49:19 What would X look like? (regression or stacktrace

wrengr 2016/12/07 00:55:38 X will be whatever information is given a priori a

X will be whatever information is given a priori and stays the same for all possible answers we consider. For our classifiers that'd be something like the CrashReport (perhaps with some additional info Predator computed before getting to the classifiers). X is all the stuff we take for granted and assume to be true no matter which y we're looking at. It's perfectly fine for features to completely ignore the x they're given; we provide it so that if the feature wants/needs it, then the feature has access to it. Y will be whatever information we're trying to generate; e.g., the labels classification will return. Ultimately we'll return the y with the highest probability, or a list of ys in order of probability. So for the component classifier, Y would be the set of all possible components. Similarly for the project classifier. For the CL classifier the natural choice is to have Y be the CLs themselves (or some simplification of CLs that's sufficient for giving good answers), since we want to get the CL most likely to be the culprit. Alas, since there are so many possible CLs that could be blamed, that introduces some practical issues with using the set of CLs as Y; but that's the idea at least. So each feature will return a real value for each pair [(x,y) for y in Y], where x is fixed for the classification task. If we assume the weight of this feature is positive, then mapping (x,y) to a higher value means the feature thinks y is more likely to be the culprit. We don't need to worry about scaling the exact value (since the weights will handle that); just so long a higher means more likely. (If the weights are negative then the meanings get flipped: larger values => more negative scores => less likely.) --- The ToFeatureFunction function is just for pushing things around so they have the right type. Often it's easier to define features independently, but then we end up with a list of all the features. ToFeatureFunction squishes things around so we end up with a single function that returns a list of values. So, for the sake of this function it doesn't really matter what X and Y are; we can convert list(X -> Y -> float) to X -> Y -> list(float) regardless.

Sharu Jiang 2016/12/07 18:40:04 Acknowledged.

On 2016/12/07 00:55:38, wrengr wrote: > On 2016/12/06 20:49:19, Sharu Jiang wrote: > > What would X look like? (regression or stacktrace maybe?) e.g. if fx is one > kind > > of features like LinearMinDistance, the only parameter it needs is the maximum > > distance, which is an Constant specified in the module. In this case, X is > > independent of Y->float, do we plan to make X->(Y->float) an identity > function? > > X will be whatever information is given a priori and stays the same for all > possible answers we consider. For our classifiers that'd be something like the > CrashReport (perhaps with some additional info Predator computed before getting > to the classifiers). X is all the stuff we take for granted and assume to be > true no matter which y we're looking at. It's perfectly fine for features to > completely ignore the x they're given; we provide it so that if the feature > wants/needs it, then the feature has access to it. > > Y will be whatever information we're trying to generate; e.g., the labels > classification will return. Ultimately we'll return the y with the highest > probability, or a list of ys in order of probability. So for the component > classifier, Y would be the set of all possible components. Similarly for the > project classifier. For the CL classifier the natural choice is to have Y be the > CLs themselves (or some simplification of CLs that's sufficient for giving good > answers), since we want to get the CL most likely to be the culprit. Alas, since > there are so many possible CLs that could be blamed, that introduces some > practical issues with using the set of CLs as Y; but that's the idea at least. > > So each feature will return a real value for each pair [(x,y) for y in Y], where > x is fixed for the classification task. If we assume the weight of this feature > is positive, then mapping (x,y) to a higher value means the feature thinks y is > more likely to be the culprit. We don't need to worry about scaling the exact > value (since the weights will handle that); just so long a higher means more > likely. (If the weights are negative then the meanings get flipped: larger > values => more negative scores => less likely.) > > --- > > The ToFeatureFunction function is just for pushing things around so they have > the right type. Often it's easier to define features independently, but then we > end up with a list of all the features. ToFeatureFunction squishes things around > so we end up with a single function that returns a list of values. So, for the > sake of this function it doesn't really matter what X and Y are; we can convert > list(X -> Y -> float) to X -> Y -> list(float) regardless.

Acknowledged.

+ fxs = [f(x) for f in fs]

+ # TODO(wrengr): allow ``fx(y)`` to be a list, and flatten everything.

+ return lambda y: [fx(y) for fx in fxs]

+ return _FeatureFunction

+# TODO(http://crbug.com/669639): lots of ways to make this code better.

+class LogLinearModel(object):

+ """A loglinear probability model.

+ We use this model for combining a bunch of different features in order

+ to decide who to blame. In particular, the nice thing about loglinear

+ models is that (a) they allow separating the amount of weight we

+ give to each feature from the feature itself, and (b) they allow us

+ to keep adding features at will without needing to worry about the

+ fiddly details needed for it to be a probability model.

+ Throughout the documentation we use the following conventional

+ terminology. The independent variable (aka: data to be classified) is

+ called ``X``, with particular values for that variable called ``x``. The

+ dependent variable (aka: the label returned by classification) is

+ called ``Y``, with particular values for that random variable called

+ ``y``. The partition function is called ``Z``. And, somewhat

+ non-conventionally, we will call the log of the partition function

+ ``zeta``.

+ """

+ def __init__(self, Y, feature_function, weights, epsilon=None):

+ """

+ Args:

+ Y (iterable): the entire range of values for the independent

+ variable. This is needed for computing the partition function.

+ feature_function: A function from ``X`` to ``Y`` to a list of

+ ``float``. N.B., for all ``x`` and ``y`` the length of

+ ``feature_function(x)(y)`` must be the same as the length of

+ ``weights``.

+ weights (list of float): coefficients for how much we believe

+ each component of the feature vector.

+ epsilon (float): The absolute-error threshold for considering a

+ weight to be "equal to zero". N.B., this should be a positive

+ number, as we will compare it against the absolute value of

+ each weight.

+ """

+ self._Y = frozenset(Y)

Sharu Jiang 2016/12/07 00:05:06 Instead of passing all Y(changelogs) into LogLinea

wrengr 2016/12/07 00:55:38 Could do that. Of course, those constants need to

+ if epsilon is None:

+ epsilon = 0.00001

Sharu Jiang 2016/12/06 20:49:19 we should define a EPSILON constant.

wrengr 2016/12/07 00:55:38 Done.

+ # N.B., ``np.array`` can't take generators.

+ self._weights = np.array([

+ w if isinstance(w, float) and math.fabs(w) >= epsilon else 0.0

+ for w in weights])

+ def _FeaturesMemoizedOnY(x):

+ fx = feature_function(x)

+ def _FeaturesCoercedToCovector(y):

+ # N.B., ``np.array`` can't take generators.

+ fxy = np.array(list(fx(y)))

+ # N.B., we're assuming that ``len(self.weights)`` is O(1).

+ assert len(fxy) == len(self.weights), TypeError(

+ "vector length mismatch: %d != %d" % (len(fxy), len(self.weights)))

+ return fxy

+ return MemoizedFunction(_FeaturesCoercedToCovector)

+ self._features = MemoizedFunction(_FeaturesMemoizedOnY)

+ # N.B., this is just the inner product of ``self.weights``

+ # against ``self._features(x)``. If we can compute this in some

+ # more efficient way, we should. In particular, we will want to

+ # make the wrights sparse, in which case we need to use a sparse

+ # variant of the dot product.

+ self._scores = MemoizedFunction(lambda x:

+ self._features(x).map(self.weights.dot))

+ def _LogZ(x):

+ score_given_x = self._scores(x)

+ return logsumexp([score_given_x(y) for y in self._Y])

+ self._zetas = MemoizedFunction(_LogZ)

+ self._quadrance = None

+ def ClearWeightBasedMemos(self):

+ """Clear all the memos that depend on the weight covector."""

+ self._quadrance = None

+ self._scores.ClearMemos()

+ self._zetas.ClearMemos()

+ def ClearAllMemos(self):

+ """Clear all memos, even those independent of the weight covector."""

+ self.ClearWeightBasedMemos()

+ self._features.ClearMemos()

+ @property

+ def weights(self):

+ """The weight covector.

+ At present we return the weights as an ``np.ndarray``, but in the

+ future that may be replaced by a more general type which specifies

+ the semantics rather than the implementation details.

+ """

+ return self._weights

+ @property

+ def l0(self): # pragma: no cover

+ """The l0-norm of the weight covector.

+ N.B., despite being popularly called the "l0-norm", this isn't

+ actually a norm in the mathematical sense."""

+ return float(np.count_nonzero(self.weights))

+ @property

+ def l1(self): # pragma: no cover

+ """The l1 (aka: Manhattan) norm of the weight covector."""

+ return math.fsum(math.fabs(w) for w in self.weights)

+ @property

+ def quadrance(self):

+ """The square of the l2 norm of the weight covector.

+ This value is often more helpful to have direct access to, as it

+ avoids the need for non-rational functions (e.g., sqrt) and shows up

+ as its own quantity in many places. Also, computing it directly avoids

+ the error introduced by squaring the square-root of an IEEE-754 float.

+ """

+ if self._quadrance is None:

+ self._quadrance = math.fsum(math.fabs(w)**2 for w in self.weights)

+ return self._quadrance

+ @property

+ def l2(self):

+ """The l2 (aka Euclidean) norm of the weight covector.

+ If you need the square of the l2-norm, do not use this property.

+ Instead, use the ``quadrance`` property which is more accurate than

+ squaring this one.

+ """

+ return math.sqrt(self.quadrance)

+ # TODO(wrengr): avoid the extra indirection from this method to the

+ # MemoizedFunction stored in the field.

+ def Features(self, x):

+ """Return a function mapping ``y`` to its feature vector given ``x``.

+ Args:

+ x (X): the value of the dependent variable.

+ Returns:

+ A ``MemoizedFunction`` from ``Y`` to an ``np.array`` of ``float``.

+ """

+ return self._features(x)

+ # TODO(wrengr): avoid the extra indirection from this method to the

+ # MemoizedFunction stored in the field.

+ def Score(self, x):

+ """Return a function mapping ``y`` to its "score" given ``x``.

+ Semantically, the "score" of ``y`` given ``x`` is the

+ unnormalized log-domain conditional probability of ``y`` given

+ ``x``. Operationally, we compute this by taking the inner product

+ of the weight covector with the ``Features(x)(y)`` vector. The

+ conditional probability itself can be computed by exponentiating

+ the "score" and normalizing with the partition function. However,

+ for determining the "best" ``y`` we don't need to bother computing

+ the probability, because ``Score(x)`` is monotonic with respect to

+ ``Probability(x)``.

+ Args:

+ x (X): the value of the dependent variable.

+ Returns:

+ A ``MemoizedFunction`` from ``Y`` to ``float``.

+ """

+ return self._scores(x)

+ def Z(self, x):

+ """The normal-domain conditional partition-function given ``x``.

+ If you need the log of the partition function, do not use this

+ method. Instead, use the ``logZ`` method which computes it directly

+ and thus is more accurate than taking the log of this one.

+ Args:

+ x (X): the value of the dependent variable.

Sharu Jiang 2016/12/07 00:05:06 I think x is independent variable?

wrengr 2016/12/07 00:55:38 yep, good catch.

+ Returns:

+ The normalizing constant of ``Probability(x)``.

+ """

+ return math.exp(self.logZ(x))

+ # TODO(wrengr): avoid the extra indirection from this method to the

+ # MemoizedFunction stored in the field.

+ def logZ(self, x):

Sharu Jiang 2016/12/07 00:05:06 To be consistent with other names, this should be

wrengr 2016/12/07 00:55:38 Done.

+ """The log-domain conditional partition-function given ``x``.

+ Args:

+ x (X): the value of the dependent variable.

+ Returns:

+ The normalizing constant of ``LogProbability(x)``.

+ """

+ return self._zetas(x)

+ def Probability(self, x):

+ """The normal-domain distribution over ``y`` given ``x``.

+ That is, ``self.Probability(x)(y)`` returns ``p(y | x; self.weights)``

+ which is the model's estimation of ``Pr(y|x)``.

+ If you need the log-probability, don't use this method. Instead,

+ use the ``LogProbability`` method which computes it directly and

+ thus is more accurate than taking the log of this one.

+ """

+ logprob_given_x = self.LogProbability(x)

+ return lambda y: math.exp(logprob_given_x(y))

+ def LogProbability(self, x):

+ """The log-domain distribution over ``y`` given ``x``.

+ That is, ``self.LogProbability(x)(y)`` returns the log of

+ ``self.Probability(x)(y)``. However, this method computes the

+ log-probabilities directly and so is more accurate and efficient

+ than taking the log of ``Probability``.

+ """

+ zeta_x = self.logZ(x)

+ score_given_x = self.Score(x)

+ return lambda y: score_given_x(y) - zeta_x

+ def Expectation(self, x, f):

+ """Compute the expectation of a function with respect to ``Probability(x)``.

Sharu Jiang 2016/12/07 00:05:06 Probability and Expectation are only used for trai

wrengr 2016/12/07 00:55:38 Semantically there's actually a three-way split ba

+ Args:

+ x (X): the value of the dependent variable.

+ f (...): a function from ``Y`` to ``np.ndarray`` of ``float``.

+ Returns:

+ An ``np.ndarray`` of ``float``

+ """

+ prob_given_x = self.Probability(x)

+ # N.B., the ``*`` below is vector scaling! If we want to make this

+ # method polymorphic in the return type of ``f`` then we'll need an

+ # API that provides both scaling and ``vsum``.

+ return vsum([prob_given_x(y) * f(y) for y in self._Y])