Issue 2517383005: Implementing loglinear classification (without training), for CL classification

Issue 2517383005: Implementing loglinear classification (without training), for CL classification (Closed)

Created:
4 years, 1 month ago by wrengr

Modified:
4 years ago

Reviewers:
Sharu Jiang, Martin Barbella, inferno

CC:
chromium-reviews, infra-reviews+infra_chromium.org, aarya, stgao

Target Ref:
refs/heads/master

Project:
infra

Visibility:
Public.

More Reviews

Description

Implementing loglinear models, for CL classification This CL adds an implementation of loglinear probability models, and translates the ./crash/scorers into feature functions for use with those models. It does not, however, put LLMs in place for being used as the primary algorithm for CL classification just yet. That should be done in a separate CL accompanied by delta testing to ensure it is as effective as we'd like. BUG= TBR=stgao@chromium.org Committed: https://chromium.googlesource.com/infra/infra/+/0f32beb2e7fb069726d9f856b9af55463d5c38ce

Patch Set 1 : Breaking apart the CL #

Patch Set 2 : rebase #

Patch Set 3 : rebase #

Patch Set 4 : rebase #

Patch Set 5 : rebase #

Patch Set 6 : linting #

Patch Set 7 : coverage tests for loglinear.py #

Patch Set 8 : adding tests for MinDistanceFeature #

Patch Set 9 : adding more tests #

Patch Set 10 : rebase #

Total comments: 28

Patch Set 11 : Removed the changelist_features directory for this CL. Will re-add in another CL #

Patch Set 12 : addressing nits #

Patch Set 13 : breaking apart normalized and unnormalized models #

Total comments: 2

Created: 4 years ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+382 lines, -4 lines)			Patch
A	appengine/findit/crash/loglinear.py	View	1 2 3 4 5 6 7 8 9 10 11 12	1 chunk	+302 lines, -0 lines	2 comments	Download
M	appengine/findit/crash/results.py	View	1 2 3 4 5	3 chunks	+5 lines, -2 lines	0 comments	Download
A	appengine/findit/crash/test/loglinear_test.py	View	1 2 3 4 5 6 7 8 9 10 11	1 chunk	+73 lines, -0 lines	0 comments	Download
M	appengine/findit/libs/math/functions.py	View	1 2 3 4 5 6 7 8 9	1 chunk	+1 line, -1 line	0 comments	Download
M	appengine/findit/libs/math/test/functions_test.py	View	1 2 3 4 5 6 7 8 9	1 chunk	+1 line, -1 line	0 comments	Download

Dependent Patchsets:

Issue 2560723005 Patch 1

Issue 2560723005 Patch 20001

Messages

Total messages: 31 (14 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

wrengr

Here's a preliminary version of the CL. I still have to write the unittests, and ...

4 years ago (2016-12-01 00:49:52 UTC) #7

Sharu Jiang

Reviewing, however I still have several questions in the doc, can you help me answer ...

4 years ago (2016-12-01 18:51:07 UTC) #8

wrengr

Looking at the doc comments now https://codereview.chromium.org/2517383005/diff/40001/appengine/findit/crash/changelist_classifier_features/__init__.py File appengine/findit/crash/changelist_classifier_features/__init__.py (right): https://codereview.chromium.org/2517383005/diff/40001/appengine/findit/crash/changelist_classifier_features/__init__.py#newcode1 appengine/findit/crash/changelist_classifier_features/__init__.py:1: # Copyright 2016 ...

4 years ago (2016-12-01 19:08:09 UTC) #9

wrengr

On 2016/12/01 22:25:51, Martin Barbella wrote: > Could you post the link to the doc ...

4 years ago (2016-12-01 22:28:15 UTC) #11

wrengr

I've uploaded a version of loglinear_test.py which gets 100% coverage, but it's terrible as actual ...

4 years ago (2016-12-02 23:34:11 UTC) #14

inferno

lgtm with nits. Sharu - can you take a look at this as well. https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/changelist_features/min_distance.py ...

4 years ago (2016-12-06 18:07:06 UTC) #18

Sharu Jiang

Publish some reviews first, still need to take a further look. https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/changelist_features/min_distance.py File appengine/findit/crash/changelist_features/min_distance.py (right): ...

4 years ago (2016-12-06 20:49:20 UTC) #19

Sharu Jiang

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/loglinear.py File appengine/findit/crash/loglinear.py (right): https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/loglinear.py#newcode69 appengine/findit/crash/loglinear.py:69: self._Y = frozenset(Y) Instead of passing all Y(changelogs) into ...

4 years ago (2016-12-07 00:05:07 UTC) #20

wrengr

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/changelist_features/test/min_distance_test.py File appengine/findit/crash/changelist_features/test/min_distance_test.py (right): https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/changelist_features/test/min_distance_test.py#newcode13 appengine/findit/crash/changelist_features/test/min_distance_test.py:13: class MinDistanceTest(ScorerTestSuite): On 2016/12/06 20:49:19, Sharu Jiang wrote: > ...

4 years ago (2016-12-07 00:55:38 UTC) #21

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
File appengine/findit/crash/changelist_features/test/min_distance_test.py
(right):

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/changelist_features/test/min_distance_test.py:13: class
MinDistanceTest(ScorerTestSuite):
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> Now that we changed Scorer -> Feature, we should also change the
ScorerTestSuite
> to FeatureTestSuite.

Will do. I actually just pushed a revised CL that removes the
changelist_features directory for this CL. Reason being that their types weren't
quite right. I will re-add them in the CL that actually uses loglinear models
for the changelist classifier, and will be sure to rename that file in/for that
CL.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
File appengine/findit/crash/changelist_features/top_frame_index.py (right):

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/changelist_features/top_frame_index.py:33:
self.max_frame_index = float(max_frame_index)
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> why converting the max_frame_index to float here? I think it's naturally an
> integer. we can convert it when computing ratio.

I was waffling back and forth on whether to do it here or in __call__. Put it
here just to avoid converting it repeatedly. I agree that naturally it's an
(positive) integer. Will change

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/changelist_features/top_frame_index.py:72: ``c1``) we can
instead use three features: ``w2*(x**2) + w1*x + w0``;
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> Just a thought, we can also use kernel to achieve this if we are using Yes/No
> labels. But the problem would be among all the changelogs, there may just be
one
> or two culprits, not quite sure if this SVM idea worth a try or not.

We could use some kernel tricks to generate a whole bunch of features which are
functions of the [0,1] value TopFrameIndexFeature returns. Not sure if it'd buy
us a whole lot, but could be worth checking out if you want. I mainly want to
try this feature (and the analogous one for MinDistance) because often we find
that the relevance of things falls off more rapidly than linearly in the
distance. This probably isn't the ideal non-linear function to use, but if it
ends up getting non-zero weight that would be an indication that this line of
exploration may be worth exploring further.

Overall it's better to have a bunch of features which aren't (quite so
obviously) correlated, since that gives us a bunch of different indicators and
so a better overall picture about what's going on. This just seemed like some
low-hanging fruit for testing out the model with a few more features.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
File appengine/findit/crash/loglinear.py (right):

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:17: fs (iterable): A collection of functions
from ``X`` to functions from
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> X -> Y -> float should be more clear.

Can do. I wasn't sure if folks were comfortable with that notation being used in
the docstrings.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:24: def _FeatureFunction(x):
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> What would X look like? (regression or stacktrace maybe?) e.g. if fx is one
kind
> of features like LinearMinDistance, the only parameter it needs is the maximum
> distance, which is an Constant specified in the module. In this case, X is
> independent of Y->float, do we plan to make X->(Y->float) an identity
function?

X will be whatever information is given a priori and stays the same for all
possible answers we consider. For our classifiers that'd be something like the
CrashReport (perhaps with some additional info Predator computed before getting
to the classifiers). X is all the stuff we take for granted and assume to be
true no matter which y we're looking at. It's perfectly fine for features to
completely ignore the x they're given; we provide it so that if the feature
wants/needs it, then the feature has access to it.

Y will be whatever information we're trying to generate; e.g., the labels
classification will return. Ultimately we'll return the y with the highest
probability, or a list of ys in order of probability. So for the component
classifier, Y would be the set of all possible components. Similarly for the
project classifier. For the CL classifier the natural choice is to have Y be the
CLs themselves (or some simplification of CLs that's sufficient for giving good
answers), since we want to get the CL most likely to be the culprit. Alas, since
there are so many possible CLs that could be blamed, that introduces some
practical issues with using the set of CLs as Y; but that's the idea at least.

So each feature will return a real value for each pair [(x,y) for y in Y], where
x is fixed for the classification task. If we assume the weight of this feature
is positive, then mapping (x,y) to a higher value means the feature thinks y is
more likely to be the culprit. We don't need to worry about scaling the exact
value (since the weights will handle that); just so long a higher means more
likely. (If the weights are negative then the meanings get flipped: larger
values => more negative scores => less likely.)

--- 

The ToFeatureFunction function is just for pushing things around so they have
the right type. Often it's easier to define features independently, but then we
end up with a list of all the features. ToFeatureFunction squishes things around
so we end up with a single function that returns a list of values. So, for the
sake of this function it doesn't really matter what X and Y are; we can convert
list(X -> Y -> float) to X -> Y -> list(float) regardless.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:69: self._Y = frozenset(Y)
On 2016/12/07 00:05:06, Sharu Jiang wrote:
> Instead of passing all Y(changelogs) into LogLinear, how about just passing an
> offline computed constant Z？

Could do that. Of course, those constants need to be computed somewhere...

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:72: epsilon = 0.00001
On 2016/12/06 20:49:19, Sharu Jiang wrote:
> we should define a EPSILON constant.

Done.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:207: x (X): the value of the dependent
variable.
On 2016/12/07 00:05:06, Sharu Jiang wrote:
> I think x is independent variable?

yep, good catch.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:216: def logZ(self, x):
On 2016/12/07 00:05:06, Sharu Jiang wrote:
> To be consistent with other names, this should be LogZ

Done.

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:253: """Compute the expectation of a
function with respect to ``Probability(x)``.
On 2016/12/07 00:05:06, Sharu Jiang wrote:
> Probability and Expectation are only used for training not for culprit
> suggestion, how about move them to training module? So we don't need to pass
> huge set of Y to this class.

Semantically there's actually a three-way split based on what you want to get
out:
(1) scores: needs the weights and the feature function, that's it
(2) probabilities: also needs Y
(3) likelihoods: needs both Y and training_data

It makes sense to be able to return probabilities even if we don't have a
training set. Realizing this three-way split in the code is what I was
suggesting at the end of today's meeting.

wrengr

I've uploaded a new version which (a) breaks the LogLinearModel class into two based on ...

4 years ago (2016-12-07 01:22:21 UTC) #22

Sharu Jiang

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/loglinear.py File appengine/findit/crash/loglinear.py (right): https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash/loglinear.py#newcode24 appengine/findit/crash/loglinear.py:24: def _FeatureFunction(x): On 2016/12/07 00:55:38, wrengr wrote: > On ...

4 years ago (2016-12-07 18:40:04 UTC) #23

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
File appengine/findit/crash/loglinear.py (right):

https://codereview.chromium.org/2517383005/diff/240001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:24: def _FeatureFunction(x):
On 2016/12/07 00:55:38, wrengr wrote:
> On 2016/12/06 20:49:19, Sharu Jiang wrote:
> > What would X look like? (regression or stacktrace maybe?) e.g. if fx is one
> kind
> > of features like LinearMinDistance, the only parameter it needs is the
maximum
> > distance, which is an Constant specified in the module. In this case, X is
> > independent of Y->float, do we plan to make X->(Y->float) an identity
> function?
> 
> X will be whatever information is given a priori and stays the same for all
> possible answers we consider. For our classifiers that'd be something like the
> CrashReport (perhaps with some additional info Predator computed before
getting
> to the classifiers). X is all the stuff we take for granted and assume to be
> true no matter which y we're looking at. It's perfectly fine for features to
> completely ignore the x they're given; we provide it so that if the feature
> wants/needs it, then the feature has access to it.
> 
> Y will be whatever information we're trying to generate; e.g., the labels
> classification will return. Ultimately we'll return the y with the highest
> probability, or a list of ys in order of probability. So for the component
> classifier, Y would be the set of all possible components. Similarly for the
> project classifier. For the CL classifier the natural choice is to have Y be
the
> CLs themselves (or some simplification of CLs that's sufficient for giving
good
> answers), since we want to get the CL most likely to be the culprit. Alas,
since
> there are so many possible CLs that could be blamed, that introduces some
> practical issues with using the set of CLs as Y; but that's the idea at least.
> 
> So each feature will return a real value for each pair [(x,y) for y in Y],
where
> x is fixed for the classification task. If we assume the weight of this
feature
> is positive, then mapping (x,y) to a higher value means the feature thinks y
is
> more likely to be the culprit. We don't need to worry about scaling the exact
> value (since the weights will handle that); just so long a higher means more
> likely. (If the weights are negative then the meanings get flipped: larger
> values => more negative scores => less likely.)
> 
> --- 
> 
> The ToFeatureFunction function is just for pushing things around so they have
> the right type. Often it's easier to define features independently, but then
we
> end up with a list of all the features. ToFeatureFunction squishes things
around
> so we end up with a single function that returns a list of values. So, for the
> sake of this function it doesn't really matter what X and Y are; we can
convert
> list(X -> Y -> float) to X -> Y -> list(float) regardless.

Acknowledged.

https://codereview.chromium.org/2517383005/diff/300001/appengine/findit/crash...
File appengine/findit/crash/loglinear.py (right):

https://codereview.chromium.org/2517383005/diff/300001/appengine/findit/crash...
appengine/findit/crash/loglinear.py:194: class
LogLinearModel(UnnormalizedLogLinearModel):
one question, so the online/offline split is between
``UnnormalizedLogLinearModel`` and ``LogLinearModel`` + ``Trainable*`` ?

wrengr

https://codereview.chromium.org/2517383005/diff/300001/appengine/findit/crash/loglinear.py File appengine/findit/crash/loglinear.py (right): https://codereview.chromium.org/2517383005/diff/300001/appengine/findit/crash/loglinear.py#newcode194 appengine/findit/crash/loglinear.py:194: class LogLinearModel(UnnormalizedLogLinearModel): On 2016/12/07 18:40:04, Sharu Jiang wrote: > ...

4 years ago (2016-12-08 20:09:07 UTC) #24

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2517383005/300001

4 years ago (2016-12-08 20:46:20 UTC) #28

commit-bot: I haz the power

4 years ago (2016-12-08 21:01:41 UTC) #31

Message was sent while issue was closed.

Committed patchset #13 (id:300001) as
https://chromium.googlesource.com/infra/infra/+/0f32beb2e7fb069726d9f856b9af5...

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages