Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(115)

Side by Side Diff: appengine/findit/crash/loglinear/training.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)
Patch Set: Address comments. Created 3 years, 11 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
OLDNEW
1 # Copyright 2016 The Chromium Authors. All rights reserved. 1 # Copyright 2016 The Chromium Authors. All rights reserved.
2 # Use of this source code is governed by a BSD-style license that can be 2 # Use of this source code is governed by a BSD-style license that can be
3 # found in the LICENSE file. 3 # found in the LICENSE file.
4 4
5 import math 5 import math
6 import numpy as np 6 import numpy as np
7 # N.B., ``np.array`` can't take generators; you must pass explicit lists. 7 # N.B., ``np.array`` can't take generators; you must pass explicit lists.
8 import scipy.optimize as spo 8 import scipy.optimize as spo
9 9
10 from crash.loglinear.model import LogLinearModel 10 from crash.loglinear.model import LogLinearModel
(...skipping 15 matching lines...) Expand all
26 ``Y`` (it'll just take more computation time is all). This is 26 ``Y`` (it'll just take more computation time is all). This is
27 needed for computing the partition function and expectation. N.B., 27 needed for computing the partition function and expectation. N.B.,
28 we do not actually need to know/enumerate of *all* of ``Y``, 28 we do not actually need to know/enumerate of *all* of ``Y``,
29 only the subsets for each ``x``. 29 only the subsets for each ``x``.
30 training_data (iterable): a collection of ``(x, y)`` pairs where 30 training_data (iterable): a collection of ``(x, y)`` pairs where
31 ``y`` is the known-correct label for ``x``. 31 ``y`` is the known-correct label for ``x``.
32 feature_function: A function from ``X`` to ``Y`` to a list of 32 feature_function: A function from ``X`` to ``Y`` to a list of
33 ``float``. N.B., the length of the list must be the same for all 33 ``float``. N.B., the length of the list must be the same for all
34 ``x`` and ``y``, and must be the same as the length of the list 34 ``x`` and ``y``, and must be the same as the length of the list
35 of weights. 35 of weights.
36 initial_weights (list of float): the pre-training coefficients 36 initial_weights (dict from str to float): the pre-training coefficients
37 for how much we believe components of the feature vector. This 37 for how much we believe components of the feature vector. This
38 provides the seed for training; this starting value shouldn't 38 provides the seed for training; this starting value shouldn't
39 affect the final weights obtained by training (thanks to 39 affect the final weights obtained by training (thanks to
40 convexity), but will affect how long it takes for training 40 convexity), but will affect how long it takes for training
41 to converge. 41 to converge.
wrengr 2017/01/12 19:09:10 Should add a note that the dictionary should not b
Sharu Jiang 2017/01/13 01:08:35 Done.
42 epsilon (float): The absolute-error threshold for considering a 42 epsilon (float): The absolute-error threshold for considering a
43 weight to be "equal to zero". N.B., this should be a positive 43 weight to be "equal to zero". N.B., this should be a positive
44 number, as we will compare it against the absolute value of 44 number, as we will compare it against the absolute value of
45 each weight. 45 each weight.
46 """ 46 """
47 super(TrainableLogLinearModel, self).__init__( 47 super(TrainableLogLinearModel, self).__init__(
48 Y_given_X, feature_function, initial_weights, epsilon) 48 Y_given_X, feature_function, initial_weights, epsilon)
49 self._training_data = training_data 49 self._training_data = training_data
50 50 self._feature_order = list(initial_weights.keys())
51 self._np_weights = None
wrengr 2017/01/12 19:09:10 I think it'd be clearer to set the _weights and _n
Sharu Jiang 2017/01/13 01:08:35 Done.
51 self._observed_feature_vector = vsum([ 52 self._observed_feature_vector = vsum([
52 self.FeaturesAsNumPyArray(x)(y) 53 self.FeaturesAsNumPyArray(x)(y)
53 for x, y in self._training_data]) 54 for x, y in self._training_data])
54 55
55 # Even though this is identical to the superclass definition, we must 56 self.np_weights = self.WeightsToNumPyArrayWeights(initial_weights)
56 # re-provide it in order to define the setter. 57
57 @property 58 @property
58 def weights(self): 59 def np_weights(self):
59 """The weight covector. 60 """The NumPy Array of weight covector.
wrengr 2017/01/12 19:09:10 "of weight covector" -> "of the weight covector"
Sharu Jiang 2017/01/13 01:08:35 Done.
60 61
61 At present we return the weights as an ``np.ndarray``, but in the 62 At present we return the weights as an ``np.ndarray``, but in the
62 future that may be replaced by a more general type which specifies 63 future that may be replaced by a more general type which specifies
63 the semantics rather than the implementation details. 64 the semantics rather than the implementation details.
wrengr 2017/01/12 19:09:10 This paragraph should be deleted, since the whole
Sharu Jiang 2017/01/13 01:08:35 Done.
64 """ 65 """
65 return self._weights 66 return self._np_weights
66 67
67 @weights.setter 68 @np_weights.setter
68 def weights(self, new_weights): # pylint: disable=W0221 69 def np_weights(self, new_np_weights): # pylint: disable=W0221
69 """Mutate the weight covector, and clear memos as necessary. 70 """Mutate the np array weight covector, and clear memos as necessary.
wrengr 2017/01/12 19:09:10 It's fine to leave off the "np array" part, since
Sharu Jiang 2017/01/13 01:08:35 Done.
70 71
71 This setter attempts to avoid clearing memos whenever possible, 72 This setter attempts to avoid clearing memos whenever possible,
72 but errs on the side of caution/correctness when it needs to. 73 but errs on the side of caution/correctness when it needs to.
73 74
74 Args: 75 Args:
75 new_weights (np.ndarray): the new weights to use. Must have the 76 new_np_weights (np.ndarray): the new weights to use. It will be converted
76 same shape as the old ``np.ndarray``. 77 to weights dict mapping feature_name to its weight.
77 """ 78 """
78 if new_weights is self._weights: 79 if np.array_equal(self._np_weights, new_np_weights):
79 return 80 return
80 81
81 if not isinstance(new_weights, np.ndarray): 82 self._np_weights = new_np_weights
82 raise TypeError('Expected an np.ndarray but got %s instead'
83 % new_weights.__class__.__name__)
84
85 if new_weights.shape != self._weights.shape:
86 raise TypeError('Weight shape mismatch: %s != %s'
87 % (new_weights.shape, self._weights.shape))
88
89 self.ClearWeightBasedMemos() 83 self.ClearWeightBasedMemos()
90 self._weights = new_weights 84 self._weights = self.NumPyArrayWeightsToWeights(new_np_weights)
91 85
92 def FeaturesAsNumPyArray(self, x): 86 def FeaturesAsNumPyArray(self, x):
93 """A variant of ``Features`` which returns a ``np.ndarray``. 87 """A variant of ``Features`` which returns a ``np.ndarray``.
94 88
89 Note, the features nparray should have the same order as in
wrengr 2017/01/12 19:09:10 "nparray" -> "np array"
Sharu Jiang 2017/01/13 01:08:35 Done.
90 self._feature_order to stay aligned with weights np array.
91
95 For training we need to have the feature function return an 92 For training we need to have the feature function return an
96 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used 93 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used
97 elsewhere. This function performes the necessary conversion. 94 elsewhere. This function performes the necessary conversion.
98 95
99 N.B., at present we do not memoize this function. The underlying 96 N.B., at present we do not memoize this function. The underlying
100 ``Features`` method is memoized, so we won't re-compute the features 97 ``Features`` method is memoized, so we won't re-compute the features
101 each time; but we will repeatedly copy the floats into newly allocated 98 each time; but we will repeatedly copy the floats into newly allocated
102 ``np.ndarray`` objects. If that turns out to be a performance 99 ``np.ndarray`` objects. If that turns out to be a performance
103 bottleneck, we can add the extra layer of memoization to avoid that. 100 bottleneck, we can add the extra layer of memoization to avoid that.
104 """ 101 """
105 fx = self.Features(x) 102 fx = self.Features(x)
106 return lambda y: np.array([fxy.value for fxy in fx(y)]) 103 def FeaturesAsNumPyArrayGivenX(y):
104 fxys = fx(y)
105 return np.array([fxys[feature_name].value if feature_name in fxys else 0.
wrengr 2017/01/12 19:09:10 If feature_name is not in fxys then that should be
Sharu Jiang 2017/01/13 01:08:35 Done.
106 for feature_name in self._feature_order])
107
108 return FeaturesAsNumPyArrayGivenX
109
110 def WeightsAsNumPyArray(self):
wrengr 2017/01/12 19:09:10 This is unnecessary. Callers should use the ``np_w
Sharu Jiang 2017/01/13 01:08:35 Done.
111 """Returns converted numpy array version of weights.
112
113 Note, this conversion is needed because model uses weights dict to organize
114 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array
115 to do computaion.
116 """
117 return self.np_weights
118
119 def NumPyArrayWeightsToWeights(self, np_weights):
wrengr 2017/01/12 19:09:10 This "weights to weights" name is confusing. Also,
Sharu Jiang 2017/01/13 01:08:35 Done.
120 """Converts numpy array to dict (mapping feature name to weight).
121
122 Note, this conversion is needed because model uses weights dict to organize
123 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array
124 to do computaion.
125
126 Args:
127 np_weights (np.ndarray): Weights which have the same order of
128 self._feature_order. Note, feature np array should also be serialized by
129 the same order as self._feature_order to match.
130
131 Returns:
132 A dict mapping feature name to weight.
133 """
134 if not isinstance(np_weights, np.ndarray):
135 raise TypeError('Expected an np.ndarray but got %s instead'
136 % np_weights.__class__.__name__)
137
138 return {feature_name: weight
139 for feature_name, weight in zip(self._feature_order, np_weights)}
140
141 def WeightsToNumPyArrayWeights(self, weights, default=0.):
wrengr 2017/01/12 19:09:10 again, name is confusing and should also be privat
Sharu Jiang 2017/01/13 01:08:35 Done.
142 """Converts dict (mapping feature name to weight) to numpy array."""
143 return np.array([weights.get(feature_name, default)
144 for feature_name in self._feature_order])
107 145
108 def LogLikelihood(self): 146 def LogLikelihood(self):
109 """The conditional log-likelihood of the training data. 147 """The conditional log-likelihood of the training data.
110 148
111 The conditional likelihood of the training data is the product 149 The conditional likelihood of the training data is the product
112 of ``Pr(y|x)`` for each ``(x, y)`` pair in the training data; so 150 of ``Pr(y|x)`` for each ``(x, y)`` pair in the training data; so
113 the conditional log-likelihood is the log of that. This is called 151 the conditional log-likelihood is the log of that. This is called
114 "likelihood" because it is thought of as a function of the weight 152 "likelihood" because it is thought of as a function of the weight
115 covector, with the training data held fixed. 153 covector, with the training data held fixed.
116 154
117 This is the ideal objective function for training the weights, as it 155 This is the ideal objective function for training the weights, as it
118 will give us the MLE weight covector for the training data. However, 156 will give us the MLE weight covector for the training data. However,
119 in practice, we want to do regularization to ensure we don't overfit 157 in practice, we want to do regularization to ensure we don't overfit
120 the training data and to reduce classification time by ensuring that 158 the training data and to reduce classification time by ensuring that
121 the weight vector is sparse. Thus, the actual objective function 159 the weight vector is sparse. Thus, the actual objective function
122 will be the log-likelihood plus some penalty terms for regularization. 160 will be the log-likelihood plus some penalty terms for regularization.
123 """ 161 """
124 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data) 162 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data)
125 observed_score = self.weights.dot(self._observed_feature_vector) 163 observed_score = self.WeightsAsNumPyArray().dot(
164 self._observed_feature_vector)
126 return observed_score - observed_zeta 165 return observed_score - observed_zeta
127 166
128 def LogLikelihoodGradient(self): 167 def LogLikelihoodGradient(self):
129 """The gradient (aka Jacobian) of ``LogLikelihood``.""" 168 """The gradient (aka Jacobian) of ``LogLikelihood``."""
130 expected_feature_vector = vsum([ 169 expected_feature_vector = vsum([
131 self.Expectation(x, self.FeaturesAsNumPyArray(x)) 170 self.Expectation(x, self.FeaturesAsNumPyArray(x))
132 for x, _ in self._training_data]) 171 for x, _ in self._training_data])
133 return self._observed_feature_vector - expected_feature_vector 172 return self._observed_feature_vector - expected_feature_vector
134 173
135 def TrainWeights(self, l2_penalty): 174 def TrainWeights(self, l2_penalty):
136 """Optimize the weight covector based on the training data. 175 """Optimize the weight covector based on the training data.
137 176
138 Args: 177 Args:
139 l2_penalty (float): the hyperparameter for how much to penalize 178 l2_penalty (float): the hyperparameter for how much to penalize
140 weight covectors far from zero. 179 weight covectors far from zero.
141 180
142 Returns: 181 Returns:
143 Nothing, but has the side effect of mutating the stored weights. 182 Nothing, but has the side effect of mutating the stored weights.
144 """ 183 """
145 initial_weights = self.weights 184 initial_np_weights = self.WeightsAsNumPyArray()
146 185
147 # We want to minimize the number of times we reset the weights since 186 # We want to minimize the number of times we reset the weights since
148 # that clears our memos. One might think we could do that in the 187 # that clears our memos. One might think we could do that in the
149 # between-iterations callback; but actually, in a single iteration, 188 # between-iterations callback; but actually, in a single iteration,
150 # BFGS calls the objective function and gradient more than once with 189 # BFGS calls the objective function and gradient more than once with
151 # different arguments; so, alas, we must reset the weights in both. 190 # different arguments; so, alas, we must reset the weights in both.
152 # This is why the ``weights`` setter tries to avoid clearing memos 191 # This is why the ``weights`` setter tries to avoid clearing memos
153 # when possible. 192 # when possible.
154 193
155 def objective_function(new_weights): 194 def objective_function(new_np_weights):
156 self.weights = new_weights 195 self.np_weights = new_np_weights
157 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance 196 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance
158 197
159 def objective_function_gradient(new_weights): 198 def objective_function_gradient(new_np_weights):
160 self.weights = new_weights 199 self.np_weights = new_np_weights
161 return -self.LogLikelihoodGradient() + l2_penalty * self.weights 200 return -self.LogLikelihoodGradient() + l2_penalty * self.np_weights
162 201
163 result = spo.minimize( 202 result = spo.minimize(
164 objective_function, 203 objective_function,
165 initial_weights, 204 initial_np_weights,
166 method='BFGS', 205 method='BFGS',
167 jac=objective_function_gradient) 206 jac=objective_function_gradient)
168 207
169 if not result.success: # pragma: no cover 208 if not result.success: # pragma: no cover
170 # This should happen infrequently enough that there's no point in 209 # This should happen infrequently enough that there's no point in
171 # logging it and attempting to carry on. 210 # logging it and attempting to carry on.
172 raise Exception( 211 raise Exception(
173 'TrainableLogLinearModel.TrainWeights failed:' 212 'TrainableLogLinearModel.TrainWeights failed:'
174 '\n\tReason: %s' 213 '\n\tReason: %s'
175 '\n\tCurrent objective value: %s' 214 '\n\tCurrent objective value: %s'
176 '\n\tCurrent objective gradient: %s' 215 '\n\tCurrent objective gradient: %s'
177 '\n\tIterations: %d' 216 '\n\tIterations: %d'
178 '\n\tFunction evaluations: %d' 217 '\n\tFunction evaluations: %d'
179 '\n\tGradient evaluations: %d' 218 '\n\tGradient evaluations: %d'
180 % (result.message, result.fun, result.jac, result.nit, result.nfev, 219 % (result.message, result.fun, result.jac, result.nit, result.nfev,
181 result.njev)) 220 result.njev))
182 221
183 # This shouldn't really be necessary, since we're resetting it 222 # This shouldn't really be necessary, since we're resetting it
184 # directly during training; but just to be safe/sure. 223 # directly during training; but just to be safe/sure.
185 self.weights = result.x 224 self.np_weights = result.x
OLDNEW

Powered by Google App Engine
This is Rietveld 408576698