appengine/findit/crash/loglinear/training.py - Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM.

Side by Side Diff: appengine/findit/crash/loglinear/training.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)

Patch Set: Address comments. Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 import math	5 import math

6 import numpy as np	6 import numpy as np

7 # N.B., ``np.array`` can't take generators; you must pass explicit lists.	7 # N.B., ``np.array`` can't take generators; you must pass explicit lists.

8 import scipy.optimize as spo	8 import scipy.optimize as spo

9	9

10 from crash.loglinear.model import LogLinearModel	10 from crash.loglinear.model import LogLinearModel

(...skipping 15 matching lines...) Expand all Loading...
26 ``Y`` (it'll just take more computation time is all). This is	26 ``Y`` (it'll just take more computation time is all). This is

27 needed for computing the partition function and expectation. N.B.,	27 needed for computing the partition function and expectation. N.B.,

28 we do not actually need to know/enumerate of all of ``Y``,	28 we do not actually need to know/enumerate of all of ``Y``,

29 only the subsets for each ``x``.	29 only the subsets for each ``x``.

30 training_data (iterable): a collection of ``(x, y)`` pairs where	30 training_data (iterable): a collection of ``(x, y)`` pairs where

31 ``y`` is the known-correct label for ``x``.	31 ``y`` is the known-correct label for ``x``.

32 feature_function: A function from ``X`` to ``Y`` to a list of	32 feature_function: A function from ``X`` to ``Y`` to a list of

33 ``float``. N.B., the length of the list must be the same for all	33 ``float``. N.B., the length of the list must be the same for all

34 ``x`` and ``y``, and must be the same as the length of the list	34 ``x`` and ``y``, and must be the same as the length of the list

35 of weights.	35 of weights.

36 initial_weights (list of float): the pre-training coefficients	36 initial_weights (dict from str to float): the pre-training coefficients

37 for how much we believe components of the feature vector. This	37 for how much we believe components of the feature vector. This

38 provides the seed for training; this starting value shouldn't	38 provides the seed for training; this starting value shouldn't

39 affect the final weights obtained by training (thanks to	39 affect the final weights obtained by training (thanks to

40 convexity), but will affect how long it takes for training	40 convexity), but will affect how long it takes for training

41 to converge.	41 to converge.
	wrengr 2017/01/12 19:09:10 Should add a note that the dictionary should not b Should add a note that the dictionary should not be sparse, since we will only train on the features whose names are keys in this dict. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > Should add a note that the dictionary should not be sparse, since we will only > train on the features whose names are keys in this dict. Done.
42 epsilon (float): The absolute-error threshold for considering a	42 epsilon (float): The absolute-error threshold for considering a

43 weight to be "equal to zero". N.B., this should be a positive	43 weight to be "equal to zero". N.B., this should be a positive

44 number, as we will compare it against the absolute value of	44 number, as we will compare it against the absolute value of

45 each weight.	45 each weight.

46 """	46 """

47 super(TrainableLogLinearModel, self).__init__(	47 super(TrainableLogLinearModel, self).__init__(

48 Y_given_X, feature_function, initial_weights, epsilon)	48 Y_given_X, feature_function, initial_weights, epsilon)

49 self._training_data = training_data	49 self._training_data = training_data

50	50 self._feature_order = list(initial_weights.keys())

	51 self._np_weights = None
	wrengr 2017/01/12 19:09:10 I think it'd be clearer to set the _weights and _n I think it'd be clearer to set the _weights and _np_weights explicitly, rather than setting this to None and then invoking the setter. Also, using the setter introduces unnecessary inefficiency. E.g., it means computing ``self.NumPyArrayWeightsToWeights(self.WeightsToNumPyArrayWeights(initial_weights))`` which is unnecessary since that is equal to ``initial_weights``. Also it means calling ClearWeightBasedMemos which is an expensive noop at this point. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > I think it'd be clearer to set the _weights and _np_weights explicitly, rather > than setting this to None and then invoking the setter. > > Also, using the setter introduces unnecessary inefficiency. E.g., it means > computing > ``self.NumPyArrayWeightsToWeights(self.WeightsToNumPyArrayWeights(initial_weights))`` > which is unnecessary since that is equal to ``initial_weights``. Also it means > calling ClearWeightBasedMemos which is an expensive noop at this point. Done.
51 self._observed_feature_vector = vsum([	52 self._observed_feature_vector = vsum([

52 self.FeaturesAsNumPyArray(x)(y)	53 self.FeaturesAsNumPyArray(x)(y)

53 for x, y in self._training_data])	54 for x, y in self._training_data])

54	55

55 # Even though this is identical to the superclass definition, we must	56 self.np_weights = self.WeightsToNumPyArrayWeights(initial_weights)

56 # re-provide it in order to define the setter.	57

57 @property	58 @property

58 def weights(self):	59 def np_weights(self):

59 """The weight covector.	60 """The NumPy Array of weight covector.
	wrengr 2017/01/12 19:09:10 "of weight covector" -> "of the weight covector" "of weight covector" -> "of the weight covector" Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > "of weight covector" -> "of the weight covector" Done.
60	61

61 At present we return the weights as an ``np.ndarray``, but in the	62 At present we return the weights as an ``np.ndarray``, but in the

62 future that may be replaced by a more general type which specifies	63 future that may be replaced by a more general type which specifies

63 the semantics rather than the implementation details.	64 the semantics rather than the implementation details.
	wrengr 2017/01/12 19:09:10 This paragraph should be deleted, since the whole This paragraph should be deleted, since the whole point of this property is to provide the np.ndarray Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > This paragraph should be deleted, since the whole point of this property is to > provide the np.ndarray Done.
64 """	65 """

65 return self._weights	66 return self._np_weights

66	67

67 @weights.setter	68 @np_weights.setter

68 def weights(self, new_weights): # pylint: disable=W0221	69 def np_weights(self, new_np_weights): # pylint: disable=W0221

69 """Mutate the weight covector, and clear memos as necessary.	70 """Mutate the np array weight covector, and clear memos as necessary.
	wrengr 2017/01/12 19:09:10 It's fine to leave off the "np array" part, since It's fine to leave off the "np array" part, since we're updating both representations of the weight covector. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > It's fine to leave off the "np array" part, since we're updating both > representations of the weight covector. Done.
70	71

71 This setter attempts to avoid clearing memos whenever possible,	72 This setter attempts to avoid clearing memos whenever possible,

72 but errs on the side of caution/correctness when it needs to.	73 but errs on the side of caution/correctness when it needs to.

73	74

74 Args:	75 Args:

75 new_weights (np.ndarray): the new weights to use. Must have the	76 new_np_weights (np.ndarray): the new weights to use. It will be converted

76 same shape as the old ``np.ndarray``.	77 to weights dict mapping feature_name to its weight.

77 """	78 """

78 if new_weights is self._weights:	79 if np.array_equal(self._np_weights, new_np_weights):

79 return	80 return

80	81

81 if not isinstance(new_weights, np.ndarray):	82 self._np_weights = new_np_weights

82 raise TypeError('Expected an np.ndarray but got %s instead'

83 % new_weights.__class__.__name__)

84

85 if new_weights.shape != self._weights.shape:

86 raise TypeError('Weight shape mismatch: %s != %s'

87 % (new_weights.shape, self._weights.shape))

88

89 self.ClearWeightBasedMemos()	83 self.ClearWeightBasedMemos()

90 self._weights = new_weights	84 self._weights = self.NumPyArrayWeightsToWeights(new_np_weights)

91	85

92 def FeaturesAsNumPyArray(self, x):	86 def FeaturesAsNumPyArray(self, x):

93 """A variant of ``Features`` which returns a ``np.ndarray``.	87 """A variant of ``Features`` which returns a ``np.ndarray``.

94	88

	89 Note, the features nparray should have the same order as in
	wrengr 2017/01/12 19:09:10 "nparray" -> "np array" "nparray" -> "np array" Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > "nparray" -> "np array" Done.
	90 self._feature_order to stay aligned with weights np array.

	91

95 For training we need to have the feature function return an	92 For training we need to have the feature function return an

96 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used	93 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used

97 elsewhere. This function performes the necessary conversion.	94 elsewhere. This function performes the necessary conversion.

98	95

99 N.B., at present we do not memoize this function. The underlying	96 N.B., at present we do not memoize this function. The underlying

100 ``Features`` method is memoized, so we won't re-compute the features	97 ``Features`` method is memoized, so we won't re-compute the features

101 each time; but we will repeatedly copy the floats into newly allocated	98 each time; but we will repeatedly copy the floats into newly allocated

102 ``np.ndarray`` objects. If that turns out to be a performance	99 ``np.ndarray`` objects. If that turns out to be a performance

103 bottleneck, we can add the extra layer of memoization to avoid that.	100 bottleneck, we can add the extra layer of memoization to avoid that.

104 """	101 """

105 fx = self.Features(x)	102 fx = self.Features(x)

106 return lambda y: np.array([fxy.value for fxy in fx(y)])	103 def FeaturesAsNumPyArrayGivenX(y):

	104 fxys = fx(y)

	105 return np.array([fxys[feature_name].value if feature_name in fxys else 0.
	wrengr 2017/01/12 19:09:10 If feature_name is not in fxys then that should be If feature_name is not in fxys then that should be an error. It's okay to have weights be missing/zero; it's not okay to have features be missing. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > If feature_name is not in fxys then that should be an error. It's okay to have > weights be missing/zero; it's not okay to have features be missing. Done.
	106 for feature_name in self._feature_order])

	107

	108 return FeaturesAsNumPyArrayGivenX

	109

	110 def WeightsAsNumPyArray(self):
	wrengr 2017/01/12 19:09:10 This is unnecessary. Callers should use the ``np_w This is unnecessary. Callers should use the ``np_weights`` property directly. The paragraph below about why we have this should be moved to the docstring for the property Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > This is unnecessary. Callers should use the ``np_weights`` property directly. > The paragraph below about why we have this should be moved to the docstring for > the property Done.
	111 """Returns converted numpy array version of weights.

	112

	113 Note, this conversion is needed because model uses weights dict to organize

	114 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array

	115 to do computaion.

	116 """

	117 return self.np_weights

	118

	119 def NumPyArrayWeightsToWeights(self, np_weights):
	wrengr 2017/01/12 19:09:10 This "weights to weights" name is confusing. Also, This "weights to weights" name is confusing. Also, should be private since this is only needed for internal use and isn't being provided for client code to use. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > This "weights to weights" name is confusing. Also, should be private since this > is only needed for internal use and isn't being provided for client code to use. Done.
	120 """Converts numpy array to dict (mapping feature name to weight).

	121

	122 Note, this conversion is needed because model uses weights dict to organize

	123 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array

	124 to do computaion.

	125

	126 Args:

	127 np_weights (np.ndarray): Weights which have the same order of

	128 self._feature_order. Note, feature np array should also be serialized by

	129 the same order as self._feature_order to match.

	130

	131 Returns:

	132 A dict mapping feature name to weight.

	133 """

	134 if not isinstance(np_weights, np.ndarray):

	135 raise TypeError('Expected an np.ndarray but got %s instead'

	136 % np_weights.__class__.__name__)

	137

	138 return {feature_name: weight

	139 for feature_name, weight in zip(self._feature_order, np_weights)}

	140

	141 def WeightsToNumPyArrayWeights(self, weights, default=0.):
	wrengr 2017/01/12 19:09:10 again, name is confusing and should also be privat again, name is confusing and should also be private. Both these methods can be made to work for any np.ndarray with len(self._feature_order) elements. If they are made generic (in which case "weight" shouldn't be in the name), then it makes sense to have a default argument. But if they're kept weight-specific, then the default value should always be 0.0 and shouldn't be overridable. Sharu Jiang 2017/01/13 01:08:35 Done. Show quoted text On 2017/01/12 19:09:10, wrengr wrote: > again, name is confusing and should also be private. > > Both these methods can be made to work for any np.ndarray with > len(self._feature_order) elements. If they are made generic (in which case > "weight" shouldn't be in the name), then it makes sense to have a default > argument. But if they're kept weight-specific, then the default value should > always be 0.0 and shouldn't be overridable. Done.
	142 """Converts dict (mapping feature name to weight) to numpy array."""

	143 return np.array([weights.get(feature_name, default)

	144 for feature_name in self._feature_order])

107	145

108 def LogLikelihood(self):	146 def LogLikelihood(self):

109 """The conditional log-likelihood of the training data.	147 """The conditional log-likelihood of the training data.

110	148

111 The conditional likelihood of the training data is the product	149 The conditional likelihood of the training data is the product

112 of ``Pr(y\|x)`` for each ``(x, y)`` pair in the training data; so	150 of ``Pr(y\|x)`` for each ``(x, y)`` pair in the training data; so

113 the conditional log-likelihood is the log of that. This is called	151 the conditional log-likelihood is the log of that. This is called

114 "likelihood" because it is thought of as a function of the weight	152 "likelihood" because it is thought of as a function of the weight

115 covector, with the training data held fixed.	153 covector, with the training data held fixed.

116	154

117 This is the ideal objective function for training the weights, as it	155 This is the ideal objective function for training the weights, as it

118 will give us the MLE weight covector for the training data. However,	156 will give us the MLE weight covector for the training data. However,

119 in practice, we want to do regularization to ensure we don't overfit	157 in practice, we want to do regularization to ensure we don't overfit

120 the training data and to reduce classification time by ensuring that	158 the training data and to reduce classification time by ensuring that

121 the weight vector is sparse. Thus, the actual objective function	159 the weight vector is sparse. Thus, the actual objective function

122 will be the log-likelihood plus some penalty terms for regularization.	160 will be the log-likelihood plus some penalty terms for regularization.

123 """	161 """

124 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data)	162 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data)

125 observed_score = self.weights.dot(self._observed_feature_vector)	163 observed_score = self.WeightsAsNumPyArray().dot(

	164 self._observed_feature_vector)

126 return observed_score - observed_zeta	165 return observed_score - observed_zeta

127	166

128 def LogLikelihoodGradient(self):	167 def LogLikelihoodGradient(self):

129 """The gradient (aka Jacobian) of ``LogLikelihood``."""	168 """The gradient (aka Jacobian) of ``LogLikelihood``."""

130 expected_feature_vector = vsum([	169 expected_feature_vector = vsum([

131 self.Expectation(x, self.FeaturesAsNumPyArray(x))	170 self.Expectation(x, self.FeaturesAsNumPyArray(x))

132 for x, _ in self._training_data])	171 for x, _ in self._training_data])

133 return self._observed_feature_vector - expected_feature_vector	172 return self._observed_feature_vector - expected_feature_vector

134	173

135 def TrainWeights(self, l2_penalty):	174 def TrainWeights(self, l2_penalty):

136 """Optimize the weight covector based on the training data.	175 """Optimize the weight covector based on the training data.

137	176

138 Args:	177 Args:

139 l2_penalty (float): the hyperparameter for how much to penalize	178 l2_penalty (float): the hyperparameter for how much to penalize

140 weight covectors far from zero.	179 weight covectors far from zero.

141	180

142 Returns:	181 Returns:

143 Nothing, but has the side effect of mutating the stored weights.	182 Nothing, but has the side effect of mutating the stored weights.

144 """	183 """

145 initial_weights = self.weights	184 initial_np_weights = self.WeightsAsNumPyArray()

146	185

147 # We want to minimize the number of times we reset the weights since	186 # We want to minimize the number of times we reset the weights since

148 # that clears our memos. One might think we could do that in the	187 # that clears our memos. One might think we could do that in the

149 # between-iterations callback; but actually, in a single iteration,	188 # between-iterations callback; but actually, in a single iteration,

150 # BFGS calls the objective function and gradient more than once with	189 # BFGS calls the objective function and gradient more than once with

151 # different arguments; so, alas, we must reset the weights in both.	190 # different arguments; so, alas, we must reset the weights in both.

152 # This is why the ``weights`` setter tries to avoid clearing memos	191 # This is why the ``weights`` setter tries to avoid clearing memos

153 # when possible.	192 # when possible.

154	193

155 def objective_function(new_weights):	194 def objective_function(new_np_weights):

156 self.weights = new_weights	195 self.np_weights = new_np_weights

157 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance	196 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance

158	197

159 def objective_function_gradient(new_weights):	198 def objective_function_gradient(new_np_weights):

160 self.weights = new_weights	199 self.np_weights = new_np_weights

161 return -self.LogLikelihoodGradient() + l2_penalty * self.weights	200 return -self.LogLikelihoodGradient() + l2_penalty * self.np_weights

162	201

163 result = spo.minimize(	202 result = spo.minimize(

164 objective_function,	203 objective_function,

165 initial_weights,	204 initial_np_weights,

166 method='BFGS',	205 method='BFGS',

167 jac=objective_function_gradient)	206 jac=objective_function_gradient)

168	207

169 if not result.success: # pragma: no cover	208 if not result.success: # pragma: no cover

170 # This should happen infrequently enough that there's no point in	209 # This should happen infrequently enough that there's no point in

171 # logging it and attempting to carry on.	210 # logging it and attempting to carry on.

172 raise Exception(	211 raise Exception(

173 'TrainableLogLinearModel.TrainWeights failed:'	212 'TrainableLogLinearModel.TrainWeights failed:'

174 '\n\tReason: %s'	213 '\n\tReason: %s'

175 '\n\tCurrent objective value: %s'	214 '\n\tCurrent objective value: %s'

176 '\n\tCurrent objective gradient: %s'	215 '\n\tCurrent objective gradient: %s'

177 '\n\tIterations: %d'	216 '\n\tIterations: %d'

178 '\n\tFunction evaluations: %d'	217 '\n\tFunction evaluations: %d'

179 '\n\tGradient evaluations: %d'	218 '\n\tGradient evaluations: %d'

180 % (result.message, result.fun, result.jac, result.nit, result.nfev,	219 % (result.message, result.fun, result.jac, result.nit, result.nfev,

181 result.njev))	220 result.njev))

182	221

183 # This shouldn't really be necessary, since we're resetting it	222 # This shouldn't really be necessary, since we're resetting it

184 # directly during training; but just to be safe/sure.	223 # directly during training; but just to be safe/sure.

185 self.weights = result.x	224 self.np_weights = result.x

OLD	NEW

« appengine/findit/crash/loglinear/test/training_test.py ('K') | « appengine/findit/crash/loglinear/test/training_test.py ('k') | no next file » | no next file with comments »