appengine/findit/crash/loglinear/training.py - Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM.

Side by Side Diff: appengine/findit/crash/loglinear/training.py

Issue 2617273002: [Predator] Move ``SingleFeatureScore`` to LLM. (Closed)

Patch Set: Update doc strs. Created 3 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 import math	5 import math

6 import numpy as np	6 import numpy as np

7 # N.B., ``np.array`` can't take generators; you must pass explicit lists.	7 # N.B., ``np.array`` can't take generators; you must pass explicit lists.

8 import scipy.optimize as spo	8 import scipy.optimize as spo

9	9

10 from crash.loglinear.model import LogLinearModel	10 from crash.loglinear.model import LogLinearModel

(...skipping 15 matching lines...) Expand all Loading...
26 ``Y`` (it'll just take more computation time is all). This is	26 ``Y`` (it'll just take more computation time is all). This is

27 needed for computing the partition function and expectation. N.B.,	27 needed for computing the partition function and expectation. N.B.,

28 we do not actually need to know/enumerate of all of ``Y``,	28 we do not actually need to know/enumerate of all of ``Y``,

29 only the subsets for each ``x``.	29 only the subsets for each ``x``.

30 training_data (iterable): a collection of ``(x, y)`` pairs where	30 training_data (iterable): a collection of ``(x, y)`` pairs where

31 ``y`` is the known-correct label for ``x``.	31 ``y`` is the known-correct label for ``x``.

32 feature_function: A function from ``X`` to ``Y`` to a list of	32 feature_function: A function from ``X`` to ``Y`` to a list of

33 ``float``. N.B., the length of the list must be the same for all	33 ``float``. N.B., the length of the list must be the same for all

34 ``x`` and ``y``, and must be the same as the length of the list	34 ``x`` and ``y``, and must be the same as the length of the list

35 of weights.	35 of weights.

36 initial_weights (list of float): the pre-training coefficients	36 initial_weights (dict from str to float): the pre-training coefficients

37 for how much we believe components of the feature vector. This	37 for how much we believe components of the feature vector. This

38 provides the seed for training; this starting value shouldn't	38 provides the seed for training; this starting value shouldn't

39 affect the final weights obtained by training (thanks to	39 affect the final weights obtained by training (thanks to

40 convexity), but will affect how long it takes for training	40 convexity), but will affect how long it takes for training

41 to converge.	41 to converge.

42 epsilon (float): The absolute-error threshold for considering a	42 epsilon (float): The absolute-error threshold for considering a

43 weight to be "equal to zero". N.B., this should be a positive	43 weight to be "equal to zero". N.B., this should be a positive

44 number, as we will compare it against the absolute value of	44 number, as we will compare it against the absolute value of

45 each weight.	45 each weight.

46 """	46 """

47 super(TrainableLogLinearModel, self).__init__(	47 super(TrainableLogLinearModel, self).__init__(

48 Y_given_X, feature_function, initial_weights, epsilon)	48 Y_given_X, feature_function, initial_weights, epsilon)

49 self._training_data = training_data	49 self._training_data = training_data

	50 self._feature_order = initial_weights.keys()
	wrengr 2017/01/11 20:38:30 You should explicitly convert this to a list (or t You should explicitly convert this to a list (or tuple) since those have a specified order, rather than leaving it as a set which doesn't. Sharu Jiang 2017/01/12 01:41:38 I think the ``.keys()`` will return a list, but ex Show quoted text On 2017/01/11 20:38:30, wrengr wrote: > You should explicitly convert this to a list (or tuple) since those have a > specified order, rather than leaving it as a set which doesn't. I think the ``.keys()`` will return a list, but explicitly converting to a list should be safer. Done. wrengr 2017/01/12 18:16:16 So it does. Weird. Show quoted text On 2017/01/12 01:41:38, Sharu Jiang wrote: > On 2017/01/11 20:38:30, wrengr wrote: > > You should explicitly convert this to a list (or tuple) since those have a > > specified order, rather than leaving it as a set which doesn't. > > I think the ``.keys()`` will return a list, but explicitly converting to a list > should be safer. Done. So it does. Weird.
50	51

51 self._observed_feature_vector = vsum([	52 self._observed_feature_vector = vsum([

52 self.FeaturesAsNumPyArray(x)(y)	53 self.FeaturesAsNumPyArray(x)(y)

53 for x, y in self._training_data])	54 for x, y in self._training_data])

54	55

55 # Even though this is identical to the superclass definition, we must	56 # Even though this is identical to the superclass definition, we must

56 # re-provide it in order to define the setter.	57 # re-provide it in order to define the setter.

57 @property	58 @property

58 def weights(self):	59 def weights(self):

59 """The weight covector.	60 """The weight covector.

60	61

61 At present we return the weights as an ``np.ndarray``, but in the	62 At present we return the weights as an ``np.ndarray``, but in the

62 future that may be replaced by a more general type which specifies	63 future that may be replaced by a more general type which specifies

63 the semantics rather than the implementation details.	64 the semantics rather than the implementation details.

64 """	65 """

65 return self._weights	66 return self._weights

66	67

67 @weights.setter	68 @weights.setter

68 def weights(self, new_weights): # pylint: disable=W0221	69 def weights(self, new_np_weights): # pylint: disable=W0221

69 """Mutate the weight covector, and clear memos as necessary.	70 """Mutate the weight covector, and clear memos as necessary.

70	71

71 This setter attempts to avoid clearing memos whenever possible,	72 This setter attempts to avoid clearing memos whenever possible,

72 but errs on the side of caution/correctness when it needs to.	73 but errs on the side of caution/correctness when it needs to.

73	74

74 Args:	75 Args:

75 new_weights (np.ndarray): the new weights to use. Must have the	76 new_np_weights (np.ndarray): the new weights to use. It will be converted

76 same shape as the old ``np.ndarray``.	77 to weights dict mapping feature_name to its weight.

77 """	78 """

78 if new_weights is self._weights:	79 new_weights = self.NumPyArrayToWeights(new_np_weights)

79 return

80

81 if not isinstance(new_weights, np.ndarray):

82 raise TypeError('Expected an np.ndarray but got %s instead'

83 % new_weights.__class__.__name__)

84

85 if new_weights.shape != self._weights.shape:

86 raise TypeError('Weight shape mismatch: %s != %s'

87 % (new_weights.shape, self._weights.shape))

88

89 self.ClearWeightBasedMemos()	80 self.ClearWeightBasedMemos()

90 self._weights = new_weights	81 self._weights = new_weights
	wrengr 2017/01/11 20:38:31 To avoid excessive conversion back and forth, this To avoid excessive conversion back and forth, this class should store both the dict and np.ndarray. That is, I suggest: def __init__(self,...): ... self._weights = initial_weights self._np_weights = self._DictToNpArray(initial_weights, 0.) @property def weights(self): return self._weights @weights.setter def weights(self, weights_dict): self.ClearWeightBasedMemos() self._weights = weights_dict self._np_weights = self._DictToNpArray(weights_dict, 0.) @property def np_weights(self): return self._np_weights @np_weights.setter def np_weights(self, weights_array): self.ClearWeightBasedMemos() self._weights = self._NpArrayToDict(weights_array, self._LogZeroish) self._np_weights = weights_array def _DictToNpArray(self, d, default=None): return np.array([d.get(i, default) for i in self._feature_order]) def _NpArrayToDict(self, arr, predicate=None): if predicate is None: predicate = lambda _: True return {k: v for k, v in zip(self._feature_order, arr) if predicate(v)} Sharu Jiang 2017/01/12 01:41:38 Done. Show quoted text On 2017/01/11 20:38:31, wrengr wrote: > To avoid excessive conversion back and forth, this class should store both the > dict and np.ndarray. That is, I suggest: > > def __init__(self,...): > ... > self._weights = initial_weights > self._np_weights = self._DictToNpArray(initial_weights, 0.) > > @property > def weights(self): > return self._weights > > @weights.setter > def weights(self, weights_dict): > self.ClearWeightBasedMemos() > self._weights = weights_dict > self._np_weights = self._DictToNpArray(weights_dict, 0.) > > @property > def np_weights(self): > return self._np_weights > > @np_weights.setter > def np_weights(self, weights_array): > self.ClearWeightBasedMemos() > self._weights = self._NpArrayToDict(weights_array, self._LogZeroish) > self._np_weights = weights_array > > def _DictToNpArray(self, d, default=None): > return np.array([d.get(i, default) for i in self._feature_order]) > > def _NpArrayToDict(self, arr, predicate=None): > if predicate is None: > predicate = lambda _: True > return {k: v for k, v in zip(self._feature_order, arr) if predicate(v)} Done.
91	82

92 def FeaturesAsNumPyArray(self, x):	83 def FeaturesAsNumPyArray(self, x):

93 """A variant of ``Features`` which returns a ``np.ndarray``.	84 """A variant of ``Features`` which returns a ``np.ndarray``.

94	85

	86 Note, the features nparray should have the same order as in

	87 self._feature_order to stay aligned with weights np array.

	88

95 For training we need to have the feature function return an	89 For training we need to have the feature function return an

96 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used	90 ``np.ndarray(float)`` rather than the ``list(FeatureValue)`` used

97 elsewhere. This function performes the necessary conversion.	91 elsewhere. This function performes the necessary conversion.

98	92

99 N.B., at present we do not memoize this function. The underlying	93 N.B., at present we do not memoize this function. The underlying

100 ``Features`` method is memoized, so we won't re-compute the features	94 ``Features`` method is memoized, so we won't re-compute the features

101 each time; but we will repeatedly copy the floats into newly allocated	95 each time; but we will repeatedly copy the floats into newly allocated

102 ``np.ndarray`` objects. If that turns out to be a performance	96 ``np.ndarray`` objects. If that turns out to be a performance

103 bottleneck, we can add the extra layer of memoization to avoid that.	97 bottleneck, we can add the extra layer of memoization to avoid that.

104 """	98 """

105 fx = self.Features(x)	99 fx = self.Features(x)

106 return lambda y: np.array([fxy.value for fxy in fx(y)])	100 def FeaturesAsNumPyArrayGivenX(y):

	101 fxys = fx(y)

	102 return np.array([fxys[feature_name].value if feature_name in fxys else 0.
	wrengr 2017/01/11 20:38:30 is clearer to use ``fxys.get(feature_name, 0.)`` is clearer to use ``fxys.get(feature_name, 0.)`` Sharu Jiang 2017/01/12 01:41:38 I have to do this, because I need ``fxys[feature_n Show quoted text On 2017/01/11 20:38:30, wrengr wrote: > is clearer to use ``fxys.get(feature_name, 0.)`` I have to do this, because I need ``fxys[feature_name].value`` instead of ``fxys[feature_name]`` wrengr 2017/01/12 18:16:16 hrm... that does make it tricky. This sort of thi Show quoted text On 2017/01/12 01:41:38, Sharu Jiang wrote: > On 2017/01/11 20:38:30, wrengr wrote: > > is clearer to use ``fxys.get(feature_name, 0.)`` > > I have to do this, because I need ``fxys[feature_name].value`` instead of > ``fxys[feature_name]`` hrm... that does make it tricky. This sort of thing is why I had FeatureValue.__float__ previously; with that we' could've ``float(fxys.get(feature_name, 0.))`` without needing to wrap ``0.`` up in something that provides a ``.value`` property. I really wish Python had let-bindings in its comprehension syntax. One alternative is to build an intermediate generator, thus: ``return np.array([0. if fxyn is None else fxyn.value for fxyn in (fxys.get(n) for n in self._feature_order)])``. Not sure that that's any cleaner though :(
	103 for feature_name in self._feature_order])

	104

	105 return FeaturesAsNumPyArrayGivenX

	106

	107 def WeightsAsNumPyArray(self):

	108 """Converts dict (mapping feature name to weight) to numpy array.

	109

	110 Note, this conversion is needed because model uses weights dict to organize

	111 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array

	112 to do computaion.

	113 """

	114 return np.array([self._weights.get(feature_name, 0.)

	115 for feature_name in self._feature_order])
	wrengr 2017/01/11 20:38:31 Again, to avoid excessive conversion, we can do th Again, to avoid excessive conversion, we can do this once in the ``weights`` setter and then this method simply returns the cached np.array Sharu Jiang 2017/01/12 01:41:38 Done. Show quoted text On 2017/01/11 20:38:31, wrengr wrote: > Again, to avoid excessive conversion, we can do this once in the ``weights`` > setter and then this method simply returns the cached np.array Done.
	116

	117 def NumPyArrayToWeights(self, np_weights):

	118 """Converts numpy array to dict (mapping feature name to weight).

	119

	120 Note, this conversion is needed because model uses weights dict to organize

	121 weights for features, however SciPy trainning (e.g. BFGS) needs numpy array

	122 to do computaion.

	123

	124 Args:

	125 np_weights (np.ndarray): Weights which have the same order of

	126 self._feature_order. Note, featuer np array should also be serialized by

	127 the same order as self._feature_order to match.

	128

	129 Returns:

	130 A dict mapping feature name to weight.

	131 """

	132 if not isinstance(np_weights, np.ndarray):

	133 raise TypeError('Expected an np.ndarray but got %s instead'

	134 % np_weights.__class__.__name__)

	135

	136 return {feature_name: weight

	137 for feature_name, weight in zip(self._feature_order, np_weights)}

107	138

108 def LogLikelihood(self):	139 def LogLikelihood(self):

109 """The conditional log-likelihood of the training data.	140 """The conditional log-likelihood of the training data.

110	141

111 The conditional likelihood of the training data is the product	142 The conditional likelihood of the training data is the product

112 of ``Pr(y\|x)`` for each ``(x, y)`` pair in the training data; so	143 of ``Pr(y\|x)`` for each ``(x, y)`` pair in the training data; so

113 the conditional log-likelihood is the log of that. This is called	144 the conditional log-likelihood is the log of that. This is called

114 "likelihood" because it is thought of as a function of the weight	145 "likelihood" because it is thought of as a function of the weight

115 covector, with the training data held fixed.	146 covector, with the training data held fixed.

116	147

117 This is the ideal objective function for training the weights, as it	148 This is the ideal objective function for training the weights, as it

118 will give us the MLE weight covector for the training data. However,	149 will give us the MLE weight covector for the training data. However,

119 in practice, we want to do regularization to ensure we don't overfit	150 in practice, we want to do regularization to ensure we don't overfit

120 the training data and to reduce classification time by ensuring that	151 the training data and to reduce classification time by ensuring that

121 the weight vector is sparse. Thus, the actual objective function	152 the weight vector is sparse. Thus, the actual objective function

122 will be the log-likelihood plus some penalty terms for regularization.	153 will be the log-likelihood plus some penalty terms for regularization.

123 """	154 """

124 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data)	155 observed_zeta = math.fsum(self.LogZ(x) for x, _ in self._training_data)

125 observed_score = self.weights.dot(self._observed_feature_vector)	156 observed_score = self.WeightsAsNumPyArray().dot(

	157 self._observed_feature_vector)

126 return observed_score - observed_zeta	158 return observed_score - observed_zeta

127	159

128 def LogLikelihoodGradient(self):	160 def LogLikelihoodGradient(self):

129 """The gradient (aka Jacobian) of ``LogLikelihood``."""	161 """The gradient (aka Jacobian) of ``LogLikelihood``."""

130 expected_feature_vector = vsum([	162 expected_feature_vector = vsum([

131 self.Expectation(x, self.FeaturesAsNumPyArray(x))	163 self.Expectation(x, self.FeaturesAsNumPyArray(x))

132 for x, _ in self._training_data])	164 for x, _ in self._training_data])

133 return self._observed_feature_vector - expected_feature_vector	165 return self._observed_feature_vector - expected_feature_vector

134	166

135 def TrainWeights(self, l2_penalty):	167 def TrainWeights(self, l2_penalty):

136 """Optimize the weight covector based on the training data.	168 """Optimize the weight covector based on the training data.

137	169

138 Args:	170 Args:

139 l2_penalty (float): the hyperparameter for how much to penalize	171 l2_penalty (float): the hyperparameter for how much to penalize

140 weight covectors far from zero.	172 weight covectors far from zero.

141	173

142 Returns:	174 Returns:

143 Nothing, but has the side effect of mutating the stored weights.	175 Nothing, but has the side effect of mutating the stored weights.

144 """	176 """

145 initial_weights = self.weights	177 initial_weights = self.WeightsAsNumPyArray()

146	178

147 # We want to minimize the number of times we reset the weights since	179 # We want to minimize the number of times we reset the weights since

148 # that clears our memos. One might think we could do that in the	180 # that clears our memos. One might think we could do that in the

149 # between-iterations callback; but actually, in a single iteration,	181 # between-iterations callback; but actually, in a single iteration,

150 # BFGS calls the objective function and gradient more than once with	182 # BFGS calls the objective function and gradient more than once with

151 # different arguments; so, alas, we must reset the weights in both.	183 # different arguments; so, alas, we must reset the weights in both.

152 # This is why the ``weights`` setter tries to avoid clearing memos	184 # This is why the ``weights`` setter tries to avoid clearing memos

153 # when possible.	185 # when possible.

154	186

155 def objective_function(new_weights):	187 def objective_function(new_weights):

156 self.weights = new_weights	188 self.weights = new_weights

157 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance	189 return -self.LogLikelihood() + 0.5 * l2_penalty * self.quadrance

158	190

159 def objective_function_gradient(new_weights):	191 def objective_function_gradient(new_weights):

160 self.weights = new_weights	192 self.weights = new_weights

161 return -self.LogLikelihoodGradient() + l2_penalty * self.weights	193 return -self.LogLikelihoodGradient() + l2_penalty * new_weights
	wrengr 2017/01/11 20:38:31 I don't think that's any cleaner than the old code I don't think that's any cleaner than the old code Sharu Jiang 2017/01/12 01:41:38 This is because we need np_array here instead of a Show quoted text On 2017/01/11 20:38:31, wrengr wrote: > I don't think that's any cleaner than the old code This is because we need np_array here instead of a dict. wrengr 2017/01/12 18:16:16 Ah, right <chagrin> Show quoted text On 2017/01/12 01:41:38, Sharu Jiang wrote: > On 2017/01/11 20:38:31, wrengr wrote: > > I don't think that's any cleaner than the old code > > This is because we need np_array here instead of a dict. Ah, right <chagrin>
162	194

163 result = spo.minimize(	195 result = spo.minimize(

164 objective_function,	196 objective_function,

165 initial_weights,	197 initial_weights,

166 method='BFGS',	198 method='BFGS',

167 jac=objective_function_gradient)	199 jac=objective_function_gradient)

168	200

169 if not result.success: # pragma: no cover	201 if not result.success: # pragma: no cover

170 # This should happen infrequently enough that there's no point in	202 # This should happen infrequently enough that there's no point in

171 # logging it and attempting to carry on.	203 # logging it and attempting to carry on.

172 raise Exception(	204 raise Exception(

173 'TrainableLogLinearModel.TrainWeights failed:'	205 'TrainableLogLinearModel.TrainWeights failed:'

174 '\n\tReason: %s'	206 '\n\tReason: %s'

175 '\n\tCurrent objective value: %s'	207 '\n\tCurrent objective value: %s'

176 '\n\tCurrent objective gradient: %s'	208 '\n\tCurrent objective gradient: %s'

177 '\n\tIterations: %d'	209 '\n\tIterations: %d'

178 '\n\tFunction evaluations: %d'	210 '\n\tFunction evaluations: %d'

179 '\n\tGradient evaluations: %d'	211 '\n\tGradient evaluations: %d'

180 % (result.message, result.fun, result.jac, result.nit, result.nfev,	212 % (result.message, result.fun, result.jac, result.nit, result.nfev,

181 result.njev))	213 result.njev))

182	214

183 # This shouldn't really be necessary, since we're resetting it	215 # This shouldn't really be necessary, since we're resetting it

184 # directly during training; but just to be safe/sure.	216 # directly during training; but just to be safe/sure.

185 self.weights = result.x	217 self.weights = result.x

OLD	NEW

« appengine/findit/crash/loglinear/model.py ('K') | « appengine/findit/crash/loglinear/test/training_test.py ('k') | no next file » | no next file with comments »