appengine/findit/waterfall/process_base_swarming_task_result_pipeline.py - Issue 2526963002: [Findit] Implement retry within swarming_util.py when making server calls

Side by Side Diff: appengine/findit/waterfall/process_base_swarming_task_result_pipeline.py

Issue 2526963002: [Findit] Implement retry within swarming_util.py when making server calls (Closed)

Patch Set: Self-review Created 4 years ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
1 # Copyright 2016 The Chromium Authors. All rights reserved.	1 # Copyright 2016 The Chromium Authors. All rights reserved.

2 # Use of this source code is governed by a BSD-style license that can be	2 # Use of this source code is governed by a BSD-style license that can be

3 # found in the LICENSE file.	3 # found in the LICENSE file.

4	4

5 from collections import defaultdict	5 from collections import defaultdict

6 import datetime	6 import datetime

7 import logging	7 import logging

8 import time	8 import time

9	9

10 from common.http_client_appengine import HttpClientAppengine as HttpClient	10 from common.http_client_appengine import HttpClientAppengine as HttpClient

(...skipping 70 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
81 task_completed = False	81 task_completed = False

82 tests_statuses = {}	82 tests_statuses = {}

83 step_name_no_platform = None	83 step_name_no_platform = None

84 task = self._GetSwarmingTask(*call_args)	84 task = self._GetSwarmingTask(*call_args)

85	85

86 while not task_completed:	86 while not task_completed:

87 data, error = swarming_util.GetSwarmingTaskResultById(	87 data, error = swarming_util.GetSwarmingTaskResultById(

88 task_id, self.HTTP_CLIENT)	88 task_id, self.HTTP_CLIENT)

89	89

90 if error:	90 if error:

91 # An error occurred when trying to contact the swarming server.	91 # An error occurred at some point when trying to retrieve data from

92 task.status = analysis_status.ERROR	92 # the swarming server, even if eventually successful.

93 task.error = error	93 task.error = error

94 task.put()	94 task.put()

95 break	95

	96 if not data:

	97 # Even after retry, no data was recieved.

	98 task.status = analysis_status.ERROR
	chanli 2016/11/23 23:45:32 This status is not saved in data store? This status is not saved in data store? lijeffrey 2016/11/28 23:55:24 This should be fine, outside the while loop there Show quoted text On 2016/11/23 23:45:32, chanli wrote: > This status is not saved in data store? This should be fine, outside the while loop there is a task.put() call. chanli 2016/11/29 18:49:38 I just want to make it consistent. There is a task Show quoted text On 2016/11/28 23:55:24, lijeffrey wrote: > On 2016/11/23 23:45:32, chanli wrote: > > This status is not saved in data store? > > This should be fine, outside the while loop there is a task.put() call. I just want to make it consistent. There is a task.put() at line94, so maybe remove that line? Or as you mentioned in the comment at ln121, save it right away? lijeffrey 2016/11/30 18:18:47 I don't think the put() call belongs here, since i Show quoted text On 2016/11/29 18:49:38, chanli wrote: > On 2016/11/28 23:55:24, lijeffrey wrote: > > On 2016/11/23 23:45:32, chanli wrote: > > > This status is not saved in data store? > > > > This should be fine, outside the while loop there is a task.put() call. > > I just want to make it consistent. There is a task.put() at line94, so maybe > remove that line? Or as you mentioned in the comment at ln121, save it right > away? I don't think the put() call belongs here, since immediately after the break there are the calls to update the time stamps which have a put() after them too so there would be 2 writes in close succession. The case for the error is we want to write the error, even if the status is eventually COMPLETED but we want to note that an error was encountered so when we measure performance we don't penalize the speed of the swarming task unnecessarily because of unexpected outages. It is possible for both data and error to be returned, meaning the call was eventually successful but issues were encountered so the .put() call is needed for the if error branch chanli 2016/11/30 18:53:29 As discussed offline, this change is not needed. Show quoted text On 2016/11/30 18:18:47, lijeffrey wrote: > On 2016/11/29 18:49:38, chanli wrote: > > On 2016/11/28 23:55:24, lijeffrey wrote: > > > On 2016/11/23 23:45:32, chanli wrote: > > > > This status is not saved in data store? > > > > > > This should be fine, outside the while loop there is a task.put() call. > > > > I just want to make it consistent. There is a task.put() at line94, so maybe > > remove that line? Or as you mentioned in the comment at ln121, save it right > > away? > > I don't think the put() call belongs here, since immediately after the break > there are the calls to update the time stamps which have a put() after them too > so there would be 2 writes in close succession. The case for the error is we > want to write the error, even if the status is eventually COMPLETED but we want > to note that an error was encountered so when we measure performance we don't > penalize the speed of the swarming task unnecessarily because of unexpected > outages. > > It is possible for both data and error to be returned, meaning the call was > eventually successful but issues were encountered so the .put() call is needed > for the if error branch As discussed offline, this change is not needed.
	99 break

96	100

97 task_state = data['state']	101 task_state = data['state']

98 exit_code = (data.get('exit_code') if	102 exit_code = (data.get('exit_code') if

99 task_state == swarming_util.STATE_COMPLETED else None)	103 task_state == swarming_util.STATE_COMPLETED else None)

100 step_name_no_platform = (	104 step_name_no_platform = (

101 step_name_no_platform or swarming_util.GetTagValue(	105 step_name_no_platform or swarming_util.GetTagValue(

102 data.get('tags', {}), 'ref_name'))	106 data.get('tags', {}), 'ref_name'))

103	107

104 if task_state not in swarming_util.STATES_RUNNING:	108 if task_state not in swarming_util.STATES_RUNNING:

105 task_completed = True	109 task_completed = True

106	110

107 if (task_state == swarming_util.STATE_COMPLETED and	111 if (task_state == swarming_util.STATE_COMPLETED and

108 int(exit_code) != swarming_util.TASK_FAILED):	112 int(exit_code) != swarming_util.TASK_FAILED):

109 outputs_ref = data.get('outputs_ref')	113 outputs_ref = data.get('outputs_ref')

110 output_json, error = swarming_util.GetSwarmingTaskFailureLog(	114 output_json, error = swarming_util.GetSwarmingTaskFailureLog(

111 outputs_ref, self.HTTP_CLIENT)	115 outputs_ref, self.HTTP_CLIENT)

112	116

	117 task.status = analysis_status.COMPLETED

	118

113 if error:	119 if error:

114 task.status = analysis_status.ERROR

115 task.error = error	120 task.error = error

116 else:	121 task.put()
	chanli 2016/11/23 23:45:32 This put is not necessary? This put is not necessary? lijeffrey 2016/11/28 23:55:24 This can be a design choice, the idea is to captur Show quoted text On 2016/11/23 23:45:32, chanli wrote: > This put is not necessary? This can be a design choice, the idea is to capture the error as early as possible so even flake swarming tasks in progress that are experiencing issues can be detected/queried for earlier rather than waiting for the whole thing to complete. WDYT?
117 task.status = analysis_status.COMPLETED	122

	123 if not output_json:
	chanli 2016/11/30 18:53:29 If not output_json, this task is actually failed a If not output_json, this task is actually failed and we should break the loop right away. I just committed a CL yesterday with this change, so you need to rebase and (possibly) address conflicts. lijeffrey 2016/11/30 20:12:11 Rebased. As discussed offline, it's possible we ge Show quoted text On 2016/11/30 18:53:29, chanli wrote: > If not output_json, this task is actually failed and we should break the loop > right away. I just committed a CL yesterday with this change, so you need to > rebase and (possibly) address conflicts. Rebased. As discussed offline, it's possible we get output_refs, but then when trying to contact the isolated server we experience problems, so this should be ok as it is.
	124 # Retry was ultimately unsuccessful.

	125 task.status = analysis_status.ERROR

118	126

119 tests_statuses = self._CheckTestsRunStatuses(output_json, *call_args)	127 tests_statuses = self._CheckTestsRunStatuses(output_json, *call_args)
	chanli 2016/11/23 23:45:32 This line will affect ln 125, right? This line will affect ln 125, right? chanli 2016/11/28 23:44:18 This comment is not valid, please ignore. Show quoted text On 2016/11/23 23:45:32, chanli wrote: > This line will affect ln 125, right? This comment is not valid, please ignore.
120 task.tests_statuses = tests_statuses	128 task.tests_statuses = tests_statuses

121 task.put()	129 task.put()

122 else:	130 else:

123 if exit_code is not None:	131 if exit_code is not None:

124 # Swarming task completed, but the task failed.	132 # Swarming task completed, but the task failed.

125 code = int(exit_code)	133 code = int(exit_code)

126 message = swarming_util.EXIT_CODE_DESCRIPTIONS[code]	134 message = swarming_util.EXIT_CODE_DESCRIPTIONS[code]

127 else:	135 else:

128 # The swarming task did not complete.	136 # The swarming task did not complete.

129 code = swarming_util.STATES_NOT_RUNNING_TO_ERROR_CODES[task_state]	137 code = swarming_util.STATES_NOT_RUNNING_TO_ERROR_CODES[task_state]

(...skipping 62 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
192 task_id (str): The task id to query the swarming server on the progresss	200 task_id (str): The task id to query the swarming server on the progresss

193 of a swarming task.	201 of a swarming task.

194	202

195 Returns:	203 Returns:

196 A dict of lists for reliable/flaky tests.	204 A dict of lists for reliable/flaky tests.

197 """	205 """

198 call_args = self._GetArgs(master_name, builder_name, build_number,	206 call_args = self._GetArgs(master_name, builder_name, build_number,

199 step_name, *args)	207 step_name, *args)

200 step_name_no_platform = self._MonitorSwarmingTask(task_id, *call_args)	208 step_name_no_platform = self._MonitorSwarmingTask(task_id, *call_args)

201 return step_name, step_name_no_platform	209 return step_name, step_name_no_platform

OLD	NEW