|
|
DescriptionRetry non-streaming gRPC calls
BUG=
Review-Url: https://codereview.chromium.org/2592683002
Committed: https://github.com/luci/luci-py/commit/0f3cdde6c7b7b9118ac93a549b18fdc7b96b716b
Patch Set 1 #
Total comments: 13
Patch Set 2 : Response to comments #
Total comments: 1
Patch Set 3 : Call calculate_sleep_before_retry and add unit tests #
Total comments: 4
Patch Set 4 : Final changes from maruel #Patch Set 5 : Disable incorrect pylint warning #
Depends on Patchset: Messages
Total messages: 23 (8 generated)
aludwin@google.com changed reviewers: + maruel@chromium.org, ryanmartens@google.com
I confirmed that when the proxy briefly goes away, the bot no longer dies. This doesn't help for the Isolate calls because those use streaming gRPC, yields and generators so there's no feasible way to rewind them.
https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:232: num_attempts = 0 for num_attempts in xrange(MAX_GRPC_ATTEMPTS): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:238: logging.warning('call_grpc succeeded after %d attempts', num_attempts) IMHO it'd not needed since we can infer from the other logs entries. https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:241: logging.warning('call_grpc - gRPC error: %s', str(rpc_error)) str() is not needed https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:246: time.sleep(1) You need exponential backoff. e.g. https://en.wikipedia.org/wiki/Exponential_backoff so you should start with a very quick wait then wait longer. https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:250: except Exception as e: I don't think this is needed.
https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:238: logging.warning('call_grpc succeeded after %d attempts', num_attempts) On 2016/12/20 15:05:59, M-A Ruel wrote: > IMHO it'd not needed since we can infer from the other logs entries. I thought it was nice to see but I agree it's technically redundant. I'll take it out. https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:241: logging.warning('call_grpc - gRPC error: %s', str(rpc_error)) On 2016/12/20 15:05:59, M-A Ruel wrote: > str() is not needed Done. https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:246: time.sleep(1) On 2016/12/20 15:05:59, M-A Ruel wrote: > You need exponential backoff. > e.g. https://en.wikipedia.org/wiki/Exponential_backoff > > > so you should start with a very quick wait then wait longer. What would you recommend as the initial, maximum and exponent? https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:250: except Exception as e: On 2016/12/20 15:05:59, M-A Ruel wrote: > I don't think this is needed. Which part? I just want to catch it so I can log it here.
You forgot to upload. https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:246: time.sleep(1) On 2016/12/20 16:42:35, aludwin wrote: > On 2016/12/20 15:05:59, M-A Ruel wrote: > > You need exponential backoff. > > e.g. https://en.wikipedia.org/wiki/Exponential_backoff > > > > > > so you should start with a very quick wait then wait longer. > > What would you recommend as the initial, maximum and exponent? Here's an idea: https://github.com/luci/luci-py/blob/master/client/utils/net.py#L754 https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:250: except Exception as e: On 2016/12/20 16:42:35, aludwin wrote: > On 2016/12/20 15:05:59, M-A Ruel wrote: > > I don't think this is needed. > > Which part? I just want to catch it so I can log it here. Whatever catches the unrelated exception will likely print the stack anyway, I'm not concerned about that.
https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:232: num_attempts = 0 On 2016/12/20 15:05:59, M-A Ruel wrote: > for num_attempts in xrange(MAX_GRPC_ATTEMPTS): Hmm, the reason I made it a for loop in the first place was because I need custom loop-exiting code anyway - I need to re-throw the exception in the "except" block. Since the for loop can never terminate naturally, I turned it into a "while True." Do you still think it should be a "for"?
https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/1/appengine/swarming/swarming... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:232: num_attempts = 0 On 2016/12/20 19:51:44, aludwin wrote: > On 2016/12/20 15:05:59, M-A Ruel wrote: > > for num_attempts in xrange(MAX_GRPC_ATTEMPTS): > > Hmm, the reason I made it a for loop in the first place was because I need > custom loop-exiting code anyway - I need to re-throw the exception in the > "except" block. Since the for loop can never terminate naturally, I turned it > into a "while True." > > Do you still think it should be a "for"? I don't mind much.
lgtm with change https://codereview.chromium.org/2592683002/diff/20001/appengine/swarming/swar... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py (right): https://codereview.chromium.org/2592683002/diff/20001/appengine/swarming/swar... appengine/swarming/swarming_bot/bot_code/remote_client_grpc.py:249: # random.random() returns [0.0, 1.0). Starts with relatively short Please call net.calculate_sleep_before_retry(num_attempts, 10.) here so we don't duplicate this code and if we want to update the retry mechanism we can update all locations at once.
Hi M-A, I made some changes to the call_grpc function and also added a bunch of unit tests. Can you please have a quick look? Thanks, A
lgtm https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py (right): https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py:47: add empty line here https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py:202: try: FYI, you can do instead: with self.asserRaises(remote_client_grpc.grpc.RpcError) as e: self._client.do_handshake(self.get_bot_attributes_dict()) self.assertEqual(self._num_sleeps, 2) self.assertEqual(e.code(), remote_client_grpc.grpc.StatusCode.UNAVAILABLE)
https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... File appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py (right): https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py:47: On 2016/12/21 16:31:44, M-A Ruel wrote: > add empty line here Done. https://codereview.chromium.org/2592683002/diff/40001/appengine/swarming/swar... appengine/swarming/swarming_bot/bot_code/remote_client_grpc_test.py:202: try: On 2016/12/21 16:31:44, M-A Ruel wrote: > FYI, you can do instead: > > with self.asserRaises(remote_client_grpc.grpc.RpcError) as e: > self._client.do_handshake(self.get_bot_attributes_dict()) > self.assertEqual(self._num_sleeps, 2) > self.assertEqual(e.code(), > remote_client_grpc.grpc.StatusCode.UNAVAILABLE) Done.
The CQ bit was checked by aludwin@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from maruel@chromium.org Link to the patchset: https://codereview.chromium.org/2592683002/#ps60001 (title: "Final changes from maruel")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Luci-py Presubmit on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/333b1ddede0ba410)
The CQ bit was checked by aludwin@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from maruel@chromium.org Link to the patchset: https://codereview.chromium.org/2592683002/#ps80001 (title: "Disable incorrect pylint warning")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1482340234569090, "parent_rev": "c29b41108967de2a33a2d90c4db8fb59fc1c4221", "commit_rev": "0f3cdde6c7b7b9118ac93a549b18fdc7b96b716b"}
Message was sent while issue was closed.
Description was changed from ========== Retry non-streaming gRPC calls BUG= ========== to ========== Retry non-streaming gRPC calls BUG= Review-Url: https://codereview.chromium.org/2592683002 Committed: https://github.com/luci/luci-py/commit/0f3cdde6c7b7b9118ac93a549b18fdc7b96b716b ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://github.com/luci/luci-py/commit/0f3cdde6c7b7b9118ac93a549b18fdc7b96b716b |