Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(395)

Issue 2914803004: Fix non-idempotent task that /poll reaped but returned HTTP 500. (Closed)

Created:
3 years, 6 months ago by M-A Ruel
Modified:
3 years, 6 months ago
Reviewers:
Vadim Sh.
CC:
martiniss, chromium-reviews, infra-reviews+luci-py_chromium.org, iannucci
Target Ref:
refs/heads/master
Project:
luci-py
Visibility:
Public.

Description

Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 Review-Url: https://codereview.chromium.org/2914803004 Committed: https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018

Patch Set 1 : . #

Patch Set 2 : One more log #

Patch Set 3 : Rebase on 2927053002 to reduce delta #

Total comments: 2

Patch Set 4 : Add test case #

Unified diffs Side-by-side diffs Delta from patch set Stats (+192 lines, -26 lines) Patch
M appengine/swarming/server/task_result.py View 1 2 3 3 chunks +9 lines, -3 lines 0 comments Download
M appengine/swarming/server/task_result_test.py View 1 2 3 8 chunks +14 lines, -6 lines 0 comments Download
M appengine/swarming/server/task_scheduler.py View 1 2 3 5 chunks +19 lines, -2 lines 0 comments Download
M appengine/swarming/server/task_scheduler_test.py View 1 2 3 6 chunks +150 lines, -15 lines 0 comments Download

Messages

Total messages: 21 (13 generated)
M-A Ruel
live on staging
3 years, 6 months ago (2017-06-01 19:02:29 UTC) #6
M-A Ruel
Forwarding to Robbie as Vadim is OOO
3 years, 6 months ago (2017-06-01 23:07:50 UTC) #10
martiniss
ping? I'd appreciate it if this would land :)
3 years, 6 months ago (2017-06-06 20:01:07 UTC) #11
M-A Ruel
Forwarding to Vadim. Split busywork in a separate CL so this CL is more focused.
3 years, 6 months ago (2017-06-08 13:23:49 UTC) #14
Vadim Sh.
https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py#newcode225 appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all ...
3 years, 6 months ago (2017-06-08 22:06:01 UTC) #15
M-A Ruel
https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py#newcode225 appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all ...
3 years, 6 months ago (2017-06-12 17:16:21 UTC) #16
Vadim Sh.
lgtm
3 years, 6 months ago (2017-06-16 22:25:07 UTC) #17
commit-bot: I haz the power
3 years, 6 months ago (2017-06-17 00:54:20 UTC) #21
Message was sent while issue was closed.
Committed patchset #4 (id:80001) as
https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018

Powered by Google App Engine
This is Rietveld 408576698