Issue 2914803004: Fix non-idempotent task that /poll reaped but returned HTTP 500.

M-A Ruel

Patchset #1 (id:1) has been deleted

3 years, 6 months ago (2017-06-01 18:57:34 UTC) #1

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 19:00:28 UTC) #2

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 19:00:49 UTC) #3

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 19:00:59 UTC) #4

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 19:02:17 UTC) #5

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 19:03:52 UTC) #7

Description was changed from

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716
==========

to

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716
==========

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-01 23:07:35 UTC) #8

Description was changed from

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716
==========

to

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=iannucci@chromium.org
BUG=728716
==========

M-A Ruel

maruel@chromium.org changed reviewers: + iannucci@chromium.org - vadimsh@chromium.org

3 years, 6 months ago (2017-06-01 23:07:36 UTC) #9

M-A Ruel

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-08 13:23:25 UTC) #12

Description was changed from

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=iannucci@chromium.org
BUG=728716
==========

to

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716
==========

M-A Ruel

maruel@chromium.org changed reviewers: + vadimsh@chromium.org - iannucci@chromium.org

3 years, 6 months ago (2017-06-08 13:23:25 UTC) #13

M-A Ruel

Forwarding to Vadim. Split busywork in a separate CL so this CL is more focused.

3 years, 6 months ago (2017-06-08 13:23:49 UTC) #14

Vadim Sh.

https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py#newcode225 appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all ...

3 years, 6 months ago (2017-06-08 22:06:01 UTC) #15

M-A Ruel

https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/server/task_scheduler.py#newcode225 appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all ...

3 years, 6 months ago (2017-06-12 17:16:21 UTC) #16

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1497660596875810, "parent_rev": "b833e9503650f0e60e379c05af07a8cb54ab1302", "commit_rev": "5251e45e518f606fba7a28ec465cc56a8e18e018"}

3 years, 6 months ago (2017-06-17 00:54:16 UTC) #19

commit-bot: I haz the power

Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. ...

3 years, 6 months ago (2017-06-17 00:54:18 UTC) #20

Message was sent while issue was closed.

Description was changed from

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716
==========

to

==========
Fix non-idempotent task that /poll reaped but returned HTTP 500.

This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.

This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.

Change the BOT_DIED latency from 5 minutes to 2 minutes.

Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.

Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.

The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.

R=vadimsh@chromium.org
BUG=728716

Review-Url: https://codereview.chromium.org/2914803004
Committed:
https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018
==========

commit-bot: I haz the power

3 years, 6 months ago (2017-06-17 00:54:20 UTC) #21

Message was sent while issue was closed.

Committed patchset #4 (id:80001) as
https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018

Issue 2914803004: Fix non-idempotent task that /poll reaped but returned HTTP 500. (Closed)

Description

Patch Set 1 : . #

Patch Set 2 : One more log #

Patch Set 3 : Rebase on 2927053002 to reduce delta #

Patch Set 4 : Add test case #

Messages