|
|
Chromium Code Reviews
DescriptionFix non-idempotent task that /poll reaped but returned HTTP 500.
This case makes the task get BOT_DIED, which is really poor user experience.
Instead cleverly detect that the bot never gave a task update, and retry the
task anyway, as it's guaranteed that no user code was run.
This should help gain one 9 of reliability as /poll return path is a critical
path that significantly affected BOT_DIED levels when retries on
idempotent:false were disabled.
Change the BOT_DIED latency from 5 minutes to 2 minutes.
Make more items unicode instead of str, which makes unit test error diffing
easier to read. Most of this CL is test changes.
Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but
it is not set in new_run_result() anymore.
The main downside of the current implementation is that it doesn't work with
'id' locked task, which will be addressed afterward to keep this CL simple.
R=vadimsh@chromium.org
BUG=728716
Review-Url: https://codereview.chromium.org/2914803004
Committed: https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018
Patch Set 1 : . #Patch Set 2 : One more log #Patch Set 3 : Rebase on 2927053002 to reduce delta #
Total comments: 2
Patch Set 4 : Add test case #
Messages
Total messages: 21 (13 generated)
Patchset #1 (id:1) has been deleted
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. This is not exactly as described in issue 728716 but it is close enough to fix most of the problem. The main downside of the current implementation is that it doesn't work with id: locked task, which will be addressed afterward. R=vadimsh@chromium.org BUG=728716 ==========
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. This is not exactly as described in issue 728716 but it is close enough to fix most of the problem. The main downside of the current implementation is that it doesn't work with id: locked task, which will be addressed afterward. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward. R=vadimsh@chromium.org BUG=728716 ==========
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ==========
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ==========
live on staging
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ==========
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=iannucci@chromium.org BUG=728716 ==========
maruel@chromium.org changed reviewers: + iannucci@chromium.org - vadimsh@chromium.org
Forwarding to Robbie as Vadim is OOO
ping? I'd appreciate it if this would land :)
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=iannucci@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ==========
maruel@chromium.org changed reviewers: + vadimsh@chromium.org - iannucci@chromium.org
Forwarding to Vadim. Split busywork in a separate CL so this CL is more focused.
https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/serv... File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/serv... appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all from task_runner.run_command() What happens if the ping from first bot then comes later anyway, when the task is already given to another bot. Can you add a test for this specific case (if it doesn't exist already)?
https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/serv... File appengine/swarming/server/task_scheduler.py (right): https://codereview.chromium.org/2914803004/diff/60001/appengine/swarming/serv... appengine/swarming/server/task_scheduler.py:225: # - task hadn't got any ping at all from task_runner.run_command() On 2017/06/08 22:06:01, Vadim Sh. wrote: > What happens if the ping from first bot then comes later anyway, when the task > is already given to another bot. > > Can you add a test for this specific case (if it doesn't exist already)? Added test_bot_poll_http_500_but_bot_reapears_after_BOT_PING_TOLERANCE to clearly test this condition.
lgtm
The CQ bit was checked by maruel@chromium.org
CQ is committing da patch.
Bot data: {"patchset_id": 80001, "attempt_start_ts": 1497660596875810,
"parent_rev": "b833e9503650f0e60e379c05af07a8cb54ab1302", "commit_rev":
"5251e45e518f606fba7a28ec465cc56a8e18e018"}
Message was sent while issue was closed.
Description was changed from ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 ========== to ========== Fix non-idempotent task that /poll reaped but returned HTTP 500. This case makes the task get BOT_DIED, which is really poor user experience. Instead cleverly detect that the bot never gave a task update, and retry the task anyway, as it's guaranteed that no user code was run. This should help gain one 9 of reliability as /poll return path is a critical path that significantly affected BOT_DIED levels when retries on idempotent:false were disabled. Change the BOT_DIED latency from 5 minutes to 2 minutes. Make more items unicode instead of str, which makes unit test error diffing easier to read. Most of this CL is test changes. Tweak asserts in _pre_put_hook(), .started_ts must be set in TaskRunResult but it is not set in new_run_result() anymore. The main downside of the current implementation is that it doesn't work with 'id' locked task, which will be addressed afterward to keep this CL simple. R=vadimsh@chromium.org BUG=728716 Review-Url: https://codereview.chromium.org/2914803004 Committed: https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018 ==========
Message was sent while issue was closed.
Committed patchset #4 (id:80001) as https://github.com/luci/luci-py/commit/5251e45e518f606fba7a28ec465cc56a8e18e018 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
