|
|
Chromium Code Reviews|
Created:
4 years, 5 months ago by Paweł Hajdan Jr. Modified:
4 years, 1 month ago CC:
chromium-reviews, infra-reviews+luci-py_chromium.org Base URL:
https://github.com/luci/luci-py.git@master Target Ref:
refs/heads/master Project:
luci-py Visibility:
Public. |
Descriptionswarming: add active jobs pending times metric
BUG=chromium:624508
Committed: https://github.com/luci/luci-py/commit/a00b8e245035fd044ac2af5007b5523cd71e8ee1
Patch Set 1 #Patch Set 2 : pylint #
Total comments: 2
Patch Set 3 : basic test #Patch Set 4 : tests #
Total comments: 1
Messages
Total messages: 26 (13 generated)
The CQ bit was checked by phajdan.jr@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by phajdan.jr@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
sergeyberezin@chromium.org changed reviewers: + maruel@chromium.org, sergeyberezin@chromium.org, vadimsh@chromium.org
Overall looks good, modulo a comment (bug). Thanks! However, I'd get someone else more familiar with swarming to review for any subtleties. I'd also suggest adding a test for this metric. Make some time pass in the test: _set_global_metrics(self.now + <some timedelta>) and measure the expected total pending time as pending_durations.get(flags).sum You also may want to be careful adding more code to this function. As it is, it's already timing out ~20% of the time on the production server. This might push it over the edge. Look for saved queries in https://bigquery.cloud.google.com/queries/chromium-swarm for the actual data. We might need to think of how to optimize global metrics collection, or to split it into multiple requests somehow. Most of the time is spent in python code, not in datastore access. https://codereview.chromium.org/2121323002/diff/20001/appengine/swarming/ts_m... File appengine/swarming/ts_mon_metrics.py (right): https://codereview.chromium.org/2121323002/diff/20001/appengine/swarming/ts_m... appengine/swarming/ts_mon_metrics.py:174: jobs_pending_distributions[key].add(pending_duration) pending_durations.total_seconds()
The CQ bit was checked by phajdan.jr@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Luci-py Presubmit on master.tryserver.infra (JOB_FAILED, https://build.chromium.org/p/tryserver.infra/builders/Luci-py%20Presubmit/bui...)
The CQ bit was checked by phajdan.jr@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Yeah, added a test. I had some trouble running swarming tests locally but it's fixed now. PTAL https://codereview.chromium.org/2121323002/diff/20001/appengine/swarming/ts_m... File appengine/swarming/ts_mon_metrics.py (right): https://codereview.chromium.org/2121323002/diff/20001/appengine/swarming/ts_m... appengine/swarming/ts_mon_metrics.py:174: jobs_pending_distributions[key].add(pending_duration) On 2016/07/06 at 18:31:47, Sergey Berezin wrote: > pending_durations.total_seconds() Done (the test also detected this).
LGTM, thanks! Feel free to test it on chromium-swarm-dev, and please watch it carefully if you deploy in production - there is a real danger ts_mon's cron job will not fit in 60s with this additional load.
lgtm Also I've realized this monitoring completely skips tiny tasks (ones that manage to start, run and finish between ts_mon cron job invocations) and deduped tasks :(
The CQ bit was checked by phajdan.jr@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== swarming: add active jobs pending times metric BUG=chromium:624508 ========== to ========== swarming: add active jobs pending times metric BUG=chromium:624508 Committed: https://github.com/luci/luci-py/commit/a00b8e245035fd044ac2af5007b5523cd71e8ee1 ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001) as https://github.com/luci/luci-py/commit/a00b8e245035fd044ac2af5007b5523cd71e8ee1
Message was sent while issue was closed.
On 2016/07/10 14:18:44, Vadim Sh. wrote: > Also I've realized this monitoring completely skips tiny tasks (ones that manage > to start, run and finish between ts_mon cron job invocations) and deduped tasks > :( In this particular case, it's a feature. Non-cumulative metrics capture the "instantaneous" load on the system, and are not designed to measure CPU-seconds, for example. We'd need to set up cumulative metrics for that, and populate them in places that handle job completion.
Message was sent while issue was closed.
stip@chromium.org changed reviewers: + stip@chromium.org
Message was sent while issue was closed.
https://codereview.chromium.org/2121323002/diff/60001/appengine/swarming/ts_m... File appengine/swarming/ts_mon_metrics.py (right): https://codereview.chromium.org/2121323002/diff/60001/appengine/swarming/ts_m... appengine/swarming/ts_mon_metrics.py:219: [jobs_running, executors_pool, executors_status]) I believe the metric should have been added to this list. Without that, stale data is reported depending on random assignment of task queue executors per https://chromium.googlesource.com/infra/infra/+/01a706d5a431461af911838cf276c... |
