|
|
Chromium Code Reviews
DescriptionReimplement Machine Provider integration
BUG=627560
Committed: https://github.com/luci/luci-py/commit/4ad6fd1e59b4095a6876974a93b30d1bec5c9649
Patch Set 1 #
Total comments: 25
Patch Set 2 #Patch Set 3 : Fix handlers_test.py #Patch Set 4 #
Messages
Total messages: 23 (10 generated)
smut@google.com changed reviewers: + vadimsh@chromium.org
Design doc: https://docs.google.com/document/d/1Bvsu4J0sMziUii3ELBHCLyLq7nfjIiGYDHs3O8BsE... Next steps: Create BotInfo as soon as hostname is known, ensure it has MP-related fields. This has been live on staging for some time now.
looks good, mostly nits https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... File appengine/swarming/server/lease_management.py (right): https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:140: ndb.Future.wait_any(futures) this doesn't check status code (e.g. if transaction fails, the error is ignored). Same for wait_all below. If you want to ignore errors intentionally, fine, but leave a comment. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:156: for machine_lease in MachineLease.query( I think it will be more efficient to do a single MachineLease.query() and then load individual MachineType entities by ID specified in machine_lease.machine_type. Though in practice it probably doesn't matter, so whatever... https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:260: def increment_lease_request_id(key): nit: update_client_request_id It takes an entity with client_request_id == None and makes a new client_request_id *shrug* https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:299: if machine_lease.client_request_id: is this possible if drained == True? https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:327: if machine_lease.client_request_id: can the body of this 'if' be converted into a separate function? It is hard to follow. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:331: duration=machine_lease.lease_duration_secs, I really really wish we can randomize this a bit to avoid periodic avalanches of lease requests. E.g. make duration be something like (0.9 + 0.2 * random()) * duration. It will make lease refreshes even out in time eventually (changes to the config would still suck though, but they are relatively infrequent). https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:354: 'Lease request failed: %s\nRequest ID: %s\nError: %s', nit: "Lease request failed\n%s...." it will make errors grouped in ereporter2 email under single section titled "Lease request failed". Ereporter2 picks first line of an error as an identifier. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:368: 'Request expired: %s\nRequest ID:%s\nExpired: %s', same here https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:388: # TODO(smut): Create BotInfo, associate lease information. yep, in the same transaction https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:403: delete_machine_lease(key) shouldn't it be in 'else' section of 'if not machine_lease.drained'?
Patchset #2 (id:20001) has been deleted
https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... File appengine/swarming/server/lease_management.py (right): https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:140: ndb.Future.wait_any(futures) On 2016/10/03 20:26:31, Vadim Sh. wrote: > this doesn't check status code (e.g. if transaction fails, the error is > ignored). Same for wait_all below. > > If you want to ignore errors intentionally, fine, but leave a comment. Done. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:156: for machine_lease in MachineLease.query( On 2016/10/03 20:26:32, Vadim Sh. wrote: > I think it will be more efficient to do a single MachineLease.query() and then > load individual MachineType entities by ID specified in > machine_lease.machine_type. > > Though in practice it probably doesn't matter, so whatever... There's only one MachineType right now. Didn't do this, but can do later. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:260: def increment_lease_request_id(key): On 2016/10/03 20:26:32, Vadim Sh. wrote: > nit: update_client_request_id > > It takes an entity with client_request_id == None and makes a new > client_request_id *shrug* Done. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:299: if machine_lease.client_request_id: On 2016/10/03 20:26:31, Vadim Sh. wrote: > is this possible if drained == True? Yes, there's the case where the MachineLease already sent a lease request and then MachineType got updated and we drained the MachineLease. We need to wait for the leased machine to be returned. Right now we just wait for the lease to expire and don't renew it, later we will use the lease release RPC to return the machine right away. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:327: if machine_lease.client_request_id: On 2016/10/03 20:26:31, Vadim Sh. wrote: > can the body of this 'if' be converted into a separate function? It is hard to > follow. Done. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:331: duration=machine_lease.lease_duration_secs, On 2016/10/03 20:26:32, Vadim Sh. wrote: > I really really wish we can randomize this a bit to avoid periodic avalanches of > lease requests. E.g. make duration be something like (0.9 + 0.2 * random()) * > duration. It will make lease refreshes even out in time eventually (changes to > the config would still suck though, but they are relatively infrequent). Your suggestion of varying with +/- 10% seems reasonable to me, but I think that lease duration is part of deduplication logic in MP (it just hashes the whole received proto), so we actually need to retransmit the same lease_duration_secs for the same client_request_id. Will do something in another CL. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:354: 'Lease request failed: %s\nRequest ID: %s\nError: %s', On 2016/10/03 20:26:32, Vadim Sh. wrote: > nit: "Lease request failed\n%s...." > > it will make errors grouped in ereporter2 email under single section titled > "Lease request failed". Ereporter2 picks first line of an error as an > identifier. Done. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:368: 'Request expired: %s\nRequest ID:%s\nExpired: %s', On 2016/10/03 20:26:31, Vadim Sh. wrote: > same here Done. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:388: # TODO(smut): Create BotInfo, associate lease information. On 2016/10/03 20:26:32, Vadim Sh. wrote: > yep, in the same transaction xg in log_lease_fulfillment? https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:403: delete_machine_lease(key) On 2016/10/03 20:26:32, Vadim Sh. wrote: > shouldn't it be in 'else' section of 'if not machine_lease.drained'? Er, yes. I just added return above under the if so it matches the pattern used above.
lgtm https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... File appengine/swarming/server/lease_management.py (right): https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:331: duration=machine_lease.lease_duration_secs, On 2016/10/03 21:18:07, smut wrote: > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > I really really wish we can randomize this a bit to avoid periodic avalanches > of > > lease requests. E.g. make duration be something like (0.9 + 0.2 * random()) * > > duration. It will make lease refreshes even out in time eventually (changes to > > the config would still suck though, but they are relatively infrequent). > > Your suggestion of varying with +/- 10% seems reasonable to me, but I think that > lease duration is part of deduplication logic in MP (it just hashes the whole > received proto), so we actually need to retransmit the same lease_duration_secs > for the same client_request_id. Will do something in another CL. Instead of random() you can use some hash-like function of client_request_id. It will be effectively "as random", but deterministic. https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:388: # TODO(smut): Create BotInfo, associate lease information. On 2016/10/03 21:18:07, smut wrote: > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > yep, in the same transaction > > xg in log_lease_fulfillment? Yes. Assuming 'manage_lease' is idempotent and will successfully call 'log_lease_fulfillment' again with exact same parameters if it fails the first time. (I think usage of 'client_request_id' ensures that?). xg transactions are not big deal if they are infrequent, like in this case
https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... File appengine/swarming/server/lease_management.py (right): https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:331: duration=machine_lease.lease_duration_secs, On 2016/10/03 21:34:18, Vadim Sh. wrote: > On 2016/10/03 21:18:07, smut wrote: > > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > > I really really wish we can randomize this a bit to avoid periodic > avalanches > > of > > > lease requests. E.g. make duration be something like (0.9 + 0.2 * random()) > * > > > duration. It will make lease refreshes even out in time eventually (changes > to > > > the config would still suck though, but they are relatively infrequent). > > > > Your suggestion of varying with +/- 10% seems reasonable to me, but I think > that > > lease duration is part of deduplication logic in MP (it just hashes the whole > > received proto), so we actually need to retransmit the same > lease_duration_secs > > for the same client_request_id. Will do something in another CL. > > Instead of random() you can use some hash-like function of client_request_id. It > will be effectively "as random", but deterministic. random.seed(machine_lease.client_request_id) random.random() https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:388: # TODO(smut): Create BotInfo, associate lease information. On 2016/10/03 21:34:18, Vadim Sh. wrote: > On 2016/10/03 21:18:07, smut wrote: > > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > > yep, in the same transaction > > > > xg in log_lease_fulfillment? > > Yes. Assuming 'manage_lease' is idempotent and will successfully call > 'log_lease_fulfillment' again with exact same parameters if it fails the first > time. (I think usage of 'client_request_id' ensures that?). > > xg transactions are not big deal if they are infrequent, like in this case Yes should be idempotent.
https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... File appengine/swarming/server/lease_management.py (right): https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... appengine/swarming/server/lease_management.py:331: duration=machine_lease.lease_duration_secs, On 2016/10/03 21:41:50, smut wrote: > On 2016/10/03 21:34:18, Vadim Sh. wrote: > > On 2016/10/03 21:18:07, smut wrote: > > > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > > > I really really wish we can randomize this a bit to avoid periodic > > avalanches > > > of > > > > lease requests. E.g. make duration be something like (0.9 + 0.2 * > random()) > > * > > > > duration. It will make lease refreshes even out in time eventually > (changes > > to > > > > the config would still suck though, but they are relatively infrequent). > > > > > > Your suggestion of varying with +/- 10% seems reasonable to me, but I think > > that > > > lease duration is part of deduplication logic in MP (it just hashes the > whole > > > received proto), so we actually need to retransmit the same > > lease_duration_secs > > > for the same client_request_id. Will do something in another CL. > > > > Instead of random() you can use some hash-like function of client_request_id. > It > > will be effectively "as random", but deterministic. > > random.seed(machine_lease.client_request_id) > random.random() Sort of, except random.seed will modify global seed :-/ Here's what I did in similar case: https://github.com/luci/luci-py/blob/master/appengine/swarming/server/bot_man...
The CQ bit was checked by smut@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Luci-py Presubmit on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/31a59dc084619410)
The CQ bit was checked by smut@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from vadimsh@chromium.org Link to the patchset: https://codereview.chromium.org/2386793002/#ps60001 (title: "Fix handlers_test.py")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Luci-py Presubmit on luci.infra.try (JOB_FAILED, https://luci-milo.appspot.com/swarming/task/31a5a2edd9244310)
The CQ bit was checked by smut@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from vadimsh@chromium.org Link to the patchset: https://codereview.chromium.org/2386793002/#ps80001 (title: "")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Reimplement Machine Provider integration BUG=627560 ========== to ========== Reimplement Machine Provider integration BUG=627560 Committed: https://github.com/luci/luci-py/commit/4ad6fd1e59b4095a6876974a93b30d1bec5c9649 ==========
Message was sent while issue was closed.
Committed patchset #4 (id:80001) as https://github.com/luci/luci-py/commit/4ad6fd1e59b4095a6876974a93b30d1bec5c9649
Message was sent while issue was closed.
On 2016/10/03 21:34:18, Vadim Sh. wrote: > lgtm > > https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... > File appengine/swarming/server/lease_management.py (right): > > https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... > appengine/swarming/server/lease_management.py:331: > duration=machine_lease.lease_duration_secs, > On 2016/10/03 21:18:07, smut wrote: > > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > > I really really wish we can randomize this a bit to avoid periodic > avalanches > > of > > > lease requests. E.g. make duration be something like (0.9 + 0.2 * random()) > * > > > duration. It will make lease refreshes even out in time eventually (changes > to > > > the config would still suck though, but they are relatively infrequent). > > > > Your suggestion of varying with +/- 10% seems reasonable to me, but I think > that > > lease duration is part of deduplication logic in MP (it just hashes the whole > > received proto), so we actually need to retransmit the same > lease_duration_secs > > for the same client_request_id. Will do something in another CL. > > Instead of random() you can use some hash-like function of client_request_id. It > will be effectively "as random", but deterministic. > > https://codereview.chromium.org/2386793002/diff/1/appengine/swarming/server/l... > appengine/swarming/server/lease_management.py:388: # TODO(smut): Create BotInfo, > associate lease information. > On 2016/10/03 21:18:07, smut wrote: > > On 2016/10/03 20:26:32, Vadim Sh. wrote: > > > yep, in the same transaction > > > > xg in log_lease_fulfillment? > > Yes. Assuming 'manage_lease' is idempotent and will successfully call > 'log_lease_fulfillment' again with exact same parameters if it fails the first > time. (I think usage of 'client_request_id' ensures that?). > > xg transactions are not big deal if they are infrequent, like in this case Just realized it's hard to do in an xg because of the way BotInfos are normally created. What happens everywhere else is we call bot_management.bot_event() to log a bot-related event. This calls datastore_utils.store_new_version(), which asserts that not ndb.in_transaction() before storing the BotRoot, BotInfo, and BotEvent together in a transaction. We could modify it to have a code path which skips the assertion and then stores MachineLease, BotRoot, BotInfo, and BotEvent in a cross-group. But then we need to write code to circumvent the whole bot_event() route which is normally the only way to create a BotInfo. How about we don't do an xg on fulfillment and instead do a check for each fulfilled machine in lease_management.manage_lease(), like: if machine_lease.hostname and machine_lease.lease_expiration_ts > now: # Request is fulfilled and unexpired. if BotInfo doesn't exist or exists but doesn't have MP-related fields: bot_event('lease_fulfilled', machine_lease.hostname, machine_lease.lease_expiration_ts, ...) bot_event() will take care of logging as a BotEvent that a lease was fulfilled and an MP bot is now expected, as well as creating the BotRoot and BotInfo or adding the MP-related info if the BotRoot/BotInfo are already created. I don't think there's any harm in not doing an xg. It's not really a weird state if MachineLease has hostname and lease_expiration_ts but BotInfo hasn't been created yet. What do you think? |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
