|
|
Created:
5 years, 7 months ago by luqui Modified:
4 years, 6 months ago CC:
chromium-reviews, jbudorick+watch_chromium.org, klundberg+watch_chromium.org, yfriedman+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionFix last_devices to be quieter, and improve device affinity.
Now we keep track of the number of builds a device has been offline for,
and when it hits 2 we send an email. This will prevent a lot of false
positives.
We now clear the list of known devices whenever we detect a new one so we
don't go on reporting the same old missing device forever. We assume an
engineer has tended to it whenever there is an unseen device.
Also replaced the device affinity calculation with a new round-robin hashing
scheme that will be more stable as the list of devices changes.
BUG=470211, 481783, 517572
Patch Set 1 #
Total comments: 24
Patch Set 2 : Review comments #Patch Set 3 : Improve docs #Patch Set 4 : Improve diagram #
Total comments: 8
Patch Set 5 : Improve comments #
Total comments: 2
Patch Set 6 : Nearest-available hash conflict resolution #Patch Set 7 : Sort devices again for reboot-stability #
Messages
Total messages: 27 (4 generated)
luqui@chromium.org changed reviewers: + jbudorick@chromium.org, navabi@chromium.org
ptal
https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:116: def CheckForMissingDevices(options, adb_online_devices): rebase, I changed this function a few days ago. https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:141: # Increment the count of any missing devices. last_devices = dict( (k, 0 if k in adb_online_devices else last_devices[k] + 1) for k, v in last_devices.iteritems()) ? https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:151: missing_devices = { k: v for k, v in last_devices.iteeritems() if v != 0 } - iteeritems -> iteritems - if not v - nit: no spaces at the ends of comprehensions https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:180: quite_missing = [ k for k, v in last_devices.iteritems() if v == 3 ] if v > 2 ? https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/... File build/android/pylib/device/device_list.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/... build/android/pylib/device/device_list.py:13: def GetOfflineDeviceMap(file_name): This is poorly named; it sounds like the map contains offline devices when the map can contain online or offline devices. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... File build/android/pylib/perf/setup.py (left): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:28: devices = android_commands.GetAttachedDevices() How is this still around... https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:50: This is a way to make sure that affinity -> device mapping is relatively nit: line break after the first line of the docstring also, this should include Args: and Returns: https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:51: stable while still maximizing the use of resources.""" I'm not sure this is stable if you lose one of the lexicographically early devices. That's why the persistent device list exists. Also, this docstring should better explain what's going on. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:100: all_devices = android_commands.GetAttachedDevices() device_utils.DeviceUtils.HealthyDevices() if you want serials, [d.adb.GetDeviceSerial() for d in device_utils.DeviceUtils.HealthyDevices()] android_commands is deprecated. also, you will absolutely want to sort this. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 What if the number of devices is higher than 8? I think the speed team has designs on something like that... https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:201: assert 0 <= affinity and affinity < NUM_DEVICE_AFFINITIES, ( This should be an exception. We generally avoid assertions in the test runner, especially on data we're reading from somewhere else (e.g. the perf test list).
luqui@chromium.org changed reviewers: + simonhatch@chromium.org
+simonhatch, while I was in this code trying to improve device_status_check, I noticed that the way we assign affinities means adding a previously unknown device will scramble on average half of the assignments. I've implemented a new scheme which will only change 16% of them. See pylib/perf/setup/.py. I'm not familiar with the perf process, let me know what you think. https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:116: def CheckForMissingDevices(options, adb_online_devices): On 2015/05/23 01:06:49, jbudorick wrote: > rebase, I changed this function a few days ago. Done. https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:141: # Increment the count of any missing devices. On 2015/05/23 01:06:49, jbudorick wrote: > last_devices = dict( > (k, 0 if k in adb_online_devices else last_devices[k] + 1) > for k, v in last_devices.iteritems()) > > ? Yeah that was a bit clumsy. Went with a more verbose version of this suggestion. https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:151: missing_devices = { k: v for k, v in last_devices.iteeritems() if v != 0 } On 2015/05/23 01:06:48, jbudorick wrote: > - iteeritems -> iteritems > - if not v > - nit: no spaces at the ends of comprehensions Done, except for "not v" -- mixing up numberness and truthiness here is too conceptually punny for me https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d... build/android/buildbot/bb_device_status_check.py:180: quite_missing = [ k for k, v in last_devices.iteritems() if v == 3 ] On 2015/05/23 01:06:49, jbudorick wrote: > if v > 2 > > ? Done. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/... File build/android/pylib/device/device_list.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/... build/android/pylib/device/device_list.py:13: def GetOfflineDeviceMap(file_name): On 2015/05/23 01:06:49, jbudorick wrote: > This is poorly named; it sounds like the map contains offline devices when the > map can contain online or offline devices. Done. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... File build/android/pylib/perf/setup.py (left): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:28: devices = android_commands.GetAttachedDevices() On 2015/05/23 01:06:49, jbudorick wrote: > How is this still around... Looks like I forgot to sync before starting work on this :-( https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:50: This is a way to make sure that affinity -> device mapping is relatively On 2015/05/23 01:06:49, jbudorick wrote: > nit: line break after the first line of the docstring > > also, this should include Args: and Returns: Done. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:51: stable while still maximizing the use of resources.""" On 2015/05/23 01:06:49, jbudorick wrote: > I'm not sure this is stable if you lose one of the lexicographically early > devices. That's why the persistent device list exists. > > Also, this docstring should better explain what's going on. IIUC the persistent device list basically messes up everything if a new device is added. You gave me an excuse to do some math, so I added a bunch of docs explaining the scheme and analyzing its stability. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se... build/android/pylib/perf/setup.py:100: all_devices = android_commands.GetAttachedDevices() On 2015/05/23 01:06:49, jbudorick wrote: > device_utils.DeviceUtils.HealthyDevices() > > if you want serials, > > [d.adb.GetDeviceSerial() > for d in device_utils.DeviceUtils.HealthyDevices()] > > android_commands is deprecated. > > also, you will absolutely want to sort this. Done. https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 On 2015/05/23 01:06:49, jbudorick wrote: > What if the number of devices is higher than 8? I think the speed team has > designs on something like that... As far as I could tell there are no specs of device_affinity greater than 8. Maybe you could help me look harder (not sure where else to look) https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:201: assert 0 <= affinity and affinity < NUM_DEVICE_AFFINITIES, ( On 2015/05/23 01:06:49, jbudorick wrote: > This should be an exception. We generally avoid assertions in the test runner, > especially on data we're reading from somewhere else (e.g. the perf test list). Done.
navabi@google.com changed reviewers: + navabi@google.com
https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... build/android/buildbot/bb_device_status_check.py:128: # detected It looks like it denotes a count map. The comment makes it sound like it is just a list of device serial numbers. https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... build/android/buildbot/bb_device_status_check.py:137: # Increment the count of missing devices and add new devices to the map. this comment is also confusing. Makes it sound like you are incrementing the number of missing devices. But I think you are incrementing the count for each missing device. https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... build/android/pylib/perf/setup.py:100: Ok, so here we have something like this: device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] where the length of device_cycle is NUM_DEVICE_AFFINITIES Can you explain to me why this hashing/rehashing is better than just assigning affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? It was my impression that the problem was just that we don't reassign device affinity when we needed to. (I.e. when devices go missing and/or when we find new devices). It looks like your code now does the DistributeAffinity when you need to, but I don't understand the need for the extra complexity of this hashing solution. Is it so that the same tests don't continue to get screwed every time a device goes missing or comes online, and this spreads out which tests get effected?
https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... build/android/buildbot/bb_device_status_check.py:128: # detected On 2015/05/27 23:36:19, navabi wrote: > It looks like it denotes a count map. The comment makes it sound like it is just > a list of device serial numbers. Done. https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/... build/android/buildbot/bb_device_status_check.py:137: # Increment the count of missing devices and add new devices to the map. On 2015/05/27 23:36:20, navabi wrote: > this comment is also confusing. Makes it sound like you are incrementing the > number of missing devices. But I think you are incrementing the count for each > missing device. Done. https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... build/android/pylib/perf/setup.py:100: On 2015/05/27 23:36:20, navabi wrote: > Ok, so here we have something like this: > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > Can you explain to me why this hashing/rehashing is better than just assigning > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? Imagine we have devices H J M, thus the affinity assignments are H - 0 3 6 J - 1 4 7 M - 2 5 Now we add a new device I, which redistributes H - 0 4 # was 0 3 6 I - 1 5 # new device J - 2 6 # was 1 4 7 M - 3 7 # was 2 5 The only test which stayed on its device was 0. When the number of devices changes it is really bad like this. It's better when we replace a device with one that sorts differently -- then any tests on devices that are lexicographically between the old and new serials will be reassigned, on average 33%. > It was my impression that the problem was just that we don't reassign device > affinity when we needed to. (I.e. when devices go missing and/or when we find > new devices). > > It looks like your code now does the DistributeAffinity when you need to, but I > don't understand the need for the extra complexity of this hashing solution. The extra algorithmic complexity of this hashing solution decreases the complexity of the system as a whole. Previously affinity distribution depended on the entire set of devices we had previously seen -- whenever that changes, affinity distribution changes. So if the bot is clobbered all the affinities get reassigned. First I wanted it to be pure, so it only depended on the set of online devices and no historical state. That would be achieved by your simpler solution. But secondly I wanted it to be "continuous" -- so small changes in the set of online devices produced small changes in the affinity assignment. That is the reason for the complexity.
> The extra algorithmic complexity of this hashing solution decreases the > complexity of the system as a whole. Previously affinity distribution depended > on the entire set of devices we had previously seen -- whenever that changes, > affinity distribution changes. So if the bot is clobbered all the affinities > get reassigned. First I wanted it to be pure, so it only depended on the set of > online devices and no historical state. That would be achieved by your simpler > solution. But secondly I wanted it to be "continuous" -- so small changes in the > set of online devices produced small changes in the affinity assignment. That > is the reason for the complexity. Thank you for the explanation. I see now how it helps. I'll let Simon and other speed team members comment on whether the "continuous" solution warrants the extra complexity. I have no strong feelings either way, and like this solution if it is solving a problem that the speed team encounters. The other stuff LGTM. Please wait for other reviewers to review.
https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 On 2015/05/27 20:01:12, luqui wrote: > On 2015/05/23 01:06:49, jbudorick wrote: > > What if the number of devices is higher than 8? I think the speed team has > > designs on something like that... > > As far as I could tell there are no specs of device_affinity greater than 8. > Maybe you could help me look harder (not sure where else to look) This isn't a thing yet, but we did ask for some investigation into upping device limits: crbug.com/466105, so it's something to keep in mind. You might be able to derive this instead of hardcoding it by looking at max(test affinities). https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... build/android/pylib/perf/setup.py:100: On 2015/05/28 00:59:54, luqui wrote: > On 2015/05/27 23:36:20, navabi wrote: > > Ok, so here we have something like this: > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > > > Can you explain to me why this hashing/rehashing is better than just assigning > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? > > > Imagine we have devices H J M, thus the affinity assignments are > > H - 0 3 6 > J - 1 4 7 > M - 2 5 > > Now we add a new device I, which redistributes > > H - 0 4 # was 0 3 6 > I - 1 5 # new device > J - 2 6 # was 1 4 7 > M - 3 7 # was 2 5 > > The only test which stayed on its device was 0. When the number of devices > changes it is really bad like this. It's better when we replace a device with > one that sorts differently -- then any tests on devices that are > lexicographically between the old and new serials will be reassigned, on average > 33%. > > > It was my impression that the problem was just that we don't reassign device > > affinity when we needed to. (I.e. when devices go missing and/or when we find > > new devices). > > > > It looks like your code now does the DistributeAffinity when you need to, but > I > > don't understand the need for the extra complexity of this hashing solution. > > The extra algorithmic complexity of this hashing solution decreases the > complexity of the system as a whole. Previously affinity distribution depended > on the entire set of devices we had previously seen -- whenever that changes, > affinity distribution changes. So if the bot is clobbered all the affinities > get reassigned. First I wanted it to be pure, so it only depended on the set of > online devices and no historical state. That would be achieved by your simpler > solution. But secondly I wanted it to be "continuous" -- so small changes in the > set of online devices produced small changes in the affinity assignment. That > is the reason for the complexity. This is a really neat solution, but I'm also worried about the extra complexity. Wouldn't it be simpler and easier to understand if we just persisted the previous devices + affinities, and new devices just fill in for missing ones? What are the downsides to that vs this solution, keeping historical state?
https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 On 2015/05/27 at 20:01:12, luqui wrote: > On 2015/05/23 01:06:49, jbudorick wrote: > > What if the number of devices is higher than 8? I think the speed team has > > designs on something like that... > > As far as I could tell there are no specs of device_affinity greater than 8. Maybe you could help me look harder (not sure where else to look) Currently there aren't, and I don't think we have any bot deployments with >8 devices. However, as I said, the speed team is at least considering doing so, and this doesn't handle that case very well. https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... build/android/pylib/perf/setup.py:100: On 2015/05/28 at 13:59:18, shatch wrote: > On 2015/05/28 00:59:54, luqui wrote: > > On 2015/05/27 23:36:20, navabi wrote: > > > Ok, so here we have something like this: > > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > > > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > > > > > Can you explain to me why this hashing/rehashing is better than just assigning > > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? > > > > > > Imagine we have devices H J M, thus the affinity assignments are > > > > H - 0 3 6 > > J - 1 4 7 > > M - 2 5 > > > > Now we add a new device I, which redistributes > > > > H - 0 4 # was 0 3 6 > > I - 1 5 # new device > > J - 2 6 # was 1 4 7 > > M - 3 7 # was 2 5 > > > > The only test which stayed on its device was 0. When the number of devices > > changes it is really bad like this. It's better when we replace a device with > > one that sorts differently -- then any tests on devices that are > > lexicographically between the old and new serials will be reassigned, on average > > 33%. > > > > > It was my impression that the problem was just that we don't reassign device > > > affinity when we needed to. (I.e. when devices go missing and/or when we find > > > new devices). > > > > > > It looks like your code now does the DistributeAffinity when you need to, but > > I > > > don't understand the need for the extra complexity of this hashing solution. > > > > The extra algorithmic complexity of this hashing solution decreases the > > complexity of the system as a whole. Previously affinity distribution depended > > on the entire set of devices we had previously seen -- whenever that changes, > > affinity distribution changes. So if the bot is clobbered all the affinities > > get reassigned. First I wanted it to be pure, so it only depended on the set of > > online devices and no historical state. That would be achieved by your simpler > > solution. But secondly I wanted it to be "continuous" -- so small changes in the > > set of online devices produced small changes in the affinity assignment. That > > is the reason for the complexity. > > This is a really neat solution, but I'm also worried about the extra complexity. I'm going to come back to do another full review later, but I just want to add that I completely agree with this sentiment. > > Wouldn't it be simpler and easier to understand if we just persisted the previous devices + affinities, and new devices just fill in for missing ones? What are the downsides to that vs this solution, keeping historical state?
On 2015/05/28 13:59:18, shatch wrote: > https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... > File build/android/pylib/perf/test_runner.py (right): > > https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te... > build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 > On 2015/05/27 20:01:12, luqui wrote: > > On 2015/05/23 01:06:49, jbudorick wrote: > > > What if the number of devices is higher than 8? I think the speed team has > > > designs on something like that... > > > > As far as I could tell there are no specs of device_affinity greater than 8. > > Maybe you could help me look harder (not sure where else to look) > > This isn't a thing yet, but we did ask for some investigation into upping device > limits: crbug.com/466105, so it's something to keep in mind. You might be able > to derive this instead of hardcoding it by looking at max(test affinities). > > https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... > File build/android/pylib/perf/setup.py (right): > > https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per... > build/android/pylib/perf/setup.py:100: > On 2015/05/28 00:59:54, luqui wrote: > > On 2015/05/27 23:36:20, navabi wrote: > > > Ok, so here we have something like this: > > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > > > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > > > > > Can you explain to me why this hashing/rehashing is better than just > assigning > > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? > > > > > > Imagine we have devices H J M, thus the affinity assignments are > > > > H - 0 3 6 > > J - 1 4 7 > > M - 2 5 > > > > Now we add a new device I, which redistributes > > > > H - 0 4 # was 0 3 6 > > I - 1 5 # new device > > J - 2 6 # was 1 4 7 > > M - 3 7 # was 2 5 > > > > The only test which stayed on its device was 0. When the number of devices > > changes it is really bad like this. It's better when we replace a device with > > one that sorts differently -- then any tests on devices that are > > lexicographically between the old and new serials will be reassigned, on > average > > 33%. > > > > > It was my impression that the problem was just that we don't reassign device > > > affinity when we needed to. (I.e. when devices go missing and/or when we > find > > > new devices). > > > > > > It looks like your code now does the DistributeAffinity when you need to, > but > > I > > > don't understand the need for the extra complexity of this hashing solution. > > > > The extra algorithmic complexity of this hashing solution decreases the > > complexity of the system as a whole. Previously affinity distribution > depended > > on the entire set of devices we had previously seen -- whenever that changes, > > affinity distribution changes. So if the bot is clobbered all the affinities > > get reassigned. First I wanted it to be pure, so it only depended on the set > of > > online devices and no historical state. That would be achieved by your simpler > > solution. But secondly I wanted it to be "continuous" -- so small changes in > the > > set of online devices produced small changes in the affinity assignment. That > > is the reason for the complexity. > > This is a really neat solution, but I'm also worried about the extra complexity. > > Wouldn't it be simpler and easier to understand if we just persisted the > previous devices + affinities, and new devices just fill in for missing ones? > What are the downsides to that vs this solution, keeping historical state? Well, I thought about that, and the way I perceive it, properly accounting for the details in "just fill in for missing ones" is a lot of complexity, too. We have to have an "expected set" of devices so we know the difference between a fill-in and a decommissioned/replaced device. Who maintains that list -- do we just generate it the first time we haven't seen one before? What if that was a bad state; i.e. one of the devices was bad right off the bat? Who watches for that? So many questions. I am a little bewildered by everyone calling this solution complex. There is really not that much to it -- each device picks a random sequence of numbers representing its choices, then we go around giving each device its first available choice. Am I scaring you off with the math in the docstring?
> Well, I thought about that, and the way I perceive it, properly accounting for > the details in "just fill in for missing ones" is a lot of complexity, too. We > have to have an "expected set" of devices so we know the difference between a > fill-in and a decommissioned/replaced device. Who maintains that list -- do we > just generate it the first time we haven't seen one before? What if that was a > bad state; i.e. one of the devices was bad right off the bat? Who watches for > that? So many questions. I think this is a good argument for your approach over the previous devices + affinities approach. > I am a little bewildered by everyone calling this solution complex. There is > really not that much to it -- each device picks a random sequence of numbers > representing its choices, then we go around giving each device its first > available choice. Am I scaring you off with the math in the docstring? I do think the math is too much for the docstring. I'm not against this approach. I just wanted justification, which now I have. Please fix the docstring to more concisely explain that this strategy minimizes changes in affinity distribution. Otherwise, this lgtm.
On 2015/06/03 at 01:36:30, navabi wrote: > > Well, I thought about that, and the way I perceive it, properly accounting for > > the details in "just fill in for missing ones" is a lot of complexity, too. We > > have to have an "expected set" of devices so we know the difference between a > > fill-in and a decommissioned/replaced device. Who maintains that list -- do we > > just generate it the first time we haven't seen one before? What if that was a > > bad state; i.e. one of the devices was bad right off the bat? Who watches for > > that? So many questions. > > I think this is a good argument for your approach over the previous devices + affinities approach. Perhaps I'm alone in this opinion, but it seems to me that your current approach is complex in both concept and implementation, while the "replace missing devices" approach is merely complex in the latter. I frankly don't think it's as simple as you seem to boil it down to below. I was spooked by the initial implementation (i.e. before the docstring), and I remained spooked afterwards. That's not to say there isn't a place in the test scripts that are complex if they solve a difficult problem (cough PushIfNeeded/PushChangedFiles), though. > > > I am a little bewildered by everyone calling this solution complex. There is > > really not that much to it -- each device picks a random sequence of numbers > > representing its choices, then we go around giving each device its first > > available choice. Am I scaring you off with the math in the docstring? > > I do think the math is too much for the docstring. I'm not against this approach. I just wanted justification, which now I have. > > Please fix the docstring to more concisely explain that this strategy minimizes changes in affinity distribution. Otherwise, this lgtm. If you don't mind, Luke, I'd like to mull this over a bit more vs a device tracking & replacement method before approving it.
On 2015/06/03 at 03:47:56, jbudorick wrote: > On 2015/06/03 at 01:36:30, navabi wrote: > > > Well, I thought about that, and the way I perceive it, properly accounting for > > > the details in "just fill in for missing ones" is a lot of complexity, too. We > > > have to have an "expected set" of devices so we know the difference between a > > > fill-in and a decommissioned/replaced device. Who maintains that list -- do we > > > just generate it the first time we haven't seen one before? What if that was a > > > bad state; i.e. one of the devices was bad right off the bat? Who watches for > > > that? So many questions. > > > > I think this is a good argument for your approach over the previous devices + affinities approach. > > Perhaps I'm alone in this opinion, but it seems to me that your current approach is complex in both concept and implementation, while the "replace missing devices" approach is merely complex in the latter. > > I frankly don't think it's as simple as you seem to boil it down to below. I was spooked by the initial implementation (i.e. before the docstring), and I remained spooked afterwards. That's not to say there isn't a place in the test scripts *for implementations* that are complex if they solve a difficult problem (cough PushIfNeeded/PushChangedFiles), though. > > > > > > I am a little bewildered by everyone calling this solution complex. There is > > > really not that much to it -- each device picks a random sequence of numbers > > > representing its choices, then we go around giving each device its first > > > available choice. Am I scaring you off with the math in the docstring? > > > > I do think the math is too much for the docstring. I'm not against this approach. I just wanted justification, which now I have. > > > > Please fix the docstring to more concisely explain that this strategy minimizes changes in affinity distribution. Otherwise, this lgtm. > > If you don't mind, Luke, I'd like to mull this over a bit more vs a device tracking & replacement method before approving it.
On 2015/06/03 03:47:56, jbudorick wrote: > On 2015/06/03 at 01:36:30, navabi wrote: > > > Well, I thought about that, and the way I perceive it, properly accounting > for > > > the details in "just fill in for missing ones" is a lot of complexity, too. > We > > > have to have an "expected set" of devices so we know the difference between > a > > > fill-in and a decommissioned/replaced device. Who maintains that list -- do > we > > > just generate it the first time we haven't seen one before? What if that > was a > > > bad state; i.e. one of the devices was bad right off the bat? Who watches > for > > > that? So many questions. > > > > I think this is a good argument for your approach over the previous devices + > affinities approach. > > Perhaps I'm alone in this opinion, but it seems to me that your current approach > is complex in both concept and implementation, while the "replace missing > devices" approach is merely complex in the latter. > > I frankly don't think it's as simple as you seem to boil it down to below. I was > spooked by the initial implementation (i.e. before the docstring), and I > remained spooked afterwards. That's not to say there isn't a place in the test > scripts that are complex if they solve a difficult problem (cough > PushIfNeeded/PushChangedFiles), though. > I agree. The way I see it is, if I ran across this code randomly, I'm not sure how easily I could debug/modify it. That might be because I'm not amazing at math :) I'd prefer a simpler solution, but I don't feel strongly enough about this to hold you up if everyone else is happy with this. > > > > > I am a little bewildered by everyone calling this solution complex. There > is > > > really not that much to it -- each device picks a random sequence of numbers > > > representing its choices, then we go around giving each device its first > > > available choice. Am I scaring you off with the math in the docstring? > > > > I do think the math is too much for the docstring. I'm not against this > approach. I just wanted justification, which now I have. > > > > Please fix the docstring to more concisely explain that this strategy > minimizes changes in affinity distribution. Otherwise, this lgtm. > > If you don't mind, Luke, I'd like to mull this over a bit more vs a device > tracking & replacement method before approving it.
> > If you don't mind, Luke, I'd like to mull this over a bit more vs a device > > tracking & replacement method before approving it. It looks like the other reviewers are hesitant. Let's think of a simpler solution that has the 2 characteristics that Luke mentioned: 1) depends on the set of online devices (i.e. no historical state) 2) continuous i.e. small changes in the set of online devices produces only small changes in the affinity assignment Hashing the device serial numbers and then %'ing with the number of affinities we need seems like a good general approach for this. The problem comes when we get conflicts, and I think the "weirdness" in the implementation (i.e. the part that is hard to understand if you don't understand the problem being solved) is that loop that will then try to give each device their second choice. How about you take the device hash, % it to find what affinity it gets, and then if there is a conflict you just find the next available affinity? It takes care of #1. For #2, the change will almost certainly not be as small as Luke's solution but simplifies the implementation enough that it is worth it. I.e. instead of seeing this strange while true loop that keeps doing things until it finds the magical assignment, it becomes obvious that hashing is used to find its place, and takes the closest spot it can in case of conflict.
One more thing to think about... https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/per... build/android/pylib/perf/setup.py:94: devices = sorted(devices) You may want to not sort the devices. I know you want to avoid keeping historical data (i.e. what were the devices or affinity assignment I had the last build), but 'adb devices' gives you some free historical data. I.e. If I have devices, [a, b, c] on the bot, and 'b' gets replaces with 'd', you 'adb devices' will show [a, c, d]. You may want this, because it let's the existing devices (i.e. a and c) get their first choices before d gets to choose. That way d gets forced into being assigned to what test suites are left, which are those that b had. I know it's not as 'pure' of a solution as it relies on this adb devices artifact, but just something to think about as we consider simpler solutions.
Here's the next-available collision strategy. Monte-carlo shows that this adds an extra 0.3 reassignments per change. https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/per... File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/per... build/android/pylib/perf/setup.py:94: devices = sorted(devices) On 2015/06/04 22:30:11, navabi wrote: > You may want to not sort the devices. I know you want to avoid keeping > historical data (i.e. what were the devices or affinity assignment I had the > last build), but 'adb devices' gives you some free historical data. > > I.e. If I have devices, [a, b, c] on the bot, and 'b' gets replaces with 'd', > you 'adb devices' will show [a, c, d]. You may want this, because it let's the > existing devices (i.e. a and c) get their first choices before d gets to choose. > That way d gets forced into being assigned to what test suites are left, which > are those that b had. > > I know it's not as 'pure' of a solution as it relies on this adb devices > artifact, but just something to think about as we consider simpler solutions. I ran some monte carlo simulations and this turned out to be a very good suggestion. It took it down from 2.1 average reassignments per change to 1.36. Of course, if the history is ever lost we will have to pay that debt... Will we have to pay every time the host is rebooted? (I.e. do they come up randomly?)
> Will we have to pay every time the host is rebooted? (I.e. do they come up randomly?) Oh yeah. That's a great question. Most (maybe all) bots restart between builds, so if it is random, this won't work. I am not sure how it works.
How are people feeling about this? I reinstated sorting because of reboots. Stability is pretty important.
On 2015/06/16 19:12:44, luqui wrote: > How are people feeling about this? > > I reinstated sorting because of reboots. Stability is pretty important. I'm not sure reboots make devices come up randomly. It's an annoying thing to test, but I tried running: navabi@yeezus:/usr/local/google/code/buildbot/build_internal/scripts/slave/android$ python restart_usb.py After restarting all the USB ports, the devices show up in the same order. The order is determined by what USB port they are on. So they will not show up in order if the USB port each device is plugged into is switched. But if a device is replaced, the new device will end up in the same place as the old device, which I believe is optimal for device affinity assignment. The bots should all have restart_usb in /usr/bin/ but it requires root. jbudorick@ and shatch@, can you please take a look and give your thoughts. (FYI. jbudorick@ is ooo until Thursday)
On 2015/06/16 20:24:47, navabi wrote: > On 2015/06/16 19:12:44, luqui wrote: > > How are people feeling about this? > > > > I reinstated sorting because of reboots. Stability is pretty important. > > I'm not sure reboots make devices come up randomly. It's an annoying thing to > test, but I tried running: > navabi@yeezus:/usr/local/google/code/buildbot/build_internal/scripts/slave/android$ > python restart_usb.py > > After restarting all the USB ports, the devices show up in the same order. The > order is determined by what USB port they are on. So they will not show up in > order if the USB port each device is plugged into is switched. But if a device > is replaced, the new device will end up in the same place as the old device, > which I believe is optimal for device affinity assignment. The bots should all > have restart_usb in /usr/bin/ but it requires root. > > jbudorick@ and shatch@, can you please take a look and give your thoughts. (FYI. > jbudorick@ is ooo until Thursday) Sorry for taking so long to get back. I'm good with this, seems simpler. lgtm
> Sorry for taking so long to get back. I'm good with this, seems simpler. lgtm Simon, unfortunately, I'm not sure *this* works. We can't do restart_usb.py without root permission. This was more of an observation to either agree with Luke's original implementation or come up with a new idea. Here is a new idea that actually works (i.e. doesn't need root) that I like better. Luke let me know what you think. lsusb does not require root. It will always list the usb ports in the same order, regardless of restart. With this I think we can achieve the optimal solution. i.e. if a device x is replaced by a new device y, all the other devices will have the same test affinity and device y will take all of x's. lsusb -v lists all the devices on the USB hubs. The -v (verbose) option will include the device serial numbers (i.e. the same thing adb devices shows). You want that order, as it will be consistent between reboots. I think this is the clearest and most optimal way to deal with switching devices. And while it may not be the simplest (i.e. because of the work parsing the output of lsub -v), it's still not too complicated. I particularly like that it is explicitly using USB port mapping to devices to deal with switched devices. what does everyone think?
On 2015/06/22 at 21:13:54, navabi wrote: > > Sorry for taking so long to get back. I'm good with this, seems simpler. lgtm > > Simon, unfortunately, I'm not sure *this* works. We can't do restart_usb.py without root permission. This was more of an observation to either agree with Luke's original implementation or come up with a new idea. Here is a new idea that actually works (i.e. doesn't need root) that I like better. Luke let me know what you think. > > lsusb does not require root. It will always list the usb ports in the same order, regardless of restart. With this I think we can achieve the optimal solution. i.e. if a device x is replaced by a new device y, all the other devices will have the same test affinity and device y will take all of x's. > > lsusb -v lists all the devices on the USB hubs. The -v (verbose) option will include the device serial numbers (i.e. the same thing adb devices shows). You want that order, as it will be consistent between reboots. > > I think this is the clearest and most optimal way to deal with switching devices. And while it may not be the simplest (i.e. because of the work parsing the output of lsub -v), it's still not too complicated. I particularly like that it is explicitly using USB port mapping to devices to deal with switched devices. > > what does everyone think? +1 to using lsusb
> +1 to using lsusb I spoke with luqui@ about using lsub and he likes this too. Please also change the description to reflect this. i.e. replace "Also replaced the device affinity calculation with a new round-robin hashing scheme that will be more stable as the list of devices changes."
Description was changed from ========== Fix last_devices to be quieter, and improve device affinity. Now we keep track of the number of builds a device has been offline for, and when it hits 2 we send an email. This will prevent a lot of false positives. We now clear the list of known devices whenever we detect a new one so we don't go on reporting the same old missing device forever. We assume an engineer has tended to it whenever there is an unseen device. Also replaced the device affinity calculation with a new round-robin hashing scheme that will be more stable as the list of devices changes. BUG=470211,481783,517572 ========== to ========== Fix last_devices to be quieter, and improve device affinity. Now we keep track of the number of builds a device has been offline for, and when it hits 2 we send an email. This will prevent a lot of false positives. We now clear the list of known devices whenever we detect a new one so we don't go on reporting the same old missing device forever. We assume an engineer has tended to it whenever there is an unseen device. Also replaced the device affinity calculation with a new round-robin hashing scheme that will be more stable as the list of devices changes. BUG=470211,481783,517572 ========== |