Issue 1148873007: Fix last_devices to be quieter, and improve device affinity.

luqui

luqui@chromium.org changed reviewers: + jbudorick@chromium.org, navabi@chromium.org

5 years, 7 months ago (2015-05-23 00:27:51 UTC) #1

jbudorick

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_device_status_check.py File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_device_status_check.py#newcode116 build/android/buildbot/bb_device_status_check.py:116: def CheckForMissingDevices(options, adb_online_devices): rebase, I changed this function a ...

5 years, 7 months ago (2015-05-23 01:06:49 UTC) #3

luqui

luqui@chromium.org changed reviewers: + simonhatch@chromium.org

5 years, 6 months ago (2015-05-27 20:01:10 UTC) #4

luqui

+simonhatch, while I was in this code trying to improve device_status_check, I noticed that the ...

5 years, 6 months ago (2015-05-27 20:01:12 UTC) #5

+simonhatch, while I was in this code trying to improve device_status_check, I
noticed that the way we assign affinities means adding a previously unknown
device will scramble on average half of the assignments.  I've implemented a new
scheme which will only change 16% of them. See pylib/perf/setup/.py. I'm not
familiar with the perf process, let me know what you think.

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d...
File build/android/buildbot/bb_device_status_check.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d...
build/android/buildbot/bb_device_status_check.py:116: def
CheckForMissingDevices(options, adb_online_devices):
On 2015/05/23 01:06:49, jbudorick wrote:
> rebase, I changed this function a few days ago.

Done.

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d...
build/android/buildbot/bb_device_status_check.py:141: # Increment the count of
any missing devices.
On 2015/05/23 01:06:49, jbudorick wrote:
>   last_devices = dict(
>       (k, 0 if k in adb_online_devices else last_devices[k] + 1)
>       for k, v in last_devices.iteritems())
> 
> ?

Yeah that was a bit clumsy.  Went with a more verbose version of this
suggestion.

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d...
build/android/buildbot/bb_device_status_check.py:151: missing_devices = { k: v
for k, v in last_devices.iteeritems() if v != 0 }
On 2015/05/23 01:06:48, jbudorick wrote:
> - iteeritems -> iteritems
> - if not v
> - nit: no spaces at the ends of comprehensions

Done, except for "not v" -- mixing up numberness and truthiness here is too
conceptually punny for me

https://codereview.chromium.org/1148873007/diff/1/build/android/buildbot/bb_d...
build/android/buildbot/bb_device_status_check.py:180: quite_missing = [ k for k,
v in last_devices.iteritems() if v == 3 ]
On 2015/05/23 01:06:49, jbudorick wrote:
> if v > 2
> 
> ?

Done.

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/...
File build/android/pylib/device/device_list.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/device/...
build/android/pylib/device/device_list.py:13: def
GetOfflineDeviceMap(file_name):
On 2015/05/23 01:06:49, jbudorick wrote:
> This is poorly named; it sounds like the map contains offline devices when the
> map can contain online or offline devices.

Done.

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
File build/android/pylib/perf/setup.py (left):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
build/android/pylib/perf/setup.py:28: devices =
android_commands.GetAttachedDevices()
On 2015/05/23 01:06:49, jbudorick wrote:
> How is this still around...

Looks like I forgot to sync before starting work on this :-(

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
File build/android/pylib/perf/setup.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
build/android/pylib/perf/setup.py:50: This is a way to make sure that affinity
-> device mapping is relatively
On 2015/05/23 01:06:49, jbudorick wrote:
> nit: line break after the first line of the docstring
> 
> also, this should include Args: and Returns:

Done.

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
build/android/pylib/perf/setup.py:51: stable while still maximizing the use of
resources."""
On 2015/05/23 01:06:49, jbudorick wrote:
> I'm not sure this is stable if you lose one of the lexicographically early
> devices. That's why the persistent device list exists.
> 
> Also, this docstring should better explain what's going on.

IIUC the persistent device list basically messes up everything if a new device
is added.

You gave me an excuse to do some math, so I added a bunch of docs explaining the
scheme and analyzing its stability.

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/se...
build/android/pylib/perf/setup.py:100: all_devices =
android_commands.GetAttachedDevices()
On 2015/05/23 01:06:49, jbudorick wrote:
> device_utils.DeviceUtils.HealthyDevices()
> 
> if you want serials,
> 
>   [d.adb.GetDeviceSerial()
>    for d in device_utils.DeviceUtils.HealthyDevices()]
> 
> android_commands is deprecated.
> 
> also, you will absolutely want to sort this.

Done.

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
File build/android/pylib/perf/test_runner.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8
On 2015/05/23 01:06:49, jbudorick wrote:
> What if the number of devices is higher than 8? I think the speed team has
> designs on something like that...

As far as I could tell there are no specs of device_affinity greater than 8. 
Maybe you could help me look harder (not sure where else to look)

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
build/android/pylib/perf/test_runner.py:201: assert 0 <= affinity and affinity <
NUM_DEVICE_AFFINITIES, (
On 2015/05/23 01:06:49, jbudorick wrote:
> This should be an exception. We generally avoid assertions in the test runner,
> especially on data we're reading from somewhere else (e.g. the perf test
list).

Done.

navabi

https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/bb_device_status_check.py File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/bb_device_status_check.py#newcode128 build/android/buildbot/bb_device_status_check.py:128: # detected It looks like it denotes a count ...

5 years, 6 months ago (2015-05-27 23:36:20 UTC) #7

luqui

https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/bb_device_status_check.py File build/android/buildbot/bb_device_status_check.py (right): https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/bb_device_status_check.py#newcode128 build/android/buildbot/bb_device_status_check.py:128: # detected On 2015/05/27 23:36:19, navabi wrote: > It ...

5 years, 6 months ago (2015-05-28 00:59:54 UTC) #8

https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/...
File build/android/buildbot/bb_device_status_check.py (right):

https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/...
build/android/buildbot/bb_device_status_check.py:128: # detected
On 2015/05/27 23:36:19, navabi wrote:
> It looks like it denotes a count map. The comment makes it sound like it is
just
> a list of device serial numbers.

Done.

https://codereview.chromium.org/1148873007/diff/60001/build/android/buildbot/...
build/android/buildbot/bb_device_status_check.py:137: # Increment the count of
missing devices and add new devices to the map.
On 2015/05/27 23:36:20, navabi wrote:
> this comment is also confusing. Makes it sound like you are incrementing the
> number of missing devices. But I think you are incrementing the count for each
> missing device.

Done.

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
File build/android/pylib/perf/setup.py (right):

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
build/android/pylib/perf/setup.py:100: 
On 2015/05/27 23:36:20, navabi wrote:
> Ok, so here we have something like this:
> device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2]
> where the length of device_cycle is NUM_DEVICE_AFFINITIES
> 
> Can you explain to me why this hashing/rehashing is better than just assigning
> affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i?

Imagine we have devices H J M, thus the affinity assignments are

H - 0 3 6
J - 1 4 7
M - 2 5

Now we add a new device I, which redistributes

H - 0 4   # was 0 3 6
I - 1 5   # new device
J - 2 6   # was 1 4 7
M - 3 7   # was 2 5

The only test which stayed on its device was 0.  When the number of devices
changes it is really bad like this.  It's better when we replace a device with
one that sorts differently -- then any tests on devices that are
lexicographically between the old and new serials will be reassigned, on average
33%.

> It was my impression that the problem was just that we don't reassign device
> affinity when we needed to. (I.e. when devices go missing and/or when we find
> new devices).
> 
> It looks like your code now does the DistributeAffinity when you need to, but
I
> don't understand the need for the extra complexity of this hashing solution.

The extra algorithmic complexity of this hashing solution decreases the
complexity of the system as a whole.  Previously affinity distribution depended
on the entire set of devices we had previously seen -- whenever that changes,
affinity distribution changes.  So if the bot is clobbered all the affinities
get reassigned.  First I wanted it to be pure, so it only depended on the set of
online devices and no historical state. That would be achieved by your simpler
solution. But secondly I wanted it to be "continuous" -- so small changes in the
set of online devices produced small changes in the affinity assignment.  That
is the reason for the complexity.

navabi

> The extra algorithmic complexity of this hashing solution decreases the > complexity of the ...

5 years, 6 months ago (2015-05-28 01:19:39 UTC) #9

shatch

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py#newcode68 build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 On 2015/05/27 20:01:12, luqui wrote: > ...

5 years, 6 months ago (2015-05-28 13:59:18 UTC) #10

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
File build/android/pylib/perf/test_runner.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8
On 2015/05/27 20:01:12, luqui wrote:
> On 2015/05/23 01:06:49, jbudorick wrote:
> > What if the number of devices is higher than 8? I think the speed team has
> > designs on something like that...
> 
> As far as I could tell there are no specs of device_affinity greater than 8. 
> Maybe you could help me look harder (not sure where else to look)

This isn't a thing yet, but we did ask for some investigation into upping device
limits: crbug.com/466105, so it's something to keep in mind. You might be able
to derive this instead of hardcoding it by looking at max(test affinities).

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
File build/android/pylib/perf/setup.py (right):

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
build/android/pylib/perf/setup.py:100: 
On 2015/05/28 00:59:54, luqui wrote:
> On 2015/05/27 23:36:20, navabi wrote:
> > Ok, so here we have something like this:
> > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2]
> > where the length of device_cycle is NUM_DEVICE_AFFINITIES
> > 
> > Can you explain to me why this hashing/rehashing is better than just
assigning
> > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i?
> 
> 
> Imagine we have devices H J M, thus the affinity assignments are
> 
> H - 0 3 6
> J - 1 4 7
> M - 2 5
> 
> Now we add a new device I, which redistributes
> 
> H - 0 4   # was 0 3 6
> I - 1 5   # new device
> J - 2 6   # was 1 4 7
> M - 3 7   # was 2 5
> 
> The only test which stayed on its device was 0.  When the number of devices
> changes it is really bad like this.  It's better when we replace a device with
> one that sorts differently -- then any tests on devices that are
> lexicographically between the old and new serials will be reassigned, on
average
> 33%.
> 
> > It was my impression that the problem was just that we don't reassign device
> > affinity when we needed to. (I.e. when devices go missing and/or when we
find
> > new devices).
> > 
> > It looks like your code now does the DistributeAffinity when you need to,
but
> I
> > don't understand the need for the extra complexity of this hashing solution.
> 
> The extra algorithmic complexity of this hashing solution decreases the
> complexity of the system as a whole.  Previously affinity distribution
depended
> on the entire set of devices we had previously seen -- whenever that changes,
> affinity distribution changes.  So if the bot is clobbered all the affinities
> get reassigned.  First I wanted it to be pure, so it only depended on the set
of
> online devices and no historical state. That would be achieved by your simpler
> solution. But secondly I wanted it to be "continuous" -- so small changes in
the
> set of online devices produced small changes in the affinity assignment.  That
> is the reason for the complexity.

This is a really neat solution, but I'm also worried about the extra complexity.

Wouldn't it be simpler and easier to understand if we just persisted the
previous devices + affinities, and new devices just fill in for missing ones?
What are the downsides to that vs this solution, keeping historical state?

jbudorick

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py File build/android/pylib/perf/test_runner.py (right): https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py#newcode68 build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8 On 2015/05/27 at 20:01:12, luqui wrote: ...

5 years, 6 months ago (2015-05-28 14:03:17 UTC) #11

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
File build/android/pylib/perf/test_runner.py (right):

https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8
On 2015/05/27 at 20:01:12, luqui wrote:
> On 2015/05/23 01:06:49, jbudorick wrote:
> > What if the number of devices is higher than 8? I think the speed team has
> > designs on something like that...
> 
> As far as I could tell there are no specs of device_affinity greater than 8. 
Maybe you could help me look harder (not sure where else to look)

Currently there aren't, and I don't think we have any bot deployments with >8
devices. However, as I said, the speed team is at least considering doing so,
and this doesn't handle that case very well.

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
File build/android/pylib/perf/setup.py (right):

https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
build/android/pylib/perf/setup.py:100: 
On 2015/05/28 at 13:59:18, shatch wrote:
> On 2015/05/28 00:59:54, luqui wrote:
> > On 2015/05/27 23:36:20, navabi wrote:
> > > Ok, so here we have something like this:
> > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2]
> > > where the length of device_cycle is NUM_DEVICE_AFFINITIES
> > > 
> > > Can you explain to me why this hashing/rehashing is better than just
assigning
> > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity
i?
> > 
> > 
> > Imagine we have devices H J M, thus the affinity assignments are
> > 
> > H - 0 3 6
> > J - 1 4 7
> > M - 2 5
> > 
> > Now we add a new device I, which redistributes
> > 
> > H - 0 4   # was 0 3 6
> > I - 1 5   # new device
> > J - 2 6   # was 1 4 7
> > M - 3 7   # was 2 5
> > 
> > The only test which stayed on its device was 0.  When the number of devices
> > changes it is really bad like this.  It's better when we replace a device
with
> > one that sorts differently -- then any tests on devices that are
> > lexicographically between the old and new serials will be reassigned, on
average
> > 33%.
> > 
> > > It was my impression that the problem was just that we don't reassign
device
> > > affinity when we needed to. (I.e. when devices go missing and/or when we
find
> > > new devices).
> > > 
> > > It looks like your code now does the DistributeAffinity when you need to,
but
> > I
> > > don't understand the need for the extra complexity of this hashing
solution.
> > 
> > The extra algorithmic complexity of this hashing solution decreases the
> > complexity of the system as a whole.  Previously affinity distribution
depended
> > on the entire set of devices we had previously seen -- whenever that
changes,
> > affinity distribution changes.  So if the bot is clobbered all the
affinities
> > get reassigned.  First I wanted it to be pure, so it only depended on the
set of
> > online devices and no historical state. That would be achieved by your
simpler
> > solution. But secondly I wanted it to be "continuous" -- so small changes in
the
> > set of online devices produced small changes in the affinity assignment. 
That
> > is the reason for the complexity.
> 
> This is a really neat solution, but I'm also worried about the extra
complexity.

I'm going to come back to do another full review later, but I just want to add
that I completely agree with this sentiment.

> 
> Wouldn't it be simpler and easier to understand if we just persisted the
previous devices + affinities, and new devices just fill in for missing ones?
What are the downsides to that vs this solution, keeping historical state?

luqui

On 2015/05/28 13:59:18, shatch wrote: > https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py > File build/android/pylib/perf/test_runner.py (right): > > https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/test_runner.py#newcode68 > ...

5 years, 6 months ago (2015-06-03 01:13:00 UTC) #12

On 2015/05/28 13:59:18, shatch wrote:
>
https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
> File build/android/pylib/perf/test_runner.py (right):
> 
>
https://codereview.chromium.org/1148873007/diff/1/build/android/pylib/perf/te...
> build/android/pylib/perf/test_runner.py:68: NUM_DEVICE_AFFINITIES = 8
> On 2015/05/27 20:01:12, luqui wrote:
> > On 2015/05/23 01:06:49, jbudorick wrote:
> > > What if the number of devices is higher than 8? I think the speed team has
> > > designs on something like that...
> > 
> > As far as I could tell there are no specs of device_affinity greater than 8.

> > Maybe you could help me look harder (not sure where else to look)
> 
> This isn't a thing yet, but we did ask for some investigation into upping
device
> limits: crbug.com/466105, so it's something to keep in mind. You might be able
> to derive this instead of hardcoding it by looking at max(test affinities).
> 
>
https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
> File build/android/pylib/perf/setup.py (right):
> 
>
https://codereview.chromium.org/1148873007/diff/60001/build/android/pylib/per...
> build/android/pylib/perf/setup.py:100: 
> On 2015/05/28 00:59:54, luqui wrote:
> > On 2015/05/27 23:36:20, navabi wrote:
> > > Ok, so here we have something like this:
> > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2]
> > > where the length of device_cycle is NUM_DEVICE_AFFINITIES
> > > 
> > > Can you explain to me why this hashing/rehashing is better than just
> assigning
> > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity
i?
> > 
> > 
> > Imagine we have devices H J M, thus the affinity assignments are
> > 
> > H - 0 3 6
> > J - 1 4 7
> > M - 2 5
> > 
> > Now we add a new device I, which redistributes
> > 
> > H - 0 4   # was 0 3 6
> > I - 1 5   # new device
> > J - 2 6   # was 1 4 7
> > M - 3 7   # was 2 5
> > 
> > The only test which stayed on its device was 0.  When the number of devices
> > changes it is really bad like this.  It's better when we replace a device
with
> > one that sorts differently -- then any tests on devices that are
> > lexicographically between the old and new serials will be reassigned, on
> average
> > 33%.
> > 
> > > It was my impression that the problem was just that we don't reassign
device
> > > affinity when we needed to. (I.e. when devices go missing and/or when we
> find
> > > new devices).
> > > 
> > > It looks like your code now does the DistributeAffinity when you need to,
> but
> > I
> > > don't understand the need for the extra complexity of this hashing
solution.
> > 
> > The extra algorithmic complexity of this hashing solution decreases the
> > complexity of the system as a whole.  Previously affinity distribution
> depended
> > on the entire set of devices we had previously seen -- whenever that
changes,
> > affinity distribution changes.  So if the bot is clobbered all the
affinities
> > get reassigned.  First I wanted it to be pure, so it only depended on the
set
> of
> > online devices and no historical state. That would be achieved by your
simpler
> > solution. But secondly I wanted it to be "continuous" -- so small changes in
> the
> > set of online devices produced small changes in the affinity assignment. 
That
> > is the reason for the complexity.
> 
> This is a really neat solution, but I'm also worried about the extra
complexity.
> 
> Wouldn't it be simpler and easier to understand if we just persisted the
> previous devices + affinities, and new devices just fill in for missing ones?
> What are the downsides to that vs this solution, keeping historical state?

Well, I thought about that, and the way I perceive it, properly accounting for
the details in "just fill in for missing ones" is a lot of complexity, too.  We
have to have an "expected set" of devices so we know the difference between a
fill-in and a decommissioned/replaced device.  Who maintains that list -- do we
just generate it the first time we haven't seen one before?  What if that was a
bad state; i.e. one of the devices was bad right off the bat?  Who watches for
that?  So many questions.

I am a little bewildered by everyone calling this solution complex.  There is
really not that much to it -- each device picks a random sequence of numbers
representing its choices, then we go around giving each device its first
available choice.  Am I scaring you off with the math in the docstring?

navabi

> Well, I thought about that, and the way I perceive it, properly accounting for ...

5 years, 6 months ago (2015-06-03 01:36:30 UTC) #13

jbudorick

On 2015/06/03 at 01:36:30, navabi wrote: > > Well, I thought about that, and the ...

5 years, 6 months ago (2015-06-03 03:47:56 UTC) #14

On 2015/06/03 at 01:36:30, navabi wrote:
> > Well, I thought about that, and the way I perceive it, properly accounting
for
> > the details in "just fill in for missing ones" is a lot of complexity, too. 
We
> > have to have an "expected set" of devices so we know the difference between
a
> > fill-in and a decommissioned/replaced device.  Who maintains that list -- do
we
> > just generate it the first time we haven't seen one before?  What if that
was a
> > bad state; i.e. one of the devices was bad right off the bat?  Who watches
for
> > that?  So many questions.
> 
> I think this is a good argument for your approach over the previous devices +
affinities approach.

Perhaps I'm alone in this opinion, but it seems to me that your current approach
is complex in both concept and implementation, while the "replace missing
devices" approach is merely complex in the latter.

I frankly don't think it's as simple as you seem to boil it down to below. I was
spooked by the initial implementation (i.e. before the docstring), and I
remained spooked afterwards. That's not to say there isn't a place in the test
scripts that are complex if they solve a difficult problem (cough
PushIfNeeded/PushChangedFiles), though.

> 
> > I am a little bewildered by everyone calling this solution complex.  There
is
> > really not that much to it -- each device picks a random sequence of numbers
> > representing its choices, then we go around giving each device its first
> > available choice.  Am I scaring you off with the math in the docstring?
> 
> I do think the math is too much for the docstring. I'm not against this
approach. I just wanted justification, which now I have.
> 
> Please fix the docstring to more concisely explain that this strategy
minimizes changes in affinity distribution. Otherwise, this lgtm.

If you don't mind, Luke, I'd like to mull this over a bit more vs a device
tracking & replacement method before approving it.

jbudorick

On 2015/06/03 at 03:47:56, jbudorick wrote: > On 2015/06/03 at 01:36:30, navabi wrote: > > ...

5 years, 6 months ago (2015-06-03 03:49:08 UTC) #15

On 2015/06/03 at 03:47:56, jbudorick wrote:
> On 2015/06/03 at 01:36:30, navabi wrote:
> > > Well, I thought about that, and the way I perceive it, properly accounting
for
> > > the details in "just fill in for missing ones" is a lot of complexity,
too.  We
> > > have to have an "expected set" of devices so we know the difference
between a
> > > fill-in and a decommissioned/replaced device.  Who maintains that list --
do we
> > > just generate it the first time we haven't seen one before?  What if that
was a
> > > bad state; i.e. one of the devices was bad right off the bat?  Who watches
for
> > > that?  So many questions.
> > 
> > I think this is a good argument for your approach over the previous devices
+ affinities approach.
> 
> Perhaps I'm alone in this opinion, but it seems to me that your current
approach is complex in both concept and implementation, while the "replace
missing devices" approach is merely complex in the latter.
> 
> I frankly don't think it's as simple as you seem to boil it down to below. I
was spooked by the initial implementation (i.e. before the docstring), and I
remained spooked afterwards. That's not to say there isn't a place in the test
scripts 

*for implementations*

that are complex if they solve a difficult problem (cough
PushIfNeeded/PushChangedFiles), though.
> 
> > 
> > > I am a little bewildered by everyone calling this solution complex.  There
is
> > > really not that much to it -- each device picks a random sequence of
numbers
> > > representing its choices, then we go around giving each device its first
> > > available choice.  Am I scaring you off with the math in the docstring?
> > 
> > I do think the math is too much for the docstring. I'm not against this
approach. I just wanted justification, which now I have.
> > 
> > Please fix the docstring to more concisely explain that this strategy
minimizes changes in affinity distribution. Otherwise, this lgtm.
> 
> If you don't mind, Luke, I'd like to mull this over a bit more vs a device
tracking & replacement method before approving it.

shatch

On 2015/06/03 03:47:56, jbudorick wrote: > On 2015/06/03 at 01:36:30, navabi wrote: > > > ...

5 years, 6 months ago (2015-06-03 14:53:34 UTC) #16

On 2015/06/03 03:47:56, jbudorick wrote:
> On 2015/06/03 at 01:36:30, navabi wrote:
> > > Well, I thought about that, and the way I perceive it, properly accounting
> for
> > > the details in "just fill in for missing ones" is a lot of complexity,
too. 
> We
> > > have to have an "expected set" of devices so we know the difference
between
> a
> > > fill-in and a decommissioned/replaced device.  Who maintains that list --
do
> we
> > > just generate it the first time we haven't seen one before?  What if that
> was a
> > > bad state; i.e. one of the devices was bad right off the bat?  Who watches
> for
> > > that?  So many questions.
> > 
> > I think this is a good argument for your approach over the previous devices
+
> affinities approach.
> 
> Perhaps I'm alone in this opinion, but it seems to me that your current
approach
> is complex in both concept and implementation, while the "replace missing
> devices" approach is merely complex in the latter.
> 
> I frankly don't think it's as simple as you seem to boil it down to below. I
was
> spooked by the initial implementation (i.e. before the docstring), and I
> remained spooked afterwards. That's not to say there isn't a place in the test
> scripts that are complex if they solve a difficult problem (cough
> PushIfNeeded/PushChangedFiles), though.
> 

I agree. The way I see it is, if I ran across this code randomly, I'm not sure
how easily I could debug/modify it. That might be because I'm not amazing at
math :)

I'd prefer a simpler solution, but I don't feel strongly enough about this to
hold you up if everyone else is happy with this.

> > 
> > > I am a little bewildered by everyone calling this solution complex.  There
> is
> > > really not that much to it -- each device picks a random sequence of
numbers
> > > representing its choices, then we go around giving each device its first
> > > available choice.  Am I scaring you off with the math in the docstring?
> > 
> > I do think the math is too much for the docstring. I'm not against this
> approach. I just wanted justification, which now I have.
> > 
> > Please fix the docstring to more concisely explain that this strategy
> minimizes changes in affinity distribution. Otherwise, this lgtm.
> 
> If you don't mind, Luke, I'd like to mull this over a bit more vs a device
> tracking & replacement method before approving it.

navabi

> > If you don't mind, Luke, I'd like to mull this over a bit ...

5 years, 6 months ago (2015-06-03 22:24:15 UTC) #17

navabi

One more thing to think about... https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/perf/setup.py File build/android/pylib/perf/setup.py (right): https://codereview.chromium.org/1148873007/diff/80001/build/android/pylib/perf/setup.py#newcode94 build/android/pylib/perf/setup.py:94: devices = sorted(devices) ...

5 years, 6 months ago (2015-06-04 22:30:12 UTC) #18

luqui

Here's the next-available collision strategy. Monte-carlo shows that this adds an extra 0.3 reassignments per ...

5 years, 6 months ago (2015-06-12 02:43:36 UTC) #19

navabi

> Will we have to pay every time the host is rebooted? (I.e. do they ...

5 years, 6 months ago (2015-06-12 18:13:02 UTC) #20

luqui

How are people feeling about this? I reinstated sorting because of reboots. Stability is pretty ...

5 years, 6 months ago (2015-06-16 19:12:44 UTC) #21

navabi

On 2015/06/16 19:12:44, luqui wrote: > How are people feeling about this? > > I ...

5 years, 6 months ago (2015-06-16 20:24:47 UTC) #22

shatch

On 2015/06/16 20:24:47, navabi wrote: > On 2015/06/16 19:12:44, luqui wrote: > > How are ...

5 years, 6 months ago (2015-06-22 13:32:56 UTC) #23

navabi

> Sorry for taking so long to get back. I'm good with this, seems simpler. ...

5 years, 6 months ago (2015-06-22 21:13:54 UTC) #24

jbudorick

On 2015/06/22 at 21:13:54, navabi wrote: > > Sorry for taking so long to get ...

5 years, 6 months ago (2015-06-22 21:25:15 UTC) #25

navabi

5 years, 6 months ago (2015-06-23 19:12:38 UTC) #26

> +1 to using lsusb

I spoke with luqui@ about using lsub and he likes this too. Please also change
the description to reflect this.

i.e. replace "Also replaced the device affinity calculation with a new
round-robin hashing scheme that will be more stable as the list of devices
changes."

jbudorick

4 years, 6 months ago (2016-06-10 14:25:56 UTC) #27

Description was changed from

==========
Fix last_devices to be quieter, and improve device affinity.

Now we keep track of the number of builds a device has been offline for,
and when it hits 2 we send an email.  This will prevent a lot of false
positives.

We now clear the list of known devices whenever we detect a new one so we
don't go on reporting the same old missing device forever.   We assume an
engineer has tended to it whenever there is an unseen device.

Also replaced the device affinity calculation with a new round-robin hashing
scheme that will be more stable as the list of devices changes.

BUG=470211,481783,517572
==========

to

==========
Fix last_devices to be quieter, and improve device affinity.

Now we keep track of the number of builds a device has been offline for,
and when it hits 2 we send an email.  This will prevent a lot of false
positives.

We now clear the list of known devices whenever we detect a new one so we
don't go on reporting the same old missing device forever.   We assume an
engineer has tended to it whenever there is an unseen device.

Also replaced the device affinity calculation with a new round-robin hashing
scheme that will be more stable as the list of devices changes.

BUG=470211,481783,517572
==========

Issue 1148873007: Fix last_devices to be quieter, and improve device affinity. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Review comments #

Patch Set 3 : Improve docs #

Patch Set 4 : Improve diagram #

Patch Set 5 : Improve comments #

Patch Set 6 : Nearest-available hash conflict resolution #

Patch Set 7 : Sort devices again for reboot-stability #

Messages