build/android/pylib/perf/setup.py - Issue 1148873007: Fix last_devices to be quieter, and improve device affinity.

Unified Diff: build/android/pylib/perf/setup.py

Issue 1148873007: Fix last_devices to be quieter, and improve device affinity. (Closed) Base URL: https://chromium.googlesource.com/chromium/src.git@master

Patch Set: Improve diagram Created 5 years, 7 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: build/android/pylib/perf/setup.py

diff --git a/build/android/pylib/perf/setup.py b/build/android/pylib/perf/setup.py

index 8e1fc28a7e651bef567b336e5593d8be64c75701..de870c2e4f9164739361801a56cf3b5c636bc393 100644

--- a/build/android/pylib/perf/setup.py

+++ b/build/android/pylib/perf/setup.py

@@ -6,6 +6,7 @@

import json

import fnmatch

+import hashlib

import logging

import os

import shutil

@@ -18,18 +19,6 @@ from pylib.perf import test_runner

from pylib.utils import test_environment

-def _GetAllDevices():

- devices_path = os.path.join(os.environ.get('CHROMIUM_OUT_DIR', 'out'),

- device_list.LAST_DEVICES_FILENAME)

- try:

- devices = [device_utils.DeviceUtils(s)

- for s in device_list.GetPersistentDeviceList(devices_path)]

- except IOError as e:

- logging.error('Unable to find %s [%s]', devices_path, e)

- devices = device_utils.DeviceUtils.HealthyDevices()

- return sorted(devices)

def _GetStepsDictFromSingleStep(test_options):

# Running a single command, build the tests structure.

steps_dict = {

@@ -56,6 +45,80 @@ def _GetStepsDict(test_options):

return steps

+def DistributeAffinities(devices):

+ """Create a mapping from device to affinity using a round-robin hash scheme.

+ Args:

+ devices: a list of device serials.

+ Returns: a dictionary mapping each of the given device serials to a list

+ of affinities (numbers between 0 and NUM_DEVICE_AFFINITIES-1) it should

+ run.

+ This function implements an algorithm to assign affinities to devices in a

+ relatively stable way while still maximizing the use of resources.

+ We begin with all affinities unassigned. We cycle through the list of device

+ serials, and for each one we assign a yet-unassigned affinity. We choose the

+ affinity to assign by hashing the device serial, so that changes in the set of

+ devices will produce minimal changes in the assignments. We resolve

+ collisions by rehashing the hash that produced the collision (which will be

+ unique, since it is an md5 hash which we squish down).

+ Another way to think of this: each device corresponds to a random series of

+ affinities, then we cycle through the devices giving each device its first

+ available choice. We expect the first few devices in the list to get their

+ first choice (so they will definitely be stable). Stability is only in danger

+ if a change to the device list produces a cascade of devices getting a

+ different choice than they had before.

+ The probability that a later choice is directly disrupted by an earlier change

+ is 1/A, since there are A different assignments to be chosen and hash

+ functions distribute uniformly. However, there are transitive disruptions: a

+ change in the 1st choice might cause a change in the 2nd choice, which in turn

+ causes a change in the 3rd choice. Call the probability that a the n'th choice

+ is disrupted by changing i'th choice D(i,n) (assuming i < n). We have the

+ relation:

+ D(i,n) = 1 - (1 - 1/A) product from j=i+1 to n-1 of { 1 - D(i,j) D(j,n) }

+ ^^^ ^^^^^^^^^^^^^

+ Disrupted directly Transitive disruptions

+ From numerical solutions with A=8, the worst case D(1,8) = 26% (the

+ probability of a device added lexicographically first disrupting the last

+ device in the cycle). The average probability of a choice being affected by a

+ change is 16%.

+ """

+ if not devices:

+ return {}

+ devices = sorted(devices)

+ device_cycle = []

+ while len(device_cycle) < test_runner.NUM_DEVICE_AFFINITIES:

+ device_cycle.extend(devices)

+ device_cycle = device_cycle[:test_runner.NUM_DEVICE_AFFINITIES]

navabi 2015/05/27 23:36:20 Ok, so here we have something like this: device_cy

luqui 2015/05/28 00:59:54 Imagine we have devices H J M, thus the affinity a

shatch 2015/05/28 13:59:18 This is a really neat solution, but I'm also worri

On 2015/05/28 00:59:54, luqui wrote: > On 2015/05/27 23:36:20, navabi wrote: > > Ok, so here we have something like this: > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > > > Can you explain to me why this hashing/rehashing is better than just assigning > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? > > > Imagine we have devices H J M, thus the affinity assignments are > > H - 0 3 6 > J - 1 4 7 > M - 2 5 > > Now we add a new device I, which redistributes > > H - 0 4 # was 0 3 6 > I - 1 5 # new device > J - 2 6 # was 1 4 7 > M - 3 7 # was 2 5 > > The only test which stayed on its device was 0. When the number of devices > changes it is really bad like this. It's better when we replace a device with > one that sorts differently -- then any tests on devices that are > lexicographically between the old and new serials will be reassigned, on average > 33%. > > > It was my impression that the problem was just that we don't reassign device > > affinity when we needed to. (I.e. when devices go missing and/or when we find > > new devices). > > > > It looks like your code now does the DistributeAffinity when you need to, but > I > > don't understand the need for the extra complexity of this hashing solution. > > The extra algorithmic complexity of this hashing solution decreases the > complexity of the system as a whole. Previously affinity distribution depended > on the entire set of devices we had previously seen -- whenever that changes, > affinity distribution changes. So if the bot is clobbered all the affinities > get reassigned. First I wanted it to be pure, so it only depended on the set of > online devices and no historical state. That would be achieved by your simpler > solution. But secondly I wanted it to be "continuous" -- so small changes in the > set of online devices produced small changes in the affinity assignment. That > is the reason for the complexity.

This is a really neat solution, but I'm also worried about the extra complexity. Wouldn't it be simpler and easier to understand if we just persisted the previous devices + affinities, and new devices just fill in for missing ones? What are the downsides to that vs this solution, keeping historical state?

jbudorick 2015/05/28 14:03:16 I'm going to come back to do another full review l

On 2015/05/28 at 13:59:18, shatch wrote: > On 2015/05/28 00:59:54, luqui wrote: > > On 2015/05/27 23:36:20, navabi wrote: > > > Ok, so here we have something like this: > > > device_cycle = [s1, s2, s3, s1, s2, s3, s1, s2] > > > where the length of device_cycle is NUM_DEVICE_AFFINITIES > > > > > > Can you explain to me why this hashing/rehashing is better than just assigning > > > affinity[x] = device_cycle[x]. I.e. assign device at index i to affinity i? > > > > > > Imagine we have devices H J M, thus the affinity assignments are > > > > H - 0 3 6 > > J - 1 4 7 > > M - 2 5 > > > > Now we add a new device I, which redistributes > > > > H - 0 4 # was 0 3 6 > > I - 1 5 # new device > > J - 2 6 # was 1 4 7 > > M - 3 7 # was 2 5 > > > > The only test which stayed on its device was 0. When the number of devices > > changes it is really bad like this. It's better when we replace a device with > > one that sorts differently -- then any tests on devices that are > > lexicographically between the old and new serials will be reassigned, on average > > 33%. > > > > > It was my impression that the problem was just that we don't reassign device > > > affinity when we needed to. (I.e. when devices go missing and/or when we find > > > new devices). > > > > > > It looks like your code now does the DistributeAffinity when you need to, but > > I > > > don't understand the need for the extra complexity of this hashing solution. > > > > The extra algorithmic complexity of this hashing solution decreases the > > complexity of the system as a whole. Previously affinity distribution depended > > on the entire set of devices we had previously seen -- whenever that changes, > > affinity distribution changes. So if the bot is clobbered all the affinities > > get reassigned. First I wanted it to be pure, so it only depended on the set of > > online devices and no historical state. That would be achieved by your simpler > > solution. But secondly I wanted it to be "continuous" -- so small changes in the > > set of online devices produced small changes in the affinity assignment. That > > is the reason for the complexity. > > This is a really neat solution, but I'm also worried about the extra complexity.

I'm going to come back to do another full review later, but I just want to add that I completely agree with this sentiment.

+ affinity_to_device = {}

+ # Hash and rehash each device ID until we get a unique affinity.

+ for device in device_cycle:

+ cur_hash = device

+ while True:

+ cur_hash = hashlib.md5(cur_hash).hexdigest()

+ affinity = int(int(cur_hash, 16) % test_runner.NUM_DEVICE_AFFINITIES)

+ if affinity not in affinity_to_device:

+ affinity_to_device[affinity] = device

+ break

+ # Invert the dictionary we just built.

+ device_to_affinity = {}

+ for k, v in affinity_to_device.iteritems():

+ if v not in device_to_affinity:

+ device_to_affinity[v] = []

+ device_to_affinity[v].append(k)

+ return device_to_affinity

def Setup(test_options):

"""Create and return the test runner factory and tests.

@@ -74,8 +137,9 @@ def Setup(test_options):

# Before running the tests, kill any leftover server.

test_environment.CleanupLeftoverProcesses()

- # We want to keep device affinity, so return all devices ever seen.

- all_devices = _GetAllDevices()

+ all_devices = [d.adb.GetDeviceSerial()

+ for d in device_utils.DeviceUtils.HealthyDevices()]

+ affinity_map = DistributeAffinities(all_devices)

steps_dict = _GetStepsDict(test_options)

sorted_step_names = sorted(steps_dict['steps'].keys())

@@ -89,9 +153,9 @@ def Setup(test_options):

with file(test_options.flaky_steps, 'r') as f:

flaky_steps = json.load(f)

- def TestRunnerFactory(device, shard_index):

+ def TestRunnerFactory(device, _):

+ affinities = affinity_map[device]

return test_runner.TestRunner(

- test_options, device, shard_index, len(all_devices),

- steps_dict, flaky_steps)

+ test_options, device, affinities, steps_dict, flaky_steps)

return (TestRunnerFactory, sorted_step_names, all_devices)

« build/android/buildbot/bb_device_status_check.py ('K') | « build/android/pylib/device/device_list.py ('k') | build/android/pylib/perf/test_runner.py » ('j') | no next file with comments »