|
|
Description[Android] Parallelize provision_devices.py.
BUG=401163
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=290988
Patch Set 1 #Patch Set 2 : fix presubmit issues #Patch Set 3 : #Patch Set 4 : #Patch Set 5 : fix location prompt flake #Patch Set 6 : #Patch Set 7 : #
Total comments: 7
Patch Set 8 : post-split rebase #Patch Set 9 : set persist.sys.usb.config in /data/local.prop #Patch Set 10 : only run LaunchHostHeartbeat if options.auto_reconnect is enabled #Patch Set 11 : rebase #
Messages
Total messages: 35 (0 generated)
This CL runs the entirety of provisioning in parallel across the desired set of devices. Unfortunately, the gains aren't quite as dramatic as we might've hoped. The parts of provisioning that seem to take the longest, the reboots, are already run in parallel. I've only been able to test this CL on two devices. In that setup, it showed a fairly consistent 12.5% drop in real time for provisioning (~4m before, ~3.5min after). I would imagine that these performance gains would be more dramatic with more attached devices, although I'm somewhat concerned about running into rumored adb race conditions in that scenario.
On 2014/08/02 01:28:15, jbudorick wrote: > This CL runs the entirety of provisioning in parallel across the desired set of > devices. > > Unfortunately, the gains aren't quite as dramatic as we might've hoped. The > parts of provisioning that seem to take the longest, the reboots, are already > run in parallel. > > I've only been able to test this CL on two devices. In that setup, it showed a > fairly consistent 12.5% drop in real time for provisioning (~4m before, ~3.5min > after). I would imagine that these performance gains would be more dramatic with > more attached devices, although I'm somewhat concerned about running into > rumored adb race conditions in that scenario. Let's test with more devices before submitting. We are already seeing stability issues with provision_devices so I think we should be very careful about changes like this and test as much as possible first. We can get you more devices for testing when we are back after the move.
On 2014/08/02 17:12:56, klundberg wrote: > On 2014/08/02 01:28:15, jbudorick wrote: > > This CL runs the entirety of provisioning in parallel across the desired set > of > > devices. > > > > Unfortunately, the gains aren't quite as dramatic as we might've hoped. The > > parts of provisioning that seem to take the longest, the reboots, are already > > run in parallel. > > > > I've only been able to test this CL on two devices. In that setup, it showed a > > fairly consistent 12.5% drop in real time for provisioning (~4m before, > ~3.5min > > after). I would imagine that these performance gains would be more dramatic > with > > more attached devices, although I'm somewhat concerned about running into > > rumored adb race conditions in that scenario. > > Let's test with more devices before submitting. > We are already seeing stability issues with provision_devices so I think we > should be very careful about changes like this and test as much as possible > first. > Agreed, I'm somewhat concerned about running into the race conditions in adb that Craig mentioned before he left. > We can get you more devices for testing when we are back after the move.
On 2014/08/03 04:11:16, jbudorick wrote: > On 2014/08/02 17:12:56, klundberg wrote: > > On 2014/08/02 01:28:15, jbudorick wrote: > > > This CL runs the entirety of provisioning in parallel across the desired set > > of > > > devices. > > > > > > Unfortunately, the gains aren't quite as dramatic as we might've hoped. The > > > parts of provisioning that seem to take the longest, the reboots, are > already > > > run in parallel. > > > > > > I've only been able to test this CL on two devices. In that setup, it showed > a > > > fairly consistent 12.5% drop in real time for provisioning (~4m before, > > ~3.5min > > > after). I would imagine that these performance gains would be more dramatic > > with > > > more attached devices, although I'm somewhat concerned about running into > > > rumored adb race conditions in that scenario. > > > > Let's test with more devices before submitting. > > We are already seeing stability issues with provision_devices so I think we > > should be very careful about changes like this and test as much as possible > > first. > > > > Agreed, I'm somewhat concerned about running into the race conditions in adb > that Craig mentioned before he left. > > > We can get you more devices for testing when we are back after the move. +1 to more testing. Let me know if you need me to get you more devices.
ptal I ran the latest CL 100x overnight on 8 devices (3xN4, 3xN5, 2xN7) in a setup similar to the way the bots work. Before each call to provision_devices, I reset usb and adb, and after each call to provision_devices, I cleared the device blacklist. Notable statistics: - total run time: 471 minutes - average run time: 4m 42s per call to provision_devices - longest run time: 5m 35s - 17 devices were blacklisted in total. All of them were because the device failed to come back after setting 'persist.sys.usb.config' in device_settings. When this happens, adb reports seeing the device serial and lists it as 'offline'. Resetting usb and adb seems to clear the issue, so my current theory is that that's a flake we can't fix. I have not tested perf provisioning yet. I will be doing so starting today.
On 2014/08/15 14:19:16, jbudorick wrote: > ptal > > I ran the latest CL 100x overnight on 8 devices (3xN4, 3xN5, 2xN7) in a setup > similar to the way the bots work. Before each call to provision_devices, I reset > usb and adb, and after each call to provision_devices, I cleared the device > blacklist. > > Notable statistics: > - total run time: 471 minutes > - average run time: 4m 42s per call to provision_devices > - longest run time: 5m 35s > - 17 devices were blacklisted in total. All of them were because the device > failed to come back after setting 'persist.sys.usb.config' in device_settings. > When this happens, adb reports seeing the device serial and lists it as > 'offline'. Resetting usb and adb seems to clear the issue, so my current theory > is that that's a flake we can't fix. > > I have not tested perf provisioning yet. I will be doing so starting today. w.r.t the trybot runs for patchset #7: - android_dbg_tests_recipe doesn't seem to detect the successful provisioning; the actual provision logic finishes in a little more than 4 minutes, but the step doesn't figure that out. - android _dbg_triggered_tests finished provisioning in less than 3 minutes.
Looking good. I just recommend breaking this into as many patches as possible because of the lack of testing and potential side effects of the various changes. Then if something goes wrong w/ any of it, it'll be a lot easier to diagnose. https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/and... File build/android/pylib/android_commands.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/and... build/android/pylib/android_commands.py:716: command = command.replace('\'', '\'\\\'\'') I'd recommend landing this separately first. https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/con... File build/android/pylib/content_settings.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/con... build/android/pylib/content_settings.py:53: if not key: ditto, given the lack of tests, I'd recommend a separate change here. https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... File build/android/pylib/device/device_utils.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... build/android/pylib/device/device_utils.py:710: self._RunShellCommandImpl('stop') Seems like this should be in system_properties, no? https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... File build/android/pylib/device_settings.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... build/android/pylib/device_settings.py:100: ENABLE_LOCATION_SETTINGS = [ Seems like another good thing to land separately.
Splitting the flake fixes out into a separate CL or two makes sense to me. https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/con... File build/android/pylib/content_settings.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/con... build/android/pylib/content_settings.py:53: if not key: On 2014/08/15 17:19:43, tonyg wrote: > ditto, given the lack of tests, I'd recommend a separate change here. This one seems unlikely to cause more flakes -- anything that behaves differently because of this is no longer dying because of an AssertionError. https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... File build/android/pylib/device/device_utils.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/dev... build/android/pylib/device/device_utils.py:710: self._RunShellCommandImpl('stop') On 2014/08/15 17:19:43, tonyg wrote: > Seems like this should be in system_properties, no? It could be, although I'm plotting to get rid of system_properties as part of the rewrite, (which, at the moment, is a little far down on my priorities list). https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/uti... File build/android/pylib/utils/parallelizer.py (right): https://codereview.chromium.org/434193002/diff/120001/build/android/pylib/uti... build/android/pylib/utils/parallelizer.py:177: # TODO(jbudorick): XXX REMOVE BEFORE COMMITTING This will be gone in the next patchset, fwiw.
Split off four CLs from this one: https://codereview.chromium.org/477953002/ https://codereview.chromium.org/477153003/ https://codereview.chromium.org/473293002/ https://codereview.chromium.org/481433004/ This CL is now dependent on all four, so I won't be updating it until they all land.
ptal. The four dependent CLs have all landed. wrt trybot runs: android_dbg_tests_recipe: I'm still trying to figure out why recipes isn't detecting provision completion after the CL. Were it able to do so, provision_devices would have finished running on 8xN5 in a little over 2 minutes. android_dbg_triggered_tests #1: Finished in 8+ minutes with a device hitting the blacklist. Was otherwise finished around 4 minutes, which is consistent with what I was seeing before the split.
The CQ bit was checked by jbudorick@chromium.org
The CQ bit was unchecked by jbudorick@chromium.org
On 2014/08/18 22:41:22, jbudorick wrote: > The CQ bit was checked by mailto:jbudorick@chromium.org misclicked :/
lgtm
+luqui for recipe issues Luke, do you have any idea why the recipe version isn't detecting the end of the revised provisioning step? e.g., http://build.chromium.org/p/tryserver.chromium.linux/builders/android_dbg_tes...
On 2014/08/18 23:09:16, jbudorick wrote: > +luqui for recipe issues > > Luke, do you have any idea why the recipe version isn't detecting the end of the > revised provisioning step? e.g., > http://build.chromium.org/p/tryserver.chromium.linux/builders/android_dbg_tes... lgtm. That is strange. I'm trying it again on the recipe version to see if it is flaky and/or to see if it fails in the same place.
On 2014/08/19 00:08:46, navabi wrote: > On 2014/08/18 23:09:16, jbudorick wrote: > > +luqui for recipe issues > > > > Luke, do you have any idea why the recipe version isn't detecting the end of > the > > revised provisioning step? e.g., > > > http://build.chromium.org/p/tryserver.chromium.linux/builders/android_dbg_tes... > > lgtm. That is strange. I'm trying it again on the recipe version to see if it is > flaky and/or to see if it fails in the same place. It's been failing in the same way since patchset #3, so I imagine it isn't a flake. I don't think I can commit this before we figure out what's going on. I think it'd shatter the downstream waterfall.
It looks like we're running afoul of one of the deadlocks mentioned here: https://docs.python.org/2/library/subprocess.html in build/scripts/common/annotator.run_step in the upstream build repository. We're trying to pipe the stdout and stderr from the subprocess running the step each to a separate thread for filtering.
On 2014/08/19 15:46:11, jbudorick wrote: > It looks like we're running afoul of one of the deadlocks mentioned here: > > https://docs.python.org/2/library/subprocess.html > > in build/scripts/common/annotator.run_step in the upstream build repository. > We're trying to pipe the stdout and stderr from the subprocess running the step > each to a separate thread for filtering. I should've noted: we run afoul of those deadlocks when options.auto_reconnect is enabled, thus running LaunchHostHeartbeat. I had mistakenly had that running in all setups. That is fixed in patchset 10. I'd imagine that we're still liable to run into the deadlocks if options.auto_reconnect is enabled.
Provisioning has succeeded on every run of android_dbg_tests_recipe since patchset 10, and the --auto-reconnect option appears to only be used by the Webkit Android (Nexus 4) bot, which doesn't use recipes at the moment. I'm going to try to land this. I'll warn the android sheriffs.
The CQ bit was checked by jbudorick@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbudorick@chromium.org/434193002/200001
On 2014/08/20 19:58:37, jbudorick wrote: > The CQ bit was checked by mailto:jbudorick@chromium.org Thanks John for diagnosing the --auto-reconnect. I noticed that launching the host heartbeat was the last thing that happened before the failure. I should have noticed that it should not have been running on the try testers. I think we can remove the --auto-reconnect form the Webkit bots, as we have since upgraded to more stable devices (this was added back when we had GN's I think). We are also in the process of experimenting with Appurify's APK's which do the same thing, except more sophisticated (at least that's what it sounds like). I'll make sure to keep an eye out for when webkit moves to recipes to not use --auto-reconnect.
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: linux_gpu on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/linux_gpu/builds/...) android_clang_dbg on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/android_clang_d...) linux_chromium_gn_rel on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_gpu on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/linux_gpu/builds/...) android_clang_dbg on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/android_clang_d...)
The CQ bit was checked by jbudorick@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbudorick@chromium.org/434193002/200001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: linux_gpu on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/linux_gpu/builds/...) android_clang_dbg on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/android_clang_d...)
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: linux_gpu on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/linux_gpu/builds/...) android_clang_dbg on tryserver.chromium.linux (http://build.chromium.org/p/tryserver.chromium.linux/builders/android_clang_d...)
The CQ bit was checked by jam@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/jbudorick@chromium.org/434193002/200001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: linux_gpu on tryserver.chromium.gpu (http://build.chromium.org/p/tryserver.chromium.gpu/builders/linux_gpu/builds/...)
Message was sent while issue was closed.
Committed patchset #11 (200001) as 290988 |