|
|
Created:
7 years, 1 month ago by pasko Modified:
6 years, 2 months ago CC:
chromium-reviews, craigdh+watch_chromium.org, bulach+watch_chromium.org, yfriedman+watch_chromium.org, ilevy-cc_chromium.org, klundberg+watch_chromium.org, frankf+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Visibility:
Public. |
DescriptionTelemetry: manage CPU core online state for some device types
To reduce the variance in testing on Qualcomm devices:
1. stop mpdecision daemon from managing CPU cores
1. enable all cores as it produces more stable page load times.
Also output CPU frequency numbers before and after running tests to get an idea
what typical frequencies there are and manage them later as constants.
Let's see how it works and in later CLs attempt to manage CPU frequencies.
BUG=318915
Patch Set 1 #
Total comments: 39
Patch Set 2 : . #
Total comments: 2
Patch Set 3 : enable 4 cores, modify files without su #Patch Set 4 : Useful only for N4, enable only for that #Patch Set 5 : rebased on top of r263055 #Messages
Total messages: 25 (0 generated)
The state management in this code is complex, but I could not make it simpler without breaking some cases. Also, do I get it right that we are not accustomed to testing changes like this?
thanks egor! just nits and suggestions, I like the overall idea!! not sure what do you mean by "not accustomed to testing changes"? this is part of the test harness, and it'll be exercised all the time... if you meant tests for the tests, then I'd agree, we probably stop at some level (specially since this is essentially issuing adb shell commands based on the device response, there's no much logic to test...) on parts that have more meaty logic, we eventually add tests... for instance, just recently dominik has kindly added some for the "surface stats": https://codereview.chromium.org/71353006/ so I think the answer to your question is "it depends" :) where there's meaty logic, we tend to have tests (certainly in telemetry), but thin wrappers most of the time don't... https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:21: ] nit: align with _ from 17 https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:35: # Manage CPUs on some device types manually to reduce variance in results. "manually" is a bit misleading, and they're all "managed" one way or another.... so perhaps: self._forced_cpus as the list (i.e, in place of "cpus_to_manager" ? no need for the bool, you can infer that from self._forced_cpus not being empty... wdyt? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:55: for cpu in xrange(self._kernel_max +1): nit: space after + https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:60: logging.info('Bringing CPU #%d offline.' %cpu_index) nit: logging takes format and vararg, so replace % with , https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:73: else: nit: no need for else, unindent the following block. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:75: logging.warning('Gave up bringing CPU %d online') nit: missing ", cpu_index" https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:78: logging.warning('Could not bring CPU %d online, retrying..' % nit: s/%/,/ https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:83: return '\n'.join(self._adb.GetProtectedFileContents(file_name)) it this function needed? :) line 71 would be fine: '1' in ['1', 'foo', 'bar'] and then just do this directly at line 90... https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:100: # Stop the proprietary 'mpdecision' hotplug manager to get a constant is "proprietary" relevant? How about just "Stop 'mdpdecision' hotplug ... https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:128: pass sorry, this block is a bit confusing :) how about just inside 122: if not self._forced_cpus: # Set the default perf mode only if the CPU was not forced online. for cpu in xrange(...): ... https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:152: logging.info('Writing scaling governor mode \'%s\' -> %s', nit: replace \' with "
In addition to the inline comments I have never really understood what we are trying to test by setting performance mode. On the only tests we run with and without performance mode (the old warm and cold startup tests on kapook.com) the results track each other very closely, and there is no evidence that one is noisier that the other. The more we modify the phones behaviour from its default behaviour the less likely it is that the regressions we find correspond to the regressions experienced by real users. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:48: 'Nexus 5': [0, 1], What is your intention in only managing two of the CPUs? I have certainly seen, in traces, Chrome using 4 CPUs, and the (relatively small) difference in Chrome startup time between a normal Nexus4 and Svelte Nexus 4 appeared to be due to svelte disabling two of the CPUs. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:61: self._adb.SetProtectedFileContents(PerfControl._CPU_ONLINE_FMT, '0') This doesn't use cpu_index, how does it know which cpu to take offline? Do you mean PerfControl._CPU_ONLINE_FMT % cpu_index? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:101: # number of CPUs online when running tests. TODO(pasko): stop 'thermald'. Isn't stopping thermald going to burn out the device? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:118: 'Nexus 4': 'ondemand', Please add 'Nexus 5' to this list. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:142: if cpu in self._cpus_on_originally: I think if the CPU was both online originally and was in the manually managed set you need to reset the performance mode. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:145: self._BringCpuOffline(cpu) Does this clear the performance mode back to the default, or does it simply hide it? If it simply hides it then you need to reset the performance mode before taking it offline.
Marcus, Anthony, thank you for review! On 2013/11/15 19:00:12, aberent wrote: > In addition to the inline comments I have never really understood what we are > trying to test by setting performance mode. On the only tests we run with and > without performance mode (the old warm and cold startup tests on http://kapook.com) the > results track each other very closely, and there is no evidence that one is > noisier that the other. The more we modify the phones behaviour from its default > behaviour the less likely it is that the regressions we find correspond to the > regressions experienced by real users. I feel strongly that we have to reduce all possible sources of variance in our measurements. Without actively fighting for reproducible results, we cannot track regressions, and even observe any performance change. This CL addresses one point about variance, but alone it will not magically fix our numbers. There are a few known methodological problems that, if fixed, together may produce better results, or they may not, due to some other still unknown sources of variance that are masking the signal. Startup tests have a built-in variability, I believe, because they rely a lot on loading binaries from disk, which depends on the flash controller as a source of variability. So I am not surprised that we cannot sort out signals from noise in startup tests. However, my hope is that not all tests are like this, *some* JavaScript non-GC-intensive workloads might be much more stable than others. To answer your question, if we want to choose among all performance modes, which would be our constant throughout the test runs, the highest-performance-mode happening for users should be our choice for a few reasons: (a) CPUs are designed to be more predictable in this mode, also (b) optimizing hot codepaths usually pays off better than optimizing cold paths. There are issues with high-perf mode that are hard: 1. melting silicon 2. running benchmarks always in high-perf would mask some issues with power management induced by our software: when we drop frames at wrong times causing later perf-critical parts be executed in low-power CPU mode. I don't have any good generic solutions in mind for (1) and (2), rather some ad-hoc, pain-in-tail, but still I feel strongly that we should always run performance tests in high-performance CPU mode. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:21: ] On 2013/11/15 17:54:43, bulach wrote: > nit: align with _ from 17 Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:35: # Manage CPUs on some device types manually to reduce variance in results. On 2013/11/15 17:54:43, bulach wrote: > "manually" is a bit misleading, and they're all "managed" one way or another.... > > so perhaps: > self._forced_cpus as the list (i.e, in place of "cpus_to_manager" ? > no need for the bool, you can infer that from self._forced_cpus not being > empty... wdyt? I tried that, and it looks less readable to me to check for emptiness of a list in a few places compared to checking for one boolean mode-like variable that is set up once and remains constant throughout the lifetime of the object. Though arguably the list is setup only once too. _forced_cpus sounds also a bit general, how about _must_be_online_cpus? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:48: 'Nexus 5': [0, 1], On 2013/11/15 19:00:12, aberent wrote: > What is your intention in only managing two of the CPUs? I have certainly seen, > in traces, Chrome using 4 CPUs, and the (relatively small) difference in Chrome > startup time between a normal Nexus4 and Svelte Nexus 4 appeared to be due to > svelte disabling two of the CPUs. I looked at a few pagecycler runs on the bot and found that 2-3 CPUs are used at the end of the tests. With 2 I am also trying to be on the safer side to avoid overheating, but we can certainly tweak these numbers later. The main purpose is making the number of CPUs online be a constant throughout the test run to make the results more predictable. Frankly, I did not measure whether anything in fact became more stable, we would need to check that on the dashboard since variability impact can be quite different for different tests. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:55: for cpu in xrange(self._kernel_max +1): On 2013/11/15 17:54:43, bulach wrote: > nit: space after + Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:60: logging.info('Bringing CPU #%d offline.' %cpu_index) On 2013/11/15 17:54:43, bulach wrote: > nit: logging takes format and vararg, so replace % with , Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:61: self._adb.SetProtectedFileContents(PerfControl._CPU_ONLINE_FMT, '0') On 2013/11/15 19:00:12, aberent wrote: > This doesn't use cpu_index, how does it know which cpu to take offline? Do you > mean PerfControl._CPU_ONLINE_FMT % cpu_index? That's a great catch! Thank you! Do you think it makes sense to mock self._adb and verify the logic in unittests? The logic is not particularly tricky, but .. with Python I am never sure even the syntax of the code is correct without covering every line of code at runtime, which can be even more tedious than writing unittests. Yes, I am not a big advocate for Python :) https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:73: else: On 2013/11/15 17:54:43, bulach wrote: > nit: no need for else, unindent the following block. Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:75: logging.warning('Gave up bringing CPU %d online') On 2013/11/15 17:54:43, bulach wrote: > nit: missing ", cpu_index" Thank you! You know, I really got used to GCC warnings about unsatisfied format strings, but luckily good reviewers do the job of the compiler. Adds a little bit feeling of guilt, but not outside manageable :) Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:78: logging.warning('Could not bring CPU %d online, retrying..' % On 2013/11/15 17:54:43, bulach wrote: > nit: s/%/,/ Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:83: return '\n'.join(self._adb.GetProtectedFileContents(file_name)) On 2013/11/15 17:54:43, bulach wrote: > it this function needed? :) > line 71 would be fine: > '1' in ['1', 'foo', 'bar'] > > and then just do this directly at line 90... This function made 3 lines (line 56, 72 and 90) shorter and not wrap, on the other hand it added 3 lines to the file. The final argument of it being a descriptive name won. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:100: # Stop the proprietary 'mpdecision' hotplug manager to get a constant On 2013/11/15 17:54:43, bulach wrote: > is "proprietary" relevant? How about just "Stop 'mdpdecision' hotplug ... no, not relevant, I was just angry :) Removed. I would not take the suggestion of your s with a typo because it may sound even more rude than "proprietary". But if you want, we can make it "m#$%decision". https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:101: # number of CPUs online when running tests. TODO(pasko): stop 'thermald'. On 2013/11/15 19:00:12, aberent wrote: > Isn't stopping thermald going to burn out the device? Yes, it certainly may. So I was thinking of having it as option for devices with enough heatsinks installed. It is not typical to user devices, but I value low variance more than being maximum representative. Let's separate this: in-lab performance tests should be with low variance, in-field metrics/histograms should be representative. WDYT? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:118: 'Nexus 4': 'ondemand', On 2013/11/15 19:00:12, aberent wrote: > Please add 'Nexus 5' to this list. Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:128: pass On 2013/11/15 17:54:43, bulach wrote: > sorry, this block is a bit confusing :) > how about just inside 122: > > if not self._forced_cpus: > # Set the default perf mode only if the CPU was not forced online. > for cpu in xrange(...): > ... Sorry, it was. Good suggestion. Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:142: if cpu in self._cpus_on_originally: On 2013/11/15 19:00:12, aberent wrote: > I think if the CPU was both online originally and was in the manually managed > set you need to reset the performance mode. That's right. I forgot to restore the performance governor mode, mainly because it would be manipulated by mpdecision anyway. This would be cleaner if restores were performed with scaling governor involved explicitly. Done. By the way, always wanted to ask. Do we use SetDefaultPerfMode() for something that is running on perf bots or is it just for local experimentation? https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:145: self._BringCpuOffline(cpu) On 2013/11/15 19:00:12, aberent wrote: > Does this clear the performance mode back to the default, or does it simply hide > it? If it simply hides it then you need to reset the performance mode before > taking it offline. Done. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:152: logging.info('Writing scaling governor mode \'%s\' -> %s', On 2013/11/15 17:54:43, bulach wrote: > nit: replace \' with " Done.
On 2013/11/18 15:11:18, pasko wrote: > Marcus, Anthony, thank you for review! > > On 2013/11/15 19:00:12, aberent wrote: > > In addition to the inline comments I have never really understood what we are > > trying to test by setting performance mode. On the only tests we run with and > > without performance mode (the old warm and cold startup tests on > http://kapook.com) the > > results track each other very closely, and there is no evidence that one is > > noisier that the other. The more we modify the phones behaviour from its > default > > behaviour the less likely it is that the regressions we find correspond to the > > regressions experienced by real users. > > I feel strongly that we have to reduce all possible sources of variance in our > measurements. Without actively fighting for reproducible results, we cannot > track regressions, and even observe any performance change. This CL addresses > one point about variance, but alone it will not magically fix our numbers. There > are a few known methodological problems that, if fixed, together may produce > better results, or they may not, due to some other still unknown sources of > variance that are masking the signal. > > Startup tests have a built-in variability, I believe, because they rely a lot on > loading binaries from disk, which depends on the flash controller as a source of > variability. So I am not surprised that we cannot sort out signals from noise in > startup tests. However, my hope is that not all tests are like this, *some* > JavaScript non-GC-intensive workloads might be much more stable than others. Actually the startup tests, at least on Manta and Mako, have much lower variability than many of our other tests. See https://chromeperf.appspot.com/report?masters=ClankUpstreamInternal&bots=mako.... The typical sd on the kapook.com results on Manta is around 1% and it is less than this on Mako. My worry, apart from performance mode not being representative, is that setting performance mode may actually increase the variability of the tests by increasing thermal throttling. On the startup tests we throw away results and re-run any tests that are thermally throttled, I don't think we yet do this on Telemetry tests, although we do print a warning in the log. It would be interesting to try some tests with and without performance mode, and see how the variability compared. Maybe we should really have two sets of bots, one for test devices (possibly development boards with lots of extra cooling) running permanently at maximum performance, and a second set running typical user devices in a typical user configuration.
https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:101: # number of CPUs online when running tests. TODO(pasko): stop 'thermald'. On 2013/11/18 15:11:18, pasko wrote: > On 2013/11/15 19:00:12, aberent wrote: > > Isn't stopping thermald going to burn out the device? > > Yes, it certainly may. So I was thinking of having it as option for devices with > enough heatsinks installed. It is not typical to user devices, but I value low > variance more than being maximum representative. Let's separate this: in-lab > performance tests should be with low variance, in-field metrics/histograms > should be representative. WDYT? Makes sense, if we can find devices that will run reliably at full performance without overheating.
On 2013/11/18 16:31:51, aberent wrote: > Actually the startup tests, at least on Manta and Mako, have much lower > variability than many of our other tests. See > https://chromeperf.appspot.com/report?masters=ClankUpstreamInternal&bots=mako.... > The typical sd on the http://kapook.com results on Manta is around 1% and it is less > than this on Mako. My worry, apart from performance mode not being > representative, is that setting performance mode may actually increase the > variability of the tests by increasing thermal throttling. On the startup tests > we throw away results and re-run any tests that are thermally throttled, I don't > think we yet do this on Telemetry tests, although we do print a warning in the > log. Oh, that's great stuff! By the way, did you look at CPU utilization numbers on startup tests? I would not be surprised if it's below 50%. Does thermal throttling appear often on startup tests? Sorry for off-topic :) > It would be interesting to try some tests with and without performance mode, and > see how the variability compared. > > Maybe we should really have two sets of bots, one for test devices (possibly > development boards with lots of extra cooling) running permanently at maximum > performance, and a second set running typical user devices in a typical user > configuration. Yes, that'd be cool. 2x maintenance cost though makes me sad. So if I am forced to choose between real devices and rock-stable cooled dev boards (do they exist?), my choice should be obvious now.
On 2013/11/18 16:48:22, pasko wrote: > On 2013/11/18 16:31:51, aberent wrote: > > Actually the startup tests, at least on Manta and Mako, have much lower > > variability than many of our other tests. See > > > https://chromeperf.appspot.com/report?masters=ClankUpstreamInternal&bots=mako.... > > The typical sd on the http://kapook.com results on Manta is around 1% and it > is less > > than this on Mako. My worry, apart from performance mode not being > > representative, is that setting performance mode may actually increase the > > variability of the tests by increasing thermal throttling. On the startup > tests > > we throw away results and re-run any tests that are thermally throttled, I > don't > > think we yet do this on Telemetry tests, although we do print a warning in the > > log. > > Oh, that's great stuff! By the way, did you look at CPU utilization numbers on > startup tests? I would not be surprised if it's below 50%. Does thermal > throttling appear often on startup tests? I don't know what the overall CPU usage is on the startup tests, but I do know that some parts of the startup sequence are CPU bound, and other parts are a long way from being CPU bound. There is a measurable difference in startup time between a Nexus 4 with four enabled CPUs and one with two enabled CPUs (an occam svelte) and if one looks at a trace one can see that at times during startup all the CPUs are running at 100% (even with 4 CPUs). At one time we had an issue on some targets when running in default performance governor setting where the CPU frequency would be stepped down during the non CPU bound sections of startup, and would then take some time to step up again on CPU bound sections, but this has now (I think) gone away due to other changes in the startup sequence. Note that we would not have seen this issue if we had been running in performance mode, but it was entirely reproducible. Yes, thermal throttling is a big problem in the startup tests. I originally wrote the thermal throttling code because thermal throttling was a major source of variance in the startup tests. I am rather surprised that kapook.com gives any results at all on the bots, I haven't been able to run it locally on devices that detect thermal throttling since I put in the thermal throttling check, since they always overheats on every run and gives up after 5 retries. > > Sorry for off-topic :) > > > It would be interesting to try some tests with and without performance mode, > and > > see how the variability compared. > > > > Maybe we should really have two sets of bots, one for test devices (possibly > > development boards with lots of extra cooling) running permanently at maximum > > performance, and a second set running typical user devices in a typical user > > configuration. > > Yes, that'd be cool. 2x maintenance cost though makes me sad. So if I am forced > to choose between real devices and rock-stable cooled dev boards (do they > exist?), my choice should be obvious now.
On 2013/11/18 17:22:03, aberent wrote: > On 2013/11/18 16:48:22, pasko wrote: > > On 2013/11/18 16:31:51, aberent wrote: > > > Actually the startup tests, at least on Manta and Mako, have much lower > > > variability than many of our other tests. See > > > > > > https://chromeperf.appspot.com/report?masters=ClankUpstreamInternal&bots=mako.... > > > The typical sd on the http://kapook.com results on Manta is around 1% and it > > is less > > > than this on Mako. My worry, apart from performance mode not being > > > representative, is that setting performance mode may actually increase the > > > variability of the tests by increasing thermal throttling. On the startup > > tests > > > we throw away results and re-run any tests that are thermally throttled, I > > don't > > > think we yet do this on Telemetry tests, although we do print a warning in > the > > > log. > > > > Oh, that's great stuff! By the way, did you look at CPU utilization numbers on > > startup tests? I would not be surprised if it's below 50%. Does thermal > > throttling appear often on startup tests? > > I don't know what the overall CPU usage is on the startup tests, but I do know > that some parts of the startup sequence are CPU bound, and other parts are a > long way from being CPU bound. There is a measurable difference in startup time > between a Nexus 4 with four enabled CPUs and one with two enabled CPUs (an occam > svelte) and if one looks at a trace one can see that at times during startup all > the CPUs are running at 100% (even with 4 CPUs). At one time we had an issue on > some targets when running in default performance governor setting where the CPU > frequency would be stepped down during the non CPU bound sections of startup, > and would then take some time to step up again on CPU bound sections, but this > has now (I think) gone away due to other changes in the startup sequence. Note > that we would not have seen this issue if we had been running in performance > mode, but it was entirely reproducible. > > Yes, thermal throttling is a big problem in the startup tests. I originally > wrote the thermal throttling code because thermal throttling was a major source > of variance in the startup tests. I am rather surprised that http://kapook.com gives > any results at all on the bots, I haven't been able to run it locally on devices > that detect thermal throttling since I put in the thermal throttling check, > since they always overheats on every run and gives up after 5 retries. Thanks! Interesting. P.S.: Looked at kapook.com for the first time. Well, impressive piece of design and engineering, I would say. I hope this stands out separately as a non-typical marvel.
oops, forgot to upload my latest changes yesterday. Now done, PTAL.
lgtm, mostly nits.. please give aberent a chance to look into it as well.. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:35: # Manage CPUs on some device types manually to reduce variance in results. On 2013/11/18 15:11:18, pasko wrote: > On 2013/11/15 17:54:43, bulach wrote: > > "manually" is a bit misleading, and they're all "managed" one way or > another.... > > > > so perhaps: > > self._forced_cpus as the list (i.e, in place of "cpus_to_manager" ? > > no need for the bool, you can infer that from self._forced_cpus not being > > empty... wdyt? > > I tried that, and it looks less readable to me to check for emptiness of a list > in a few places compared to checking for one boolean mode-like variable that is > set up once and remains constant throughout the lifetime of the object. Though > arguably the list is setup only once too. _forced_cpus sounds also a bit > general, how about _must_be_online_cpus? how about this: extract cpu_indexes_to_manage_manually into a constant: _MODELS_TO_FULLY_MANAGED_CPU = { 'Nexus 4': [0, 1], 'Nexus 5': [0, 1], } then rather than a bool + list, it can use: self._fully_managed_cpus = _MODELS_TO_FULLY_MANAGED_CPU.get( self._adb.GetProductModel(), []) and from then on, you can just test "if self._fully_managed_cpus" as a simple bool-like test.. I still think "manually" is a bit misleading as nothing here is manual, this script can't even be called from the shell directly :) ...or perhaps even more directly like governor_mode, and just have: self._must_be_online_cpus = { 'Nexus 4' : [0, 1], 'Nexut 5' : [0, 1], }.get(self._adb.GetProductModel(), []) https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:83: return '\n'.join(self._adb.GetProtectedFileContents(file_name)) On 2013/11/18 15:11:18, pasko wrote: > On 2013/11/15 17:54:43, bulach wrote: > > it this function needed? :) > > line 71 would be fine: > > '1' in ['1', 'foo', 'bar'] > > > > and then just do this directly at line 90... > > This function made 3 lines (line 56, 72 and 90) shorter and not wrap, on the > other hand it added 3 lines to the file. The final argument of it being a > descriptive name won. perhaps then name it as _GetProtectedFileContents as the underlying method it's calling? https://codereview.chromium.org/73173005/diff/120001/build/android/pylib/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/120001/build/android/pylib/perf... build/android/pylib/perf/perf_control.py:102: # online when running tests. TODO(pasko): stop 'thermald'. make sure you send a heads up to chrome-perf-sheriffs, and please keep an eye on the bots to ensure nothing will burn out :) assuming you tried locally, right?
https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:35: # Manage CPUs on some device types manually to reduce variance in results. On 2013/11/19 15:13:57, bulach wrote: > On 2013/11/18 15:11:18, pasko wrote: > > On 2013/11/15 17:54:43, bulach wrote: > > > "manually" is a bit misleading, and they're all "managed" one way or > > another.... > > > > > > so perhaps: > > > self._forced_cpus as the list (i.e, in place of "cpus_to_manager" ? > > > no need for the bool, you can infer that from self._forced_cpus not being > > > empty... wdyt? > > > > I tried that, and it looks less readable to me to check for emptiness of a > list > > in a few places compared to checking for one boolean mode-like variable that > is > > set up once and remains constant throughout the lifetime of the object. Though > > arguably the list is setup only once too. _forced_cpus sounds also a bit > > general, how about _must_be_online_cpus? > > how about this: > > extract cpu_indexes_to_manage_manually into a constant: > > _MODELS_TO_FULLY_MANAGED_CPU = { > 'Nexus 4': [0, 1], > 'Nexus 5': [0, 1], > } > > then rather than a bool + list, it can use: > self._fully_managed_cpus = _MODELS_TO_FULLY_MANAGED_CPU.get( > self._adb.GetProductModel(), []) > > and from then on, you can just test "if self._fully_managed_cpus" as a simple > bool-like test.. > > I still think "manually" is a bit misleading as nothing here is manual, this > script can't even be called from the shell directly :) > > ...or perhaps even more directly like governor_mode, and just have: > > self._must_be_online_cpus = { > 'Nexus 4' : [0, 1], > 'Nexut 5' : [0, 1], > }.get(self._adb.GetProductModel(), []) I like "fully managed CPUs" as a term. Thanks!! self._fully_managed_cpus is okay, I'll go with that, it is less readable to me when one variable gets interpreted as different types in different contexts (I just had one of those rare Perl flashbacks), but you are right, seems like more pythonistic readers will find it more convenient. https://codereview.chromium.org/73173005/diff/1/build/android/pylib/perf/perf... build/android/pylib/perf/perf_control.py:83: return '\n'.join(self._adb.GetProtectedFileContents(file_name)) On 2013/11/19 15:13:57, bulach wrote: > On 2013/11/18 15:11:18, pasko wrote: > > On 2013/11/15 17:54:43, bulach wrote: > > > it this function needed? :) > > > line 71 would be fine: > > > '1' in ['1', 'foo', 'bar'] > > > > > > and then just do this directly at line 90... > > > > This function made 3 lines (line 56, 72 and 90) shorter and not wrap, on the > > other hand it added 3 lines to the file. The final argument of it being a > > descriptive name won. > > perhaps then name it as _GetProtectedFileContents as the underlying method it's > calling? Whichever you like. Done. https://codereview.chromium.org/73173005/diff/120001/build/android/pylib/perf... File build/android/pylib/perf/perf_control.py (right): https://codereview.chromium.org/73173005/diff/120001/build/android/pylib/perf... build/android/pylib/perf/perf_control.py:102: # online when running tests. TODO(pasko): stop 'thermald'. On 2013/11/19 15:13:57, bulach wrote: > make sure you send a heads up to chrome-perf-sheriffs, Yes, I am planning to let sheriffs know, this can shift performance numbers in unexpected ways. > and please keep an eye on the bots to ensure nothing will burn out :) assuming you tried locally, right? I ran this locally, of course, it is very unlikely that anything burns with current CL, so I'll leave most of the watching to sheriffs, will occasionally look at bots a few times.
Great catch on the scaling governor fixes! Could we please land those fixes first so we can observe their effects in isolation? However, I'm really skeptical about the idea of disabling cores. Can you explain that a little further and punt out to a separate CL?
On 2013/11/19 18:07:24, tonyg wrote: > Great catch on the scaling governor fixes! Could we please land those fixes > first so we can observe their effects in isolation? > > However, I'm really skeptical about the idea of disabling cores. Can you explain > that a little further and punt out to a separate CL? The scaling governor mode being not set for CPUs (scaling governor issue) is tightly connected to managing cores. When we stop 'mpdecision' there is no daemon or kernel module to bring the CPUs up online on Qualcomm chips. So when we run tests we need to bring up somewhat representative number of CPUs directly in telemetry PerfControl. In practice we never stop a CPU, we only bring up an extra core on N4/N5 slightly before 'mpdecision' would have done so. I looked at a few perfbot logs to get a somewhat representative number of CPUs online at the moment tests just finished running, though this can be after the device had cooled for a few milliseconds. Moreover, fun fact: on N7/tegra one core actually works faster if other cores are down. It remains to be seen what predictable and representative performance mode there is for this chip. I'd vote for running some tests in 2 modes: 1 core and 2 cores.
I'm also curious how this will affect the smoothness benchmark. A part the jankiness there comes from the fact that mpdecision or equivalent is slow in bringing up additional cores, so we limp along with a single core for the first few frames. Having two cores online from the start should improve this, but isn't entirely realistic. I think ultimately we want to optimize for scrolling smoothly with just one core since that gives us more room to breathe, so I like Egor's idea of running some tests in two modes.
On 2013/11/21 10:56:20, Sami wrote: > I'm also curious how this will affect the smoothness benchmark. A part the > jankiness there comes from the fact that mpdecision or equivalent is slow in > bringing up additional cores, so we limp along with a single core for the first > few frames. Having two cores online from the start should improve this, but > isn't entirely realistic. > > I think ultimately we want to optimize for scrolling smoothly with just one core > since that gives us more room to breathe, so I like Egor's idea of running some > tests in two modes. Even the current disabling of cpu frequency scaling gets us down a path of sacrificing realism for low-noise. This takes us even farther down this path. I think that is a good idea for microbenchmarks like octane, blink_perf, cc_perftests, etc. But a bad idea for regression tests like page_cycler, smoothness, etc. What would folks think about first changing Telemetry so that we don't call the methods to put us into low noise mode for those end-to-end regression tests, but do it for the micros. Then we can land this improvement to our noise reduction functionality.
On 2013/11/21 16:20:43, tonyg wrote: > On 2013/11/21 10:56:20, Sami wrote: > > I'm also curious how this will affect the smoothness benchmark. A part the > > jankiness there comes from the fact that mpdecision or equivalent is slow in > > bringing up additional cores, so we limp along with a single core for the > first > > few frames. Having two cores online from the start should improve this, but > > isn't entirely realistic. > > > > I think ultimately we want to optimize for scrolling smoothly with just one > core > > since that gives us more room to breathe, so I like Egor's idea of running > some > > tests in two modes. > > Even the current disabling of cpu frequency scaling gets us down a path of > sacrificing realism for low-noise. This takes us even farther down this path. > > I think that is a good idea for microbenchmarks like octane, blink_perf, > cc_perftests, etc. But a bad idea for regression tests like page_cycler, > smoothness, etc. > > What would folks think about first changing Telemetry so that we don't call the > methods to put us into low noise mode for those end-to-end regression tests, but > do it for the micros. Then we can land this improvement to our noise reduction > functionality. Yes, sounds good to me. My concerns about this have been about end-to-end benchmarks (in particular startup benchmarks). For microbenchmarks running in performance mode probably makes sense.
On 2013/11/21 17:47:32, aberent wrote: > On 2013/11/21 16:20:43, tonyg wrote: > > On 2013/11/21 10:56:20, Sami wrote: > > > I'm also curious how this will affect the smoothness benchmark. A part the > > > jankiness there comes from the fact that mpdecision or equivalent is slow in > > > bringing up additional cores, so we limp along with a single core for the > > first > > > few frames. Having two cores online from the start should improve this, but > > > isn't entirely realistic. > > > > > > I think ultimately we want to optimize for scrolling smoothly with just one > > core > > > since that gives us more room to breathe, so I like Egor's idea of running > > some > > > tests in two modes. > > > > Even the current disabling of cpu frequency scaling gets us down a path of > > sacrificing realism for low-noise. This takes us even farther down this path. I prefer to think of this CL as fixing the broken disabling of CPU frequency. If for some tests we do not want to disable cpufreq scaling, then OK, let's address this, but the current state is half-solution that is not consistent within itself. > > I think that is a good idea for microbenchmarks like octane, blink_perf, > > cc_perftests, etc. But a bad idea for regression tests like page_cycler, > > smoothness, etc. > > > > What would folks think about first changing Telemetry so that we don't call > the > > methods to put us into low noise mode for those end-to-end regression tests, > but > > do it for the micros. Then we can land this improvement to our noise reduction > > functionality. > > Yes, sounds good to me. My concerns about this have been about end-to-end > benchmarks (in particular startup benchmarks). For microbenchmarks running in > performance mode probably makes sense. I disagree. I feel strongly that low variance is more important than our immediate attempts to be more 'representative'. That is because: 1. we spend immense resources on bisecting performance problems just to discover that regressions are within noise, this slows us down and does not prevent regressions 2. in-lab benchmarks are always liars, in all industries, some more, some less. Benchmarks are not representative, but they are a good tool to assess some properties of the product. Ideally we would have *all* benchmarks run in two modes: one 'closest to representative' as much as we can get (non-rooted devices?), and another 'tuned for stable performance' with proper cooling etc. We should not separate by benchmark type to decide which subset of them runs in which mode. By the way, are we OK with the current level of realism on page_cycler benchmarks? I would think realism requires non non-rooted devices, with network delay simulation. I remember we actually discovered problems in presence of network delays in the past.
On 2013/11/21 19:53:11, pasko wrote: > On 2013/11/21 17:47:32, aberent wrote: > > On 2013/11/21 16:20:43, tonyg wrote: > > > On 2013/11/21 10:56:20, Sami wrote: > > > > I'm also curious how this will affect the smoothness benchmark. A part the > > > > jankiness there comes from the fact that mpdecision or equivalent is slow > in > > > > bringing up additional cores, so we limp along with a single core for the > > > first > > > > few frames. Having two cores online from the start should improve this, > but > > > > isn't entirely realistic. > > > > > > > > I think ultimately we want to optimize for scrolling smoothly with just > one > > > core > > > > since that gives us more room to breathe, so I like Egor's idea of running > > > some > > > > tests in two modes. > > > > > > Even the current disabling of cpu frequency scaling gets us down a path of > > > sacrificing realism for low-noise. This takes us even farther down this > path. > > I prefer to think of this CL as fixing the broken disabling of CPU frequency. If > for some > tests we do not want to disable cpufreq scaling, then OK, let's address this, > but the > current state is half-solution that is not consistent within itself. Agreed; if we are going to use performance mode at all we should use it correctly, and this CL appears to fix how we currently set performance mode. I don't think anything in this CL changes when we attempt to set performance mode. > > > > I think that is a good idea for microbenchmarks like octane, blink_perf, > > > cc_perftests, etc. But a bad idea for regression tests like page_cycler, > > > smoothness, etc. > > > > > > What would folks think about first changing Telemetry so that we don't call > > the > > > methods to put us into low noise mode for those end-to-end regression tests, > > but > > > do it for the micros. Then we can land this improvement to our noise > reduction > > > functionality. > > > > Yes, sounds good to me. My concerns about this have been about end-to-end > > benchmarks (in particular startup benchmarks). For microbenchmarks running in > > performance mode probably makes sense. > > I disagree. > > I feel strongly that low variance is more important than our immediate attempts > to be more 'representative'. That is because: > 1. we spend immense resources on bisecting performance problems just to discover > that regressions are within noise, this slows us down and does not prevent > regressions > 2. in-lab benchmarks are always liars, in all industries, some more, some less. > Benchmarks are not representative, but they are a good tool to assess some > properties of the product. > > Ideally we would have *all* benchmarks run in two modes: one 'closest to > representative' as much as we can get (non-rooted devices?), and another 'tuned > for stable performance' with proper cooling etc. We should not separate by > benchmark type to decide which subset of them runs in which mode. > > By the way, are we OK with the current level of realism on page_cycler > benchmarks? I would think realism requires non non-rooted devices, with network > delay simulation. I remember we actually discovered problems in presence of > network delays in the past. Do we need to have this discussion before landing this CL? While I have some doubts as to the benefits of performance mode (I am not convinced that this has much to do with the variability we are seeing) I don't have much data. Once this has landed we will be able to see the true effect of performance mode on the variability of the test results, and whether (and on which tests) the benefits of it are sufficient to justify the lack of realism. BTW on the downstream bots we do run what benchmarks we can on a non-rooted S3, in addition to running on rooted Nexus devices.
On 2013/11/22 10:00:09, aberent wrote: > > I disagree. > > > > I feel strongly that low variance is more important than our immediate > attempts > > to be more 'representative'. That is because: > > 1. we spend immense resources on bisecting performance problems just to > discover > > that regressions are within noise, this slows us down and does not prevent > > regressions > > 2. in-lab benchmarks are always liars, in all industries, some more, some > less. > > Benchmarks are not representative, but they are a good tool to assess some > > properties of the product. > > > > Ideally we would have *all* benchmarks run in two modes: one 'closest to > > representative' as much as we can get (non-rooted devices?), and another > 'tuned > > for stable performance' with proper cooling etc. We should not separate by > > benchmark type to decide which subset of them runs in which mode. > > > > By the way, are we OK with the current level of realism on page_cycler > > benchmarks? I would think realism requires non non-rooted devices, with > network > > delay simulation. I remember we actually discovered problems in presence of > > network delays in the past. > > Do we need to have this discussion before landing this CL? While I have some > doubts as to the benefits of performance mode (I am not convinced that this has > much to do with the variability we are seeing) I don't have much data. Once this > has landed we will be able to see the true effect of performance mode on the > variability of the test results, and whether (and on which tests) the benefits > of it are sufficient to justify the lack of realism. I agree that we should land this and observe the effect. :) The performance mode discussion is indeed broad, we will continue it beyond review. > BTW on the downstream bots we do run what benchmarks we can on a non-rooted S3, > in addition to running on rooted Nexus devices. I just looked at downstream page_cycler runs on s3, they all say "User build device, skipping test" and walk away. The s3 bot is *not* useless, it runs startup tests and other benchmarks, just not (yet?) page_cycler. So, Anthony, does this CL LGTY?
lgtm
lgtm
lgtm, thanks!
After a conversation with tonyg@ we decided to wait until we have a dedicated bot tuned for perf predictability. The device on the bot should have the same hardware/OS configuration compared to one of our existing perf devices, and then tuned for being more predictable (i.e. constant number of cores, scaling governor, or even userspace frequency management).
On 2013/11/27 18:29:17, pasko wrote: > After a conversation with tonyg@ we decided to wait until we have a dedicated > bot tuned for perf predictability. The device on the bot should have the same > hardware/OS configuration compared to one of our existing perf devices, and then > tuned for being more predictable (i.e. constant number of cores, scaling > governor, or even userspace frequency management). done in another patch, closing this |