Issue 2541843007: [Telemetry][Android] Wait for device under test to cool between pages.

rnephew (Reviews Here)

rnephew@chromium.org changed reviewers: + nednguyen@google.com, sullivan@chromium.org

4 years ago (2016-12-01 18:38:00 UTC) #1

rnephew (Reviews Here)

https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/internal/platform/platform_backend.py File telemetry/telemetry/internal/platform/platform_backend.py (right): https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/internal/platform/platform_backend.py#newcode304 telemetry/telemetry/internal/platform/platform_backend.py:304: pass Debating if I should add a CanWaitForTemperature method ...

4 years ago (2016-12-01 18:38:00 UTC) #2

nednguyen

On 2016/12/01 18:38:00, rnephew (Reviews Here) wrote: > https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/internal/platform/platform_backend.py > File telemetry/telemetry/internal/platform/platform_backend.py (right): > > ...

4 years ago (2016-12-01 19:01:46 UTC) #3

rnephew (Reviews Here)

On 2016/12/01 19:01:46, nednguyen wrote: > On 2016/12/01 18:38:00, rnephew (Reviews Here) wrote: > > ...

4 years ago (2016-12-01 19:14:02 UTC) #4

On 2016/12/01 19:01:46, nednguyen wrote:
> On 2016/12/01 18:38:00, rnephew (Reviews Here) wrote:
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > File telemetry/telemetry/internal/platform/platform_backend.py (right):
> > 
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > telemetry/telemetry/internal/platform/platform_backend.py:304: pass
> > Debating if I should add a CanWaitForTemperature method that defaults to
> false,
> > having each platform that implements it override it to true, and having this
> > raise NotImplementedError, but that also seems like unnecessary boilerplate
> code
> > when we can just make in a NOP on platforms that do not support it yet (or
> > platforms that have no need to support it).
> > 
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > File telemetry/telemetry/page/shared_page_state.py (right):
> > 
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > telemetry/telemetry/page/shared_page_state.py:189: # Make sure device under
> test
> > is at a suitable temperature.
> > Decided to put it here, because doing it here makes it wait between pages.
> This
> > will make sure that on long running tests, such as system health, that we do
> not
> > start overheating partway through. I do expect this to increase the amount
of
> > time single benchmarks take since it will cool between pages, but I do not
> > expect this to increase the amount of time a test run takes in total though,
> > since the android test runner already waits between benchmarks for the temp
to
> > drop.
> 
> I think this really should be a commandline flag
> (--wait-to-cool-down-between-runs) or a benchmark property
> (wait_to_cool_down_between_runs). We don't want to enable this by default for
> all benchmark since it can be expensive, and benchmarks like power probably
> don't care much about the temperature.

It would help decrease noise though if we do it between each page always. And as
stated above, in theory it shouldn't increase run times very much since its just
moving where it waits for temperatures to drop. I'll look into adding it as a
flag though.

nednguyen

On 2016/12/01 19:14:02, rnephew (Reviews Here) wrote: > On 2016/12/01 19:01:46, nednguyen wrote: > > ...

4 years ago (2016-12-01 19:21:41 UTC) #5

On 2016/12/01 19:14:02, rnephew (Reviews Here) wrote:
> On 2016/12/01 19:01:46, nednguyen wrote:
> > On 2016/12/01 18:38:00, rnephew (Reviews Here) wrote:
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > > File telemetry/telemetry/internal/platform/platform_backend.py (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > > telemetry/telemetry/internal/platform/platform_backend.py:304: pass
> > > Debating if I should add a CanWaitForTemperature method that defaults to
> > false,
> > > having each platform that implements it override it to true, and having
this
> > > raise NotImplementedError, but that also seems like unnecessary
boilerplate
> > code
> > > when we can just make in a NOP on platforms that do not support it yet (or
> > > platforms that have no need to support it).
> > > 
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > > File telemetry/telemetry/page/shared_page_state.py (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > > telemetry/telemetry/page/shared_page_state.py:189: # Make sure device
under
> > test
> > > is at a suitable temperature.
> > > Decided to put it here, because doing it here makes it wait between pages.
> > This
> > > will make sure that on long running tests, such as system health, that we
do
> > not
> > > start overheating partway through. I do expect this to increase the amount
> of
> > > time single benchmarks take since it will cool between pages, but I do not
> > > expect this to increase the amount of time a test run takes in total
though,
> > > since the android test runner already waits between benchmarks for the
temp
> to
> > > drop.
> > 
> > I think this really should be a commandline flag
> > (--wait-to-cool-down-between-runs) or a benchmark property
> > (wait_to_cool_down_between_runs). We don't want to enable this by default
for
> > all benchmark since it can be expensive, and benchmarks like power probably
> > don't care much about the temperature.
> 
> It would help decrease noise though if we do it between each page always. And
as
> stated above, in theory it shouldn't increase run times very much since its
just
> moving where it waits for temperatures to drop. I'll look into adding it as a
> flag though.

We currently also have _WaitForThermalThrottling if needed in
https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/....
If you work on this, let take this slow & do it right.

A few questions I have are:
1) How is the method you're implementing different from
_WaitForThermalThrottling? Can we merge the two to into 1 thing?
2) On a typical long running benchmark like system_health.mobile_memory, how
much time does WaitingForDeviceToCool between story add to the overall benchmark
run?
3) Can we also implement this on desktop. IIRC, Charlie was having some problem
with the power increasing across runs on Mac.

Can you make a design doc about this work & answering these questions?

nednguyen

Description was changed from ========== [Telemetry][Android] Wait for device under test to cool between pages. ...

4 years ago (2016-12-01 19:21:56 UTC) #6

rnephew (Reviews Here)

On 2016/12/01 19:21:41, nednguyen wrote: > On 2016/12/01 19:14:02, rnephew (Reviews Here) wrote: > > ...

4 years ago (2016-12-01 19:28:20 UTC) #8

On 2016/12/01 19:21:41, nednguyen wrote:
> On 2016/12/01 19:14:02, rnephew (Reviews Here) wrote:
> > On 2016/12/01 19:01:46, nednguyen wrote:
> > > On 2016/12/01 18:38:00, rnephew (Reviews Here) wrote:
> > > >
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > > > File telemetry/telemetry/internal/platform/platform_backend.py (right):
> > > > 
> > > >
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/interna...
> > > > telemetry/telemetry/internal/platform/platform_backend.py:304: pass
> > > > Debating if I should add a CanWaitForTemperature method that defaults to
> > > false,
> > > > having each platform that implements it override it to true, and having
> this
> > > > raise NotImplementedError, but that also seems like unnecessary
> boilerplate
> > > code
> > > > when we can just make in a NOP on platforms that do not support it yet
(or
> > > > platforms that have no need to support it).
> > > > 
> > > >
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > > > File telemetry/telemetry/page/shared_page_state.py (right):
> > > > 
> > > >
> > >
> >
>
https://codereview.chromium.org/2541843007/diff/1/telemetry/telemetry/page/sh...
> > > > telemetry/telemetry/page/shared_page_state.py:189: # Make sure device
> under
> > > test
> > > > is at a suitable temperature.
> > > > Decided to put it here, because doing it here makes it wait between
pages.
> > > This
> > > > will make sure that on long running tests, such as system health, that
we
> do
> > > not
> > > > start overheating partway through. I do expect this to increase the
amount
> > of
> > > > time single benchmarks take since it will cool between pages, but I do
not
> > > > expect this to increase the amount of time a test run takes in total
> though,
> > > > since the android test runner already waits between benchmarks for the
> temp
> > to
> > > > drop.
> > > 
> > > I think this really should be a commandline flag
> > > (--wait-to-cool-down-between-runs) or a benchmark property
> > > (wait_to_cool_down_between_runs). We don't want to enable this by default
> for
> > > all benchmark since it can be expensive, and benchmarks like power
probably
> > > don't care much about the temperature.
> > 
> > It would help decrease noise though if we do it between each page always.
And
> as
> > stated above, in theory it shouldn't increase run times very much since its
> just
> > moving where it waits for temperatures to drop. I'll look into adding it as
a
> > flag though.
> 
> We currently also have _WaitForThermalThrottling if needed in
>
https://github.com/catapult-project/catapult/blob/master/telemetry/telemetry/....
> If you work on this, let take this slow & do it right.
> 
> A few questions I have are:
> 1) How is the method you're implementing different from
> _WaitForThermalThrottling? Can we merge the two to into 1 thing?
> 2) On a typical long running benchmark like system_health.mobile_memory, how
> much time does WaitingForDeviceToCool between story add to the overall
benchmark
> run?
> 3) Can we also implement this on desktop. IIRC, Charlie was having some
problem
> with the power increasing across runs on Mac.
> 
> Can you make a design doc about this work & answering these questions?

If we already have _WaitForThermalThrottling, then maybe thermal throttling
isn't the issue in crbug.com/669923 making this unnecessary? 

I dont know enough about mac hardware to implement it there atm. Since this (if
this is even what is causing it) is causing problems on bisects for android, I'd
prefer to roll it out on android first, then work on the other platforms so we
can get unstuck there.

I dont think this requires a full DD, but I'll make a 1-pager for it.

rnephew (Reviews Here)

I added a command line flag and moved where it calls LetTemperatureCool to reflect where ...

4 years ago (2016-12-02 15:54:57 UTC) #9

charliea (OOO until 10-5)

https://codereview.chromium.org/2541843007/diff/20001/telemetry/telemetry/internal/story_runner.py File telemetry/telemetry/internal/story_runner.py (right): https://codereview.chromium.org/2541843007/diff/20001/telemetry/telemetry/internal/story_runner.py#newcode64 telemetry/telemetry/internal/story_runner.py:64: help='Temperature to wait for between pages. In tenths of' ...

4 years ago (2016-12-02 16:03:12 UTC) #10

rnephew (Reviews Here)

https://codereview.chromium.org/2541843007/diff/20001/telemetry/telemetry/internal/story_runner.py File telemetry/telemetry/internal/story_runner.py (right): https://codereview.chromium.org/2541843007/diff/20001/telemetry/telemetry/internal/story_runner.py#newcode64 telemetry/telemetry/internal/story_runner.py:64: help='Temperature to wait for between pages. In tenths of' ...

4 years ago (2016-12-12 16:58:58 UTC) #11

nednguyen

https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py File telemetry/telemetry/internal/story_runner.py (right): https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py#newcode64 telemetry/telemetry/internal/story_runner.py:64: help='Temperature to wait for between pages. In degrees C.') ...

4 years ago (2016-12-15 15:38:11 UTC) #13

rnephew (Reviews Here)

On 2016/12/15 15:38:11, nednguyen wrote: > https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py > File telemetry/telemetry/internal/story_runner.py (right): > > https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py#newcode64 > ...

4 years ago (2016-12-15 15:47:05 UTC) #14

nednguyen

On 2016/12/15 15:47:05, rnephew (Reviews Here) wrote: > On 2016/12/15 15:38:11, nednguyen wrote: > > ...

4 years ago (2016-12-15 16:01:19 UTC) #15

On 2016/12/15 15:47:05, rnephew (Reviews Here) wrote:
> On 2016/12/15 15:38:11, nednguyen wrote:
> >
>
https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/int...
> > File telemetry/telemetry/internal/story_runner.py (right):
> > 
> >
>
https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/int...
> > telemetry/telemetry/internal/story_runner.py:64: help='Temperature to wait
for
> > between pages. In degrees C.')
> > I think we should not expose this. Instead, can you do a local run of system
> > health benchmark on a test phone & see whether this has big effect on
runtime?
> > 
> > If not, we should just go ahead & replace _WaitForThermalThrottlingIfNeeded
> with
> > this new mechanism
> 
> I dont think any amount of local testing will adequately warm a device like it
> would be in the labs. We typically do not see temperature issues out side of
lab
> conditions, which are nearly impossible to replicate in the lab. I also expect
a
> single benchmark run to take longer, its the overall telemetry run that should
> not take longer.
> 
> Old:                        New:
> [Test 1 start]          [Test 1 start]
>   [page 1]                   [page 1]
>   [page 2]                   [Cool device]
>   [page 3]                  [page 2] 
> [Cool device]            [Cool device]
> [Cool device]            [Page 3]
> [Cool device]            [Cool device]
> 
> That is about how I expect it to change. Moving all the cooling at the end of
> the benchmark to the between pages. Its likely that we will see a small
increase
> because of newtons law of cooling, but probably not very high. I just dont
think
> we can adequately test it outside of the lab.

I see. I think the next step then could be:
1) Remove the commandline flag. We should be the one who know best about which
temperature to use. In general, less public APIs will give you less headache in
the long run.
2) Go ahead with landing this patch, but taking it as an experiment. Add logging
to know whether the wait for temperature is actually triggered & how long it
takes.
3) Collect the data: 
i) Whether this helps fixing crbug.com/669923
ii) How much runtime increase does this contribute to the benchmark.
iii) Whether this makes other benchmarks more stable or less stable. --> Gives
this about 3 weeks to see if this CL got blamed by any bisect.

4) Assuming data in 3) is good, go ahead with migrate
_WaitForThermalThrottlingIfNeeded(state.platform) over to this new method of
waiting for device to cool
If the data in 3) is not good, we may need to revert this CL & decide what to do
next.

rnephew (Reviews Here)

https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py File telemetry/telemetry/internal/story_runner.py (right): https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py#newcode64 telemetry/telemetry/internal/story_runner.py:64: help='Temperature to wait for between pages. In degrees C.') ...

4 years ago (2016-12-15 16:29:01 UTC) #16

rnephew (Reviews Here)

The CQ bit was checked by rnephew@chromium.org to run a CQ dry run

4 years ago (2016-12-15 16:29:04 UTC) #17

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2541843007/60001

4 years ago (2016-12-15 16:29:13 UTC) #18

rnephew (Reviews Here)

On 2016/12/15 16:29:01, rnephew (Reviews Here) wrote: > https://codereview.chromium.org/2541843007/diff/40001/telemetry/telemetry/internal/story_runner.py > File telemetry/telemetry/internal/story_runner.py (right): > > ...

4 years ago (2016-12-15 16:30:47 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-15 16:39:37 UTC) #20

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Catapult Mac Tryserver on master.tryserver.client.catapult (JOB_FAILED, https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Mac%20Tryserver/builds/5984)

4 years ago (2016-12-15 16:39:38 UTC) #21

rnephew (Reviews Here)

The CQ bit was checked by rnephew@chromium.org to run a CQ dry run

4 years ago (2016-12-15 16:50:55 UTC) #22

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2541843007/80001

4 years ago (2016-12-15 16:50:57 UTC) #24

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-15 17:01:27 UTC) #25

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Catapult Mac Tryserver on master.tryserver.client.catapult (JOB_FAILED, https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Mac%20Tryserver/builds/5985)

4 years ago (2016-12-15 17:01:28 UTC) #26

nednguyen

On 2016/12/15 17:48:55, nednguyen wrote: > lgtm +Juan & Hector since this may affect crbug.com/671156

4 years ago (2016-12-15 17:49:39 UTC) #28

rnephew (Reviews Here)

On 2016/12/15 17:49:39, nednguyen wrote: > On 2016/12/15 17:48:55, nednguyen wrote: > > lgtm > ...

4 years ago (2016-12-19 21:31:10 UTC) #29

rnephew (Reviews Here)

On 2016/12/19 21:31:10, rnephew (Reviews Here) wrote: > On 2016/12/15 17:49:39, nednguyen wrote: > > ...

4 years ago (2016-12-19 21:31:22 UTC) #30

rnephew (Reviews Here)

On 2016/12/19 21:31:10, rnephew (Reviews Here) wrote: > On 2016/12/15 17:49:39, nednguyen wrote: > > ...

4 years ago (2016-12-19 21:31:25 UTC) #31

rnephew (Reviews Here)

The patchset sent to the CQ was uploaded after l-g-t-m from nednguyen@google.com, charliea@chromium.org Link to ...

4 years ago (2016-12-19 21:31:35 UTC) #33

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2541843007/120001

4 years ago (2016-12-19 21:31:38 UTC) #34

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-19 21:43:02 UTC) #35

commit-bot: I haz the power

Try jobs failed on following builders: Catapult Mac Tryserver on master.tryserver.client.catapult (JOB_FAILED, https://build.chromium.org/p/tryserver.client.catapult/builders/Catapult%20Mac%20Tryserver/builds/6017)

4 years ago (2016-12-19 21:43:03 UTC) #36

rnephew (Reviews Here)

The patchset sent to the CQ was uploaded after l-g-t-m from nednguyen@google.com, charliea@chromium.org Link to ...

4 years ago (2016-12-19 22:43:01 UTC) #38

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2541843007/140001

4 years ago (2016-12-19 22:43:04 UTC) #39

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 140001, "attempt_start_ts": 1482187380743900, "parent_rev": "d77eaf7f69e860436f0245c10ea2c186937ab666", "commit_rev": "70f42a7c55ca69cdeb9aa6ec7e40ff3f155040b9"}

4 years ago (2016-12-19 23:01:40 UTC) #40

commit-bot: I haz the power

Description was changed from ========== [Telemetry][Android] Wait for device under test to cool between pages. ...

4 years ago (2016-12-19 23:01:43 UTC) #41

commit-bot: I haz the power

4 years ago (2016-12-19 23:01:44 UTC) #42

Message was sent while issue was closed.

Committed patchset #8 (id:140001) as
https://chromium.googlesource.com/external/github.com/catapult-project/catapu...

Issue 2541843007: [Telemetry][Android] Wait for device under test to cool between pages. (Closed)

Description

Patch Set 1 #

Patch Set 2 : add as cmdline flag #

Patch Set 3 : convert to tenths at android level #

Patch Set 4 : [Telemetry][Android] Wait for device under test to cool between pages. #

Patch Set 5 : Get rid of reference to deleted cmdline arg #

Patch Set 6 : Fix tests #

Patch Set 7 : rebase #

Patch Set 8 : fix more tests #

Messages