Issue 2237643002: Delay asynchronous work from the metrics clients until after startup.

gab

gab@chromium.org changed reviewers: + isherman@chromium.org

4 years, 4 months ago (2016-08-10 21:07:28 UTC) #1

gab

The CQ bit was checked by gab@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-10 21:08:21 UTC) #3

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2237643002/1

4 years, 4 months ago (2016-08-10 21:09:13 UTC) #4

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-10 21:24:51 UTC) #5

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_compile_dbg_ng/builds/244658) linux_chromium_compile_dbg_ng on ...

4 years, 4 months ago (2016-08-10 21:24:53 UTC) #6

Ilya Sherman

isherman@chromium.org changed reviewers: + asvitkine@chromium.org, bcwhite@chromium.org

4 years, 4 months ago (2016-08-11 04:17:59 UTC) #7

Ilya Sherman

+Alexei and Brian as a sanity check Hmm. There are a number of subtle timing ...

4 years, 4 months ago (2016-08-11 04:18:00 UTC) #8

bcwhite

> Hmm. There are a number of subtle timing issues that could come up here. ...

4 years, 4 months ago (2016-08-11 10:42:21 UTC) #9

gab

On 2016/08/11 04:18:00, Ilya Sherman wrote: > +Alexei and Brian as a sanity check > ...

4 years, 4 months ago (2016-08-11 11:08:41 UTC) #10

On 2016/08/11 04:18:00, Ilya Sherman wrote:
> +Alexei and Brian as a sanity check
> 
> Hmm.  There are a number of subtle timing issues that could come up here.  For
> example, the persistent memory allocator should really be created as early as
> possible.  How much of a timing difference does this CL make, roughly?

This would delay several seconds (i.e. by the average startup time instead of
being one of the first tasks in the BlockingPool), AfterStartupTaskUtils queues
tasks and posts them in a random order once startup completes (first frame
painted). I figured this was okay as you're using the BlockingPool without
sequencing already (and shouldn't really rely on throughput from it for
correctness either -- though being its first clients you might have gotten away
with this..).

It's kind of an accident that this works today IMO as there really shouldn't be
any pool available before the end of PreCreateThreads.

If that delay is unacceptable, another option is a controlled delay -- i.e. by
using a DeferredSequencedTaskRunner (or introducing DeferredTaskRunner if we
want to avoid  unnecessary sequencing) and then manually releasing it at the
very end of PreCreateThreadsImpl

> 
> Also: Is this CL *only* needed to be able to *experiment* on the new
scheduler? 
> If so, could it be reverted in the future, whether we move to the new
scheduler
> or not?  If the answer is yes, I'd prefer to somehow make it really clear that
> change is temporary, by naming the methods to indicate such.

No it's more than that. 1) we are almost certainly moving to the new scheduler.
2) there will almost always need to be live experiments from the scheduler's
internals (or at least need the possibility to bring one up without a
refactoring each time). 3) as mentioned above there shouldn't really be pools
live during PreCreateThreads but the scheduler needs to be up before any thread
hence it's currently a strong requirement that it be last in PreCreateThreads
(or could be first in CreateThreads I guess but same problem).
So no it's not temporary, the metrics async work needs to be at least delayed
after PreCreateThreads (after startup is just a convenience as it appeared to be
non critical work but I might have misinterpreted that).

gab

The CQ bit was checked by gab@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-11 12:28:32 UTC) #11

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2237643002/20001

4 years, 4 months ago (2016-08-11 12:28:47 UTC) #12

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-11 12:30:37 UTC) #13

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/50306) ios-simulator on ...

4 years, 4 months ago (2016-08-11 12:30:41 UTC) #14

gab

The CQ bit was checked by gab@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-11 12:39:11 UTC) #15

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2237643002/40001

4 years, 4 months ago (2016-08-11 12:39:21 UTC) #16

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-11 14:01:58 UTC) #18

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/208110)

4 years, 4 months ago (2016-08-11 14:01:59 UTC) #19

Alexei Svitkine (slow)

I can look at this more on Friday, but I believe our expectation when writing ...

4 years, 4 months ago (2016-08-11 15:12:05 UTC) #20

I can look at this more on Friday, but I believe our expectation when
writing this code is to simply schedule thing that will run when Chrome
gets out of single threaded mode and spins up the blocking pool. (Not that
these would be scheduled right away.)

I think being able to do that is useful and it would be good to be able to
preserve this capability. Whether these specific tasks could be delayed
until after start up is imho orthogonal and I'll have to look at exactly
what they are doing again.

On Aug 11, 2016 4:08 AM, <gab@chromium.org> wrote:

> On 2016/08/11 04:18:00, Ilya Sherman wrote:
> > +Alexei and Brian as a sanity check
> >
> > Hmm. There are a number of subtle timing issues that could come up here.
> For
> > example, the persistent memory allocator should really be created as
> early as
> > possible. How much of a timing difference does this CL make, roughly?
>
> This would delay several seconds (i.e. by the average startup time instead
> of
> being one of the first tasks in the BlockingPool), AfterStartupTaskUtils
> queues
> tasks and posts them in a random order once startup completes (first frame
> painted). I figured this was okay as you're using the BlockingPool without
> sequencing already (and shouldn't really rely on throughput from it for
> correctness either -- though being its first clients you might have gotten
> away
> with this..).
>
> It's kind of an accident that this works today IMO as there really
> shouldn't be
> any pool available before the end of PreCreateThreads.
>
> If that delay is unacceptable, another option is a controlled delay --
> i.e. by
> using a DeferredSequencedTaskRunner (or introducing DeferredTaskRunner if
> we
> want to avoid unnecessary sequencing) and then manually releasing it at the
> very end of PreCreateThreadsImpl
>
> >
> > Also: Is this CL *only* needed to be able to *experiment* on the new
> scheduler?
> > If so, could it be reverted in the future, whether we move to the new
> scheduler
> > or not? If the answer is yes, I'd prefer to somehow make it really clear
> that
> > change is temporary, by naming the methods to indicate such.
>
> No it's more than that. 1) we are almost certainly moving to the new
> scheduler.
> 2) there will almost always need to be live experiments from the
> scheduler's
> internals (or at least need the possibility to bring one up without a
> refactoring each time). 3) as mentioned above there shouldn't really be
> pools
> live during PreCreateThreads but the scheduler needs to be up before any
> thread
> hence it's currently a strong requirement that it be last in
> PreCreateThreads
> (or could be first in CreateThreads I guess but same problem).
> So no it's not temporary, the metrics async work needs to be at least
> delayed
> after PreCreateThreads (after startup is just a convenience as it appeared
> to be
> non critical work but I might have misinterpreted that).
>
> https://codereview.chromium.org/2237643002/
>

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Ilya Sherman

Okay, per discussion, I think this LGTM. You might want to wait for Alexei to ...

4 years, 4 months ago (2016-08-11 18:43:04 UTC) #21

gab

On 2016/08/11 15:12:05, Alexei Svitkine (very slow) wrote: > I can look at this more ...

4 years, 4 months ago (2016-08-11 22:48:30 UTC) #22

gab

The CQ bit was checked by gab@chromium.org to run a CQ dry run

4 years, 4 months ago (2016-08-12 04:19:03 UTC) #23

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2237643002/60001

4 years, 4 months ago (2016-08-12 04:19:19 UTC) #24

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 4 months ago (2016-08-12 05:58:59 UTC) #25

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/208734)

4 years, 4 months ago (2016-08-12 05:59:01 UTC) #26

Alexei Svitkine (slow)

I have some concerns. Comments below. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/chrome_metrics_service_client.cc File chrome/browser/metrics/chrome_metrics_service_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/chrome_metrics_service_client.cc#newcode425 chrome/browser/metrics/chrome_metrics_service_client.cc:425: content::BrowserThread::GetBlockingPool())))); Similar to ...

4 years, 4 months ago (2016-08-12 18:28:18 UTC) #27

I have some concerns. Comments below.

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
File chrome/browser/metrics/chrome_metrics_service_client.cc (right):

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_service_client.cc:425:
content::BrowserThread::GetBlockingPool()))));
Similar to AntiVirusMetricsProvider, this could delay getting the data in time.

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_service_client.cc:486:
content::BrowserThread::GetBlockingPool()))));
This seems fine. Used to clean up registry data when metrics is off.

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_service_client.cc:489: new
AntiVirusMetricsProvider(new AfterStartupTaskUtils::Runner(
This will delay running GetAntiVirusProductsOnFileThread(). (Which is not well
named, as an aside.)

If it runs too late, ProvideSystemProfileMetrics() will not have AV info in
time.

Seems like this could affect behavior. I would want us to verify things or add a
histogram to measure the impact of this, rather than just blindly landing this.

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
File chrome/browser/metrics/chrome_metrics_services_manager_client.cc (right):

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_services_manager_client.cc:43: FROM_HERE,
content::BrowserThread::GetBlockingPool(),
Hmm, in the following CL I'm clearing the client info if consent is toggled off:

https://codereview.chromium.org/2222903004/

I wonder if there's any chance of a race with this change.

For example, if this schedules a task, then the "toggle checkbox off" code runs
and then this task runs, then we will get data written.

Now, I find it hard to reason about whether the above is possible, since I'm not
sure what "After startup" means. Is it arbitrarily delayed by some random value?

If so, I can imagine this going wrong. Ideally, every call to
StoreMetricsClientInfo should use the same sequenced runner. Which might be a
bit hard to structure.

gab

Brain dump here before heading off on vacation. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/chrome_metrics_service_client.cc File chrome/browser/metrics/chrome_metrics_service_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/chrome_metrics_service_client.cc#newcode489 chrome/browser/metrics/chrome_metrics_service_client.cc:489: new ...

4 years, 4 months ago (2016-08-15 19:39:56 UTC) #28

Brain dump here before heading off on vacation.

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
File chrome/browser/metrics/chrome_metrics_service_client.cc (right):

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_service_client.cc:489: new
AntiVirusMetricsProvider(new AfterStartupTaskUtils::Runner(
On 2016/08/12 18:28:18, Alexei Svitkine (very slow) wrote:
> This will delay running GetAntiVirusProductsOnFileThread(). (Which is not well
> named, as an aside.)
> 
> If it runs too late, ProvideSystemProfileMetrics() will not have AV info in
> time.
> 
> Seems like this could affect behavior. I would want us to verify things or add
a
> histogram to measure the impact of this, rather than just blindly landing
this.

For startup's sake on slow machines (forgetting the initial re-ordering premise
of this CL for a second), should we not delay the first UMA ping until this
information has been gathered instead of rushing to acquire it while the machine
is under heavy load on startup? This doesn't appear critical to painting the
first frame and hence we should be happy to delay it.

Most users will complete startup in the first 30 seconds and then would schedule
these tasks (with a random 0-10s delay, see after_startup_task_utils.cc). So for
most users we should have this information in time (1 minute IIUC), for others,
gathering it is probably hurting anyways and delaying the first UMA ping (or
flagging it as missing the info) sounds better to me.

From offline discussion : delaying regular ping is okay but not stability ping
(I'm not sure about details here).

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
File chrome/browser/metrics/chrome_metrics_services_manager_client.cc (right):

https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/...
chrome/browser/metrics/chrome_metrics_services_manager_client.cc:43: FROM_HERE,
content::BrowserThread::GetBlockingPool(),
On 2016/08/12 18:28:18, Alexei Svitkine (very slow) wrote:
> Hmm, in the following CL I'm clearing the client info if consent is toggled
off:
> 
> https://codereview.chromium.org/2222903004/
> 
> I wonder if there's any chance of a race with this change.
> 
> For example, if this schedules a task, then the "toggle checkbox off" code
runs
> and then this task runs, then we will get data written.
> 
> Now, I find it hard to reason about whether the above is possible, since I'm
not
> sure what "After startup" means. Is it arbitrarily delayed by some random
value?
> 
> If so, I can imagine this going wrong. Ideally, every call to
> StoreMetricsClientInfo should use the same sequenced runner. Which might be a
> bit hard to structure.

If there is, the race condition already exists today (an unsequenced PostTask to
a TaskRunner provides absolutely no guarantees of ordering. The extra delay
might highlight the race but it won't cause it.

From offline discussion : true but today people would have to change settings
manually super early whereas with this CL they have up to a 10 seconds window
after first paint during which they could disable UMA and have this delayed task
re-save state after the fact later. This can perhaps be handled by a
CancellableFlag?

Alexei Svitkine (slow)

This can be closed now with robliao's CL, right?

4 years, 3 months ago (2016-09-15 16:41:05 UTC) #29

gab

4 years, 3 months ago (2016-09-15 17:23:58 UTC) #30

On 2016/09/15 16:41:05, Alexei Svitkine (very slow) wrote:
> This can be closed now with robliao's CL, right?

Yes

Issue 2237643002: Delay asynchronous work from the metrics clients until after startup. (Closed)

Description

Patch Set 1 #

Patch Set 2 : rebase dependent #

Patch Set 3 : rebase dependent #

Messages