|
|
DescriptionDelay asynchronous work from the metrics clients until after startup.
BUG=636518
Patch Set 1 #Patch Set 2 : rebase dependent #Patch Set 3 : rebase dependent #
Total comments: 6
Depends on Patchset: Dependent Patchsets: Messages
Total messages: 30 (19 generated)
gab@chromium.org changed reviewers: + isherman@chromium.org
Ilya PTAL, thanks!
The CQ bit was checked by gab@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) linux_chromium_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
isherman@chromium.org changed reviewers: + asvitkine@chromium.org, bcwhite@chromium.org
+Alexei and Brian as a sanity check Hmm. There are a number of subtle timing issues that could come up here. For example, the persistent memory allocator should really be created as early as possible. How much of a timing difference does this CL make, roughly? Also: Is this CL *only* needed to be able to *experiment* on the new scheduler? If so, could it be reverted in the future, whether we move to the new scheduler or not? If the answer is yes, I'd prefer to somehow make it really clear that change is temporary, by naming the methods to indicate such.
> Hmm. There are a number of subtle timing issues that could come up here. For > example, the persistent memory allocator should really be created as early as > possible. How much of a timing difference does this CL make, roughly? The memory allocator currently gets created as soon as field trials are activated (i.e. not affected by this CL) but eventually it may happen sooner, such as when BrowserMainRunner is initializing. It depends on how important persistence is for those metrics that evaluate startup behavior. The delay of the FileMetricsProvider runner is fine; there will be plenty of time between startup finishing and the first metrics report at the 1-minute mark to collect those.
On 2016/08/11 04:18:00, Ilya Sherman wrote: > +Alexei and Brian as a sanity check > > Hmm. There are a number of subtle timing issues that could come up here. For > example, the persistent memory allocator should really be created as early as > possible. How much of a timing difference does this CL make, roughly? This would delay several seconds (i.e. by the average startup time instead of being one of the first tasks in the BlockingPool), AfterStartupTaskUtils queues tasks and posts them in a random order once startup completes (first frame painted). I figured this was okay as you're using the BlockingPool without sequencing already (and shouldn't really rely on throughput from it for correctness either -- though being its first clients you might have gotten away with this..). It's kind of an accident that this works today IMO as there really shouldn't be any pool available before the end of PreCreateThreads. If that delay is unacceptable, another option is a controlled delay -- i.e. by using a DeferredSequencedTaskRunner (or introducing DeferredTaskRunner if we want to avoid unnecessary sequencing) and then manually releasing it at the very end of PreCreateThreadsImpl > > Also: Is this CL *only* needed to be able to *experiment* on the new scheduler? > If so, could it be reverted in the future, whether we move to the new scheduler > or not? If the answer is yes, I'd prefer to somehow make it really clear that > change is temporary, by naming the methods to indicate such. No it's more than that. 1) we are almost certainly moving to the new scheduler. 2) there will almost always need to be live experiments from the scheduler's internals (or at least need the possibility to bring one up without a refactoring each time). 3) as mentioned above there shouldn't really be pools live during PreCreateThreads but the scheduler needs to be up before any thread hence it's currently a strong requirement that it be last in PreCreateThreads (or could be first in CreateThreads I guess but same problem). So no it's not temporary, the metrics async work needs to be at least delayed after PreCreateThreads (after startup is just a convenience as it appeared to be non critical work but I might have misinterpreted that).
The CQ bit was checked by gab@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...)
The CQ bit was checked by gab@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #2 (id:20001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
I can look at this more on Friday, but I believe our expectation when writing this code is to simply schedule thing that will run when Chrome gets out of single threaded mode and spins up the blocking pool. (Not that these would be scheduled right away.) I think being able to do that is useful and it would be good to be able to preserve this capability. Whether these specific tasks could be delayed until after start up is imho orthogonal and I'll have to look at exactly what they are doing again. On Aug 11, 2016 4:08 AM, <gab@chromium.org> wrote: > On 2016/08/11 04:18:00, Ilya Sherman wrote: > > +Alexei and Brian as a sanity check > > > > Hmm. There are a number of subtle timing issues that could come up here. > For > > example, the persistent memory allocator should really be created as > early as > > possible. How much of a timing difference does this CL make, roughly? > > This would delay several seconds (i.e. by the average startup time instead > of > being one of the first tasks in the BlockingPool), AfterStartupTaskUtils > queues > tasks and posts them in a random order once startup completes (first frame > painted). I figured this was okay as you're using the BlockingPool without > sequencing already (and shouldn't really rely on throughput from it for > correctness either -- though being its first clients you might have gotten > away > with this..). > > It's kind of an accident that this works today IMO as there really > shouldn't be > any pool available before the end of PreCreateThreads. > > If that delay is unacceptable, another option is a controlled delay -- > i.e. by > using a DeferredSequencedTaskRunner (or introducing DeferredTaskRunner if > we > want to avoid unnecessary sequencing) and then manually releasing it at the > very end of PreCreateThreadsImpl > > > > > Also: Is this CL *only* needed to be able to *experiment* on the new > scheduler? > > If so, could it be reverted in the future, whether we move to the new > scheduler > > or not? If the answer is yes, I'd prefer to somehow make it really clear > that > > change is temporary, by naming the methods to indicate such. > > No it's more than that. 1) we are almost certainly moving to the new > scheduler. > 2) there will almost always need to be live experiments from the > scheduler's > internals (or at least need the possibility to bring one up without a > refactoring each time). 3) as mentioned above there shouldn't really be > pools > live during PreCreateThreads but the scheduler needs to be up before any > thread > hence it's currently a strong requirement that it be last in > PreCreateThreads > (or could be first in CreateThreads I guess but same problem). > So no it's not temporary, the metrics async work needs to be at least > delayed > after PreCreateThreads (after startup is just a convenience as it appeared > to be > non critical work but I might have misinterpreted that). > > https://codereview.chromium.org/2237643002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Okay, per discussion, I think this LGTM. You might want to wait for Alexei to take another look on Friday, just to be safe. (And, FYI, I'll be OOO the rest of today and tomorrow.)
On 2016/08/11 15:12:05, Alexei Svitkine (very slow) wrote: > I can look at this more on Friday, but I believe our expectation when > writing this code is to simply schedule thing that will run when Chrome > gets out of single threaded mode and spins up the blocking pool. (Not that > these would be scheduled right away.) Right, except that BrowserThreadImpl happens to lazily instantiate the pool right then, which brings up its threads on the first PostTask and essentially violates PreCreateThreads()'s contract (I understand that this isn't really your problem, but your code happens to be the only one triggering this so a local resolution would be easier for now). > > I think being able to do that is useful and it would be good to be able to > preserve this capability. Whether these specific tasks could be delayed > until after start up is imho orthogonal and I'll have to look at exactly > what they are doing again. Agreed that it's orthogonal. We just need _a_ delay. We could introduce a generic way to buffer up SequencedWorkerPool tasks until signaled or we could pass a DeferredTaskRunner to SetupMetricsAndFieldTrials() and release it when we please. But if running after startup works for you, it's the simplest solution IMO (and pushing more things after startup is never a bad thing either, your code happens to not have sequencing requirements so yay :-))
The CQ bit was checked by gab@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
I have some concerns. Comments below. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... File chrome/browser/metrics/chrome_metrics_service_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_service_client.cc:425: content::BrowserThread::GetBlockingPool())))); Similar to AntiVirusMetricsProvider, this could delay getting the data in time. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_service_client.cc:486: content::BrowserThread::GetBlockingPool())))); This seems fine. Used to clean up registry data when metrics is off. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_service_client.cc:489: new AntiVirusMetricsProvider(new AfterStartupTaskUtils::Runner( This will delay running GetAntiVirusProductsOnFileThread(). (Which is not well named, as an aside.) If it runs too late, ProvideSystemProfileMetrics() will not have AV info in time. Seems like this could affect behavior. I would want us to verify things or add a histogram to measure the impact of this, rather than just blindly landing this. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... File chrome/browser/metrics/chrome_metrics_services_manager_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_services_manager_client.cc:43: FROM_HERE, content::BrowserThread::GetBlockingPool(), Hmm, in the following CL I'm clearing the client info if consent is toggled off: https://codereview.chromium.org/2222903004/ I wonder if there's any chance of a race with this change. For example, if this schedules a task, then the "toggle checkbox off" code runs and then this task runs, then we will get data written. Now, I find it hard to reason about whether the above is possible, since I'm not sure what "After startup" means. Is it arbitrarily delayed by some random value? If so, I can imagine this going wrong. Ideally, every call to StoreMetricsClientInfo should use the same sequenced runner. Which might be a bit hard to structure.
Brain dump here before heading off on vacation. https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... File chrome/browser/metrics/chrome_metrics_service_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_service_client.cc:489: new AntiVirusMetricsProvider(new AfterStartupTaskUtils::Runner( On 2016/08/12 18:28:18, Alexei Svitkine (very slow) wrote: > This will delay running GetAntiVirusProductsOnFileThread(). (Which is not well > named, as an aside.) > > If it runs too late, ProvideSystemProfileMetrics() will not have AV info in > time. > > Seems like this could affect behavior. I would want us to verify things or add a > histogram to measure the impact of this, rather than just blindly landing this. For startup's sake on slow machines (forgetting the initial re-ordering premise of this CL for a second), should we not delay the first UMA ping until this information has been gathered instead of rushing to acquire it while the machine is under heavy load on startup? This doesn't appear critical to painting the first frame and hence we should be happy to delay it. Most users will complete startup in the first 30 seconds and then would schedule these tasks (with a random 0-10s delay, see after_startup_task_utils.cc). So for most users we should have this information in time (1 minute IIUC), for others, gathering it is probably hurting anyways and delaying the first UMA ping (or flagging it as missing the info) sounds better to me. From offline discussion : delaying regular ping is okay but not stability ping (I'm not sure about details here). https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... File chrome/browser/metrics/chrome_metrics_services_manager_client.cc (right): https://codereview.chromium.org/2237643002/diff/60001/chrome/browser/metrics/... chrome/browser/metrics/chrome_metrics_services_manager_client.cc:43: FROM_HERE, content::BrowserThread::GetBlockingPool(), On 2016/08/12 18:28:18, Alexei Svitkine (very slow) wrote: > Hmm, in the following CL I'm clearing the client info if consent is toggled off: > > https://codereview.chromium.org/2222903004/ > > I wonder if there's any chance of a race with this change. > > For example, if this schedules a task, then the "toggle checkbox off" code runs > and then this task runs, then we will get data written. > > Now, I find it hard to reason about whether the above is possible, since I'm not > sure what "After startup" means. Is it arbitrarily delayed by some random value? > > If so, I can imagine this going wrong. Ideally, every call to > StoreMetricsClientInfo should use the same sequenced runner. Which might be a > bit hard to structure. If there is, the race condition already exists today (an unsequenced PostTask to a TaskRunner provides absolutely no guarantees of ordering. The extra delay might highlight the race but it won't cause it. From offline discussion : true but today people would have to change settings manually super early whereas with this CL they have up to a 10 seconds window after first paint during which they could disable UMA and have this delayed task re-save state after the fact later. This can perhaps be handled by a CancellableFlag?
This can be closed now with robliao's CL, right?
On 2016/09/15 16:41:05, Alexei Svitkine (very slow) wrote: > This can be closed now with robliao's CL, right? Yes |