Issue 2554123002: Support parallel captures from the StackSamplingProfiler.

Mike Wittman

wittman@chromium.org changed reviewers: + wittman@chromium.org

4 years ago (2016-12-06 21:04:56 UTC) #1

Mike Wittman

Haven't reviewed the logic in great detail but this approach looks reasonable to me, with ...

4 years ago (2016-12-06 21:04:58 UTC) #2

Haven't reviewed the logic in great detail but this approach looks reasonable to
me, with two high-level comments:

The standard mechanism for inter-thread communication in Chrome is via PostTask
to a task runner/message loop. We should support this method for requesting
captures instead of asynchronous state access guarded by a lock, unless there
are profiler-specific reasons that prevent us from doing so. (It's entirely
possible there could be, but I'm not aware of any blockers to this at the
moment.) This probably would simplify some of the logic as a side effect.

I believe you'll need to join the profiler thread in the main thread before
exiting Chrome, to ensure clean shutdown in tests and during normal execution. I
initially implemented the profiler without the join and there were failures at
shutdown due to walking stacks of threads that had already been destroyed.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord
next_capture_id_;
We probably can avoid need for a thread-safe id by identifying the ActiveCapture
by its address (e.g. as an opaque void*).

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:348:
capture->native_sampler->ProfileRecordingStarting(&profile.modules);
The matching call to ProfileRecordingStopped has been dropped with these
changes.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:379: wait = TimeDelta::FromDays(365); 
// A long, long time.
There's a general desire to have as few persistent threads as possible in
Chrome, so we probably should have the sampling thread terminate after a period
of inactivity, and restart on demand.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:441:
active_captures_.push_back(std::move(capture_ptr));
push_heap?

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:494:
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
We'll need to refactor to use a single stack copy buffer across all
NativeStackSamplers, as the buffer is fairly large.

Mike Wittman

On 2016/12/06 21:04:58, Mike Wittman wrote: > I believe you'll need to join the profiler ...

4 years ago (2016-12-06 21:10:01 UTC) #3

bcwhite

> The standard mechanism for inter-thread communication in > Chrome is via PostTask to a ...

4 years ago (2016-12-07 15:15:30 UTC) #4

Mike Wittman

On 2016/12/07 15:15:30, bcwhite wrote: > > The standard mechanism for inter-thread communication in > ...

4 years ago (2016-12-07 16:25:02 UTC) #5

Mike Wittman

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_sampling_profiler.cc#newcode223 base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; On 2016/12/07 16:25:02, Mike Wittman wrote: ...

4 years ago (2016-12-07 17:20:42 UTC) #6

bcwhite

> > > The standard mechanism for inter-thread communication in > > > Chrome is ...

4 years ago (2016-12-07 17:48:21 UTC) #7

Mike Wittman

On 2016/12/07 17:48:21, bcwhite wrote: > > > > The standard mechanism for inter-thread communication ...

4 years ago (2016-12-07 18:58:21 UTC) #8

bcwhite

On 2016/12/07 18:58:21, Mike Wittman wrote: > On 2016/12/07 17:48:21, bcwhite wrote: > > > ...

4 years ago (2016-12-07 19:54:24 UTC) #9

On 2016/12/07 18:58:21, Mike Wittman wrote:
> On 2016/12/07 17:48:21, bcwhite wrote:
> > > > > The standard mechanism for inter-thread communication in
> > > > > Chrome is via PostTask to a task runner/message loop.
> > > > 
> > > > I considered the message-loop but it didn't appear to
> > > > offer acceptable timing guarantees.  The code I saw was
> > > > only "run until idle" which meant no timing support
> > > > for exiting the loop to perform a sampling operation.
> > > 
> > > Can we use PostDelayedTask on the message loop's task runner for this?
> > 
> > It's possible but the precision would be poor and I think timing accuracy is
> > more important than convenience in this case...  not that I find the message
> > loop to be very convenient for this use.
> 
> Why do you say the precision would be poorer? I believe the waiting in both
> cases operates at system timer tick resolution: WaitableEvent via
> SleepConditionVariableSRW, and e.g. MessagePumpForUI via
> MsgWaitForMultipleObjectsEx.
> (https://randomascii.wordpress.com/2013/04/02/sleep-variation-investigated has
> analysis of the sleep resolution, and
> https://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx documents the
> MsgWaitForMultipleObjectsEx resolution.)

It's not the OS call but the code of the loop.  There's more overhead in the
general-purpose class and while it may wait with the same resolution, it may do
other things, too.  And the code could always change outside of our control
violating our assumptions.

Is there a way to cancel a delayed task?  If not then there will be some added
complexity so that it can be stopped immediately but not fail when the delayed
task gets executed.

On the other hand, Thread supports restarting of the thread without having to
completely recreate the object.  That means no home-grown Singleton-with-delete
around SimpleThread and no need to go to a lower-level PlatformThread.  That's a
plus.
https://cs.chromium.org/chromium/src/testing/gtest/include/gtest/gtest.h?l=446

Need to sleep on it.  :-)

Mike Wittman

On 2016/12/07 19:54:24, bcwhite wrote: > On 2016/12/07 18:58:21, Mike Wittman wrote: > > On ...

4 years ago (2016-12-07 20:36:33 UTC) #10

On 2016/12/07 19:54:24, bcwhite wrote:
> On 2016/12/07 18:58:21, Mike Wittman wrote:
> > On 2016/12/07 17:48:21, bcwhite wrote:
> > > > > > The standard mechanism for inter-thread communication in
> > > > > > Chrome is via PostTask to a task runner/message loop.
> > > > > 
> > > > > I considered the message-loop but it didn't appear to
> > > > > offer acceptable timing guarantees.  The code I saw was
> > > > > only "run until idle" which meant no timing support
> > > > > for exiting the loop to perform a sampling operation.
> > > > 
> > > > Can we use PostDelayedTask on the message loop's task runner for this?
> > > 
> > > It's possible but the precision would be poor and I think timing accuracy
is
> > > more important than convenience in this case...  not that I find the
message
> > > loop to be very convenient for this use.
> > 
> > Why do you say the precision would be poorer? I believe the waiting in both
> > cases operates at system timer tick resolution: WaitableEvent via
> > SleepConditionVariableSRW, and e.g. MessagePumpForUI via
> > MsgWaitForMultipleObjectsEx.
> > (https://randomascii.wordpress.com/2013/04/02/sleep-variation-investigated
has
> > analysis of the sleep resolution, and
> > https://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx documents the
> > MsgWaitForMultipleObjectsEx resolution.)
> 
> It's not the OS call but the code of the loop.  There's more overhead in the
> general-purpose class and while it may wait with the same resolution, it may
do
> other things, too.  And the code could always change outside of our control
> violating our assumptions.

I'd be most concerned about e.g. Windows sending extraneous messages to the
message loop and those delaying the processing. I don't know what the likelihood
of that occurring is though. As far as implementation overhead goes, the GPU
main thread uses a message loop and it presumably has more stringent performance
requirements than the profiler, in order to maintain frame rate.

> Is there a way to cancel a delayed task?  If not then there will be some added
> complexity so that it can be stopped immediately but not fail when the delayed
> task gets executed.

There's CancelableCallback, but it's not clear if its use of WeakPtrs would work
within the constraints of the profiler.

> On the other hand, Thread supports restarting of the thread without having to
> completely recreate the object.  That means no home-grown
Singleton-with-delete
> around SimpleThread and no need to go to a lower-level PlatformThread.  That's
a
> plus.
> https://cs.chromium.org/chromium/src/testing/gtest/include/gtest/gtest.h?l=446
> 
> Need to sleep on it.  :-)

bcwhite

New patch set using a message loop! Still rough but tests pass... except for the ...

4 years ago (2016-12-09 17:58:23 UTC) #11

Mike Wittman

On 2016/12/09 17:58:23, bcwhite wrote: > New patch set using a message loop! Great! Made ...

4 years ago (2016-12-09 21:45:02 UTC) #12

On 2016/12/09 17:58:23, bcwhite wrote:
> New patch set using a message loop!

Great! Made a pass through looking for thread safety and higher-level issues.

> Still rough but tests pass... except for the one that checks that concurrent
> profiling isn't allowed.  :-)

Good. Tests definitely will need to be extended to provide good test coverage
for the concurrent case.

> There is code to deal with the thread exiting but actually having it exit will
> require a small addition to RunLoop.  Specifically, I'll need to add a
> QuitWhenEmpty() to sit beside QuitWhenIdle(), the latter stopping when there
are
> no immediate tasks to execute even if there are pending delayed tasks.

I'm not sure this is necessary... see the final comment in the code.

Beyond the comments below the other major issue I'm aware of is ensuring
profiling doesn't occur after threads are destroyed.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture*
capture);
The term "capture" is overloaded in the method names to mean both the recording
of all the samples and the recording of a single sample. Can we use something
like "record sample" for the latter case to be consistent with the
NativeStackSampler? e.g. this becomes something like RecordSampleForCapture.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:311:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
According to the Thread documentation, task_runner() can only be safely called
from the thread that invokes Start().

Thread's API is not thread-safe in general, so care should be taken to ensure
that it's only used from the proper threads, including making liberal use of
DCHECKs/ThreadChecker since it's non-trivial to validate from reading the code.

The same goes for ensuring execution on proper threads in other functions in
this class (and for documenting thread expectations to the reader).

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE,
Bind(&SamplingThread::StartCaptureTask,
It's common practice to implement thread hopping using just one function,
checking whether the execution is on the desired thread at the start of the
function, and if not, posting a task back to the same function on the desired
thread. I think that could be done here by checking the thread id, and probably
would make the code a little easier to follow.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:344:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
Same issue with task_runner() here.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:352: FROM_HERE,
Bind(&SamplingThread::StopCaptureTask, Unretained(this), id));
Same comment here about thread hopping.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:429: 
where does the capture get erased from active_captures_?

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
My understanding is that the message loop just waits if no tasks are present. I
believe it must be forcibly quit or its thread shut down to terminate it.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-09 22:41:37 UTC) #13

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/40001

4 years ago (2016-12-09 22:42:29 UTC) #14

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-09 22:53:31 UTC) #15

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_compile_dbg_ng/builds/321176)

4 years ago (2016-12-09 22:53:31 UTC) #16

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > According to the Thread ...

4 years ago (2016-12-09 23:38:31 UTC) #18

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:311:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
> According to the Thread documentation, task_runner() can only be safely called
> from the thread that invokes Start().

The comment for Thread::task_runner() says:
  // In addition to this Thread's owning sequence, this can also safely be
  // called from the underlying thread itself.

> Thread's API is not thread-safe in general, so care should be taken to ensure
> that it's only used from the proper threads, including making liberal use of
> DCHECKs/ThreadChecker since it's non-trivial to validate from reading the
code.

There's a DCHECK in Thread::task_runner() that verifies this.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE,
Bind(&SamplingThread::StartCaptureTask,
On 2016/12/09 21:45:02, Mike Wittman wrote:
> It's common practice to implement thread hopping using just one function,
> checking whether the execution is on the desired thread at the start of the
> function, and if not, posting a task back to the same function on the desired
> thread. I think that could be done here by checking the thread id, and
probably
> would make the code a little easier to follow.

Add() and Stop() are always coming from a different thread.  The
StartCaptureTask() could be merged into the same method but I think that would
be more confusing because of all the work done above to make sure the thread is
actually running.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:429: 
On 2016/12/09 21:45:01, Mike Wittman wrote:
> where does the capture get erased from active_captures_?

In ::Cleanup()
... which I realized after uploading that forgot to write.  :-)

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
On 2016/12/09 21:45:01, Mike Wittman wrote:
> My understanding is that the message loop just waits if no tasks are present.
I
> believe it must be forcibly quit or its thread shut down to terminate it.

Correct.  My idea is to add the ability for it to self-destruct when "empty"
(which is not the same thing as "idle").

I didn't think it was possible for at outside class like this one to tell if the
message_loop is empty, but perhaps it can -- I'll have to check.  If so, then I
can add a check at the end of every task to terminate the loop if it is empty.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/09 23:38:30, bcwhite wrote: ...

4 years ago (2016-12-10 00:24:23 UTC) #19

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:311:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
On 2016/12/09 23:38:30, bcwhite wrote:
> > According to the Thread documentation, task_runner() can only be safely
called
> > from the thread that invokes Start().
> 
> The comment for Thread::task_runner() says:
>   // In addition to this Thread's owning sequence, this can also safely be
>   // called from the underlying thread itself.

Right, but Add() will never be called on the thread itself, correct?

If I'm not mistaken task_runner() will be invoked on a thread other than the
thread itself and the one that called Start(), once a second thread attempts to
profile itself concurrently.

Unit tests for the concurrency functionality will help catch this type of issue.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE,
Bind(&SamplingThread::StartCaptureTask,
On 2016/12/09 23:38:30, bcwhite wrote:
> On 2016/12/09 21:45:02, Mike Wittman wrote:
> > It's common practice to implement thread hopping using just one function,
> > checking whether the execution is on the desired thread at the start of the
> > function, and if not, posting a task back to the same function on the
desired
> > thread. I think that could be done here by checking the thread id, and
> probably
> > would make the code a little easier to follow.
> 
> Add() and Stop() are always coming from a different thread.  The
> StartCaptureTask() could be merged into the same method but I think that would
> be more confusing because of all the work done above to make sure the thread
is
> actually running.

I think it's worth doing this for Stop() at least. The other benefit of the
one-function thread hop implementation is that it clearly documents/enforces
threading expectations for an operation in a single location.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
On 2016/12/09 23:38:30, bcwhite wrote:
> On 2016/12/09 21:45:01, Mike Wittman wrote:
> > My understanding is that the message loop just waits if no tasks are
present.
> I
> > believe it must be forcibly quit or its thread shut down to terminate it.
> 
> Correct.  My idea is to add the ability for it to self-destruct when "empty"
> (which is not the same thing as "idle").

I think that will be confusing to readers since it will operate differently than
all the other message loops in the application. Couldn't this be addressed by
posting a delayed quit task when the number of captures drops to zero (and
canceling the task if the number of captures becomes non-zero)?

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/10 00:24:23, Mike Wittman ...

4 years ago (2016-12-13 16:08:11 UTC) #20

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:311:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
On 2016/12/10 00:24:23, Mike Wittman wrote:
> On 2016/12/09 23:38:30, bcwhite wrote:
> > > According to the Thread documentation, task_runner() can only be safely
> called
> > > from the thread that invokes Start().
> > 
> > The comment for Thread::task_runner() says:
> >   // In addition to this Thread's owning sequence, this can also safely be
> >   // called from the underlying thread itself.
> 
> Right, but Add() will never be called on the thread itself, correct?
> 
> If I'm not mistaken task_runner() will be invoked on a thread other than the
> thread itself and the one that called Start(), once a second thread attempts
to
> profile itself concurrently.
> 
> Unit tests for the concurrency functionality will help catch this type of
issue.

Ah, I understand.  So if this is called from other than the thread that started
it, I need to post-task to the thread that started it... which will then
post-task to the worker thread.

But is it possible that whatever thread started it has the same restrictions and
won't allow posts from just anywhere?

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE,
Bind(&SamplingThread::StartCaptureTask,
On 2016/12/10 00:24:23, Mike Wittman wrote:
> On 2016/12/09 23:38:30, bcwhite wrote:
> > On 2016/12/09 21:45:02, Mike Wittman wrote:
> > > It's common practice to implement thread hopping using just one function,
> > > checking whether the execution is on the desired thread at the start of
the
> > > function, and if not, posting a task back to the same function on the
> desired
> > > thread. I think that could be done here by checking the thread id, and
> > probably
> > > would make the code a little easier to follow.
> > 
> > Add() and Stop() are always coming from a different thread.  The
> > StartCaptureTask() could be merged into the same method but I think that
would
> > be more confusing because of all the work done above to make sure the thread
> is
> > actually running.
> 
> I think it's worth doing this for Stop() at least. The other benefit of the
> one-function thread hop implementation is that it clearly documents/enforces
> threading expectations for an operation in a single location.

I can see that.  The downside is that there are then two different styles for
methods of the same class.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
On 2016/12/10 00:24:23, Mike Wittman wrote:
> On 2016/12/09 23:38:30, bcwhite wrote:
> > On 2016/12/09 21:45:01, Mike Wittman wrote:
> > > My understanding is that the message loop just waits if no tasks are
> present.
> > I
> > > believe it must be forcibly quit or its thread shut down to terminate it.
> > 
> > Correct.  My idea is to add the ability for it to self-destruct when "empty"
> > (which is not the same thing as "idle").
> 
> I think that will be confusing to readers since it will operate differently
than
> all the other message loops in the application. Couldn't this be addressed by
> posting a delayed quit task when the number of captures drops to zero (and
> canceling the task if the number of captures becomes non-zero)?

Message looks are already RunForever or RunUntilIdle.  Adding RunUntilEmpty
seems a natural (and generally useful) extension.

Posting a delayed quit is still a race condition because a new task could get
posted from another thread just as the quit starts executing.

It's possible that's a problem no matter what the solution.  Even RunUntilEmpty
may have that issue -- I'd have to investigate further.  There may need to be
some sort of atomic operation, such as a simple counter, no matter what.

Given the variations, I think it would be best to leave it as "run forever" in
this CL and do the quit-when-idle as a follow-up CL.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/13 16:08:11, bcwhite wrote: ...

4 years ago (2016-12-13 18:16:41 UTC) #21

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:311:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
On 2016/12/13 16:08:11, bcwhite wrote:
> On 2016/12/10 00:24:23, Mike Wittman wrote:
> > On 2016/12/09 23:38:30, bcwhite wrote:
> > > > According to the Thread documentation, task_runner() can only be safely
> > called
> > > > from the thread that invokes Start().
> > > 
> > > The comment for Thread::task_runner() says:
> > >   // In addition to this Thread's owning sequence, this can also safely be
> > >   // called from the underlying thread itself.
> > 
> > Right, but Add() will never be called on the thread itself, correct?
> > 
> > If I'm not mistaken task_runner() will be invoked on a thread other than the
> > thread itself and the one that called Start(), once a second thread attempts
> to
> > profile itself concurrently.
> > 
> > Unit tests for the concurrency functionality will help catch this type of
> issue.
> 
> Ah, I understand.  So if this is called from other than the thread that
started
> it, I need to post-task to the thread that started it... which will then
> post-task to the worker thread.
> 
> But is it possible that whatever thread started it has the same restrictions
and
> won't allow posts from just anywhere?

It may be possible to call task_runner() on the Start thread, then maintain an
instance of the scoped_refptr<SingleThreadTaskRunner> as a member variable on
SamplingThread, since SingleThreadTaskRunner is thread-safe refcounted type.

Delaying the quit-when-idle behavior to a follow on CL hopefully should make
this a little easier to deal with.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
On 2016/12/13 16:08:11, bcwhite wrote:
> On 2016/12/10 00:24:23, Mike Wittman wrote:
> > On 2016/12/09 23:38:30, bcwhite wrote:
> > > On 2016/12/09 21:45:01, Mike Wittman wrote:
> > > > My understanding is that the message loop just waits if no tasks are
> > present.
> > > I
> > > > believe it must be forcibly quit or its thread shut down to terminate
it.
> > > 
> > > Correct.  My idea is to add the ability for it to self-destruct when
"empty"
> > > (which is not the same thing as "idle").
> > 
> > I think that will be confusing to readers since it will operate differently
> than
> > all the other message loops in the application. Couldn't this be addressed
by
> > posting a delayed quit task when the number of captures drops to zero (and
> > canceling the task if the number of captures becomes non-zero)?
> 
> Message looks are already RunForever or RunUntilIdle.  Adding RunUntilEmpty
> seems a natural (and generally useful) extension.

RunForever is pretty much the only mode that's used in Chrome itself;
RunUntilIdle is used almost exclusively for testing. I believe RunUntilIdle is
generally considered an anti-pattern in production code because of its action at
a distance properties -- anyone else in the system (including the OS) can
unintentionally keep the message loop alive by posting messages to it. There's
only a dozen instances of RunUntilIdle in actual Chrome code, all of which are
in highly constrained scenarios:
https://cs.chromium.org/search/?q=rununtilidle%5C(%5C);+file:%5C.cc$+-file:te...

RunUntilEmpty will be subject to the same issues, I think.

> Posting a delayed quit is still a race condition because a new task could get
> posted from another thread just as the quit starts executing.
> 
> It's possible that's a problem no matter what the solution.  Even
RunUntilEmpty
> may have that issue -- I'd have to investigate further.  There may need to be
> some sort of atomic operation, such as a simple counter, no matter what.
> 
> Given the variations, I think it would be best to leave it as "run forever" in
> this CL and do the quit-when-idle as a follow-up CL.

Deferring the quit-when-idle behavior to a follow-on CL SGTM. I suspect there
will be more than enough complexity to deal with just implementing the "run
forever" mode.

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > It may be possible ...

4 years ago (2016-12-14 15:37:59 UTC) #22

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/14 15:37:59, bcwhite wrote: ...

4 years ago (2016-12-14 18:00:48 UTC) #23

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > > Wasn't there an ...

4 years ago (2016-12-14 19:39:39 UTC) #24

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/14 19:39:39, bcwhite wrote: ...

4 years ago (2016-12-14 20:47:51 UTC) #25

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > > I must be ...

4 years ago (2016-12-15 11:42:15 UTC) #26

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/15 11:42:15, bcwhite wrote: ...

4 years ago (2016-12-15 15:01:16 UTC) #27

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode311 base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/15 15:01:16, bcwhite wrote: ...

4 years ago (2016-12-15 17:22:46 UTC) #28

bcwhite

Switched to task-runner. Still a bit rough and tests need to be updated/added -- working ...

4 years ago (2016-12-15 18:07:50 UTC) #29

Switched to task-runner.  Still a bit rough and tests need to be updated/added
-- working on that now.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord
next_capture_id_;
On 2016/12/07 17:20:42, Mike Wittman wrote:
> On 2016/12/07 16:25:02, Mike Wittman wrote:
> > On 2016/12/07 15:15:30, bcwhite wrote:
> > > On 2016/12/06 21:04:58, Mike Wittman wrote:
> > > > We probably can avoid need for a thread-safe id by identifying the
> > > ActiveCapture
> > > > by its address (e.g. as an opaque void*).
> > > 
> > > My concern with that is that addresses may be reused.  A capture could
start
> > and
> > > then complete, getting freed.  A new capture could start and reuse the
same
> > > address, reasonably likely given that the allocation is the exact same
> number
> > of
> > > bytes as the free'd block.  Then a stop-request for the first one could be
> > made
> > > and cause the new one to stop.
> > > 
> > > The incrementing integer will also repeat but not for a long, long time.
> > 
> > Yes, care would need to be taken to ensure the StackSamplingProfiler doesn't
> > retain the address beyond when the object is deleted. This may be feasible,
> > depending on the synchronization we ultimately have in place with the
threads
> > owning the StackSamplingProfilers. Let's reconsider once we have something
> > closer to a final implementation.
> 
> Also, the current implementation can use base::StaticAtomicSequenceNumber.

Done.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:348:
capture->native_sampler->ProfileRecordingStarting(&profile.modules);
On 2016/12/06 21:04:57, Mike Wittman wrote:
> The matching call to ProfileRecordingStopped has been dropped with these
> changes.

Done.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:379: wait = TimeDelta::FromDays(365); 
// A long, long time.
On 2016/12/06 21:04:58, Mike Wittman wrote:
> There's a general desire to have as few persistent threads as possible in
> Chrome, so we probably should have the sampling thread terminate after a
period
> of inactivity, and restart on demand.

To be done in a future CL.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:441:
active_captures_.push_back(std::move(capture_ptr));
On 2016/12/06 21:04:57, Mike Wittman wrote:
> push_heap?

Acknowledged.

https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin...
base/profiler/stack_sampling_profiler.cc:494:
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2016/12/06 21:04:58, Mike Wittman wrote:
> We'll need to refactor to use a single stack copy buffer across all
> NativeStackSamplers, as the buffer is fairly large.

Future CL.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture*
capture);
On 2016/12/09 21:45:02, Mike Wittman wrote:
> The term "capture" is overloaded in the method names to mean both the
recording
> of all the samples and the recording of a single sample. Can we use something
> like "record sample" for the latter case to be consistent with the
> NativeStackSampler? e.g. this becomes something like RecordSampleForCapture.

Done, though shortened to just Begin/End/PerformRecording.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:344:
scoped_refptr<SingleThreadTaskRunner> runner = task_runner();
On 2016/12/09 21:45:01, Mike Wittman wrote:
> Same issue with task_runner() here.

Done.

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:471:
TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds));
On 2016/12/13 18:16:41, Mike Wittman wrote:
> On 2016/12/13 16:08:11, bcwhite wrote:
> > On 2016/12/10 00:24:23, Mike Wittman wrote:
> > > On 2016/12/09 23:38:30, bcwhite wrote:
> > > > On 2016/12/09 21:45:01, Mike Wittman wrote:
> > > > > My understanding is that the message loop just waits if no tasks are
> > > present.
> > > > I
> > > > > believe it must be forcibly quit or its thread shut down to terminate
> it.
> > > > 
> > > > Correct.  My idea is to add the ability for it to self-destruct when
> "empty"
> > > > (which is not the same thing as "idle").
> > > 
> > > I think that will be confusing to readers since it will operate
differently
> > than
> > > all the other message loops in the application. Couldn't this be addressed
> by
> > > posting a delayed quit task when the number of captures drops to zero (and
> > > canceling the task if the number of captures becomes non-zero)?
> > 
> > Message looks are already RunForever or RunUntilIdle.  Adding RunUntilEmpty
> > seems a natural (and generally useful) extension.
> 
> RunForever is pretty much the only mode that's used in Chrome itself;
> RunUntilIdle is used almost exclusively for testing. I believe RunUntilIdle is
> generally considered an anti-pattern in production code because of its action
at
> a distance properties -- anyone else in the system (including the OS) can
> unintentionally keep the message loop alive by posting messages to it. There's
> only a dozen instances of RunUntilIdle in actual Chrome code, all of which are
> in highly constrained scenarios:
>
https://cs.chromium.org/search/?q=rununtilidle%5C(%5C);+file:%5C.cc$+-file:te...
> 
> RunUntilEmpty will be subject to the same issues, I think.
> 
> > Posting a delayed quit is still a race condition because a new task could
get
> > posted from another thread just as the quit starts executing.
> > 
> > It's possible that's a problem no matter what the solution.  Even
> RunUntilEmpty
> > may have that issue -- I'd have to investigate further.  There may need to
be
> > some sort of atomic operation, such as a simple counter, no matter what.
> > 
> > Given the variations, I think it would be best to leave it as "run forever"
in
> > this CL and do the quit-when-idle as a follow-up CL.
> 
> Deferring the quit-when-idle behavior to a follow-on CL SGTM. I suspect there
> will be more than enough complexity to deal with just implementing the "run
> forever" mode.

Acknowledged.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode265 base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); On 2016/12/15 18:07:50, bcwhite wrote: > ...

4 years ago (2016-12-15 20:37:53 UTC) #30

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture*
capture);
On 2016/12/15 18:07:50, bcwhite wrote:
> On 2016/12/09 21:45:02, Mike Wittman wrote:
> > The term "capture" is overloaded in the method names to mean both the
> recording
> > of all the samples and the recording of a single sample. Can we use
something
> > like "record sample" for the latter case to be consistent with the
> > NativeStackSampler? e.g. this becomes something like RecordSampleForCapture.
> 
> Done, though shortened to just Begin/End/PerformRecording.

This still has the same issue: "recording" is used to refer to both the
recording of one sample and all the samples. It's also not clear what the
relationship between "capture", "recording", and "collection" is; all three
terms are used variously in code and comments.

Can we regularize all this terminology? My suggestion:
- use "record" or "record sample" to refer to the recording of one stack/sample
- use "collection" to refer to the collection of all the samples for one
request.

With that, these functions become BeginCollection, EndCollection, RecordSample.
ActiveCapture becomes ActiveCollection, or better, CollectionContext since that
makes clear it's strictly state associated with a collection and not behavior.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:285: std::map<int,
WeakPtr<ActiveCapture>> active_captures_;
I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than
unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain?

Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much
easier to reason about ownership, and reduce the amount of state in the system.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:321: DCHECK(task_runner_);
Remove this DCHECK? Seems like it's just verifying Thread's documented behavior
at this point.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:327: DCHECK(success);
Remove this one as well? I don't think this will ever fail, and if it does, it's
an issue internal to the message loop/task runner.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:365: }
Why not erase the capture from active_captures_ at the end of this function?

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
While the task_runner() accesses on the profiler thread don't need to be guarded
by the lock, that won't be at all obvious to the casual reader. Can we
encapsulate this subtlety within functions (e.g.
GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) there?

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:404: DCHECK(success);
I think this can be removed for the same reasons as above.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-16 01:15:05 UTC) #31

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/80001

4 years ago (2016-12-16 01:15:56 UTC) #32

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-16 01:25:42 UTC) #33

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/124024) ios-device-xcode-clang on ...

4 years ago (2016-12-16 01:25:43 UTC) #34

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-21 15:31:01 UTC) #35

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/100001

4 years ago (2016-12-21 15:31:26 UTC) #36

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-21 15:39:54 UTC) #37

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/126668)

4 years ago (2016-12-21 15:39:55 UTC) #38

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-21 16:08:55 UTC) #39

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/120001

4 years ago (2016-12-21 16:09:24 UTC) #40

bcwhite

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sampling_profiler.cc#newcode265 base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); On 2016/12/15 20:37:53, Mike Wittman wrote: ...

4 years ago (2016-12-21 16:39:11 UTC) #42

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture*
capture);
On 2016/12/15 20:37:53, Mike Wittman wrote:
> On 2016/12/15 18:07:50, bcwhite wrote:
> > On 2016/12/09 21:45:02, Mike Wittman wrote:
> > > The term "capture" is overloaded in the method names to mean both the
> > recording
> > > of all the samples and the recording of a single sample. Can we use
> something
> > > like "record sample" for the latter case to be consistent with the
> > > NativeStackSampler? e.g. this becomes something like
RecordSampleForCapture.
> > 
> > Done, though shortened to just Begin/End/PerformRecording.
> 
> This still has the same issue: "recording" is used to refer to both the
> recording of one sample and all the samples. It's also not clear what the
> relationship between "capture", "recording", and "collection" is; all three
> terms are used variously in code and comments.
> 
> Can we regularize all this terminology? My suggestion:
> - use "record" or "record sample" to refer to the recording of one
stack/sample
> - use "collection" to refer to the collection of all the samples for one
> request.
> 
> With that, these functions become BeginCollection, EndCollection,
RecordSample.
> ActiveCapture becomes ActiveCollection, or better, CollectionContext since
that
> makes clear it's strictly state associated with a collection and not behavior.

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:285: std::map<int,
WeakPtr<ActiveCapture>> active_captures_;
On 2016/12/15 20:37:53, Mike Wittman wrote:
> I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than
> unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain?
> 
> Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much
> easier to reason about ownership, and reduce the amount of state in the
system.

The ownership was with the posted tasks so other pointers needed to be weak. 
But it didn't work out like I was thinking so went another way.  Since the
single instance of this class never gets destructed, ownership can stay in this
map and raw pointers passed to posted tasks.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:321: DCHECK(task_runner_);
On 2016/12/15 20:37:53, Mike Wittman wrote:
> Remove this DCHECK? Seems like it's just verifying Thread's documented
behavior
> at this point.

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:327: DCHECK(success);
On 2016/12/15 20:37:53, Mike Wittman wrote:
> Remove this one as well? I don't think this will ever fail, and if it does,
it's
> an issue internal to the message loop/task runner.

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:365: }
On 2016/12/15 20:37:53, Mike Wittman wrote:
> Why not erase the capture from active_captures_ at the end of this function?

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2016/12/15 20:37:53, Mike Wittman wrote:
> While the task_runner() accesses on the profiler thread don't need to be
guarded
> by the lock, that won't be at all obvious to the casual reader. Can we
> encapsulate this subtlety within functions (e.g.
> GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary)
there?

Such a method would have to fetch the current thread-id and compare it to the id
of the sampling thread to know whether it needs to use the (lock-protected)
member variable or call Thread::task_runner().  Unfortunately, getting the
current thread-id can be a system call which means we probably shouldn't do it
unless necessary.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:404: DCHECK(success);
On 2016/12/15 20:37:53, Mike Wittman wrote:
> I think this can be removed for the same reasons as above.

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-21 18:28:12 UTC) #43

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/354149)

4 years ago (2016-12-21 18:28:13 UTC) #44

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-21 18:38:33 UTC) #45

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/140001

4 years ago (2016-12-21 18:39:13 UTC) #46

Mike Wittman

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode285 base/profiler/stack_sampling_profiler.cc:285: std::map<int, WeakPtr<ActiveCapture>> active_captures_; On 2016/12/21 16:39:10, bcwhite wrote: > ...

4 years ago (2016-12-21 19:38:41 UTC) #47

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:285: std::map<int,
WeakPtr<ActiveCapture>> active_captures_;
On 2016/12/21 16:39:10, bcwhite wrote:
> On 2016/12/15 20:37:53, Mike Wittman wrote:
> > I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than
> > unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain?
> > 
> > Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much
> > easier to reason about ownership, and reduce the amount of state in the
> system.
> 
> The ownership was with the posted tasks so other pointers needed to be weak. 
> But it didn't work out like I was thinking so went another way.  Since the
> single instance of this class never gets destructed, ownership can stay in
this
> map and raw pointers passed to posted tasks.

Can we pass the id rather than the raw pointer? Paying the small overhead of
looking up the context in the map is IMHO well worth the benefit of not having
to consider whether there are lifetime issues between the context references in
the posted tasks and active_captures_.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2016/12/21 16:39:10, bcwhite wrote:
> On 2016/12/15 20:37:53, Mike Wittman wrote:
> > While the task_runner() accesses on the profiler thread don't need to be
> guarded
> > by the lock, that won't be at all obvious to the casual reader. Can we
> > encapsulate this subtlety within functions (e.g.
> > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary)
> there?
> 
> Such a method would have to fetch the current thread-id and compare it to the
id
> of the sampling thread to know whether it needs to use the (lock-protected)
> member variable or call Thread::task_runner().  Unfortunately, getting the
> current thread-id can be a system call which means we probably shouldn't do it
> unless necessary.

Fetching the thread id on Windows is cheap: it's stored in the Thread
Environment Block, which is accessed via a segment register and doesn't require
a syscall. I took a look at glibc and it also does not need a syscall to get the
thread id.

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:187: bool UpdateNextSampleTime() {
structs should not have methods providing behavior; this function probably
should be moved out to SamplingThread.

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:269: static constexpr int
kMinimumThreadRunTimeSeconds = 60;
This can be removed.

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:338:
DCHECK(collection->native_sampler);
This DCHECK can be moved to CollectionContext::CollectionContext() and the
function removed. No need for an explicit Begin/Finish function pair if there's
nothing to do on begin.

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:344: collection->stopped = true;
The stopped state can be removed since it's now redundant to the presence of the
context in active_collections_.

https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:446: DCHECK_EQ(0U,
active_collections_.size());
nit: DCHECK(active_collections_.empty());

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-21 19:50:27 UTC) #49

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

4 years ago (2016-12-22 15:31:25 UTC) #51

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/160001

4 years ago (2016-12-22 15:31:42 UTC) #52

bcwhite

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode188 base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { > structs should not have methods ...

4 years ago (2016-12-22 16:12:10 UTC) #53

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() {
> structs should not have methods providing behavior; this function
> probably should be moved out to SamplingThread.

Really?  I've seen it many times and Alexei has in the past even requested
methods being added to structs if they're actions are solely confined to the
data of those structs.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:285: std::map<int,
WeakPtr<ActiveCapture>> active_captures_;
On 2016/12/21 19:38:41, Mike Wittman wrote:
> On 2016/12/21 16:39:10, bcwhite wrote:
> > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather
than
> > > unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain?
> > > 
> > > Seems like unique_ptr could avoid the whole WeakPtr machinery, make it
much
> > > easier to reason about ownership, and reduce the amount of state in the
> > system.
> > 
> > The ownership was with the posted tasks so other pointers needed to be weak.

> > But it didn't work out like I was thinking so went another way.  Since the
> > single instance of this class never gets destructed, ownership can stay in
> this
> > map and raw pointers passed to posted tasks.
> 
> Can we pass the id rather than the raw pointer? Paying the small overhead of
> looking up the context in the map is IMHO well worth the benefit of not having
> to consider whether there are lifetime issues between the context references
in
> the posted tasks and active_captures_.

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2016/12/21 19:38:41, Mike Wittman wrote:
> On 2016/12/21 16:39:10, bcwhite wrote:
> > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > While the task_runner() accesses on the profiler thread don't need to be
> > guarded
> > > by the lock, that won't be at all obvious to the casual reader. Can we
> > > encapsulate this subtlety within functions (e.g.
> > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary)
> > there?
> > 
> > Such a method would have to fetch the current thread-id and compare it to
the
> id
> > of the sampling thread to know whether it needs to use the (lock-protected)
> > member variable or call Thread::task_runner().  Unfortunately, getting the
> > current thread-id can be a system call which means we probably shouldn't do
it
> > unless necessary.
> 
> Fetching the thread id on Windows is cheap: it's stored in the Thread
> Environment Block, which is accessed via a segment register and doesn't
require
> a syscall. I took a look at glibc and it also does not need a syscall to get
the
> thread id.

Linux does a direct syscall():
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-22 16:27:29 UTC) #54

Mike Wittman

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode188 base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { On 2016/12/22 16:12:10, bcwhite wrote: > ...

4 years ago (2016-12-22 17:38:22 UTC) #56

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() {
On 2016/12/22 16:12:10, bcwhite wrote:
> > structs should not have methods providing behavior; this function
> > probably should be moved out to SamplingThread.
> 
> Really?  I've seen it many times and Alexei has in the past even requested
> methods being added to structs if they're actions are solely confined to the
> data of those structs.

The style guide says structs should not have any functionality beyond
access/setting the data members:
https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.shtml?cl=head#...

I think this is the right thing to do regardless, so that all the logic dealing
with the context state is collocated. The relationship between sample and
params.samples_per_burst, for example, is defined across both this code and
SamplingThread::PerformRecording.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2016/12/22 16:12:10, bcwhite wrote:
> On 2016/12/21 19:38:41, Mike Wittman wrote:
> > On 2016/12/21 16:39:10, bcwhite wrote:
> > > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > > While the task_runner() accesses on the profiler thread don't need to be
> > > guarded
> > > > by the lock, that won't be at all obvious to the casual reader. Can we
> > > > encapsulate this subtlety within functions (e.g.
> > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if
necessary)
> > > there?
> > > 
> > > Such a method would have to fetch the current thread-id and compare it to
> the
> > id
> > > of the sampling thread to know whether it needs to use the
(lock-protected)
> > > member variable or call Thread::task_runner().  Unfortunately, getting the
> > > current thread-id can be a system call which means we probably shouldn't
do
> it
> > > unless necessary.
> > 
> > Fetching the thread id on Windows is cheap: it's stored in the Thread
> > Environment Block, which is accessed via a segment register and doesn't
> require
> > a syscall. I took a look at glibc and it also does not need a syscall to get
> the
> > thread id.
> 
> Linux does a direct syscall():
>
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...

Ah, missed that the Linux implementation doesn't go through pthread_self().

Given that Linux syscall overhead is in the 10's to 100's of ns, and the task
runner likely will be accessed at most a handful of times every 100ms, I think
we can afford to pay this minimal overhead to make the code less tricky and more
robust to changes.

https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:400: if (found ==
active_collections_.end())
It would be good to retain the comment here indicating that this situation can
happen when the collection was stopped.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-05 16:27:25 UTC) #57

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/180001

3 years, 11 months ago (2017-01-05 16:27:44 UTC) #58

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-05 16:32:34 UTC) #59

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/200001

3 years, 11 months ago (2017-01-05 16:32:51 UTC) #60

bcwhite

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode188 base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { On 2016/12/22 17:38:22, Mike Wittman wrote: ...

3 years, 11 months ago (2017-01-05 16:35:58 UTC) #62

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() {
On 2016/12/22 17:38:22, Mike Wittman wrote:
> On 2016/12/22 16:12:10, bcwhite wrote:
> > > structs should not have methods providing behavior; this function
> > > probably should be moved out to SamplingThread.
> > 
> > Really?  I've seen it many times and Alexei has in the past even requested
> > methods being added to structs if they're actions are solely confined to the
> > data of those structs.
> 
> The style guide says structs should not have any functionality beyond
> access/setting the data members:
>
https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.shtml?cl=head#...
> 
> I think this is the right thing to do regardless, so that all the logic
dealing
> with the context state is collocated. The relationship between sample and
> params.samples_per_burst, for example, is defined across both this code and
> SamplingThread::PerformRecording.

Done.

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2016/12/22 17:38:22, Mike Wittman wrote:
> On 2016/12/22 16:12:10, bcwhite wrote:
> > On 2016/12/21 19:38:41, Mike Wittman wrote:
> > > On 2016/12/21 16:39:10, bcwhite wrote:
> > > > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > > > While the task_runner() accesses on the profiler thread don't need to
be
> > > > guarded
> > > > > by the lock, that won't be at all obvious to the casual reader. Can we
> > > > > encapsulate this subtlety within functions (e.g.
> > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if
> necessary)
> > > > there?
> > > > 
> > > > Such a method would have to fetch the current thread-id and compare it
to
> > the
> > > id
> > > > of the sampling thread to know whether it needs to use the
> (lock-protected)
> > > > member variable or call Thread::task_runner().  Unfortunately, getting
the
> > > > current thread-id can be a system call which means we probably shouldn't
> do
> > it
> > > > unless necessary.
> > > 
> > > Fetching the thread id on Windows is cheap: it's stored in the Thread
> > > Environment Block, which is accessed via a segment register and doesn't
> > require
> > > a syscall. I took a look at glibc and it also does not need a syscall to
get
> > the
> > > thread id.
> > 
> > Linux does a direct syscall():
> >
>
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...
> 
> Ah, missed that the Linux implementation doesn't go through pthread_self().
> 
> Given that Linux syscall overhead is in the 10's to 100's of ns, and the task
> runner likely will be accessed at most a handful of times every 100ms, I think
> we can afford to pay this minimal overhead to make the code less tricky and
more
> robust to changes.

I started down this path but it ends up requiring the lock every access.  I
can't compare the current thread-id to the sampling thread's ID without it
waiting for that ID to be valid, which only happens after it has been started.

But I don't want to start it until needed and the only way to tell if its needed
is to call IsRunning() or check the task_runner_ local variable to see if it's
set, both of which require a lock.

Getting the thread's ID does an event-wait so that'll need to be cached and
locked as well, though it can probably share the same lock as task_runner_.

Acquiring a lock isn't expensive but would be required with every sample and
it's not necessary when we already know we're running on the sampling thread.

I think a comment is the better way to make things obvious to the casual reader.

https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:400: if (found ==
active_collections_.end())
On 2016/12/22 17:38:22, Mike Wittman wrote:
> It would be good to retain the comment here indicating that this situation can
> happen when the collection was stopped.

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-05 17:26:32 UTC) #63

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 11 months ago (2017-01-05 17:26:32 UTC) #64

Mike Wittman

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode400 base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2017/01/05 16:35:58, bcwhite wrote: ...

3 years, 11 months ago (2017-01-05 21:08:39 UTC) #65

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2017/01/05 16:35:58, bcwhite wrote:
> On 2016/12/22 17:38:22, Mike Wittman wrote:
> > On 2016/12/22 16:12:10, bcwhite wrote:
> > > On 2016/12/21 19:38:41, Mike Wittman wrote:
> > > > On 2016/12/21 16:39:10, bcwhite wrote:
> > > > > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > > > > While the task_runner() accesses on the profiler thread don't need
to
> be
> > > > > guarded
> > > > > > by the lock, that won't be at all obvious to the casual reader. Can
we
> > > > > > encapsulate this subtlety within functions (e.g.
> > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if
> > necessary)
> > > > > there?
> > > > > 
> > > > > Such a method would have to fetch the current thread-id and compare it
> to
> > > the
> > > > id
> > > > > of the sampling thread to know whether it needs to use the
> > (lock-protected)
> > > > > member variable or call Thread::task_runner().  Unfortunately, getting
> the
> > > > > current thread-id can be a system call which means we probably
shouldn't
> > do
> > > it
> > > > > unless necessary.
> > > > 
> > > > Fetching the thread id on Windows is cheap: it's stored in the Thread
> > > > Environment Block, which is accessed via a segment register and doesn't
> > > require
> > > > a syscall. I took a look at glibc and it also does not need a syscall to
> get
> > > the
> > > > thread id.
> > > 
> > > Linux does a direct syscall():
> > >
> >
>
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...
> > 
> > Ah, missed that the Linux implementation doesn't go through pthread_self().
> > 
> > Given that Linux syscall overhead is in the 10's to 100's of ns, and the
task
> > runner likely will be accessed at most a handful of times every 100ms, I
think
> > we can afford to pay this minimal overhead to make the code less tricky and
> more
> > robust to changes.
> 
> I started down this path but it ends up requiring the lock every access.  I
> can't compare the current thread-id to the sampling thread's ID without it
> waiting for that ID to be valid, which only happens after it has been started.
> 
> But I don't want to start it until needed and the only way to tell if its
needed
> is to call IsRunning() or check the task_runner_ local variable to see if it's
> set, both of which require a lock.
> 
> Getting the thread's ID does an event-wait so that'll need to be cached and
> locked as well, though it can probably share the same lock as task_runner_.
> 
> Acquiring a lock isn't expensive but would be required with every sample and
> it's not necessary when we already know we're running on the sampling thread.

Yeah, it's probably not worth going to the extent of acquiring the lock on the
profiler thread.

> I think a comment is the better way to make things obvious to the casual
reader.

I think it would be better to create and use explicit functions for getting the
task runner on either the sampling thread or on other threads (e.g.
GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread,
GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will
force reviewers and developers to consider the appropriate method for getting
the task runner when making future changes. We should be able to DCHECK in these
functions to validate correct usage as well.

bcwhite

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode400 base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2017/01/05 21:08:39, Mike Wittman ...

3 years, 11 months ago (2017-01-05 22:04:22 UTC) #66

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
base/profiler/stack_sampling_profiler.cc:400: bool success =
task_runner()->PostDelayedTask(
On 2017/01/05 21:08:39, Mike Wittman wrote:
> On 2017/01/05 16:35:58, bcwhite wrote:
> > On 2016/12/22 17:38:22, Mike Wittman wrote:
> > > On 2016/12/22 16:12:10, bcwhite wrote:
> > > > On 2016/12/21 19:38:41, Mike Wittman wrote:
> > > > > On 2016/12/21 16:39:10, bcwhite wrote:
> > > > > > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > > > > > While the task_runner() accesses on the profiler thread don't need
> to
> > be
> > > > > > guarded
> > > > > > > by the lock, that won't be at all obvious to the casual reader.
Can
> we
> > > > > > > encapsulate this subtlety within functions (e.g.
> > > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if
> > > necessary)
> > > > > > there?
> > > > > > 
> > > > > > Such a method would have to fetch the current thread-id and compare
it
> > to
> > > > the
> > > > > id
> > > > > > of the sampling thread to know whether it needs to use the
> > > (lock-protected)
> > > > > > member variable or call Thread::task_runner().  Unfortunately,
getting
> > the
> > > > > > current thread-id can be a system call which means we probably
> shouldn't
> > > do
> > > > it
> > > > > > unless necessary.
> > > > > 
> > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread
> > > > > Environment Block, which is accessed via a segment register and
doesn't
> > > > require
> > > > > a syscall. I took a look at glibc and it also does not need a syscall
to
> > get
> > > > the
> > > > > thread id.
> > > > 
> > > > Linux does a direct syscall():
> > > >
> > >
> >
>
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...
> > > 
> > > Ah, missed that the Linux implementation doesn't go through
pthread_self().
> > > 
> > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the
> task
> > > runner likely will be accessed at most a handful of times every 100ms, I
> think
> > > we can afford to pay this minimal overhead to make the code less tricky
and
> > more
> > > robust to changes.
> > 
> > I started down this path but it ends up requiring the lock every access.  I
> > can't compare the current thread-id to the sampling thread's ID without it
> > waiting for that ID to be valid, which only happens after it has been
started.
> > 
> > But I don't want to start it until needed and the only way to tell if its
> needed
> > is to call IsRunning() or check the task_runner_ local variable to see if
it's
> > set, both of which require a lock.
> > 
> > Getting the thread's ID does an event-wait so that'll need to be cached and
> > locked as well, though it can probably share the same lock as task_runner_.
> > 
> > Acquiring a lock isn't expensive but would be required with every sample and
> > it's not necessary when we already know we're running on the sampling
thread.
> 
> Yeah, it's probably not worth going to the extent of acquiring the lock on the
> profiler thread.
> 
> > I think a comment is the better way to make things obvious to the casual
> reader.
> 
> I think it would be better to create and use explicit functions for getting
the
> task runner on either the sampling thread or on other threads (e.g.
> GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread,
> GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will
> force reviewers and developers to consider the appropriate method for getting
> the task runner when making future changes. We should be able to DCHECK in
these
> functions to validate correct usage as well.

Trying this but there are issues.

GetOrCreate isn't enough because Stop() needs to know the value without creating
-- we don't want to start the sampling thread there.  It could access
task_runner_ directly (while locked) while it does now but that means multiple
methods accessing task_runner_ which is what creating the method was supposed to
avoid.

If I leave the "create" part in Add() then it would be one to access
task_runner_ directly.

Given that only two methods access task_runner_ currently, there's no win here.

Similarly, the GetFromSamplingThread() method ends up just a wrapper around
Thread::task_runner() since the desired DCHECK is already in
Thread::task_runner().

Mike Wittman

On 2017/01/05 22:04:22, bcwhite wrote: > https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc > File base/profiler/stack_sampling_profiler.cc (right): > > https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sampling_profiler.cc#newcode400 > ...

3 years, 11 months ago (2017-01-05 23:30:32 UTC) #67

On 2017/01/05 22:04:22, bcwhite wrote:
>
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
> File base/profiler/stack_sampling_profiler.cc (right):
> 
>
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam...
> base/profiler/stack_sampling_profiler.cc:400: bool success =
> task_runner()->PostDelayedTask(
> On 2017/01/05 21:08:39, Mike Wittman wrote:
> > On 2017/01/05 16:35:58, bcwhite wrote:
> > > On 2016/12/22 17:38:22, Mike Wittman wrote:
> > > > On 2016/12/22 16:12:10, bcwhite wrote:
> > > > > On 2016/12/21 19:38:41, Mike Wittman wrote:
> > > > > > On 2016/12/21 16:39:10, bcwhite wrote:
> > > > > > > On 2016/12/15 20:37:53, Mike Wittman wrote:
> > > > > > > > While the task_runner() accesses on the profiler thread don't
need
> > to
> > > be
> > > > > > > guarded
> > > > > > > > by the lock, that won't be at all obvious to the casual reader.
> Can
> > we
> > > > > > > > encapsulate this subtlety within functions (e.g.
> > > > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if
> > > > necessary)
> > > > > > > there?
> > > > > > > 
> > > > > > > Such a method would have to fetch the current thread-id and
compare
> it
> > > to
> > > > > the
> > > > > > id
> > > > > > > of the sampling thread to know whether it needs to use the
> > > > (lock-protected)
> > > > > > > member variable or call Thread::task_runner().  Unfortunately,
> getting
> > > the
> > > > > > > current thread-id can be a system call which means we probably
> > shouldn't
> > > > do
> > > > > it
> > > > > > > unless necessary.
> > > > > > 
> > > > > > Fetching the thread id on Windows is cheap: it's stored in the
Thread
> > > > > > Environment Block, which is accessed via a segment register and
> doesn't
> > > > > require
> > > > > > a syscall. I took a look at glibc and it also does not need a
syscall
> to
> > > get
> > > > > the
> > > > > > thread id.
> > > > > 
> > > > > Linux does a direct syscall():
> > > > >
> > > >
> > >
> >
>
https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...
> > > > 
> > > > Ah, missed that the Linux implementation doesn't go through
> pthread_self().
> > > > 
> > > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the
> > task
> > > > runner likely will be accessed at most a handful of times every 100ms, I
> > think
> > > > we can afford to pay this minimal overhead to make the code less tricky
> and
> > > more
> > > > robust to changes.
> > > 
> > > I started down this path but it ends up requiring the lock every access. 
I
> > > can't compare the current thread-id to the sampling thread's ID without it
> > > waiting for that ID to be valid, which only happens after it has been
> started.
> > > 
> > > But I don't want to start it until needed and the only way to tell if its
> > needed
> > > is to call IsRunning() or check the task_runner_ local variable to see if
> it's
> > > set, both of which require a lock.
> > > 
> > > Getting the thread's ID does an event-wait so that'll need to be cached
and
> > > locked as well, though it can probably share the same lock as
task_runner_.
> > > 
> > > Acquiring a lock isn't expensive but would be required with every sample
and
> > > it's not necessary when we already know we're running on the sampling
> thread.
> > 
> > Yeah, it's probably not worth going to the extent of acquiring the lock on
the
> > profiler thread.
> > 
> > > I think a comment is the better way to make things obvious to the casual
> > reader.
> > 
> > I think it would be better to create and use explicit functions for getting
> the
> > task runner on either the sampling thread or on other threads (e.g.
> > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread,
> > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will
> > force reviewers and developers to consider the appropriate method for
getting
> > the task runner when making future changes. We should be able to DCHECK in
> these
> > functions to validate correct usage as well.
> 
> Trying this but there are issues.
> 
> GetOrCreate isn't enough because Stop() needs to know the value without
creating
> -- we don't want to start the sampling thread there.  It could access
> task_runner_ directly (while locked) while it does now but that means multiple
> methods accessing task_runner_ which is what creating the method was supposed
to
> avoid.

I'm proposing two separate functions for those cases:
GetOrCreateTaskRunnerOnOtherThread() called by Add(), and
GetTaskRunnerOnOtherThread() called by Stop(). (With a third function
GetTaskRunnerOnOwnThread() called by StartCollectionTask() and
PerformCollectionTask().)

> If I leave the "create" part in Add() then it would be one to access
> task_runner_ directly.
> 
> Given that only two methods access task_runner_ currently, there's no win
here.

The win here is in terms of code readability and maintainability. In particular:

- it makes the mechanism for accessing the task runner and the associated
constraints involved self-documenting in the code itself. This avoids the issue
of the comments getting out of sync with the code. It also reduces the burden on
future developers and reviewers for determining how to use the task runner
correctly in new code: one can simply look at the class interface rather than
dig through unrelated functions or base class headers to understand the
appropriate constraints.

- it encapsulates the locking in the smallest possible scope, making it obvious
exactly which operations need to be protected by the lock. As it is now, it
would not be clear to someone unfamiliar with the code whether the PostTask
calls in Add() and Stop() require the lock be held.

- it encapsulates the subtlety around the task runner access in functions
dedicated to that task. Assuming the functions are invoked correctly according
to their names, a reader of the code could verify that this subtle behavior is
correct without having to read through all the code.

If someone introduces new code accessing the task runner the wrong way, the
consequence will be non-deterministic failures whose cause will be
extraordinarily difficult to track down. So it's worth additional effort and
complexity up-front to avoid this (this seems to be the general philosophy to
multithreading issues across Chrome).

> Similarly, the GetFromSamplingThread() method ends up just a wrapper around
> Thread::task_runner() since the desired DCHECK is already in
> Thread::task_runner().

That DCHECK appears to be verifying a looser condition than what we care about.
In particular it looks like it will succeed if the message loop is running,
regardless of which thread is invoking the function.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-06 13:58:30 UTC) #68

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/220001

3 years, 11 months ago (2017-01-06 13:58:46 UTC) #69

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-06 15:10:03 UTC) #70

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 11 months ago (2017-01-06 15:10:04 UTC) #71

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-06 15:24:48 UTC) #72

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/240001

3 years, 11 months ago (2017-01-06 15:25:07 UTC) #73

bcwhite

> > > I think it would be better to create and use explicit functions ...

3 years, 11 months ago (2017-01-06 15:32:59 UTC) #74

> > > I think it would be better to create and use explicit functions for
getting
> > the
> > > task runner on either the sampling thread or on other threads (e.g.
> > > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread,
> > > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and
will
> > > force reviewers and developers to consider the appropriate method for
> getting
> > > the task runner when making future changes. We should be able to DCHECK in
> > these
> > > functions to validate correct usage as well.
> > 
> > Trying this but there are issues.
> > 
> > GetOrCreate isn't enough because Stop() needs to know the value without
> creating
> > -- we don't want to start the sampling thread there.  It could access
> > task_runner_ directly (while locked) while it does now but that means
multiple
> > methods accessing task_runner_ which is what creating the method was
supposed
> to
> > avoid.
> 
> I'm proposing two separate functions for those cases:
> GetOrCreateTaskRunnerOnOtherThread() called by Add(), and
> GetTaskRunnerOnOtherThread() called by Stop(). (With a third function
> GetTaskRunnerOnOwnThread() called by StartCollectionTask() and
> PerformCollectionTask().)
> 
> > If I leave the "create" part in Add() then it would be one to access
> > task_runner_ directly.
> > 
> > Given that only two methods access task_runner_ currently, there's no win
> here.
> 
> The win here is in terms of code readability and maintainability. In
particular:
> 
> - it makes the mechanism for accessing the task runner and the associated
> constraints involved self-documenting in the code itself. This avoids the
issue
> of the comments getting out of sync with the code. It also reduces the burden
on
> future developers and reviewers for determining how to use the task runner
> correctly in new code: one can simply look at the class interface rather than
> dig through unrelated functions or base class headers to understand the
> appropriate constraints.
> 
> - it encapsulates the locking in the smallest possible scope, making it
obvious
> exactly which operations need to be protected by the lock. As it is now, it
> would not be clear to someone unfamiliar with the code whether the PostTask
> calls in Add() and Stop() require the lock be held.
> 
> - it encapsulates the subtlety around the task runner access in functions
> dedicated to that task. Assuming the functions are invoked correctly according
> to their names, a reader of the code could verify that this subtle behavior is
> correct without having to read through all the code.
> 
> If someone introduces new code accessing the task runner the wrong way, the
> consequence will be non-deterministic failures whose cause will be
> extraordinarily difficult to track down. So it's worth additional effort and
> complexity up-front to avoid this (this seems to be the general philosophy to
> multithreading issues across Chrome).

I understand what you're saying, but I find it harder to read with helper
methods than with full sentence comments.
But done.

Mike Wittman

On 2017/01/06 15:32:59, bcwhite wrote: > > > > I think it would be better ...

3 years, 11 months ago (2017-01-06 16:02:40 UTC) #75

On 2017/01/06 15:32:59, bcwhite wrote:
> > > > I think it would be better to create and use explicit functions for
> getting
> > > the
> > > > task runner on either the sampling thread or on other threads (e.g.
> > > > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread,
> > > > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and
> will
> > > > force reviewers and developers to consider the appropriate method for
> > getting
> > > > the task runner when making future changes. We should be able to DCHECK
in
> > > these
> > > > functions to validate correct usage as well.
> > > 
> > > Trying this but there are issues.
> > > 
> > > GetOrCreate isn't enough because Stop() needs to know the value without
> > creating
> > > -- we don't want to start the sampling thread there.  It could access
> > > task_runner_ directly (while locked) while it does now but that means
> multiple
> > > methods accessing task_runner_ which is what creating the method was
> supposed
> > to
> > > avoid.
> > 
> > I'm proposing two separate functions for those cases:
> > GetOrCreateTaskRunnerOnOtherThread() called by Add(), and
> > GetTaskRunnerOnOtherThread() called by Stop(). (With a third function
> > GetTaskRunnerOnOwnThread() called by StartCollectionTask() and
> > PerformCollectionTask().)
> > 
> > > If I leave the "create" part in Add() then it would be one to access
> > > task_runner_ directly.
> > > 
> > > Given that only two methods access task_runner_ currently, there's no win
> > here.
> > 
> > The win here is in terms of code readability and maintainability. In
> particular:
> > 
> > - it makes the mechanism for accessing the task runner and the associated
> > constraints involved self-documenting in the code itself. This avoids the
> issue
> > of the comments getting out of sync with the code. It also reduces the
burden
> on
> > future developers and reviewers for determining how to use the task runner
> > correctly in new code: one can simply look at the class interface rather
than
> > dig through unrelated functions or base class headers to understand the
> > appropriate constraints.
> > 
> > - it encapsulates the locking in the smallest possible scope, making it
> obvious
> > exactly which operations need to be protected by the lock. As it is now, it
> > would not be clear to someone unfamiliar with the code whether the PostTask
> > calls in Add() and Stop() require the lock be held.
> > 
> > - it encapsulates the subtlety around the task runner access in functions
> > dedicated to that task. Assuming the functions are invoked correctly
according
> > to their names, a reader of the code could verify that this subtle behavior
is
> > correct without having to read through all the code.
> > 
> > If someone introduces new code accessing the task runner the wrong way, the
> > consequence will be non-deterministic failures whose cause will be
> > extraordinarily difficult to track down. So it's worth additional effort and
> > complexity up-front to avoid this (this seems to be the general philosophy
to
> > multithreading issues across Chrome).
> 
> I understand what you're saying, but I find it harder to read with helper
> methods than with full sentence comments.
> But done.

Thanks.

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:232: // Get tas krunner that is usable
from the sampling thread itself.
nit: task runner

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:318: task_runner_ = task_runner();
nit: Thread::task_runner() to be explicitly clear where this function is coming
from

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:339:
StackSamplingProfiler::SamplingThread::GetTaskRunnerFromSamplingThread() {
How about GetTaskRunnerOnSamplingThread? GetTaskRunnerFromSamplingThread is
ambiguous since "SamplingThread" could refer to either the thread of execution
or the SamplingThread class itself.

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:344: return task_runner();
nit: Thread::task_runner() here also

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-06 16:42:51 UTC) #76

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/260001

3 years, 11 months ago (2017-01-06 16:43:18 UTC) #77

bcwhite

https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sampling_profiler.cc#newcode232 base/profiler/stack_sampling_profiler.cc:232: // Get tas krunner that is usable from the ...

3 years, 11 months ago (2017-01-06 16:43:52 UTC) #78

Mike Wittman

Thanks, at a high level I think this is looking good at this point. The ...

3 years, 11 months ago (2017-01-06 17:45:51 UTC) #79

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-06 17:47:55 UTC) #80

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/343608)

3 years, 11 months ago (2017-01-06 17:47:58 UTC) #81

bcwhite

> 1. Comprehensive tests of the new functionality. I will be surprised if these > ...

3 years, 11 months ago (2017-01-06 20:47:57 UTC) #82

Mike Wittman

On 2017/01/06 20:47:57, bcwhite wrote: > > 1. Comprehensive tests of the new functionality. I ...

3 years, 11 months ago (2017-01-06 21:19:24 UTC) #83

On 2017/01/06 20:47:57, bcwhite wrote:
> >  1. Comprehensive tests of the new functionality. I will be surprised if
these
> > doesn't flush out issues that neither of us has anticipated.
> 
> From your experience, what kinds of tests are required?  Is there something
> missing in the existing tests that wouldn't catch differences in the sample
> timing, or start/stop?
> 
> The most comprehensive test I can think of off hand would be two threads both
> with separate stacks of "dummy" methods.  Sampling both simultaneously should
> result in total times that are reasonably consistent.  Plus ensure that the
> stack samples themselves show only the methods for each specific thread.

The main things I think need testing are:
 - correctness when profiling with multiple threads
 - proper handling of various overlappings of Start()/Stop()/destroy/thread exit
events on the profiled thread with collection occurring/not occurring on the
profiler thread
 - proper handling of various interleavings of Start()/Stop()/destroy/thread
exit events on multiple profiled threads

Checking for reasonably consistent times between collections across two threads
should be done manually, but I'm not sure how easily this can be implemented in
an automated fashion that doesn't flake on a test slave under load. (We already
have some issues like this in the current tests: http://crbug.com/551939.)

> >  2. Handling of thread lifetime issues, particularly profiled threads
exiting
> > while profiling is occurring, and correct behavior during application
> shutdown.
> > Plus tests for for these.
> 
> That's something that is completely new, right?  The new code should behave
the
> same as the old code.  In that case, I'd prefer to do it in a different CL.

No, this was handled before before and is not handled in the current code. It
needs to be supported before this CL can go in, otherwise the profiler will very
likely crash on application shutdown.

The existing implementation stops the profiling then joins the profiler thread
on destruction of its profiler object, but that approach only works in a single
threaded implementation. I think we'll need a new approach for the multithreaded
implementation.

Mike Wittman

Two other things remaining to do also come to mind: Using a single stack buffer ...

3 years, 11 months ago (2017-01-07 02:29:18 UTC) #84

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-12 21:25:17 UTC) #85

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/280001

3 years, 11 months ago (2017-01-12 21:26:14 UTC) #86

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-12 23:15:28 UTC) #87

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/370107)

3 years, 11 months ago (2017-01-12 23:15:29 UTC) #88

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-13 14:33:59 UTC) #89

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/300001

3 years, 11 months ago (2017-01-13 14:34:18 UTC) #90

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-13 15:22:29 UTC) #91

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/302778)

3 years, 11 months ago (2017-01-13 15:22:30 UTC) #92

bcwhite

> Using a single stack buffer for all captures. This needs to be implemented > ...

3 years, 11 months ago (2017-01-16 15:26:08 UTC) #94

bcwhite

> No, this was handled before before and is not handled in the current code. ...

3 years, 11 months ago (2017-01-16 16:09:52 UTC) #95

bcwhite

On 2017/01/16 16:09:52, bcwhite wrote: > > No, this was handled before before and is ...

3 years, 11 months ago (2017-01-16 20:08:19 UTC) #96

Mike Wittman

On 2017/01/16 20:08:19, bcwhite wrote: > On 2017/01/16 16:09:52, bcwhite wrote: > > > No, ...

3 years, 11 months ago (2017-01-17 17:01:21 UTC) #97

On 2017/01/16 20:08:19, bcwhite wrote:
> On 2017/01/16 16:09:52, bcwhite wrote:
> > > No, this was handled before before and is not handled in the current code.
> It
> > > needs to be supported before this CL can go in, otherwise the profiler
will
> > very
> > > likely crash on application shutdown.
> > > 
> > > The existing implementation stops the profiling then joins the profiler
> thread
> > > on destruction of its profiler object, but that approach only works in a
> > single
> > > threaded implementation. I think we'll need a new approach for the
> > multithreaded
> > > implementation.
> > 
> > I think there's going to have to be a hook in the shutdown() code to do this
> as
> > I don't expect the objects calling Stop()necessarily know if they're doing
so
> > because of a browser shutdown.  Trying to stop and join the thread would be
> > impossible because other sampling operations could be ongoing.
> > 
> > As long as there are "async runner" samples supported, telling it to "join
> when
> > finished" wouldn't work because those async samples could take an arbitrary
> > amount of time.
> > 
> > I'm thinking of adding a call to a new Shutdown() method in
> > BrowserMainLoop::PreShutdown().  Seem reasonable?
> 
> Turns out there is a problem with this when the thread is started on-demand by
> whatever thread happens to want to do the sampling...  Stop() can only be
called
> by whatever thread called Start().
> 
> While it's possible to use StopSoon() from a posted task and have the thread
> stop, which would be sufficient for the true shutdown case, there's no way to
> restart it until after Stop() is called.  That means that the same mechanism
> can't be use to halt the thread when it has nothing to do.
> 
> I'm wondering if a task_runner_ for the main UI thread could be given to it
that
> somehow used to start/stop the thread.
> 
> Or I abandon the Thread class and go back to PlatformThread::Delegate and
manage
> my own message-loop... but that seems likely to run into the same
complications.
> 
> Thoughts?

I'm not sure we actually need to stop the profiling thread pre-shutdown, as long
as it doesn't delay or otherwise adversely impact process shutdown.

I think the main issue is ensuring that the profiler does not attempt to profile
threads after they have exited, which would result in access violations if the
stack memory has been freed.

bcwhite

On 2017/01/17 17:01:21, Mike Wittman wrote: > On 2017/01/16 20:08:19, bcwhite wrote: > > On ...

3 years, 11 months ago (2017-01-17 17:07:41 UTC) #98

On 2017/01/17 17:01:21, Mike Wittman wrote:
> On 2017/01/16 20:08:19, bcwhite wrote:
> > On 2017/01/16 16:09:52, bcwhite wrote:
> > > > No, this was handled before before and is not handled in the current
code.
> > It
> > > > needs to be supported before this CL can go in, otherwise the profiler
> will
> > > very
> > > > likely crash on application shutdown.
> > > > 
> > > > The existing implementation stops the profiling then joins the profiler
> > thread
> > > > on destruction of its profiler object, but that approach only works in a
> > > single
> > > > threaded implementation. I think we'll need a new approach for the
> > > multithreaded
> > > > implementation.
> > > 
> > > I think there's going to have to be a hook in the shutdown() code to do
this
> > as
> > > I don't expect the objects calling Stop()necessarily know if they're doing
> so
> > > because of a browser shutdown.  Trying to stop and join the thread would
be
> > > impossible because other sampling operations could be ongoing.
> > > 
> > > As long as there are "async runner" samples supported, telling it to "join
> > when
> > > finished" wouldn't work because those async samples could take an
arbitrary
> > > amount of time.
> > > 
> > > I'm thinking of adding a call to a new Shutdown() method in
> > > BrowserMainLoop::PreShutdown().  Seem reasonable?
> > 
> > Turns out there is a problem with this when the thread is started on-demand
by
> > whatever thread happens to want to do the sampling...  Stop() can only be
> called
> > by whatever thread called Start().
> > 
> > While it's possible to use StopSoon() from a posted task and have the thread
> > stop, which would be sufficient for the true shutdown case, there's no way
to
> > restart it until after Stop() is called.  That means that the same mechanism
> > can't be use to halt the thread when it has nothing to do.
> > 
> > I'm wondering if a task_runner_ for the main UI thread could be given to it
> that
> > somehow used to start/stop the thread.
> > 
> > Or I abandon the Thread class and go back to PlatformThread::Delegate and
> manage
> > my own message-loop... but that seems likely to run into the same
> complications.
> > 
> > Thoughts?
> 
> I'm not sure we actually need to stop the profiling thread pre-shutdown, as
long
> as it doesn't delay or otherwise adversely impact process shutdown.
> 
> I think the main issue is ensuring that the profiler does not attempt to
profile
> threads after they have exited, which would result in access violations if the
> stack memory has been freed.

Shutdown isn't such a problem because I can do "StopSoon()" and let it go.  The
problem is that we also want this thread to stop and restart when necessary and
that turns out to be complicated.

Mike Wittman

On 2017/01/17 17:07:41, bcwhite wrote: > On 2017/01/17 17:01:21, Mike Wittman wrote: > > On ...

3 years, 11 months ago (2017-01-17 17:19:33 UTC) #99

On 2017/01/17 17:07:41, bcwhite wrote:
> On 2017/01/17 17:01:21, Mike Wittman wrote:
> > On 2017/01/16 20:08:19, bcwhite wrote:
> > > On 2017/01/16 16:09:52, bcwhite wrote:
> > > > > No, this was handled before before and is not handled in the current
> code.
> > > It
> > > > > needs to be supported before this CL can go in, otherwise the profiler
> > will
> > > > very
> > > > > likely crash on application shutdown.
> > > > > 
> > > > > The existing implementation stops the profiling then joins the
profiler
> > > thread
> > > > > on destruction of its profiler object, but that approach only works in
a
> > > > single
> > > > > threaded implementation. I think we'll need a new approach for the
> > > > multithreaded
> > > > > implementation.
> > > > 
> > > > I think there's going to have to be a hook in the shutdown() code to do
> this
> > > as
> > > > I don't expect the objects calling Stop()necessarily know if they're
doing
> > so
> > > > because of a browser shutdown.  Trying to stop and join the thread would
> be
> > > > impossible because other sampling operations could be ongoing.
> > > > 
> > > > As long as there are "async runner" samples supported, telling it to
"join
> > > when
> > > > finished" wouldn't work because those async samples could take an
> arbitrary
> > > > amount of time.
> > > > 
> > > > I'm thinking of adding a call to a new Shutdown() method in
> > > > BrowserMainLoop::PreShutdown().  Seem reasonable?
> > > 
> > > Turns out there is a problem with this when the thread is started
on-demand
> by
> > > whatever thread happens to want to do the sampling...  Stop() can only be
> > called
> > > by whatever thread called Start().
> > > 
> > > While it's possible to use StopSoon() from a posted task and have the
thread
> > > stop, which would be sufficient for the true shutdown case, there's no way
> to
> > > restart it until after Stop() is called.  That means that the same
mechanism
> > > can't be use to halt the thread when it has nothing to do.
> > > 
> > > I'm wondering if a task_runner_ for the main UI thread could be given to
it
> > that
> > > somehow used to start/stop the thread.
> > > 
> > > Or I abandon the Thread class and go back to PlatformThread::Delegate and
> > manage
> > > my own message-loop... but that seems likely to run into the same
> > complications.
> > > 
> > > Thoughts?
> > 
> > I'm not sure we actually need to stop the profiling thread pre-shutdown, as
> long
> > as it doesn't delay or otherwise adversely impact process shutdown.
> > 
> > I think the main issue is ensuring that the profiler does not attempt to
> profile
> > threads after they have exited, which would result in access violations if
the
> > stack memory has been freed.
> 
> Shutdown isn't such a problem because I can do "StopSoon()" and let it go. 
The
> problem is that we also want this thread to stop and restart when necessary
and
> that turns out to be complicated.

Having the UI thread start/stop the profiler thread seems like a reasonable
fallback to me, if it's difficult or not possible to do so from arbitrary
threads due to the thread affinity restrictions.

Note that the profiled thread exit scenario needs to be handled independent of
process shutdown.

Mike Wittman

On 2017/01/16 15:26:08, bcwhite wrote: > > Using a single stack buffer for all captures. ...

3 years, 11 months ago (2017-01-17 17:37:40 UTC) #100

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-25 21:41:52 UTC) #101

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/320001

3 years, 11 months ago (2017-01-25 21:42:57 UTC) #102

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 11 months ago (2017-01-25 22:01:37 UTC) #103

bcwhite

Finally got a working start/stop solution that supports browser shutdown and idle shutdown. Still need ...

3 years, 11 months ago (2017-01-25 22:02:17 UTC) #104

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/340001

3 years, 11 months ago (2017-01-25 22:02:38 UTC) #106

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 11 months ago (2017-01-25 23:33:46 UTC) #107

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/376187)

3 years, 11 months ago (2017-01-25 23:33:47 UTC) #108

Mike Wittman

On 2017/01/25 22:02:17, bcwhite wrote: > Finally got a working start/stop solution that supports browser ...

3 years, 11 months ago (2017-01-26 02:25:01 UTC) #109

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-01-30 19:58:55 UTC) #110

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/360001

3 years, 10 months ago (2017-01-30 19:59:23 UTC) #111

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-01-30 21:22:02 UTC) #112

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/378629)

3 years, 10 months ago (2017-01-30 21:22:03 UTC) #113

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-01-31 14:26:44 UTC) #114

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/380001

3 years, 10 months ago (2017-01-31 14:26:59 UTC) #115

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-01-31 15:53:40 UTC) #116

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/381105)

3 years, 10 months ago (2017-01-31 15:53:41 UTC) #117

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-01-31 17:56:16 UTC) #118

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/400001

3 years, 10 months ago (2017-01-31 17:56:37 UTC) #119

bcwhite

> A meta point: can we implement and review the support for the profiled thread ...

3 years, 10 months ago (2017-01-31 17:58:56 UTC) #121

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-01-31 19:15:00 UTC) #122

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/379174)

3 years, 10 months ago (2017-01-31 19:15:02 UTC) #123

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-01-31 22:14:27 UTC) #124

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-01 14:45:42 UTC) #125

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/420001

3 years, 10 months ago (2017-02-01 14:45:50 UTC) #126

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 14:47:29 UTC) #127

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
> I don't think relying on SuspendThread to error out on terminated threads is a
> viable mechanism for handling thread exit. Thread ids are reused on Windows so
> there's no guarantee that another thread won't have been started with the same
> id. I suspect we'll need some kind of formal synchronization between the
target
> threads and the profiler thread to coordinate thread exit.

While reuse of thread-ids is possible, I don't think it's a concern:

1) The thread has to exit and the ID reused relatively quickly.
2) The presence of a foreign stack-frame in the data would be obvious and easily
dismissed.
3) It's non-trivial (at best) to have an outside, independent watcher learn when
a thread exits.

I believe it's not worth addressing this until it proves to be a real problem.

> Also, independent of that, an empty result here could happen for many other
> reasons -- the stack pointer pointing to a guard page for example.

Makes sense.  I'll make the native sampler record information from the last
sample attempt.

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:899:
PlatformThread::Sleep(TimeDelta::FromSeconds(3));
On 2017/01/31 22:14:27, Mike Wittman wrote:
> Coordinating threads via sleep will cause this test to be flaky when run under
> load. We should do proper coordination via WaitableEvents to guarantee the
> expected test behavior. I think this should be possible by adding calls into
the
> TestDelegate within the profiler at appropriate coordination points, and
> supplying a TestDelegate implementation in the test that waits for the
expected
> events (e.g. thread has terminated).

I'll look at that.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-01 16:12:12 UTC) #128

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/379786)

3 years, 10 months ago (2017-02-01 16:12:13 UTC) #129

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-01 17:51:53 UTC) #130

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/440001

3 years, 10 months ago (2017-02-01 17:52:08 UTC) #131

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler_unittest.cc#newcode899 base/profiler/stack_sampling_profiler_unittest.cc:899: PlatformThread::Sleep(TimeDelta::FromSeconds(3)); On 2017/01/31 22:14:27, Mike Wittman wrote: > Coordinating ...

3 years, 10 months ago (2017-02-01 17:53:10 UTC) #132

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-01 17:55:49 UTC) #133

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/460001

3 years, 10 months ago (2017-02-01 17:56:14 UTC) #134

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-01 17:57:49 UTC) #135

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/480001

3 years, 10 months ago (2017-02-01 17:58:12 UTC) #136

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/145772) ios-device-xcode-clang on ...

3 years, 10 months ago (2017-02-01 17:59:46 UTC) #138

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 17:59:54 UTC) #139

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/01 14:47:29, bcwhite wrote:
> > I don't think relying on SuspendThread to error out on terminated threads is
a
> > viable mechanism for handling thread exit. Thread ids are reused on Windows
so
> > there's no guarantee that another thread won't have been started with the
same
> > id. I suspect we'll need some kind of formal synchronization between the
> target
> > threads and the profiler thread to coordinate thread exit.
> 
> While reuse of thread-ids is possible, I don't think it's a concern:
> 
> 1) The thread has to exit and the ID reused relatively quickly.
> 2) The presence of a foreign stack-frame in the data would be obvious and
easily
> dismissed.
> 3) It's non-trivial (at best) to have an outside, independent watcher learn
when
> a thread exits.
> 
> I believe it's not worth addressing this until it proves to be a real problem.

I think this is a serious concern, and requires a solution that we have
confidence in up front. Since the profiler effectively controls the execution of
the entire rest of Chrome, it's imperative that it be as bulletproof as
possible. Avoiding non-deterministic failure modes is absolutely essential
because the resulting failures will be difficult to notice and next to
impossible to investigate effectively.

100ms is a pretty huge window in system execution terms. If the profiled thread
exits, hundreds if not thousands of thread creations could occur before the next
attempted sample, any of which could reuse the id.

If a thread in another process claims the id, then it's not clear what will
happen. Worst case scenario would be that we succeed in suspending the thread,
only to crash while trying to copy the stack. That would likely deadlock some
random innocent process on the system, a nasty scenario that we should be
avoiding at all costs.

If a thread in Chrome claims the id, then the profiler will happily continue
profiling the other thread, silently generating wrong data. There would be no
way to reliably detect this scenario in the data processing.

I don't think we need an independent thread watcher, since we can rely on the
StackSamplingProfiler's destructor being called before thread exit.

Straw man proposal: put a WaitableEvent "profiling_stopped" in the
CollectionContext. Thread a "profiler_destroying" flag through to the
RemoveCollectionTask from the StackSamplingProfiler destructor. After posting
the RemoveCollectionTask in the target thread, wait on the profiling_stopped
event. When executing the RemoveCollectionTask in the profiler thread, signal
the profiling_stopped event if the profiler_destroying flag is present.

I believe this would ensure the target thread has not exited until the profiler
is finished with it.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 19:11:16 UTC) #141

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-01 19:23:10 UTC) #142

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/379894)

3 years, 10 months ago (2017-02-01 19:23:11 UTC) #143

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 20:26:51 UTC) #144

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/01 19:11:16, bcwhite wrote:
> > I don't think we need an independent thread watcher, since we can rely on
the
> > StackSamplingProfiler's destructor being called before thread exit.
> 
> I don't understand.  The StackSamplingProfiler lifetime is completely
> independent of any thread it might be sampling.

That's true, the current interface allows an arbitrary thread id to be supplied
for profiling. We'd need restrict the profiler to working on the thread where
it's created to rely on this behavior.

This is probably a reasonable trade off to make though, considering the
anticipated use cases in the new thread scheduler world. I think the profiler
will be used either for self profiling, or for profiling directed by the thread
scheduler. In the former case the StackSamplingProfiler would be allocated on
the thread's stack. In the latter case, the thread scheduler will know when
threads exit and can coordinate with the profiler internals via some
to-be-defined mechanism.

> > Straw man proposal: put a WaitableEvent "profiling_stopped" in the
> > CollectionContext. Thread a "profiler_destroying" flag through to the
> > RemoveCollectionTask from the StackSamplingProfiler destructor. After
posting
> > the RemoveCollectionTask in the target thread, wait on the profiling_stopped
> > event. When executing the RemoveCollectionTask in the profiler thread,
signal
> > the profiling_stopped event if the profiler_destroying flag is present.
> > 
> > I believe this would ensure the target thread has not exited until the
> profiler
> > is finished with it.
> 
> I'm confused.  An independent thread can exit at any time, right?

Chrome threads can exit by quitting the message loop. I believe directly exiting
threads is not supported in Chrome because it doesn't run destructors or do
other necessary cleanup. In the case of threads managed by the thread scheduler,
the scheduler itself will be responsible for thread exit.

> How about this for a simpler technique:  When the native sampler is created,
use
> GetThreadTimes() to get lpCreationTime.  Before each sample, do the same.  If
> the creation-time changes, it must be a different thread.  In addition, there
is
> an lpExitTime that would determine if the thread has exited (but not yet been
> reaped).

This would dramatically reduce the window of vulnerability, but I don't think it
could prevent this situation from occurring. GetThreadTimes() is an inherently
racy API; there will always be some window between the time that
GetThreadTimes() is invoked and the actions taken as a result, during which time
threads could be destroyed or created.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 20:37:59 UTC) #145

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 21:19:04 UTC) #146

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-01 22:01:47 UTC) #147

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
> > > This would dramatically reduce the window of vulnerability, but I don't
> think
> > it
> > > could prevent this situation from occurring. GetThreadTimes() is an
> inherently
> > > racy API; there will always be some window between the time that
> > > GetThreadTimes() is invoked and the actions taken as a result, during
which
> > time
> > > threads could be destroyed or created.
> > 
> > If the call were made while the thread was suspended, there wouldn't be any
> > race.
> > 
> > But to avoid being suspended any longer than necessary, the check could be
> done
> > after the acquisition.  On the incredibly slim chance that the thread died
and
> > was replaced with an identical ID in those few ns, the worst that would
happen
> > is that the last sample would get discarded unnecessarily.
> 
> There would still be a race between the time the thread id was provided to the
> profiler and the first time the thread was suspended. (At least that one,
there
> may be others too.)

The thread creation time would be captured during the ctor of the
StackSamplingProfiler so at a known time and from a known thread.

The thread under test could still die and be replaced in that time but that's a
race outside of this module.  It's up to the caller to ensure that the thread it
wants to profile is still alive when the ctor returns, before Start is called,
something it has the chance of doing because it has more knowledge.

It'll be possible to verify the thread even on the very first sampling attempt.


> It's really difficult to be sure that all relevant thread interleaving
scenarios
> have been considered with an API like this. And even if they have, it will be
> significantly difficult for other developers to validate the correctness of
the
> resulting code. If and when some future developer makes changes here, there's
a
> small chance they will understand the subtleties sufficiently to avoid
> introducing races.

Any solution is going to have potential race conditions to verify but this at
least is simple and easy to follow.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-02 02:48:05 UTC) #148

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/01 22:01:47, bcwhite wrote:
> > > > This would dramatically reduce the window of vulnerability, but I don't
> > think
> > > it
> > > > could prevent this situation from occurring. GetThreadTimes() is an
> > inherently
> > > > racy API; there will always be some window between the time that
> > > > GetThreadTimes() is invoked and the actions taken as a result, during
> which
> > > time
> > > > threads could be destroyed or created.
> > > 
> > > If the call were made while the thread was suspended, there wouldn't be
any
> > > race.
> > > 
> > > But to avoid being suspended any longer than necessary, the check could be
> > done
> > > after the acquisition.  On the incredibly slim chance that the thread died
> and
> > > was replaced with an identical ID in those few ns, the worst that would
> happen
> > > is that the last sample would get discarded unnecessarily.
> > 
> > There would still be a race between the time the thread id was provided to
the
> > profiler and the first time the thread was suspended. (At least that one,
> there
> > may be others too.)
> 
> The thread creation time would be captured during the ctor of the
> StackSamplingProfiler so at a known time and from a known thread.
> 
> The thread under test could still die and be replaced in that time but that's
a
> race outside of this module.  It's up to the caller to ensure that the thread
it
> wants to profile is still alive when the ctor returns, before Start is called,
> something it has the chance of doing because it has more knowledge.
> 
> It'll be possible to verify the thread even on the very first sampling
attempt. 

I am not convinced that we've enumerated and addressed all the possible races
here, and I am skeptical that this is realistically possible given the
dependence on Win32 implementation details. Take GetThreadTime() for example:
your analysis assumes that this executes in a short time and that the values it
returns reflect some relatively current state of reality. Neither of these is
guaranteed to be true, and even if they are now the behavior could change in the
future. There are probably other assumptions we're both making about how this
call and SuspendThread work that may be invalid. Depending on undocumented
behavior is risky and should be avoided where possible.

An entirely separate can of worms is cross-platform support. GetThreadTimes() is
a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and Android,
there's basically zero chance we can depend on winning the same races,
consistently, on every one of those platforms now and in the future. It's also
unknown if the SuspendThread-equivalent will reliably tell us if the thread was
terminated (this is true for Windows too for that matter).

If GetThreadTimes() and its other-platform equivalents take locks then they
cannot be called while the thread is suspended, making races unavoidable. 

> Any solution is going to have potential race conditions to verify but this at
> least is simple and easy to follow.

As far as I'm aware the strawman proposal I mentioned has no race conditions due
to the use of established synchronization primitives. It also depends solely on
cross-platform interfaces.

Given all the issues with the SuspendThread/GetThreadTimes approach I'm not OK
moving forward with it for handling thread exit. We need a solution that
guarantees correct behavior in all cases.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-02 14:24:25 UTC) #149

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
> I am not convinced that we've enumerated and addressed all the possible races
> here, and I am skeptical that this is realistically possible given the
> dependence on Win32 implementation details. Take GetThreadTime() for example:
> your analysis assumes that this executes in a short time and that the values
it
> returns reflect some relatively current state of reality. Neither of these is
> guaranteed to be true, and even if they are now the behavior could change in
the
> future. There are probably other assumptions we're both making about how this
> call and SuspendThread work that may be invalid. Depending on undocumented
> behavior is risky and should be avoided where possible.

Undocumented?  GetThreadTimes is a published and supported API of which the only
behavior we're looking at is the reported creation time.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx

We can't guarantee the time and operation of future Chrome changes, either, but
since GetThreadTime() is a published API upon which thousands of applications
likely depend, I don't see it changing in any significant way.  And it really
doesn't matter if it's not exceptionally quick (though it likely is) since it'll
be running on the sampling thread after the sampled thread has been resumed.  As
you've said, 100s of ms is a lot of time.  Even if there were many concurrent
profiles being collected, it's not going be significant compared to the existing
activities of stopping a thread, copying its stack, and decoding it.


> An entirely separate can of worms is cross-platform support. GetThreadTimes()
is
> a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and
Android,
> there's basically zero chance we can depend on winning the same races,
> consistently, on every one of those platforms now and in the future. It's also
> unknown if the SuspendThread-equivalent will reliably tell us if the thread
was
> terminated (this is true for Windows too for that matter).

Cross-platform is already a can of worms.  Recording and checking the thread-ID
would live in the platform-specific NativeStackSamplerWin class.  When (if?)
other platform support gets added, those classes can use whatever is appropriate
for them.  Or if nothing works, then a more complex solution can be
investigated.  There's no benefit in trying to code specifically for them in
advance.


> If GetThreadTimes() and its other-platform equivalents take locks then they
> cannot be called while the thread is suspended, making races unavoidable. 

I wouldn't do it while suspended anyway for reasons I mentioned previously.


> > Any solution is going to have potential race conditions to verify but this
at
> > least is simple and easy to follow.
> 
> As far as I'm aware the strawman proposal I mentioned has no race conditions
due
> to the use of established synchronization primitives. It also depends solely
on
> cross-platform interfaces.

It does.  I don't know what they are, but I'm sure they're there.  Managing the
start/stop of a thread proved to be insanely difficult.  But even if I'm wrong,
the proposal is far more complex and difficult to understand than this simple,
self-contained solution.  The proposal also makes assumptions about the threads
under test, something that may prove limiting in the future.  Somebody is bound
to want to trace a PlatformThread without a message-loop at some point.


> Given all the issues with the SuspendThread/GetThreadTimes approach I'm not OK
> moving forward with it for handling thread exit. We need a solution that
> guarantees correct behavior in all cases.

No, you don't.  You need to be sure it won't crash but other than that, you just
need a solution that has a signal-to-noise ratio sufficient to analyze the data;
I see very little, if any, noise coming from this.  We can't let "perfect" be
the enemy of the "good".

This is a simple solution and can be implemented quickly and cleanly.  We should
do it.  If it proves to be untenable in the field, then we can investigate more
complicated methods.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-02 16:17:05 UTC) #150

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/500001

3 years, 10 months ago (2017-02-02 16:17:26 UTC) #151

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-02 17:50:07 UTC) #152

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/357412)

3 years, 10 months ago (2017-02-02 17:50:09 UTC) #153

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-02 19:22:31 UTC) #154

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/02 14:24:25, bcwhite wrote:
> > I am not convinced that we've enumerated and addressed all the possible
races
> > here, and I am skeptical that this is realistically possible given the
> > dependence on Win32 implementation details. Take GetThreadTime() for
example:
> > your analysis assumes that this executes in a short time and that the values
> it
> > returns reflect some relatively current state of reality. Neither of these
is
> > guaranteed to be true, and even if they are now the behavior could change in
> the
> > future. There are probably other assumptions we're both making about how
this
> > call and SuspendThread work that may be invalid. Depending on undocumented
> > behavior is risky and should be avoided where possible.
> 
> Undocumented?  GetThreadTimes is a published and supported API of which the
only
> behavior we're looking at is the reported creation time.
>
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx
> 
> We can't guarantee the time and operation of future Chrome changes, either,
but
> since GetThreadTime() is a published API upon which thousands of applications
> likely depend, I don't see it changing in any significant way.  And it really
> doesn't matter if it's not exceptionally quick (though it likely is) since
it'll
> be running on the sampling thread after the sampled thread has been resumed. 
As
> you've said, 100s of ms is a lot of time.  Even if there were many concurrent
> profiles being collected, it's not going be significant compared to the
existing
> activities of stopping a thread, copying its stack, and decoding it.
> 

Several points:

1. Running the check after the sampling has already happened still leaves a
100ms race window, and still allows the failure scenarios I mentioned in comment
#139.

2. The point I was trying, inelegantly, to make above is that Win32
implementation details affect the length of the race window in ways which are
difficult to predict.

3. All it takes for this approach to go sideways, due to SuspendThread operating
on a different thread than GetThreadTimes, is one ill-timed context switch on
the profiler thread within the race window. The length of the window just makes
the probability of hitting this case more or less likely.

4. Given the number of Chrome users and the number of times this code is run,
events with even a vanishingly small probability of occurring will occur
reliably over the population. A one-in-a-million event during profiling will
occur hundreds of times per day, just over the population of canary and dev
users. The guard page check in the code is there to handle a case that occurs
with a probability of around 1 in 10,000,000, and was generating a
non-negligible number of crash reports.

> > An entirely separate can of worms is cross-platform support.
GetThreadTimes()
> is
> > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and
> Android,
> > there's basically zero chance we can depend on winning the same races,
> > consistently, on every one of those platforms now and in the future. It's
also
> > unknown if the SuspendThread-equivalent will reliably tell us if the thread
> was
> > terminated (this is true for Windows too for that matter).
> 
> Cross-platform is already a can of worms.  Recording and checking the
thread-ID
> would live in the platform-specific NativeStackSamplerWin class.  When (if?)
> other platform support gets added, those classes can use whatever is
appropriate
> for them.  Or if nothing works, then a more complex solution can be
> investigated.  There's no benefit in trying to code specifically for them in
> advance.

The OS X implementation is in progress; one of the Mac developers has already
started working on it.

> > If GetThreadTimes() and its other-platform equivalents take locks then they
> > cannot be called while the thread is suspended, making races unavoidable. 
> 
> I wouldn't do it while suspended anyway for reasons I mentioned previously.
> 
> 
> > > Any solution is going to have potential race conditions to verify but this
> at
> > > least is simple and easy to follow.
> > 
> > As far as I'm aware the strawman proposal I mentioned has no race conditions
> due
> > to the use of established synchronization primitives. It also depends solely
> on
> > cross-platform interfaces.
> 
> It does.  I don't know what they are, but I'm sure they're there.  Managing
the
> start/stop of a thread proved to be insanely difficult.  But even if I'm
wrong,
> the proposal is far more complex and difficult to understand than this simple,
> self-contained solution.  The proposal also makes assumptions about the
threads
> under test, something that may prove limiting in the future.  Somebody is
bound
> to want to trace a PlatformThread without a message-loop at some point.

The strawman proposal uses standard Chrome synchronization primitives and would
be significantly easier to understand by the average Chrome developer than the
use of Win32 APIs. Effectively the only restrictions it places on the profiled
threads is that, if they are being profiled from a thread other than themselves,
that thread must be responsible for ensuring the profiled thread outlives the
profiling. There's no need for the profiled thread to have a message loop.

> > Given all the issues with the SuspendThread/GetThreadTimes approach I'm not
OK
> > moving forward with it for handling thread exit. We need a solution that
> > guarantees correct behavior in all cases.
> 
> No, you don't.  You need to be sure it won't crash but other than that, you
just
> need a solution that has a signal-to-noise ratio sufficient to analyze the
data;
> I see very little, if any, noise coming from this.  We can't let "perfect" be
> the enemy of the "good".
> 
> This is a simple solution and can be implemented quickly and cleanly.  We
should
> do it.  If it proves to be untenable in the field, then we can investigate
more
> complicated methods.

I am not sure it won't crash and I haven't seen sufficient justification in this
thread for why it won't crash. I've even outlined a possible scenario where not
only will it crash, but it will deadlock other processes on the system. This
would not only be incredibly poor behavior, but potentially a huge PR black eye
("Chrome is so unstable it crashes my other applications too!").

Summarizing my objections to this approach:
- it's platform-specific, and it's highly uncertain whether it can even be
applied to other platforms
- it's racy, probably unavoidably so
- we don't understand all the consequences of losing the race, but there's good
reason to believe that at least one consequence is crashes and system
instability
- losing the race will happen with non-negligible frequency over the user
population
- there's good reason to believe that an alternate implementation is possible
that doesn't have these drawbacks

This is the best I can do to explain why I can't approve this approach to
handling thread exit. If you want to continue pursuing it, you'll need to
escalate to danakj or another Chrome threading/synchronization guru.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-02 20:46:16 UTC) #155

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
> > Undocumented?  GetThreadTimes is a published and supported API of which the
> only
> > behavior we're looking at is the reported creation time.
> >
>
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx
> > 
> > We can't guarantee the time and operation of future Chrome changes, either,
> but
> > since GetThreadTime() is a published API upon which thousands of
applications
> > likely depend, I don't see it changing in any significant way.  And it
really
> > doesn't matter if it's not exceptionally quick (though it likely is) since
> it'll
> > be running on the sampling thread after the sampled thread has been resumed.

> As
> > you've said, 100s of ms is a lot of time.  Even if there were many
concurrent
> > profiles being collected, it's not going be significant compared to the
> existing
> > activities of stopping a thread, copying its stack, and decoding it.
> > 
> 
> Several points:
> 
> 1. Running the check after the sampling has already happened still leaves a
> 100ms race window, and still allows the failure scenarios I mentioned in
comment
> #139.

100ms?  If the check happens immediately after the sampling, it is checked as
soon as the sample is converted.  That's us.

And the failure case is that the last sample gets dropped unnecessarily.  That's
not a serious problem.


> 2. The point I was trying, inelegantly, to make above is that Win32
> implementation details affect the length of the race window in ways which are
> difficult to predict.
> 
> 3. All it takes for this approach to go sideways, due to SuspendThread
operating
> on a different thread than GetThreadTimes, is one ill-timed context switch on
> the profiler thread within the race window. The length of the window just
makes
> the probability of hitting this case more or less likely.

But it's not.  The GetThreadTimes is done on the same thread, just after it is
resumed, all within the NativeStackSampler.


> 4. Given the number of Chrome users and the number of times this code is run,
> events with even a vanishingly small probability of occurring will occur
> reliably over the population. A one-in-a-million event during profiling will
> occur hundreds of times per day, just over the population of canary and dev
> users. The guard page check in the code is there to handle a case that occurs
> with a probability of around 1 in 10,000,000, and was generating a
> non-negligible number of crash reports.

Crashes are serious things, and one-in-ten-million crashes would be a
significant addition to the current number of crashes.

But one-in-a-million bad samples is in no way significant, an error of 0.0001%
which is minuscule and certainly less than the normal variation seen when
aggregating samples.


> > > An entirely separate can of worms is cross-platform support.
> GetThreadTimes()
> > is
> > > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and
> > Android,
> > > there's basically zero chance we can depend on winning the same races,
> > > consistently, on every one of those platforms now and in the future. It's
> also
> > > unknown if the SuspendThread-equivalent will reliably tell us if the
thread
> > was
> > > terminated (this is true for Windows too for that matter).
> > 
> > Cross-platform is already a can of worms.  Recording and checking the
> thread-ID
> > would live in the platform-specific NativeStackSamplerWin class.  When (if?)
> > other platform support gets added, those classes can use whatever is
> appropriate
> > for them.  Or if nothing works, then a more complex solution can be
> > investigated.  There's no benefit in trying to code specifically for them in
> > advance.
> 
> The OS X implementation is in progress; one of the Mac developers has already
> started working on it.

And do they claim that there is no platform-specific way to detect if a thread
has exited?


> > It does.  I don't know what they are, but I'm sure they're there.  Managing
> the
> > start/stop of a thread proved to be insanely difficult.  But even if I'm
> wrong,
> > the proposal is far more complex and difficult to understand than this
simple,
> > self-contained solution.  The proposal also makes assumptions about the
> threads
> > under test, something that may prove limiting in the future.  Somebody is
> bound
> > to want to trace a PlatformThread without a message-loop at some point.
> 
> The strawman proposal uses standard Chrome synchronization primitives and
would
> be significantly easier to understand by the average Chrome developer than the
> use of Win32 APIs. Effectively the only restrictions it places on the profiled
> threads is that, if they are being profiled from a thread other than
themselves,
> that thread must be responsible for ensuring the profiled thread outlives the
> profiling. There's no need for the profiled thread to have a message loop.

It also requires modification of every thread that needs to be sampled,
something that will limit the ease of using this tool.  Plus, those
modifications may have far-reaching effects since they will change the
characteristics of how the thread exits, possibly affecting other threads
waiting on it.  There's no way to predict how far those effects will propagate.


> > > Given all the issues with the SuspendThread/GetThreadTimes approach I'm
not
> OK
> > > moving forward with it for handling thread exit. We need a solution that
> > > guarantees correct behavior in all cases.
> > 
> > No, you don't.  You need to be sure it won't crash but other than that, you
> just
> > need a solution that has a signal-to-noise ratio sufficient to analyze the
> data;
> > I see very little, if any, noise coming from this.  We can't let "perfect"
be
> > the enemy of the "good".
> > 
> > This is a simple solution and can be implemented quickly and cleanly.  We
> should
> > do it.  If it proves to be untenable in the field, then we can investigate
> more
> > complicated methods.
> 
> I am not sure it won't crash and I haven't seen sufficient justification in
this
> thread for why it won't crash. I've even outlined a possible scenario where
not
> only will it crash, but it will deadlock other processes on the system. This
> would not only be incredibly poor behavior, but potentially a huge PR black
eye
> ("Chrome is so unstable it crashes my other applications too!").

If it won't work or risks other processes then I agree with you and a more
complicated solution needs to be used.

What I'm trying to determine at the moment is if a thread-ID can be reused while
a win::ScopedHandle to it is still open to that thread.  Seems to me something
the OS would prevent...

From this article on SO...
http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali...
... it seems to be the case that the thread-ID CANNOT be reused until the
ScopedHandle used by the NativeStackSampler releases it.

If that is the case then there is no need to worry about the thread being
replaced with one of an identical ID while it is being sampled and all this can
be simplified.

Is there something I'm missing about this?

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-02 22:29:15 UTC) #156

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/02 20:46:16, bcwhite wrote:
>
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
> File base/profiler/stack_sampling_profiler.cc (right):
> 
>
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
> base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates
that
> the thread under test is gone.
> > > Undocumented?  GetThreadTimes is a published and supported API of which
the
> > only
> > > behavior we're looking at is the reported creation time.
> > >
> >
>
https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx
> > > 
> > > We can't guarantee the time and operation of future Chrome changes,
either,
> > but
> > > since GetThreadTime() is a published API upon which thousands of
> applications
> > > likely depend, I don't see it changing in any significant way.  And it
> really
> > > doesn't matter if it's not exceptionally quick (though it likely is) since
> > it'll
> > > be running on the sampling thread after the sampled thread has been
resumed.
> 
> > As
> > > you've said, 100s of ms is a lot of time.  Even if there were many
> concurrent
> > > profiles being collected, it's not going be significant compared to the
> > existing
> > > activities of stopping a thread, copying its stack, and decoding it.
> > > 
> > 
> > Several points:
> > 
> > 1. Running the check after the sampling has already happened still leaves a
> > 100ms race window, and still allows the failure scenarios I mentioned in
> comment
> > #139.
> 
> 100ms?  If the check happens immediately after the sampling, it is checked as
> soon as the sample is converted.  That's us.

Are you presuming the GetThreadTimes call is made both before and after the
thread suspension? Otherwise the race window is the interval between the
GetThreadTimes call and the next SuspendThread call, which is the 100ms between
the end of one sample and the start of the next. If the thread gets replaced
during that window, SuspendThread will operate on the wrong thread.

> And the failure case is that the last sample gets dropped unnecessarily. 
That's
> not a serious problem.

The failure I've mentioned results in a crash and deadlock and works like this:

1. GetThreadTimes call is made. Thread id and creation time match what was seen
previously.
2. Time goes by...
3. The thread is replaced by a thread in another process.
4. SuspendThread is called on the new thread and succeeds.
5. The NativeStackSampler attempts to copy the thread's stack, but generates an
access violation because the thread's stack lives in another process' address
space.
6. Chrome crashes. The thread in the other process remains permanently
suspended.

> > 2. The point I was trying, inelegantly, to make above is that Win32
> > implementation details affect the length of the race window in ways which
are
> > difficult to predict.
> > 
> > 3. All it takes for this approach to go sideways, due to SuspendThread
> operating
> > on a different thread than GetThreadTimes, is one ill-timed context switch
on
> > the profiler thread within the race window. The length of the window just
> makes
> > the probability of hitting this case more or less likely.
> 
> But it's not.  The GetThreadTimes is done on the same thread, just after it is
> resumed, all within the NativeStackSampler.

What do you mean by "it's not"? The context switch won't happen in this case?

> > 4. Given the number of Chrome users and the number of times this code is
run,
> > events with even a vanishingly small probability of occurring will occur
> > reliably over the population. A one-in-a-million event during profiling will
> > occur hundreds of times per day, just over the population of canary and dev
> > users. The guard page check in the code is there to handle a case that
occurs
> > with a probability of around 1 in 10,000,000, and was generating a
> > non-negligible number of crash reports.
> 
> Crashes are serious things, and one-in-ten-million crashes would be a
> significant addition to the current number of crashes.
> 
> But one-in-a-million bad samples is in no way significant, an error of 0.0001%
> which is minuscule and certainly less than the normal variation seen when
> aggregating samples.

The issue I see with this affecting the data is that it's turning what is
currently a retryable failure (SuspendThread failed) into a permanent failure.
To the extent that SuspendThread fails due reasons other than the thread has
exited, we will miss out collecting all the rest of the collection's samples.
It's unclear how often this will happen. There's one case described in the
documentation, but it's unclear what other scenarios will cause this behavior.

> > > > An entirely separate can of worms is cross-platform support.
> > GetThreadTimes()
> > > is
> > > > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and
> > > Android,
> > > > there's basically zero chance we can depend on winning the same races,
> > > > consistently, on every one of those platforms now and in the future.
It's
> > also
> > > > unknown if the SuspendThread-equivalent will reliably tell us if the
> thread
> > > was
> > > > terminated (this is true for Windows too for that matter).
> > > 
> > > Cross-platform is already a can of worms.  Recording and checking the
> > thread-ID
> > > would live in the platform-specific NativeStackSamplerWin class.  When
(if?)
> > > other platform support gets added, those classes can use whatever is
> > appropriate
> > > for them.  Or if nothing works, then a more complex solution can be
> > > investigated.  There's no benefit in trying to code specifically for them
in
> > > advance.
> > 
> > The OS X implementation is in progress; one of the Mac developers has
already
> > started working on it.
> 
> And do they claim that there is no platform-specific way to detect if a thread
> has exited?

I looked at the initial OS X implementation and the SuspendThread equivalent
effectively operates on pid_t's, which are reused by the OS. So it's very likely
subject to the same race conditions.

> > > It does.  I don't know what they are, but I'm sure they're there. 
Managing
> > the
> > > start/stop of a thread proved to be insanely difficult.  But even if I'm
> > wrong,
> > > the proposal is far more complex and difficult to understand than this
> simple,
> > > self-contained solution.  The proposal also makes assumptions about the
> > threads
> > > under test, something that may prove limiting in the future.  Somebody is
> > bound
> > > to want to trace a PlatformThread without a message-loop at some point.
> > 
> > The strawman proposal uses standard Chrome synchronization primitives and
> would
> > be significantly easier to understand by the average Chrome developer than
the
> > use of Win32 APIs. Effectively the only restrictions it places on the
profiled
> > threads is that, if they are being profiled from a thread other than
> themselves,
> > that thread must be responsible for ensuring the profiled thread outlives
the
> > profiling. There's no need for the profiled thread to have a message loop.
> 
> It also requires modification of every thread that needs to be sampled,
> something that will limit the ease of using this tool.  Plus, those
> modifications may have far-reaching effects since they will change the
> characteristics of how the thread exits, possibly affecting other threads
> waiting on it.  There's no way to predict how far those effects will
propagate.

It doesn't require modification of every thread that needs to be sampled, it
just requires allocation of a StackSamplingProfiler on the stack for any thread
that wants to sample itself. The other main use case of threads sampled on
behalf of the thread scheduler require no changes. The only effect on thread
exit is that the profiled thread might need to wait for the profiler thread to
finish its current sample, and there may still be ways to mitigate that if it
proves burdensome.

> > > > Given all the issues with the SuspendThread/GetThreadTimes approach I'm
> not
> > OK
> > > > moving forward with it for handling thread exit. We need a solution that
> > > > guarantees correct behavior in all cases.
> > > 
> > > No, you don't.  You need to be sure it won't crash but other than that,
you
> > just
> > > need a solution that has a signal-to-noise ratio sufficient to analyze the
> > data;
> > > I see very little, if any, noise coming from this.  We can't let "perfect"
> be
> > > the enemy of the "good".
> > > 
> > > This is a simple solution and can be implemented quickly and cleanly.  We
> > should
> > > do it.  If it proves to be untenable in the field, then we can investigate
> > more
> > > complicated methods.
> > 
> > I am not sure it won't crash and I haven't seen sufficient justification in
> this
> > thread for why it won't crash. I've even outlined a possible scenario where
> not
> > only will it crash, but it will deadlock other processes on the system. This
> > would not only be incredibly poor behavior, but potentially a huge PR black
> eye
> > ("Chrome is so unstable it crashes my other applications too!").
> 
> If it won't work or risks other processes then I agree with you and a more
> complicated solution needs to be used.
> 
> What I'm trying to determine at the moment is if a thread-ID can be reused
while
> a win::ScopedHandle to it is still open to that thread.  Seems to me something
> the OS would prevent...
> 
> From this article on SO...
>
http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali...
> ... it seems to be the case that the thread-ID CANNOT be reused until the
> ScopedHandle used by the NativeStackSampler releases it.
> 
> If that is the case then there is no need to worry about the thread being
> replaced with one of an identical ID while it is being sampled and all this
can
> be simplified.
> 
> Is there something I'm missing about this?

I don't know if a thread id can be reused while holding a handle. But even if
this works on Windows, it's unlikely to work on OS X since there's no
corresponding handle concept. On that platform we'll probably have to implement
a synchronization-based solution similar what I've proposed. At that point,
given a cross-platform implementation, there would be no reason not to replace
the Windows implementation with that solution to minimize code complexity. I
don't see why we shouldn't just avoid the intermediate step.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-03 13:28:39 UTC) #157

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-03 17:24:15 UTC) #158

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/03 13:28:38, bcwhite wrote:
> Okay, you've convinced me: If a thread were to exit and its ID be reused,
> whether it be another Chrome thread or in some outside process, then Bad
> Things(tm) may happen if any attempt to suspend that thread is made.  There is
> no way, in the NativeStackSampler alone, to fully eliminate such a
possibility.
> 
> I also agree with you completely that any solution should be implemented, if
> practical, in a generic manner that applies to all architectures and even,
> ideally, is reusable for other purposes.  That's just a general principal and
> good idea.
> 
> However, documentation states that a thread-ID cannot be reused under Windows
as
> long as there are any open handles to that thread.  Since the
> NativeStackSamplerWin object holds an open handle, the thread-ID cannot be
> reused and thus no other solution is necessary. Additional protection should
not
> be implemented speculatively based on what may be needed in the future.
> 
> If the implementation of stack-sampling for another OS does need a more
> elaborate solution, then one should be implemented at that time as part of
that
> effort.
> 
> Do you agree?

No, sorry, I don't agree.

Support for other OS's is not a future concern. Mac support is being worked on
right now by an engineer the team has committed for the project.

I don't see how this approach could work on Mac. Going with it would not only
shift the burden for implementing cross-platform support onto the Mac developer
but would make their job much harder by forcing them to deal with a bunch of
complexity that doesn't apply to them. They're not going to be comfortable
changing the Windows-specific implementation, so the end result will be two
different implementations doing the same thing. Then, someone with Windows
experience will have to come in and clean this up just to get back to the state
we would have been if we'd implemented the cross-platform interface in the first
place.

That's a ton of unnecessary work and code churn.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-03 18:45:20 UTC) #159

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/520001

3 years, 10 months ago (2017-02-03 18:45:46 UTC) #160

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-03 18:47:58 UTC) #161

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-03 18:53:37 UTC) #162

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/147111) ios-device-xcode-clang on ...

3 years, 10 months ago (2017-02-03 18:53:39 UTC) #163

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-04 01:09:29 UTC) #164

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/03 18:47:58, bcwhite wrote:
> > I don't see how this approach could work on Mac. Going with it would not
only
> > shift the burden for implementing cross-platform support onto the Mac
> developer
> 
> Correct.  Or at least to a different CL.
> 
> 
> > but would make their job much harder by forcing them to deal with a bunch of
> > complexity that doesn't apply to them.
> 
> There is no other complexity.

The complexity is inherent in this change: it adds functions and state to the
cross-platform NativeStackSampler interface that only apply to Windows. The Mac
developer will have to understand and keep straight what parts of this interface
apply to them, what parts don't, how both of those parts interact each other and
with the cross-platform support that they will need to implement, and how to
make the StackSamplingProfiler work with both systems at the same time.

I've been here -- working on a cross-platform interface with inconsistent
platform-specific implementations -- and it increases the level of difficulty
and complexity substantially, well beyond what it would take to just implement
the cross-platform support from scratch.

> There is absolutely nothing that needs to be done
> here.  The native sampler will work exactly as it did before.  There is no
need
> to try to handle the "thread under test dies" case as it is already safe.

Why do you say this will work as it did before? The profiler thread join in the
StackSamplingProfiler destructor, which prevented profiling past thread exit in
the single-thread-profiling implementation, has gone away with the
multi-threading changes. It hasn't been replaced with something that works
consistently across platforms.

It's particularly ill-timed to be regressing this behavior now, right when we're
trying to bring up the profiler on Mac.

> It's safe exactly as it was.  For me to write something that isn't necessary
on
> the hope that it will fulfill someone else need would not be in line with
> standard Chrome development and could easily result in more churn trying to do
> so.

Chrome is a cross-platform product. Implementing platform-specific functionality
that not only does not generalize across platforms, but makes it harder to do so
is counterproductive. Especially if there is a known need for the functionality
on other platforms at time of implementation and a likely viable
platform-independent alternative.

I appreciate the effort that has gone into this thread exit implementation. But
the bottom line is that, even if it works on Windows, it would move the project
further from where it needs to be rather than closer.

The suggestion I made for a cross platform approach is conceptually similar to
the single-thread-profiling join-on-destruction implementation, so there's good
reason to believe it will work without a huge amount of effort.

bcwhite

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-04 02:07:01 UTC) #165

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
> The complexity is inherent in this change: it adds functions and state to the
> cross-platform NativeStackSampler interface that only apply to Windows.

Not true.  I can remove completely everything I added to the NativeSampler and
it will continue to work.  I left them in place because it helps the *test*.

In fact, the only addition is recording the state of the thread under test in
detail.  The detail could be easily reduced or removed completely.


> Why do you say this will work as it did before? The profiler thread join in
the
> StackSamplingProfiler destructor, which prevented profiling past thread exit
in
> the single-thread-profiling implementation, has gone away with the
> multi-threading changes. It hasn't been replaced with something that works
> consistently across platforms.

It continues to sample but gets only empty frames because the SuspendThread call
will fail.  When there are no more samples to take, it executes the callback. 
If the object of that callback has gone away (it should be a weak-pointer), then
the callback will do nothing.

Dealing the the sampling of the thread that owns the Profiler isn't sufficient
anyway.  Thread A could start sampling on thread B but then B could exit without
A's knowledge or destruction of the profiler doing the sampling.  Handling this
general case also handles the A-samples-A special case and the Windows
NativeStackSampler handles the generic case by returning empty stack frames
after the thread exit.

Yes, it would be nice to be able to tell if the thread has actually exited and
stop the sampling immediately but I haven't found any way to do that reliably.


> It's particularly ill-timed to be regressing this behavior now, right when
we're
> trying to bring up the profiler on Mac.

When it was only the main thread sampling the main thread during startup, there
was only the special case but now that we want to be able to sample any thread
at any time, the general case has to be addressed and thus will have to be
addressed on the Mac, too.  And in addressing that, it'll address the special
case as well.


I've asked Alexei to weigh in on this discussion because it seems we're coming
at this from different directions and a new voice may help clear things up.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sampling_profiler.cc#newcode474 base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under ...

3 years, 10 months ago (2017-02-06 20:54:06 UTC) #166

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that
the thread under test is gone.
On 2017/02/04 02:07:01, bcwhite wrote:
>
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
> File base/profiler/stack_sampling_profiler.cc (right):
> 
>
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa...
> base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates
that
> the thread under test is gone.
> > The complexity is inherent in this change: it adds functions and state to
the
> > cross-platform NativeStackSampler interface that only apply to Windows.
> 
> Not true.  I can remove completely everything I added to the NativeSampler and
> it will continue to work.  I left them in place because it helps the *test*.
> 
> In fact, the only addition is recording the state of the thread under test in
> detail.  The detail could be easily reduced or removed completely.

Even if the interface is updated to be the same across platforms, the behavior
will still be different and the StackSamplingProfiler/SamplingThread will need
to operate differently depending on whether it is running on Windows or Mac. The
code will have to operate against two different sets of invariants.

> > Why do you say this will work as it did before? The profiler thread join in
> the
> > StackSamplingProfiler destructor, which prevented profiling past thread exit
> in
> > the single-thread-profiling implementation, has gone away with the
> > multi-threading changes. It hasn't been replaced with something that works
> > consistently across platforms.
> 
> It continues to sample but gets only empty frames because the SuspendThread
call
> will fail.  When there are no more samples to take, it executes the callback. 
> If the object of that callback has gone away (it should be a weak-pointer),
then
> the callback will do nothing.

How can this work on Mac? It's still subject to all the problems I outlined at
the end of comment #154.

> Dealing the the sampling of the thread that owns the Profiler isn't sufficient
> anyway.  Thread A could start sampling on thread B but then B could exit
without
> A's knowledge or destruction of the profiler doing the sampling.  Handling
this
> general case also handles the A-samples-A special case and the Windows
> NativeStackSampler handles the generic case by returning empty stack frames
> after the thread exit.

General A-samples-B is a relatively unimportant use case within Chrome, and
should not be driving the implementation. The two main use cases that need to be
supported now and in the future, respectively, are A-samples-A and thread
scheduler-samples-thread scheduler-managed thread.

A-samples-A can be handled in a cross-platform manner by having the
StackSamplingProfiler coordinate with SamplingThread to ensure the
SamplingThread is done with the profiled thread before the profiled thread
exits.

Thread scheduler-directed sampling can be handled by the thread scheduler
notifying the SamplingThread before it shuts down threads.

In a thread scheduler world, general A-samples-B is at best a niche use case. If
someone wants this, it still can be supported the same way as A-samples-A by
delegating the responsibility for ensuring that the thread outlives the
StackSamplingProfiler onto the StackSamplingProfiler user. In pretty much any
non-thread-scheduler case where A-samples-B would be useful, A will already have
some relationship with B that it can leverage to be notified prior to B exiting.

> I've asked Alexei to weigh in on this discussion because it seems we're coming
> at this from different directions and a new voice may help clear things up.

Thanks, I think this is a good idea.

Alexei Svitkine (slow)

asvitkine@chromium.org changed reviewers: + asvitkine@chromium.org

3 years, 10 months ago (2017-02-06 22:44:34 UTC) #167

Alexei Svitkine (slow)

Thanks for looping me in. I tried to read through a good chunk of recent ...

3 years, 10 months ago (2017-02-06 22:44:35 UTC) #168

bcwhite

> I tried to read through a good chunk of recent discussion here and hopefully ...

3 years, 10 months ago (2017-02-07 14:04:46 UTC) #169

> I tried to read through a good chunk of recent discussion here and hopefully I
> got all the context. Here's my understanding:
> 
>   - We are worried that a thread could die and its ID re-used without sampling
> profiler noticing.
>   - Brian proposes a solution that would query the thread's created time using
a
> Win32 API to guard against this.

Originally, yes.  I've since discovered that thread-ID reuse is not possible
under Windows because the native sampler continues to hold an open handle to
that thread which prevents it from being fully released by the OS and thus its
identifier will remain unique as long as sampling continues.


>   - Mike is worried about the above because a) it's platform specific and b)
it
> has potential races; Mike proposes a platform-agnostic solution.

Partially true because though Windows doesn't need a fix, that's still
platform-specific.  There are no races, however!


The key difference is that when it was single-sampling only, destructing the
StackSamplingProfiler object caused a join of the SamplingThread which in turn
meant that sampling of the target thread had necessarily stopped.  Thus, a
thread that wanted to initiate sampling upon itself could create a
StackSamplingProfiler as a local variable.  This ensured that it got destructed
before the thread exited and thus there was no possibility of self-sampling to
continue after the thread exited.

It was this no-self-sampling-after-death that the current Mac development was
counting on to avoid accidentally sampling the wrong thread should the
target-thread exit and be replaced by an identical identifier.

Unfortunately, multi-sampling means that the SamplingThread no longer
necessarily exits just because one StackSamplingProfiler object is done or
stopping due to going out of scope.  The Thread::Join synchronization step has
been removed.  That means that sampling can continue even though the initiating
StackSamplingProfiler has gone away and there is no way for a self-sampling
thread to definitively know that sampling has stopped before exiting.  Thus,
thread-ID re-use could be a (quite serious) problem if there are no other
protections in place.

On top of this A-samples-A case, there's the more general case of A-samples-B
where A and B are arbitrary threads and either A or B could exit at any time and
in any order.  A-samples-B also includes the fire-and-forget case where some
method on thread A calls the existing static
StackSamplingProfiler::StartAndRunAsync(...) to initiate sampling on itself and
then returns.  In that case there is no object on the stack to destruct and
sampling always continues until fully completed.

Mike, did I miss anything there?


The solutions, then...

1) Let each OS deal with this on its own in whatever way is appropriate.

For Windows, this means doing nothing as protection exists in the form of the
open handle already held by the native sampler. There's no development cost to
this solution; nothing needs to be written now that would be discarded in the
future.

2) Create a mutex that a destructing StackSamplingProfiler can wait on until the
SamplingThread has finished with the thread in question.

This is a better solution but I don't think it should be implemented in this
specific CL because it's non-trivial and unnecessary for Windows.  It also
solves only the A-samples-A case which I believe to be insufficient as it seems
likely that conditional sampling (i.e. take some samples when this event occurs)
will need to use fire-and-forget asynchronous sampling and/or be sampling a
thread other than itself.  I could be wrong but I'd prefer to leave the
discussion of that to a separate CL.

3) Something else.  :-)

Also fine... but also not for this CL.

Alexei Svitkine (slow)

Thanks for clarifying - sorry for not getting the full context the first time through. ...

3 years, 10 months ago (2017-02-07 15:58:26 UTC) #170

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-07 17:25:02 UTC) #171

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/540001

3 years, 10 months ago (2017-02-07 17:25:24 UTC) #172

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-07 17:34:23 UTC) #173

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/builds/150389) ios-simulator-xcode-clang on ...

3 years, 10 months ago (2017-02-07 17:34:25 UTC) #174

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-07 18:05:10 UTC) #175

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/560001

3 years, 10 months ago (2017-02-07 18:05:24 UTC) #176

Mike Wittman

On 2017/02/07 15:58:26, Alexei Svitkine (slow) wrote: > Thanks for clarifying - sorry for not ...

3 years, 10 months ago (2017-02-07 18:23:24 UTC) #179

On 2017/02/07 15:58:26, Alexei Svitkine (slow) wrote:
> Thanks for clarifying - sorry for not getting the full context the first time
> through.
> 
> In this case I agree that we should go with the simpler solution in this CL -
> since there are no holes in it for Windows - and this will allow us to get
this
> functionality sooner without very much technical debt.

I do not agree that there are no races and no other holes in this for Windows --
this hasn't been established in this review, and there's a non-trivial amount of
review and implementation effort left in order to do so. As you stated
initially, it's not so simple, and if we're going to put that effort in, it
would be best to try for a platform-agnostic solution.

The areas of concern I'm currently aware of on Windows are:

- Can we be confident that the thread does not get swapped out while holding the
handle, even past thread exit? I can believe this would be true, but do we have
informed guidance on this either from Microsoft or one of our Windows gurus? I
was unable to find anything definitive on this when I looked for it.

- This approach imposes the new assumption that SuspendThread is race-free with
respect to all the thread state we care about. Can we be confident that there
*aren't* some odd race conditions within SuspendThread where it could still
succeed despite the thread being partially torn down? Again, do we have informed
guidance on this from either from Microsoft or one of our Windows gurus? What
about AV's that hook SuspendThread -- are they likely to either get this right,
or not fail often enough for it to be an issue? (This isn't a theoretical
concern: I've seen this scenario in crash dumps.)

- SuspendThread can fail for reasons other than thread exit. Can we be confident
that either we can detect these "false positives" so we don't abort data
collection early in cases other than thread exit? If not, can we be confident
that we haven't introduced substantial skew or bias into the resulting data? Do
we have informed guidance on the failure modes of this API?

These are just the items off the top of my head -- it's very difficult to reason
about racy algorithms, so it's likely there are others that would come up in
review. Notably, we've barely considered how injected third party code would
interact with this approach. We also don't have code currently for the
GetThreadTimes() part of this approach, so it's hard to estimate what issues
would need to be considered there.

It's also important to consider the consequences of missing something or
glossing over potential issues. Any issues that make it into the code are likely
to result in generalized instability within Chrome if not system instability.
Reverse engineering causes of issues from crash dumps will be exceedingly
difficult, particularly if they're due to racy behavior. The cost of
investigating a single race bug is likely to dwarf the cost of implementing the
platform-agnostic solution.

> We can leave room for revising this in a follow-up CL if it ends up being
needed
> for other platforms. We should it make it very clear with at least comments -
> but maybe even #ifdef #error for non-Windows - that this needs to be
considered
> when it comes time to port to another platform.

The trade off in terms of cross-platform support as I see it is this:

1. Assuming Brian's approach works, Windows gains multiple thread profiling
support. The cross-platform profiler code no longer works for Mac. In order to
bring up the profiler, the Mac developer has to implement and understand not
just the relatively constrained platform-dependent piece of the profiler, but
the platform-independent piece including substantial non-trivial threading and
synchronization concerns.

2. Assuming my platform-agnostic approach works, all platforms gain multiple
thread profiling support. Profiling scenarios other than thread-profiles-self
place responsibility on the entity initiating the profiling to ensure the
profiling stops before the thread exits. I have little concern about this caveat
since the thread scheduler generally should be able provide this coordination in
the cases we care about.

bcwhite

> > In this case I agree that we should go with the simpler solution ...

3 years, 10 months ago (2017-02-07 19:18:40 UTC) #180

> > In this case I agree that we should go with the simpler solution in this CL
-
> > since there are no holes in it for Windows - and this will allow us to get
> this
> > functionality sooner without very much technical debt.
> 
> I do not agree that there are no races and no other holes in this for Windows
--
> this hasn't been established in this review, and there's a non-trivial amount
of
> review and implementation effort left in order to do so. As you stated
> initially, it's not so simple, and if we're going to put that effort in, it
> would be best to try for a platform-agnostic solution.
> 
> The areas of concern I'm currently aware of on Windows are:
> 
> - Can we be confident that the thread does not get swapped out while holding
the
> handle, even past thread exit? I can believe this would be true, but do we
have
> informed guidance on this either from Microsoft or one of our Windows gurus? I
> was unable to find anything definitive on this when I looked for it.

http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali...

"So an identifier can only be reused after last thread handle is closed"

The answer is without a reference to support this but I'm inclined to believe it
since it makes sense: A handle is an open OS reference and the OS isn't going to
destroy and reuse an object to which open references exist. I've posted a
comment asking for said reference.

Have you found any documentation to the contrary?


> - This approach imposes the new assumption that SuspendThread is race-free
with
> respect to all the thread state we care about. Can we be confident that there
> *aren't* some odd race conditions within SuspendThread where it could still
> succeed despite the thread being partially torn down? Again, do we have
informed
> guidance on this from either from Microsoft or one of our Windows gurus? What
> about AV's that hook SuspendThread -- are they likely to either get this
right,
> or not fail often enough for it to be an issue? (This isn't a theoretical
> concern: I've seen this scenario in crash dumps.)

Since SuspendThread is mainly intended for debuggers (according to official MSDN
documentation) which cannot necessarily know in advance if the thread they're
trying to stop might have suddenly exited, I'm again inclined to believe that
its going to be safe.


> - SuspendThread can fail for reasons other than thread exit. Can we be
confident
> that either we can detect these "false positives" so we don't abort data
> collection early in cases other than thread exit? If not, can we be confident
> that we haven't introduced substantial skew or bias into the resulting data?
Do
> we have informed guidance on the failure modes of this API?

The windows native sampler does not abort collection in the case of a failed
SuspendThread.  It just records an empty frame and will try again at the next
sampling interval.  This is unchanged from the previous working behavior so any
skew or bias encountered with the new code is already present in the old code.


> These are just the items off the top of my head -- it's very difficult to
reason
> about racy algorithms, so it's likely there are others that would come up in
> review. Notably, we've barely considered how injected third party code would
> interact with this approach. We also don't have code currently for the
> GetThreadTimes() part of this approach, so it's hard to estimate what issues
> would need to be considered there.

GetThreadTimes was abandoned last week as it's not necessary when thread IDs
cannot be reused.  The code that tried using that information was removed in
https://codereview.chromium.org/2554123002/#ps520001


> It's also important to consider the consequences of missing something or
> glossing over potential issues. Any issues that make it into the code are
likely
> to result in generalized instability within Chrome if not system instability.
> Reverse engineering causes of issues from crash dumps will be exceedingly
> difficult, particularly if they're due to racy behavior. The cost of
> investigating a single race bug is likely to dwarf the cost of implementing
the
> platform-agnostic solution.

That assumes that there are not race possibilities in the platform-agnostic
solution.  Given that such will require cross-thread communication and mutex
access just to cover the simplest A-samples-A-below case, I think it's very big
assumption.


> > We can leave room for revising this in a follow-up CL if it ends up being
> needed
> > for other platforms. We should it make it very clear with at least comments
-
> > but maybe even #ifdef #error for non-Windows - that this needs to be
> considered
> > when it comes time to port to another platform.
> 
> The trade off in terms of cross-platform support as I see it is this:
> 
> 1. Assuming Brian's approach works, Windows gains multiple thread profiling
> support. The cross-platform profiler code no longer works for Mac. In order to
> bring up the profiler, the Mac developer has to implement and understand not
> just the relatively constrained platform-dependent piece of the profiler, but
> the platform-independent piece including substantial non-trivial threading and
> synchronization concerns.

On an adjacent CL, you had me change a class named "Common" to "StackBuffer"
because that was the only thing currently contained within it.  A generic
solution was made specific because that was all that was necessary for that CL. 
No consideration was given to perhaps a Mac solution needing something else. 
But here you're suggesting adding a huge piece of complex synchronization that
is unneeded for Windows in order to support something being written elsewhere.


> 2. Assuming my platform-agnostic approach works, all platforms gain multiple
> thread profiling support. Profiling scenarios other than thread-profiles-self
> place responsibility on the entity initiating the profiling to ensure the
> profiling stops before the thread exits. I have little concern about this
caveat
> since the thread scheduler generally should be able provide this coordination
in
> the cases we care about.

The basic solution doesn't even fully support A-samples-A.  It supports only
A-samples-A-below (meaning until the current scope exits).  We'd have to remove
the static StartAndRunAsync methods, or perhaps limit them exclusively to the UI
thread.


I'm not saying that a platform-agnostic solution isn't of benefit.  I'm just
saying it shouldn't be done here.

I also disagree that fixing the A-samples-A-below case is insufficient and that
a full general solution to the A-samples-B case should be found in order to
avoid having to rewrite it later when a developer has need to profile exactly
that case.  But this is a discussion for that other CL.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-07 19:43:07 UTC) #181

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/383266)

3 years, 10 months ago (2017-02-07 19:43:12 UTC) #182

Mike Wittman

We still haven't reached definitive conclusions on the known open questions I raised, which corroborates ...

3 years, 10 months ago (2017-02-07 21:44:28 UTC) #183

We still haven't reached definitive conclusions on the known open questions I
raised, which corroborates the belief that this approach does not provide a
simple slam-dunk solution on Windows.

On 2017/02/07 19:18:40, bcwhite wrote:
> > > In this case I agree that we should go with the simpler solution in this
CL
> -
> > > since there are no holes in it for Windows - and this will allow us to get
> > this
> > > functionality sooner without very much technical debt.
> > 
> > I do not agree that there are no races and no other holes in this for
Windows
> --
> > this hasn't been established in this review, and there's a non-trivial
amount
> of
> > review and implementation effort left in order to do so. As you stated
> > initially, it's not so simple, and if we're going to put that effort in, it
> > would be best to try for a platform-agnostic solution.
> > 
> > The areas of concern I'm currently aware of on Windows are:
> > 
> > - Can we be confident that the thread does not get swapped out while holding
> the
> > handle, even past thread exit? I can believe this would be true, but do we
> have
> > informed guidance on this either from Microsoft or one of our Windows gurus?
I
> > was unable to find anything definitive on this when I looked for it.
> 
>
http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali...
> 
> "So an identifier can only be reused after last thread handle is closed"
> 
> The answer is without a reference to support this but I'm inclined to believe
it
> since it makes sense: A handle is an open OS reference and the OS isn't going
to
> destroy and reuse an object to which open references exist. I've posted a
> comment asking for said reference.
> 
> Have you found any documentation to the contrary?
> 
> 
> > - This approach imposes the new assumption that SuspendThread is race-free
> with
> > respect to all the thread state we care about. Can we be confident that
there
> > *aren't* some odd race conditions within SuspendThread where it could still
> > succeed despite the thread being partially torn down? Again, do we have
> informed
> > guidance on this from either from Microsoft or one of our Windows gurus?
What
> > about AV's that hook SuspendThread -- are they likely to either get this
> right,
> > or not fail often enough for it to be an issue? (This isn't a theoretical
> > concern: I've seen this scenario in crash dumps.)
> 
> Since SuspendThread is mainly intended for debuggers (according to official
MSDN
> documentation) which cannot necessarily know in advance if the thread they're
> trying to stop might have suddenly exited, I'm again inclined to believe that
> its going to be safe.
> 
> 
> > - SuspendThread can fail for reasons other than thread exit. Can we be
> confident
> > that either we can detect these "false positives" so we don't abort data
> > collection early in cases other than thread exit? If not, can we be
confident
> > that we haven't introduced substantial skew or bias into the resulting data?
> Do
> > we have informed guidance on the failure modes of this API?
> 
> The windows native sampler does not abort collection in the case of a failed
> SuspendThread.  It just records an empty frame and will try again at the next
> sampling interval.  This is unchanged from the previous working behavior so
any
> skew or bias encountered with the new code is already present in the old code.

Lines 475-482 of stack_sampling_profiler.cc in patch set 17 certainly appear to
stop the profiling.

> > These are just the items off the top of my head -- it's very difficult to>
reason
> > about racy algorithms, so it's likely there are others that would come up in
> > review. Notably, we've barely considered how injected third party code would
> > interact with this approach. We also don't have code currently for the
> > GetThreadTimes() part of this approach, so it's hard to estimate what issues
> > would need to be considered there.
> 
> GetThreadTimes was abandoned last week as it's not necessary when thread IDs
> cannot be reused.  The code that tried using that information was removed in
> https://codereview.chromium.org/2554123002/#ps520001
> 
> 
> > It's also important to consider the consequences of missing something or
> > glossing over potential issues. Any issues that make it into the code are
> likely
> > to result in generalized instability within Chrome if not system
instability.
> > Reverse engineering causes of issues from crash dumps will be exceedingly
> > difficult, particularly if they're due to racy behavior. The cost of
> > investigating a single race bug is likely to dwarf the cost of implementing
> the
> > platform-agnostic solution.
> 
> That assumes that there are not race possibilities in the platform-agnostic
> solution.  Given that such will require cross-thread communication and mutex
> access just to cover the simplest A-samples-A-below case, I think it's very
big
> assumption.

The platform-agnostic solution uses the *exact same* synchronization point in
StackSamplingProfiler as the current single-thread implementation, which already
performs cross-thread communication using WaitableEvents, and is proven to work.
The only major difference between the current and proposed approach is whether
the profiling thread exits. (The proposed approach uses a WaitableEvent, not a
mutex.)

> > > We can leave room for revising this in a follow-up CL if it ends up being>
> needed
> > > for other platforms. We should it make it very clear with at least
comments
> -
> > > but maybe even #ifdef #error for non-Windows - that this needs to be
> > considered
> > > when it comes time to port to another platform.
> > 
> > The trade off in terms of cross-platform support as I see it is this:
> > 
> > 1. Assuming Brian's approach works, Windows gains multiple thread profiling
> > support. The cross-platform profiler code no longer works for Mac. In order
to
> > bring up the profiler, the Mac developer has to implement and understand not
> > just the relatively constrained platform-dependent piece of the profiler,
but
> > the platform-independent piece including substantial non-trivial threading
and
> > synchronization concerns.
> 
> On an adjacent CL, you had me change a class named "Common" to "StackBuffer"
> because that was the only thing currently contained within it.  A generic
> solution was made specific because that was all that was necessary for that
CL. 
> No consideration was given to perhaps a Mac solution needing something else.

No, that's wrong. I have a very good idea of what is necessary for Mac, having
reviewed the initial NativeStackSampler implementation in
https://codereview.chromium.org/1346453004. My judgement, in that case and this
one, is strongly informed by what is needed for the platform given the likely
implementation.

> But here you're suggesting adding a huge piece of complex synchronization that
> is unneeded for Windows in order to support something being written elsewhere.

How can you conclude it's huge and complex if you haven't explored it? Since
it's based on the current single-thread approach it's unlikely to be
significantly more complicated than that, and the amount of code is likely to be
comparable to the Windows-only approach. 

> > 2. Assuming my platform-agnostic approach works, all platforms gain multiple
> > thread profiling support. Profiling scenarios other than
thread-profiles-self
> > place responsibility on the entity initiating the profiling to ensure the
> > profiling stops before the thread exits. I have little concern about this
> caveat
> > since the thread scheduler generally should be able provide this
coordination
> in
> > the cases we care about.
> 
> The basic solution doesn't even fully support A-samples-A.  It supports only
> A-samples-A-below (meaning until the current scope exits).  We'd have to
remove
> the static StartAndRunAsync methods, or perhaps limit them exclusively to the
UI
> thread.

Removing StartAndRunAsync would be fine with me. It's not used, and the purpose
for which it was implemented was found to be supportable using an object-owned
StackSamplingProfiler. In most if not all cases where people might be tempted to
use it, they'd be better off using Start and coordinating threads. That would be
better from a system design perspective since it would force the inter-thread
relationships to be explicit in the code.

> I'm not saying that a platform-agnostic solution isn't of benefit.  I'm just
> saying it shouldn't be done here.
> 
> I also disagree that fixing the A-samples-A-below case is insufficient and
that
> a full general solution to the A-samples-B case should be found in order to
> avoid having to rewrite it later when a developer has need to profile exactly
> that case.  But this is a discussion for that other CL.

bcwhite

On 2017/02/07 21:44:28, Mike Wittman wrote: > We still haven't reached definitive conclusions on the ...

3 years, 10 months ago (2017-02-07 23:04:07 UTC) #184

On 2017/02/07 21:44:28, Mike Wittman wrote:
> We still haven't reached definitive conclusions on the known open questions I
> raised, which corroborates the belief that this approach does not provide a
> simple slam-dunk solution on Windows.

I provided my evidence and my reasoning.  If you have something definitive to
the contrary, please provide it.


> > > - SuspendThread can fail for reasons other than thread exit. Can we be
> > confident
> > > that either we can detect these "false positives" so we don't abort data
> > > collection early in cases other than thread exit? If not, can we be
> confident
> > > that we haven't introduced substantial skew or bias into the resulting
data?
> > Do
> > > we have informed guidance on the failure modes of this API?
> > 
> > The windows native sampler does not abort collection in the case of a failed
> > SuspendThread.  It just records an empty frame and will try again at the
next
> > sampling interval.  This is unchanged from the previous working behavior so
> any
> > skew or bias encountered with the new code is already present in the old
code.
> 
> Lines 475-482 of stack_sampling_profiler.cc in patch set 17 certainly appear
to
> stop the profiling.

Yes.  I left that in to react to a thread having exited.  There seemed no reason
not to.  But the Windows native stack sampler never sets THREAD_EXITED because
it has no way to detect that the thread has exited.  Other OS may have better
luck and be able to stop the sampling early.


> > > These are just the items off the top of my head -- it's very difficult to>
> reason
> > > about racy algorithms, so it's likely there are others that would come up
in
> > > review. Notably, we've barely considered how injected third party code
would
> > > interact with this approach. We also don't have code currently for the
> > > GetThreadTimes() part of this approach, so it's hard to estimate what
issues
> > > would need to be considered there.
> > 
> > GetThreadTimes was abandoned last week as it's not necessary when thread IDs
> > cannot be reused.  The code that tried using that information was removed in
> > https://codereview.chromium.org/2554123002/#ps520001
> > 
> > 
> > > It's also important to consider the consequences of missing something or
> > > glossing over potential issues. Any issues that make it into the code are
> > likely
> > > to result in generalized instability within Chrome if not system
> instability.
> > > Reverse engineering causes of issues from crash dumps will be exceedingly
> > > difficult, particularly if they're due to racy behavior. The cost of
> > > investigating a single race bug is likely to dwarf the cost of
implementing
> > the
> > > platform-agnostic solution.
> > 
> > That assumes that there are not race possibilities in the platform-agnostic
> > solution.  Given that such will require cross-thread communication and mutex
> > access just to cover the simplest A-samples-A-below case, I think it's very
> big
> > assumption.
> 
> The platform-agnostic solution uses the *exact same* synchronization point in
> StackSamplingProfiler as the current single-thread implementation, which
already
> performs cross-thread communication using WaitableEvents, and is proven to
work.
> The only major difference between the current and proposed approach is whether
> the profiling thread exits. (The proposed approach uses a WaitableEvent, not a
> mutex.)

Except (a) it's not needed for the CL I'm writing and (b) is not the best
solution (in my opinion).


> > > > We can leave room for revising this in a follow-up CL if it ends up
being>
> > needed
> > > > for other platforms. We should it make it very clear with at least
> comments
> > -
> > > > but maybe even #ifdef #error for non-Windows - that this needs to be
> > > considered
> > > > when it comes time to port to another platform.
> > > 
> > > The trade off in terms of cross-platform support as I see it is this:
> > > 
> > > 1. Assuming Brian's approach works, Windows gains multiple thread
profiling
> > > support. The cross-platform profiler code no longer works for Mac. In
order
> to
> > > bring up the profiler, the Mac developer has to implement and understand
not
> > > just the relatively constrained platform-dependent piece of the profiler,
> but
> > > the platform-independent piece including substantial non-trivial threading
> and
> > > synchronization concerns.
> > 
> > On an adjacent CL, you had me change a class named "Common" to "StackBuffer"
> > because that was the only thing currently contained within it.  A generic
> > solution was made specific because that was all that was necessary for that
> CL. 
> > No consideration was given to perhaps a Mac solution needing something else.
> 
> No, that's wrong. I have a very good idea of what is necessary for Mac, having
> reviewed the initial NativeStackSampler implementation in
> https://codereview.chromium.org/1346453004. My judgement, in that case and
this
> one, is strongly informed by what is needed for the platform given the likely
> implementation.

But that's you.  You're asking me to write for something about which I have no
knowledge in a CL that doesn't need it.


> > But here you're suggesting adding a huge piece of complex synchronization
that
> > is unneeded for Windows in order to support something being written
elsewhere.
> 
> How can you conclude it's huge and complex if you haven't explored it? Since
> it's based on the current single-thread approach it's unlikely to be
> significantly more complicated than that, and the amount of code is likely to
be
> comparable to the Windows-only approach. 

Please do not assume that because I disagree with you that I haven't explored
the option.  The implementation I see is for the dtor to create a WaitableEvent
that gets passed to a StopTask() via a parameter to PostTask.  Stock() would
then wait on that event and StopTask() would signal it when ready.  Seems pretty
simple on the surface, though that can be misleading.

It's not complex.  But it's not necessary for this CL.  And I believe it's an
inadequate solution because it fails to cover too many use-cases and thus will
eventually have to be removed in favor of something more complex.

Regardless of what I believe, though, it still doesn't belong in this CL.


> > > 2. Assuming my platform-agnostic approach works, all platforms gain
multiple
> > > thread profiling support. Profiling scenarios other than
> thread-profiles-self
> > > place responsibility on the entity initiating the profiling to ensure the
> > > profiling stops before the thread exits. I have little concern about this
> > caveat
> > > since the thread scheduler generally should be able provide this
> coordination
> > in
> > > the cases we care about.
> > 
> > The basic solution doesn't even fully support A-samples-A.  It supports only
> > A-samples-A-below (meaning until the current scope exits).  We'd have to
> remove
> > the static StartAndRunAsync methods, or perhaps limit them exclusively to
the
> UI
> > thread.
> 
> Removing StartAndRunAsync would be fine with me. It's not used, and the
purpose
> for which it was implemented was found to be supportable using an object-owned
> StackSamplingProfiler. In most if not all cases where people might be tempted
to
> use it, they'd be better off using Start and coordinating threads. That would
be
> better from a system design perspective since it would force the inter-thread
> relationships to be explicit in the code.

That's news to me.  If the interface supports nothing but A-samples-A-following
then it's a different ballgame.  It still doesn't belong in this CL but it's now
a reasonable solution.

Coded up quickly:
https://codereview.chromium.org/2680703004

It should probably still have a DCHECK that sampled-thread == current-thread but
I gotta go take my son to Judo.  :-)

> 
> > I'm not saying that a platform-agnostic solution isn't of benefit.  I'm just
> > saying it shouldn't be done here.
> > 
> > I also disagree that fixing the A-samples-A-below case is insufficient and
> that
> > a full general solution to the A-samples-B case should be found in order to
> > avoid having to rewrite it later when a developer has need to profile
exactly
> > that case.  But this is a discussion for that other CL.

Mike Wittman

On 2017/02/07 23:04:07, bcwhite wrote: > But the Windows native stack sampler never sets THREAD_EXITED ...

3 years, 10 months ago (2017-02-10 01:36:09 UTC) #185

On 2017/02/07 23:04:07, bcwhite wrote:
> But the Windows native stack sampler never sets THREAD_EXITED because
> it has no way to detect that the thread has exited.

If this approach can't tell us when the thread has exited, then it doesn't solve
the problem at hand.

Generating samples for a thread must stop after thread exit. Otherwise the extra
samples will skew the results.

Continuing profiler execution may also waste power due to unnecessary wakeups.

> > > No consideration was given to perhaps a Mac solution needing something
else.
> > 
> > No, that's wrong. I have a very good idea of what is necessary for Mac,
having
> > reviewed the initial NativeStackSampler implementation in
> > https://codereview.chromium.org/1346453004. My judgement, in that case and
> this
> > one, is strongly informed by what is needed for the platform given the
likely
> > implementation.
> 
> But that's you.  You're asking me to write for something about which I have no
> knowledge in a CL that doesn't need it.

Understanding the broader needs of the code and asking reviewees to address them
is precisely my responsibility as OWNER and reviewer.

There's no special knowledge needed to write the platform-agnostic code since
the constraints on Mac are exactly the same as on Windows. That's the whole
point of having platform-agnostic code.

> Please do not assume that because I disagree with you that I haven't explored
> the option.

I won't assume you haven't explored the option if you don't assume I haven't
thoroughly considered the larger context for this change. :)

> > Removing StartAndRunAsync would be fine with me. It's not used, and the
> purpose
> > for which it was implemented was found to be supportable using an
object-owned
> > StackSamplingProfiler. In most if not all cases where people might be
tempted
> to
> > use it, they'd be better off using Start and coordinating threads. That
would
> be
> > better from a system design perspective since it would force the
inter-thread
> > relationships to be explicit in the code.
> 
> That's news to me.  If the interface supports nothing but
A-samples-A-following
> then it's a different ballgame.  It still doesn't belong in this CL but it's
now
> a reasonable solution.
> 
> Coded up quickly:
> https://codereview.chromium.org/2680703004
> 
> It should probably still have a DCHECK that sampled-thread == current-thread
but
> I gotta go take my son to Judo.  :-)

That looks like a reasonable start to a platform-agnostic solution.

> > > > > I'm not saying that a platform-agnostic solution isn't of benefit. 
I'm just
> > > saying it shouldn't be done here.
> > > 
> > > I also disagree that fixing the A-samples-A-below case is insufficient and
> > that
> > > a full general solution to the A-samples-B case should be found in order
to
> > > avoid having to rewrite it later when a developer has need to profile
> exactly
> > > that case.  But this is a discussion for that other CL.

As I've mentioned, we don't need to have a general case solution to the
A-samples-B problem because it can be addressed external to the profiler for the
likely use cases in Chrome. Doing so is even preferable because it will lead to
better system design around threading. I'm afraid you're just going to have to
believe me on this. :)

bcwhite

> > But the Windows native stack sampler never sets THREAD_EXITED because > > it ...

3 years, 10 months ago (2017-02-10 14:47:18 UTC) #186

> > But the Windows native stack sampler never sets THREAD_EXITED because
> > it has no way to detect that the thread has exited.
> 
> If this approach can't tell us when the thread has exited, then it doesn't
solve
> the problem at hand.

No, but it provides a mechanism for a native sampler that *can* detect the exit
of a thread to report such and have sampling stop in that case.

I'll remove it if you prefer but since it was already written, and a seemingly
useful feature, I left it in.


> Generating samples for a thread must stop after thread exit. Otherwise the
extra
> samples will skew the results.
> 
> Continuing profiler execution may also waste power due to unnecessary wakeups.

But it does stop!  Destruction of the profiler requests the stop of the
sampling.  It just doesn't wait for it to stop.

In the rare case where somebody creates a profiler on a thread that samples
itself until its own death then it's possible that one sample may occur after
the thread dies but before the sampling thread gets around to processing the
"Remove" task that was posted to it.

Note that because the posted task is not delayed, it will come before any
pending sampling tasks which means that the thread under test would have to post
the task and exit completely before an already-started "RecordSample" task on
the sampling thread actually gets around to trying to suspend the thread.

So yes, it *can* happen without "join synchronization" but it would be an
amazingly rare occurrence with the only problem being a single empty stack frame
recorded at the end.  Such an occurrence would be nothing but noise, if it were
to actually happen at all, of a threshold far below the general variation of the
sampling itself.


> > > > No consideration was given to perhaps a Mac solution needing something
> else.
> > > 
> > > No, that's wrong. I have a very good idea of what is necessary for Mac,
> having
> > > reviewed the initial NativeStackSampler implementation in
> > > https://codereview.chromium.org/1346453004. My judgement, in that case and
> > this
> > > one, is strongly informed by what is needed for the platform given the
> likely
> > > implementation.
> > 
> > But that's you.  You're asking me to write for something about which I have
no
> > knowledge in a CL that doesn't need it.
> 
> Understanding the broader needs of the code and asking reviewees to address
them
> is precisely my responsibility as OWNER and reviewer.

Sure.  Nobody is saying that these things aren't important.


> There's no special knowledge needed to write the platform-agnostic code since
> the constraints on Mac are exactly the same as on Windows. That's the whole
> point of having platform-agnostic code.

But it's unnecessary *here* which is why it should be in a separate CL.  A
separate CL where a proper discussion of what is necessary can be held and the
intricacies of it can be explored in its own context.


> > Please do not assume that because I disagree with you that I haven't
explored
> > the option.
> 
> I won't assume you haven't explored the option if you don't assume I haven't
> thoroughly considered the larger context for this change. :)

Nobody is arguing that you haven't.  I'm only arguing that it unnecessary here
and to do it in a different CL.


> > > Removing StartAndRunAsync would be fine with me. It's not used, and the
> > purpose
> > > for which it was implemented was found to be supportable using an
> object-owned
> > > StackSamplingProfiler. In most if not all cases where people might be
> tempted
> > to
> > > use it, they'd be better off using Start and coordinating threads. That
> would
> > be
> > > better from a system design perspective since it would force the
> inter-thread
> > > relationships to be explicit in the code.
> > 
> > That's news to me.  If the interface supports nothing but
> A-samples-A-following
> > then it's a different ballgame.  It still doesn't belong in this CL but it's
> now
> > a reasonable solution.
> > 
> > Coded up quickly:
> > https://codereview.chromium.org/2680703004
> > 
> > It should probably still have a DCHECK that sampled-thread == current-thread
> but
> > I gotta go take my son to Judo.  :-)
> 
> That looks like a reasonable start to a platform-agnostic solution.

Comments welcome.  Happy to get it done.


> > > > > > I'm not saying that a platform-agnostic solution isn't of benefit. 
> I'm just
> > > > saying it shouldn't be done here.
> > > > 
> > > > I also disagree that fixing the A-samples-A-below case is insufficient
and
> > > that
> > > > a full general solution to the A-samples-B case should be found in order
> to
> > > > avoid having to rewrite it later when a developer has need to profile
> > exactly
> > > > that case.  But this is a discussion for that other CL.
> 
> As I've mentioned, we don't need to have a general case solution to the
> A-samples-B problem because it can be addressed external to the profiler for
the
> likely use cases in Chrome. Doing so is even preferable because it will lead
to
> better system design around threading. I'm afraid you're just going to have to
> believe me on this. :)

Fine.   But let's discuss it on another CL so this one can start testing.

Mike Wittman

On 2017/02/10 14:47:18, bcwhite wrote: > > > But the Windows native stack sampler never ...

3 years, 10 months ago (2017-02-10 17:28:02 UTC) #187

bcwhite

> > > If this approach can't tell us when the thread has exited, then ...

3 years, 10 months ago (2017-02-10 18:36:06 UTC) #188

> > > If this approach can't tell us when the thread has exited, then it doesn't
> > solve
> > > the problem at hand.
> > 
> > No, but it provides a mechanism for a native sampler that *can* detect the
> exit
> > of a thread to report such and have sampling stop in that case.
> > 
> > I'll remove it if you prefer but since it was already written, and a
seemingly
> > useful feature, I left it in.
> 
> By "this approach" I mean the entire strategy of handling thread exit by
relying
> on SuspendThread failing.

Again, there is no strategy of handling thread-exit by SuspendThread failing.  I
tried that and removed it a week or two ago.  Now if a thread exits, it'll just
append empty frames.


> > > Generating samples for a thread must stop after thread exit. Otherwise the
> > extra
> > > samples will skew the results.
> > > 
> > > Continuing profiler execution may also waste power due to unnecessary
> wakeups.
> > 
> > But it does stop!  Destruction of the profiler requests the stop of the
> > sampling.  It just doesn't wait for it to stop.
> 
> It does not stop in the A-samples-B case, where B exits.

Why bring that up when you want to remove support for such?

Yes, in that case you could end up with many empty frames at the end of the
sample.  Such could easily be pruned, either in Chrome or on the server, if you
feel it's a real problem.

There's no stability issues, however, because Windows won't start sampling some
other thread with the same ID because open handles prevent the ID being reused.


> > > There's no special knowledge needed to write the platform-agnostic code
> since
> > > the constraints on Mac are exactly the same as on Windows. That's the
whole
> > > point of having platform-agnostic code.
> > 
> > But it's unnecessary *here* which is why it should be in a separate CL.  A
> > separate CL where a proper discussion of what is necessary can be held and
the
> > intricacies of it can be explored in its own context.
> 
> It's premature to consider what would be done in any follow-on CLs when we
don't
> even know if the approach in the current CL is viable.

I believe it is, and have provided evidence and reasoning to support it. I have
no evidence to the contrary.

Mike Wittman

On 2017/02/10 18:36:06, bcwhite wrote: > > > > If this approach can't tell us ...

3 years, 10 months ago (2017-02-10 21:00:17 UTC) #189

On 2017/02/10 18:36:06, bcwhite wrote:
> > > > If this approach can't tell us when the thread has exited, then it
doesn't
> > > solve
> > > > the problem at hand.
> > > 
> > > No, but it provides a mechanism for a native sampler that *can* detect the
> > exit
> > > of a thread to report such and have sampling stop in that case.
> > > 
> > > I'll remove it if you prefer but since it was already written, and a
> seemingly
> > > useful feature, I left it in.
> > 
> > By "this approach" I mean the entire strategy of handling thread exit by
> relying
> > on SuspendThread failing.
> 
> Again, there is no strategy of handling thread-exit by SuspendThread failing. 
I
> tried that and removed it a week or two ago.  Now if a thread exits, it'll
just
> append empty frames.

Huh? How does the empty frame get appended in the thread exit case if not by the
SuspendThread call failing?

> > > > Generating samples for a thread must stop after thread exit. Otherwise
the
> > > extra
> > > > samples will skew the results.
> > > > 
> > > > Continuing profiler execution may also waste power due to unnecessary
> > wakeups.
> > > 
> > > But it does stop!  Destruction of the profiler requests the stop of the
> > > sampling.  It just doesn't wait for it to stop.
> > 
> > It does not stop in the A-samples-B case, where B exits.
> 
> Why bring that up when you want to remove support for such?

I didn't say I want to remove support for A-samples-B. I said we don't need a
*general case* solution for A-samples-B, where there is no relationship between
A and B other than the profiling. The other cases of A-samples-B can be handled
external to the profiler by having the profiler user do the necessary thread
synchronization.

> Yes, in that case you could end up with many empty frames at the end of the
> sample.  Such could easily be pruned, either in Chrome or on the server, if
you
> feel it's a real problem.

The empty samples* cannot be pruned because it's not possible to know which of
them are the result of thread exit. Some or all can be the result of the
transient issues detected by SuspendThreadAndRecordStack.

Treating them all as thread exit samples does not work because it throws out the
valid data represented by the transient-issue samples. Treating them all as
transient-issue samples also does not work because it treats the bogus
post-thread-exit samples as valid data.

Either way would skew the results, and we would be blind to the severity of the
problem.

* To be clear: the scenario results in empty samples at the end of the
collection rather than empty frames at the end of the sample.

> > > > There's no special knowledge needed to write the platform-agnostic code
> > since
> > > > the constraints on Mac are exactly the same as on Windows. That's the
> whole
> > > > point of having platform-agnostic code.
> > > 
> > > But it's unnecessary *here* which is why it should be in a separate CL.  A
> > > separate CL where a proper discussion of what is necessary can be held and
> the
> > > intricacies of it can be explored in its own context.
> > 
> > It's premature to consider what would be done in any follow-on CLs when we
> don't
> > even know if the approach in the current CL is viable.
> 
> I believe it is, and have provided evidence and reasoning to support it. I
have
> no evidence to the contrary.

Understood, and I've already stated why I find this evidence and reasoning
insufficient.

bcwhite

> > > By "this approach" I mean the entire strategy of handling thread exit ...

3 years, 10 months ago (2017-02-11 02:57:10 UTC) #190

> > > By "this approach" I mean the entire strategy of handling thread exit by
> > relying
> > > on SuspendThread failing.
> > 
> > Again, there is no strategy of handling thread-exit by SuspendThread
failing. 
> I
> > tried that and removed it a week or two ago.  Now if a thread exits, it'll
> just
> > append empty frames.
> 
> Huh? How does the empty frame get appended in the thread exit case if not by
the
> SuspendThread call failing?

In exactly that way: by SuspendThread failing.  A failing SuspendThread call
results in an empty sample.  I experimented with it causing an exit but due to
your concern of it being only a transient error, I removed it.

There is no way I found for NativeStackSamplerWin to detect that a thread has
exited so that condition never gets set.  Other OS may have that ability and set
it.


> > > > > Generating samples for a thread must stop after thread exit. Otherwise
> the
> > > > extra
> > > > > samples will skew the results.
> > > > > 
> > > > > Continuing profiler execution may also waste power due to unnecessary
> > > wakeups.
> > > > 
> > > > But it does stop!  Destruction of the profiler requests the stop of the
> > > > sampling.  It just doesn't wait for it to stop.
> > > 
> > > It does not stop in the A-samples-B case, where B exits.
> > 
> > Why bring that up when you want to remove support for such?
> 
> I didn't say I want to remove support for A-samples-B. I said we don't need a
> *general case* solution for A-samples-B, where there is no relationship
between
> A and B other than the profiling. The other cases of A-samples-B can be
handled
> external to the profiler by having the profiler user do the necessary thread
> synchronization.

Okay.


> > Yes, in that case you could end up with many empty frames at the end of the
> > sample.  Such could easily be pruned, either in Chrome or on the server, if
> you
> > feel it's a real problem.
> 
> The empty samples* cannot be pruned because it's not possible to know which of
> them are the result of thread exit. Some or all can be the result of the
> transient issues detected by SuspendThreadAndRecordStack.

Yes, it's possible that the final empty samples are due to a transient failure
but dropping them anyway isn't really going to cause any more confusion that
leaving them in.


> Treating them all as thread exit samples does not work because it throws out
the
> valid data represented by the transient-issue samples. Treating them all as
> transient-issue samples also does not work because it treats the bogus
> post-thread-exit samples as valid data.
> 
> Either way would skew the results, and we would be blind to the severity of
the
> problem.

But that's not something we have to worry about here because, as you said above,
management of A-samples-B is to be handled by external synchronization and this
is only an A-samples-B issue.


> * To be clear: the scenario results in empty samples at the end of the
> collection rather than empty frames at the end of the sample.

Right.


> > > > > There's no special knowledge needed to write the platform-agnostic
code
> > > since
> > > > > the constraints on Mac are exactly the same as on Windows. That's the
> > whole
> > > > > point of having platform-agnostic code.
> > > > 
> > > > But it's unnecessary *here* which is why it should be in a separate CL. 
A
> > > > separate CL where a proper discussion of what is necessary can be held
and
> > the
> > > > intricacies of it can be explored in its own context.
> > > 
> > > It's premature to consider what would be done in any follow-on CLs when we
> > don't
> > > even know if the approach in the current CL is viable.
> > 
> > I believe it is, and have provided evidence and reasoning to support it. I
> have
> > no evidence to the contrary.
> 
> Understood, and I've already stated why I find this evidence and reasoning
> insufficient.

We should talk about this on a VC Monday with Alexei because we're just going in
circles here.

Mike Wittman

On 2017/02/11 02:57:10, bcwhite wrote: > > Treating them all as thread exit samples does ...

3 years, 10 months ago (2017-02-13 18:12:44 UTC) #191

bcwhite

> > > Treating them all as thread exit samples does not work because it ...

3 years, 10 months ago (2017-02-13 18:25:38 UTC) #192

Mike Wittman

On 2017/02/13 18:25:38, bcwhite wrote: > > > > Treating them all as thread exit ...

3 years, 10 months ago (2017-02-13 18:32:12 UTC) #193

On 2017/02/13 18:25:38, bcwhite wrote:
> > > > Treating them all as thread exit samples does not work because it throws
> out
> > > the
> > > > valid data represented by the transient-issue samples. Treating them all
> as
> > > > transient-issue samples also does not work because it treats the bogus
> > > > post-thread-exit samples as valid data.
> > > > 
> > > > Either way would skew the results, and we would be blind to the severity
> of
> > > the
> > > > problem.
> > > 
> > > But that's not something we have to worry about here because, as you said
> > above,
> > > management of A-samples-B is to be handled by external synchronization and
> > this
> > > is only an A-samples-B issue.
> > 
> > That's correct, if we have the StackSamplingProfiler/SamplingThread
> > synchronization in place at StackSamplingProfiler destruction (or otherwise
> > pre-thread-exit). We need to have that implementation in place to avoid this
> > scenario.
> 
> The current implentation is not synchronized but it's damned close with the
> extremely unlikely case with at most one sample being taken after the dtor
> returns.  This is safe under Windows (empty sample should the thread have had
> time to exit) but unknown for future OS implementations.

We saw intermittent crashes in the profiler when running tests on the buildbots,
before the Join implementation was in place, so I don't think we can conclude
it's sufficiently close to synchronized. It's also completely unsynchronized in
the A-samples-B case.

> > > We should talk about this on a VC Monday with Alexei because we're just
> going
> > in
> > > circles here.
> > 
> > I think we have come full circle -- if I understand correctly we've just
> > established that the current proposed approach to handling thread exit is
> > exactly functionally equivalent to the existing implementation.
> > 
> > The main remaining piece left to make the thread exit piece work is the
> > StackSamplingProfiler/SamplingThread synchronization. If we can do that and
> some
> > code cleanup, then I'm good for that part of the review. Then the remaining
> > piece to get this thing done is the review of the shutdown behavior.
> 
> Synchronization is done in draft.  It's here:
> https://codereview.chromium.org/2680703004

Yes, I think we need to move that into this CL.

bcwhite

> > > > > Treating them all as thread exit samples does not work ...

3 years, 10 months ago (2017-02-13 18:46:39 UTC) #194

Mike Wittman

On 2017/02/13 18:46:39, bcwhite wrote: > > > > > > Treating them all as ...

3 years, 10 months ago (2017-02-13 18:58:43 UTC) #195

On 2017/02/13 18:46:39, bcwhite wrote:
> > > > > > Treating them all as thread exit samples does not work because it
> throws
> > > out
> > > > > the
> > > > > > valid data represented by the transient-issue samples. Treating them
> all
> > > as
> > > > > > transient-issue samples also does not work because it treats the
bogus
> > > > > > post-thread-exit samples as valid data.
> > > > > > 
> > > > > > Either way would skew the results, and we would be blind to the
> severity
> > > of
> > > > > the
> > > > > > problem.
> > > > > 
> > > > > But that's not something we have to worry about here because, as you
> said
> > > > above,
> > > > > management of A-samples-B is to be handled by external synchronization
> and
> > > > this
> > > > > is only an A-samples-B issue.
> > > > 
> > > > That's correct, if we have the StackSamplingProfiler/SamplingThread
> > > > synchronization in place at StackSamplingProfiler destruction (or
> otherwise
> > > > pre-thread-exit). We need to have that implementation in place to avoid
> this
> > > > scenario.
> > > 
> > > The current implentation is not synchronized but it's damned close with
the
> > > extremely unlikely case with at most one sample being taken after the dtor
> > > returns.  This is safe under Windows (empty sample should the thread have
> had
> > > time to exit) but unknown for future OS implementations.
> > 
> > We saw intermittent crashes in the profiler when running tests on the
> buildbots,
> > before the Join implementation was in place, so I don't think we can
conclude
> > it's sufficiently close to synchronized.
> 
> Were those crashes due to the sampled-thread exiting while under test or the
> sampling-thread continuing to operate?

They were due to the sampled-thread exiting while under test.

> > It's also completely unsynchronized in
> > the A-samples-B case.
> 
> Yes.  There is no fix for that since they're independent and thus either could
> exit at any time.

Right, which is why we need the synchronization in this CL.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-13 20:16:10 UTC) #196

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/580001

3 years, 10 months ago (2017-02-13 20:17:19 UTC) #197

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-13 21:44:59 UTC) #199

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/320732)

3 years, 10 months ago (2017-02-13 21:45:00 UTC) #200

Mike Wittman

On 2017/02/13 21:08:52, bcwhite wrote: > Sync-stop CL merged into this one. PTAL Thanks. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_stack_sampler.h ...

3 years, 10 months ago (2017-02-13 22:35:57 UTC) #201

Mike Wittman

We'll probably need to special case the ThreadRestrictions::AssertWaitAllowed implementation for this to address the test ...

3 years, 10 months ago (2017-02-13 22:41:04 UTC) #202

Mike Wittman

On 2017/02/13 22:41:04, Mike Wittman wrote: > We'll probably need to special case the ThreadRestrictions::AssertWaitAllowed ...

3 years, 10 months ago (2017-02-13 22:43:21 UTC) #203

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 14:21:38 UTC) #204

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/600001

3 years, 10 months ago (2017-02-14 14:22:08 UTC) #205

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 14:25:42 UTC) #206

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/620001

3 years, 10 months ago (2017-02-14 14:26:16 UTC) #207

bcwhite

https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_stack_sampler.h File base/profiler/native_stack_sampler.h (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_stack_sampler.h#newcode24 base/profiler/native_stack_sampler.h:24: // The thread state as determined by the sampler ...

3 years, 10 months ago (2017-02-14 14:33:02 UTC) #209

bcwhite

bcwhite@chromium.org changed reviewers: + brettw@chromium.org

3 years, 10 months ago (2017-02-14 14:37:01 UTC) #210

bcwhite

brettw@chromium.org: Please review changes in base/threading/thread_restrictions.h Rationale is provided here: https://codereview.chromium.org/2554123002/diff/620001/base/profiler/stack_sampling_profiler.cc line 614 // The ...

3 years, 10 months ago (2017-02-14 14:37:03 UTC) #211

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 14:39:14 UTC) #212

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/640001

3 years, 10 months ago (2017-02-14 14:39:39 UTC) #213

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-14 16:08:26 UTC) #214

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/364445)

3 years, 10 months ago (2017-02-14 16:08:28 UTC) #215

Mike Wittman

On 2017/02/14 14:37:03, bcwhite wrote: > mailto:brettw@chromium.org: Please review changes in > base/threading/thread_restrictions.h > > ...

3 years, 10 months ago (2017-02-14 16:17:34 UTC) #216

bcwhite

On 2017/02/14 16:17:34, Mike Wittman wrote: > On 2017/02/14 14:37:03, bcwhite wrote: > > mailto:brettw@chromium.org: ...

3 years, 10 months ago (2017-02-14 17:32:44 UTC) #217

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 17:34:35 UTC) #218

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/660001

3 years, 10 months ago (2017-02-14 17:38:07 UTC) #219

Mike Wittman

On 2017/02/14 17:32:44, bcwhite wrote: > On 2017/02/14 16:17:34, Mike Wittman wrote: > > On ...

3 years, 10 months ago (2017-02-14 17:50:56 UTC) #220

Mike Wittman

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h#newcode200 base/profiler/stack_sampling_profiler.h:200: // IMPORTANT: This should generally be created on the ...

3 years, 10 months ago (2017-02-14 17:52:32 UTC) #221

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:200: // IMPORTANT: This should generally
be created on the local stack (i.e. NOT
This guidance is more conservative than necessary. I think it's sufficient to
say that the object must be destroyed before thread exit.

The general expectation in Chrome is that non-singleton objects are destroyed on
the thread where they are created. Not destroying would be a memory leak, so the
restriction for this constructor is fairly unremarkable and probably doesn't
justify an IMPORTANT label.

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:211: // IMPORTANT: Only threads
guaranteed to live beyond the lifetime of the
It would be best to move the prescriptive advice to the top, and drop the text
about the current thread within this comment.

Maybe something like:

IMPORTANT: Users of this interface must ensure the specified thread outlives
this object. Otherwise the profiler will continue trying to profile past thread
exit, resulting in crashes or worse.

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:241: static void Shutdown();
It's worth considering process shutdown now that we have a solution for thread
exit.

The destruction of the profilers before profiled thread exit will ensure we
won't be actively profiling at process shutdown, so we don't need to worry about
the profiled threads at shutdown.

The guidance from the thread scheduler team is that we should not be trying to
terminate the profiler thread at process shutdown, because it's just unnecessary
work. So we don't need to do anything special there either.

I'm not aware of anything else that would need to happen at shutdown, so I think
we can remove this function and the next one.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-14 19:31:07 UTC) #222

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/364549)

3 years, 10 months ago (2017-02-14 19:31:10 UTC) #223

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 20:10:52 UTC) #224

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/680001

3 years, 10 months ago (2017-02-14 20:11:54 UTC) #225

bcwhite

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h#newcode200 base/profiler/stack_sampling_profiler.h:200: // IMPORTANT: This should generally be created on the ...

3 years, 10 months ago (2017-02-14 20:13:27 UTC) #227

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-14 20:36:18 UTC) #228

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/700001

3 years, 10 months ago (2017-02-14 20:36:56 UTC) #229

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-14 22:10:14 UTC) #230

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: android_n5x_swarming_rel on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_swarming_rel/builds/118830)

3 years, 10 months ago (2017-02-14 22:10:15 UTC) #231

Mike Wittman

Still thinking through ShutdownTask() and the tests... https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h#newcode241 base/profiler/stack_sampling_profiler.h:241: static void ...

3 years, 10 months ago (2017-02-15 03:26:44 UTC) #232

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-15 16:10:51 UTC) #233

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/720001

3 years, 10 months ago (2017-02-15 16:11:18 UTC) #234

bcwhite

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sampling_profiler.h#newcode241 base/profiler/stack_sampling_profiler.h:241: static void Shutdown(); On 2017/02/15 03:26:44, Mike Wittman wrote: ...

3 years, 10 months ago (2017-02-15 16:17:35 UTC) #235

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:241: static void Shutdown();
On 2017/02/15 03:26:44, Mike Wittman wrote:
> On 2017/02/14 20:13:27, bcwhite wrote:
> > > The destruction of the profilers before profiled thread exit will ensure
we
> > > won't be actively profiling at process shutdown, so we don't need to worry
> > about
> > > the profiled threads at shutdown.
> > 
> > As long as we assume that the assumptions a thread makes about the lifetime
of
> a
> > thread under test are still valid during shutdown.
> 
> There's a well-defined shut down procedure for process threads within
> BrowserMainLoop::ShutdownThreadsAndCleanUp(), so I think this is a reasonable
> assumption. A worst case scenario might require some special case code to run
> before this function, but I don't anticipate this to be necessary.

Acknowledged.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:216: TimeDelta
task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(5);
On 2017/02/15 03:26:44, Mike Wittman wrote:
> I think it's worth bumping this up to something more like a minute, since
> keeping a thread around is very cheap.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:257: if (task_runner) {
On 2017/02/15 03:26:44, Mike Wittman wrote:
> No need to check the task_runner anymore.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:310: // started it so that it can be
self-managed or stopped on by another
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: stopped by

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:350: // calculated it now.
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: calculate

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:390:
collection->native_sampler->RecordStackSample(&sample);
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: RecordStackSample(&profile.samples.back()) and remove the previous line

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:399: void
StackSamplingProfiler::SamplingThread::CheckForIdle() {
On 2017/02/15 03:26:44, Mike Wittman wrote:
> how about calling this ScheduleShutdownIfIdle, to be clear what's happening
> here?

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection =
collection_ptr.get();
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: better to save off the id and initial_delay and pass those below, to
> eliminate the burden on the reader of figuring out if the pointer usage is
safe 

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time =
Time::Now();
On 2017/02/15 03:26:44, Mike Wittman wrote:
> Can we initialize this in the CollectionContext constructor?

No because it needs to be recorded when the first sample is being done which is
after the initial delay plus any other things the sampling-thread is doing.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:508: // This will keep a consistent
average interval between samples but will
On 2017/02/15 03:26:44, Mike Wittman wrote:
> Should this comment be on the above if statement?

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2017/02/15 03:26:44, Mike Wittman wrote:
> Can we continue to use the current strategy of constructing this as a local
> variable in Start()? It's moved within that function, so the pointer is not
> valid after that anyway.

I thought about that but decided to leave it this way because it should allow
multiple start/stop operations without having to recreate it each time.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:579: // short task or none at all if
sampling has already completed.
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: ... as long as it takes to collect one sample, taking ~200μs, or none at
> all ...
> 
> Jank has a precise definition (tasks taking >118ms) so readers will want to
> evaluate the wait duration in absolute terms.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:187: // are move-only. This should run
quickly as possible that another thread,
On 2017/02/15 03:26:44, Mike Wittman wrote:
> I'm having trouble parsing this. How about:
> 
> Other threads, including the UI thread, may block on callback completion, so
> this should run as quickly as possible.
> 
> ?

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:200: // Ensure that this object gets
destroyed before the current thread exits.
On 2017/02/15 03:26:44, Mike Wittman wrote:
> nit: The caller must ensure ...
> 
> Comments with imperatives directed at the reader are not generally done and
will
> be confusing to most developers.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:209: // IMPORTANT: Ensure that the
thread being sampled does not exit before this
On 2017/02/15 03:26:44, Mike Wittman wrote:
> Same here.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:233: static bool
IsSamplingThreadRunningForTesting();
On 2017/02/15 03:26:44, Mike Wittman wrote:
> Can we move these three functions into an internal TestApi class? (See other
> examples in the code.)
> 
> And also, do the same for the two functions they call in SamplingThread? It's
> not obvious what parts of that class are there for test purposes.

Done.  Let me know if it's what you were thinking.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-15 18:01:49 UTC) #236

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 10 months ago (2017-02-15 18:01:51 UTC) #237

Mike Wittman

Wow, supporting restartable threads certainly makes for tons of complications around thread start and thread ...

3 years, 10 months ago (2017-02-15 21:56:01 UTC) #238

Wow, supporting restartable threads certainly makes for tons of complications
around thread start and thread exit...

I think this general approach can work. Hopefully it's just a matter of fixing a
few things and simplifying to facilitate understanding.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection =
collection_ptr.get();
On 2017/02/15 16:17:34, bcwhite wrote:
> On 2017/02/15 03:26:44, Mike Wittman wrote:
> > nit: better to save off the id and initial_delay and pass those below, to
> > eliminate the burden on the reader of figuring out if the pointer usage is
> safe 
> 
> Done.

This is no longer potentially dangerous, so I think we can remove the comment as
well.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:233: static bool
IsSamplingThreadRunningForTesting();
On 2017/02/15 16:17:35, bcwhite wrote:
> On 2017/02/15 03:26:44, Mike Wittman wrote:
> > Can we move these three functions into an internal TestApi class? (See other
> > examples in the code.)
> > 
> > And also, do the same for the two functions they call in SamplingThread?
It's
> > not obvious what parts of that class are there for test purposes.
> 
> Done.  Let me know if it's what you were thinking.

Looks good. Can you add a TestAPI to SamplingThread as well so we can
distinguish the test support code in that class from the other code?

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:380: // The currently active profile
being acptured.
nit: captured

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:462: void
StackSamplingProfiler::SamplingThread::ShutdownTask() {
I think need some kind of invalidation for ShutdownTasks in-flight. Otherwise, I
believe we can get in a situation where an earlier-posted ShutdownTask shuts
down the thread immediately after a collection finishes. The relevant sequence
of events would be:

- a collection starts
- the collection stops and posts ShutdownTask #1
- a new collection starts
- the new collection stops and posts ShutdownTask #2
- ShutdownTask #1 executes shortly thereafter, finds a non-zero
task_runner_create_requests_, and posts itself again
- ShutdownTask #1.1 executes immediately, finds that all conditions to stop have
been satisfied, and shuts down the thread

CancelableTaskTracker unfortunately doesn't support delayed tasks, and is also
focused on tasks with reply, so I'd suggest implementing a poor-man's task
cancellation:
 - keep a "state" counter and increment whenever the the ShutdownTask should be
invalidated
 - bind the current state counter value to a ShutdownTask argument when posting
 - abort the ShutdownTask if the passed state counter is not equal to the
current state counter

I suspect this would make the logic a little easier to reason about too.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty())
Isn't this case already covered by the task_runner_create_requests_ check below?
i.e. if a new collection was added then task_runner_create_requests_ must have
been incremented, right?

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:485: StopSoon();
One note on the Stop/StopSoon/DetachFromSequence calls here and in
GetOrCreateTaskRunner: these require mutual exclusion per the Thread interface.
We have this by virtue of the task_runner_lock_, but it's not obvious from
reading the code that these calls have to be guarded by that lock to ensure
correct operation.

I'd wait to address this until we're pretty close to a resolution on the restart
behavior, however, in case other things change in the mean time.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode445 base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/15 16:17:34, bcwhite wrote: > ...

3 years, 10 months ago (2017-02-15 22:52:16 UTC) #239

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-16 17:38:34 UTC) #240

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/740001

3 years, 10 months ago (2017-02-16 17:39:36 UTC) #241

bcwhite

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode415 base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection = collection_ptr.get(); On 2017/02/15 21:56:00, Mike Wittman ...

3 years, 10 months ago (2017-02-16 17:39:49 UTC) #242

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection =
collection_ptr.get();
On 2017/02/15 21:56:00, Mike Wittman wrote:
> On 2017/02/15 16:17:34, bcwhite wrote:
> > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > nit: better to save off the id and initial_delay and pass those below, to
> > > eliminate the burden on the reader of figuring out if the pointer usage is
> > safe 
> > 
> > Done.
> 
> This is no longer potentially dangerous, so I think we can remove the comment
as
> well.

Done.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time =
Time::Now();
On 2017/02/15 22:52:16, Mike Wittman wrote:
> On 2017/02/15 16:17:34, bcwhite wrote:
> > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > Can we initialize this in the CollectionContext constructor?
> > 
> > No because it needs to be recorded when the first sample is being done which
> is
> > after the initial delay plus any other things the sampling-thread is doing.
> 
> But the value recorded isn't derived from next_sample_time; the only place
where
> next_sample_time is read is in the PostDelayedTaskCall below. It looks to me
> that the max expression will be equal to TimeDelta() regardless of whether
> next_sample_time is set to Time::Now() here or when the CollectionContext is
> constructed.

next_sample_time is persistent and all sample times will be calculated from its
starting value.  If it is set any time before the very first sample, then it
will race to catch up, capturing multiple samples at the start.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2017/02/15 22:52:16, Mike Wittman wrote:
> On 2017/02/15 16:17:34, bcwhite wrote:
> > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > Can we continue to use the current strategy of constructing this as a
local
> > > variable in Start()? It's moved within that function, so the pointer is
not
> > > valid after that anyway.
> > 
> > I thought about that but decided to leave it this way because it should
allow
> > multiple start/stop operations without having to recreate it each time.
> 
> On second look, I think we have to create the native sampler in Start()
because
> the current implementation is not correct. The native sampler is moved into
the
> CollectionContext the first time Start is called, and is never recreated, so
> it's not valid to move again when Start is called a second time.

That was the case but not any more.  Ownership of the native sampler stays with
this object and the context has only a pointer to it, much like the signaled
event.

A invalid native_sampler_ is used to return early in many API calls so as to not
try to access a sampling-thread that doesn't exist.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:233: static bool
IsSamplingThreadRunningForTesting();
On 2017/02/15 21:56:00, Mike Wittman wrote:
> On 2017/02/15 16:17:35, bcwhite wrote:
> > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > Can we move these three functions into an internal TestApi class? (See
other
> > > examples in the code.)
> > > 
> > > And also, do the same for the two functions they call in SamplingThread?
> It's
> > > not obvious what parts of that class are there for test purposes.
> > 
> > Done.  Let me know if it's what you were thinking.
> 
> Looks good. Can you add a TestAPI to SamplingThread as well so we can
> distinguish the test support code in that class from the other code?

I didn't think it was necessary since that class is embedded inside this .cc
file and so has no access from the outside.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:380: // The currently active profile
being acptured.
On 2017/02/15 21:56:01, Mike Wittman wrote:
> nit: captured

Done.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:462: void
StackSamplingProfiler::SamplingThread::ShutdownTask() {
On 2017/02/15 21:56:01, Mike Wittman wrote:
> I think need some kind of invalidation for ShutdownTasks in-flight. Otherwise,
I
> believe we can get in a situation where an earlier-posted ShutdownTask shuts
> down the thread immediately after a collection finishes. The relevant sequence
> of events would be:
> 
> - a collection starts
> - the collection stops and posts ShutdownTask #1
> - a new collection starts
> - the new collection stops and posts ShutdownTask #2
> - ShutdownTask #1 executes shortly thereafter, finds a non-zero
> task_runner_create_requests_, and posts itself again
> - ShutdownTask #1.1 executes immediately, finds that all conditions to stop
have
> been satisfied, and shuts down the thread
> 
> 
> CancelableTaskTracker unfortunately doesn't support delayed tasks, and is also
> focused on tasks with reply, so I'd suggest implementing a poor-man's task
> cancellation:
>  - keep a "state" counter and increment whenever the the ShutdownTask should
be
> invalidated
>  - bind the current state counter value to a ShutdownTask argument when
posting
>  - abort the ShutdownTask if the passed state counter is not equal to the
> current state counter
> 
> I suspect this would make the logic a little easier to reason about too.

I can use the existing task_runner_create_requests_ as the state and then just
pass it's current value and a first/second flag to ShutdownTask.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty())
On 2017/02/15 21:56:01, Mike Wittman wrote:
> Isn't this case already covered by the task_runner_create_requests_ check
below?
> i.e. if a new collection was added then task_runner_create_requests_ must have
> been incremented, right?

They happen at different times so something could race in between.

I think it could fail without this check if:
- last sample of only active collection begins
- Add increments create_requests and posts AddCollectionTask
- last sample completes and does Finish+ScheduleShutdownIfIdle
- SSII finds active_collections to be empty, posts delayed task using current
create_requests (which was incremented above)
- AddCollectionTask runs, adds new collection to active_collections
- profiling of the new collections goes on and on
- ShutdownTask eventually runs, creation_requests is unchanged, posts second
task
- ShutdownTask runs again, still no change, thinks its done
- thread exits while a collection is still running

I have a love/hate relationship with this stuff.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:485: StopSoon();
On 2017/02/15 21:56:01, Mike Wittman wrote:
> One note on the Stop/StopSoon/DetachFromSequence calls here and in
> GetOrCreateTaskRunner: these require mutual exclusion per the Thread
interface.
> We have this by virtue of the task_runner_lock_, but it's not obvious from
> reading the code that these calls have to be guarded by that lock to ensure
> correct operation.
> 
> I'd wait to address this until we're pretty close to a resolution on the
restart
> behavior, however, in case other things change in the mean time.

Acknowledged.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-16 18:00:11 UTC) #243

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_asan_rel_ng/builds/312563)

3 years, 10 months ago (2017-02-16 18:00:13 UTC) #244

Mike Wittman

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode445 base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/16 17:39:49, bcwhite wrote: > ...

3 years, 10 months ago (2017-02-17 16:10:19 UTC) #245

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time =
Time::Now();
On 2017/02/16 17:39:49, bcwhite wrote:
> On 2017/02/15 22:52:16, Mike Wittman wrote:
> > On 2017/02/15 16:17:34, bcwhite wrote:
> > > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > > Can we initialize this in the CollectionContext constructor?
> > > 
> > > No because it needs to be recorded when the first sample is being done
which
> > is
> > > after the initial delay plus any other things the sampling-thread is
doing.
> > 
> > But the value recorded isn't derived from next_sample_time; the only place
> where
> > next_sample_time is read is in the PostDelayedTaskCall below. It looks to me
> > that the max expression will be equal to TimeDelta() regardless of whether
> > next_sample_time is set to Time::Now() here or when the CollectionContext is
> > constructed.
> 
> next_sample_time is persistent and all sample times will be calculated from
its
> starting value.  If it is set any time before the very first sample, then it
> will race to catch up, capturing multiple samples at the start.

Ah, right. Missed that all the sample times are offset from the initial one.

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2017/02/16 17:39:49, bcwhite wrote:
> On 2017/02/15 22:52:16, Mike Wittman wrote:
> > On 2017/02/15 16:17:34, bcwhite wrote:
> > > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > > Can we continue to use the current strategy of constructing this as a
> local
> > > > variable in Start()? It's moved within that function, so the pointer is
> not
> > > > valid after that anyway.
> > > 
> > > I thought about that but decided to leave it this way because it should
> allow
> > > multiple start/stop operations without having to recreate it each time.
> > 
> > On second look, I think we have to create the native sampler in Start()
> because
> > the current implementation is not correct. The native sampler is moved into
> the
> > CollectionContext the first time Start is called, and is never recreated, so
> > it's not valid to move again when Start is called a second time.
> 
> That was the case but not any more.  Ownership of the native sampler stays
with
> this object and the context has only a pointer to it, much like the signaled
> event.

I just remembered why I implemented the creation of the native sampler in Start
originally: it's so that the use and destruction of the object does not occur on
different threads. This makes it trivial to reason about the correctness of the
use of the object, without having to consider any synchronization concerns.

The readability benefit of not having to think about synchronization greatly
outweighs the runtime benefit of avoiding additional object constructions, so we
should keep the existing behavior.

> A invalid native_sampler_ is used to return early in many API calls so as to
not
> try to access a sampling-thread that doesn't exist.

I don't understand. In patch set 25, how can native_sampler_ ever be null after
creation?

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:233: static bool
IsSamplingThreadRunningForTesting();
On 2017/02/16 17:39:49, bcwhite wrote:
> On 2017/02/15 21:56:00, Mike Wittman wrote:
> > On 2017/02/15 16:17:35, bcwhite wrote:
> > > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > > Can we move these three functions into an internal TestApi class? (See
> other
> > > > examples in the code.)
> > > > 
> > > > And also, do the same for the two functions they call in SamplingThread?
> > It's
> > > > not obvious what parts of that class are there for test purposes.
> > > 
> > > Done.  Let me know if it's what you were thinking.
> > 
> > Looks good. Can you add a TestAPI to SamplingThread as well so we can
> > distinguish the test support code in that class from the other code?
> 
> I didn't think it was necessary since that class is embedded inside this .cc
> file and so has no access from the outside.

Segregating test support code all the way down makes it clear which parts of the
class are core functionality. This makes it easier to understand the important
parts of the code and facilitates refactoring.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty())
On 2017/02/16 17:39:49, bcwhite wrote:
> On 2017/02/15 21:56:01, Mike Wittman wrote:
> > Isn't this case already covered by the task_runner_create_requests_ check
> below?
> > i.e. if a new collection was added then task_runner_create_requests_ must
have
> > been incremented, right?
> 
> They happen at different times so something could race in between.
> 
> I think it could fail without this check if:
> - last sample of only active collection begins
> - Add increments create_requests and posts AddCollectionTask
> - last sample completes and does Finish+ScheduleShutdownIfIdle
> - SSII finds active_collections to be empty, posts delayed task using current
> create_requests (which was incremented above)
> - AddCollectionTask runs, adds new collection to active_collections
> - profiling of the new collections goes on and on
> - ShutdownTask eventually runs, creation_requests is unchanged, posts second
> task
> - ShutdownTask runs again, still no change, thinks its done
> - thread exits while a collection is still running
> 
> I have a love/hate relationship with this stuff.

Yeah, I can see that sequence of events occurring.

Stepping back a bit it seems like we have a number of interacting constraints
around shutdown/startup:

1. A delayed shutdown must be initiated when the number of active collections
drops to zero.

2. Any delayed shutdowns must be aborted (or have no effect) if there are
pending collections at the time of execution of the shutdown.

3. Requests for collection must be synchronous with respect to shutdown
execution. Otherwise collection requests can be racily added only to have the
thread exit before they get serviced.

4. It's not possible to actually perform the thread exit itself synchronously
with respect to other events in the system because the thread can't hold a lock
as it exits. This means that seeing the thread to exit must be done on another
thread.

5. Taking #3 and #4 together, we effectively need to synchronize the thread exit
execution (which takes place across two different threads), with the requests
for collection.

Does that sound like a reasonable summary? Are there other relevant
synchronization constraints? I want to be able to convince myself that the
solution works from first principles.

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:466: // get postponed until StopSoon
can run thus eliminating the race.
It seems like the key to eliminating the race is actually setting task_runner_
to null while holding the lock, since that indicates to GetOrCreateTaskRunner
that it needs to wait for the thread to shut down before restarting it.

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) {
I think the task_runner_create_requests_ check above should avoid the need for a
second task posting:

If GetOrCreateTaskRunner was already executed, then we would have failed the
check, regardless of whether the task got posted yet. If GetOrCreateTaskRunner
is being executed now and waiting on task_runner_lock_, then it will not start
executing until we've done the shutdown sequence below and reset task_runner_ so
it won't have a chance to post any tasks until after restart.

task_runner_create_requests_ essentially serves as a proxy for whether there
could be pending add requests that haven't been executed yet.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-21 16:18:42 UTC) #246

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/760001

3 years, 10 months ago (2017-02-21 16:19:10 UTC) #247

bcwhite

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode562 base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, > I just remembered why ...

3 years, 10 months ago (2017-02-21 16:21:19 UTC) #248

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
> I just remembered why I implemented the creation of the native sampler in
Start
> originally: it's so that the use and destruction of the object does not occur
on
> different threads. This makes it trivial to reason about the correctness of
the
> use of the object, without having to consider any synchronization concerns.

Use will always be on a different thread that construction because use is always
on the SamplingThread.

Right now construction & destruction of the native sampler occurs on the same
thread that does construction & destruction of the generic object.

If I move the construction to Start and Start can be called by yet another
thread, then construction and destruction of the native sampler can be called on
different threads because the destruction of the native sampler will occur with
the destruction of the generic sampler.

Moving destruction of the native sampler to Stop won't help because there is no
requirement that Stop ever be called.

> The readability benefit of not having to think about synchronization greatly
> outweighs the runtime benefit of avoiding additional object constructions, so
we
> should keep the existing behavior.

The existing behavior was that the native sampler was constructed by the thread
calling Start but destructed on the sampling thread.  That seems riskier.

> > A invalid native_sampler_ is used to return early in many API calls so as to
> not
> > try to access a sampling-thread that doesn't exist.
> 
> I don't understand. In patch set 25, how can native_sampler_ ever be null
after
> creation?

Unsupported platforms return nullptr from NativeStackSampler::Create().

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:233: static bool
IsSamplingThreadRunningForTesting();
On 2017/02/17 16:10:19, Mike Wittman wrote:
> On 2017/02/16 17:39:49, bcwhite wrote:
> > On 2017/02/15 21:56:00, Mike Wittman wrote:
> > > On 2017/02/15 16:17:35, bcwhite wrote:
> > > > On 2017/02/15 03:26:44, Mike Wittman wrote:
> > > > > Can we move these three functions into an internal TestApi class? (See
> > other
> > > > > examples in the code.)
> > > > > 
> > > > > And also, do the same for the two functions they call in
SamplingThread?
> > > It's
> > > > > not obvious what parts of that class are there for test purposes.
> > > > 
> > > > Done.  Let me know if it's what you were thinking.
> > > 
> > > Looks good. Can you add a TestAPI to SamplingThread as well so we can
> > > distinguish the test support code in that class from the other code?
> > 
> > I didn't think it was necessary since that class is embedded inside this .cc
> > file and so has no access from the outside.
> 
> Segregating test support code all the way down makes it clear which parts of
the
> class are core functionality. This makes it easier to understand the important
> parts of the code and facilitates refactoring.

Done.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty())
> Yeah, I can see that sequence of events occurring.
> 
> 
> Stepping back a bit it seems like we have a number of interacting constraints
> around shutdown/startup:
> 
> 1. A delayed shutdown must be initiated when the number of active collections
> drops to zero.

Should.  Since new collections could come in during the delay, "must" is
stronger than necessary since that condition cannot be relied upon later on.

> 2. Any delayed shutdowns must be aborted (or have no effect) if there are
> pending collections at the time of execution of the shutdown.
> 
> 3. Requests for collection must be synchronous with respect to shutdown
> execution. Otherwise collection requests can be racily added only to have the
> thread exit before they get serviced.
> 
> 4. It's not possible to actually perform the thread exit itself synchronously
> with respect to other events in the system because the thread can't hold a
lock
> as it exits. This means that seeing the thread to exit must be done on another
> thread.

Yes.  The thread must indicate that it is about to exit so that an outside
thread can know to wait for the thread to exit (and possibly restart it).

> 5. Taking #3 and #4 together, we effectively need to synchronize the thread
exit
> execution (which takes place across two different threads), with the requests
> for collection.
> 
> Does that sound like a reasonable summary? Are there other relevant
> synchronization constraints? I want to be able to convince myself that the
> solution works from first principles.

And thread API access has to be synchronized (which is done via the
task_runner_lock_).

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:466: // get postponed until StopSoon
can run thus eliminating the race.
On 2017/02/17 16:10:19, Mike Wittman wrote:
> It seems like the key to eliminating the race is actually setting task_runner_
> to null while holding the lock, since that indicates to GetOrCreateTaskRunner
> that it needs to wait for the thread to shut down before restarting it.

Done.

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) {
On 2017/02/17 16:10:19, Mike Wittman wrote:
> I think the task_runner_create_requests_ check above should avoid the need for
a
> second task posting:
> 
> If GetOrCreateTaskRunner was already executed, then we would have failed the
> check, regardless of whether the task got posted yet. If GetOrCreateTaskRunner
> is being executed now and waiting on task_runner_lock_, then it will not start
> executing until we've done the shutdown sequence below and reset task_runner_
so
> it won't have a chance to post any tasks until after restart.
> 
> task_runner_create_requests_ essentially serves as a proxy for whether there
> could be pending add requests that haven't been executed yet.

Yeah, I think that's reasonable on the assumption that the "idle delay" is
sufficient to ensure the execution of the AddCollectionTask associated with an
increment of task_runner_create_requests.

But what about this:

- task_runner_create_requests_ is incremented (at time T)
- ScheduleShutdownIfIdle runs with this new value
- AddTask gets posted
- New collections runs for 55 seconds
- ShutdownTask runs at T+60 seconds
- thread exits only 5 seconds after the completion of the last collection

It's not really a problem; it just results in the thread possibly exiting
without waiting for the full delay.

Also, I think the fix would be complicated.  ScheduleShutdownIfIdle would have
to do the two back-to-back runs to get both the latest
task_runner_creation_requests_ count and the non-empty collections boolean.

On top of that...  The PostTask done by Add() would have be done while the
task_runner_lock_ remains held to ensure that both those two
ScheduleShutdownIfIdle() task runs couldn't occur between the increment and the
posting.  Doing that would require undoing the GetOrCreateTaskRunner helper
method.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode562 base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/21 16:21:19, bcwhite wrote: ...

3 years, 10 months ago (2017-02-21 18:57:02 UTC) #249

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2017/02/21 16:21:19, bcwhite wrote:
> > The readability benefit of not having to think about synchronization greatly
> > outweighs the runtime benefit of avoiding additional object constructions,
so
> we
> > should keep the existing behavior.
> 
> The existing behavior was that the native sampler was constructed by the
thread
> calling Start but destructed on the sampling thread.  That seems riskier.

What's the risk you're concerned about in creating and destroying the native
sampler on different threads? It's not possible for those two lifecycle events
to overlap, so there's no coordination required. I'm not seeing where this could
go wrong.

If use and destruction are on different threads, however, then this requires
coordination between the threads to ensure that the object is not destroyed
while it is still being used. The more state that requires cross-thread
synchronization, the more difficult it is to understand and validate the code,
and the harder it is to make future changes correctly.

> > > A invalid native_sampler_ is used to return early in many API calls so as
to
> > not
> > > try to access a sampling-thread that doesn't exist.
> > 
> > I don't understand. In patch set 25, how can native_sampler_ ever be null
> after
> > creation?
> 
> Unsupported platforms return nullptr from NativeStackSampler::Create().

Right. I don't think it's worth the complexity to try to maintain the
NativeStackSampler pointer as the sentinel for whether sampling is supported.
Better to save off a boolean on first attempted creation or add a
NativeStackSampler::IsSupported() function, whatever's simpler.

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty())
On 2017/02/21 16:21:19, bcwhite wrote:
> > Stepping back a bit it seems like we have a number of interacting
constraints
> > around shutdown/startup:
> > 
> > 1. A delayed shutdown must be initiated when the number of active
collections
> > drops to zero.
> 
> Should.  Since new collections could come in during the delay, "must" is
> stronger than necessary since that condition cannot be relied upon later on.
> 
> 
> > 2. Any delayed shutdowns must be aborted (or have no effect) if there are
> > pending collections at the time of execution of the shutdown.

Actually, I think this should be:

2. Any delayed shutdowns must be aborted (or have no effect) if any additional
collections have occurred or are pending at the time of execution of the
shutdown.

> > 3. Requests for collection must be synchronous with respect to shutdown
> > execution. Otherwise collection requests can be racily added only to have
the
> > thread exit before they get serviced.
> > 
> > 4. It's not possible to actually perform the thread exit itself
synchronously
> > with respect to other events in the system because the thread can't hold a
> lock
> > as it exits. This means that seeing the thread to exit must be done on
another
> > thread.
> 
> Yes.  The thread must indicate that it is about to exit so that an outside
> thread can know to wait for the thread to exit (and possibly restart it).
> 
> 
> > 5. Taking #3 and #4 together, we effectively need to synchronize the thread
> exit
> > execution (which takes place across two different threads), with the
requests
> > for collection.
> > 
> > Does that sound like a reasonable summary? Are there other relevant
> > synchronization constraints? I want to be able to convince myself that the
> > solution works from first principles.
> 
> And thread API access has to be synchronized (which is done via the
> task_runner_lock_).

Great, I think we can get to a workable and understandable (although nontrivial)
solution within these constraints.

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) {
On 2017/02/21 16:21:19, bcwhite wrote:
> But what about this:
> 
> - task_runner_create_requests_ is incremented (at time T)
> - ScheduleShutdownIfIdle runs with this new value
> - AddTask gets posted
> - New collections runs for 55 seconds
> - ShutdownTask runs at T+60 seconds
> - thread exits only 5 seconds after the completion of the last collection
> 
> It's not really a problem; it just results in the thread possibly exiting
> without waiting for the full delay.
> 
> Also, I think the fix would be complicated.  ScheduleShutdownIfIdle would have
> to do the two back-to-back runs to get both the latest
> task_runner_creation_requests_ count and the non-empty collections boolean.
> 
> On top of that...  The PostTask done by Add() would have be done while the
> task_runner_lock_ remains held to ensure that both those two
> ScheduleShutdownIfIdle() task runs couldn't occur between the increment and
the
> posting.  Doing that would require undoing the GetOrCreateTaskRunner helper
> method.

It seems like the basic issue here is distinguishing whether the
active_collections_.empty() state is the result of the state being empty since
the ShutdownTask was posted, or the result of collections occurring then
completing.

Rather than trying to figure this out post hoc, I think a simpler mechanism
would be to make a note of when collections start, e.g. by separately
incrementing task_runner_create_requests_ (under lock) in AddCollectionTask.
Then the create_requests check would detect this case too, and there would be no
need to try to divine the current state using the active_collections_.empty()
check and multiple task postings in ShutdownTask.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-21 19:04:39 UTC) #250

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 10 months ago (2017-02-21 19:04:41 UTC) #251

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-21 21:46:42 UTC) #252

bcwhite

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode562 base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, > If use and destruction ...

3 years, 10 months ago (2017-02-21 21:48:05 UTC) #253

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,

> If use and destruction are on different threads, however, then this requires
> coordination between the threads to ensure that the object is not destroyed
> while it is still being used.

That's already done because destruction requires waiting on collection having
finished.


> Right. I don't think it's worth the complexity to try to maintain the
> NativeStackSampler pointer as the sentinel for whether sampling is supported.
> Better to save off a boolean on first attempted creation or add a
> NativeStackSampler::IsSupported() function, whatever's simpler.

Existing collection_id will do it, too.  Done.

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) {
On 2017/02/21 18:57:02, Mike Wittman wrote:
> On 2017/02/21 16:21:19, bcwhite wrote:
> > But what about this:
> > 
> > - task_runner_create_requests_ is incremented (at time T)
> > - ScheduleShutdownIfIdle runs with this new value
> > - AddTask gets posted
> > - New collections runs for 55 seconds
> > - ShutdownTask runs at T+60 seconds
> > - thread exits only 5 seconds after the completion of the last collection
> > 
> > It's not really a problem; it just results in the thread possibly exiting
> > without waiting for the full delay.
> > 
> > Also, I think the fix would be complicated.  ScheduleShutdownIfIdle would
have
> > to do the two back-to-back runs to get both the latest
> > task_runner_creation_requests_ count and the non-empty collections boolean.
> > 
> > On top of that...  The PostTask done by Add() would have be done while the
> > task_runner_lock_ remains held to ensure that both those two
> > ScheduleShutdownIfIdle() task runs couldn't occur between the increment and
> the
> > posting.  Doing that would require undoing the GetOrCreateTaskRunner helper
> > method.
> 
> It seems like the basic issue here is distinguishing whether the
> active_collections_.empty() state is the result of the state being empty since
> the ShutdownTask was posted, or the result of collections occurring then
> completing.
> 
> Rather than trying to figure this out post hoc, I think a simpler mechanism
> would be to make a note of when collections start, e.g. by separately
> incrementing task_runner_create_requests_ (under lock) in AddCollectionTask.
> Then the create_requests check would detect this case too, and there would be
no
> need to try to divine the current state using the active_collections_.empty()
> check and multiple task postings in ShutdownTask.

Works for me.

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/780001

3 years, 10 months ago (2017-02-21 21:48:07 UTC) #254

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-21 23:53:03 UTC) #255

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: chromeos_amd64-generic_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) ...

3 years, 10 months ago (2017-02-21 23:53:08 UTC) #256

Mike Wittman

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sampling_profiler.cc#newcode562 base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/21 21:48:05, bcwhite wrote: ...

3 years, 10 months ago (2017-02-22 03:06:48 UTC) #257

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:562: native_sampler_ =
NativeStackSampler::Create(thread_id_, &RecordAnnotations,
On 2017/02/21 21:48:05, bcwhite wrote:
> > Right. I don't think it's worth the complexity to try to maintain the
> > NativeStackSampler pointer as the sentinel for whether sampling is
supported.
> > Better to save off a boolean on first attempted creation or add a
> > NativeStackSampler::IsSupported() function, whatever's simpler.
> 
> Existing collection_id will do it, too.  Done.

That works.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:220: int task_runner_create_requests_ =
0;
This should have a more general name now. collection_add_events_ ?

Also, documentation on what it's used for.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:436: AutoLock lock(task_runner_lock_);
nit: enclose in a block

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr;
The thread execution state from here to GetOrCreateTaskRuner is both pretty
subtle and critically important. So it probably should have a representation
with more inherent meaning than a null task runner pointer. I'd recommend an
enum and associated variable; e.g.

enum class ThreadExecutionState {
  NOT_STARTED,
  RUNNING,
  EXITING
};

Top-level documentation for the management of thread exit could be usefully hung
off of this type and it potentially would allow meaningful DCHECKS about the
thread execution state in the code, which would be helpful from a correctness
and documentation perspective.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait();
One more comment here, after considering the case of performing more than one
collection from the same StackSamplingProfiler object. I believe we need to move
the wait into Stop() to support that case: as it is now the WaitableEvent
remains signalled once the first collection finishes, so the
StackSamplingProfiler object could be destroyed any time after the first
collection completes, without waiting for a later collection to complete.

We probably should have a test exercising this case as well.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:623: if (collection_id_ == -1) {
This should be a named constant.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:848: // Capture thread should
still be running at this point.
The shutdown and restart behavior should be broken out into a different test (or
tests).

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:866: #define
MAYBE_ConcurrentProfiling ConcurrentProfiling
I think we should test different interleavings of Start(), Stop(), and profiler
destruction for the concurrent case, probably in different tests.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-22 14:23:35 UTC) #258

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/800001

3 years, 10 months ago (2017-02-22 14:23:49 UTC) #259

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-22 14:32:43 UTC) #260

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/314246) linux_chromium_tsan_rel_ng on ...

3 years, 10 months ago (2017-02-22 14:32:45 UTC) #261

bcwhite

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode220 base/profiler/stack_sampling_profiler.cc:220: int task_runner_create_requests_ = 0; On 2017/02/22 03:06:48, Mike Wittman ...

3 years, 10 months ago (2017-02-22 14:32:51 UTC) #262

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:220: int task_runner_create_requests_ =
0;
On 2017/02/22 03:06:48, Mike Wittman wrote:
> This should have a more general name now. collection_add_events_ ?
> 
> Also, documentation on what it's used for.

Done.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:436: AutoLock lock(task_runner_lock_);
On 2017/02/22 03:06:48, Mike Wittman wrote:
> nit: enclose in a block

Done.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr;
On 2017/02/22 03:06:48, Mike Wittman wrote:
> The thread execution state from here to GetOrCreateTaskRuner is both pretty
> subtle and critically important. So it probably should have a representation
> with more inherent meaning than a null task runner pointer. I'd recommend an
> enum and associated variable; e.g.
> 
> enum class ThreadExecutionState {
>   NOT_STARTED,
>   RUNNING,
>   EXITING
> };
> 
> Top-level documentation for the management of thread exit could be usefully
hung
> off of this type and it potentially would allow meaningful DCHECKS about the
> thread execution state in the code, which would be helpful from a correctness
> and documentation perspective.

While I like the idea, it's a second variable accessed independently of
task_runner_, access to which has been largely pushed into "helper methods". 
Since those helper methods acquire and release the lock privately, it'll mean
doing two acquire/release operations on the lock, changing all the helper
methods to be WhileLocked, or removing the helper methods altogether.

Preferences?

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait();
> One more comment here, after considering the case of performing more than one
> collection from the same StackSamplingProfiler object. I believe we need to
move
> the wait into Stop() to support that case: as it is now the WaitableEvent
> remains signalled once the first collection finishes, so the
> StackSamplingProfiler object could be destroyed any time after the first
> collection completes, without waiting for a later collection to complete.

The same occurred to me but I think it's better to do the wait in Start() so
that Stop() remains asynchronous.

This means that the controlling thread won't block until absolutely necessary,
which means probably never since the stop will happen within microseconds.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:623: if (collection_id_ == -1) {
On 2017/02/22 03:06:47, Mike Wittman wrote:
> This should be a named constant.

Done.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:848: // Capture thread should
still be running at this point.
On 2017/02/22 03:06:48, Mike Wittman wrote:
> The shutdown and restart behavior should be broken out into a different test
(or
> tests).

Done.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:866: #define
MAYBE_ConcurrentProfiling ConcurrentProfiling
On 2017/02/22 03:06:48, Mike Wittman wrote:
> I think we should test different interleavings of Start(), Stop(), and
profiler
> destruction for the concurrent case, probably in different tests.

Done.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode504 base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 14:32:51, bcwhite wrote: > ...

3 years, 10 months ago (2017-02-22 20:32:18 UTC) #263

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr;
On 2017/02/22 14:32:51, bcwhite wrote:
> On 2017/02/22 03:06:48, Mike Wittman wrote:
> > The thread execution state from here to GetOrCreateTaskRuner is both pretty
> > subtle and critically important. So it probably should have a representation
> > with more inherent meaning than a null task runner pointer. I'd recommend an
> > enum and associated variable; e.g.
> > 
> > enum class ThreadExecutionState {
> >   NOT_STARTED,
> >   RUNNING,
> >   EXITING
> > };
> > 
> > Top-level documentation for the management of thread exit could be usefully
> hung
> > off of this type and it potentially would allow meaningful DCHECKS about the
> > thread execution state in the code, which would be helpful from a
correctness
> > and documentation perspective.
> 
> While I like the idea, it's a second variable accessed independently of
> task_runner_, access to which has been largely pushed into "helper methods". 
> Since those helper methods acquire and release the lock privately, it'll mean
> doing two acquire/release operations on the lock, changing all the helper
> methods to be WhileLocked, or removing the helper methods altogether.
> 
> Preferences?

I think using the enum is worth doing for the improved documentation and
readability, plus the fact that it removes the ambiguity between the not started
state and the exiting state.

Just using it in place of checking task_runner_ for null shouldn't require any
additional locking.

Adding DCHECKS might require additional locking and complexity, but the locking
could be gated by DCHECK_IS_ON() to avoid any non-debug performance impact. The
question of whether the added complexity it worth it probably can be considered
on a case-by-case basis.

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait();
On 2017/02/22 14:32:51, bcwhite wrote:
> > One more comment here, after considering the case of performing more than
one
> > collection from the same StackSamplingProfiler object. I believe we need to
> move
> > the wait into Stop() to support that case: as it is now the WaitableEvent
> > remains signalled once the first collection finishes, so the
> > StackSamplingProfiler object could be destroyed any time after the first
> > collection completes, without waiting for a later collection to complete.
> 
> The same occurred to me but I think it's better to do the wait in Start() so
> that Stop() remains asynchronous.
> 
> This means that the controlling thread won't block until absolutely necessary,
> which means probably never since the stop will happen within microseconds.

Sounds reasonable to me, but the current name doesn't quite fit for this use.
Better to call it something like profiling_inactive_.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:32: const int NO_ID = -1;
nit: how about NULL_COLLECTION_ID, so it's clear what state this corresponds to?

This also could use some documentation.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:223: int task_runner_create_requests_ =
0;
remove

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:224: TimeDelta
task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(60);
If we're not supporting configurable shutdown times (per comment in the header),
then this can be replaced with something like:
  bool disable_idle_shutdown_for_testing_ = false;
And the TimeDelta::FromSeconds(60) can be moved into ScheduleShutdownIfIdle().

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:592:
WaitableEvent::InitialState::SIGNALED),
Comment on why the initial state is signaled.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:630: finished_event_.Reset();
This code is equivalent to just:
  finished_event_.Wait();

For the comment I think it's sufficient to say "Wait for any previously started
profiling to complete."

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:642: finished_event_.Signal();
This is no longer necessary.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:194: static void
SetSamplingThreadIdleShutdownTime(int shutdown_ms);
This function is only ever used to disable the the idle shutdown, so we should
name it accordingly, e.g. DisableSamplingThreadIdleShutdown().

I can't see a test use for setting the shutdown time that wouldn't be better
served by explicit coordination using the TestDelegate.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:291: // An ID uniquely identifying this
collection to the sampling thread.
Also mention the conditions under which this will be null.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:858:
sampling_completed.TimedWait(AVeryLongTimeDelta());
What's the reason for the TimedWait?

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:911:
PlatformThread::YieldCurrentThread();
Channeling brucedawson@: while (condition) yield(); results in a busy wait if
there spare execution cycles in the system, where the thread is repeatedly
scheduled, executes, and yields. This is bad for power usage. Admittedly this is
not a big concern in tests, but people do tend to copy-paste code around. The
preferred formulation is while (condition) sleep(1); to allow the processor to
be idle for some time.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:922: #define
MAYBE_ConcurrentProfiling1 ConcurrentProfiling1
Test name and comment should be descriptive of the specific conditions this is
testing, rather than using a numeric suffix.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:962:
WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2);
This probably should wait until all profilers have completed, then verify that
results are as expected. The expectation below is not correct since both of the
profilers could have completed before this call.

Same comment potentially applies to the tests below, depending on what is being
tested.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1070:
WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2);
2 => 3, or better, arraysize(params). Since we're duplicating and changing tests
here, we probably should replace all the constants here and in the previous
tests with arraysize(params) to make this less fragile.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1077: 
I'd like to have a test that interleaves Start and Stop calls on different
profilers as well.

bcwhite

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode504 base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; > Just using it in place ...

3 years, 10 months ago (2017-02-22 20:51:15 UTC) #264

Mike Wittman

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode504 base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 20:51:15, bcwhite wrote: > ...

3 years, 10 months ago (2017-02-22 21:11:13 UTC) #265

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-23 16:06:44 UTC) #266

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/820001

3 years, 10 months ago (2017-02-23 16:07:01 UTC) #267

bcwhite

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode504 base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 21:11:13, Mike Wittman wrote: ...

3 years, 10 months ago (2017-02-23 16:08:33 UTC) #268

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-23 17:24:45 UTC) #269

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 10 months ago (2017-02-23 17:24:47 UTC) #270

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-23 18:17:16 UTC) #271

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/840001

3 years, 10 months ago (2017-02-23 18:18:08 UTC) #272

Mike Wittman

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sampling_profiler.cc#newcode187 base/profiler/stack_sampling_profiler.cc:187: enum ThreadExecutionState { This needs substantial comments explaining the ...

3 years, 10 months ago (2017-02-23 18:26:52 UTC) #273

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-23 18:29:05 UTC) #274

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linux/builds/315375) ios-device on ...

3 years, 10 months ago (2017-02-23 18:29:07 UTC) #275

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-23 21:55:30 UTC) #276

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/860001

3 years, 10 months ago (2017-02-23 21:56:06 UTC) #277

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-23 22:06:34 UTC) #278

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/159206) ios-simulator-xcode-clang on ...

3 years, 10 months ago (2017-02-23 22:06:36 UTC) #279

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 10 months ago (2017-02-23 22:07:37 UTC) #280

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/880001

3 years, 10 months ago (2017-02-23 22:08:25 UTC) #281

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 10 months ago (2017-02-23 23:49:13 UTC) #282

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 10 months ago (2017-02-23 23:49:15 UTC) #283

bcwhite

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sampling_profiler.cc#newcode600 base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait(); On 2017/02/22 20:32:17, Mike Wittman wrote: > On ...

3 years, 10 months ago (2017-02-24 20:39:25 UTC) #285

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait();
On 2017/02/22 20:32:17, Mike Wittman wrote:
> On 2017/02/22 14:32:51, bcwhite wrote:
> > > One more comment here, after considering the case of performing more than
> one
> > > collection from the same StackSamplingProfiler object. I believe we need
to
> > move
> > > the wait into Stop() to support that case: as it is now the WaitableEvent
> > > remains signalled once the first collection finishes, so the
> > > StackSamplingProfiler object could be destroyed any time after the first
> > > collection completes, without waiting for a later collection to complete.
> > 
> > The same occurred to me but I think it's better to do the wait in Start() so
> > that Stop() remains asynchronous.
> > 
> > This means that the controlling thread won't block until absolutely
necessary,
> > which means probably never since the stop will happen within microseconds.
> 
> Sounds reasonable to me, but the current name doesn't quite fit for this use.
> Better to call it something like profiling_inactive_.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:32: const int NO_ID = -1;
On 2017/02/22 20:32:18, Mike Wittman wrote:
> nit: how about NULL_COLLECTION_ID, so it's clear what state this corresponds
to?
> 
> This also could use some documentation.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:223: int task_runner_create_requests_ =
0;
On 2017/02/22 20:32:18, Mike Wittman wrote:
> remove

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:224: TimeDelta
task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(60);
On 2017/02/22 20:32:18, Mike Wittman wrote:
> If we're not supporting configurable shutdown times (per comment in the
header),
> then this can be replaced with something like:
>   bool disable_idle_shutdown_for_testing_ = false;
> And the TimeDelta::FromSeconds(60) can be moved into ScheduleShutdownIfIdle().

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:592:
WaitableEvent::InitialState::SIGNALED),
On 2017/02/22 20:32:18, Mike Wittman wrote:
> Comment on why the initial state is signaled.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:630: finished_event_.Reset();
On 2017/02/22 20:32:18, Mike Wittman wrote:
> This code is equivalent to just:
>   finished_event_.Wait();
> 
> For the comment I think it's sufficient to say "Wait for any previously
started
> profiling to complete."

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:642: finished_event_.Signal();
On 2017/02/22 20:32:17, Mike Wittman wrote:
> This is no longer necessary.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:194: static void
SetSamplingThreadIdleShutdownTime(int shutdown_ms);
On 2017/02/22 20:32:18, Mike Wittman wrote:
> This function is only ever used to disable the the idle shutdown, so we should
> name it accordingly, e.g. DisableSamplingThreadIdleShutdown().
> 
> I can't see a test use for setting the shutdown time that wouldn't be better
> served by explicit coordination using the TestDelegate.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.h:291: // An ID uniquely identifying this
collection to the sampling thread.
On 2017/02/22 20:32:18, Mike Wittman wrote:
> Also mention the conditions under which this will be null.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:858:
sampling_completed.TimedWait(AVeryLongTimeDelta());
On 2017/02/22 20:32:18, Mike Wittman wrote:
> What's the reason for the TimedWait?

To ensure that it runs to completion before being stopped.  Otherwise it could
stop before the first sample and there wouldn't be any evidence that it would
have run.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:922: #define
MAYBE_ConcurrentProfiling1 ConcurrentProfiling1
On 2017/02/22 20:32:18, Mike Wittman wrote:
> Test name and comment should be descriptive of the specific conditions this is
> testing, rather than using a numeric suffix.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:962:
WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2);
On 2017/02/22 20:32:18, Mike Wittman wrote:
> This probably should wait until all profilers have completed, then verify that
> results are as expected. The expectation below is not correct since both of
the
> profilers could have completed before this call.

The test below isn't the number of completed profiles but rather that the
completed profile has data.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1070:
WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2);
On 2017/02/22 20:32:18, Mike Wittman wrote:
> 2 => 3, or better, arraysize(params). Since we're duplicating and changing
tests
> here, we probably should replace all the constants here and in the previous
> tests with arraysize(params) to make this less fragile.

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1077: 
On 2017/02/22 20:32:18, Mike Wittman wrote:
> I'd like to have a test that interleaves Start and Stop calls on different
> profilers as well.

Done.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:187: enum ThreadExecutionState {
On 2017/02/23 18:26:52, Mike Wittman wrote:
> This needs substantial comments explaining the lifecycle of the thread and the
> meaning of the different states, so readers can understand the lifecycle and
> corresponding subtleties without having to piece it together from the
> implementing code.

Done.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:265: if (state != RUNNING)
On 2017/02/23 18:26:52, Mike Wittman wrote:
> DCHECK_EQ(RUNNING, state)
> 
> unless this is called in the other states in tests

Done.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING)
On 2017/02/23 18:26:52, Mike Wittman wrote:
> DCHECK_NE(NOT_STARTED, state) before this

It could legitimately be in that state if the the collection runs to completion,
the idle time expires, the thread shuts down, and then an attempt is made to
remove the collection.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:331: if (task_runner_thread_state_ !=
RUNNING) {
On 2017/02/23 18:26:52, Mike Wittman wrote:
> I think this code would be easier to follow structured like:
> 
> if (task_runner_thread_state_ == RUNNING {
>   ... code currently in else clause ...
> 
>   return task_runner_;
> }
> 
> if (task_runner_thread_state == EXITING) {
>   // ...
>   Stop();
> }
> 
> ... code currently in if clause ...
> 
> return task_runner_;

Done.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ !=
RUNNING)
On 2017/02/23 18:26:52, Mike Wittman wrote:
> This would be better as a DCHECK:
> DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ : !task_runner_)

It needs to handle not-RUNNING by exiting early or the GetThreadId() below will
hang.  I can DCHECK the task-runner, though.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:572: task_runner_thread_state_ =
NOT_STARTED;
On 2017/02/23 18:26:52, Mike Wittman wrote:
> I don't think we need or want this. CleanUp is documented to be called after
the
> message loop ends, so this would overwrite the EXITING state before it could
be
> seen by GetOrCreateTaskRunner().

Done.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc#newcode858 base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); On 2017/02/24 20:39:24, bcwhite wrote: > On 2017/02/22 ...

3 years, 9 months ago (2017-02-27 23:27:35 UTC) #286

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:858:
sampling_completed.TimedWait(AVeryLongTimeDelta());
On 2017/02/24 20:39:24, bcwhite wrote:
> On 2017/02/22 20:32:18, Mike Wittman wrote:
> > What's the reason for the TimedWait?
> 
> To ensure that it runs to completion before being stopped.  Otherwise it could
> stop before the first sample and there wouldn't be any evidence that it would
> have run.

Right, but why not just Wait(), and remove the Wait() call below?

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:962:
WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2);
On 2017/02/24 20:39:24, bcwhite wrote:
> On 2017/02/22 20:32:18, Mike Wittman wrote:
> > This probably should wait until all profilers have completed, then verify
that
> > results are as expected. The expectation below is not correct since both of
> the
> > profilers could have completed before this call.
> 
> The test below isn't the number of completed profiles but rather that the
> completed profile has data.

Ah right, misread that.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING)
On 2017/02/24 20:39:25, bcwhite wrote:
> On 2017/02/23 18:26:52, Mike Wittman wrote:
> > DCHECK_NE(NOT_STARTED, state) before this
> 
> It could legitimately be in that state if the the collection runs to
completion,
> the idle time expires, the thread shuts down, and then an attempt is made to
> remove the collection.

Won't it be in the EXITING state in that case?

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ !=
RUNNING)
On 2017/02/24 20:39:25, bcwhite wrote:
> On 2017/02/23 18:26:52, Mike Wittman wrote:
> > This would be better as a DCHECK:
> > DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ :
!task_runner_)
> 
> It needs to handle not-RUNNING by exiting early or the GetThreadId() below
will
> hang.  I can DCHECK the task-runner, though.

How about putting the DCHECKS in a conditional then?

if (task_runner_thread_state_ == RUNNING) {
  // ...
  DCHECK_NE(GetThreadId(), PlatformThread::CurrentId());
  DCHECK(task_runner_);
} else {
  DCHECK(!task_runner_);
}

return task_runner_;

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will
be started (or restarted) when a sampling
As the code is currently, the state is only set to NOT_STARTED when the thread
has never been started.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:202: // initiated.
We should mention that new profiling requests (which occur on their own thread)
are responsible for ensuring the exit has completed then starting the thread and
transitioning to the RUNNING state.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:340:
StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() {
We should rename this to something like GetOrCreateTaskRunnerForAdd, since
incrementing the task_runner_add_events_ is predicated on this only being called
for Add. Also add a comment discussing why this should only be called from Add.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:354: // to call Stop() before Start().
This is safe even the thread has never
The last sentence is no longer relevant and can be removed.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset();
We can avoid the manual reset by setting the reset policy to AUTOMATIC.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay =
TimeDelta::FromMilliseconds(10);
Do we need an initial delay for this set of params (and the one below)?

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
Can we reduce this to something like 3-5 samples (and the one below)? If the
sampling is serviced at the normal timer tick interval of 15.6ms, then the 10
samples in this test likely will take 160+ ms.

We should strive to minimize test execution time where possible.

Same comment applies to the subsequent tests as well.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:935:
params[1].samples_per_burst = 10;
We should make this value different than above, and check the number of samples
returned below, to test that we've profiled against each set of parameters once.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:964:
sampling_completed_rawptrs.data(), sampling_completed_rawptrs.size());
This block of code down to this line can be extracted into a utility function
and reused across all these tests.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:969:
EXPECT_TRUE(sampling_completed[other_profiler]->TimedWait(
This should be a regular Wait() call. The test will fail with a time out if the
event is never signaled.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay =
TimeDelta::FromMilliseconds(2);
The resolution for all waiting values is the 15.6ms timer tick interval, so this
profiler is likely to start on the same timer tick as the previous one (with 94%
probability).

The same logic applies to the sampling intervals, so this test will
substantially be testing the same behavior as the previous one, except for the
Stop behavior at the end.

As a general principle, we can't rely on wait times to force a specific
interleaving of execution. Even if things work out relative to the timer tick
interval, execution still can be delayed and push things onto the same timer
tick if the system is under load.

Given this, I think the best we can do for the ConcurrentProfiling tests is to
test various interleavings of Start/Stop/destroy calls.

e.g.:

Start() // 1
// sample
Start() // 2
// sample
Stop() // 1
// sample
Stop() // 2

Start() // 1
// sample
Start() // 2
// sample
destroy // 2
// sample
destroy // 1

where the fact that samples have occurred is validated by signaling the test
delegate.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:991:
params[1].samples_per_burst = 10;
If this test is intending to exercise the stopping of one profiler, then the
params for this profiler (or the other) should be set up to ensure it executes
much longer than the other one, to try to avoid races between profiler exit and
the Stop call.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset();
The resetting code is unnecessary; the profilers will be destroyed when
|profilers| is destroyed at the end of the block.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1039: params[0].initial_delay
= TimeDelta::FromMilliseconds(8);
Same comments here with respect to the timer tick interval and forcing specific
interleaving of execution.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1082: profiler[i].reset();
This is unnecessary.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-06 14:30:18 UTC) #287

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/900001

3 years, 9 months ago (2017-03-06 14:30:36 UTC) #288

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-06 16:47:16 UTC) #289

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 9 months ago (2017-03-06 16:47:18 UTC) #290

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-13 18:42:14 UTC) #291

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/920001

3 years, 9 months ago (2017-03-13 18:43:07 UTC) #292

bcwhite

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc#newcode858 base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); On 2017/02/27 23:27:34, Mike Wittman wrote: > On ...

3 years, 9 months ago (2017-03-13 18:50:18 UTC) #293

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:858:
sampling_completed.TimedWait(AVeryLongTimeDelta());
On 2017/02/27 23:27:34, Mike Wittman wrote:
> On 2017/02/24 20:39:24, bcwhite wrote:
> > On 2017/02/22 20:32:18, Mike Wittman wrote:
> > > What's the reason for the TimedWait?
> > 
> > To ensure that it runs to completion before being stopped.  Otherwise it
could
> > stop before the first sample and there wouldn't be any evidence that it
would
> > have run.
> 
> Right, but why not just Wait(), and remove the Wait() call below?

Done.

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:911:
PlatformThread::YieldCurrentThread();
On 2017/02/22 20:32:18, Mike Wittman wrote:
> Channeling brucedawson@: while (condition) yield(); results in a busy wait if
> there spare execution cycles in the system, where the thread is repeatedly
> scheduled, executes, and yields. This is bad for power usage. Admittedly this
is
> not a big concern in tests, but people do tend to copy-paste code around. The
> preferred formulation is while (condition) sleep(1); to allow the processor to
> be idle for some time.

Done.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING)
On 2017/02/27 23:27:34, Mike Wittman wrote:
> On 2017/02/24 20:39:25, bcwhite wrote:
> > On 2017/02/23 18:26:52, Mike Wittman wrote:
> > > DCHECK_NE(NOT_STARTED, state) before this
> > 
> > It could legitimately be in that state if the the collection runs to
> completion,
> > the idle time expires, the thread shuts down, and then an attempt is made to
> > remove the collection.
> 
> Won't it be in the EXITING state in that case?

Yes.  Originally I had planned for the thread to go from EXITING to NOT_STARTED
when it had exited completely.

https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ !=
RUNNING)
On 2017/02/27 23:27:34, Mike Wittman wrote:
> On 2017/02/24 20:39:25, bcwhite wrote:
> > On 2017/02/23 18:26:52, Mike Wittman wrote:
> > > This would be better as a DCHECK:
> > > DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ :
> !task_runner_)
> > 
> > It needs to handle not-RUNNING by exiting early or the GetThreadId() below
> will
> > hang.  I can DCHECK the task-runner, though.
> 
> How about putting the DCHECKS in a conditional then?
> 
> if (task_runner_thread_state_ == RUNNING) {
>   // ...
>   DCHECK_NE(GetThreadId(), PlatformThread::CurrentId());
>   DCHECK(task_runner_);
> } else {
>   DCHECK(!task_runner_);
> }
> 
> return task_runner_;

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will
be started (or restarted) when a sampling
On 2017/02/27 23:27:34, Mike Wittman wrote:
> As the code is currently, the state is only set to NOT_STARTED when the thread
> has never been started.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:202: // initiated.
On 2017/02/27 23:27:34, Mike Wittman wrote:
> We should mention that new profiling requests (which occur on their own
thread)
> are responsible for ensuring the exit has completed then starting the thread
and
> transitioning to the RUNNING state.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:340:
StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() {
On 2017/02/27 23:27:34, Mike Wittman wrote:
> We should rename this to something like GetOrCreateTaskRunnerForAdd, since
> incrementing the task_runner_add_events_ is predicated on this only being
called
> for Add. Also add a comment discussing why this should only be called from
Add.

Done.  This is exactly why I didn't want these helper methods in the first
place.  They're not "general purpose".

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:354: // to call Stop() before Start().
This is safe even the thread has never
On 2017/02/27 23:27:34, Mike Wittman wrote:
> The last sentence is no longer relevant and can be removed.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset();
On 2017/02/27 23:27:34, Mike Wittman wrote:
> We can avoid the manual reset by setting the reset policy to AUTOMATIC.

True but only because there is currently nothing else (other than the dtor) that
waits on this. Conceptually it's not an automatic-reset that could be checked
for multiple reasons.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay =
TimeDelta::FromMilliseconds(10);
On 2017/02/27 23:27:34, Mike Wittman wrote:
> Do we need an initial delay for this set of params (and the one below)?

The initial delay just provides some extra time to be confident that both are
scheduled before one starts to execute.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
On 2017/02/27 23:27:35, Mike Wittman wrote:
> Can we reduce this to something like 3-5 samples (and the one below)? If the
> sampling is serviced at the normal timer tick interval of 15.6ms, then the 10
> samples in this test likely will take 160+ ms.

The sampling will take 10ms + timer-resolution.  Samples are taken at strict
times and if the thread runs behind then multiple samples will be taken to
"catch up".

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:935:
params[1].samples_per_burst = 10;
On 2017/02/27 23:27:34, Mike Wittman wrote:
> We should make this value different than above, and check the number of
samples
> returned below, to test that we've profiled against each set of parameters
once.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:964:
sampling_completed_rawptrs.data(), sampling_completed_rawptrs.size());
On 2017/02/27 23:27:35, Mike Wittman wrote:
> This block of code down to this line can be extracted into a utility function
> and reused across all these tests.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:969:
EXPECT_TRUE(sampling_completed[other_profiler]->TimedWait(
On 2017/02/27 23:27:35, Mike Wittman wrote:
> This should be a regular Wait() call. The test will fail with a time out if
the
> event is never signaled.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay =
TimeDelta::FromMilliseconds(2);
> The resolution for all waiting values is the 15.6ms timer tick interval, so
this
> profiler is likely to start on the same timer tick as the previous one (with
94%
> probability).

It doesn't actually matter because events are posted to the queue in order based
on desired run time.  Thus, regardless of when execution is scheduled, the two
tasks will interleave.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:991:
params[1].samples_per_burst = 10;
On 2017/02/27 23:27:35, Mike Wittman wrote:
> If this test is intending to exercise the stopping of one profiler, then the
> params for this profiler (or the other) should be set up to ensure it executes
> much longer than the other one, to try to avoid races between profiler exit
and
> the Stop call.

Races between exit and stop should be safe.  Keeping them close helps test that
edge case.  The test is just "don't crash".

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset();
On 2017/02/27 23:27:35, Mike Wittman wrote:
> The resetting code is unnecessary; the profilers will be destroyed when
> |profilers| is destroyed at the end of the block.

Yes but they will be destroyed in descending order (oppose of construction). I
want them destroyed in ascending order just like all other calls.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1039: params[0].initial_delay
= TimeDelta::FromMilliseconds(8);
On 2017/02/27 23:27:35, Mike Wittman wrote:
> Same comments here with respect to the timer tick interval and forcing
specific
> interleaving of execution.

Acknowledged.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1082: profiler[i].reset();
On 2017/02/27 23:27:34, Mike Wittman wrote:
> This is unnecessary.

Acknowledged.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-13 19:45:58 UTC) #294

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/339041)

3 years, 9 months ago (2017-03-13 19:46:00 UTC) #295

Mike Wittman

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc#newcode911 base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); On 2017/03/13 18:50:17, bcwhite wrote: > On 2017/02/22 ...

3 years, 9 months ago (2017-03-14 18:57:34 UTC) #296

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:911:
PlatformThread::YieldCurrentThread();
On 2017/03/13 18:50:17, bcwhite wrote:
> On 2017/02/22 20:32:18, Mike Wittman wrote:
> > Channeling brucedawson@: while (condition) yield(); results in a busy wait
if
> > there spare execution cycles in the system, where the thread is repeatedly
> > scheduled, executes, and yields. This is bad for power usage. Admittedly
this
> is
> > not a big concern in tests, but people do tend to copy-paste code around.
The
> > preferred formulation is while (condition) sleep(1); to allow the processor
to
> > be idle for some time.
> 
> Done.

Sleeping for 1ms is preferable, to minimize test execution time.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will
be started (or restarted) when a sampling
On 2017/03/13 18:50:17, bcwhite wrote:
> On 2017/02/27 23:27:34, Mike Wittman wrote:
> > As the code is currently, the state is only set to NOT_STARTED when the
thread
> > has never been started.
> 
> Done.

I think it would be clearer to remove the "(or restarted)" part since the thread
is not in this execution state before being restarted.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:340:
StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() {
On 2017/03/13 18:50:17, bcwhite wrote:
> On 2017/02/27 23:27:34, Mike Wittman wrote:
> > We should rename this to something like GetOrCreateTaskRunnerForAdd, since
> > incrementing the task_runner_add_events_ is predicated on this only being
> called
> > for Add. Also add a comment discussing why this should only be called from
> Add.
> 
> Done.  This is exactly why I didn't want these helper methods in the first
> place.  They're not "general purpose".

The most important values for Chrome code are readability, understandability,
and ease of modification. I think this remains a quite reasonable encapsulation
by those metrics despite the clunky name. If this function needs to be used from
a non-Add function in the future it's a simple modification to revert the name
change and take a flag or callback for the Add case, since the relevant
assumptions are documented (if not DCHECK'ed) in the code.

Writing "general purpose" code outside of public APIs is pretty universally
considered an anti-pattern in Chrome, because it adds complexity based on
assumptions about how the code will be used in the future. When those
assumptions are wrong, as they often are, the added complexity makes it more
difficult to refactor to support use cases the original developer didn't
anticipate: the developer doing the refactoring has to tease apart the aspects
of the general purpose solution that are actually required and used from those
that are superfluous.

The most effective way to make code future-proof is to make it "generalizable"
rather than "general purpose", which is accomplished through readability and
understandability: being explicit about the constraints, assumptions, and intent
of the code.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset();
On 2017/03/13 18:50:17, bcwhite wrote:
> On 2017/02/27 23:27:34, Mike Wittman wrote:
> > We can avoid the manual reset by setting the reset policy to AUTOMATIC.
> 
> True but only because there is currently nothing else (other than the dtor)
that
> waits on this. Conceptually it's not an automatic-reset that could be checked
> for multiple reasons.

Acknowledged.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay =
TimeDelta::FromMilliseconds(10);
On 2017/03/13 18:50:18, bcwhite wrote:
> On 2017/02/27 23:27:34, Mike Wittman wrote:
> > Do we need an initial delay for this set of params (and the one below)?
> 
> The initial delay just provides some extra time to be confident that both are
> scheduled before one starts to execute.

Ok. Please document this within the test.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
On 2017/03/13 18:50:18, bcwhite wrote:
> On 2017/02/27 23:27:35, Mike Wittman wrote:
> > Can we reduce this to something like 3-5 samples (and the one below)? If the
> > sampling is serviced at the normal timer tick interval of 15.6ms, then the
10
> > samples in this test likely will take 160+ ms.
> 
> The sampling will take 10ms + timer-resolution.  Samples are taken at strict
> times and if the thread runs behind then multiple samples will be taken to
> "catch up".
> 

Please document this as well.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay =
TimeDelta::FromMilliseconds(2);
On 2017/03/13 18:50:18, bcwhite wrote:
> > The resolution for all waiting values is the 15.6ms timer tick interval, so
> this
> > profiler is likely to start on the same timer tick as the previous one (with
> 94%
> > probability).
> 
> It doesn't actually matter because events are posted to the queue in order
based
> on desired run time.  Thus, regardless of when execution is scheduled, the two
> tasks will interleave.

Ok, I can see that. In that case though, it's not clear to me what the value is
in testing multiple interleavings (see my general comment below).

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:991:
params[1].samples_per_burst = 10;
On 2017/03/13 18:50:18, bcwhite wrote:
> On 2017/02/27 23:27:35, Mike Wittman wrote:
> > If this test is intending to exercise the stopping of one profiler, then the
> > params for this profiler (or the other) should be set up to ensure it
executes
> > much longer than the other one, to try to avoid races between profiler exit
> and
> > the Stop call.
> 
> Races between exit and stop should be safe.  Keeping them close helps test
that
> edge case.  The test is just "don't crash".

If the desire is to test both winning and losing the race, that should be done
in two separate tests, each of which is written to unambiguously exercise one
case or the other.

Having a race in the test, even if benign, risks flaky failures if only one the
two cases fails.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset();
On 2017/03/13 18:50:18, bcwhite wrote:
> On 2017/02/27 23:27:35, Mike Wittman wrote:
> > The resetting code is unnecessary; the profilers will be destroyed when
> > |profilers| is destroyed at the end of the block.
> 
> Yes but they will be destroyed in descending order (oppose of construction). I
> want them destroyed in ascending order just like all other calls.

Why do you want this in this particular test? How does the ordering of the Stop
calls and destruction relate to the interleaving scenario above? Same question
applies to the tests below too.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1127:
EXPECT_FALSE(sampling_completed[1]->IsSignaled());
The first two profilers could both complete by this point if the system is under
load, resulting in flaky failures.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1132:
EXPECT_FALSE(sampling_completed[2]->IsSignaled());
Same here for the second and third profilers.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
I think it would be better to call this something like
thread_execution_state_lock_ at this point, since it's basically protecting
changes that affect the thread's execution state.

This naming would also have the benefit of encompasssing the protection of the
non-thread-safe Start/Stop/StopSoon/DetachFromSequence Thread API calls. The
documentation should state that these are covered by the lock as well.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:489: // Another increment of "create
requests" serves to invalidate any pending
create requests => add events

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:549: // those always increments "create
requests". There may be other requests,
create requests => add events

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
General comments on testing this functionality, now that I mostly understand
what the current tests are doing:

1. The most important aspects of this change to test are the subtleties around
collection and thread lifetime. We should have dedicated tests exercising the
different conditional outcomes for all the conditionals in
GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and for
the id-not-found cases in RemoveCollectionTask and PerformCollectionTask.

There are several interleavings of Start/Stop/destroy in the current tests, but
it's not clear which of the cases are being exercised by them, which is why they
should have dedicated tests.

2. Testing that multiple profilers can run concurrently is important, but it's
not clear to me what the value is in testing multiple sampling interleavings.
The sampling interleaving is mostly an implementation detail of the profiler --
users won't care how they're interleaved with other collections as long as their
samples are collected close to "on time". The interleaving behavior is also
substantially provided by the message loop, so tests of different interleavings
are testing mostly that behavior not the behavior implemented in this class.
Perhaps I'm missing something though -- is there another reason to test multiple
interleavings?

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1025: // Stop and destroy all
profilers, always in the some order. Don't crash.
nit: same

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1076: // Stop and destroy all
profilers, always in the some order. Don't crash.
nit: same

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-16 15:53:53 UTC) #297

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/940001

3 years, 9 months ago (2017-03-16 15:54:22 UTC) #298

bcwhite

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sampling_profiler_unittest.cc#newcode911 base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); On 2017/03/14 18:57:33, Mike Wittman wrote: > On ...

3 years, 9 months ago (2017-03-16 15:56:25 UTC) #299

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:911:
PlatformThread::YieldCurrentThread();
On 2017/03/14 18:57:33, Mike Wittman wrote:
> On 2017/03/13 18:50:17, bcwhite wrote:
> > On 2017/02/22 20:32:18, Mike Wittman wrote:
> > > Channeling brucedawson@: while (condition) yield(); results in a busy wait
> if
> > > there spare execution cycles in the system, where the thread is repeatedly
> > > scheduled, executes, and yields. This is bad for power usage. Admittedly
> this
> > is
> > > not a big concern in tests, but people do tend to copy-paste code around.
> The
> > > preferred formulation is while (condition) sleep(1); to allow the
processor
> to
> > > be idle for some time.
> > 
> > Done.
> 
> Sleeping for 1ms is preferable, to minimize test execution time.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will
be started (or restarted) when a sampling
On 2017/03/14 18:57:33, Mike Wittman wrote:
> On 2017/03/13 18:50:17, bcwhite wrote:
> > On 2017/02/27 23:27:34, Mike Wittman wrote:
> > > As the code is currently, the state is only set to NOT_STARTED when the
> thread
> > > has never been started.
> > 
> > Done.
> 
> I think it would be clearer to remove the "(or restarted)" part since the
thread
> is not in this execution state before being restarted.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay =
TimeDelta::FromMilliseconds(10);
On 2017/03/14 18:57:33, Mike Wittman wrote:
> On 2017/03/13 18:50:18, bcwhite wrote:
> > On 2017/02/27 23:27:34, Mike Wittman wrote:
> > > Do we need an initial delay for this set of params (and the one below)?
> > 
> > The initial delay just provides some extra time to be confident that both
are
> > scheduled before one starts to execute.
> 
> Ok. Please document this within the test.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
On 2017/03/14 18:57:33, Mike Wittman wrote:
> On 2017/03/13 18:50:18, bcwhite wrote:
> > On 2017/02/27 23:27:35, Mike Wittman wrote:
> > > Can we reduce this to something like 3-5 samples (and the one below)? If
the
> > > sampling is serviced at the normal timer tick interval of 15.6ms, then the
> 10
> > > samples in this test likely will take 160+ ms.
> > 
> > The sampling will take 10ms + timer-resolution.  Samples are taken at strict
> > times and if the thread runs behind then multiple samples will be taken to
> > "catch up".
> > 
> 
> Please document this as well.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:991:
params[1].samples_per_burst = 10;
> If the desire is to test both winning and losing the race, that should be done
> in two separate tests, each of which is written to unambiguously exercise one
> case or the other.

The desire is to make sure that different sampling parameters will operate in
parallel.  Anything else is superfluous.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset();
On 2017/03/14 18:57:33, Mike Wittman wrote:
> On 2017/03/13 18:50:18, bcwhite wrote:
> > On 2017/02/27 23:27:35, Mike Wittman wrote:
> > > The resetting code is unnecessary; the profilers will be destroyed when
> > > |profilers| is destroyed at the end of the block.
> > 
> > Yes but they will be destroyed in descending order (oppose of construction).
I
> > want them destroyed in ascending order just like all other calls.
> 
> Why do you want this in this particular test? How does the ordering of the
Stop
> calls and destruction relate to the interleaving scenario above? Same question
> applies to the tests below too.

It's an "interleave" test so order needs to be consistent to ensure that.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1127:
EXPECT_FALSE(sampling_completed[1]->IsSignaled());
On 2017/03/14 18:57:33, Mike Wittman wrote:
> The first two profilers could both complete by this point if the system is
under
> load, resulting in flaky failures.

Done.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1132:
EXPECT_FALSE(sampling_completed[2]->IsSignaled());
On 2017/03/14 18:57:33, Mike Wittman wrote:
> Same here for the second and third profilers.

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
> I think it would be better to call this something like
> thread_execution_state_lock_ at this point, since it's basically protecting
> changes that affect the thread's execution state.

All the variables it protects start with task_runner_.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:489: // Another increment of "create
requests" serves to invalidate any pending
On 2017/03/14 18:57:33, Mike Wittman wrote:
> create requests => add events

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:549: // those always increments "create
requests". There may be other requests,
On 2017/03/14 18:57:33, Mike Wittman wrote:
> create requests => add events

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
> 1. The most important aspects of this change to test are the subtleties around
> collection and thread lifetime. We should have dedicated tests exercising the
> different conditional outcomes for all the conditionals in
> GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and
for
> the id-not-found cases in RemoveCollectionTask and PerformCollectionTask.

GetOrCreateTaskRunnerForAdd:
- state==RUNNING: tested by ConcurrentProfiling_*
  Entered when doing parallel sampling.
- state==EXITING: tested by WillRestartSampler
  Entered after sampler shutdown.
- state==other: tested by everything
  Every first sampler will exercise this state.

Remove:
- state!=RUNNING: added StopAfterIdle test
- state==other: tested by every valid Stop()

ShutdownTask:
- not idle: tested everywhere
- changed add-event
  This is to handle a race condition which is necessarily difficult
  (if not impossible) to test.

RemoveCollectionTask:
- found: tested every stop before completion
- not-found: tested every stop after completion

PerformCollectionTask:
- found: tested with every valid sample
- not-found: tested every stop before completion


> 2. Testing that multiple profilers can run concurrently is important, but it's
> not clear to me what the value is in testing multiple sampling interleavings.
> The sampling interleaving is mostly an implementation detail of the profiler
--
> users won't care how they're interleaved with other collections as long as
their
> samples are collected close to "on time". The interleaving behavior is also
> substantially provided by the message loop, so tests of different
interleavings
> are testing mostly that behavior not the behavior implemented in this class.
> Perhaps I'm missing something though -- is there another reason to test
multiple
> interleavings?

Removed one of the tests.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1025: // Stop and destroy all
profilers, always in the some order. Don't crash.
On 2017/03/14 18:57:33, Mike Wittman wrote:
> nit: same

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1076: // Stop and destroy all
profilers, always in the some order. Don't crash.
On 2017/03/14 18:57:34, Mike Wittman wrote:
> nit: same

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-16 17:06:53 UTC) #300

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/341630)

3 years, 9 months ago (2017-03-16 17:06:55 UTC) #301

Mike Wittman

Still taking another look at the ConcurrentProfiling_* tests. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sampling_profiler_unittest.cc#newcode931 base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst ...

3 years, 9 months ago (2017-03-18 01:38:41 UTC) #302

Still taking another look at the ConcurrentProfiling_* tests.

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
On 2017/03/16 15:56:25, bcwhite wrote:
> On 2017/03/14 18:57:33, Mike Wittman wrote:
> > On 2017/03/13 18:50:18, bcwhite wrote:
> > > On 2017/02/27 23:27:35, Mike Wittman wrote:
> > > > Can we reduce this to something like 3-5 samples (and the one below)? If
> the
> > > > sampling is serviced at the normal timer tick interval of 15.6ms, then
the
> > 10
> > > > samples in this test likely will take 160+ ms.
> > > 
> > > The sampling will take 10ms + timer-resolution.  Samples are taken at
strict
> > > times and if the thread runs behind then multiple samples will be taken to
> > > "catch up".
> > > 
> > 
> > Please document this as well.
> 
> Done.

I don't see something equivalent to this in the comments. The text above ("The
sampling will take 10ms + ...") would be fine to just copy into the comment.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
On 2017/03/16 15:56:25, bcwhite wrote:
> > I think it would be better to call this something like
> > thread_execution_state_lock_ at this point, since it's basically protecting
> > changes that affect the thread's execution state.
> 
> All the variables it protects start with task_runner_.

Yes, and their names should be updated also, for the same reason.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
On 2017/03/16 15:56:25, bcwhite wrote:
> > 1. The most important aspects of this change to test are the subtleties
around
> > collection and thread lifetime. We should have dedicated tests exercising
the
> > different conditional outcomes for all the conditionals in
> > GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and
> for
> > the id-not-found cases in RemoveCollectionTask and PerformCollectionTask.

Relying on existing tests is OK if the test is obviously and reliably testing
the particular behavior. All the other cases need dedicated tests so that (1)
it's clear that the behavior is important and that it's actually being tested,
and (2) the test of the behavior survives changes and refactoring to the code
and tests. In particular, the behavior under test in several of the cases below
is subject to races dependent on the sampling parameters, system load, and
shutdown idle time.

> GetOrCreateTaskRunnerForAdd:
> - state==RUNNING: tested by ConcurrentProfiling_*
>   Entered when doing parallel sampling.

We can't rely on the RUNNING state being tested in the ConcurrentProfiling_*
tests because the state in the second and later Start() calls is subject to
races dependent on the sampling parameters, the load on the system, and the
shutdown idle time.

> - state==EXITING: tested by WillRestartSampler
>   Entered after sampler shutdown.

This is a good test for this behavior.

> - state==other: tested by everything
>   Every first sampler will exercise this state.
> 
> Remove:
> - state!=RUNNING: added StopAfterIdle test

This is also a good test for this behavior.

> - state==other: tested by every valid Stop()

We can't rely on state==other(EXITING) being tested in Remove for the same
configuration/load-dependent reasons as for state==RUNNING in
GetOrCreateTaskRunnerForAdd.

> ShutdownTask:
> - not idle: tested everywhere

This is not tested everywhere, If I'm not mistaken, since ShutdownTask is only
executed in tests where InitiateSamplingThreadIdleShutdown() is called.
Independent of that, I think this conditional can be removed entirely. See the
comment on the code.

> - changed add-event
>   This is to handle a race condition which is necessarily difficult
>   (if not impossible) to test.

If the previous conditional is removed, this can be tested by calling
InitiateSamplingThreadIdleShutdown(true) while a collection is in process.

> RemoveCollectionTask:
> - found: tested every stop before completion

We can't rely on the item being found for the for the same
configuration/load-dependent reasons as above.

> - not-found: tested every stop after completion

This is pretty subtle in most of the tests but it's a key part of StopAfterIdle,
I think that's reasonable.

> PerformCollectionTask:
> - found: tested with every valid sample
> - not-found: tested every stop before completion

We can't rely on the item being not-found for the for the same
configuration/load-dependent reasons as above.

When run in their own, tests of the cases with configuration/load-dependent
sensitivity can have exactly the configuration they need to avoid races in the
behavior under test. They should be set up to be race-free even if the idle
shutdown is set to a very short duration.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:543: return;
I don't think this conditional is needed.

Alternative solution:

1. Add a bool simulateDefunctShutdown parameter to
InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1 to
the posted task if true.

2. Document on the InitiateSamplingThreadIdleShutdown() interface that
simulateDefunctShutdown must be set to true if any collections are still active.

3. CHECK(simulateDefunctShutdown || active_collections_.empty()) within
InitiateSamplingThreadIdleShutdown() to enforce (2).

4. Eliminate a race in FinishCollection so that tests can depend on the profiler
being idle when all collections have finished: at the end of the function, save
off the finished WaitableEvent and only signal it after
active_collections_.erase().

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:355: void
CreateProfilers(PlatformThreadId target_thread_id,
Nice, encapsulating this functionality makes the tests cleaner.

Can you pass the input profiles as a vector, and the three output arguments as
pointers to empty vectors (and push_back() here), so we don't need to keep all
the array/vector sizes in sync in the tests?

Also, input arguments should appear before output arguments in the interface.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:359: SamplingParams* params,
const

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:922: #define
MAYBE_WillRestartSampler WillRestartSampler
WillRestartSamplerAfterIdleShutdown would be a better name for this test.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:943: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
Thinking about this in terms of the underlying StackSamplingProfiler state, I
don't see the need to wait for the thread to exit in this test.

Thread::Stop() is documented to handle both the pre-exit and exited cases, so
the behavior in StackSamplingProfiler is the same regardless of which state the
thread is in.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:956: #define
MAYBE_StopAfterIdle StopAfterIdle
StopAfterIdleShutdown would be a better name.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:980: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
I don't think we need this for the same reason as above.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1001: // run at their
scheduled, interleaved times regardless of whatever
This doesn't make sense to me. How can the samples run at their scheduled times
if the thread hasn't woken up by then?

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
How does the ordering of the Start and Stop calls and destruction relate to the
parameters above?

Mike Wittman

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sampling_profiler.cc#newcode543 base/profiler/stack_sampling_profiler.cc:543: return; On 2017/03/18 01:38:41, Mike Wittman wrote: > I ...

3 years, 9 months ago (2017-03-20 14:59:40 UTC) #303

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-20 20:50:17 UTC) #304

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/960001

3 years, 9 months ago (2017-03-20 20:51:06 UTC) #305

bcwhite

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sampling_profiler_unittest.cc#newcode931 base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/03/18 01:38:41, Mike Wittman wrote: ...

3 years, 9 months ago (2017-03-20 21:50:51 UTC) #306

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:931:
params[0].samples_per_burst = 10;
On 2017/03/18 01:38:41, Mike Wittman wrote:
> On 2017/03/16 15:56:25, bcwhite wrote:
> > On 2017/03/14 18:57:33, Mike Wittman wrote:
> > > On 2017/03/13 18:50:18, bcwhite wrote:
> > > > On 2017/02/27 23:27:35, Mike Wittman wrote:
> > > > > Can we reduce this to something like 3-5 samples (and the one below)?
If
> > the
> > > > > sampling is serviced at the normal timer tick interval of 15.6ms, then
> the
> > > 10
> > > > > samples in this test likely will take 160+ ms.
> > > > 
> > > > The sampling will take 10ms + timer-resolution.  Samples are taken at
> strict
> > > > times and if the thread runs behind then multiple samples will be taken
to
> > > > "catch up".
> > > > 
> > > 
> > > Please document this as well.
> > 
> > Done.
> 
> I don't see something equivalent to this in the comments. The text above ("The
> sampling will take 10ms + ...") would be fine to just copy into the comment.

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
On 2017/03/18 01:38:41, Mike Wittman wrote:
> On 2017/03/16 15:56:25, bcwhite wrote:
> > > I think it would be better to call this something like
> > > thread_execution_state_lock_ at this point, since it's basically
protecting
> > > changes that affect the thread's execution state.
> > 
> > All the variables it protects start with task_runner_.
> 
> Yes, and their names should be updated also, for the same reason.

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
> Relying on existing tests is OK if the test is obviously and reliably testing
> the particular behavior. All the other cases need dedicated tests so that (1)
> it's clear that the behavior is important and that it's actually being tested,
> and (2) the test of the behavior survives changes and refactoring to the code
> and tests. In particular, the behavior under test in several of the cases
below
> is subject to races dependent on the sampling parameters, system load, and
> shutdown idle time.
>
> > GetOrCreateTaskRunnerForAdd:
> > - state==RUNNING: tested by ConcurrentProfiling_*
> >   Entered when doing parallel sampling.
> 
> We can't rely on the RUNNING state being tested in the ConcurrentProfiling_*
> tests because the state in the second and later Start() calls is subject to
> races dependent on the sampling parameters, the load on the system, and the
> shutdown idle time.

I've disabled the idle shutdown in those tests.  60 seconds wasn't going to be
an issue but this is guaranteed.


> > - state==EXITING: tested by WillRestartSampler
> >   Entered after sampler shutdown.
> 
> This is a good test for this behavior.
> 
> > - state==other: tested by everything
> >   Every first sampler will exercise this state.
> > 
> > Remove:
> > - state!=RUNNING: added StopAfterIdle test
> 
> This is also a good test for this behavior.
> 
> > - state==other: tested by every valid Stop()
> 
> We can't rely on state==other(EXITING) being tested in Remove for the same
> configuration/load-dependent reasons as for state==RUNNING in
> GetOrCreateTaskRunnerForAdd.

For EXITING state specifically, the StopAfterIdle test will call Remove() when
it is in the EXITING state.


> > ShutdownTask:
> > - not idle: tested everywhere
> 
> This is not tested everywhere, If I'm not mistaken, since ShutdownTask is only
> executed in tests where InitiateSamplingThreadIdleShutdown() is called.
> Independent of that, I think this conditional can be removed entirely. See the
> comment on the code.

And wherever DisableIdleShutdown is not called.


> > - changed add-event
> >   This is to handle a race condition which is necessarily difficult
> >   (if not impossible) to test.
> 
> If the previous conditional is removed, this can be tested by calling
> InitiateSamplingThreadIdleShutdown(true) while a collection is in process.
> 
> > RemoveCollectionTask:
> > - found: tested every stop before completion
> 
> We can't rely on the item being found for the for the same
> configuration/load-dependent reasons as above.

There are dozens of stop calls before completion, some with huge timeouts. 
They're not all going to miss.


> > - not-found: tested every stop after completion
> 
> This is pretty subtle in most of the tests but it's a key part of
StopAfterIdle,
> I think that's reasonable.
> 
> > PerformCollectionTask:
> > - found: tested with every valid sample
> > - not-found: tested every stop before completion
> 
> We can't rely on the item being not-found for the for the same
> configuration/load-dependent reasons as above.

For all practical purposes, I believe we can.  Your original StopDuring*() tests
do this.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:543: return;
> I don't think this conditional is needed.
> 
> Alternative solution:
> 
> 1. Add a bool simulateDefunctShutdown parameter to
> InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1
to
> the posted task if true.
> 
> 2. Document on the InitiateSamplingThreadIdleShutdown() interface that
> simulateDefunctShutdown must be set to true if any collections are still
active.

That puts a lot of burden and complexity on the test to avoid what amounts to a
"var != 0" at the top of this method.  I don't see it being worth it.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:355: void
CreateProfilers(PlatformThreadId target_thread_id,
On 2017/03/18 01:38:41, Mike Wittman wrote:
> Nice, encapsulating this functionality makes the tests cleaner.
> 
> Can you pass the input profiles as a vector, and the three output arguments as
> pointers to empty vectors (and push_back() here), so we don't need to keep all
> the array/vector sizes in sync in the tests?
> 
> Also, input arguments should appear before output arguments in the interface.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:359: SamplingParams* params,
On 2017/03/18 01:38:41, Mike Wittman wrote:
> const

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:922: #define
MAYBE_WillRestartSampler WillRestartSampler
On 2017/03/18 01:38:41, Mike Wittman wrote:
> WillRestartSamplerAfterIdleShutdown would be a better name for this test.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:943: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/18 01:38:41, Mike Wittman wrote:
> Thinking about this in terms of the underlying StackSamplingProfiler state, I
> don't see the need to wait for the thread to exit in this test.
> 
> Thread::Stop() is documented to handle both the pre-exit and exited cases, so
> the behavior in StackSamplingProfiler is the same regardless of which state
the
> thread is in.

But then you wouldn't be able to tell that the thread restarted.  For all the
test could know, the same thread would have continued to run.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:956: #define
MAYBE_StopAfterIdle StopAfterIdle
On 2017/03/18 01:38:41, Mike Wittman wrote:
> StopAfterIdleShutdown would be a better name.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1001: // run at their
scheduled, interleaved times regardless of whatever
On 2017/03/18 01:38:41, Mike Wittman wrote:
> This doesn't make sense to me. How can the samples run at their scheduled
times
> if the thread hasn't woken up by then?

Fixed wording.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
On 2017/03/18 01:38:41, Mike Wittman wrote:
> How does the ordering of the Start and Stop calls and destruction relate to
the
> parameters above?

They don't.  It's three different sampling parameters that start in a staggered
ordering.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-20 22:19:18 UTC) #307

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_ozone_rel_ng/builds/343714)

3 years, 9 months ago (2017-03-20 22:19:20 UTC) #308

Mike Wittman

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler.cc#newcode242 base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/20 21:50:51, bcwhite wrote: > On ...

3 years, 9 months ago (2017-03-21 16:50:38 UTC) #309

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
On 2017/03/20 21:50:51, bcwhite wrote:
> On 2017/03/18 01:38:41, Mike Wittman wrote:
> > On 2017/03/16 15:56:25, bcwhite wrote:
> > > > I think it would be better to call this something like
> > > > thread_execution_state_lock_ at this point, since it's basically
> protecting
> > > > changes that affect the thread's execution state.
> > > 
> > > All the variables it protects start with task_runner_.
> > 
> > Yes, and their names should be updated also, for the same reason.
> 
> Done.

Please add a comment stating that the lock also protects execution of the
non-thread-safe Thread API calls related to the execution state: Start, Stop,
StopSoon, DetachFromSequence.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
To repeat:
Relying on existing tests is OK if the test is obviously and reliably testing
the particular behavior. *All the other cases need dedicated tests* so that (1)
it's clear that the behavior is important and that it's actually being tested,
and (2) the test of the behavior survives changes and refactoring to the code
and tests.

The purpose in pointing out possible races is not to try to fix or explain away
the races. It's to demonstrate that the existing tests are not obviously and
reliably testing some of the behaviors, and that the test of those behaviors
needs to be split out into dedicated tests. Please create separate tests for
those behaviors.

> > > Remove:
> > > - state!=RUNNING: added StopAfterIdle test
> > 
> > This is also a good test for this behavior.
> > 
> > > - state==other: tested by every valid Stop()
> > 
> > We can't rely on state==other(EXITING) being tested in Remove for the same
> > configuration/load-dependent reasons as for state==RUNNING in
> > GetOrCreateTaskRunnerForAdd.
> 
> For EXITING state specifically, the StopAfterIdle test will call Remove() when
> it is in the EXITING state.

I think this is true but it took me 10+ minutes of reasoning about the code to
reach that conclusion. So this is still not obvious and still needs a dedicated
test.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:543: return;
On 2017/03/20 21:50:51, bcwhite wrote:
> > I don't think this conditional is needed.
> > 
> > Alternative solution:
> > 
> > 1. Add a bool simulateDefunctShutdown parameter to
> > InitiateSamplingThreadIdleShutdown(). Bind
sampler->task_runner_add_events_-1
> to
> > the posted task if true.
> > 
> > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that
> > simulateDefunctShutdown must be set to true if any collections are still
> active.
> 
> That puts a lot of burden and complexity on the test to avoid what amounts to
a
> "var != 0" at the top of this method.  I don't see it being worth it.

Why do we need this conditional at all at this point? With the changes to
FinishCollection the TestAPI now has the ability to tell if the profiler under
test is idle.

As a side note, test code is exactly where the burden of test complexity
belongs. Putting it in production code makes the code less readable and
understandable, especially in cases like this where the the behavior is already
very subtle.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:355: void
CreateProfilers(PlatformThreadId target_thread_id,
On 2017/03/20 21:50:51, bcwhite wrote:
> On 2017/03/18 01:38:41, Mike Wittman wrote:
> > Nice, encapsulating this functionality makes the tests cleaner.
> > 
> > Can you pass the input profiles as a vector, and the three output arguments
as
> > pointers to empty vectors (and push_back() here), so we don't need to keep
all
> > the array/vector sizes in sync in the tests?
> > 
> > Also, input arguments should appear before output arguments in the
interface.
> 
> Done.

param should be passed as a vector also.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:943: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/20 21:50:51, bcwhite wrote:
> On 2017/03/18 01:38:41, Mike Wittman wrote:
> > Thinking about this in terms of the underlying StackSamplingProfiler state,
I
> > don't see the need to wait for the thread to exit in this test.
> > 
> > Thread::Stop() is documented to handle both the pre-exit and exited cases,
so
> > the behavior in StackSamplingProfiler is the same regardless of which state
> the
> > thread is in.
> 
> But then you wouldn't be able to tell that the thread restarted.  For all the
> test could know, the same thread would have continued to run.

That's true. But whether the thread stopped after idle shutdown is a different
concern than whether a new collection can start. It's also a very important
behavior on its own and deserves to be separated out into its own test.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
On 2017/03/20 21:50:51, bcwhite wrote:
> On 2017/03/18 01:38:41, Mike Wittman wrote:
> > How does the ordering of the Start and Stop calls and destruction relate to
> the
> > parameters above?
> 
> They don't.  It's three different sampling parameters that start in a
staggered
> ordering.

In that case, if there's still a motivation for this ordering then it should be
documented and moved to its own test since it's independent. If not, then this
can be simplified to three Start() calls followed by three wait calls and remove
the Stop() calls.

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:430: // move them because this
collection is about to be deleted.
This last sentence is no longer relevant and can be removed.

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:362: CHECK(profiles->empty());
ASSERT_TRUE rather than CHECK in test code (or EXPECT_TRUE if ASSERT_TRUE
doesn't work here)

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-22 17:44:11 UTC) #310

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/980001

3 years, 9 months ago (2017-03-22 17:45:10 UTC) #311

bcwhite

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler.cc#newcode242 base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/21 16:50:38, Mike Wittman wrote: > ...

3 years, 9 months ago (2017-03-22 17:48:55 UTC) #312

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_;
On 2017/03/21 16:50:38, Mike Wittman wrote:
> On 2017/03/20 21:50:51, bcwhite wrote:
> > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > On 2017/03/16 15:56:25, bcwhite wrote:
> > > > > I think it would be better to call this something like
> > > > > thread_execution_state_lock_ at this point, since it's basically
> > protecting
> > > > > changes that affect the thread's execution state.
> > > > 
> > > > All the variables it protects start with task_runner_.
> > > 
> > > Yes, and their names should be updated also, for the same reason.
> > 
> > Done.
> 
> Please add a comment stating that the lock also protects execution of the
> non-thread-safe Thread API calls related to the execution state: Start, Stop,
> StopSoon, DetachFromSequence.

Done.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
On 2017/03/21 16:50:38, Mike Wittman wrote:
> To repeat:
> Relying on existing tests is OK if the test is obviously and reliably testing
> the particular behavior. *All the other cases need dedicated tests* so that
(1)
> it's clear that the behavior is important and that it's actually being tested,
> and (2) the test of the behavior survives changes and refactoring to the code
> and tests.
> 
> The purpose in pointing out possible races is not to try to fix or explain
away
> the races. It's to demonstrate that the existing tests are not obviously and
> reliably testing some of the behaviors, and that the test of those behaviors
> needs to be split out into dedicated tests. Please create separate tests for
> those behaviors.
> 
> > > > Remove:
> > > > - state!=RUNNING: added StopAfterIdle test
> > > 
> > > This is also a good test for this behavior.
> > > 
> > > > - state==other: tested by every valid Stop()
> > > 
> > > We can't rely on state==other(EXITING) being tested in Remove for the same
> > > configuration/load-dependent reasons as for state==RUNNING in
> > > GetOrCreateTaskRunnerForAdd.
> > 
> > For EXITING state specifically, the StopAfterIdle test will call Remove()
when
> > it is in the EXITING state.
> 
> I think this is true but it took me 10+ minutes of reasoning about the code to
> reach that conclusion. So this is still not obvious and still needs a
dedicated
> test.

This IS the dedicated test!  StopAfterIdleShutdown was added, at your request,
just to cover this case.  The only way to know that a thread has gone into the
EXITING state is that it has stopped for being idle.  At any prior time, it
could still be a posted task waiting to execute.
I'll add a comment to that effect.


> > RemoveCollectionTask:
> > - found: tested every stop before completion
> 
> We can't rely on the item being found for the for the same
> configuration/load-dependent reasons as above.

The run-time for the tasks is 1 day, longer than the run-time of the test.  If
the test completes, the task was stopped before completion.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:543: return;
On 2017/03/21 16:50:38, Mike Wittman wrote:
> On 2017/03/20 21:50:51, bcwhite wrote:
> > > I don't think this conditional is needed.
> > > 
> > > Alternative solution:
> > > 
> > > 1. Add a bool simulateDefunctShutdown parameter to
> > > InitiateSamplingThreadIdleShutdown(). Bind
> sampler->task_runner_add_events_-1
> > to
> > > the posted task if true.
> > > 
> > > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that
> > > simulateDefunctShutdown must be set to true if any collections are still
> > active.
> > 
> > That puts a lot of burden and complexity on the test to avoid what amounts
to
> a
> > "var != 0" at the top of this method.  I don't see it being worth it.
> 
> Why do we need this conditional at all at this point? With the changes to
> FinishCollection the TestAPI now has the ability to tell if the profiler under
> test is idle.
> 
> As a side note, test code is exactly where the burden of test complexity
> belongs. Putting it in production code makes the code less readable and
> understandable, especially in cases like this where the the behavior is
already
> very subtle.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:355: void
CreateProfilers(PlatformThreadId target_thread_id,
On 2017/03/21 16:50:38, Mike Wittman wrote:
> On 2017/03/20 21:50:51, bcwhite wrote:
> > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > Nice, encapsulating this functionality makes the tests cleaner.
> > > 
> > > Can you pass the input profiles as a vector, and the three output
arguments
> as
> > > pointers to empty vectors (and push_back() here), so we don't need to keep
> all
> > > the array/vector sizes in sync in the tests?
> > > 
> > > Also, input arguments should appear before output arguments in the
> interface.
> > 
> > Done.
> 
> param should be passed as a vector also.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:943: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/21 16:50:38, Mike Wittman wrote:
> On 2017/03/20 21:50:51, bcwhite wrote:
> > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > Thinking about this in terms of the underlying StackSamplingProfiler
state,
> I
> > > don't see the need to wait for the thread to exit in this test.
> > > 
> > > Thread::Stop() is documented to handle both the pre-exit and exited cases,
> so
> > > the behavior in StackSamplingProfiler is the same regardless of which
state
> > the
> > > thread is in.
> > 
> > But then you wouldn't be able to tell that the thread restarted.  For all
the
> > test could know, the same thread would have continued to run.
> 
> That's true. But whether the thread stopped after idle shutdown is a different
> concern than whether a new collection can start. It's also a very important
> behavior on its own and deserves to be separated out into its own test.

Done.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
On 2017/03/21 16:50:38, Mike Wittman wrote:
> On 2017/03/20 21:50:51, bcwhite wrote:
> > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > How does the ordering of the Start and Stop calls and destruction relate
to
> > the
> > > parameters above?
> > 
> > They don't.  It's three different sampling parameters that start in a
> staggered
> > ordering.
> 
> In that case, if there's still a motivation for this ordering then it should
be
> documented and moved to its own test since it's independent. If not, then this
> can be simplified to three Start() calls followed by three wait calls and
remove
> the Stop() calls.

No motivation, per say.  If you don't feel that staggered start/stop calls are
of any use then I'll just remove the test because it would become essentially
the same as the other ConcurrentProfiling tests.

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler.cc:430: // move them because this
collection is about to be deleted.
On 2017/03/21 16:50:38, Mike Wittman wrote:
> This last sentence is no longer relevant and can be removed.

Done.

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:362: CHECK(profiles->empty());
On 2017/03/21 16:50:38, Mike Wittman wrote:
> ASSERT_TRUE rather than CHECK in test code (or EXPECT_TRUE if ASSERT_TRUE
> doesn't work here)

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-22 18:56:34 UTC) #313

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/389292)

3 years, 9 months ago (2017-03-22 18:56:36 UTC) #314

Mike Wittman

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc#newcode940 base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > > > > > Remove: ...

3 years, 9 months ago (2017-03-23 22:18:32 UTC) #315

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
> > > > > Remove:
> > > > > - state!=RUNNING: added StopAfterIdle test
> > > > 
> > > > This is also a good test for this behavior.
> > > > 
> > > > > - state==other: tested by every valid Stop()
> > > > 
> > > > We can't rely on state==other(EXITING) being tested in Remove for the
same
> > > > configuration/load-dependent reasons as for state==RUNNING in
> > > > GetOrCreateTaskRunnerForAdd.
> > > 
> > > For EXITING state specifically, the StopAfterIdle test will call Remove()
> when
> > > it is in the EXITING state.
> > 
> > I think this is true but it took me 10+ minutes of reasoning about the code
to
> > reach that conclusion. So this is still not obvious and still needs a
> dedicated
> > test.
> 
> This IS the dedicated test!  StopAfterIdleShutdown was added, at your request,
> just to cover this case.  The only way to know that a thread has gone into the
> EXITING state is that it has stopped for being idle.  At any prior time, it
> could still be a posted task waiting to execute.
> I'll add a comment to that effect.

This is still pretty subtle and could use some even more extensive comments
explaining how it works. I made suggestions in the code. 

> > > GetOrCreateTaskRunnerForAdd:
> > > - state==RUNNING: tested by ConcurrentProfiling_*
> > >   Entered when doing parallel sampling.
> > 
> > We can't rely on the RUNNING state being tested in the ConcurrentProfiling_*
> > tests because the state in the second and later Start() calls is subject to
> > races dependent on the sampling parameters, the load on the system, and the
> > shutdown idle time.
> 
> I've disabled the idle shutdown in those tests.  60 seconds wasn't going to be
> an issue but this is guaranteed.

It's still not obvious to a reader that this scenario is reliably tested. The
ConcurrentProfiling_* tests stated intention is to test behavior other than this
so could easily be changed in the future such that they no longer test this
behavior.

Please write a dedicated test. All that's required is starting a profiler with a
very large initial delay before starting a second profiler.

> > ShutdownTask:
> > - changed add-event
> >   This is to handle a race condition which is necessarily difficult
> >   (if not impossible) to test.
> 
> If the previous conditional is removed, this can be tested by calling
> InitiateSamplingThreadIdleShutdown(true) while a collection is in process.

Please write a dedicated test.

> > > RemoveCollectionTask:
> > > - found: tested every stop before completion
> > 
> > We can't rely on the item being found for the for the same
> > configuration/load-dependent reasons as above.
> 
> There are dozens of stop calls before completion, some with huge timeouts. 
> They're not all going to miss.

None of these tests are intending to reliably exercise this behavior and could
easily be changed in the future such that they no longer test the behavior. The
use of Stop in the destructor is an implementation detail and shouldn't be
relied upon when testing this behavior.

Please write a dedicated test. All that's required is starting a profiler with a
very long sampling interval, waiting for one sample to be collected using the
test delegate, and stopping the profiler.

> > > - not-found: tested every stop after completion
> > 
> > This is pretty subtle in most of the tests but it's a key part of
> StopAfterIdle,
> > I think that's reasonable.
> > 
> > > PerformCollectionTask:
> > > - found: tested with every valid sample
> > > - not-found: tested every stop before completion
> > 
> > We can't rely on the item being not-found for the for the same
> > configuration/load-dependent reasons as above.
> 
> For all practical purposes, I believe we can.  Your original StopDuring*()
tests
> do this.

There's no reason to believe that the test process will continue execution long
enough for the next PerformCollectionTask to be executed after Stop. Even if
there was, none of these tests are intending to exercise this behavior and could
easily be changed in the future such that they no longer test the behavior.

Please write a dedicated test. This is trickier but I believe can be done by
relying on the message loop task ordering and observing when samples are taken
via the test delegate. Start two profilers with interleaved execution, wait for
both to take samples, stop the first, and observe that two samples of the second
occur with no interleaved sample from the first.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
On 2017/03/22 17:48:54, bcwhite wrote:
> On 2017/03/21 16:50:38, Mike Wittman wrote:
> > On 2017/03/20 21:50:51, bcwhite wrote:
> > > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > > How does the ordering of the Start and Stop calls and destruction relate
> to
> > > the
> > > > parameters above?
> > > 
> > > They don't.  It's three different sampling parameters that start in a
> > staggered
> > > ordering.
> > 
> > In that case, if there's still a motivation for this ordering then it should
> be
> > documented and moved to its own test since it's independent. If not, then
this
> > can be simplified to three Start() calls followed by three wait calls and
> remove
> > the Stop() calls.
> 
> No motivation, per say.  If you don't feel that staggered start/stop calls are
> of any use then I'll just remove the test because it would become essentially
> the same as the other ConcurrentProfiling tests.

Removing SGTM. Any specific behaviors exercised by staggered stop/stop would be
better addressed in focused tests.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1011: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
Since the previous test has already established that the thread will eventually
exit, there's no need to reverify it here. The important behavior under test
here is simply that profiling can take place after idle shutdown has run.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1046: // its task and before
the thread actually exits.
This is much better, although I think this information would be easier to
understand fleshed out more and applied to on the individual calls. Suggested
comments below.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1049:
StackSamplingProfiler::TestAPI::InitiateSamplingThreadIdleShutdown();
// Post a ShutdownTask on the sampling thread, which will mark the thread as
EXITING and shut down the thread asynchronously after the function exits.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1050: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
// Wait for the thread to exit to ensure the ShutdownTask has finished executing
and has set the thread state to EXITING.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1053: // Ensure it's still
safe to stop.
// Attempt to stop the profiler now that we know the thread is in the EXITING
state.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 9 months ago (2017-03-27 17:45:24 UTC) #317

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1000001

3 years, 9 months ago (2017-03-27 17:46:31 UTC) #318

bcwhite

Some re-work was necessary to fix tests when --gtest_shuffle is set. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): ...

3 years, 9 months ago (2017-03-27 17:52:43 UTC) #319

Some re-work was necessary to fix tests when --gtest_shuffle is set.

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
> > This IS the dedicated test!  StopAfterIdleShutdown was added, at your
request,
> > just to cover this case.  The only way to know that a thread has gone into
the
> > EXITING state is that it has stopped for being idle.  At any prior time, it
> > could still be a posted task waiting to execute.
> > I'll add a comment to that effect.
> 
> This is still pretty subtle and could use some even more extensive comments
> explaining how it works. I made suggestions in the code. 

Done.


> > > > GetOrCreateTaskRunnerForAdd:
> > > > - state==RUNNING: tested by ConcurrentProfiling_*
> > > >   Entered when doing parallel sampling.
> > > 
> > > We can't rely on the RUNNING state being tested in the
ConcurrentProfiling_*
> > > tests because the state in the second and later Start() calls is subject
to
> > > races dependent on the sampling parameters, the load on the system, and
the
> > > shutdown idle time.
> > 
> > I've disabled the idle shutdown in those tests.  60 seconds wasn't going to
be
> > an issue but this is guaranteed.
> 
> It's still not obvious to a reader that this scenario is reliably tested. The
> ConcurrentProfiling_* tests stated intention is to test behavior other than
this
> so could easily be changed in the future such that they no longer test this
> behavior.
> 
> Please write a dedicated test. All that's required is starting a profiler with
a
> very large initial delay before starting a second profiler.

Done.  MultipleStart.


> > > ShutdownTask:
> > > - changed add-event
> > >   This is to handle a race condition which is necessarily difficult
> > >   (if not impossible) to test.
> > 
> > If the previous conditional is removed, this can be tested by calling
> > InitiateSamplingThreadIdleShutdown(true) while a collection is in process.
> 
> Please write a dedicated test.

IdleShutdownAbort

It's definitely not that easy but I've changed the InitiateShutdown into a
PerformShutdown that waits for the task to execute.  At that point the test can
know that at least StopSoon() has been called and the state set to EXITING but
there's still no way to know if the thread has actually exited without waiting.


> > > > RemoveCollectionTask:
> > > > - found: tested every stop before completion
> > > 
> > > We can't rely on the item being found for the for the same
> > > configuration/load-dependent reasons as above.
> > 
> > There are dozens of stop calls before completion, some with huge timeouts. 
> > They're not all going to miss.
> 
> None of these tests are intending to reliably exercise this behavior and could
> easily be changed in the future such that they no longer test the behavior.
The
> use of Stop in the destructor is an implementation detail and shouldn't be
> relied upon when testing this behavior.
> 
> Please write a dedicated test. All that's required is starting a profiler with
a
> very long sampling interval, waiting for one sample to be collected using the
> test delegate, and stopping the profiler.

StopSafely

Thanks for being specific.  Done.


> > > > - not-found: tested every stop after completion
> > > 
> > > This is pretty subtle in most of the tests but it's a key part of
> > StopAfterIdle,
> > > I think that's reasonable.
> > > 
> > > > PerformCollectionTask:
> > > > - found: tested with every valid sample
> > > > - not-found: tested every stop before completion
> > > 
> > > We can't rely on the item being not-found for the for the same
> > > configuration/load-dependent reasons as above.
> > 
> > For all practical purposes, I believe we can.  Your original StopDuring*()
> tests
> > do this.
> 
> There's no reason to believe that the test process will continue execution
long
> enough for the next PerformCollectionTask to be executed after Stop. Even if
> there was, none of these tests are intending to exercise this behavior and
could
> easily be changed in the future such that they no longer test the behavior.
> 
> Please write a dedicated test. This is trickier but I believe can be done by
> relying on the message loop task ordering and observing when samples are taken
> via the test delegate. Start two profilers with interleaved execution, wait
for
> both to take samples, stop the first, and observe that two samples of the
second
> occur with no interleaved sample from the first.

Done as part of StopSafely.

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start();
On 2017/03/23 22:18:31, Mike Wittman wrote:
> On 2017/03/22 17:48:54, bcwhite wrote:
> > On 2017/03/21 16:50:38, Mike Wittman wrote:
> > > On 2017/03/20 21:50:51, bcwhite wrote:
> > > > On 2017/03/18 01:38:41, Mike Wittman wrote:
> > > > > How does the ordering of the Start and Stop calls and destruction
relate
> > to
> > > > the
> > > > > parameters above?
> > > > 
> > > > They don't.  It's three different sampling parameters that start in a
> > > staggered
> > > > ordering.
> > > 
> > > In that case, if there's still a motivation for this ordering then it
should
> > be
> > > documented and moved to its own test since it's independent. If not, then
> this
> > > can be simplified to three Start() calls followed by three wait calls and
> > remove
> > > the Stop() calls.
> > 
> > No motivation, per say.  If you don't feel that staggered start/stop calls
are
> > of any use then I'll just remove the test because it would become
essentially
> > the same as the other ConcurrentProfiling tests.
> 
> Removing SGTM. Any specific behaviors exercised by staggered stop/stop would
be
> better addressed in focused tests.

Done.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1011: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/23 22:18:31, Mike Wittman wrote:
> Since the previous test has already established that the thread will
eventually
> exit, there's no need to reverify it here. The important behavior under test
> here is simply that profiling can take place after idle shutdown has run.

It's not enough to know that it will exit.  It must have actually exited before
CaptureProfiles() below in order to know that it restarted and didn't just get
the idle-shutdown cancelled.

Without the IsSamplingThreadRunning() there is no way to know that the idle
shutdown has run let alone that the thread has exited. 
InitiateSamplingThreadIdleShutdown only posts a task that could be delayed by
any amount of time based on current activity and system load.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1046: // its task and before
the thread actually exits.
On 2017/03/23 22:18:31, Mike Wittman wrote:
> This is much better, although I think this information would be easier to
> understand fleshed out more and applied to on the individual calls. Suggested
> comments below.

Acknowledged.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1049:
StackSamplingProfiler::TestAPI::InitiateSamplingThreadIdleShutdown();
On 2017/03/23 22:18:31, Mike Wittman wrote:
> // Post a ShutdownTask on the sampling thread, which will mark the thread as
> EXITING and shut down the thread asynchronously after the function exits.

Done.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1050: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/23 22:18:31, Mike Wittman wrote:
> // Wait for the thread to exit to ensure the ShutdownTask has finished
executing
> and has set the thread state to EXITING.

Done.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1053: // Ensure it's still
safe to stop.
On 2017/03/23 22:18:31, Mike Wittman wrote:
> // Attempt to stop the profiler now that we know the thread is in the EXITING
> state.

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 9 months ago (2017-03-27 19:35:09 UTC) #320

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 9 months ago (2017-03-27 19:35:12 UTC) #321

Mike Wittman

Seems like we're converging on the tests covering the lifetime behaviors. Can you also provide ...

3 years, 8 months ago (2017-03-28 19:32:14 UTC) #322

Seems like we're converging on the tests covering the lifetime behaviors.

Can you also provide tests for correct behavior when
 - concurrently profiling two different threads, and
 - concurrently profiling FROM two different threads

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
> > > > > RemoveCollectionTask:
> > > > > - found: tested every stop before completion
> > > > 
> > > > We can't rely on the item being found for the for the same
> > > > configuration/load-dependent reasons as above.
> > > 
> > > There are dozens of stop calls before completion, some with huge timeouts.

> > > They're not all going to miss.
> > 
> > None of these tests are intending to reliably exercise this behavior and
could
> > easily be changed in the future such that they no longer test the behavior.
> The
> > use of Stop in the destructor is an implementation detail and shouldn't be
> > relied upon when testing this behavior.
> > 
> > Please write a dedicated test. All that's required is starting a profiler
with
> a
> > very long sampling interval, waiting for one sample to be collected using
the
> > test delegate, and stopping the profiler.
> 
> StopSafely
> 
> Thanks for being specific.  Done.

Please split this out into a dedicated test. It's not easy to even determine
that this behavior is being tested from reading StopSafely because of the
complexity required for the not found case of PerformCollectionTask.

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:1011: while
(StackSamplingProfiler::TestAPI::IsSamplingThreadRunning())
On 2017/03/27 17:52:43, bcwhite wrote:
> On 2017/03/23 22:18:31, Mike Wittman wrote:
> > Since the previous test has already established that the thread will
> eventually
> > exit, there's no need to reverify it here. The important behavior under test
> > here is simply that profiling can take place after idle shutdown has run.
> 
> It's not enough to know that it will exit.  It must have actually exited
before
> CaptureProfiles() below in order to know that it restarted and didn't just get
> the idle-shutdown cancelled.

That's true. The explicit event for shutdown run rather than waiting on the
thread to exit is a clearer indication of what's happening regardless.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:282: DCHECK(sampler);
No need to DCHECK this. We can assume the singleton operates correctly. Applies
two places below also.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
CHECK

No reason to use DCHECK in test API code. Applies to DCHECKs in
ShtudownAssumingIdle as well.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ !=
NULL_COLLECTION_ID) {
Why do we need this conditional?

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:200: // still happens asynchronously.
Watch IsSamplingThreadRunningForTesting()
IsSamplingThreadRunningForTesting => IsSamplingThreadRunning

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:204: static void
PerformSamplingThreadIdleShutdown(bool simulate_start);
Can we call this simulate_intervening_start, to make it clear that the shutdown
is not doing some new start-like activity.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:385: if (delegates) {
shorter:
profilers->push_back(MakeUnique<StackSamplingProfiler>(target_thread_id,
params[i], callback, delegates ? (*delegates[i]).get() : nullptr));

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id,
params[i], callback, delegates->at(i).get()));
(*delegates)[i].get()

vector<>::at() is no different than operator[]() since Chrome builds without
exceptions, and is more confusing to read for the same reason.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:646:
StackSamplingProfiler::TestAPI::Reset();
Why do only some of the tests use Reset()?

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_
= 0;
This should use locks rather than atomic ops. From the atomicops.h file header:
"If you plan to use these routines, you should have a good reason, such as solid
evidence that performance would otherwise suffer, or there being no
alternative."

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:869:
std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded;
std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
and access with
static_cast<SampleRecordedCounter*>(samples_recorded[0].get())

reinterpret_cast across two levels of template instantiation is highly unsafe
and dependent on multiple layers of undefined behavior.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:884: // Wait for both to start
accumulating samples.
It seems like using WaitableEvents in the test delegate would be more
appropriate here than sleeping until the relevant conditions are met.

However, I considered what it would take to do this and I think it results in a
more complicated solution. There's inherently a race between the task posted by
the Stop call and the next PerformCollectionTask on profiler 0, so it's possible
that either zero or one collections could take place on that profiler after
Stop() returns. Thus one can't know how many times to wait for collection on
profiler 0 before it stops.

Can you add a comment with this information so that future readers of this test
know why this seemingly appropriate solution doesn't work well here?

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1065: // will be 10ms (delay)
+ 10x1ms (sampling) + 1/2 timer minimum interval.
This comment is no longer relevant and can be removed.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1066: params[0].initial_delay
= TimeDelta::FromDays(1);
AVeryLongTimeDelta() is the established way to say "effectively infinite time"
in these tests. Applies one other place below too.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1067:
params[0].sampling_interval = TimeDelta::FromMilliseconds(1);
This parameter can be removed.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1070: params[1].initial_delay
= TimeDelta::FromMilliseconds(0);
This line can be removed since this is the default initial delay.

Same comment applies four other places below.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1081: profilers[1]->Start();
We should wait on the second profiler to finish and check that it got the right
data, to validate that the Start() call succeeded.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1082:
EXPECT_FALSE(sampling_completed[0]->IsSignaled());
This should be removed since what it's checking is unrelated to the behavior
under test.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1162:
std::vector<SamplingParams> params(1);
There's no need for vectors here since there's only one profiler.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
Since we can't reliably validate that the thread was not stopped, I think the
best way to check the behavior is to collect another profile using a second
profiler and validate that it works as expected. This is the expected
user-observable behavior anyway.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-29 14:47:59 UTC) #323

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1020001

3 years, 8 months ago (2017-03-29 14:48:15 UTC) #324

bcwhite

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc#newcode940 base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/28 19:32:01, Mike Wittman wrote: ...

3 years, 8 months ago (2017-03-29 14:56:59 UTC) #325

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
On 2017/03/28 19:32:01, Mike Wittman wrote:
> > > > > > RemoveCollectionTask:
> > > > > > - found: tested every stop before completion
> > > > > 
> > > > > We can't rely on the item being found for the for the same
> > > > > configuration/load-dependent reasons as above.
> > > > 
> > > > There are dozens of stop calls before completion, some with huge
timeouts.
> 
> > > > They're not all going to miss.
> > > 
> > > None of these tests are intending to reliably exercise this behavior and
> could
> > > easily be changed in the future such that they no longer test the
behavior.
> > The
> > > use of Stop in the destructor is an implementation detail and shouldn't be
> > > relied upon when testing this behavior.
> > > 
> > > Please write a dedicated test. All that's required is starting a profiler
> with
> > a
> > > very long sampling interval, waiting for one sample to be collected using
> the
> > > test delegate, and stopping the profiler.
> > 
> > StopSafely
> > 
> > Thanks for being specific.  Done.
> 
> Please split this out into a dedicated test. It's not easy to even determine
> that this behavior is being tested from reading StopSafely because of the
> complexity required for the not found case of PerformCollectionTask.

That would be pretty much the same as StopDuringInterSampleInterval but using a
test delegate rather than timers.  That's what you want?

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:282: DCHECK(sampler);
On 2017/03/28 19:32:02, Mike Wittman wrote:
> No need to DCHECK this. We can assume the singleton operates correctly.
Applies
> two places below also.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
On 2017/03/28 19:32:02, Mike Wittman wrote:
> CHECK
> 
> No reason to use DCHECK in test API code. Applies to DCHECKs in
> ShtudownAssumingIdle as well.

Calling this with active samples would indicate bad test code.  I'd rather flag
it explicitly rather than lose time debugging a test that is just getting bad
results.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ !=
NULL_COLLECTION_ID) {
On 2017/03/28 19:32:02, Mike Wittman wrote:
> Why do we need this conditional?

Because you wanted to ensure that Remove() wasn't called when the sampling
thread had "not started".  I'll remove the DCHECK from there instead.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:200: // still happens asynchronously.
Watch IsSamplingThreadRunningForTesting()
On 2017/03/28 19:32:02, Mike Wittman wrote:
> IsSamplingThreadRunningForTesting => IsSamplingThreadRunning

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:204: static void
PerformSamplingThreadIdleShutdown(bool simulate_start);
On 2017/03/28 19:32:02, Mike Wittman wrote:
> Can we call this simulate_intervening_start, to make it clear that the
shutdown
> is not doing some new start-like activity.

Adding "intervening" wouldn't indicate anything about whether it is or is not
doing a new start-like activity.  "Simulate" and the comment, on the other hand,
make it pretty clear.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:385: if (delegates) {
On 2017/03/28 19:32:12, Mike Wittman wrote:
> shorter:
> profilers->push_back(MakeUnique<StackSamplingProfiler>(target_thread_id,
> params[i], callback, delegates ? (*delegates[i]).get() : nullptr));

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id,
params[i], callback, delegates->at(i).get()));
On 2017/03/28 19:32:02, Mike Wittman wrote:
> (*delegates)[i].get()
> 
> vector<>::at() is no different than operator[]() since Chrome builds without
> exceptions, and is more confusing to read for the same reason.

at() is required when working with const vectors (though I had left that out
from the parameter definition).

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:646:
StackSamplingProfiler::TestAPI::Reset();
On 2017/03/28 19:32:07, Mike Wittman wrote:
> Why do only some of the tests use Reset()?

It's only necessary for tests that deal with the startup of the sampling thread.
 I'd prefer it on all but figured you'd claim it was not required.  Added
everywhere.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_
= 0;
On 2017/03/28 19:32:02, Mike Wittman wrote:
> This should use locks rather than atomic ops. From the atomicops.h file
header:
> "If you plan to use these routines, you should have a good reason, such as
solid
> evidence that performance would otherwise suffer, or there being no
> alternative."

I'm relatively well versed in the subtleties of atomics, thanks.  But fine,
we'll do it your way.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:869:
std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded;
On 2017/03/28 19:32:02, Mike Wittman wrote:
> std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
> and access with
> static_cast<SampleRecordedCounter*>(samples_recorded[0].get())
> 
> reinterpret_cast across two levels of template instantiation is highly unsafe
> and dependent on multiple layers of undefined behavior.

I tried many different ways this was the only one that would fully compile. 
unique_ptr will down-cast automatically but not a vector of them even though
they're compatible.  Making it a native vector of the base class would mean
up-casting it with every use in this method.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:884: // Wait for both to start
accumulating samples.
On 2017/03/28 19:32:06, Mike Wittman wrote:
> It seems like using WaitableEvents in the test delegate would be more
> appropriate here than sleeping until the relevant conditions are met.
> 
> However, I considered what it would take to do this and I think it results in
a
> more complicated solution. There's inherently a race between the task posted
by
> the Stop call and the next PerformCollectionTask on profiler 0, so it's
possible
> that either zero or one collections could take place on that profiler after
> Stop() returns. Thus one can't know how many times to wait for collection on
> profiler 0 before it stops.
> 
> Can you add a comment with this information so that future readers of this
test
> know why this seemingly appropriate solution doesn't work well here?

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1065: // will be 10ms (delay)
+ 10x1ms (sampling) + 1/2 timer minimum interval.
On 2017/03/28 19:32:05, Mike Wittman wrote:
> This comment is no longer relevant and can be removed.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1066: params[0].initial_delay
= TimeDelta::FromDays(1);
On 2017/03/28 19:32:09, Mike Wittman wrote:
> AVeryLongTimeDelta() is the established way to say "effectively infinite time"
> in these tests. Applies one other place below too.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1067:
params[0].sampling_interval = TimeDelta::FromMilliseconds(1);
On 2017/03/28 19:32:11, Mike Wittman wrote:
> This parameter can be removed.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1070: params[1].initial_delay
= TimeDelta::FromMilliseconds(0);
On 2017/03/28 19:32:02, Mike Wittman wrote:
> This line can be removed since this is the default initial delay.
> 
> Same comment applies four other places below.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1081: profilers[1]->Start();
On 2017/03/28 19:32:08, Mike Wittman wrote:
> We should wait on the second profiler to finish and check that it got the
right
> data, to validate that the Start() call succeeded.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1082:
EXPECT_FALSE(sampling_completed[0]->IsSignaled());
On 2017/03/28 19:32:08, Mike Wittman wrote:
> This should be removed since what it's checking is unrelated to the behavior
> under test.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1162:
std::vector<SamplingParams> params(1);
On 2017/03/28 19:32:04, Mike Wittman wrote:
> There's no need for vectors here since there's only one profiler.

CreateProfilers() takes a vector so this allows code-reuse.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
On 2017/03/28 19:32:02, Mike Wittman wrote:
> Since we can't reliably validate that the thread was not stopped, I think the
> best way to check the behavior is to collect another profile using a second
> profiler and validate that it works as expected. This is the expected
> user-observable behavior anyway.

If the sampling thread did stop (when it shouldn't) then starting a new
collection would quietly restart it.  Thus, the test would always pass.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-03-29 16:34:23 UTC) #326

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/393832)

3 years, 8 months ago (2017-03-29 16:34:26 UTC) #327

Mike Wittman

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc#newcode940 base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/29 14:56:57, bcwhite wrote: > ...

3 years, 8 months ago (2017-03-30 16:18:39 UTC) #328

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
On 2017/03/29 14:56:57, bcwhite wrote:
> On 2017/03/28 19:32:01, Mike Wittman wrote:
> > > > > > > RemoveCollectionTask:
> > > > > > > - found: tested every stop before completion
> > > > > > 
> > > > > > We can't rely on the item being found for the for the same
> > > > > > configuration/load-dependent reasons as above.
> > > > > 
> > > > > There are dozens of stop calls before completion, some with huge
> timeouts.
> > 
> > > > > They're not all going to miss.
> > > > 
> > > > None of these tests are intending to reliably exercise this behavior and
> > could
> > > > easily be changed in the future such that they no longer test the
> behavior.
> > > The
> > > > use of Stop in the destructor is an implementation detail and shouldn't
be
> > > > relied upon when testing this behavior.
> > > > 
> > > > Please write a dedicated test. All that's required is starting a
profiler
> > with
> > > a
> > > > very long sampling interval, waiting for one sample to be collected
using
> > the
> > > > test delegate, and stopping the profiler.
> > > 
> > > StopSafely
> > > 
> > > Thanks for being specific.  Done.
> > 
> > Please split this out into a dedicated test. It's not easy to even determine
> > that this behavior is being tested from reading StopSafely because of the
> > complexity required for the not found case of PerformCollectionTask.
> 
> That would be pretty much the same as StopDuringInterSampleInterval but using
a
> test delegate rather than timers.  That's what you want?

Yes. We might as well replace StopDuringInterSampleInterval's implementation
with that one, along with a comment specifying the additional behavior being
tested, since that's one of the existing tests that is racy.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
On 2017/03/29 14:56:57, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > CHECK
> > 
> > No reason to use DCHECK in test API code. Applies to DCHECKs in
> > ShtudownAssumingIdle as well.
> 
> Calling this with active samples would indicate bad test code.  I'd rather
flag
> it explicitly rather than lose time debugging a test that is just getting bad
> results.

My comment was intended to convey that this should be CHECK rather than DCHECK.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ !=
NULL_COLLECTION_ID) {
On 2017/03/29 14:56:57, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > Why do we need this conditional?
> 
> Because you wanted to ensure that Remove() wasn't called when the sampling
> thread had "not started".  I'll remove the DCHECK from there instead.

What was in the previous change that caused you to add this when it wasn't
needed before? Is this to support resetting the thread?

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:204: static void
PerformSamplingThreadIdleShutdown(bool simulate_start);
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > Can we call this simulate_intervening_start, to make it clear that the
> shutdown
> > is not doing some new start-like activity.
> 
> Adding "intervening" wouldn't indicate anything about whether it is or is not
> doing a new start-like activity.  "Simulate" and the comment, on the other
hand,
> make it pretty clear.

It would indicate the temporal relationship of the start event to the idle
shutdown, which is not possible to infer directly from the function signature
without reading the comment. It would be very easy to interpret this signature
as meaning that the function is simulating a start of the thread after idle
shutdown.

The behavior around timing of the starts/adds with respect to the shutdown is
the most tricky part of this entire implementation to understand, so I think
it's important to be abundantly clear in the interfaces.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id,
params[i], callback, delegates->at(i).get()));
On 2017/03/29 14:56:59, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > (*delegates)[i].get()
> > 
> > vector<>::at() is no different than operator[]() since Chrome builds without
> > exceptions, and is more confusing to read for the same reason.
> 
> at() is required when working with const vectors (though I had left that out
> from the parameter definition).

The const change is good. at() is not required. vector<>::operator[] has a const
overload.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:646:
StackSamplingProfiler::TestAPI::Reset();
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:07, Mike Wittman wrote:
> > Why do only some of the tests use Reset()?
> 
> It's only necessary for tests that deal with the startup of the sampling
thread.
>  I'd prefer it on all but figured you'd claim it was not required.  Added
> everywhere.

If we need to clean up state to make the test runs hermetic, which it sounds
like we do, then ensuring every test executes with the expected state is the
appropriate thing to do. It prevents mysterious failures when people add tests
in the future.

The expected way to accomplish this is to use a test fixture and do the cleanup
in the TearDown method. The relevant principle is that tests should clean up
after themselves.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_
= 0;
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > This should use locks rather than atomic ops. From the atomicops.h file
> header:
> > "If you plan to use these routines, you should have a good reason, such as
> solid
> > evidence that performance would otherwise suffer, or there being no
> > alternative."
> 
> I'm relatively well versed in the subtleties of atomics, thanks.  But fine,
> we'll do it your way.

You may be but readers generally won't, and we optimize for code readability,
not writeability.

This is not my way, it's the accepted guidance for using this functionality in
both Chrome and google3, and has been for 8 years.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:869:
std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded;
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
> > and access with
> > static_cast<SampleRecordedCounter*>(samples_recorded[0].get())
> > 
> > reinterpret_cast across two levels of template instantiation is highly
unsafe
> > and dependent on multiple layers of undefined behavior.
> 
> I tried many different ways this was the only one that would fully compile. 
> unique_ptr will down-cast automatically but not a vector of them even though
> they're compatible.  Making it a native vector of the base class would mean
> up-casting it with every use in this method.

Casting to SampleRecordedCounter* on use is what's required. Using
reinterpret_cast in this way is not typesafe or valid C++, and only functions at
all because std::unique_ptr<NativeStackSamplerTestDelegate> and
std::unique_ptr<SampleRecordedCounter>, and
std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
std::vector<std::unique_ptr<SampleRecordedCounter>> happen to compile to the
same binary representations.

A reasonable rule of thumb for casting in Chrome is: use static_cast where
possible, and reinterpret_cast between pointers to POD types*. Any other case is
probably not type safe.

* with caveats: see bit_cast.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1162:
std::vector<SamplingParams> params(1);
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:04, Mike Wittman wrote:
> > There's no need for vectors here since there's only one profiler.
> 
> CreateProfilers() takes a vector so this allows code-reuse.

Yes, but we already have several instances where scalar profilers are created
directly within tests, so this would be the one inconsistency in the code.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
On 2017/03/29 14:56:58, bcwhite wrote:
> On 2017/03/28 19:32:02, Mike Wittman wrote:
> > Since we can't reliably validate that the thread was not stopped, I think
the
> > best way to check the behavior is to collect another profile using a second
> > profiler and validate that it works as expected. This is the expected
> > user-observable behavior anyway.
> 
> If the sampling thread did stop (when it shouldn't) then starting a new
> collection would quietly restart it.  Thus, the test would always pass.

It's still worth checking this to prevent future regressions in user-observable
behavior on this code path.

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread
profiler2("profiler2", target_thread_id, params2);
How about profiler_thread1 and profiler_thread2 to make it clear these are
ProfilerThread objects and not StackSamplingProfiler objects?

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-30 18:53:17 UTC) #329

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1040001

3 years, 8 months ago (2017-03-30 18:54:04 UTC) #330

bcwhite

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sampling_profiler_unittest.cc#newcode940 base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/30 16:18:38, Mike Wittman wrote: ...

3 years, 8 months ago (2017-03-30 18:54:51 UTC) #331

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa...
base/profiler/stack_sampling_profiler_unittest.cc:940:
TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) {
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:57, bcwhite wrote:
> > On 2017/03/28 19:32:01, Mike Wittman wrote:
> > > > > > > > RemoveCollectionTask:
> > > > > > > > - found: tested every stop before completion
> > > > > > > 
> > > > > > > We can't rely on the item being found for the for the same
> > > > > > > configuration/load-dependent reasons as above.
> > > > > > 
> > > > > > There are dozens of stop calls before completion, some with huge
> > timeouts.
> > > 
> > > > > > They're not all going to miss.
> > > > > 
> > > > > None of these tests are intending to reliably exercise this behavior
and
> > > could
> > > > > easily be changed in the future such that they no longer test the
> > behavior.
> > > > The
> > > > > use of Stop in the destructor is an implementation detail and
shouldn't
> be
> > > > > relied upon when testing this behavior.
> > > > > 
> > > > > Please write a dedicated test. All that's required is starting a
> profiler
> > > with
> > > > a
> > > > > very long sampling interval, waiting for one sample to be collected
> using
> > > the
> > > > > test delegate, and stopping the profiler.
> > > > 
> > > > StopSafely
> > > > 
> > > > Thanks for being specific.  Done.
> > > 
> > > Please split this out into a dedicated test. It's not easy to even
determine
> > > that this behavior is being tested from reading StopSafely because of the
> > > complexity required for the not found case of PerformCollectionTask.
> > 
> > That would be pretty much the same as StopDuringInterSampleInterval but
using
> a
> > test delegate rather than timers.  That's what you want?
> 
> Yes. We might as well replace StopDuringInterSampleInterval's implementation
> with that one, along with a comment specifying the additional behavior being
> tested, since that's one of the existing tests that is racy.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:57, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > CHECK
> > > 
> > > No reason to use DCHECK in test API code. Applies to DCHECKs in
> > > ShtudownAssumingIdle as well.
> > 
> > Calling this with active samples would indicate bad test code.  I'd rather
> flag
> > it explicitly rather than lose time debugging a test that is just getting
bad
> > results.
> 
> My comment was intended to convey that this should be CHECK rather than
DCHECK.

Does the compiler completely remove this as dead code when its being shipped? 
If not, it would be better to not include the statements in release builds for
simple code-size reasons.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ !=
NULL_COLLECTION_ID) {
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:57, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > Why do we need this conditional?
> > 
> > Because you wanted to ensure that Remove() wasn't called when the sampling
> > thread had "not started".  I'll remove the DCHECK from there instead.
> 
> What was in the previous change that caused you to add this when it wasn't
> needed before? Is this to support resetting the thread?

StopWithoutStarting.  If the object is created but never started then the dtor
will call Stop() which will call Remove() when the thread is NOT_STARTED.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:204: static void
PerformSamplingThreadIdleShutdown(bool simulate_start);
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:58, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > Can we call this simulate_intervening_start, to make it clear that the
> > shutdown
> > > is not doing some new start-like activity.
> > 
> > Adding "intervening" wouldn't indicate anything about whether it is or is
not
> > doing a new start-like activity.  "Simulate" and the comment, on the other
> hand,
> > make it pretty clear.
> 
> It would indicate the temporal relationship of the start event to the idle
> shutdown, which is not possible to infer directly from the function signature
> without reading the comment. It would be very easy to interpret this signature
> as meaning that the function is simulating a start of the thread after idle
> shutdown.
> 
> The behavior around timing of the starts/adds with respect to the shutdown is
> the most tricky part of this entire implementation to understand, so I think
> it's important to be abundantly clear in the interfaces.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id,
params[i], callback, delegates->at(i).get()));
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:59, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > (*delegates)[i].get()
> > > 
> > > vector<>::at() is no different than operator[]() since Chrome builds
without
> > > exceptions, and is more confusing to read for the same reason.
> > 
> > at() is required when working with const vectors (though I had left that out
> > from the parameter definition).
> 
> The const change is good. at() is not required. vector<>::operator[] has a
const
> overload.

Right.  It's maps that don't allow [] with const.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:646:
StackSamplingProfiler::TestAPI::Reset();
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:58, bcwhite wrote:
> > On 2017/03/28 19:32:07, Mike Wittman wrote:
> > > Why do only some of the tests use Reset()?
> > 
> > It's only necessary for tests that deal with the startup of the sampling
> thread.
> >  I'd prefer it on all but figured you'd claim it was not required.  Added
> > everywhere.
> 
> If we need to clean up state to make the test runs hermetic, which it sounds
> like we do, then ensuring every test executes with the expected state is the
> appropriate thing to do. It prevents mysterious failures when people add tests
> in the future.
> 
> The expected way to accomplish this is to use a test fixture and do the
cleanup
> in the TearDown method. The relevant principle is that tests should clean up
> after themselves.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:869:
std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded;
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:58, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
> > > and access with
> > > static_cast<SampleRecordedCounter*>(samples_recorded[0].get())
> > > 
> > > reinterpret_cast across two levels of template instantiation is highly
> unsafe
> > > and dependent on multiple layers of undefined behavior.
> > 
> > I tried many different ways this was the only one that would fully compile. 
> > unique_ptr will down-cast automatically but not a vector of them even though
> > they're compatible.  Making it a native vector of the base class would mean
> > up-casting it with every use in this method.
> 
> Casting to SampleRecordedCounter* on use is what's required. Using
> reinterpret_cast in this way is not typesafe or valid C++, and only functions
at
> all because std::unique_ptr<NativeStackSamplerTestDelegate> and
> std::unique_ptr<SampleRecordedCounter>, and
> std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>>
> std::vector<std::unique_ptr<SampleRecordedCounter>> happen to compile to the
> same binary representations.
> 
> A reasonable rule of thumb for casting in Chrome is: use static_cast where
> possible, and reinterpret_cast between pointers to POD types*. Any other case
is
> probably not type safe.
> 
> * with caveats: see bit_cast.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1162:
std::vector<SamplingParams> params(1);
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:58, bcwhite wrote:
> > On 2017/03/28 19:32:04, Mike Wittman wrote:
> > > There's no need for vectors here since there's only one profiler.
> > 
> > CreateProfilers() takes a vector so this allows code-reuse.
> 
> Yes, but we already have several instances where scalar profilers are created
> directly within tests, so this would be the one inconsistency in the code.

Usually where it's being passed to CaptureProfiles, which does all the
heavy-lifting of creating the profiler.  Here I'm using CreateProfilers to do
the heavy-lifting and it takes vectors.  It's much more difficult to make
mistakes by re-using CreateProfilers even if it makes a bit more boiler-plate
code in the test.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
On 2017/03/30 16:18:38, Mike Wittman wrote:
> On 2017/03/29 14:56:58, bcwhite wrote:
> > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > Since we can't reliably validate that the thread was not stopped, I think
> the
> > > best way to check the behavior is to collect another profile using a
second
> > > profiler and validate that it works as expected. This is the expected
> > > user-observable behavior anyway.
> > 
> > If the sampling thread did stop (when it shouldn't) then starting a new
> > collection would quietly restart it.  Thus, the test would always pass.
> 
> It's still worth checking this to prevent future regressions in
user-observable
> behavior on this code path.

That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes.

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread
profiler2("profiler2", target_thread_id, params2);
On 2017/03/30 16:18:38, Mike Wittman wrote:
> How about profiler_thread1 and profiler_thread2 to make it clear these are
> ProfilerThread objects and not StackSamplingProfiler objects?

Done.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-03-30 19:04:31 UTC) #332

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/180954) ios-device-xcode-clang on ...

3 years, 8 months ago (2017-03-30 19:04:33 UTC) #333

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-30 19:39:13 UTC) #334

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1060001

3 years, 8 months ago (2017-03-30 19:40:40 UTC) #335

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-03-30 19:52:10 UTC) #336

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds/180998) ios-device-xcode-clang on ...

3 years, 8 months ago (2017-03-30 19:52:12 UTC) #337

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-30 20:07:04 UTC) #338

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1080001

3 years, 8 months ago (2017-03-30 20:08:03 UTC) #339

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-03-30 23:23:06 UTC) #340

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-03-30 23:23:08 UTC) #341

Mike Wittman

With the changes below this should be looking pretty good. I'm going to make another ...

3 years, 8 months ago (2017-03-31 01:38:22 UTC) #342

With the changes below this should be looking pretty good. I'm going to make
another pass for anything I missed.

Beyond that, I think it would be good to get a quick review from gab@ in case
anything has changed with the Thread API in the mean time. And you'll need to
ping brettw@ for the thread_restrictions.h change.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
On 2017/03/30 18:54:51, bcwhite wrote:
> On 2017/03/30 16:18:38, Mike Wittman wrote:
> > On 2017/03/29 14:56:57, bcwhite wrote:
> > > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > > CHECK
> > > > 
> > > > No reason to use DCHECK in test API code. Applies to DCHECKs in
> > > > ShtudownAssumingIdle as well.
> > > 
> > > Calling this with active samples would indicate bad test code.  I'd rather
> > flag
> > > it explicitly rather than lose time debugging a test that is just getting
> bad
> > > results.
> > 
> > My comment was intended to convey that this should be CHECK rather than
> DCHECK.
> 
> Does the compiler completely remove this as dead code when its being shipped? 
> If not, it would be better to not include the statements in release builds for
> simple code-size reasons.

Yes, the linkers on various platforms do dead code elimination, so none of the
TestAPI code should be present in a release binary.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ !=
NULL_COLLECTION_ID) {
On 2017/03/30 18:54:51, bcwhite wrote:
> On 2017/03/30 16:18:38, Mike Wittman wrote:
> > On 2017/03/29 14:56:57, bcwhite wrote:
> > > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > > Why do we need this conditional?
> > > 
> > > Because you wanted to ensure that Remove() wasn't called when the sampling
> > > thread had "not started".  I'll remove the DCHECK from there instead.
> > 
> > What was in the previous change that caused you to add this when it wasn't
> > needed before? Is this to support resetting the thread?
> 
> StopWithoutStarting.  If the object is created but never started then the dtor
> will call Stop() which will call Remove() when the thread is NOT_STARTED.

Removing the DCHECK sounds like a good solution.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1162:
std::vector<SamplingParams> params(1);
On 2017/03/30 18:54:51, bcwhite wrote:
> On 2017/03/30 16:18:38, Mike Wittman wrote:
> > On 2017/03/29 14:56:58, bcwhite wrote:
> > > On 2017/03/28 19:32:04, Mike Wittman wrote:
> > > > There's no need for vectors here since there's only one profiler.
> > > 
> > > CreateProfilers() takes a vector so this allows code-reuse.
> > 
> > Yes, but we already have several instances where scalar profilers are
created
> > directly within tests, so this would be the one inconsistency in the code.
> 
> Usually where it's being passed to CaptureProfiles, which does all the
> heavy-lifting of creating the profiler.  Here I'm using CreateProfilers to do
> the heavy-lifting and it takes vectors.  It's much more difficult to make
> mistakes by re-using CreateProfilers even if it makes a bit more boiler-plate
> code in the test.

OK, sounds reasonable.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
On 2017/03/30 18:54:51, bcwhite wrote:
> On 2017/03/30 16:18:38, Mike Wittman wrote:
> > On 2017/03/29 14:56:58, bcwhite wrote:
> > > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > > Since we can't reliably validate that the thread was not stopped, I
think
> > the
> > > > best way to check the behavior is to collect another profile using a
> second
> > > > profiler and validate that it works as expected. This is the expected
> > > > user-observable behavior anyway.
> > > 
> > > If the sampling thread did stop (when it shouldn't) then starting a new
> > > collection would quietly restart it.  Thus, the test would always pass.
> > 
> > It's still worth checking this to prevent future regressions in
> user-observable
> > behavior on this code path.
> 
> That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes.

Neither of those tests exercise the shutdown abort code path.

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread
profiler2("profiler2", target_thread_id, params2);
On 2017/03/30 18:54:51, bcwhite wrote:
> On 2017/03/30 16:18:38, Mike Wittman wrote:
> > How about profiler_thread1 and profiler_thread2 to make it clear these are
> > ProfilerThread objects and not StackSamplingProfiler objects?
> 
> Done.

I don't see this change in the patch sets.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
Can you move this function into the TestAPI for consistency?

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
Also, this call should go in TearDown.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:638:
StackSamplingProfiler::TestAPI::DisableIdleShutdown();
It's worth commenting that we disable idle shutdown because it takes too long to
occur to be testable. The behavior is tested in some tests by artificially
triggering an idle shutdown.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1009: while (
This test would be simpler and easier to understand using a WaitableEvent that
gets signaled in the test delegate and waited on here, along with two samples
per burst and a long sampling interval.

Mike Wittman

wittman@chromium.org changed reviewers: + gab@chromium.org

3 years, 8 months ago (2017-03-31 01:46:30 UTC) #343

Mike Wittman

Hi gab, can you review the Thread API usage in this change for consistency with ...

3 years, 8 months ago (2017-03-31 01:46:33 UTC) #344

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-31 13:33:27 UTC) #346

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1090005

3 years, 8 months ago (2017-03-31 13:33:48 UTC) #347

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-03-31 13:49:40 UTC) #348

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1110001

3 years, 8 months ago (2017-03-31 13:49:59 UTC) #350

bcwhite

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_sampling_profiler.cc#newcode288 base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/31 01:38:21, Mike Wittman wrote: > On ...

3 years, 8 months ago (2017-03-31 13:57:56 UTC) #351

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:288:
DCHECK(sampler->active_collections_.empty());
On 2017/03/31 01:38:21, Mike Wittman wrote:
> On 2017/03/30 18:54:51, bcwhite wrote:
> > On 2017/03/30 16:18:38, Mike Wittman wrote:
> > > On 2017/03/29 14:56:57, bcwhite wrote:
> > > > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > > > CHECK
> > > > > 
> > > > > No reason to use DCHECK in test API code. Applies to DCHECKs in
> > > > > ShtudownAssumingIdle as well.
> > > > 
> > > > Calling this with active samples would indicate bad test code.  I'd
rather
> > > flag
> > > > it explicitly rather than lose time debugging a test that is just
getting
> > bad
> > > > results.
> > > 
> > > My comment was intended to convey that this should be CHECK rather than
> > DCHECK.
> > 
> > Does the compiler completely remove this as dead code when its being
shipped? 
> > If not, it would be better to not include the statements in release builds
for
> > simple code-size reasons.
> 
> Yes, the linkers on various platforms do dead code elimination, so none of the
> TestAPI code should be present in a release binary.

Done.

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the
shutdown-task has been executed, any actual exit of the
On 2017/03/31 01:38:21, Mike Wittman wrote:
> On 2017/03/30 18:54:51, bcwhite wrote:
> > On 2017/03/30 16:18:38, Mike Wittman wrote:
> > > On 2017/03/29 14:56:58, bcwhite wrote:
> > > > On 2017/03/28 19:32:02, Mike Wittman wrote:
> > > > > Since we can't reliably validate that the thread was not stopped, I
> think
> > > the
> > > > > best way to check the behavior is to collect another profile using a
> > second
> > > > > profiler and validate that it works as expected. This is the expected
> > > > > user-observable behavior anyway.
> > > > 
> > > > If the sampling thread did stop (when it shouldn't) then starting a new
> > > > collection would quietly restart it.  Thus, the test would always pass.
> > > 
> > > It's still worth checking this to prevent future regressions in
> > user-observable
> > > behavior on this code path.
> > 
> > That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes.
> 
> Neither of those tests exercise the shutdown abort code path.

Done.

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread
profiler2("profiler2", target_thread_id, params2);
On 2017/03/31 01:38:22, Mike Wittman wrote:
> On 2017/03/30 18:54:51, bcwhite wrote:
> > On 2017/03/30 16:18:38, Mike Wittman wrote:
> > > How about profiler_thread1 and profiler_thread2 to make it clear these are
> > > ProfilerThread objects and not StackSamplingProfiler objects?
> > 
> > Done.
> 
> I don't see this change in the patch sets.

It's there; #38 and above.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
On 2017/03/31 01:38:22, Mike Wittman wrote:
> Can you move this function into the TestAPI for consistency?

Done.  And it makes sense to call that as part of the more general Reset()
behavior.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
On 2017/03/31 01:38:22, Mike Wittman wrote:
> Also, this call should go in TearDown.

Reset needs to happen in SetUp because it's possible that other tests (in other
files) could have set some values here.  But it should be in tear-down, too.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:638:
StackSamplingProfiler::TestAPI::DisableIdleShutdown();
On 2017/03/31 01:38:22, Mike Wittman wrote:
> It's worth commenting that we disable idle shutdown because it takes too long
to
> occur to be testable. The behavior is tested in some tests by artificially
> triggering an idle shutdown.

Done.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1009: while (
On 2017/03/31 01:38:22, Mike Wittman wrote:
> This test would be simpler and easier to understand using a WaitableEvent that
> gets signaled in the test delegate and waited on here, along with two samples
> per burst and a long sampling interval.

Done.

bcwhite

@brettw: stack_sampling_profiler.cc, 738 // The behavior of sampling a thread that has exited is undefined ...

3 years, 8 months ago (2017-03-31 13:59:19 UTC) #352

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-03-31 14:59:29 UTC) #353

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-03-31 14:59:31 UTC) #354

brettw

Can you write a better CL description? I don't really know what's going on.

3 years, 8 months ago (2017-03-31 17:12:44 UTC) #355

Mike Wittman

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_sampling_profiler_unittest.cc#newcode1611 base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); On 2017/03/31 13:57:56, bcwhite wrote: ...

3 years, 8 months ago (2017-03-31 18:12:33 UTC) #356

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread
profiler2("profiler2", target_thread_id, params2);
On 2017/03/31 13:57:56, bcwhite wrote:
> On 2017/03/31 01:38:22, Mike Wittman wrote:
> > On 2017/03/30 18:54:51, bcwhite wrote:
> > > On 2017/03/30 16:18:38, Mike Wittman wrote:
> > > > How about profiler_thread1 and profiler_thread2 to make it clear these
are
> > > > ProfilerThread objects and not StackSamplingProfiler objects?
> > > 
> > > Done.
> > 
> > I don't see this change in the patch sets.
> 
> It's there; #38 and above.

Ah, the thread names changed. My comment was intended to be about the variable
names.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
On 2017/03/31 13:57:56, bcwhite wrote:
> On 2017/03/31 01:38:22, Mike Wittman wrote:
> > Also, this call should go in TearDown.
> 
> Reset needs to happen in SetUp because it's possible that other tests (in
other
> files) could have set some values here.  But it should be in tear-down, too.

That's a very good point that we need to be concerned about other users of the
profiler. Unfortunately, I don't think running Reset() in SetUp() is sufficient
(or necessary, even) to solve the issue.

For the annotations case, non-profiler tests that exercise calls to
SetProcessMilestone() won't benefit from this fixture. They'll potentially set
the same milestone multiple times, hitting at least one DCHECK. We don't see
this problem now because none of the milestone setting code appears to be tested
outside of browser_tests and interactive_ui_tests which run one test per
process.

For the sampling thread case, I don't think anything needs to be done at SetUp()
to ensure correct behavior. The profiler tests are the only ones in the
base_unittests target that exercise the profiler, and should be the only ones in
that target that will ever do so, so running Reset() only in TearDown() is
sufficient. All the other unit test targets should be fine without doing Reset()
-- they'll get the normal thread restart behavior -- but it still would be good
practice to mock out the profiler in unit tests of new code using the profiler
to isolate that code from the profiler dependency.

Given these facts, I think the best course of action is:
 - only run DisableIdleShutdown() in SetUp()
 - only run Reset() in TearDown(), and add a comment on the fixture saying that
all profiler tests in base_unittests must use the fixture to ensure proper
clean-up
 - in a separate CL, update the milestone handling logic to accept milestones
being set multiple times and add a comment about why/how this happens

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:249: // Updates the |next_sample_time|
time based on configured parameters.
add comment on the meaning of the return value

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:259: // so that multiple threads may
make those calls.
broaden comment to discuss the general thread execution state under the lock.
e.g.:

// State maintained about the current execution (or non-execution) of
// the thread. This state must always be accessed while holding the
// lock. A copy of the task-runner is maintained here for use by any
// calling thread; this is necessary because Thread's accessor for it is
// not itself thread-safe. The lock is also used to order calls to the
// Thread API (Start, Stop, StopSoon, & DetachFromSequence) so that
// multiple threads may make those calls.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:275: std::map<int,
std::unique_ptr<CollectionContext>> active_collections_;
nit: move this declaration above the lock to make totally obvious that this is
not protected by the lock.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:603: // All capturing has completed so
finish the collection. Let object expire.
nit: clarify the meaning of "Let object expire."

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:613: // eliminating the race.
mention what the race is

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:28: class WaitableEvent;
This can be removed since the header is already included.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:183: // Testing support.
Add comment: The functions on this API are static because they affect the single
sampling thread that is used across all StackSamplingProfilers.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:245: // This will block until the
callback has been run.
update comment: This will block until the callback has been run _if profiling is
taking place_.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:364:
std::vector<std::unique_ptr<WaitableEvent>>* completed) {
There's a very subtle issue with this function in that it hides destruction
ordering constraints from the caller: if the profiles or completed vectors were
declared before the profilers vector in a test, the profilers could access those
objects after they were destroyed but before the profilers were destroyed. This
would cause flaky crashes that would be very difficult to track down.

We should put these in a struct to enforce proper destruction order:
struct ProfilerState {
  std::unique_ptr<StackSamplingProfiler> profiler;
  CallStackProfiles profiles;
  std::unique_ptr<WaitableEvent> completed;
};

along with a comment calling out the reason for the ordering. Then, return a
std::vector<ProfilerState> from this function.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:903:
static_cast<SampleRecordedCounter*>(samples_recorded[0].get())->Get() ==
I think we should be able to avoid casting entirely by using a template
function. Taking the comment on the CreateProfilers function into account:

template <typename T>
std::vector<ProfilerState> CreateProfilers(
    PlatformThreadId target_thread_id,
    const std::vector<SamplingParams>& params,
    const std::vector<std::unique_ptr<T>>* test_delegates) {
  // ... existing function ...
}

std::vector<ProfilerState> CreateProfilers(
    PlatformThreadId target_thread_id,
    const std::vector<SamplingParams>& params) {
  return CreateProfilers<NativeStackSamplerTestDelegate>(target_thread_id,
      params, nullptr);
}

And convert the non-delegate-users to the second overload.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:991: void Wait() {
sample_recorded_.Wait(); }
nit: WaitForSampleToOccur

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1154: // Initiate an "idle"
shutdown and ensure it happens. Idle-shutdown was
Idle-shutdown is disabled in the test fixture ...

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. BUG=671716 ========== to ========== ...

3 years, 8 months ago (2017-04-03 14:02:42 UTC) #357

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-03 14:03:33 UTC) #358

gab

On 2017/03/31 01:46:33, Mike Wittman wrote: > Hi gab, can you review the Thread API ...

3 years, 8 months ago (2017-04-03 17:00:19 UTC) #359

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-03 18:06:34 UTC) #360

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-03 18:09:59 UTC) #361

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1130001

3 years, 8 months ago (2017-04-03 18:10:37 UTC) #362

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-03 18:17:13 UTC) #363

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1150001

3 years, 8 months ago (2017-04-03 18:18:03 UTC) #365

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-03 19:57:43 UTC) #366

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/413886)

3 years, 8 months ago (2017-04-03 19:57:45 UTC) #367

bcwhite

CL description changes: Done. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_sampling_profiler_unittest.cc File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_sampling_profiler_unittest.cc#newcode637 base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); > Given these facts, ...

3 years, 8 months ago (2017-04-03 20:18:13 UTC) #368

CL description changes: Done.

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:637:
StackSamplingProfiler::ResetAnnotationsForTesting();
> Given these facts, I think the best course of action is:
>  - only run DisableIdleShutdown() in SetUp()
>  - only run Reset() in TearDown(), and add a comment on the fixture saying
that
> all profiler tests in base_unittests must use the fixture to ensure proper
> clean-up

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:249: // Updates the |next_sample_time|
time based on configured parameters.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> add comment on the meaning of the return value

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:259: // so that multiple threads may
make those calls.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> broaden comment to discuss the general thread execution state under the lock.
> e.g.:
> 
> // State maintained about the current execution (or non-execution) of
> // the thread. This state must always be accessed while holding the
> // lock. A copy of the task-runner is maintained here for use by any
> // calling thread; this is necessary because Thread's accessor for it is
> // not itself thread-safe. The lock is also used to order calls to the
> // Thread API (Start, Stop, StopSoon, & DetachFromSequence) so that
> // multiple threads may make those calls.

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:275: std::map<int,
std::unique_ptr<CollectionContext>> active_collections_;
On 2017/03/31 18:12:33, Mike Wittman wrote:
> nit: move this declaration above the lock to make totally obvious that this is
> not protected by the lock.

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:603: // All capturing has completed so
finish the collection. Let object expire.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> nit: clarify the meaning of "Let object expire."

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:613: // eliminating the race.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> mention what the race is

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:28: class WaitableEvent;
On 2017/03/31 18:12:33, Mike Wittman wrote:
> This can be removed since the header is already included.

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:183: // Testing support.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> Add comment: The functions on this API are static because they affect the
single
> sampling thread that is used across all StackSamplingProfilers.

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:245: // This will block until the
callback has been run.
On 2017/03/31 18:12:33, Mike Wittman wrote:
> update comment: This will block until the callback has been run _if profiling
is
> taking place_.

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:364:
std::vector<std::unique_ptr<WaitableEvent>>* completed) {
On 2017/03/31 18:12:33, Mike Wittman wrote:
> There's a very subtle issue with this function in that it hides destruction
> ordering constraints from the caller: if the profiles or completed vectors
were
> declared before the profilers vector in a test, the profilers could access
those
> objects after they were destroyed but before the profilers were destroyed.
This
> would cause flaky crashes that would be very difficult to track down.
> 
> We should put these in a struct to enforce proper destruction order:
> struct ProfilerState {
>   std::unique_ptr<StackSamplingProfiler> profiler;
>   CallStackProfiles profiles;
>   std::unique_ptr<WaitableEvent> completed;
> };
> 
> along with a comment calling out the reason for the ordering. Then, return a
> std::vector<ProfilerState> from this function.

Better than that, I think, is to create a struct that contains everything for a
single test profiler and create a vector of pointers to those.

This cleans up a lot of things and means we can get rid of vectors of a single
element.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:903:
static_cast<SampleRecordedCounter*>(samples_recorded[0].get())->Get() ==
On 2017/03/31 18:12:33, Mike Wittman wrote:
> I think we should be able to avoid casting entirely by using a template
> function. Taking the comment on the CreateProfilers function into account:
> 
> template <typename T>
> std::vector<ProfilerState> CreateProfilers(
>     PlatformThreadId target_thread_id,
>     const std::vector<SamplingParams>& params,
>     const std::vector<std::unique_ptr<T>>* test_delegates) {
>   // ... existing function ...
> }
> 
> std::vector<ProfilerState> CreateProfilers(
>     PlatformThreadId target_thread_id,
>     const std::vector<SamplingParams>& params) {
>   return CreateProfilers<NativeStackSamplerTestDelegate>(target_thread_id,
>       params, nullptr);
> }
> 
> And convert the non-delegate-users to the second overload.

No casting necessary with new TestProfilerInfo structure.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:991: void Wait() {
sample_recorded_.Wait(); }
On 2017/03/31 18:12:33, Mike Wittman wrote:
> nit: WaitForSampleToOccur

Done.

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1154: // Initiate an "idle"
shutdown and ensure it happens. Idle-shutdown was
On 2017/03/31 18:12:33, Mike Wittman wrote:
> Idle-shutdown is disabled in the test fixture ...

Done.

brettw

owners lgtm but I mostly only looked at the API. Be sure to follow up ...

3 years, 8 months ago (2017-04-03 21:42:51 UTC) #369

bcwhite

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode291 base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/03 21:42:51, brettw (plz ping after 24h) ...

3 years, 8 months ago (2017-04-04 12:59:31 UTC) #370

Alexei Svitkine (slow)

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode291 base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 12:59:31, bcwhite wrote: > On 2017/04/03 ...

3 years, 8 months ago (2017-04-04 15:44:26 UTC) #371

bcwhite

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode291 base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > ...

3 years, 8 months ago (2017-04-04 15:54:49 UTC) #372

Mike Wittman

This is looking good to me. Just waiting for gab's review. A couple comments on ...

3 years, 8 months ago (2017-04-04 17:59:52 UTC) #373

This is looking good to me. Just waiting for gab's review.

A couple comments on the CL description:

> Sampling will continue until the desired number has been collected,
> it is manually stopped, or the controlling object gets destructed.

... until the desired number _of samples_ has been collected ...

> The thread under test is expected to live at least as long as the
> thread controlling the sampling.

The sampled thread is expected to live at least as long ...

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:364:
std::vector<std::unique_ptr<WaitableEvent>>* completed) {
On 2017/04/03 20:18:13, bcwhite wrote:
> On 2017/03/31 18:12:33, Mike Wittman wrote:
> > There's a very subtle issue with this function in that it hides destruction
> > ordering constraints from the caller: if the profiles or completed vectors
> were
> > declared before the profilers vector in a test, the profilers could access
> those
> > objects after they were destroyed but before the profilers were destroyed.
> This
> > would cause flaky crashes that would be very difficult to track down.
> > 
> > We should put these in a struct to enforce proper destruction order:
> > struct ProfilerState {
> >   std::unique_ptr<StackSamplingProfiler> profiler;
> >   CallStackProfiles profiles;
> >   std::unique_ptr<WaitableEvent> completed;
> > };
> > 
> > along with a comment calling out the reason for the ordering. Then, return a
> > std::vector<ProfilerState> from this function.
> 
> Better than that, I think, is to create a struct that contains everything for
a
> single test profiler and create a vector of pointers to those.
> 
> This cleans up a lot of things and means we can get rid of vectors of a single
> element.

Nice, that makes things much cleaner.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:291:
CHECK(sampler->active_collections_.empty());
On 2017/04/04 15:54:49, bcwhite wrote:
> On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote:
> > On 2017/04/04 12:59:31, bcwhite wrote:
> > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote:
> > > > Can this and the two non-debug assertions 2 functions down be converted
to
> > > debug
> > > > ones? Non-debug assertions add ~100 bytes each to the release binary.
And
> > > since
> > > > these are test ones, most of the tests are run with debug assertions
> enabled
> > > > (even in release mode).
> > > 
> > > I had them DCHECK originally but was told to convert them to CHECK because
> the
> > > linker would remove these methods as "dead code" (they're only called from
> > > tests).  Is that not the case?
> > 
> > If they're only called from tests, then what's the problem with them being
> > removed as dead code from Chrome build?
> > 
> > They shouldn't be removed by the linker when building the tests.
> 
> No problem.  The question was whether there was any difference between CHECK
and
> DCHECK in test code.
> 
> If test code is removed by the linker when building FOR RELEASE then it
wouldn't
> matter which check is used inside that code.  If it hangs around, then we
should
> definitely use DCHECK.

I did a spot check of several TestAPI definitions in Windows 64-bit 59.0.3060.0
canary and there are no symbols for those classes in the release build despite
the presence of symbols for the enclosing class. There are no TestAPI symbols in
the build at all.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:882: SampleRecordedCounter
samples_recorded[2];
nit: arraysize(params)

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1274: profiler_info.Reset();
Now that we have the TestProfilerInfo struct it would be simpler just to create
a new TestProfilerInfo here and run that.

brettw

The DCHECK/CHECK thing doesn't matter either way in practice for this patch so I want ...

3 years, 8 months ago (2017-04-04 18:20:55 UTC) #374

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-05 12:14:12 UTC) #375

Description was changed from

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number has been collected,
it is manually stopped, or the controlling object gets destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The thread under test is expected to live at least as long as the
thread controlling the sampling.

BUG=671716
==========

to

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

BUG=671716
==========

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-05 13:09:13 UTC) #376

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1170001

3 years, 8 months ago (2017-04-05 13:09:30 UTC) #377

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-05 15:30:51 UTC) #378

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-05 15:30:52 UTC) #379

gab

Some comments, some drive-bys on a few nits I spotted and a meta-comment: Why does ...

3 years, 8 months ago (2017-04-05 20:38:43 UTC) #380

Some comments, some drive-bys on a few nits I spotted and a meta-comment:

Why does sampling have to be startable from any thread? Besides starting and
self-stopping, what makes the threading complicated here?

Can this CL be split? >1K LOCs is a huge review... I therefore didn't read the
whole thing but did look at everything that seemed related to threading (skipped
tests mostly though).

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:114: class
StackSamplingProfiler::SamplingThread : public Thread {
This is big enough to warrant its own file and unit tests IMO

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:240: // Check if the sampling thread is
idle and begin a shutdown if so.
"begin a shutdown if so" sounds weird to me

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:291:
CHECK(sampler->active_collections_.empty());
On 2017/04/04 17:59:51, Mike Wittman wrote:
> On 2017/04/04 15:54:49, bcwhite wrote:
> > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote:
> > > On 2017/04/04 12:59:31, bcwhite wrote:
> > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote:
> > > > > Can this and the two non-debug assertions 2 functions down be
converted
> to
> > > > debug
> > > > > ones? Non-debug assertions add ~100 bytes each to the release binary.
> And
> > > > since
> > > > > these are test ones, most of the tests are run with debug assertions
> > enabled
> > > > > (even in release mode).
> > > > 
> > > > I had them DCHECK originally but was told to convert them to CHECK
because
> > the
> > > > linker would remove these methods as "dead code" (they're only called
from
> > > > tests).  Is that not the case?
> > > 
> > > If they're only called from tests, then what's the problem with them being
> > > removed as dead code from Chrome build?
> > > 
> > > They shouldn't be removed by the linker when building the tests.
> > 
> > No problem.  The question was whether there was any difference between CHECK
> and
> > DCHECK in test code.
> > 
> > If test code is removed by the linker when building FOR RELEASE then it
> wouldn't
> > matter which check is used inside that code.  If it hangs around, then we
> should
> > definitely use DCHECK.
> 
> I did a spot check of several TestAPI definitions in Windows 64-bit
59.0.3060.0
> canary and there are no symbols for those classes in the release build despite
> the presence of symbols for the enclosing class. There are no TestAPI symbols
in
> the build at all.

Yes, this should be DCHECK, don't CHECK just to force a test in non-test code:
https://chromium.googlesource.com/chromium/src/+/master/styleguide/c++/c++.md...

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:354: void
StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent(
You can just post two tasks in a row instead of a having a custom helper that
does two things in one.

base::Thread will run all tasks before winding down anyways.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:366: :
Thread("Chrome_SamplingProfilerThread") {}
No need to prefix with "Chrome_", the thread names will always be viewed as
scoped to Chrome's browser process anyways. No need to suffix with "Thread"
either.

"StackSamplingProfiler" is a shorter and more precise name.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:369: Stop();
Not necessary, ~Thread() does this already so = default; is sufficient here.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:380: options.priority =
ThreadPriority::DISPLAY;
Hmmm I don't think that's appropriate. On Android only UI/IO run at DISPLAY I
think and on Desktop no thread runs at DISPLAY priority for now (all NORMAL or
lower).

I understand that sampling has to be regular to be accurate but we also don't
want to slow down the product in order to sample... right?

With this we're telling the OS, if you're under crunch and there's only one
thing you can make chrome do: schedule this thread..

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:435: // thread and the thread that
creates it (i.e. this thread).
Add " for thread-safety reasons which are alleviated in SamplingThread per
gating its access on |thread_execution_state_lock_|."

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:475: CollectionContext* collection) {
Add DCHECK_EQ(GetThreadId(), PlatformThread::CurrentId()); calls that document
methods that always run on the sampling thread (others should use AutoLock or
have a meta-comment so that it's clear which context each method is entered
from).

Or even better would be to split this state into (1) a class that only runs on
sampling thread (members never touched from elsewhere) and (2) the
multi-threaded part with state behind lock.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:603:
std::max(collection->next_sample_time - Time::Now(), TimeDelta()));
This isn't required I think, negative delays should be the same as no delays.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:616: // get postponed until
thread_execution_state_ is updated, thus eliminating
|thread_execution_state_|

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:639: // work comes in. Remove the
thread_execution_state_task_runner_ to avoid
|thread_execution_state_task_runner_ | and maybe elsewhere too

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset();
Use a ResetPolicy::AUTOMATIC WaitableEvent?

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:194: static bool
IsSamplingThreadRunning();
Thread::IsRunning() isn't thread-safe (though the check is sadly disabled right
now [1]) and as such this method isn't either (must be called from owning
sequence). Please document it as such (or probably this entire TestAPI class as
such in fact).

[1]
https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//...

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown();
Since you should support having multiple pending delayed shutdown tasks in your
queue (I don't see anything that prevents that from happening), why bother
disable them? Your tests should complete much before they fire for real anyways
so it shouldn't be a source of flakiness, disabling them merely brings you
further from testing your real product code.

https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:729: // Stop is immediate but
asynchronous. There is a non-zero probability that
If it's asynchronous, it's not immediate :).

"// Stop is asynchronous: there is a non-zero..."

bcwhite

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode291 base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 17:59:51, Mike Wittman wrote: > On ...

3 years, 8 months ago (2017-04-06 16:18:51 UTC) #381

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:291:
CHECK(sampler->active_collections_.empty());
On 2017/04/04 17:59:51, Mike Wittman wrote:
> On 2017/04/04 15:54:49, bcwhite wrote:
> > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote:
> > > On 2017/04/04 12:59:31, bcwhite wrote:
> > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote:
> > > > > Can this and the two non-debug assertions 2 functions down be
converted
> to
> > > > debug
> > > > > ones? Non-debug assertions add ~100 bytes each to the release binary.
> And
> > > > since
> > > > > these are test ones, most of the tests are run with debug assertions
> > enabled
> > > > > (even in release mode).
> > > > 
> > > > I had them DCHECK originally but was told to convert them to CHECK
because
> > the
> > > > linker would remove these methods as "dead code" (they're only called
from
> > > > tests).  Is that not the case?
> > > 
> > > If they're only called from tests, then what's the problem with them being
> > > removed as dead code from Chrome build?
> > > 
> > > They shouldn't be removed by the linker when building the tests.
> > 
> > No problem.  The question was whether there was any difference between CHECK
> and
> > DCHECK in test code.
> > 
> > If test code is removed by the linker when building FOR RELEASE then it
> wouldn't
> > matter which check is used inside that code.  If it hangs around, then we
> should
> > definitely use DCHECK.
> 
> I did a spot check of several TestAPI definitions in Windows 64-bit
59.0.3060.0
> canary and there are no symbols for those classes in the release build despite
> the presence of symbols for the enclosing class. There are no TestAPI symbols
in
> the build at all.

Mike, you asked for them to be CHECK and you're the owner.  Your call.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler_unittest.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:882: SampleRecordedCounter
samples_recorded[2];
On 2017/04/04 17:59:52, Mike Wittman wrote:
> nit: arraysize(params)

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler_unittest.cc:1274: profiler_info.Reset();
On 2017/04/04 17:59:52, Mike Wittman wrote:
> Now that we have the TestProfilerInfo struct it would be simpler just to
create
> a new TestProfilerInfo here and run that.

Done.

Mike Wittman

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode291 base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/06 16:18:49, bcwhite wrote: > On 2017/04/04 ...

3 years, 8 months ago (2017-04-06 16:55:41 UTC) #382

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:291:
CHECK(sampler->active_collections_.empty());
On 2017/04/06 16:18:49, bcwhite wrote:
> On 2017/04/04 17:59:51, Mike Wittman wrote:
> > On 2017/04/04 15:54:49, bcwhite wrote:
> > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote:
> > > > On 2017/04/04 12:59:31, bcwhite wrote:
> > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote:
> > > > > > Can this and the two non-debug assertions 2 functions down be
> converted
> > to
> > > > > debug
> > > > > > ones? Non-debug assertions add ~100 bytes each to the release
binary.
> > And
> > > > > since
> > > > > > these are test ones, most of the tests are run with debug assertions
> > > enabled
> > > > > > (even in release mode).
> > > > > 
> > > > > I had them DCHECK originally but was told to convert them to CHECK
> because
> > > the
> > > > > linker would remove these methods as "dead code" (they're only called
> from
> > > > > tests).  Is that not the case?
> > > > 
> > > > If they're only called from tests, then what's the problem with them
being
> > > > removed as dead code from Chrome build?
> > > > 
> > > > They shouldn't be removed by the linker when building the tests.
> > > 
> > > No problem.  The question was whether there was any difference between
CHECK
> > and
> > > DCHECK in test code.
> > > 
> > > If test code is removed by the linker when building FOR RELEASE then it
> > wouldn't
> > > matter which check is used inside that code.  If it hangs around, then we
> > should
> > > definitely use DCHECK.
> > 
> > I did a spot check of several TestAPI definitions in Windows 64-bit
> 59.0.3060.0
> > canary and there are no symbols for those classes in the release build
despite
> > the presence of symbols for the enclosing class. There are no TestAPI
symbols
> in
> > the build at all.
> 
> Mike, you asked for them to be CHECK and you're the owner.  Your call.

Following the guidance from Brett and Gab is OK with me. If DCHECKs are
generally always enabled in tests then the CHECKs don't really buy us anything.

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-06 18:38:31 UTC) #383

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1190001

3 years, 8 months ago (2017-04-06 18:39:12 UTC) #384

bcwhite

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_sampling_profiler.cc#newcode114 base/profiler/stack_sampling_profiler.cc:114: class StackSamplingProfiler::SamplingThread : public Thread { On 2017/04/05 20:38:42, ...

3 years, 8 months ago (2017-04-06 18:40:19 UTC) #385

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:114: class
StackSamplingProfiler::SamplingThread : public Thread {
On 2017/04/05 20:38:42, gab wrote:
> This is big enough to warrant its own file and unit tests IMO

Acknowledged.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:240: // Check if the sampling thread is
idle and begin a shutdown if so.
On 2017/04/05 20:38:42, gab wrote:
> "begin a shutdown if so" sounds weird to me

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:291:
CHECK(sampler->active_collections_.empty());
On 2017/04/06 16:55:41, Mike Wittman wrote:
> On 2017/04/06 16:18:49, bcwhite wrote:
> > On 2017/04/04 17:59:51, Mike Wittman wrote:
> > > On 2017/04/04 15:54:49, bcwhite wrote:
> > > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote:
> > > > > On 2017/04/04 12:59:31, bcwhite wrote:
> > > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote:
> > > > > > > Can this and the two non-debug assertions 2 functions down be
> > converted
> > > to
> > > > > > debug
> > > > > > > ones? Non-debug assertions add ~100 bytes each to the release
> binary.
> > > And
> > > > > > since
> > > > > > > these are test ones, most of the tests are run with debug
assertions
> > > > enabled
> > > > > > > (even in release mode).
> > > > > > 
> > > > > > I had them DCHECK originally but was told to convert them to CHECK
> > because
> > > > the
> > > > > > linker would remove these methods as "dead code" (they're only
called
> > from
> > > > > > tests).  Is that not the case?
> > > > > 
> > > > > If they're only called from tests, then what's the problem with them
> being
> > > > > removed as dead code from Chrome build?
> > > > > 
> > > > > They shouldn't be removed by the linker when building the tests.
> > > > 
> > > > No problem.  The question was whether there was any difference between
> CHECK
> > > and
> > > > DCHECK in test code.
> > > > 
> > > > If test code is removed by the linker when building FOR RELEASE then it
> > > wouldn't
> > > > matter which check is used inside that code.  If it hangs around, then
we
> > > should
> > > > definitely use DCHECK.
> > > 
> > > I did a spot check of several TestAPI definitions in Windows 64-bit
> > 59.0.3060.0
> > > canary and there are no symbols for those classes in the release build
> despite
> > > the presence of symbols for the enclosing class. There are no TestAPI
> symbols
> > in
> > > the build at all.
> > 
> > Mike, you asked for them to be CHECK and you're the owner.  Your call.
> 
> Following the guidance from Brett and Gab is OK with me. If DCHECKs are
> generally always enabled in tests then the CHECKs don't really buy us
anything.

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:354: void
StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent(
On 2017/04/05 20:38:42, gab wrote:
> You can just post two tasks in a row instead of a having a custom helper that
> does two things in one.
> 
> base::Thread will run all tasks before winding down anyways.

Wouldn't two successive posts create a race-condition?
- ShutdownTask gets posted
   - ShutdownTask runs
   - thread exits
- SignalDoneTask gets posted
   - never runs

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:366: :
Thread("Chrome_SamplingProfilerThread") {}
On 2017/04/05 20:38:42, gab wrote:
> No need to prefix with "Chrome_", the thread names will always be viewed as
> scoped to Chrome's browser process anyways. No need to suffix with "Thread"
> either.
> 
> "StackSamplingProfiler" is a shorter and more precise name.

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:369: Stop();
On 2017/04/05 20:38:42, gab wrote:
> Not necessary, ~Thread() does this already so = default; is sufficient here.

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:380: options.priority =
ThreadPriority::DISPLAY;
> Hmmm I don't think that's appropriate. On Android only UI/IO run at DISPLAY I
> think and on Desktop no thread runs at DISPLAY priority for now (all NORMAL or
> lower).
> 
> I understand that sampling has to be regular to be accurate but we also don't
> want to slow down the product in order to sample... right?

Right.


> With this we're telling the OS, if you're under crunch and there's only one
> thing you can make chrome do: schedule this thread..

Done.  I can remove the method and let the default one do the default thing.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:435: // thread and the thread that
creates it (i.e. this thread).
On 2017/04/05 20:38:42, gab wrote:
> Add " for thread-safety reasons which are alleviated in SamplingThread per
> gating its access on |thread_execution_state_lock_|."

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:475: CollectionContext* collection) {
On 2017/04/05 20:38:42, gab wrote:
> Add DCHECK_EQ(GetThreadId(), PlatformThread::CurrentId()); calls that document
> methods that always run on the sampling thread (others should use AutoLock or
> have a meta-comment so that it's clear which context each method is entered
> from).

Done.


> Or even better would be to split this state into (1) a class that only runs on
> sampling thread (members never touched from elsewhere) and (2) the
> multi-threaded part with state behind lock.

As in...

Create yet another sub-class that has all the thread-specific access and give it
a pointer to its parent class for accessing the shared information, including
the lock to that information?

I'm not sure that's worth the effort but I'll think about it.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:603:
std::max(collection->next_sample_time - Time::Now(), TimeDelta()));
On 2017/04/05 20:38:42, gab wrote:
> This isn't required I think, negative delays should be the same as no delays.

There is a DCHECK in incoming_task_queue.cc (line 45) that checks that delay is
not negative.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:616: // get postponed until
thread_execution_state_ is updated, thus eliminating
On 2017/04/05 20:38:42, gab wrote:
> |thread_execution_state_|

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:639: // work comes in. Remove the
thread_execution_state_task_runner_ to avoid
On 2017/04/05 20:38:42, gab wrote:
> |thread_execution_state_task_runner_ | and maybe elsewhere too

Done.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset();
On 2017/04/05 20:38:42, gab wrote:
> Use a ResetPolicy::AUTOMATIC WaitableEvent?

There are other Wait calls on this that don't reset it.  Comment added where it
is defined.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:194: static bool
IsSamplingThreadRunning();
On 2017/04/05 20:38:43, gab wrote:
> Thread::IsRunning() isn't thread-safe (though the check is sadly disabled
right
> now [1]) and as such this method isn't either (must be called from owning
> sequence). Please document it as such (or probably this entire TestAPI class
as
> such in fact).

I see.  Would the best solution be to have CleanUp set a flag (under lock) and
return that?

> 
> [1]
>
https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//...

Done.  I also added a DetachFromSequence in the code for this.  Let me know if
that's unnecessary.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown();
On 2017/04/05 20:38:42, gab wrote:
> Since you should support having multiple pending delayed shutdown tasks in
your
> queue (I don't see anything that prevents that from happening),

They can and it's handled.


> why bother
> disable them? Your tests should complete much before they fire for real
anyways
> so it shouldn't be a source of flakiness, disabling them merely brings you
> further from testing your real product code.

Just for guaranteed operation.  Right now the idle shutdown time is 1 minute but
it's an internal thing.  If it was changed to 1s we wouldn't want the tests to
become flaky.

https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:729: // Stop is immediate but
asynchronous. There is a non-zero probability that
On 2017/04/05 20:38:43, gab wrote:
> If it's asynchronous, it's not immediate :).
> 
> "// Stop is asynchronous: there is a non-zero..."

Done.  :-)

Alexei Svitkine (slow)

asvitkine@chromium.org changed reviewers: - asvitkine@chromium.org

3 years, 8 months ago (2017-04-06 18:42:54 UTC) #386

bcwhite

> Why does sampling have to be startable from any thread? Besides starting and > ...

3 years, 8 months ago (2017-04-06 18:44:03 UTC) #387

gab

On 2017/04/06 18:44:03, bcwhite wrote: > > Why does sampling have to be startable from ...

3 years, 8 months ago (2017-04-06 19:31:28 UTC) #388

On 2017/04/06 18:44:03, bcwhite wrote:
> > Why does sampling have to be startable from any thread? Besides starting and
> > self-stopping, what makes the threading complicated here?
> 
> Generally, a thread will initiate sampling upon itself but since any thread
> could do that to itself, the sampling thread needs to be startable from any
> thread.
> 
> And there may be cases as well well, for example, the UI thread wants to
> initiate sampling on some worker thread.
> 
> 
> > Can this CL be split? >1K LOCs is a huge review... I therefore didn't read
the
> > whole thing but did look at everything that seemed related to threading
> (skipped
> > tests mostly though).
> 
> I don't see how.  It's one change.  And it's a replacement for
currently-active
> functionality so even it it could be broken into pieces, it wouldn't "drop in"
> (or revert cleanly) as needed.

I'm not the main reviewer so won't force it but the way we usually do this in
base/task_scheduler land et al. is build individual components on their own w/
unit tests that aren't yet attached to the system (e.g. StackSamplingProfiler
class could be one) instead of having them hidden in anonymous namespace and
tested by integration. This makes testing easier (focused unit tests), eventual
refactoring easier (e.g. sequenced_worker_pool.cc is a mumbo-jumbo mess of
anonymous classes and is hard to test and refactor because of that -- we avoided
that in base/task_scheduler, base::internal:: namespace is used instead to
depict impl-only boundary), and incremental CLs.



Didn't do another full pass but overall threading lgtm w/ comments below.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.cc (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:354: void
StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent(
On 2017/04/06 18:40:18, bcwhite wrote:
> On 2017/04/05 20:38:42, gab wrote:
> > You can just post two tasks in a row instead of a having a custom helper
that
> > does two things in one.
> > 
> > base::Thread will run all tasks before winding down anyways.
> 
> Wouldn't two successive posts create a race-condition?
> - ShutdownTask gets posted
>    - ShutdownTask runs
>    - thread exits
> - SignalDoneTask gets posted
>    - never runs

Ah, good point, yes. Hadn't realized ShutdownTask() initiated its own async
shutdown (via StopSoon()) when I wrote this, that paradigm is a first in Chrome!
(but it's okay, we had talked about it, just forgot)

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:603:
std::max(collection->next_sample_time - Time::Now(), TimeDelta()));
On 2017/04/06 18:40:18, bcwhite wrote:
> On 2017/04/05 20:38:42, gab wrote:
> > This isn't required I think, negative delays should be the same as no
delays.
> 
> There is a DCHECK in incoming_task_queue.cc (line 45) that checks that delay
is
> not negative.

Ah ok interesting, probably an artifact but std::max here is fine then :)

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset();
On 2017/04/06 18:40:18, bcwhite wrote:
> On 2017/04/05 20:38:42, gab wrote:
> > Use a ResetPolicy::AUTOMATIC WaitableEvent?
> 
> There are other Wait calls on this that don't reset it.  Comment added where
it
> is defined.

The only other call I see is in the destructor (at which point resetting or not
doesn't matter)?

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:194: static bool
IsSamplingThreadRunning();
On 2017/04/06 18:40:18, bcwhite wrote:
> On 2017/04/05 20:38:43, gab wrote:
> > Thread::IsRunning() isn't thread-safe (though the check is sadly disabled
> right
> > now [1]) and as such this method isn't either (must be called from owning
> > sequence). Please document it as such (or probably this entire TestAPI class
> as
> > such in fact).
> 
> I see.  Would the best solution be to have CleanUp set a flag (under lock) and
> return that?
> 
> > 
> > [1]
> >
>
https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//...
> 
> Done.  I also added a DetachFromSequence in the code for this.  Let me know if
> that's unnecessary.

Hmmm if the comment above is respected and IsRunning() is always called from
owning thread then it's always fine. No need to have a fancy Cleanup() -- that's
already what Thread::IsRunning() is doing when called properly.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown();
On 2017/04/06 18:40:18, bcwhite wrote:
> On 2017/04/05 20:38:42, gab wrote:
> > Since you should support having multiple pending delayed shutdown tasks in
> your
> > queue (I don't see anything that prevents that from happening),
> 
> They can and it's handled.
> 
> 
> > why bother
> > disable them? Your tests should complete much before they fire for real
> anyways
> > so it shouldn't be a source of flakiness, disabling them merely brings you
> > further from testing your real product code.
> 
> Just for guaranteed operation.  Right now the idle shutdown time is 1 minute
but
> it's an internal thing.  If it was changed to 1s we wouldn't want the tests to
> become flaky.

Hmm okay, but you also by doing so don't test calling ShutdownTask with other
pending ShutdownTasks.

Up to you and wittman@ to decide which you prefer, just highlighting this.

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-06 21:53:18 UTC) #389

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-06 21:53:24 UTC) #390

Mike Wittman

LGTM Since this is a pretty major change, we should land it after next week's ...

3 years, 8 months ago (2017-04-06 21:54:30 UTC) #391

LGTM

Since this is a pretty major change, we should land it after next week's branch
point to minimize risk and potential disturbance to the release stabilization
process. The following Monday or Wednesday morning (4/17 or 4/19) would be ideal
as that would give one day of bake time in canary before the following dev
release.

Also, please manually sample browser_tests output on later runs of the CQ and
try bots during the day that this has landed to ensure it's not causing
stability issues. (Grep'ing for crash stacks containing "StackSamplingProfiler"
in the browser_tests log output is sufficient.)

On 2017/04/06 19:31:28, gab (behind) wrote:
> On 2017/04/06 18:44:03, bcwhite wrote:
> > > Can this CL be split? >1K LOCs is a huge review... I therefore didn't read
> the
> > > whole thing but did look at everything that seemed related to threading
> > (skipped
> > > tests mostly though).
> > 
> > I don't see how.  It's one change.  And it's a replacement for
> currently-active
> > functionality so even it it could be broken into pieces, it wouldn't "drop
in"
> > (or revert cleanly) as needed.
> 
> I'm not the main reviewer so won't force it but the way we usually do this in
> base/task_scheduler land et al. is build individual components on their own w/
> unit tests that aren't yet attached to the system (e.g. StackSamplingProfiler
> class could be one) instead of having them hidden in anonymous namespace and
> tested by integration. This makes testing easier (focused unit tests),
eventual
> refactoring easier (e.g. sequenced_worker_pool.cc is a mumbo-jumbo mess of
> anonymous classes and is hard to test and refactor because of that -- we
avoided
> that in base/task_scheduler, base::internal:: namespace is used instead to
> depict impl-only boundary), and incremental CLs.

I'm fully on board with building incrementally, but unfortunately that was for
the most part not a viable option with this change. :( The difficulty here is
that the vast majority of the complexity is in the interrelationships of the
thread and collection lifetimes along with the tasks implementing them. Even
now, I don't see a way that these could have been meaningfully separated while
preserving the essential complexity of the problem.

That said, there are opportunities to enforce a more explicit decoupling of some
pieces of this implementation. Moving SamplingThread to its own file would be
beneficial. Also: factoring out the collection sampling state management from
SamplingThread, and decoupling SamplingThread from the platform
NativeStackSampler implementations by using a mock in tests.

But this review has gone on long enough as it is, and those things can be
addressed independently.

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
File base/profiler/stack_sampling_profiler.h (right):

https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s...
base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown();
On 2017/04/06 19:31:28, gab (behind) wrote:
> On 2017/04/06 18:40:18, bcwhite wrote:
> > On 2017/04/05 20:38:42, gab wrote:
> > > Since you should support having multiple pending delayed shutdown tasks in
> > your
> > > queue (I don't see anything that prevents that from happening),
> > 
> > They can and it's handled.
> > 
> > 
> > > why bother
> > > disable them? Your tests should complete much before they fire for real
> > anyways
> > > so it shouldn't be a source of flakiness, disabling them merely brings you
> > > further from testing your real product code.
> > 
> > Just for guaranteed operation.  Right now the idle shutdown time is 1 minute
> but
> > it's an internal thing.  If it was changed to 1s we wouldn't want the tests
to
> > become flaky.
> 
> Hmm okay, but you also by doing so don't test calling ShutdownTask with other
> pending ShutdownTasks.
> 
> Up to you and wittman@ to decide which you prefer, just highlighting this.

Yes, the tests don't explicitly generate multiple ShutdownTasks, but we
extensively considered this behavior in the review and I'm satisfied that all
the code paths exercised in this case are adequately tested.

bcwhite

> Since this is a pretty major change, we should land it after next week's ...

3 years, 8 months ago (2017-04-07 15:41:55 UTC) #392

bcwhite

The patchset sent to the CQ was uploaded after l-g-t-m from brettw@chromium.org Link to the ...

3 years, 8 months ago (2017-04-19 11:46:04 UTC) #394

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1190001

3 years, 8 months ago (2017-04-19 11:46:28 UTC) #395

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-19 11:49:46 UTC) #396

commit-bot: I haz the power

Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presubmit/builds/414999) ios-device on master.tryserver.chromium.mac (JOB_FAILED, ...

3 years, 8 months ago (2017-04-19 11:49:49 UTC) #397

bcwhite

The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run

3 years, 8 months ago (2017-04-19 13:59:37 UTC) #398

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1210001

3 years, 8 months ago (2017-04-19 14:00:05 UTC) #399

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 8 months ago (2017-04-19 15:20:05 UTC) #400

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 8 months ago (2017-04-19 15:20:07 UTC) #401

bcwhite

The patchset sent to the CQ was uploaded after l-g-t-m from wittman@chromium.org, gab@chromium.org, brettw@chromium.org Link ...

3 years, 8 months ago (2017-04-19 15:25:45 UTC) #403

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2554123002/1210001

3 years, 8 months ago (2017-04-19 15:26:13 UTC) #404

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 1210001, "attempt_start_ts": 1492615544585300, "parent_rev": "431dd44543668f59e341aaf350f1370690ee9b35", "commit_rev": "69e964496800e75cb0e3cdd974436659bd24e9cf"}

3 years, 8 months ago (2017-04-19 15:30:31 UTC) #405

commit-bot: I haz the power

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-19 15:30:46 UTC) #406

Message was sent while issue was closed.

Description was changed from

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

BUG=671716
==========

to

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

BUG=671716

Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed:
https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
==========

commit-bot: I haz the power

Committed patchset #45 (id:1210001) as https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd974436659bd24e9cf

3 years, 8 months ago (2017-04-19 15:30:50 UTC) #407

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-19 18:39:06 UTC) #408

Message was sent while issue was closed.

Description was changed from

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

BUG=671716

Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed:
https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
==========

to

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

SHERIFFS: Don't hesitate to roll this back if it correlates well with some kind
of instability. Sampling has been known to have odd effects in the past and this
rewrites a large part of it.

BUG=671716

Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed:
https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
==========

bcwhite

Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling ...

3 years, 8 months ago (2017-04-19 18:39:43 UTC) #409

Message was sent while issue was closed.

Description was changed from

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

SHERIFFS: Don't hesitate to roll this back if it correlates well with some kind
of instability. Sampling has been known to have odd effects in the past and this
rewrites a large part of it.

BUG=671716

Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed:
https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
==========

to

==========
Support parallel captures from the StackSamplingProfiler.

Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser.  To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.

Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.

The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
  thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
  idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
  accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
  ensure that access is sequenced.

The sampled thread is expected to live at least as long as the
thread controlling the sampling.

SHERIFFS: Don't hesitate to roll this back if it correlates well with
some kind of instability. Sampling has been known to have odd effects
in the past and this rewrites a large part of it.

BUG=671716

Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed:
https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
==========

lijeffrey

On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > Committed patchset #45 (id:1210001) as ...

3 years, 8 months ago (2017-04-26 23:52:53 UTC) #410

lijeffrey

On 2017/04/26 23:52:53, lijeffrey wrote: > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: ...

3 years, 8 months ago (2017-04-26 23:54:12 UTC) #411

Mike Wittman

Can you provide a link to a build where this test fails? It's passing in ...

3 years, 8 months ago (2017-04-27 00:34:54 UTC) #412

Message was sent while issue was closed.

Can you provide a link to a build where this test fails? It's passing in
build 39545 linked in the analysis.

In any case, this change is unlikely to be the cause of flakiness on Mac
because the modified functionality is only enabled for 64-bit Windows.

On Wed, Apr 26, 2017 at 4:54 PM, <lijeffrey@chromium.org> wrote:

> On 2017/04/26 23:52:53, lijeffrey wrote:
> > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote:
> > > Committed patchset #45 (id:1210001) as
> > >
> >
> https://chromium.googlesource.com/chromium/src/+/
> 69e964496800e75cb0e3cdd974436659bd24e9cf
> >
> > Hey guys, Findit's analysis for a flaky test
> >
> "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest
> .AddBeforeUnloadDuringClosing/0"
> > suggests this as the culprit according to analysis
> >
> ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM,
> > can someone please help verify?
> >
> > Thanks,
> > Jeff on behalf of Findit team
>
> Oops sorry here's the full link to the analysis:
>
> https://findit-for-me.appspot.com/waterfall/flake?key=
> ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM
>
> https://codereview.chromium.org/2554123002/
>

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

lijeffrey

On 2017/04/27 00:34:54, Mike Wittman wrote: > Can you provide a link to a build ...

3 years, 7 months ago (2017-04-27 11:16:17 UTC) #413

Message was sent while issue was closed.

On 2017/04/27 00:34:54, Mike Wittman wrote:
> Can you provide a link to a build where this test fails? It's passing in
> build 39545 linked in the analysis.
> 
> In any case, this change is unlikely to be the cause of flakiness on Mac
> because the modified functionality is only enabled for 64-bit Windows.
> 
> On Wed, Apr 26, 2017 at 4:54 PM, <mailto:lijeffrey@chromium.org> wrote:
> 
> > On 2017/04/26 23:52:53, lijeffrey wrote:
> > > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote:
> > > > Committed patchset #45 (id:1210001) as
> > > >
> > >
> > https://chromium.googlesource.com/chromium/src/+/
> > 69e964496800e75cb0e3cdd974436659bd24e9cf
> > >
> > > Hey guys, Findit's analysis for a flaky test
> > >
> > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest
> > .AddBeforeUnloadDuringClosing/0"
> > > suggests this as the culprit according to analysis
> > >
> > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM,
> > > can someone please help verify?
> > >
> > > Thanks,
> > > Jeff on behalf of Findit team
> >
> > Oops sorry here's the full link to the analysis:
> >
> > https://findit-for-me.appspot.com/waterfall/flake?key=
> > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM
> >
> > https://codereview.chromium.org/2554123002/
> >
> 
> -- 
> You received this message because you are subscribed to the Google Groups
> "Chromium-reviews" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email
> to mailto:chromium-reviews+unsubscribe@chromium.org.

Thanks for the reply!
https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%20Tests%20%28dbg%...
and
https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%20Tests%20%28dbg%...
both fail for the same test which appears to have started flaking after this CL
landed. If it's a false positive please let us know so we can improve the flake
analyzer! :)

Mike Wittman

3 years, 7 months ago (2017-04-27 16:51:38 UTC) #414

Message was sent while issue was closed.

This is a false positive. The failure appears to be due to some destruction
ordering error in Mac UI, which is entirely unrelated to the CL at hand.

On Thu, Apr 27, 2017 at 4:16 AM, <lijeffrey@chromium.org> wrote:

> On 2017/04/27 00:34:54, Mike Wittman wrote:
> > Can you provide a link to a build where this test fails? It's passing in
> > build 39545 linked in the analysis.
> >
> > In any case, this change is unlikely to be the cause of flakiness on Mac
> > because the modified functionality is only enabled for 64-bit Windows.
> >
> > On Wed, Apr 26, 2017 at 4:54 PM, <mailto:lijeffrey@chromium.org> wrote:
> >
> > > On 2017/04/26 23:52:53, lijeffrey wrote:
> > > > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote:
> > > > > Committed patchset #45 (id:1210001) as
> > > > >
> > > >
> > > https://chromium.googlesource.com/chromium/src/+/
> > > 69e964496800e75cb0e3cdd974436659bd24e9cf
> > > >
> > > > Hey guys, Findit's analysis for a flaky test
> > > >
> > > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest
> > > .AddBeforeUnloadDuringClosing/0"
> > > > suggests this as the culprit according to analysis
> > > >
> > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM,
> > > > can someone please help verify?
> > > >
> > > > Thanks,
> > > > Jeff on behalf of Findit team
> > >
> > > Oops sorry here's the full link to the analysis:
> > >
> > > https://findit-for-me.appspot.com/waterfall/flake?key=
> > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v
> > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy
> > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa
> > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH
> > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB
> > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM
> > >
> > > https://codereview.chromium.org/2554123002/
> > >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Chromium-reviews" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> email
> > to mailto:chromium-reviews+unsubscribe@chromium.org.
>
> Thanks for the reply!
> https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%
> 20Tests%20%28dbg%29/39548
> and
> https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%
> 20Tests%20%28dbg%29/39550
> both fail for the same test which appears to have started flaking after
> this CL
> landed. If it's a false positive please let us know so we can improve the
> flake
> analyzer! :)
>
> https://codereview.chromium.org/2554123002/
>

-- 
You received this message because you are subscribed to the Google Groups
"Chromium-reviews" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to chromium-reviews+unsubscribe@chromium.org.

Issue 2554123002: Support parallel captures from the StackSamplingProfiler. (Closed)