|
|
DescriptionSupport parallel captures from the StackSamplingProfiler.
Previously, only one sampling operation could be running and it was
generally used to profile the startup of the browser. To make it more
useful, it can now run against any thread and multiple profilers can
execute in parallel.
Sampling will continue until the desired number of samples has been
collected, it is manually stopped, or the controlling object gets
destructed.
The SamplingThread is a singleton base::Thread that is self-managing.
- It is started (via GetOrCreateTaskRunnerForAdd) on the calling
thread when work arrives.
- It stops (via ShutdownTask) on its own thread when it has been
idle for 1 minute.
- DetachFromSequence is called after both of these to allow for
accessing the API from different threads.
- thread_execution_state_lock_ is held when doing Thread API calls to
ensure that access is sequenced.
The sampled thread is expected to live at least as long as the
thread controlling the sampling.
SHERIFFS: Don't hesitate to roll this back if it correlates well with
some kind of instability. Sampling has been known to have odd effects
in the past and this rewrites a large part of it.
BUG=671716
Review-Url: https://codereview.chromium.org/2554123002
Cr-Commit-Position: refs/heads/master@{#465614}
Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd974436659bd24e9cf
Patch Set 1 #
Total comments: 13
Patch Set 2 : move to Thread and MessageLoop #
Total comments: 31
Patch Set 3 : switched to task-runner and address review comments #
Total comments: 23
Patch Set 4 : rebased #Patch Set 5 : addressed review comments by wittman #Patch Set 6 : pass ID instead of pointers #
Total comments: 2
Patch Set 7 : addressed review comments by wittman #Patch Set 8 : rebased #Patch Set 9 : use helper methods for getting task runner #
Total comments: 8
Patch Set 10 : some minor cleanup #Patch Set 11 : rebased #Patch Set 12 : working shutdown, both idle and forced, with tests #Patch Set 13 : support for death of thread-under-test #
Total comments: 22
Patch Set 14 : improved detection of thread state #Patch Set 15 : use events to track thread lifetime in tests #Patch Set 16 : add check that thread under test doesn't change #Patch Set 17 : removed unnecessary thread-id check and fix test appropriately #Patch Set 18 : added prevention of using profiler on non-windows platforms #Patch Set 19 : merged synchronized-stop CL #
Total comments: 22
Patch Set 20 : addressed review comments by wittman #Patch Set 21 : fixed typo #
Total comments: 8
Patch Set 22 : remove shutdown(); comment improvements #Patch Set 23 : fix deadlock problem with GetTaskRunner(); fix layout of thread_restrictions.h #
Total comments: 46
Patch Set 24 : addressed review comments by wittman #
Total comments: 11
Patch Set 25 : addressed review comments by wittman #
Total comments: 6
Patch Set 26 : addressed review comments by wittman #Patch Set 27 : addressed review comments by wittman #
Total comments: 20
Patch Set 28 : addressed review comments by wittman #
Total comments: 33
Patch Set 29 : switch to separate thread-state variable #
Total comments: 16
Patch Set 30 : addressed review comments by wittman #Patch Set 31 : addressed review comments by wittman #
Total comments: 49
Patch Set 32 : rebased #Patch Set 33 : addressed review comments by wittman #
Total comments: 26
Patch Set 34 : addressed review comments by wittman #
Total comments: 28
Patch Set 35 : addressed review comments by wittman #
Total comments: 4
Patch Set 36 : addressed review comments by wittman #
Total comments: 11
Patch Set 37 : more tests; improved tests #
Total comments: 61
Patch Set 38 : more test improvements #
Total comments: 5
Patch Set 39 : addressed review comments by wittman #Patch Set 40 : fixed signed/unsigned comparison in test #
Total comments: 10
Patch Set 41 : addressed review comments by wittman #
Total comments: 25
Patch Set 42 : addressed review comments by wittman #
Total comments: 47
Patch Set 43 : addressed review comments by wittman #
Total comments: 2
Patch Set 44 : addressed review comments by gab #Patch Set 45 : rebased #
Messages
Total messages: 414 (252 generated)
wittman@chromium.org changed reviewers: + wittman@chromium.org
Haven't reviewed the logic in great detail but this approach looks reasonable to me, with two high-level comments: The standard mechanism for inter-thread communication in Chrome is via PostTask to a task runner/message loop. We should support this method for requesting captures instead of asynchronous state access guarded by a lock, unless there are profiler-specific reasons that prevent us from doing so. (It's entirely possible there could be, but I'm not aware of any blockers to this at the moment.) This probably would simplify some of the logic as a side effect. I believe you'll need to join the profiler thread in the main thread before exiting Chrome, to ensure clean shutdown in tests and during normal execution. I initially implemented the profiler without the join and there were failures at shutdown due to walking stacks of threads that had already been destroyed. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; We probably can avoid need for a thread-safe id by identifying the ActiveCapture by its address (e.g. as an opaque void*). https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:348: capture->native_sampler->ProfileRecordingStarting(&profile.modules); The matching call to ProfileRecordingStopped has been dropped with these changes. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:379: wait = TimeDelta::FromDays(365); // A long, long time. There's a general desire to have as few persistent threads as possible in Chrome, so we probably should have the sampling thread terminate after a period of inactivity, and restart on demand. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:441: active_captures_.push_back(std::move(capture_ptr)); push_heap? https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:494: NativeStackSampler::Create(thread_id_, &RecordAnnotations, We'll need to refactor to use a single stack copy buffer across all NativeStackSamplers, as the buffer is fairly large.
On 2016/12/06 21:04:58, Mike Wittman wrote: > I believe you'll need to join the profiler thread in the main thread before > exiting Chrome, to ensure clean shutdown in tests and during normal execution. I > initially implemented the profiler without the join and there were failures at > shutdown due to walking stacks of threads that had already been destroyed. It may be necessary to do some kind of synchronization on non-main threads too, to prevent walk-after-destroy issues there as well. We should add tests for this scenario if possible.
> The standard mechanism for inter-thread communication in > Chrome is via PostTask to a task runner/message loop. I considered the message-loop but it didn't appear to offer acceptable timing guarantees. The code I saw was only "run until idle" which meant no timing support for exiting the loop to perform a sampling operation. At best, I'd still need the capture_change_ WaitableEvent but use it to signal when to run the message-loop instead of accessing shared data structures directly. I can do that but it seems significantly more complicated, larger in code, and slower to execute (though those last two aren't really significant). > I believe you'll need to join the profiler thread in > the main thread before exiting Chrome. Will do. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; On 2016/12/06 21:04:58, Mike Wittman wrote: > We probably can avoid need for a thread-safe id by identifying the ActiveCapture > by its address (e.g. as an opaque void*). My concern with that is that addresses may be reused. A capture could start and then complete, getting freed. A new capture could start and reuse the same address, reasonably likely given that the allocation is the exact same number of bytes as the free'd block. Then a stop-request for the first one could be made and cause the new one to stop. The incrementing integer will also repeat but not for a long, long time.
On 2016/12/07 15:15:30, bcwhite wrote: > > The standard mechanism for inter-thread communication in > > Chrome is via PostTask to a task runner/message loop. > > I considered the message-loop but it didn't appear to > offer acceptable timing guarantees. The code I saw was > only "run until idle" which meant no timing support > for exiting the loop to perform a sampling operation. Can we use PostDelayedTask on the message loop's task runner for this? https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; On 2016/12/07 15:15:30, bcwhite wrote: > On 2016/12/06 21:04:58, Mike Wittman wrote: > > We probably can avoid need for a thread-safe id by identifying the > ActiveCapture > > by its address (e.g. as an opaque void*). > > My concern with that is that addresses may be reused. A capture could start and > then complete, getting freed. A new capture could start and reuse the same > address, reasonably likely given that the allocation is the exact same number of > bytes as the free'd block. Then a stop-request for the first one could be made > and cause the new one to stop. > > The incrementing integer will also repeat but not for a long, long time. Yes, care would need to be taken to ensure the StackSamplingProfiler doesn't retain the address beyond when the object is deleted. This may be feasible, depending on the synchronization we ultimately have in place with the threads owning the StackSamplingProfilers. Let's reconsider once we have something closer to a final implementation.
https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; On 2016/12/07 16:25:02, Mike Wittman wrote: > On 2016/12/07 15:15:30, bcwhite wrote: > > On 2016/12/06 21:04:58, Mike Wittman wrote: > > > We probably can avoid need for a thread-safe id by identifying the > > ActiveCapture > > > by its address (e.g. as an opaque void*). > > > > My concern with that is that addresses may be reused. A capture could start > and > > then complete, getting freed. A new capture could start and reuse the same > > address, reasonably likely given that the allocation is the exact same number > of > > bytes as the free'd block. Then a stop-request for the first one could be > made > > and cause the new one to stop. > > > > The incrementing integer will also repeat but not for a long, long time. > > Yes, care would need to be taken to ensure the StackSamplingProfiler doesn't > retain the address beyond when the object is deleted. This may be feasible, > depending on the synchronization we ultimately have in place with the threads > owning the StackSamplingProfilers. Let's reconsider once we have something > closer to a final implementation. Also, the current implementation can use base::StaticAtomicSequenceNumber.
> > > The standard mechanism for inter-thread communication in > > > Chrome is via PostTask to a task runner/message loop. > > > > I considered the message-loop but it didn't appear to > > offer acceptable timing guarantees. The code I saw was > > only "run until idle" which meant no timing support > > for exiting the loop to perform a sampling operation. > > Can we use PostDelayedTask on the message loop's task runner for this? It's possible but the precision would be poor and I think timing accuracy is more important than convenience in this case... not that I find the message loop to be very convenient for this use.
On 2016/12/07 17:48:21, bcwhite wrote: > > > > The standard mechanism for inter-thread communication in > > > > Chrome is via PostTask to a task runner/message loop. > > > > > > I considered the message-loop but it didn't appear to > > > offer acceptable timing guarantees. The code I saw was > > > only "run until idle" which meant no timing support > > > for exiting the loop to perform a sampling operation. > > > > Can we use PostDelayedTask on the message loop's task runner for this? > > It's possible but the precision would be poor and I think timing accuracy is > more important than convenience in this case... not that I find the message > loop to be very convenient for this use. Why do you say the precision would be poorer? I believe the waiting in both cases operates at system timer tick resolution: WaitableEvent via SleepConditionVariableSRW, and e.g. MessagePumpForUI via MsgWaitForMultipleObjectsEx. (https://randomascii.wordpress.com/2013/04/02/sleep-variation-investigated has analysis of the sleep resolution, and https://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx documents the MsgWaitForMultipleObjectsEx resolution.)
On 2016/12/07 18:58:21, Mike Wittman wrote: > On 2016/12/07 17:48:21, bcwhite wrote: > > > > > The standard mechanism for inter-thread communication in > > > > > Chrome is via PostTask to a task runner/message loop. > > > > > > > > I considered the message-loop but it didn't appear to > > > > offer acceptable timing guarantees. The code I saw was > > > > only "run until idle" which meant no timing support > > > > for exiting the loop to perform a sampling operation. > > > > > > Can we use PostDelayedTask on the message loop's task runner for this? > > > > It's possible but the precision would be poor and I think timing accuracy is > > more important than convenience in this case... not that I find the message > > loop to be very convenient for this use. > > Why do you say the precision would be poorer? I believe the waiting in both > cases operates at system timer tick resolution: WaitableEvent via > SleepConditionVariableSRW, and e.g. MessagePumpForUI via > MsgWaitForMultipleObjectsEx. > (https://randomascii.wordpress.com/2013/04/02/sleep-variation-investigated has > analysis of the sleep resolution, and > https://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx documents the > MsgWaitForMultipleObjectsEx resolution.) It's not the OS call but the code of the loop. There's more overhead in the general-purpose class and while it may wait with the same resolution, it may do other things, too. And the code could always change outside of our control violating our assumptions. Is there a way to cancel a delayed task? If not then there will be some added complexity so that it can be stopped immediately but not fail when the delayed task gets executed. On the other hand, Thread supports restarting of the thread without having to completely recreate the object. That means no home-grown Singleton-with-delete around SimpleThread and no need to go to a lower-level PlatformThread. That's a plus. https://cs.chromium.org/chromium/src/testing/gtest/include/gtest/gtest.h?l=446 Need to sleep on it. :-)
On 2016/12/07 19:54:24, bcwhite wrote: > On 2016/12/07 18:58:21, Mike Wittman wrote: > > On 2016/12/07 17:48:21, bcwhite wrote: > > > > > > The standard mechanism for inter-thread communication in > > > > > > Chrome is via PostTask to a task runner/message loop. > > > > > > > > > > I considered the message-loop but it didn't appear to > > > > > offer acceptable timing guarantees. The code I saw was > > > > > only "run until idle" which meant no timing support > > > > > for exiting the loop to perform a sampling operation. > > > > > > > > Can we use PostDelayedTask on the message loop's task runner for this? > > > > > > It's possible but the precision would be poor and I think timing accuracy is > > > more important than convenience in this case... not that I find the message > > > loop to be very convenient for this use. > > > > Why do you say the precision would be poorer? I believe the waiting in both > > cases operates at system timer tick resolution: WaitableEvent via > > SleepConditionVariableSRW, and e.g. MessagePumpForUI via > > MsgWaitForMultipleObjectsEx. > > (https://randomascii.wordpress.com/2013/04/02/sleep-variation-investigated has > > analysis of the sleep resolution, and > > https://msdn.microsoft.com/en-us/library/ms687069(VS.85).aspx documents the > > MsgWaitForMultipleObjectsEx resolution.) > > It's not the OS call but the code of the loop. There's more overhead in the > general-purpose class and while it may wait with the same resolution, it may do > other things, too. And the code could always change outside of our control > violating our assumptions. I'd be most concerned about e.g. Windows sending extraneous messages to the message loop and those delaying the processing. I don't know what the likelihood of that occurring is though. As far as implementation overhead goes, the GPU main thread uses a message loop and it presumably has more stringent performance requirements than the profiler, in order to maintain frame rate. > Is there a way to cancel a delayed task? If not then there will be some added > complexity so that it can be stopped immediately but not fail when the delayed > task gets executed. There's CancelableCallback, but it's not clear if its use of WeakPtrs would work within the constraints of the profiler. > On the other hand, Thread supports restarting of the thread without having to > completely recreate the object. That means no home-grown Singleton-with-delete > around SimpleThread and no need to go to a lower-level PlatformThread. That's a > plus. > https://cs.chromium.org/chromium/src/testing/gtest/include/gtest/gtest.h?l=446 > > Need to sleep on it. :-)
New patch set using a message loop! Still rough but tests pass... except for the one that checks that concurrent profiling isn't allowed. :-) There is code to deal with the thread exiting but actually having it exit will require a small addition to RunLoop. Specifically, I'll need to add a QuitWhenEmpty() to sit beside QuitWhenIdle(), the latter stopping when there are no immediate tasks to execute even if there are pending delayed tasks.
On 2016/12/09 17:58:23, bcwhite wrote: > New patch set using a message loop! Great! Made a pass through looking for thread safety and higher-level issues. > Still rough but tests pass... except for the one that checks that concurrent > profiling isn't allowed. :-) Good. Tests definitely will need to be extended to provide good test coverage for the concurrent case. > There is code to deal with the thread exiting but actually having it exit will > require a small addition to RunLoop. Specifically, I'll need to add a > QuitWhenEmpty() to sit beside QuitWhenIdle(), the latter stopping when there are > no immediate tasks to execute even if there are pending delayed tasks. I'm not sure this is necessary... see the final comment in the code. Beyond the comments below the other major issue I'm aware of is ensuring profiling doesn't occur after threads are destroyed. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); The term "capture" is overloaded in the method names to mean both the recording of all the samples and the recording of a single sample. Can we use something like "record sample" for the latter case to be consistent with the NativeStackSampler? e.g. this becomes something like RecordSampleForCapture. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); According to the Thread documentation, task_runner() can only be safely called from the thread that invokes Start(). Thread's API is not thread-safe in general, so care should be taken to ensure that it's only used from the proper threads, including making liberal use of DCHECKs/ThreadChecker since it's non-trivial to validate from reading the code. The same goes for ensuring execution on proper threads in other functions in this class (and for documenting thread expectations to the reader). https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE, Bind(&SamplingThread::StartCaptureTask, It's common practice to implement thread hopping using just one function, checking whether the execution is on the desired thread at the start of the function, and if not, posting a task back to the same function on the desired thread. I think that could be done here by checking the thread id, and probably would make the code a little easier to follow. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:344: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); Same issue with task_runner() here. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:352: FROM_HERE, Bind(&SamplingThread::StopCaptureTask, Unretained(this), id)); Same comment here about thread hopping. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:429: where does the capture get erased from active_captures_? https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); My understanding is that the message loop just waits if no tasks are present. I believe it must be forcibly quit or its thread shut down to terminate it.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...)
Patchset #3 (id:40001) has been deleted
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > According to the Thread documentation, task_runner() can only be safely called > from the thread that invokes Start(). The comment for Thread::task_runner() says: // In addition to this Thread's owning sequence, this can also safely be // called from the underlying thread itself. > Thread's API is not thread-safe in general, so care should be taken to ensure > that it's only used from the proper threads, including making liberal use of > DCHECKs/ThreadChecker since it's non-trivial to validate from reading the code. There's a DCHECK in Thread::task_runner() that verifies this. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE, Bind(&SamplingThread::StartCaptureTask, On 2016/12/09 21:45:02, Mike Wittman wrote: > It's common practice to implement thread hopping using just one function, > checking whether the execution is on the desired thread at the start of the > function, and if not, posting a task back to the same function on the desired > thread. I think that could be done here by checking the thread id, and probably > would make the code a little easier to follow. Add() and Stop() are always coming from a different thread. The StartCaptureTask() could be merged into the same method but I think that would be more confusing because of all the work done above to make sure the thread is actually running. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:429: On 2016/12/09 21:45:01, Mike Wittman wrote: > where does the capture get erased from active_captures_? In ::Cleanup() ... which I realized after uploading that forgot to write. :-) https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); On 2016/12/09 21:45:01, Mike Wittman wrote: > My understanding is that the message loop just waits if no tasks are present. I > believe it must be forcibly quit or its thread shut down to terminate it. Correct. My idea is to add the ability for it to self-destruct when "empty" (which is not the same thing as "idle"). I didn't think it was possible for at outside class like this one to tell if the message_loop is empty, but perhaps it can -- I'll have to check. If so, then I can add a check at the end of every task to terminate the loop if it is empty.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/09 23:38:30, bcwhite wrote: > > According to the Thread documentation, task_runner() can only be safely called > > from the thread that invokes Start(). > > The comment for Thread::task_runner() says: > // In addition to this Thread's owning sequence, this can also safely be > // called from the underlying thread itself. Right, but Add() will never be called on the thread itself, correct? If I'm not mistaken task_runner() will be invoked on a thread other than the thread itself and the one that called Start(), once a second thread attempts to profile itself concurrently. Unit tests for the concurrency functionality will help catch this type of issue. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE, Bind(&SamplingThread::StartCaptureTask, On 2016/12/09 23:38:30, bcwhite wrote: > On 2016/12/09 21:45:02, Mike Wittman wrote: > > It's common practice to implement thread hopping using just one function, > > checking whether the execution is on the desired thread at the start of the > > function, and if not, posting a task back to the same function on the desired > > thread. I think that could be done here by checking the thread id, and > probably > > would make the code a little easier to follow. > > Add() and Stop() are always coming from a different thread. The > StartCaptureTask() could be merged into the same method but I think that would > be more confusing because of all the work done above to make sure the thread is > actually running. I think it's worth doing this for Stop() at least. The other benefit of the one-function thread hop implementation is that it clearly documents/enforces threading expectations for an operation in a single location. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); On 2016/12/09 23:38:30, bcwhite wrote: > On 2016/12/09 21:45:01, Mike Wittman wrote: > > My understanding is that the message loop just waits if no tasks are present. > I > > believe it must be forcibly quit or its thread shut down to terminate it. > > Correct. My idea is to add the ability for it to self-destruct when "empty" > (which is not the same thing as "idle"). I think that will be confusing to readers since it will operate differently than all the other message loops in the application. Couldn't this be addressed by posting a delayed quit task when the number of captures drops to zero (and canceling the task if the number of captures becomes non-zero)?
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/10 00:24:23, Mike Wittman wrote: > On 2016/12/09 23:38:30, bcwhite wrote: > > > According to the Thread documentation, task_runner() can only be safely > called > > > from the thread that invokes Start(). > > > > The comment for Thread::task_runner() says: > > // In addition to this Thread's owning sequence, this can also safely be > > // called from the underlying thread itself. > > Right, but Add() will never be called on the thread itself, correct? > > If I'm not mistaken task_runner() will be invoked on a thread other than the > thread itself and the one that called Start(), once a second thread attempts to > profile itself concurrently. > > Unit tests for the concurrency functionality will help catch this type of issue. Ah, I understand. So if this is called from other than the thread that started it, I need to post-task to the thread that started it... which will then post-task to the worker thread. But is it possible that whatever thread started it has the same restrictions and won't allow posts from just anywhere? https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:336: runner->PostTask(FROM_HERE, Bind(&SamplingThread::StartCaptureTask, On 2016/12/10 00:24:23, Mike Wittman wrote: > On 2016/12/09 23:38:30, bcwhite wrote: > > On 2016/12/09 21:45:02, Mike Wittman wrote: > > > It's common practice to implement thread hopping using just one function, > > > checking whether the execution is on the desired thread at the start of the > > > function, and if not, posting a task back to the same function on the > desired > > > thread. I think that could be done here by checking the thread id, and > > probably > > > would make the code a little easier to follow. > > > > Add() and Stop() are always coming from a different thread. The > > StartCaptureTask() could be merged into the same method but I think that would > > be more confusing because of all the work done above to make sure the thread > is > > actually running. > > I think it's worth doing this for Stop() at least. The other benefit of the > one-function thread hop implementation is that it clearly documents/enforces > threading expectations for an operation in a single location. I can see that. The downside is that there are then two different styles for methods of the same class. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); On 2016/12/10 00:24:23, Mike Wittman wrote: > On 2016/12/09 23:38:30, bcwhite wrote: > > On 2016/12/09 21:45:01, Mike Wittman wrote: > > > My understanding is that the message loop just waits if no tasks are > present. > > I > > > believe it must be forcibly quit or its thread shut down to terminate it. > > > > Correct. My idea is to add the ability for it to self-destruct when "empty" > > (which is not the same thing as "idle"). > > I think that will be confusing to readers since it will operate differently than > all the other message loops in the application. Couldn't this be addressed by > posting a delayed quit task when the number of captures drops to zero (and > canceling the task if the number of captures becomes non-zero)? Message looks are already RunForever or RunUntilIdle. Adding RunUntilEmpty seems a natural (and generally useful) extension. Posting a delayed quit is still a race condition because a new task could get posted from another thread just as the quit starts executing. It's possible that's a problem no matter what the solution. Even RunUntilEmpty may have that issue -- I'd have to investigate further. There may need to be some sort of atomic operation, such as a simple counter, no matter what. Given the variations, I think it would be best to leave it as "run forever" in this CL and do the quit-when-idle as a follow-up CL.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/13 16:08:11, bcwhite wrote: > On 2016/12/10 00:24:23, Mike Wittman wrote: > > On 2016/12/09 23:38:30, bcwhite wrote: > > > > According to the Thread documentation, task_runner() can only be safely > > called > > > > from the thread that invokes Start(). > > > > > > The comment for Thread::task_runner() says: > > > // In addition to this Thread's owning sequence, this can also safely be > > > // called from the underlying thread itself. > > > > Right, but Add() will never be called on the thread itself, correct? > > > > If I'm not mistaken task_runner() will be invoked on a thread other than the > > thread itself and the one that called Start(), once a second thread attempts > to > > profile itself concurrently. > > > > Unit tests for the concurrency functionality will help catch this type of > issue. > > Ah, I understand. So if this is called from other than the thread that started > it, I need to post-task to the thread that started it... which will then > post-task to the worker thread. > > But is it possible that whatever thread started it has the same restrictions and > won't allow posts from just anywhere? It may be possible to call task_runner() on the Start thread, then maintain an instance of the scoped_refptr<SingleThreadTaskRunner> as a member variable on SamplingThread, since SingleThreadTaskRunner is thread-safe refcounted type. Delaying the quit-when-idle behavior to a follow on CL hopefully should make this a little easier to deal with. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); On 2016/12/13 16:08:11, bcwhite wrote: > On 2016/12/10 00:24:23, Mike Wittman wrote: > > On 2016/12/09 23:38:30, bcwhite wrote: > > > On 2016/12/09 21:45:01, Mike Wittman wrote: > > > > My understanding is that the message loop just waits if no tasks are > > present. > > > I > > > > believe it must be forcibly quit or its thread shut down to terminate it. > > > > > > Correct. My idea is to add the ability for it to self-destruct when "empty" > > > (which is not the same thing as "idle"). > > > > I think that will be confusing to readers since it will operate differently > than > > all the other message loops in the application. Couldn't this be addressed by > > posting a delayed quit task when the number of captures drops to zero (and > > canceling the task if the number of captures becomes non-zero)? > > Message looks are already RunForever or RunUntilIdle. Adding RunUntilEmpty > seems a natural (and generally useful) extension. RunForever is pretty much the only mode that's used in Chrome itself; RunUntilIdle is used almost exclusively for testing. I believe RunUntilIdle is generally considered an anti-pattern in production code because of its action at a distance properties -- anyone else in the system (including the OS) can unintentionally keep the message loop alive by posting messages to it. There's only a dozen instances of RunUntilIdle in actual Chrome code, all of which are in highly constrained scenarios: https://cs.chromium.org/search/?q=rununtilidle%5C(%5C);+file:%5C.cc$+-file:te... RunUntilEmpty will be subject to the same issues, I think. > Posting a delayed quit is still a race condition because a new task could get > posted from another thread just as the quit starts executing. > > It's possible that's a problem no matter what the solution. Even RunUntilEmpty > may have that issue -- I'd have to investigate further. There may need to be > some sort of atomic operation, such as a simple counter, no matter what. > > Given the variations, I think it would be best to leave it as "run forever" in > this CL and do the quit-when-idle as a follow-up CL. Deferring the quit-when-idle behavior to a follow-on CL SGTM. I suspect there will be more than enough complexity to deal with just implementing the "run forever" mode.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > It may be possible to call task_runner() on the Start thread, then maintain an > instance of the scoped_refptr<SingleThreadTaskRunner> as a member variable on > SamplingThread, since SingleThreadTaskRunner is thread-safe refcounted type. Wasn't there an issue with a task-runner not being available for the Startup collection? Why the callback had to be direct instead of a posted task? This is getting complicated. Are you sure it's not easier and less prone to problems to just use a lock and some shared data structures?
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/14 15:37:59, bcwhite wrote: > > It may be possible to call task_runner() on the Start thread, then maintain an > > instance of the scoped_refptr<SingleThreadTaskRunner> as a member variable on > > SamplingThread, since SingleThreadTaskRunner is thread-safe refcounted type. > > Wasn't there an issue with a task-runner not being available for the Startup > collection? Why the callback had to be direct instead of a posted task? The UI thread does not have a task runner when it starts the profiler because it hasn't started its message loop at that point. The profiler thread doesn't have this issue because its message loop starts with the thread. > This is getting complicated. Are you sure it's not easier and less prone to > problems to just use a lock and some shared data structures? This shouldn't be significantly more complicated than the other places in Chrome where threads are started for the purpose of other threads posting tasks to them. The one extra factor here is the desire to start the thread from an arbitrary thread. If that proves to be not workable, I think we can fall back to requesting that the profiling thread be started on a specific thread (e.g. the UI thread).
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > > Wasn't there an issue with a task-runner not being available for the Startup > > collection? Why the callback had to be direct instead of a posted task? > > The UI thread does not have a task runner when it starts the profiler because it > hasn't started its message loop at that point. The profiler thread doesn't have > this issue because its message loop starts with the thread. I must be missing something. The SamplingThread's task-runner is postable by only the SamplingThread and whatever thread created it. The very first thing that will start a capture (and create the SamplingThread) doesn't have a task-runner at that point so there is no way to get a refptr to it for future use. And even if it did, how would you know it would be postable from any arbitrary thread?
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/14 19:39:39, bcwhite wrote: > > > Wasn't there an issue with a task-runner not being available for the Startup > > > collection? Why the callback had to be direct instead of a posted task? > > > > The UI thread does not have a task runner when it starts the profiler because > it > > hasn't started its message loop at that point. The profiler thread doesn't > have > > this issue because its message loop starts with the thread. > > I must be missing something. The SamplingThread's task-runner is postable by > only the SamplingThread and whatever thread created it. I don't think this is correct. Thread::task_runner() can only be called on those two threads, but I believe if the scoped_refptr<SingleThreadTaskRunner> returned by that function is saved somewhere, other threads can use it to post tasks. The scoped_refptr is thread-safe and all methods on TaskRunner are also thread-safe. In Chrome for example, the IO thread is created by the UI thread, but other threads can post tasks directly to it. So there must already be a defined mechanism to do this. > The very first thing that will start a capture (and create the SamplingThread) > doesn't have a task-runner at that point so there is no way to get a refptr to > it for future use. And even if it did, how would you know it would be postable > from any arbitrary thread? Why can't the thread that starts the SamplingThread get the TaskRunner via the task_runner() accessor after calling Start()?
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); > > I must be missing something. The SamplingThread's task-runner is postable by > > only the SamplingThread and whatever thread created it. > > I don't think this is correct. Thread::task_runner() can only be called on those > two threads, but I believe if the scoped_refptr<SingleThreadTaskRunner> returned > by that function is saved somewhere, other threads can use it to post tasks. The > scoped_refptr is thread-safe and all methods on TaskRunner are also thread-safe. Ah! So _fetching_ the task-runner must be done on one of those two threads but, once fetched, it can be _used_ from any thread? I had assumed that the 2-thread limitation applied to the latter. This makes better sense since it seemed odd to have restrictions about which threads could post to a queue.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/15 11:42:15, bcwhite wrote: > > > I must be missing something. The SamplingThread's task-runner is postable > by > > > only the SamplingThread and whatever thread created it. > > > > I don't think this is correct. Thread::task_runner() can only be called on > those > > two threads, but I believe if the scoped_refptr<SingleThreadTaskRunner> > returned > > by that function is saved somewhere, other threads can use it to post tasks. > The > > scoped_refptr is thread-safe and all methods on TaskRunner are also > thread-safe. > > Ah! So _fetching_ the task-runner must be done on one of those two threads but, > once fetched, it can be _used_ from any thread? I had assumed that the 2-thread > limitation applied to the latter. This makes better sense since it seemed odd > to have restrictions about which threads could post to a queue. It's going to still be necessary to have a lock, this time to protect the saved task-runner pointer. Since the thread is started on-demand rather than in the (single-threaded) constructor, there could be other threads trying to read that pointer at the same moment that it is being set. The same lock will protect against having multiple Start() calls from attempting to launch the sampling thread.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:311: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/15 15:01:16, bcwhite wrote: > On 2016/12/15 11:42:15, bcwhite wrote: > > > > I must be missing something. The SamplingThread's task-runner is postable > > by > > > > only the SamplingThread and whatever thread created it. > > > > > > I don't think this is correct. Thread::task_runner() can only be called on > > those > > > two threads, but I believe if the scoped_refptr<SingleThreadTaskRunner> > > returned > > > by that function is saved somewhere, other threads can use it to post tasks. > > The > > > scoped_refptr is thread-safe and all methods on TaskRunner are also > > thread-safe. > > > > Ah! So _fetching_ the task-runner must be done on one of those two threads > but, > > once fetched, it can be _used_ from any thread? I had assumed that the > 2-thread > > limitation applied to the latter. This makes better sense since it seemed odd > > to have restrictions about which threads could post to a queue. > > It's going to still be necessary to have a lock, this time to protect the saved > task-runner pointer. Since the thread is started on-demand rather than in the > (single-threaded) constructor, there could be other threads trying to read that > pointer at the same moment that it is being set. > > The same lock will protect against having multiple Start() calls from attempting > to launch the sampling thread. Yeah, I'm not surprised we can't completely avoid synchronization. Doing it in this form will at least be fairly limited in scope.
Switched to task-runner. Still a bit rough and tests need to be updated/added -- working on that now. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:223: static subtle::AtomicWord next_capture_id_; On 2016/12/07 17:20:42, Mike Wittman wrote: > On 2016/12/07 16:25:02, Mike Wittman wrote: > > On 2016/12/07 15:15:30, bcwhite wrote: > > > On 2016/12/06 21:04:58, Mike Wittman wrote: > > > > We probably can avoid need for a thread-safe id by identifying the > > > ActiveCapture > > > > by its address (e.g. as an opaque void*). > > > > > > My concern with that is that addresses may be reused. A capture could start > > and > > > then complete, getting freed. A new capture could start and reuse the same > > > address, reasonably likely given that the allocation is the exact same > number > > of > > > bytes as the free'd block. Then a stop-request for the first one could be > > made > > > and cause the new one to stop. > > > > > > The incrementing integer will also repeat but not for a long, long time. > > > > Yes, care would need to be taken to ensure the StackSamplingProfiler doesn't > > retain the address beyond when the object is deleted. This may be feasible, > > depending on the synchronization we ultimately have in place with the threads > > owning the StackSamplingProfilers. Let's reconsider once we have something > > closer to a final implementation. > > Also, the current implementation can use base::StaticAtomicSequenceNumber. Done. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:348: capture->native_sampler->ProfileRecordingStarting(&profile.modules); On 2016/12/06 21:04:57, Mike Wittman wrote: > The matching call to ProfileRecordingStopped has been dropped with these > changes. Done. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:379: wait = TimeDelta::FromDays(365); // A long, long time. On 2016/12/06 21:04:58, Mike Wittman wrote: > There's a general desire to have as few persistent threads as possible in > Chrome, so we probably should have the sampling thread terminate after a period > of inactivity, and restart on demand. To be done in a future CL. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:441: active_captures_.push_back(std::move(capture_ptr)); On 2016/12/06 21:04:57, Mike Wittman wrote: > push_heap? Acknowledged. https://codereview.chromium.org/2554123002/diff/1/base/profiler/stack_samplin... base/profiler/stack_sampling_profiler.cc:494: NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2016/12/06 21:04:58, Mike Wittman wrote: > We'll need to refactor to use a single stack copy buffer across all > NativeStackSamplers, as the buffer is fairly large. Future CL. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); On 2016/12/09 21:45:02, Mike Wittman wrote: > The term "capture" is overloaded in the method names to mean both the recording > of all the samples and the recording of a single sample. Can we use something > like "record sample" for the latter case to be consistent with the > NativeStackSampler? e.g. this becomes something like RecordSampleForCapture. Done, though shortened to just Begin/End/PerformRecording. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:344: scoped_refptr<SingleThreadTaskRunner> runner = task_runner(); On 2016/12/09 21:45:01, Mike Wittman wrote: > Same issue with task_runner() here. Done. https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:471: TimeDelta::FromSeconds(kMinimumThreadRunTimeSeconds)); On 2016/12/13 18:16:41, Mike Wittman wrote: > On 2016/12/13 16:08:11, bcwhite wrote: > > On 2016/12/10 00:24:23, Mike Wittman wrote: > > > On 2016/12/09 23:38:30, bcwhite wrote: > > > > On 2016/12/09 21:45:01, Mike Wittman wrote: > > > > > My understanding is that the message loop just waits if no tasks are > > > present. > > > > I > > > > > believe it must be forcibly quit or its thread shut down to terminate > it. > > > > > > > > Correct. My idea is to add the ability for it to self-destruct when > "empty" > > > > (which is not the same thing as "idle"). > > > > > > I think that will be confusing to readers since it will operate differently > > than > > > all the other message loops in the application. Couldn't this be addressed > by > > > posting a delayed quit task when the number of captures drops to zero (and > > > canceling the task if the number of captures becomes non-zero)? > > > > Message looks are already RunForever or RunUntilIdle. Adding RunUntilEmpty > > seems a natural (and generally useful) extension. > > RunForever is pretty much the only mode that's used in Chrome itself; > RunUntilIdle is used almost exclusively for testing. I believe RunUntilIdle is > generally considered an anti-pattern in production code because of its action at > a distance properties -- anyone else in the system (including the OS) can > unintentionally keep the message loop alive by posting messages to it. There's > only a dozen instances of RunUntilIdle in actual Chrome code, all of which are > in highly constrained scenarios: > https://cs.chromium.org/search/?q=rununtilidle%5C(%5C);+file:%5C.cc$+-file:te... > > RunUntilEmpty will be subject to the same issues, I think. > > > Posting a delayed quit is still a race condition because a new task could get > > posted from another thread just as the quit starts executing. > > > > It's possible that's a problem no matter what the solution. Even > RunUntilEmpty > > may have that issue -- I'd have to investigate further. There may need to be > > some sort of atomic operation, such as a simple counter, no matter what. > > > > Given the variations, I think it would be best to leave it as "run forever" in > > this CL and do the quit-when-idle as a follow-up CL. > > Deferring the quit-when-idle behavior to a follow-on CL SGTM. I suspect there > will be more than enough complexity to deal with just implementing the "run > forever" mode. Acknowledged.
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); On 2016/12/15 18:07:50, bcwhite wrote: > On 2016/12/09 21:45:02, Mike Wittman wrote: > > The term "capture" is overloaded in the method names to mean both the > recording > > of all the samples and the recording of a single sample. Can we use something > > like "record sample" for the latter case to be consistent with the > > NativeStackSampler? e.g. this becomes something like RecordSampleForCapture. > > Done, though shortened to just Begin/End/PerformRecording. This still has the same issue: "recording" is used to refer to both the recording of one sample and all the samples. It's also not clear what the relationship between "capture", "recording", and "collection" is; all three terms are used variously in code and comments. Can we regularize all this terminology? My suggestion: - use "record" or "record sample" to refer to the recording of one stack/sample - use "collection" to refer to the collection of all the samples for one request. With that, these functions become BeginCollection, EndCollection, RecordSample. ActiveCapture becomes ActiveCollection, or better, CollectionContext since that makes clear it's strictly state associated with a collection and not behavior. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:285: std::map<int, WeakPtr<ActiveCapture>> active_captures_; I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain? Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much easier to reason about ownership, and reduce the amount of state in the system. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:321: DCHECK(task_runner_); Remove this DCHECK? Seems like it's just verifying Thread's documented behavior at this point. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:327: DCHECK(success); Remove this one as well? I don't think this will ever fail, and if it does, it's an issue internal to the message loop/task runner. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:365: } Why not erase the capture from active_captures_ at the end of this function? https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( While the task_runner() accesses on the profiler thread don't need to be guarded by the lock, that won't be at all obvious to the casual reader. Can we encapsulate this subtlety within functions (e.g. GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) there? https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:404: DCHECK(success); I think this can be removed for the same reasons as above.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...) ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #5 (id:100001) has been deleted
https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/20001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:265: void PerformCapture(ActiveCapture* capture); On 2016/12/15 20:37:53, Mike Wittman wrote: > On 2016/12/15 18:07:50, bcwhite wrote: > > On 2016/12/09 21:45:02, Mike Wittman wrote: > > > The term "capture" is overloaded in the method names to mean both the > > recording > > > of all the samples and the recording of a single sample. Can we use > something > > > like "record sample" for the latter case to be consistent with the > > > NativeStackSampler? e.g. this becomes something like RecordSampleForCapture. > > > > Done, though shortened to just Begin/End/PerformRecording. > > This still has the same issue: "recording" is used to refer to both the > recording of one sample and all the samples. It's also not clear what the > relationship between "capture", "recording", and "collection" is; all three > terms are used variously in code and comments. > > Can we regularize all this terminology? My suggestion: > - use "record" or "record sample" to refer to the recording of one stack/sample > - use "collection" to refer to the collection of all the samples for one > request. > > With that, these functions become BeginCollection, EndCollection, RecordSample. > ActiveCapture becomes ActiveCollection, or better, CollectionContext since that > makes clear it's strictly state associated with a collection and not behavior. Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:285: std::map<int, WeakPtr<ActiveCapture>> active_captures_; On 2016/12/15 20:37:53, Mike Wittman wrote: > I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than > unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain? > > Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much > easier to reason about ownership, and reduce the amount of state in the system. The ownership was with the posted tasks so other pointers needed to be weak. But it didn't work out like I was thinking so went another way. Since the single instance of this class never gets destructed, ownership can stay in this map and raw pointers passed to posted tasks. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:321: DCHECK(task_runner_); On 2016/12/15 20:37:53, Mike Wittman wrote: > Remove this DCHECK? Seems like it's just verifying Thread's documented behavior > at this point. Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:327: DCHECK(success); On 2016/12/15 20:37:53, Mike Wittman wrote: > Remove this one as well? I don't think this will ever fail, and if it does, it's > an issue internal to the message loop/task runner. Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:365: } On 2016/12/15 20:37:53, Mike Wittman wrote: > Why not erase the capture from active_captures_ at the end of this function? Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2016/12/15 20:37:53, Mike Wittman wrote: > While the task_runner() accesses on the profiler thread don't need to be guarded > by the lock, that won't be at all obvious to the casual reader. Can we > encapsulate this subtlety within functions (e.g. > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) there? Such a method would have to fetch the current thread-id and compare it to the id of the sampling thread to know whether it needs to use the (lock-protected) member variable or call Thread::task_runner(). Unfortunately, getting the current thread-id can be a system call which means we probably shouldn't do it unless necessary. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:404: DCHECK(success); On 2016/12/15 20:37:53, Mike Wittman wrote: > I think this can be removed for the same reasons as above. Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:285: std::map<int, WeakPtr<ActiveCapture>> active_captures_; On 2016/12/21 16:39:10, bcwhite wrote: > On 2016/12/15 20:37:53, Mike Wittman wrote: > > I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than > > unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain? > > > > Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much > > easier to reason about ownership, and reduce the amount of state in the > system. > > The ownership was with the posted tasks so other pointers needed to be weak. > But it didn't work out like I was thinking so went another way. Since the > single instance of this class never gets destructed, ownership can stay in this > map and raw pointers passed to posted tasks. Can we pass the id rather than the raw pointer? Paying the small overhead of looking up the context in the map is IMHO well worth the benefit of not having to consider whether there are lifetime issues between the context references in the posted tasks and active_captures_. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2016/12/21 16:39:10, bcwhite wrote: > On 2016/12/15 20:37:53, Mike Wittman wrote: > > While the task_runner() accesses on the profiler thread don't need to be > guarded > > by the lock, that won't be at all obvious to the casual reader. Can we > > encapsulate this subtlety within functions (e.g. > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) > there? > > Such a method would have to fetch the current thread-id and compare it to the id > of the sampling thread to know whether it needs to use the (lock-protected) > member variable or call Thread::task_runner(). Unfortunately, getting the > current thread-id can be a system call which means we probably shouldn't do it > unless necessary. Fetching the thread id on Windows is cheap: it's stored in the Thread Environment Block, which is accessed via a segment register and doesn't require a syscall. I took a look at glibc and it also does not need a syscall to get the thread id. https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:187: bool UpdateNextSampleTime() { structs should not have methods providing behavior; this function probably should be moved out to SamplingThread. https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:269: static constexpr int kMinimumThreadRunTimeSeconds = 60; This can be removed. https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:338: DCHECK(collection->native_sampler); This DCHECK can be moved to CollectionContext::CollectionContext() and the function removed. No need for an explicit Begin/Finish function pair if there's nothing to do on begin. https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:344: collection->stopped = true; The stopped state can be removed since it's now redundant to the presence of the context in active_collections_. https://codereview.chromium.org/2554123002/diff/120001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:446: DCHECK_EQ(0U, active_collections_.size()); nit: DCHECK(active_collections_.empty());
Patchset #5 (id:120001) has been deleted
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { > structs should not have methods providing behavior; this function > probably should be moved out to SamplingThread. Really? I've seen it many times and Alexei has in the past even requested methods being added to structs if they're actions are solely confined to the data of those structs. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:285: std::map<int, WeakPtr<ActiveCapture>> active_captures_; On 2016/12/21 19:38:41, Mike Wittman wrote: > On 2016/12/21 16:39:10, bcwhite wrote: > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > I'm not seeing the benefit of using WeakPtr<ActiveCapture> here, rather than > > > unique_ptr<ActiveCapture>, and delete by erasing the id. Can you explain? > > > > > > Seems like unique_ptr could avoid the whole WeakPtr machinery, make it much > > > easier to reason about ownership, and reduce the amount of state in the > > system. > > > > The ownership was with the posted tasks so other pointers needed to be weak. > > But it didn't work out like I was thinking so went another way. Since the > > single instance of this class never gets destructed, ownership can stay in > this > > map and raw pointers passed to posted tasks. > > Can we pass the id rather than the raw pointer? Paying the small overhead of > looking up the context in the map is IMHO well worth the benefit of not having > to consider whether there are lifetime issues between the context references in > the posted tasks and active_captures_. Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2016/12/21 19:38:41, Mike Wittman wrote: > On 2016/12/21 16:39:10, bcwhite wrote: > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > While the task_runner() accesses on the profiler thread don't need to be > > guarded > > > by the lock, that won't be at all obvious to the casual reader. Can we > > > encapsulate this subtlety within functions (e.g. > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) > > there? > > > > Such a method would have to fetch the current thread-id and compare it to the > id > > of the sampling thread to know whether it needs to use the (lock-protected) > > member variable or call Thread::task_runner(). Unfortunately, getting the > > current thread-id can be a system call which means we probably shouldn't do it > > unless necessary. > > Fetching the thread id on Windows is cheap: it's stored in the Thread > Environment Block, which is accessed via a segment register and doesn't require > a syscall. I took a look at glibc and it also does not need a syscall to get the > thread id. Linux does a direct syscall(): https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { On 2016/12/22 16:12:10, bcwhite wrote: > > structs should not have methods providing behavior; this function > > probably should be moved out to SamplingThread. > > Really? I've seen it many times and Alexei has in the past even requested > methods being added to structs if they're actions are solely confined to the > data of those structs. The style guide says structs should not have any functionality beyond access/setting the data members: https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.shtml?cl=head#... I think this is the right thing to do regardless, so that all the logic dealing with the context state is collocated. The relationship between sample and params.samples_per_burst, for example, is defined across both this code and SamplingThread::PerformRecording. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2016/12/22 16:12:10, bcwhite wrote: > On 2016/12/21 19:38:41, Mike Wittman wrote: > > On 2016/12/21 16:39:10, bcwhite wrote: > > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > > While the task_runner() accesses on the profiler thread don't need to be > > > guarded > > > > by the lock, that won't be at all obvious to the casual reader. Can we > > > > encapsulate this subtlety within functions (e.g. > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if necessary) > > > there? > > > > > > Such a method would have to fetch the current thread-id and compare it to > the > > id > > > of the sampling thread to know whether it needs to use the (lock-protected) > > > member variable or call Thread::task_runner(). Unfortunately, getting the > > > current thread-id can be a system call which means we probably shouldn't do > it > > > unless necessary. > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread > > Environment Block, which is accessed via a segment register and doesn't > require > > a syscall. I took a look at glibc and it also does not need a syscall to get > the > > thread id. > > Linux does a direct syscall(): > https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?... Ah, missed that the Linux implementation doesn't go through pthread_self(). Given that Linux syscall overhead is in the 10's to 100's of ns, and the task runner likely will be accessed at most a handful of times every 100ms, I think we can afford to pay this minimal overhead to make the code less tricky and more robust to changes. https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:400: if (found == active_collections_.end()) It would be good to retain the comment here indicating that this situation can happen when the collection was stopped.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #7 (id:180001) has been deleted
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:188: bool UpdateNextSampleTime() { On 2016/12/22 17:38:22, Mike Wittman wrote: > On 2016/12/22 16:12:10, bcwhite wrote: > > > structs should not have methods providing behavior; this function > > > probably should be moved out to SamplingThread. > > > > Really? I've seen it many times and Alexei has in the past even requested > > methods being added to structs if they're actions are solely confined to the > > data of those structs. > > The style guide says structs should not have any functionality beyond > access/setting the data members: > https://engdoc.corp.google.com/eng/doc/devguide/cpp/styleguide.shtml?cl=head#... > > I think this is the right thing to do regardless, so that all the logic dealing > with the context state is collocated. The relationship between sample and > params.samples_per_burst, for example, is defined across both this code and > SamplingThread::PerformRecording. Done. https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2016/12/22 17:38:22, Mike Wittman wrote: > On 2016/12/22 16:12:10, bcwhite wrote: > > On 2016/12/21 19:38:41, Mike Wittman wrote: > > > On 2016/12/21 16:39:10, bcwhite wrote: > > > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > > > While the task_runner() accesses on the profiler thread don't need to be > > > > guarded > > > > > by the lock, that won't be at all obvious to the casual reader. Can we > > > > > encapsulate this subtlety within functions (e.g. > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if > necessary) > > > > there? > > > > > > > > Such a method would have to fetch the current thread-id and compare it to > > the > > > id > > > > of the sampling thread to know whether it needs to use the > (lock-protected) > > > > member variable or call Thread::task_runner(). Unfortunately, getting the > > > > current thread-id can be a system call which means we probably shouldn't > do > > it > > > > unless necessary. > > > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread > > > Environment Block, which is accessed via a segment register and doesn't > > require > > > a syscall. I took a look at glibc and it also does not need a syscall to get > > the > > > thread id. > > > > Linux does a direct syscall(): > > > https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?... > > Ah, missed that the Linux implementation doesn't go through pthread_self(). > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the task > runner likely will be accessed at most a handful of times every 100ms, I think > we can afford to pay this minimal overhead to make the code less tricky and more > robust to changes. I started down this path but it ends up requiring the lock every access. I can't compare the current thread-id to the sampling thread's ID without it waiting for that ID to be valid, which only happens after it has been started. But I don't want to start it until needed and the only way to tell if its needed is to call IsRunning() or check the task_runner_ local variable to see if it's set, both of which require a lock. Getting the thread's ID does an event-wait so that'll need to be cached and locked as well, though it can probably share the same lock as task_runner_. Acquiring a lock isn't expensive but would be required with every sample and it's not necessary when we already know we're running on the sampling thread. I think a comment is the better way to make things obvious to the casual reader. https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/160001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:400: if (found == active_collections_.end()) On 2016/12/22 17:38:22, Mike Wittman wrote: > It would be good to retain the comment here indicating that this situation can > happen when the collection was stopped. Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2017/01/05 16:35:58, bcwhite wrote: > On 2016/12/22 17:38:22, Mike Wittman wrote: > > On 2016/12/22 16:12:10, bcwhite wrote: > > > On 2016/12/21 19:38:41, Mike Wittman wrote: > > > > On 2016/12/21 16:39:10, bcwhite wrote: > > > > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > > > > While the task_runner() accesses on the profiler thread don't need to > be > > > > > guarded > > > > > > by the lock, that won't be at all obvious to the casual reader. Can we > > > > > > encapsulate this subtlety within functions (e.g. > > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if > > necessary) > > > > > there? > > > > > > > > > > Such a method would have to fetch the current thread-id and compare it > to > > > the > > > > id > > > > > of the sampling thread to know whether it needs to use the > > (lock-protected) > > > > > member variable or call Thread::task_runner(). Unfortunately, getting > the > > > > > current thread-id can be a system call which means we probably shouldn't > > do > > > it > > > > > unless necessary. > > > > > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread > > > > Environment Block, which is accessed via a segment register and doesn't > > > require > > > > a syscall. I took a look at glibc and it also does not need a syscall to > get > > > the > > > > thread id. > > > > > > Linux does a direct syscall(): > > > > > > https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?... > > > > Ah, missed that the Linux implementation doesn't go through pthread_self(). > > > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the task > > runner likely will be accessed at most a handful of times every 100ms, I think > > we can afford to pay this minimal overhead to make the code less tricky and > more > > robust to changes. > > I started down this path but it ends up requiring the lock every access. I > can't compare the current thread-id to the sampling thread's ID without it > waiting for that ID to be valid, which only happens after it has been started. > > But I don't want to start it until needed and the only way to tell if its needed > is to call IsRunning() or check the task_runner_ local variable to see if it's > set, both of which require a lock. > > Getting the thread's ID does an event-wait so that'll need to be cached and > locked as well, though it can probably share the same lock as task_runner_. > > Acquiring a lock isn't expensive but would be required with every sample and > it's not necessary when we already know we're running on the sampling thread. Yeah, it's probably not worth going to the extent of acquiring the lock on the profiler thread. > I think a comment is the better way to make things obvious to the casual reader. I think it would be better to create and use explicit functions for getting the task runner on either the sampling thread or on other threads (e.g. GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread, GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will force reviewers and developers to consider the appropriate method for getting the task runner when making future changes. We should be able to DCHECK in these functions to validate correct usage as well.
https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... base/profiler/stack_sampling_profiler.cc:400: bool success = task_runner()->PostDelayedTask( On 2017/01/05 21:08:39, Mike Wittman wrote: > On 2017/01/05 16:35:58, bcwhite wrote: > > On 2016/12/22 17:38:22, Mike Wittman wrote: > > > On 2016/12/22 16:12:10, bcwhite wrote: > > > > On 2016/12/21 19:38:41, Mike Wittman wrote: > > > > > On 2016/12/21 16:39:10, bcwhite wrote: > > > > > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > > > > > While the task_runner() accesses on the profiler thread don't need > to > > be > > > > > > guarded > > > > > > > by the lock, that won't be at all obvious to the casual reader. Can > we > > > > > > > encapsulate this subtlety within functions (e.g. > > > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if > > > necessary) > > > > > > there? > > > > > > > > > > > > Such a method would have to fetch the current thread-id and compare it > > to > > > > the > > > > > id > > > > > > of the sampling thread to know whether it needs to use the > > > (lock-protected) > > > > > > member variable or call Thread::task_runner(). Unfortunately, getting > > the > > > > > > current thread-id can be a system call which means we probably > shouldn't > > > do > > > > it > > > > > > unless necessary. > > > > > > > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread > > > > > Environment Block, which is accessed via a segment register and doesn't > > > > require > > > > > a syscall. I took a look at glibc and it also does not need a syscall to > > get > > > > the > > > > > thread id. > > > > > > > > Linux does a direct syscall(): > > > > > > > > > > https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?... > > > > > > Ah, missed that the Linux implementation doesn't go through pthread_self(). > > > > > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the > task > > > runner likely will be accessed at most a handful of times every 100ms, I > think > > > we can afford to pay this minimal overhead to make the code less tricky and > > more > > > robust to changes. > > > > I started down this path but it ends up requiring the lock every access. I > > can't compare the current thread-id to the sampling thread's ID without it > > waiting for that ID to be valid, which only happens after it has been started. > > > > But I don't want to start it until needed and the only way to tell if its > needed > > is to call IsRunning() or check the task_runner_ local variable to see if it's > > set, both of which require a lock. > > > > Getting the thread's ID does an event-wait so that'll need to be cached and > > locked as well, though it can probably share the same lock as task_runner_. > > > > Acquiring a lock isn't expensive but would be required with every sample and > > it's not necessary when we already know we're running on the sampling thread. > > Yeah, it's probably not worth going to the extent of acquiring the lock on the > profiler thread. > > > I think a comment is the better way to make things obvious to the casual > reader. > > I think it would be better to create and use explicit functions for getting the > task runner on either the sampling thread or on other threads (e.g. > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread, > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will > force reviewers and developers to consider the appropriate method for getting > the task runner when making future changes. We should be able to DCHECK in these > functions to validate correct usage as well. Trying this but there are issues. GetOrCreate isn't enough because Stop() needs to know the value without creating -- we don't want to start the sampling thread there. It could access task_runner_ directly (while locked) while it does now but that means multiple methods accessing task_runner_ which is what creating the method was supposed to avoid. If I leave the "create" part in Add() then it would be one to access task_runner_ directly. Given that only two methods access task_runner_ currently, there's no win here. Similarly, the GetFromSamplingThread() method ends up just a wrapper around Thread::task_runner() since the desired DCHECK is already in Thread::task_runner().
On 2017/01/05 22:04:22, bcwhite wrote: > https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... > File base/profiler/stack_sampling_profiler.cc (right): > > https://codereview.chromium.org/2554123002/diff/60001/base/profiler/stack_sam... > base/profiler/stack_sampling_profiler.cc:400: bool success = > task_runner()->PostDelayedTask( > On 2017/01/05 21:08:39, Mike Wittman wrote: > > On 2017/01/05 16:35:58, bcwhite wrote: > > > On 2016/12/22 17:38:22, Mike Wittman wrote: > > > > On 2016/12/22 16:12:10, bcwhite wrote: > > > > > On 2016/12/21 19:38:41, Mike Wittman wrote: > > > > > > On 2016/12/21 16:39:10, bcwhite wrote: > > > > > > > On 2016/12/15 20:37:53, Mike Wittman wrote: > > > > > > > > While the task_runner() accesses on the profiler thread don't need > > to > > > be > > > > > > > guarded > > > > > > > > by the lock, that won't be at all obvious to the casual reader. > Can > > we > > > > > > > > encapsulate this subtlety within functions (e.g. > > > > > > > > GetTaskRunner/GetOrCreateTaskRunner), handling the locking (if > > > > necessary) > > > > > > > there? > > > > > > > > > > > > > > Such a method would have to fetch the current thread-id and compare > it > > > to > > > > > the > > > > > > id > > > > > > > of the sampling thread to know whether it needs to use the > > > > (lock-protected) > > > > > > > member variable or call Thread::task_runner(). Unfortunately, > getting > > > the > > > > > > > current thread-id can be a system call which means we probably > > shouldn't > > > > do > > > > > it > > > > > > > unless necessary. > > > > > > > > > > > > Fetching the thread id on Windows is cheap: it's stored in the Thread > > > > > > Environment Block, which is accessed via a segment register and > doesn't > > > > > require > > > > > > a syscall. I took a look at glibc and it also does not need a syscall > to > > > get > > > > > the > > > > > > thread id. > > > > > > > > > > Linux does a direct syscall(): > > > > > > > > > > > > > > > https://cs.chromium.org/chromium/src/base/threading/platform_thread_posix.cc?... > > > > > > > > Ah, missed that the Linux implementation doesn't go through > pthread_self(). > > > > > > > > Given that Linux syscall overhead is in the 10's to 100's of ns, and the > > task > > > > runner likely will be accessed at most a handful of times every 100ms, I > > think > > > > we can afford to pay this minimal overhead to make the code less tricky > and > > > more > > > > robust to changes. > > > > > > I started down this path but it ends up requiring the lock every access. I > > > can't compare the current thread-id to the sampling thread's ID without it > > > waiting for that ID to be valid, which only happens after it has been > started. > > > > > > But I don't want to start it until needed and the only way to tell if its > > needed > > > is to call IsRunning() or check the task_runner_ local variable to see if > it's > > > set, both of which require a lock. > > > > > > Getting the thread's ID does an event-wait so that'll need to be cached and > > > locked as well, though it can probably share the same lock as task_runner_. > > > > > > Acquiring a lock isn't expensive but would be required with every sample and > > > it's not necessary when we already know we're running on the sampling > thread. > > > > Yeah, it's probably not worth going to the extent of acquiring the lock on the > > profiler thread. > > > > > I think a comment is the better way to make things obvious to the casual > > reader. > > > > I think it would be better to create and use explicit functions for getting > the > > task runner on either the sampling thread or on other threads (e.g. > > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread, > > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will > > force reviewers and developers to consider the appropriate method for getting > > the task runner when making future changes. We should be able to DCHECK in > these > > functions to validate correct usage as well. > > Trying this but there are issues. > > GetOrCreate isn't enough because Stop() needs to know the value without creating > -- we don't want to start the sampling thread there. It could access > task_runner_ directly (while locked) while it does now but that means multiple > methods accessing task_runner_ which is what creating the method was supposed to > avoid. I'm proposing two separate functions for those cases: GetOrCreateTaskRunnerOnOtherThread() called by Add(), and GetTaskRunnerOnOtherThread() called by Stop(). (With a third function GetTaskRunnerOnOwnThread() called by StartCollectionTask() and PerformCollectionTask().) > If I leave the "create" part in Add() then it would be one to access > task_runner_ directly. > > Given that only two methods access task_runner_ currently, there's no win here. The win here is in terms of code readability and maintainability. In particular: - it makes the mechanism for accessing the task runner and the associated constraints involved self-documenting in the code itself. This avoids the issue of the comments getting out of sync with the code. It also reduces the burden on future developers and reviewers for determining how to use the task runner correctly in new code: one can simply look at the class interface rather than dig through unrelated functions or base class headers to understand the appropriate constraints. - it encapsulates the locking in the smallest possible scope, making it obvious exactly which operations need to be protected by the lock. As it is now, it would not be clear to someone unfamiliar with the code whether the PostTask calls in Add() and Stop() require the lock be held. - it encapsulates the subtlety around the task runner access in functions dedicated to that task. Assuming the functions are invoked correctly according to their names, a reader of the code could verify that this subtle behavior is correct without having to read through all the code. If someone introduces new code accessing the task runner the wrong way, the consequence will be non-deterministic failures whose cause will be extraordinarily difficult to track down. So it's worth additional effort and complexity up-front to avoid this (this seems to be the general philosophy to multithreading issues across Chrome). > Similarly, the GetFromSamplingThread() method ends up just a wrapper around > Thread::task_runner() since the desired DCHECK is already in > Thread::task_runner(). That DCHECK appears to be verifying a looser condition than what we care about. In particular it looks like it will succeed if the message loop is running, regardless of which thread is invoking the function.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
> > > I think it would be better to create and use explicit functions for getting > > the > > > task runner on either the sampling thread or on other threads (e.g. > > > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread, > > > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and will > > > force reviewers and developers to consider the appropriate method for > getting > > > the task runner when making future changes. We should be able to DCHECK in > > these > > > functions to validate correct usage as well. > > > > Trying this but there are issues. > > > > GetOrCreate isn't enough because Stop() needs to know the value without > creating > > -- we don't want to start the sampling thread there. It could access > > task_runner_ directly (while locked) while it does now but that means multiple > > methods accessing task_runner_ which is what creating the method was supposed > to > > avoid. > > I'm proposing two separate functions for those cases: > GetOrCreateTaskRunnerOnOtherThread() called by Add(), and > GetTaskRunnerOnOtherThread() called by Stop(). (With a third function > GetTaskRunnerOnOwnThread() called by StartCollectionTask() and > PerformCollectionTask().) > > > If I leave the "create" part in Add() then it would be one to access > > task_runner_ directly. > > > > Given that only two methods access task_runner_ currently, there's no win > here. > > The win here is in terms of code readability and maintainability. In particular: > > - it makes the mechanism for accessing the task runner and the associated > constraints involved self-documenting in the code itself. This avoids the issue > of the comments getting out of sync with the code. It also reduces the burden on > future developers and reviewers for determining how to use the task runner > correctly in new code: one can simply look at the class interface rather than > dig through unrelated functions or base class headers to understand the > appropriate constraints. > > - it encapsulates the locking in the smallest possible scope, making it obvious > exactly which operations need to be protected by the lock. As it is now, it > would not be clear to someone unfamiliar with the code whether the PostTask > calls in Add() and Stop() require the lock be held. > > - it encapsulates the subtlety around the task runner access in functions > dedicated to that task. Assuming the functions are invoked correctly according > to their names, a reader of the code could verify that this subtle behavior is > correct without having to read through all the code. > > If someone introduces new code accessing the task runner the wrong way, the > consequence will be non-deterministic failures whose cause will be > extraordinarily difficult to track down. So it's worth additional effort and > complexity up-front to avoid this (this seems to be the general philosophy to > multithreading issues across Chrome). I understand what you're saying, but I find it harder to read with helper methods than with full sentence comments. But done.
On 2017/01/06 15:32:59, bcwhite wrote: > > > > I think it would be better to create and use explicit functions for > getting > > > the > > > > task runner on either the sampling thread or on other threads (e.g. > > > > GetTaskRunnerOnOtherThread GetOrCreateTaskRunnerOnOtherThread, > > > > GetTaskRunnerOnOwnThread). That will still encapsulate the locking and > will > > > > force reviewers and developers to consider the appropriate method for > > getting > > > > the task runner when making future changes. We should be able to DCHECK in > > > these > > > > functions to validate correct usage as well. > > > > > > Trying this but there are issues. > > > > > > GetOrCreate isn't enough because Stop() needs to know the value without > > creating > > > -- we don't want to start the sampling thread there. It could access > > > task_runner_ directly (while locked) while it does now but that means > multiple > > > methods accessing task_runner_ which is what creating the method was > supposed > > to > > > avoid. > > > > I'm proposing two separate functions for those cases: > > GetOrCreateTaskRunnerOnOtherThread() called by Add(), and > > GetTaskRunnerOnOtherThread() called by Stop(). (With a third function > > GetTaskRunnerOnOwnThread() called by StartCollectionTask() and > > PerformCollectionTask().) > > > > > If I leave the "create" part in Add() then it would be one to access > > > task_runner_ directly. > > > > > > Given that only two methods access task_runner_ currently, there's no win > > here. > > > > The win here is in terms of code readability and maintainability. In > particular: > > > > - it makes the mechanism for accessing the task runner and the associated > > constraints involved self-documenting in the code itself. This avoids the > issue > > of the comments getting out of sync with the code. It also reduces the burden > on > > future developers and reviewers for determining how to use the task runner > > correctly in new code: one can simply look at the class interface rather than > > dig through unrelated functions or base class headers to understand the > > appropriate constraints. > > > > - it encapsulates the locking in the smallest possible scope, making it > obvious > > exactly which operations need to be protected by the lock. As it is now, it > > would not be clear to someone unfamiliar with the code whether the PostTask > > calls in Add() and Stop() require the lock be held. > > > > - it encapsulates the subtlety around the task runner access in functions > > dedicated to that task. Assuming the functions are invoked correctly according > > to their names, a reader of the code could verify that this subtle behavior is > > correct without having to read through all the code. > > > > If someone introduces new code accessing the task runner the wrong way, the > > consequence will be non-deterministic failures whose cause will be > > extraordinarily difficult to track down. So it's worth additional effort and > > complexity up-front to avoid this (this seems to be the general philosophy to > > multithreading issues across Chrome). > > I understand what you're saying, but I find it harder to read with helper > methods than with full sentence comments. > But done. Thanks. https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:232: // Get tas krunner that is usable from the sampling thread itself. nit: task runner https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:318: task_runner_ = task_runner(); nit: Thread::task_runner() to be explicitly clear where this function is coming from https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:339: StackSamplingProfiler::SamplingThread::GetTaskRunnerFromSamplingThread() { How about GetTaskRunnerOnSamplingThread? GetTaskRunnerFromSamplingThread is ambiguous since "SamplingThread" could refer to either the thread of execution or the SamplingThread class itself. https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:344: return task_runner(); nit: Thread::task_runner() here also
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:232: // Get tas krunner that is usable from the sampling thread itself. On 2017/01/06 16:02:40, Mike Wittman wrote: > nit: task runner Done. https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:318: task_runner_ = task_runner(); On 2017/01/06 16:02:40, Mike Wittman wrote: > nit: Thread::task_runner() to be explicitly clear where this function is coming > from Done. https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:339: StackSamplingProfiler::SamplingThread::GetTaskRunnerFromSamplingThread() { On 2017/01/06 16:02:40, Mike Wittman wrote: > How about GetTaskRunnerOnSamplingThread? GetTaskRunnerFromSamplingThread is > ambiguous since "SamplingThread" could refer to either the thread of execution > or the SamplingThread class itself. Done. https://codereview.chromium.org/2554123002/diff/240001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:344: return task_runner(); On 2017/01/06 16:02:40, Mike Wittman wrote: > nit: Thread::task_runner() here also Done.
Thanks, at a high level I think this is looking good at this point. The major missing pieces that I see are 1. Comprehensive tests of the new functionality. I will be surprised if these doesn't flush out issues that neither of us has anticipated. 2. Handling of thread lifetime issues, particularly profiled threads exiting while profiling is occurring, and correct behavior during application shutdown. Plus tests for for these.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
> 1. Comprehensive tests of the new functionality. I will be surprised if these > doesn't flush out issues that neither of us has anticipated. From your experience, what kinds of tests are required? Is there something missing in the existing tests that wouldn't catch differences in the sample timing, or start/stop? The most comprehensive test I can think of off hand would be two threads both with separate stacks of "dummy" methods. Sampling both simultaneously should result in total times that are reasonably consistent. Plus ensure that the stack samples themselves show only the methods for each specific thread. > 2. Handling of thread lifetime issues, particularly profiled threads exiting > while profiling is occurring, and correct behavior during application shutdown. > Plus tests for for these. That's something that is completely new, right? The new code should behave the same as the old code. In that case, I'd prefer to do it in a different CL.
On 2017/01/06 20:47:57, bcwhite wrote: > > 1. Comprehensive tests of the new functionality. I will be surprised if these > > doesn't flush out issues that neither of us has anticipated. > > From your experience, what kinds of tests are required? Is there something > missing in the existing tests that wouldn't catch differences in the sample > timing, or start/stop? > > The most comprehensive test I can think of off hand would be two threads both > with separate stacks of "dummy" methods. Sampling both simultaneously should > result in total times that are reasonably consistent. Plus ensure that the > stack samples themselves show only the methods for each specific thread. The main things I think need testing are: - correctness when profiling with multiple threads - proper handling of various overlappings of Start()/Stop()/destroy/thread exit events on the profiled thread with collection occurring/not occurring on the profiler thread - proper handling of various interleavings of Start()/Stop()/destroy/thread exit events on multiple profiled threads Checking for reasonably consistent times between collections across two threads should be done manually, but I'm not sure how easily this can be implemented in an automated fashion that doesn't flake on a test slave under load. (We already have some issues like this in the current tests: http://crbug.com/551939.) > > 2. Handling of thread lifetime issues, particularly profiled threads exiting > > while profiling is occurring, and correct behavior during application > shutdown. > > Plus tests for for these. > > That's something that is completely new, right? The new code should behave the > same as the old code. In that case, I'd prefer to do it in a different CL. No, this was handled before before and is not handled in the current code. It needs to be supported before this CL can go in, otherwise the profiler will very likely crash on application shutdown. The existing implementation stops the profiling then joins the profiler thread on destruction of its profiler object, but that approach only works in a single threaded implementation. I think we'll need a new approach for the multithreaded implementation.
Two other things remaining to do also come to mind: Using a single stack buffer for all captures. This needs to be implemented before using the profiler on a second thread, so it doesn't strictly need to be in this CL, but it should be a fairly trivial change. Prepare for gradual roll-out. I'm gunshy about enabling this large a change wholesale on canary given the history of unforeseen issues related to the profiling. We'll need to preserve the old implementation alongside the new implementation and choose which one to use at runtime, gradually increasing the enable percentage while validating that there is no additional stability impact. Creating the separate implementation can be done immediately prior to submit.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
Patchset #11 (id:280001) has been deleted
> Using a single stack buffer for all captures. This needs to be implemented > before using the profiler on a second thread, so it doesn't strictly need to be > in this CL, but it should be a fairly trivial change. https://codereview.chromium.org/2601633002/ > Prepare for gradual roll-out. I'm gunshy about enabling this large a change > wholesale on canary given the history of unforeseen issues related to the > profiling. We'll need to preserve the old implementation alongside the new > implementation and choose which one to use at runtime, gradually increasing the > enable percentage while validating that there is no additional stability impact. > Creating the separate implementation can be done immediately prior to submit. That sounds like a lot of effort to protect canary. If you're concerned, how about a Finch experiment to enable/disable it that it can be immediately turned off if there is a problem.
> No, this was handled before before and is not handled in the current code. It > needs to be supported before this CL can go in, otherwise the profiler will very > likely crash on application shutdown. > > The existing implementation stops the profiling then joins the profiler thread > on destruction of its profiler object, but that approach only works in a single > threaded implementation. I think we'll need a new approach for the multithreaded > implementation. I think there's going to have to be a hook in the shutdown() code to do this as I don't expect the objects calling Stop()necessarily know if they're doing so because of a browser shutdown. Trying to stop and join the thread would be impossible because other sampling operations could be ongoing. As long as there are "async runner" samples supported, telling it to "join when finished" wouldn't work because those async samples could take an arbitrary amount of time. I'm thinking of adding a call to a new Shutdown() method in BrowserMainLoop::PreShutdown(). Seem reasonable?
On 2017/01/16 16:09:52, bcwhite wrote: > > No, this was handled before before and is not handled in the current code. It > > needs to be supported before this CL can go in, otherwise the profiler will > very > > likely crash on application shutdown. > > > > The existing implementation stops the profiling then joins the profiler thread > > on destruction of its profiler object, but that approach only works in a > single > > threaded implementation. I think we'll need a new approach for the > multithreaded > > implementation. > > I think there's going to have to be a hook in the shutdown() code to do this as > I don't expect the objects calling Stop()necessarily know if they're doing so > because of a browser shutdown. Trying to stop and join the thread would be > impossible because other sampling operations could be ongoing. > > As long as there are "async runner" samples supported, telling it to "join when > finished" wouldn't work because those async samples could take an arbitrary > amount of time. > > I'm thinking of adding a call to a new Shutdown() method in > BrowserMainLoop::PreShutdown(). Seem reasonable? Turns out there is a problem with this when the thread is started on-demand by whatever thread happens to want to do the sampling... Stop() can only be called by whatever thread called Start(). While it's possible to use StopSoon() from a posted task and have the thread stop, which would be sufficient for the true shutdown case, there's no way to restart it until after Stop() is called. That means that the same mechanism can't be use to halt the thread when it has nothing to do. I'm wondering if a task_runner_ for the main UI thread could be given to it that somehow used to start/stop the thread. Or I abandon the Thread class and go back to PlatformThread::Delegate and manage my own message-loop... but that seems likely to run into the same complications. Thoughts?
On 2017/01/16 20:08:19, bcwhite wrote: > On 2017/01/16 16:09:52, bcwhite wrote: > > > No, this was handled before before and is not handled in the current code. > It > > > needs to be supported before this CL can go in, otherwise the profiler will > > very > > > likely crash on application shutdown. > > > > > > The existing implementation stops the profiling then joins the profiler > thread > > > on destruction of its profiler object, but that approach only works in a > > single > > > threaded implementation. I think we'll need a new approach for the > > multithreaded > > > implementation. > > > > I think there's going to have to be a hook in the shutdown() code to do this > as > > I don't expect the objects calling Stop()necessarily know if they're doing so > > because of a browser shutdown. Trying to stop and join the thread would be > > impossible because other sampling operations could be ongoing. > > > > As long as there are "async runner" samples supported, telling it to "join > when > > finished" wouldn't work because those async samples could take an arbitrary > > amount of time. > > > > I'm thinking of adding a call to a new Shutdown() method in > > BrowserMainLoop::PreShutdown(). Seem reasonable? > > Turns out there is a problem with this when the thread is started on-demand by > whatever thread happens to want to do the sampling... Stop() can only be called > by whatever thread called Start(). > > While it's possible to use StopSoon() from a posted task and have the thread > stop, which would be sufficient for the true shutdown case, there's no way to > restart it until after Stop() is called. That means that the same mechanism > can't be use to halt the thread when it has nothing to do. > > I'm wondering if a task_runner_ for the main UI thread could be given to it that > somehow used to start/stop the thread. > > Or I abandon the Thread class and go back to PlatformThread::Delegate and manage > my own message-loop... but that seems likely to run into the same complications. > > Thoughts? I'm not sure we actually need to stop the profiling thread pre-shutdown, as long as it doesn't delay or otherwise adversely impact process shutdown. I think the main issue is ensuring that the profiler does not attempt to profile threads after they have exited, which would result in access violations if the stack memory has been freed.
On 2017/01/17 17:01:21, Mike Wittman wrote: > On 2017/01/16 20:08:19, bcwhite wrote: > > On 2017/01/16 16:09:52, bcwhite wrote: > > > > No, this was handled before before and is not handled in the current code. > > It > > > > needs to be supported before this CL can go in, otherwise the profiler > will > > > very > > > > likely crash on application shutdown. > > > > > > > > The existing implementation stops the profiling then joins the profiler > > thread > > > > on destruction of its profiler object, but that approach only works in a > > > single > > > > threaded implementation. I think we'll need a new approach for the > > > multithreaded > > > > implementation. > > > > > > I think there's going to have to be a hook in the shutdown() code to do this > > as > > > I don't expect the objects calling Stop()necessarily know if they're doing > so > > > because of a browser shutdown. Trying to stop and join the thread would be > > > impossible because other sampling operations could be ongoing. > > > > > > As long as there are "async runner" samples supported, telling it to "join > > when > > > finished" wouldn't work because those async samples could take an arbitrary > > > amount of time. > > > > > > I'm thinking of adding a call to a new Shutdown() method in > > > BrowserMainLoop::PreShutdown(). Seem reasonable? > > > > Turns out there is a problem with this when the thread is started on-demand by > > whatever thread happens to want to do the sampling... Stop() can only be > called > > by whatever thread called Start(). > > > > While it's possible to use StopSoon() from a posted task and have the thread > > stop, which would be sufficient for the true shutdown case, there's no way to > > restart it until after Stop() is called. That means that the same mechanism > > can't be use to halt the thread when it has nothing to do. > > > > I'm wondering if a task_runner_ for the main UI thread could be given to it > that > > somehow used to start/stop the thread. > > > > Or I abandon the Thread class and go back to PlatformThread::Delegate and > manage > > my own message-loop... but that seems likely to run into the same > complications. > > > > Thoughts? > > I'm not sure we actually need to stop the profiling thread pre-shutdown, as long > as it doesn't delay or otherwise adversely impact process shutdown. > > I think the main issue is ensuring that the profiler does not attempt to profile > threads after they have exited, which would result in access violations if the > stack memory has been freed. Shutdown isn't such a problem because I can do "StopSoon()" and let it go. The problem is that we also want this thread to stop and restart when necessary and that turns out to be complicated.
On 2017/01/17 17:07:41, bcwhite wrote: > On 2017/01/17 17:01:21, Mike Wittman wrote: > > On 2017/01/16 20:08:19, bcwhite wrote: > > > On 2017/01/16 16:09:52, bcwhite wrote: > > > > > No, this was handled before before and is not handled in the current > code. > > > It > > > > > needs to be supported before this CL can go in, otherwise the profiler > > will > > > > very > > > > > likely crash on application shutdown. > > > > > > > > > > The existing implementation stops the profiling then joins the profiler > > > thread > > > > > on destruction of its profiler object, but that approach only works in a > > > > single > > > > > threaded implementation. I think we'll need a new approach for the > > > > multithreaded > > > > > implementation. > > > > > > > > I think there's going to have to be a hook in the shutdown() code to do > this > > > as > > > > I don't expect the objects calling Stop()necessarily know if they're doing > > so > > > > because of a browser shutdown. Trying to stop and join the thread would > be > > > > impossible because other sampling operations could be ongoing. > > > > > > > > As long as there are "async runner" samples supported, telling it to "join > > > when > > > > finished" wouldn't work because those async samples could take an > arbitrary > > > > amount of time. > > > > > > > > I'm thinking of adding a call to a new Shutdown() method in > > > > BrowserMainLoop::PreShutdown(). Seem reasonable? > > > > > > Turns out there is a problem with this when the thread is started on-demand > by > > > whatever thread happens to want to do the sampling... Stop() can only be > > called > > > by whatever thread called Start(). > > > > > > While it's possible to use StopSoon() from a posted task and have the thread > > > stop, which would be sufficient for the true shutdown case, there's no way > to > > > restart it until after Stop() is called. That means that the same mechanism > > > can't be use to halt the thread when it has nothing to do. > > > > > > I'm wondering if a task_runner_ for the main UI thread could be given to it > > that > > > somehow used to start/stop the thread. > > > > > > Or I abandon the Thread class and go back to PlatformThread::Delegate and > > manage > > > my own message-loop... but that seems likely to run into the same > > complications. > > > > > > Thoughts? > > > > I'm not sure we actually need to stop the profiling thread pre-shutdown, as > long > > as it doesn't delay or otherwise adversely impact process shutdown. > > > > I think the main issue is ensuring that the profiler does not attempt to > profile > > threads after they have exited, which would result in access violations if the > > stack memory has been freed. > > Shutdown isn't such a problem because I can do "StopSoon()" and let it go. The > problem is that we also want this thread to stop and restart when necessary and > that turns out to be complicated. Having the UI thread start/stop the profiler thread seems like a reasonable fallback to me, if it's difficult or not possible to do so from arbitrary threads due to the thread affinity restrictions. Note that the profiled thread exit scenario needs to be handled independent of process shutdown.
On 2017/01/16 15:26:08, bcwhite wrote: > > Using a single stack buffer for all captures. This needs to be implemented > > before using the profiler on a second thread, so it doesn't strictly need to > be > > in this CL, but it should be a fairly trivial change. > > https://codereview.chromium.org/2601633002/ > > > > Prepare for gradual roll-out. I'm gunshy about enabling this large a change > > wholesale on canary given the history of unforeseen issues related to the > > profiling. We'll need to preserve the old implementation alongside the new > > implementation and choose which one to use at runtime, gradually increasing > the > > enable percentage while validating that there is no additional stability > impact. > > Creating the separate implementation can be done immediately prior to submit. > > That sounds like a lot of effort to protect canary. If you're concerned, how > about a Finch experiment to enable/disable it that it can be immediately turned > off if there is a problem. I wouldn't be that concerned, except that there's a history of bad interactions between the profiler and third party code, and the adverse impacts tend to be deadlock or browser process crashes. It's not possible to use Finch kill switch for this, unfortunately, because the profiler executes before metrics initialization. A second option would be to commit this very early in the day on a non-dev release day, then review browser test executions on the try bots throughout the day, reverting if there are any issues. If things looked clean there I think the risk in canary would be relatively low.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Finally got a working start/stop solution that supports browser shutdown and idle shutdown. Still need to tie it into the browser lifetime and create some more tests but wanted to give you a chance to comment.
Patchset #12 (id:320001) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
On 2017/01/25 22:02:17, bcwhite wrote: > Finally got a working start/stop solution that supports browser shutdown and > idle shutdown. > > Still need to tie it into the browser lifetime and create some more tests but > wanted to give you a chance to comment. Nice. Haven't reviewed in detail yet, but I will take a closer look. A meta point: can we implement and review the support for the profiled thread exit scenario first, before addressing the shutdown behavior? (i.e. stopping the profiling of a thread that exits during the normal course of Chrome operation.) The thread exit scenario needs to be supported regardless, and I suspect the shutdown behavior may be partially or fully implementable in terms of the thread exit support. Doing it in this order will be easier to review and probably will lead to less complicated code in the end.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #13 (id:360001) has been deleted
> A meta point: can we implement and review the support for the profiled thread > exit scenario first, before addressing the shutdown behavior? (i.e. stopping the > profiling of a thread that exits during the normal course of Chrome operation.) Support added. Test added.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. I don't think relying on SuspendThread to error out on terminated threads is a viable mechanism for handling thread exit. Thread ids are reused on Windows so there's no guarantee that another thread won't have been started with the same id. I suspect we'll need some kind of formal synchronization between the target threads and the profiler thread to coordinate thread exit. Also, independent of that, an empty result here could happen for many other reasons -- the stack pointer pointing to a guard page for example. https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:899: PlatformThread::Sleep(TimeDelta::FromSeconds(3)); Coordinating threads via sleep will cause this test to be flaky when run under load. We should do proper coordination via WaitableEvents to guarantee the expected test behavior. I think this should be possible by adding calls into the TestDelegate within the profiler at appropriate coordination points, and supplying a TestDelegate implementation in the test that waits for the expected events (e.g. thread has terminated).
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > I don't think relying on SuspendThread to error out on terminated threads is a > viable mechanism for handling thread exit. Thread ids are reused on Windows so > there's no guarantee that another thread won't have been started with the same > id. I suspect we'll need some kind of formal synchronization between the target > threads and the profiler thread to coordinate thread exit. While reuse of thread-ids is possible, I don't think it's a concern: 1) The thread has to exit and the ID reused relatively quickly. 2) The presence of a foreign stack-frame in the data would be obvious and easily dismissed. 3) It's non-trivial (at best) to have an outside, independent watcher learn when a thread exits. I believe it's not worth addressing this until it proves to be a real problem. > Also, independent of that, an empty result here could happen for many other > reasons -- the stack pointer pointing to a guard page for example. Makes sense. I'll make the native sampler record information from the last sample attempt. https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:899: PlatformThread::Sleep(TimeDelta::FromSeconds(3)); On 2017/01/31 22:14:27, Mike Wittman wrote: > Coordinating threads via sleep will cause this test to be flaky when run under > load. We should do proper coordination via WaitableEvents to guarantee the > expected test behavior. I think this should be possible by adding calls into the > TestDelegate within the profiler at appropriate coordination points, and > supplying a TestDelegate implementation in the test that waits for the expected > events (e.g. thread has terminated). I'll look at that.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:899: PlatformThread::Sleep(TimeDelta::FromSeconds(3)); On 2017/01/31 22:14:27, Mike Wittman wrote: > Coordinating threads via sleep will cause this test to be flaky when run under > load. We should do proper coordination via WaitableEvents to guarantee the > expected test behavior. I think this should be possible by adding calls into the > TestDelegate within the profiler at appropriate coordination points, and > supplying a TestDelegate implementation in the test that waits for the expected > events (e.g. thread has terminated). Done.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #18 (id:480001) has been deleted
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/01 14:47:29, bcwhite wrote: > > I don't think relying on SuspendThread to error out on terminated threads is a > > viable mechanism for handling thread exit. Thread ids are reused on Windows so > > there's no guarantee that another thread won't have been started with the same > > id. I suspect we'll need some kind of formal synchronization between the > target > > threads and the profiler thread to coordinate thread exit. > > While reuse of thread-ids is possible, I don't think it's a concern: > > 1) The thread has to exit and the ID reused relatively quickly. > 2) The presence of a foreign stack-frame in the data would be obvious and easily > dismissed. > 3) It's non-trivial (at best) to have an outside, independent watcher learn when > a thread exits. > > I believe it's not worth addressing this until it proves to be a real problem. I think this is a serious concern, and requires a solution that we have confidence in up front. Since the profiler effectively controls the execution of the entire rest of Chrome, it's imperative that it be as bulletproof as possible. Avoiding non-deterministic failure modes is absolutely essential because the resulting failures will be difficult to notice and next to impossible to investigate effectively. 100ms is a pretty huge window in system execution terms. If the profiled thread exits, hundreds if not thousands of thread creations could occur before the next attempted sample, any of which could reuse the id. If a thread in another process claims the id, then it's not clear what will happen. Worst case scenario would be that we succeed in suspending the thread, only to crash while trying to copy the stack. That would likely deadlock some random innocent process on the system, a nasty scenario that we should be avoiding at all costs. If a thread in Chrome claims the id, then the profiler will happily continue profiling the other thread, silently generating wrong data. There would be no way to reliably detect this scenario in the data processing. I don't think we need an independent thread watcher, since we can rely on the StackSamplingProfiler's destructor being called before thread exit. Straw man proposal: put a WaitableEvent "profiling_stopped" in the CollectionContext. Thread a "profiler_destroying" flag through to the RemoveCollectionTask from the StackSamplingProfiler destructor. After posting the RemoveCollectionTask in the target thread, wait on the profiling_stopped event. When executing the RemoveCollectionTask in the profiler thread, signal the profiling_stopped event if the profiler_destroying flag is present. I believe this would ensure the target thread has not exited until the profiler is finished with it.
Patchset #16 (id:440001) has been deleted
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > I don't think we need an independent thread watcher, since we can rely on the > StackSamplingProfiler's destructor being called before thread exit. I don't understand. The StackSamplingProfiler lifetime is completely independent of any thread it might be sampling. > Straw man proposal: put a WaitableEvent "profiling_stopped" in the > CollectionContext. Thread a "profiler_destroying" flag through to the > RemoveCollectionTask from the StackSamplingProfiler destructor. After posting > the RemoveCollectionTask in the target thread, wait on the profiling_stopped > event. When executing the RemoveCollectionTask in the profiler thread, signal > the profiling_stopped event if the profiler_destroying flag is present. > > I believe this would ensure the target thread has not exited until the profiler > is finished with it. I'm confused. An independent thread can exit at any time, right? How about this for a simpler technique: When the native sampler is created, use GetThreadTimes() to get lpCreationTime. Before each sample, do the same. If the creation-time changes, it must be a different thread. In addition, there is an lpExitTime that would determine if the thread has exited (but not yet been reaped).
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/01 19:11:16, bcwhite wrote: > > I don't think we need an independent thread watcher, since we can rely on the > > StackSamplingProfiler's destructor being called before thread exit. > > I don't understand. The StackSamplingProfiler lifetime is completely > independent of any thread it might be sampling. That's true, the current interface allows an arbitrary thread id to be supplied for profiling. We'd need restrict the profiler to working on the thread where it's created to rely on this behavior. This is probably a reasonable trade off to make though, considering the anticipated use cases in the new thread scheduler world. I think the profiler will be used either for self profiling, or for profiling directed by the thread scheduler. In the former case the StackSamplingProfiler would be allocated on the thread's stack. In the latter case, the thread scheduler will know when threads exit and can coordinate with the profiler internals via some to-be-defined mechanism. > > Straw man proposal: put a WaitableEvent "profiling_stopped" in the > > CollectionContext. Thread a "profiler_destroying" flag through to the > > RemoveCollectionTask from the StackSamplingProfiler destructor. After posting > > the RemoveCollectionTask in the target thread, wait on the profiling_stopped > > event. When executing the RemoveCollectionTask in the profiler thread, signal > > the profiling_stopped event if the profiler_destroying flag is present. > > > > I believe this would ensure the target thread has not exited until the > profiler > > is finished with it. > > I'm confused. An independent thread can exit at any time, right? Chrome threads can exit by quitting the message loop. I believe directly exiting threads is not supported in Chrome because it doesn't run destructors or do other necessary cleanup. In the case of threads managed by the thread scheduler, the scheduler itself will be responsible for thread exit. > How about this for a simpler technique: When the native sampler is created, use > GetThreadTimes() to get lpCreationTime. Before each sample, do the same. If > the creation-time changes, it must be a different thread. In addition, there is > an lpExitTime that would determine if the thread has exited (but not yet been > reaped). This would dramatically reduce the window of vulnerability, but I don't think it could prevent this situation from occurring. GetThreadTimes() is an inherently racy API; there will always be some window between the time that GetThreadTimes() is invoked and the actions taken as a result, during which time threads could be destroyed or created.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > This would dramatically reduce the window of vulnerability, but I don't think it > could prevent this situation from occurring. GetThreadTimes() is an inherently > racy API; there will always be some window between the time that > GetThreadTimes() is invoked and the actions taken as a result, during which time > threads could be destroyed or created. If the call were made while the thread was suspended, there wouldn't be any race. But to avoid being suspended any longer than necessary, the check could be done after the acquisition. On the incredibly slim chance that the thread died and was replaced with an identical ID in those few ns, the worst that would happen is that the last sample would get discarded unnecessarily.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/01 20:37:59, bcwhite wrote: > > This would dramatically reduce the window of vulnerability, but I don't think > it > > could prevent this situation from occurring. GetThreadTimes() is an inherently > > racy API; there will always be some window between the time that > > GetThreadTimes() is invoked and the actions taken as a result, during which > time > > threads could be destroyed or created. > > If the call were made while the thread was suspended, there wouldn't be any > race. > > But to avoid being suspended any longer than necessary, the check could be done > after the acquisition. On the incredibly slim chance that the thread died and > was replaced with an identical ID in those few ns, the worst that would happen > is that the last sample would get discarded unnecessarily. There would still be a race between the time the thread id was provided to the profiler and the first time the thread was suspended. (At least that one, there may be others too.) It's really difficult to be sure that all relevant thread interleaving scenarios have been considered with an API like this. And even if they have, it will be significantly difficult for other developers to validate the correctness of the resulting code. If and when some future developer makes changes here, there's a small chance they will understand the subtleties sufficiently to avoid introducing races. I'm generally not comfortable with a solution here that requires either analyzing races away or winning them at runtime, if there is a viable non-racy alternative.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > > > This would dramatically reduce the window of vulnerability, but I don't > think > > it > > > could prevent this situation from occurring. GetThreadTimes() is an > inherently > > > racy API; there will always be some window between the time that > > > GetThreadTimes() is invoked and the actions taken as a result, during which > > time > > > threads could be destroyed or created. > > > > If the call were made while the thread was suspended, there wouldn't be any > > race. > > > > But to avoid being suspended any longer than necessary, the check could be > done > > after the acquisition. On the incredibly slim chance that the thread died and > > was replaced with an identical ID in those few ns, the worst that would happen > > is that the last sample would get discarded unnecessarily. > > There would still be a race between the time the thread id was provided to the > profiler and the first time the thread was suspended. (At least that one, there > may be others too.) The thread creation time would be captured during the ctor of the StackSamplingProfiler so at a known time and from a known thread. The thread under test could still die and be replaced in that time but that's a race outside of this module. It's up to the caller to ensure that the thread it wants to profile is still alive when the ctor returns, before Start is called, something it has the chance of doing because it has more knowledge. It'll be possible to verify the thread even on the very first sampling attempt. > It's really difficult to be sure that all relevant thread interleaving scenarios > have been considered with an API like this. And even if they have, it will be > significantly difficult for other developers to validate the correctness of the > resulting code. If and when some future developer makes changes here, there's a > small chance they will understand the subtleties sufficiently to avoid > introducing races. Any solution is going to have potential race conditions to verify but this at least is simple and easy to follow.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/01 22:01:47, bcwhite wrote: > > > > This would dramatically reduce the window of vulnerability, but I don't > > think > > > it > > > > could prevent this situation from occurring. GetThreadTimes() is an > > inherently > > > > racy API; there will always be some window between the time that > > > > GetThreadTimes() is invoked and the actions taken as a result, during > which > > > time > > > > threads could be destroyed or created. > > > > > > If the call were made while the thread was suspended, there wouldn't be any > > > race. > > > > > > But to avoid being suspended any longer than necessary, the check could be > > done > > > after the acquisition. On the incredibly slim chance that the thread died > and > > > was replaced with an identical ID in those few ns, the worst that would > happen > > > is that the last sample would get discarded unnecessarily. > > > > There would still be a race between the time the thread id was provided to the > > profiler and the first time the thread was suspended. (At least that one, > there > > may be others too.) > > The thread creation time would be captured during the ctor of the > StackSamplingProfiler so at a known time and from a known thread. > > The thread under test could still die and be replaced in that time but that's a > race outside of this module. It's up to the caller to ensure that the thread it > wants to profile is still alive when the ctor returns, before Start is called, > something it has the chance of doing because it has more knowledge. > > It'll be possible to verify the thread even on the very first sampling attempt. I am not convinced that we've enumerated and addressed all the possible races here, and I am skeptical that this is realistically possible given the dependence on Win32 implementation details. Take GetThreadTime() for example: your analysis assumes that this executes in a short time and that the values it returns reflect some relatively current state of reality. Neither of these is guaranteed to be true, and even if they are now the behavior could change in the future. There are probably other assumptions we're both making about how this call and SuspendThread work that may be invalid. Depending on undocumented behavior is risky and should be avoided where possible. An entirely separate can of worms is cross-platform support. GetThreadTimes() is a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and Android, there's basically zero chance we can depend on winning the same races, consistently, on every one of those platforms now and in the future. It's also unknown if the SuspendThread-equivalent will reliably tell us if the thread was terminated (this is true for Windows too for that matter). If GetThreadTimes() and its other-platform equivalents take locks then they cannot be called while the thread is suspended, making races unavoidable. > Any solution is going to have potential race conditions to verify but this at > least is simple and easy to follow. As far as I'm aware the strawman proposal I mentioned has no race conditions due to the use of established synchronization primitives. It also depends solely on cross-platform interfaces. Given all the issues with the SuspendThread/GetThreadTimes approach I'm not OK moving forward with it for handling thread exit. We need a solution that guarantees correct behavior in all cases.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > I am not convinced that we've enumerated and addressed all the possible races > here, and I am skeptical that this is realistically possible given the > dependence on Win32 implementation details. Take GetThreadTime() for example: > your analysis assumes that this executes in a short time and that the values it > returns reflect some relatively current state of reality. Neither of these is > guaranteed to be true, and even if they are now the behavior could change in the > future. There are probably other assumptions we're both making about how this > call and SuspendThread work that may be invalid. Depending on undocumented > behavior is risky and should be avoided where possible. Undocumented? GetThreadTimes is a published and supported API of which the only behavior we're looking at is the reported creation time. https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx We can't guarantee the time and operation of future Chrome changes, either, but since GetThreadTime() is a published API upon which thousands of applications likely depend, I don't see it changing in any significant way. And it really doesn't matter if it's not exceptionally quick (though it likely is) since it'll be running on the sampling thread after the sampled thread has been resumed. As you've said, 100s of ms is a lot of time. Even if there were many concurrent profiles being collected, it's not going be significant compared to the existing activities of stopping a thread, copying its stack, and decoding it. > An entirely separate can of worms is cross-platform support. GetThreadTimes() is > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and Android, > there's basically zero chance we can depend on winning the same races, > consistently, on every one of those platforms now and in the future. It's also > unknown if the SuspendThread-equivalent will reliably tell us if the thread was > terminated (this is true for Windows too for that matter). Cross-platform is already a can of worms. Recording and checking the thread-ID would live in the platform-specific NativeStackSamplerWin class. When (if?) other platform support gets added, those classes can use whatever is appropriate for them. Or if nothing works, then a more complex solution can be investigated. There's no benefit in trying to code specifically for them in advance. > If GetThreadTimes() and its other-platform equivalents take locks then they > cannot be called while the thread is suspended, making races unavoidable. I wouldn't do it while suspended anyway for reasons I mentioned previously. > > Any solution is going to have potential race conditions to verify but this at > > least is simple and easy to follow. > > As far as I'm aware the strawman proposal I mentioned has no race conditions due > to the use of established synchronization primitives. It also depends solely on > cross-platform interfaces. It does. I don't know what they are, but I'm sure they're there. Managing the start/stop of a thread proved to be insanely difficult. But even if I'm wrong, the proposal is far more complex and difficult to understand than this simple, self-contained solution. The proposal also makes assumptions about the threads under test, something that may prove limiting in the future. Somebody is bound to want to trace a PlatformThread without a message-loop at some point. > Given all the issues with the SuspendThread/GetThreadTimes approach I'm not OK > moving forward with it for handling thread exit. We need a solution that > guarantees correct behavior in all cases. No, you don't. You need to be sure it won't crash but other than that, you just need a solution that has a signal-to-noise ratio sufficient to analyze the data; I see very little, if any, noise coming from this. We can't let "perfect" be the enemy of the "good". This is a simple solution and can be implemented quickly and cleanly. We should do it. If it proves to be untenable in the field, then we can investigate more complicated methods.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/02 14:24:25, bcwhite wrote: > > I am not convinced that we've enumerated and addressed all the possible races > > here, and I am skeptical that this is realistically possible given the > > dependence on Win32 implementation details. Take GetThreadTime() for example: > > your analysis assumes that this executes in a short time and that the values > it > > returns reflect some relatively current state of reality. Neither of these is > > guaranteed to be true, and even if they are now the behavior could change in > the > > future. There are probably other assumptions we're both making about how this > > call and SuspendThread work that may be invalid. Depending on undocumented > > behavior is risky and should be avoided where possible. > > Undocumented? GetThreadTimes is a published and supported API of which the only > behavior we're looking at is the reported creation time. > https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx > > We can't guarantee the time and operation of future Chrome changes, either, but > since GetThreadTime() is a published API upon which thousands of applications > likely depend, I don't see it changing in any significant way. And it really > doesn't matter if it's not exceptionally quick (though it likely is) since it'll > be running on the sampling thread after the sampled thread has been resumed. As > you've said, 100s of ms is a lot of time. Even if there were many concurrent > profiles being collected, it's not going be significant compared to the existing > activities of stopping a thread, copying its stack, and decoding it. > Several points: 1. Running the check after the sampling has already happened still leaves a 100ms race window, and still allows the failure scenarios I mentioned in comment #139. 2. The point I was trying, inelegantly, to make above is that Win32 implementation details affect the length of the race window in ways which are difficult to predict. 3. All it takes for this approach to go sideways, due to SuspendThread operating on a different thread than GetThreadTimes, is one ill-timed context switch on the profiler thread within the race window. The length of the window just makes the probability of hitting this case more or less likely. 4. Given the number of Chrome users and the number of times this code is run, events with even a vanishingly small probability of occurring will occur reliably over the population. A one-in-a-million event during profiling will occur hundreds of times per day, just over the population of canary and dev users. The guard page check in the code is there to handle a case that occurs with a probability of around 1 in 10,000,000, and was generating a non-negligible number of crash reports. > > An entirely separate can of worms is cross-platform support. GetThreadTimes() > is > > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and > Android, > > there's basically zero chance we can depend on winning the same races, > > consistently, on every one of those platforms now and in the future. It's also > > unknown if the SuspendThread-equivalent will reliably tell us if the thread > was > > terminated (this is true for Windows too for that matter). > > Cross-platform is already a can of worms. Recording and checking the thread-ID > would live in the platform-specific NativeStackSamplerWin class. When (if?) > other platform support gets added, those classes can use whatever is appropriate > for them. Or if nothing works, then a more complex solution can be > investigated. There's no benefit in trying to code specifically for them in > advance. The OS X implementation is in progress; one of the Mac developers has already started working on it. > > If GetThreadTimes() and its other-platform equivalents take locks then they > > cannot be called while the thread is suspended, making races unavoidable. > > I wouldn't do it while suspended anyway for reasons I mentioned previously. > > > > > Any solution is going to have potential race conditions to verify but this > at > > > least is simple and easy to follow. > > > > As far as I'm aware the strawman proposal I mentioned has no race conditions > due > > to the use of established synchronization primitives. It also depends solely > on > > cross-platform interfaces. > > It does. I don't know what they are, but I'm sure they're there. Managing the > start/stop of a thread proved to be insanely difficult. But even if I'm wrong, > the proposal is far more complex and difficult to understand than this simple, > self-contained solution. The proposal also makes assumptions about the threads > under test, something that may prove limiting in the future. Somebody is bound > to want to trace a PlatformThread without a message-loop at some point. The strawman proposal uses standard Chrome synchronization primitives and would be significantly easier to understand by the average Chrome developer than the use of Win32 APIs. Effectively the only restrictions it places on the profiled threads is that, if they are being profiled from a thread other than themselves, that thread must be responsible for ensuring the profiled thread outlives the profiling. There's no need for the profiled thread to have a message loop. > > Given all the issues with the SuspendThread/GetThreadTimes approach I'm not OK > > moving forward with it for handling thread exit. We need a solution that > > guarantees correct behavior in all cases. > > No, you don't. You need to be sure it won't crash but other than that, you just > need a solution that has a signal-to-noise ratio sufficient to analyze the data; > I see very little, if any, noise coming from this. We can't let "perfect" be > the enemy of the "good". > > This is a simple solution and can be implemented quickly and cleanly. We should > do it. If it proves to be untenable in the field, then we can investigate more > complicated methods. I am not sure it won't crash and I haven't seen sufficient justification in this thread for why it won't crash. I've even outlined a possible scenario where not only will it crash, but it will deadlock other processes on the system. This would not only be incredibly poor behavior, but potentially a huge PR black eye ("Chrome is so unstable it crashes my other applications too!"). Summarizing my objections to this approach: - it's platform-specific, and it's highly uncertain whether it can even be applied to other platforms - it's racy, probably unavoidably so - we don't understand all the consequences of losing the race, but there's good reason to believe that at least one consequence is crashes and system instability - losing the race will happen with non-negligible frequency over the user population - there's good reason to believe that an alternate implementation is possible that doesn't have these drawbacks This is the best I can do to explain why I can't approve this approach to handling thread exit. If you want to continue pursuing it, you'll need to escalate to danakj or another Chrome threading/synchronization guru.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > > Undocumented? GetThreadTimes is a published and supported API of which the > only > > behavior we're looking at is the reported creation time. > > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx > > > > We can't guarantee the time and operation of future Chrome changes, either, > but > > since GetThreadTime() is a published API upon which thousands of applications > > likely depend, I don't see it changing in any significant way. And it really > > doesn't matter if it's not exceptionally quick (though it likely is) since > it'll > > be running on the sampling thread after the sampled thread has been resumed. > As > > you've said, 100s of ms is a lot of time. Even if there were many concurrent > > profiles being collected, it's not going be significant compared to the > existing > > activities of stopping a thread, copying its stack, and decoding it. > > > > Several points: > > 1. Running the check after the sampling has already happened still leaves a > 100ms race window, and still allows the failure scenarios I mentioned in comment > #139. 100ms? If the check happens immediately after the sampling, it is checked as soon as the sample is converted. That's us. And the failure case is that the last sample gets dropped unnecessarily. That's not a serious problem. > 2. The point I was trying, inelegantly, to make above is that Win32 > implementation details affect the length of the race window in ways which are > difficult to predict. > > 3. All it takes for this approach to go sideways, due to SuspendThread operating > on a different thread than GetThreadTimes, is one ill-timed context switch on > the profiler thread within the race window. The length of the window just makes > the probability of hitting this case more or less likely. But it's not. The GetThreadTimes is done on the same thread, just after it is resumed, all within the NativeStackSampler. > 4. Given the number of Chrome users and the number of times this code is run, > events with even a vanishingly small probability of occurring will occur > reliably over the population. A one-in-a-million event during profiling will > occur hundreds of times per day, just over the population of canary and dev > users. The guard page check in the code is there to handle a case that occurs > with a probability of around 1 in 10,000,000, and was generating a > non-negligible number of crash reports. Crashes are serious things, and one-in-ten-million crashes would be a significant addition to the current number of crashes. But one-in-a-million bad samples is in no way significant, an error of 0.0001% which is minuscule and certainly less than the normal variation seen when aggregating samples. > > > An entirely separate can of worms is cross-platform support. > GetThreadTimes() > > is > > > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and > > Android, > > > there's basically zero chance we can depend on winning the same races, > > > consistently, on every one of those platforms now and in the future. It's > also > > > unknown if the SuspendThread-equivalent will reliably tell us if the thread > > was > > > terminated (this is true for Windows too for that matter). > > > > Cross-platform is already a can of worms. Recording and checking the > thread-ID > > would live in the platform-specific NativeStackSamplerWin class. When (if?) > > other platform support gets added, those classes can use whatever is > appropriate > > for them. Or if nothing works, then a more complex solution can be > > investigated. There's no benefit in trying to code specifically for them in > > advance. > > The OS X implementation is in progress; one of the Mac developers has already > started working on it. And do they claim that there is no platform-specific way to detect if a thread has exited? > > It does. I don't know what they are, but I'm sure they're there. Managing > the > > start/stop of a thread proved to be insanely difficult. But even if I'm > wrong, > > the proposal is far more complex and difficult to understand than this simple, > > self-contained solution. The proposal also makes assumptions about the > threads > > under test, something that may prove limiting in the future. Somebody is > bound > > to want to trace a PlatformThread without a message-loop at some point. > > The strawman proposal uses standard Chrome synchronization primitives and would > be significantly easier to understand by the average Chrome developer than the > use of Win32 APIs. Effectively the only restrictions it places on the profiled > threads is that, if they are being profiled from a thread other than themselves, > that thread must be responsible for ensuring the profiled thread outlives the > profiling. There's no need for the profiled thread to have a message loop. It also requires modification of every thread that needs to be sampled, something that will limit the ease of using this tool. Plus, those modifications may have far-reaching effects since they will change the characteristics of how the thread exits, possibly affecting other threads waiting on it. There's no way to predict how far those effects will propagate. > > > Given all the issues with the SuspendThread/GetThreadTimes approach I'm not > OK > > > moving forward with it for handling thread exit. We need a solution that > > > guarantees correct behavior in all cases. > > > > No, you don't. You need to be sure it won't crash but other than that, you > just > > need a solution that has a signal-to-noise ratio sufficient to analyze the > data; > > I see very little, if any, noise coming from this. We can't let "perfect" be > > the enemy of the "good". > > > > This is a simple solution and can be implemented quickly and cleanly. We > should > > do it. If it proves to be untenable in the field, then we can investigate > more > > complicated methods. > > I am not sure it won't crash and I haven't seen sufficient justification in this > thread for why it won't crash. I've even outlined a possible scenario where not > only will it crash, but it will deadlock other processes on the system. This > would not only be incredibly poor behavior, but potentially a huge PR black eye > ("Chrome is so unstable it crashes my other applications too!"). If it won't work or risks other processes then I agree with you and a more complicated solution needs to be used. What I'm trying to determine at the moment is if a thread-ID can be reused while a win::ScopedHandle to it is still open to that thread. Seems to me something the OS would prevent... From this article on SO... http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali... ... it seems to be the case that the thread-ID CANNOT be reused until the ScopedHandle used by the NativeStackSampler releases it. If that is the case then there is no need to worry about the thread being replaced with one of an identical ID while it is being sampled and all this can be simplified. Is there something I'm missing about this?
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/02 20:46:16, bcwhite wrote: > https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... > File base/profiler/stack_sampling_profiler.cc (right): > > https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... > base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that > the thread under test is gone. > > > Undocumented? GetThreadTimes is a published and supported API of which the > > only > > > behavior we're looking at is the reported creation time. > > > > > > https://msdn.microsoft.com/en-us/library/windows/desktop/ms683237(v=vs.85).aspx > > > > > > We can't guarantee the time and operation of future Chrome changes, either, > > but > > > since GetThreadTime() is a published API upon which thousands of > applications > > > likely depend, I don't see it changing in any significant way. And it > really > > > doesn't matter if it's not exceptionally quick (though it likely is) since > > it'll > > > be running on the sampling thread after the sampled thread has been resumed. > > > As > > > you've said, 100s of ms is a lot of time. Even if there were many > concurrent > > > profiles being collected, it's not going be significant compared to the > > existing > > > activities of stopping a thread, copying its stack, and decoding it. > > > > > > > Several points: > > > > 1. Running the check after the sampling has already happened still leaves a > > 100ms race window, and still allows the failure scenarios I mentioned in > comment > > #139. > > 100ms? If the check happens immediately after the sampling, it is checked as > soon as the sample is converted. That's us. Are you presuming the GetThreadTimes call is made both before and after the thread suspension? Otherwise the race window is the interval between the GetThreadTimes call and the next SuspendThread call, which is the 100ms between the end of one sample and the start of the next. If the thread gets replaced during that window, SuspendThread will operate on the wrong thread. > And the failure case is that the last sample gets dropped unnecessarily. That's > not a serious problem. The failure I've mentioned results in a crash and deadlock and works like this: 1. GetThreadTimes call is made. Thread id and creation time match what was seen previously. 2. Time goes by... 3. The thread is replaced by a thread in another process. 4. SuspendThread is called on the new thread and succeeds. 5. The NativeStackSampler attempts to copy the thread's stack, but generates an access violation because the thread's stack lives in another process' address space. 6. Chrome crashes. The thread in the other process remains permanently suspended. > > 2. The point I was trying, inelegantly, to make above is that Win32 > > implementation details affect the length of the race window in ways which are > > difficult to predict. > > > > 3. All it takes for this approach to go sideways, due to SuspendThread > operating > > on a different thread than GetThreadTimes, is one ill-timed context switch on > > the profiler thread within the race window. The length of the window just > makes > > the probability of hitting this case more or less likely. > > But it's not. The GetThreadTimes is done on the same thread, just after it is > resumed, all within the NativeStackSampler. What do you mean by "it's not"? The context switch won't happen in this case? > > 4. Given the number of Chrome users and the number of times this code is run, > > events with even a vanishingly small probability of occurring will occur > > reliably over the population. A one-in-a-million event during profiling will > > occur hundreds of times per day, just over the population of canary and dev > > users. The guard page check in the code is there to handle a case that occurs > > with a probability of around 1 in 10,000,000, and was generating a > > non-negligible number of crash reports. > > Crashes are serious things, and one-in-ten-million crashes would be a > significant addition to the current number of crashes. > > But one-in-a-million bad samples is in no way significant, an error of 0.0001% > which is minuscule and certainly less than the normal variation seen when > aggregating samples. The issue I see with this affecting the data is that it's turning what is currently a retryable failure (SuspendThread failed) into a permanent failure. To the extent that SuspendThread fails due reasons other than the thread has exited, we will miss out collecting all the rest of the collection's samples. It's unclear how often this will happen. There's one case described in the documentation, but it's unclear what other scenarios will cause this behavior. > > > > An entirely separate can of worms is cross-platform support. > > GetThreadTimes() > > > is > > > > a Win32 API. Even if there are equivalent APIs on OS X, iOS, Linux, and > > > Android, > > > > there's basically zero chance we can depend on winning the same races, > > > > consistently, on every one of those platforms now and in the future. It's > > also > > > > unknown if the SuspendThread-equivalent will reliably tell us if the > thread > > > was > > > > terminated (this is true for Windows too for that matter). > > > > > > Cross-platform is already a can of worms. Recording and checking the > > thread-ID > > > would live in the platform-specific NativeStackSamplerWin class. When (if?) > > > other platform support gets added, those classes can use whatever is > > appropriate > > > for them. Or if nothing works, then a more complex solution can be > > > investigated. There's no benefit in trying to code specifically for them in > > > advance. > > > > The OS X implementation is in progress; one of the Mac developers has already > > started working on it. > > And do they claim that there is no platform-specific way to detect if a thread > has exited? I looked at the initial OS X implementation and the SuspendThread equivalent effectively operates on pid_t's, which are reused by the OS. So it's very likely subject to the same race conditions. > > > It does. I don't know what they are, but I'm sure they're there. Managing > > the > > > start/stop of a thread proved to be insanely difficult. But even if I'm > > wrong, > > > the proposal is far more complex and difficult to understand than this > simple, > > > self-contained solution. The proposal also makes assumptions about the > > threads > > > under test, something that may prove limiting in the future. Somebody is > > bound > > > to want to trace a PlatformThread without a message-loop at some point. > > > > The strawman proposal uses standard Chrome synchronization primitives and > would > > be significantly easier to understand by the average Chrome developer than the > > use of Win32 APIs. Effectively the only restrictions it places on the profiled > > threads is that, if they are being profiled from a thread other than > themselves, > > that thread must be responsible for ensuring the profiled thread outlives the > > profiling. There's no need for the profiled thread to have a message loop. > > It also requires modification of every thread that needs to be sampled, > something that will limit the ease of using this tool. Plus, those > modifications may have far-reaching effects since they will change the > characteristics of how the thread exits, possibly affecting other threads > waiting on it. There's no way to predict how far those effects will propagate. It doesn't require modification of every thread that needs to be sampled, it just requires allocation of a StackSamplingProfiler on the stack for any thread that wants to sample itself. The other main use case of threads sampled on behalf of the thread scheduler require no changes. The only effect on thread exit is that the profiled thread might need to wait for the profiler thread to finish its current sample, and there may still be ways to mitigate that if it proves burdensome. > > > > Given all the issues with the SuspendThread/GetThreadTimes approach I'm > not > > OK > > > > moving forward with it for handling thread exit. We need a solution that > > > > guarantees correct behavior in all cases. > > > > > > No, you don't. You need to be sure it won't crash but other than that, you > > just > > > need a solution that has a signal-to-noise ratio sufficient to analyze the > > data; > > > I see very little, if any, noise coming from this. We can't let "perfect" > be > > > the enemy of the "good". > > > > > > This is a simple solution and can be implemented quickly and cleanly. We > > should > > > do it. If it proves to be untenable in the field, then we can investigate > > more > > > complicated methods. > > > > I am not sure it won't crash and I haven't seen sufficient justification in > this > > thread for why it won't crash. I've even outlined a possible scenario where > not > > only will it crash, but it will deadlock other processes on the system. This > > would not only be incredibly poor behavior, but potentially a huge PR black > eye > > ("Chrome is so unstable it crashes my other applications too!"). > > If it won't work or risks other processes then I agree with you and a more > complicated solution needs to be used. > > What I'm trying to determine at the moment is if a thread-ID can be reused while > a win::ScopedHandle to it is still open to that thread. Seems to me something > the OS would prevent... > > From this article on SO... > http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali... > ... it seems to be the case that the thread-ID CANNOT be reused until the > ScopedHandle used by the NativeStackSampler releases it. > > If that is the case then there is no need to worry about the thread being > replaced with one of an identical ID while it is being sampled and all this can > be simplified. > > Is there something I'm missing about this? I don't know if a thread id can be reused while holding a handle. But even if this works on Windows, it's unlikely to work on OS X since there's no corresponding handle concept. On that platform we'll probably have to implement a synchronization-based solution similar what I've proposed. At that point, given a cross-platform implementation, there would be no reason not to replace the Windows implementation with that solution to minimize code complexity. I don't see why we shouldn't just avoid the intermediate step.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. Okay, you've convinced me: If a thread were to exit and its ID be reused, whether it be another Chrome thread or in some outside process, then Bad Things(tm) may happen if any attempt to suspend that thread is made. There is no way, in the NativeStackSampler alone, to fully eliminate such a possibility. I also agree with you completely that any solution should be implemented, if practical, in a generic manner that applies to all architectures and even, ideally, is reusable for other purposes. That's just a general principal and good idea. However, documentation states that a thread-ID cannot be reused under Windows as long as there are any open handles to that thread. Since the NativeStackSamplerWin object holds an open handle, the thread-ID cannot be reused and thus no other solution is necessary. Additional protection should not be implemented speculatively based on what may be needed in the future. If the implementation of stack-sampling for another OS does need a more elaborate solution, then one should be implemented at that time as part of that effort. Do you agree?
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/03 13:28:38, bcwhite wrote: > Okay, you've convinced me: If a thread were to exit and its ID be reused, > whether it be another Chrome thread or in some outside process, then Bad > Things(tm) may happen if any attempt to suspend that thread is made. There is > no way, in the NativeStackSampler alone, to fully eliminate such a possibility. > > I also agree with you completely that any solution should be implemented, if > practical, in a generic manner that applies to all architectures and even, > ideally, is reusable for other purposes. That's just a general principal and > good idea. > > However, documentation states that a thread-ID cannot be reused under Windows as > long as there are any open handles to that thread. Since the > NativeStackSamplerWin object holds an open handle, the thread-ID cannot be > reused and thus no other solution is necessary. Additional protection should not > be implemented speculatively based on what may be needed in the future. > > If the implementation of stack-sampling for another OS does need a more > elaborate solution, then one should be implemented at that time as part of that > effort. > > Do you agree? No, sorry, I don't agree. Support for other OS's is not a future concern. Mac support is being worked on right now by an engineer the team has committed for the project. I don't see how this approach could work on Mac. Going with it would not only shift the burden for implementing cross-platform support onto the Mac developer but would make their job much harder by forcing them to deal with a bunch of complexity that doesn't apply to them. They're not going to be comfortable changing the Windows-specific implementation, so the end result will be two different implementations doing the same thing. Then, someone with Windows experience will have to come in and clean this up just to get back to the state we would have been if we'd implemented the cross-platform interface in the first place. That's a ton of unnecessary work and code churn.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > I don't see how this approach could work on Mac. Going with it would not only > shift the burden for implementing cross-platform support onto the Mac developer Correct. Or at least to a different CL. > but would make their job much harder by forcing them to deal with a bunch of > complexity that doesn't apply to them. There is no other complexity. There is absolutely nothing that needs to be done here. The native sampler will work exactly as it did before. There is no need to try to handle the "thread under test dies" case as it is already safe. It's safe exactly as it was. For me to write something that isn't necessary on the hope that it will fulfill someone else need would not be in line with standard Chrome development and could easily result in more churn trying to do so.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...)
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/03 18:47:58, bcwhite wrote: > > I don't see how this approach could work on Mac. Going with it would not only > > shift the burden for implementing cross-platform support onto the Mac > developer > > Correct. Or at least to a different CL. > > > > but would make their job much harder by forcing them to deal with a bunch of > > complexity that doesn't apply to them. > > There is no other complexity. The complexity is inherent in this change: it adds functions and state to the cross-platform NativeStackSampler interface that only apply to Windows. The Mac developer will have to understand and keep straight what parts of this interface apply to them, what parts don't, how both of those parts interact each other and with the cross-platform support that they will need to implement, and how to make the StackSamplingProfiler work with both systems at the same time. I've been here -- working on a cross-platform interface with inconsistent platform-specific implementations -- and it increases the level of difficulty and complexity substantially, well beyond what it would take to just implement the cross-platform support from scratch. > There is absolutely nothing that needs to be done > here. The native sampler will work exactly as it did before. There is no need > to try to handle the "thread under test dies" case as it is already safe. Why do you say this will work as it did before? The profiler thread join in the StackSamplingProfiler destructor, which prevented profiling past thread exit in the single-thread-profiling implementation, has gone away with the multi-threading changes. It hasn't been replaced with something that works consistently across platforms. It's particularly ill-timed to be regressing this behavior now, right when we're trying to bring up the profiler on Mac. > It's safe exactly as it was. For me to write something that isn't necessary on > the hope that it will fulfill someone else need would not be in line with > standard Chrome development and could easily result in more churn trying to do > so. Chrome is a cross-platform product. Implementing platform-specific functionality that not only does not generalize across platforms, but makes it harder to do so is counterproductive. Especially if there is a known need for the functionality on other platforms at time of implementation and a likely viable platform-independent alternative. I appreciate the effort that has gone into this thread exit implementation. But the bottom line is that, even if it works on Windows, it would move the project further from where it needs to be rather than closer. The suggestion I made for a cross platform approach is conceptually similar to the single-thread-profiling join-on-destruction implementation, so there's good reason to believe it will work without a huge amount of effort.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. > The complexity is inherent in this change: it adds functions and state to the > cross-platform NativeStackSampler interface that only apply to Windows. Not true. I can remove completely everything I added to the NativeSampler and it will continue to work. I left them in place because it helps the *test*. In fact, the only addition is recording the state of the thread under test in detail. The detail could be easily reduced or removed completely. > Why do you say this will work as it did before? The profiler thread join in the > StackSamplingProfiler destructor, which prevented profiling past thread exit in > the single-thread-profiling implementation, has gone away with the > multi-threading changes. It hasn't been replaced with something that works > consistently across platforms. It continues to sample but gets only empty frames because the SuspendThread call will fail. When there are no more samples to take, it executes the callback. If the object of that callback has gone away (it should be a weak-pointer), then the callback will do nothing. Dealing the the sampling of the thread that owns the Profiler isn't sufficient anyway. Thread A could start sampling on thread B but then B could exit without A's knowledge or destruction of the profiler doing the sampling. Handling this general case also handles the A-samples-A special case and the Windows NativeStackSampler handles the generic case by returning empty stack frames after the thread exit. Yes, it would be nice to be able to tell if the thread has actually exited and stop the sampling immediately but I haven't found any way to do that reliably. > It's particularly ill-timed to be regressing this behavior now, right when we're > trying to bring up the profiler on Mac. When it was only the main thread sampling the main thread during startup, there was only the special case but now that we want to be able to sample any thread at any time, the general case has to be addressed and thus will have to be addressed on the Mac, too. And in addressing that, it'll address the special case as well. I've asked Alexei to weigh in on this discussion because it seems we're coming at this from different directions and a new voice may help clear things up.
https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that the thread under test is gone. On 2017/02/04 02:07:01, bcwhite wrote: > https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... > File base/profiler/stack_sampling_profiler.cc (right): > > https://codereview.chromium.org/2554123002/diff/400001/base/profiler/stack_sa... > base/profiler/stack_sampling_profiler.cc:474: // An empty result indicates that > the thread under test is gone. > > The complexity is inherent in this change: it adds functions and state to the > > cross-platform NativeStackSampler interface that only apply to Windows. > > Not true. I can remove completely everything I added to the NativeSampler and > it will continue to work. I left them in place because it helps the *test*. > > In fact, the only addition is recording the state of the thread under test in > detail. The detail could be easily reduced or removed completely. Even if the interface is updated to be the same across platforms, the behavior will still be different and the StackSamplingProfiler/SamplingThread will need to operate differently depending on whether it is running on Windows or Mac. The code will have to operate against two different sets of invariants. > > Why do you say this will work as it did before? The profiler thread join in > the > > StackSamplingProfiler destructor, which prevented profiling past thread exit > in > > the single-thread-profiling implementation, has gone away with the > > multi-threading changes. It hasn't been replaced with something that works > > consistently across platforms. > > It continues to sample but gets only empty frames because the SuspendThread call > will fail. When there are no more samples to take, it executes the callback. > If the object of that callback has gone away (it should be a weak-pointer), then > the callback will do nothing. How can this work on Mac? It's still subject to all the problems I outlined at the end of comment #154. > Dealing the the sampling of the thread that owns the Profiler isn't sufficient > anyway. Thread A could start sampling on thread B but then B could exit without > A's knowledge or destruction of the profiler doing the sampling. Handling this > general case also handles the A-samples-A special case and the Windows > NativeStackSampler handles the generic case by returning empty stack frames > after the thread exit. General A-samples-B is a relatively unimportant use case within Chrome, and should not be driving the implementation. The two main use cases that need to be supported now and in the future, respectively, are A-samples-A and thread scheduler-samples-thread scheduler-managed thread. A-samples-A can be handled in a cross-platform manner by having the StackSamplingProfiler coordinate with SamplingThread to ensure the SamplingThread is done with the profiled thread before the profiled thread exits. Thread scheduler-directed sampling can be handled by the thread scheduler notifying the SamplingThread before it shuts down threads. In a thread scheduler world, general A-samples-B is at best a niche use case. If someone wants this, it still can be supported the same way as A-samples-A by delegating the responsibility for ensuring that the thread outlives the StackSamplingProfiler onto the StackSamplingProfiler user. In pretty much any non-thread-scheduler case where A-samples-B would be useful, A will already have some relationship with B that it can leverage to be notified prior to B exiting. > I've asked Alexei to weigh in on this discussion because it seems we're coming > at this from different directions and a new voice may help clear things up. Thanks, I think this is a good idea.
asvitkine@chromium.org changed reviewers: + asvitkine@chromium.org
Thanks for looping me in. I tried to read through a good chunk of recent discussion here and hopefully I got all the context. Here's my understanding: - We are worried that a thread could die and its ID re-used without sampling profiler noticing. - Brian proposes a solution that would query the thread's created time using a Win32 API to guard against this. - Mike is worried about the above because a) it's platform specific and b) it has potential races; Mike proposes a platform-agnostic solution. Here are my thoughts: In general, before diving into the details I was tempted to agree with Brian to go with a simpler solution to make this work on Windows first and leave the ability to come back and fix things as a follow-up. However, diving into the details it seems the simple solution is not necessarily so simple - since there's a bunch of edge cases we'd need to specifically handle to avoid some races. Since that the solution no longer sounds quite as simple if we need to add extra complexity to handle those edge cases and that there are still more concerns with it, I think giving Mike's suggested solution a shot is worthwhile. In the end it might end up not being that much more work than the Win32 solution with all its edge cases addressed - and if it has no platform-specific dependencies hopefully it could benefit the other platform ports in the future.
> I tried to read through a good chunk of recent discussion here and hopefully I > got all the context. Here's my understanding: > > - We are worried that a thread could die and its ID re-used without sampling > profiler noticing. > - Brian proposes a solution that would query the thread's created time using a > Win32 API to guard against this. Originally, yes. I've since discovered that thread-ID reuse is not possible under Windows because the native sampler continues to hold an open handle to that thread which prevents it from being fully released by the OS and thus its identifier will remain unique as long as sampling continues. > - Mike is worried about the above because a) it's platform specific and b) it > has potential races; Mike proposes a platform-agnostic solution. Partially true because though Windows doesn't need a fix, that's still platform-specific. There are no races, however! The key difference is that when it was single-sampling only, destructing the StackSamplingProfiler object caused a join of the SamplingThread which in turn meant that sampling of the target thread had necessarily stopped. Thus, a thread that wanted to initiate sampling upon itself could create a StackSamplingProfiler as a local variable. This ensured that it got destructed before the thread exited and thus there was no possibility of self-sampling to continue after the thread exited. It was this no-self-sampling-after-death that the current Mac development was counting on to avoid accidentally sampling the wrong thread should the target-thread exit and be replaced by an identical identifier. Unfortunately, multi-sampling means that the SamplingThread no longer necessarily exits just because one StackSamplingProfiler object is done or stopping due to going out of scope. The Thread::Join synchronization step has been removed. That means that sampling can continue even though the initiating StackSamplingProfiler has gone away and there is no way for a self-sampling thread to definitively know that sampling has stopped before exiting. Thus, thread-ID re-use could be a (quite serious) problem if there are no other protections in place. On top of this A-samples-A case, there's the more general case of A-samples-B where A and B are arbitrary threads and either A or B could exit at any time and in any order. A-samples-B also includes the fire-and-forget case where some method on thread A calls the existing static StackSamplingProfiler::StartAndRunAsync(...) to initiate sampling on itself and then returns. In that case there is no object on the stack to destruct and sampling always continues until fully completed. Mike, did I miss anything there? The solutions, then... 1) Let each OS deal with this on its own in whatever way is appropriate. For Windows, this means doing nothing as protection exists in the form of the open handle already held by the native sampler. There's no development cost to this solution; nothing needs to be written now that would be discarded in the future. 2) Create a mutex that a destructing StackSamplingProfiler can wait on until the SamplingThread has finished with the thread in question. This is a better solution but I don't think it should be implemented in this specific CL because it's non-trivial and unnecessary for Windows. It also solves only the A-samples-A case which I believe to be insufficient as it seems likely that conditional sampling (i.e. take some samples when this event occurs) will need to use fire-and-forget asynchronous sampling and/or be sampling a thread other than itself. I could be wrong but I'd prefer to leave the discussion of that to a separate CL. 3) Something else. :-) Also fine... but also not for this CL.
Thanks for clarifying - sorry for not getting the full context the first time through. In this case I agree that we should go with the simpler solution in this CL - since there are no holes in it for Windows - and this will allow us to get this functionality sooner without very much technical debt. We can leave room for revising this in a follow-up CL if it ends up being needed for other platforms. We should it make it very clear with at least comments - but maybe even #ifdef #error for non-Windows - that this needs to be considered when it comes time to port to another platform.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-simulator on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator/bui...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #19 (id:540001) has been deleted
Patchset #13 (id:380001) has been deleted
On 2017/02/07 15:58:26, Alexei Svitkine (slow) wrote: > Thanks for clarifying - sorry for not getting the full context the first time > through. > > In this case I agree that we should go with the simpler solution in this CL - > since there are no holes in it for Windows - and this will allow us to get this > functionality sooner without very much technical debt. I do not agree that there are no races and no other holes in this for Windows -- this hasn't been established in this review, and there's a non-trivial amount of review and implementation effort left in order to do so. As you stated initially, it's not so simple, and if we're going to put that effort in, it would be best to try for a platform-agnostic solution. The areas of concern I'm currently aware of on Windows are: - Can we be confident that the thread does not get swapped out while holding the handle, even past thread exit? I can believe this would be true, but do we have informed guidance on this either from Microsoft or one of our Windows gurus? I was unable to find anything definitive on this when I looked for it. - This approach imposes the new assumption that SuspendThread is race-free with respect to all the thread state we care about. Can we be confident that there *aren't* some odd race conditions within SuspendThread where it could still succeed despite the thread being partially torn down? Again, do we have informed guidance on this from either from Microsoft or one of our Windows gurus? What about AV's that hook SuspendThread -- are they likely to either get this right, or not fail often enough for it to be an issue? (This isn't a theoretical concern: I've seen this scenario in crash dumps.) - SuspendThread can fail for reasons other than thread exit. Can we be confident that either we can detect these "false positives" so we don't abort data collection early in cases other than thread exit? If not, can we be confident that we haven't introduced substantial skew or bias into the resulting data? Do we have informed guidance on the failure modes of this API? These are just the items off the top of my head -- it's very difficult to reason about racy algorithms, so it's likely there are others that would come up in review. Notably, we've barely considered how injected third party code would interact with this approach. We also don't have code currently for the GetThreadTimes() part of this approach, so it's hard to estimate what issues would need to be considered there. It's also important to consider the consequences of missing something or glossing over potential issues. Any issues that make it into the code are likely to result in generalized instability within Chrome if not system instability. Reverse engineering causes of issues from crash dumps will be exceedingly difficult, particularly if they're due to racy behavior. The cost of investigating a single race bug is likely to dwarf the cost of implementing the platform-agnostic solution. > We can leave room for revising this in a follow-up CL if it ends up being needed > for other platforms. We should it make it very clear with at least comments - > but maybe even #ifdef #error for non-Windows - that this needs to be considered > when it comes time to port to another platform. The trade off in terms of cross-platform support as I see it is this: 1. Assuming Brian's approach works, Windows gains multiple thread profiling support. The cross-platform profiler code no longer works for Mac. In order to bring up the profiler, the Mac developer has to implement and understand not just the relatively constrained platform-dependent piece of the profiler, but the platform-independent piece including substantial non-trivial threading and synchronization concerns. 2. Assuming my platform-agnostic approach works, all platforms gain multiple thread profiling support. Profiling scenarios other than thread-profiles-self place responsibility on the entity initiating the profiling to ensure the profiling stops before the thread exits. I have little concern about this caveat since the thread scheduler generally should be able provide this coordination in the cases we care about.
> > In this case I agree that we should go with the simpler solution in this CL - > > since there are no holes in it for Windows - and this will allow us to get > this > > functionality sooner without very much technical debt. > > I do not agree that there are no races and no other holes in this for Windows -- > this hasn't been established in this review, and there's a non-trivial amount of > review and implementation effort left in order to do so. As you stated > initially, it's not so simple, and if we're going to put that effort in, it > would be best to try for a platform-agnostic solution. > > The areas of concern I'm currently aware of on Windows are: > > - Can we be confident that the thread does not get swapped out while holding the > handle, even past thread exit? I can believe this would be true, but do we have > informed guidance on this either from Microsoft or one of our Windows gurus? I > was unable to find anything definitive on this when I looked for it. http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali... "So an identifier can only be reused after last thread handle is closed" The answer is without a reference to support this but I'm inclined to believe it since it makes sense: A handle is an open OS reference and the OS isn't going to destroy and reuse an object to which open references exist. I've posted a comment asking for said reference. Have you found any documentation to the contrary? > - This approach imposes the new assumption that SuspendThread is race-free with > respect to all the thread state we care about. Can we be confident that there > *aren't* some odd race conditions within SuspendThread where it could still > succeed despite the thread being partially torn down? Again, do we have informed > guidance on this from either from Microsoft or one of our Windows gurus? What > about AV's that hook SuspendThread -- are they likely to either get this right, > or not fail often enough for it to be an issue? (This isn't a theoretical > concern: I've seen this scenario in crash dumps.) Since SuspendThread is mainly intended for debuggers (according to official MSDN documentation) which cannot necessarily know in advance if the thread they're trying to stop might have suddenly exited, I'm again inclined to believe that its going to be safe. > - SuspendThread can fail for reasons other than thread exit. Can we be confident > that either we can detect these "false positives" so we don't abort data > collection early in cases other than thread exit? If not, can we be confident > that we haven't introduced substantial skew or bias into the resulting data? Do > we have informed guidance on the failure modes of this API? The windows native sampler does not abort collection in the case of a failed SuspendThread. It just records an empty frame and will try again at the next sampling interval. This is unchanged from the previous working behavior so any skew or bias encountered with the new code is already present in the old code. > These are just the items off the top of my head -- it's very difficult to reason > about racy algorithms, so it's likely there are others that would come up in > review. Notably, we've barely considered how injected third party code would > interact with this approach. We also don't have code currently for the > GetThreadTimes() part of this approach, so it's hard to estimate what issues > would need to be considered there. GetThreadTimes was abandoned last week as it's not necessary when thread IDs cannot be reused. The code that tried using that information was removed in https://codereview.chromium.org/2554123002/#ps520001 > It's also important to consider the consequences of missing something or > glossing over potential issues. Any issues that make it into the code are likely > to result in generalized instability within Chrome if not system instability. > Reverse engineering causes of issues from crash dumps will be exceedingly > difficult, particularly if they're due to racy behavior. The cost of > investigating a single race bug is likely to dwarf the cost of implementing the > platform-agnostic solution. That assumes that there are not race possibilities in the platform-agnostic solution. Given that such will require cross-thread communication and mutex access just to cover the simplest A-samples-A-below case, I think it's very big assumption. > > We can leave room for revising this in a follow-up CL if it ends up being > needed > > for other platforms. We should it make it very clear with at least comments - > > but maybe even #ifdef #error for non-Windows - that this needs to be > considered > > when it comes time to port to another platform. > > The trade off in terms of cross-platform support as I see it is this: > > 1. Assuming Brian's approach works, Windows gains multiple thread profiling > support. The cross-platform profiler code no longer works for Mac. In order to > bring up the profiler, the Mac developer has to implement and understand not > just the relatively constrained platform-dependent piece of the profiler, but > the platform-independent piece including substantial non-trivial threading and > synchronization concerns. On an adjacent CL, you had me change a class named "Common" to "StackBuffer" because that was the only thing currently contained within it. A generic solution was made specific because that was all that was necessary for that CL. No consideration was given to perhaps a Mac solution needing something else. But here you're suggesting adding a huge piece of complex synchronization that is unneeded for Windows in order to support something being written elsewhere. > 2. Assuming my platform-agnostic approach works, all platforms gain multiple > thread profiling support. Profiling scenarios other than thread-profiles-self > place responsibility on the entity initiating the profiling to ensure the > profiling stops before the thread exits. I have little concern about this caveat > since the thread scheduler generally should be able provide this coordination in > the cases we care about. The basic solution doesn't even fully support A-samples-A. It supports only A-samples-A-below (meaning until the current scope exits). We'd have to remove the static StartAndRunAsync methods, or perhaps limit them exclusively to the UI thread. I'm not saying that a platform-agnostic solution isn't of benefit. I'm just saying it shouldn't be done here. I also disagree that fixing the A-samples-A-below case is insufficient and that a full general solution to the A-samples-B case should be found in order to avoid having to rewrite it later when a developer has need to profile exactly that case. But this is a discussion for that other CL.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
We still haven't reached definitive conclusions on the known open questions I raised, which corroborates the belief that this approach does not provide a simple slam-dunk solution on Windows. On 2017/02/07 19:18:40, bcwhite wrote: > > > In this case I agree that we should go with the simpler solution in this CL > - > > > since there are no holes in it for Windows - and this will allow us to get > > this > > > functionality sooner without very much technical debt. > > > > I do not agree that there are no races and no other holes in this for Windows > -- > > this hasn't been established in this review, and there's a non-trivial amount > of > > review and implementation effort left in order to do so. As you stated > > initially, it's not so simple, and if we're going to put that effort in, it > > would be best to try for a platform-agnostic solution. > > > > The areas of concern I'm currently aware of on Windows are: > > > > - Can we be confident that the thread does not get swapped out while holding > the > > handle, even past thread exit? I can believe this would be true, but do we > have > > informed guidance on this either from Microsoft or one of our Windows gurus? I > > was unable to find anything definitive on this when I looked for it. > > http://stackoverflow.com/questions/14863919/does-a-thread-id-stay-unique-vali... > > "So an identifier can only be reused after last thread handle is closed" > > The answer is without a reference to support this but I'm inclined to believe it > since it makes sense: A handle is an open OS reference and the OS isn't going to > destroy and reuse an object to which open references exist. I've posted a > comment asking for said reference. > > Have you found any documentation to the contrary? > > > > - This approach imposes the new assumption that SuspendThread is race-free > with > > respect to all the thread state we care about. Can we be confident that there > > *aren't* some odd race conditions within SuspendThread where it could still > > succeed despite the thread being partially torn down? Again, do we have > informed > > guidance on this from either from Microsoft or one of our Windows gurus? What > > about AV's that hook SuspendThread -- are they likely to either get this > right, > > or not fail often enough for it to be an issue? (This isn't a theoretical > > concern: I've seen this scenario in crash dumps.) > > Since SuspendThread is mainly intended for debuggers (according to official MSDN > documentation) which cannot necessarily know in advance if the thread they're > trying to stop might have suddenly exited, I'm again inclined to believe that > its going to be safe. > > > > - SuspendThread can fail for reasons other than thread exit. Can we be > confident > > that either we can detect these "false positives" so we don't abort data > > collection early in cases other than thread exit? If not, can we be confident > > that we haven't introduced substantial skew or bias into the resulting data? > Do > > we have informed guidance on the failure modes of this API? > > The windows native sampler does not abort collection in the case of a failed > SuspendThread. It just records an empty frame and will try again at the next > sampling interval. This is unchanged from the previous working behavior so any > skew or bias encountered with the new code is already present in the old code. Lines 475-482 of stack_sampling_profiler.cc in patch set 17 certainly appear to stop the profiling. > > These are just the items off the top of my head -- it's very difficult to> reason > > about racy algorithms, so it's likely there are others that would come up in > > review. Notably, we've barely considered how injected third party code would > > interact with this approach. We also don't have code currently for the > > GetThreadTimes() part of this approach, so it's hard to estimate what issues > > would need to be considered there. > > GetThreadTimes was abandoned last week as it's not necessary when thread IDs > cannot be reused. The code that tried using that information was removed in > https://codereview.chromium.org/2554123002/#ps520001 > > > > It's also important to consider the consequences of missing something or > > glossing over potential issues. Any issues that make it into the code are > likely > > to result in generalized instability within Chrome if not system instability. > > Reverse engineering causes of issues from crash dumps will be exceedingly > > difficult, particularly if they're due to racy behavior. The cost of > > investigating a single race bug is likely to dwarf the cost of implementing > the > > platform-agnostic solution. > > That assumes that there are not race possibilities in the platform-agnostic > solution. Given that such will require cross-thread communication and mutex > access just to cover the simplest A-samples-A-below case, I think it's very big > assumption. The platform-agnostic solution uses the *exact same* synchronization point in StackSamplingProfiler as the current single-thread implementation, which already performs cross-thread communication using WaitableEvents, and is proven to work. The only major difference between the current and proposed approach is whether the profiling thread exits. (The proposed approach uses a WaitableEvent, not a mutex.) > > > We can leave room for revising this in a follow-up CL if it ends up being> > needed > > > for other platforms. We should it make it very clear with at least comments > - > > > but maybe even #ifdef #error for non-Windows - that this needs to be > > considered > > > when it comes time to port to another platform. > > > > The trade off in terms of cross-platform support as I see it is this: > > > > 1. Assuming Brian's approach works, Windows gains multiple thread profiling > > support. The cross-platform profiler code no longer works for Mac. In order to > > bring up the profiler, the Mac developer has to implement and understand not > > just the relatively constrained platform-dependent piece of the profiler, but > > the platform-independent piece including substantial non-trivial threading and > > synchronization concerns. > > On an adjacent CL, you had me change a class named "Common" to "StackBuffer" > because that was the only thing currently contained within it. A generic > solution was made specific because that was all that was necessary for that CL. > No consideration was given to perhaps a Mac solution needing something else. No, that's wrong. I have a very good idea of what is necessary for Mac, having reviewed the initial NativeStackSampler implementation in https://codereview.chromium.org/1346453004. My judgement, in that case and this one, is strongly informed by what is needed for the platform given the likely implementation. > But here you're suggesting adding a huge piece of complex synchronization that > is unneeded for Windows in order to support something being written elsewhere. How can you conclude it's huge and complex if you haven't explored it? Since it's based on the current single-thread approach it's unlikely to be significantly more complicated than that, and the amount of code is likely to be comparable to the Windows-only approach. > > 2. Assuming my platform-agnostic approach works, all platforms gain multiple > > thread profiling support. Profiling scenarios other than thread-profiles-self > > place responsibility on the entity initiating the profiling to ensure the > > profiling stops before the thread exits. I have little concern about this > caveat > > since the thread scheduler generally should be able provide this coordination > in > > the cases we care about. > > The basic solution doesn't even fully support A-samples-A. It supports only > A-samples-A-below (meaning until the current scope exits). We'd have to remove > the static StartAndRunAsync methods, or perhaps limit them exclusively to the UI > thread. Removing StartAndRunAsync would be fine with me. It's not used, and the purpose for which it was implemented was found to be supportable using an object-owned StackSamplingProfiler. In most if not all cases where people might be tempted to use it, they'd be better off using Start and coordinating threads. That would be better from a system design perspective since it would force the inter-thread relationships to be explicit in the code. > I'm not saying that a platform-agnostic solution isn't of benefit. I'm just > saying it shouldn't be done here. > > I also disagree that fixing the A-samples-A-below case is insufficient and that > a full general solution to the A-samples-B case should be found in order to > avoid having to rewrite it later when a developer has need to profile exactly > that case. But this is a discussion for that other CL.
On 2017/02/07 21:44:28, Mike Wittman wrote: > We still haven't reached definitive conclusions on the known open questions I > raised, which corroborates the belief that this approach does not provide a > simple slam-dunk solution on Windows. I provided my evidence and my reasoning. If you have something definitive to the contrary, please provide it. > > > - SuspendThread can fail for reasons other than thread exit. Can we be > > confident > > > that either we can detect these "false positives" so we don't abort data > > > collection early in cases other than thread exit? If not, can we be > confident > > > that we haven't introduced substantial skew or bias into the resulting data? > > Do > > > we have informed guidance on the failure modes of this API? > > > > The windows native sampler does not abort collection in the case of a failed > > SuspendThread. It just records an empty frame and will try again at the next > > sampling interval. This is unchanged from the previous working behavior so > any > > skew or bias encountered with the new code is already present in the old code. > > Lines 475-482 of stack_sampling_profiler.cc in patch set 17 certainly appear to > stop the profiling. Yes. I left that in to react to a thread having exited. There seemed no reason not to. But the Windows native stack sampler never sets THREAD_EXITED because it has no way to detect that the thread has exited. Other OS may have better luck and be able to stop the sampling early. > > > These are just the items off the top of my head -- it's very difficult to> > reason > > > about racy algorithms, so it's likely there are others that would come up in > > > review. Notably, we've barely considered how injected third party code would > > > interact with this approach. We also don't have code currently for the > > > GetThreadTimes() part of this approach, so it's hard to estimate what issues > > > would need to be considered there. > > > > GetThreadTimes was abandoned last week as it's not necessary when thread IDs > > cannot be reused. The code that tried using that information was removed in > > https://codereview.chromium.org/2554123002/#ps520001 > > > > > > > It's also important to consider the consequences of missing something or > > > glossing over potential issues. Any issues that make it into the code are > > likely > > > to result in generalized instability within Chrome if not system > instability. > > > Reverse engineering causes of issues from crash dumps will be exceedingly > > > difficult, particularly if they're due to racy behavior. The cost of > > > investigating a single race bug is likely to dwarf the cost of implementing > > the > > > platform-agnostic solution. > > > > That assumes that there are not race possibilities in the platform-agnostic > > solution. Given that such will require cross-thread communication and mutex > > access just to cover the simplest A-samples-A-below case, I think it's very > big > > assumption. > > The platform-agnostic solution uses the *exact same* synchronization point in > StackSamplingProfiler as the current single-thread implementation, which already > performs cross-thread communication using WaitableEvents, and is proven to work. > The only major difference between the current and proposed approach is whether > the profiling thread exits. (The proposed approach uses a WaitableEvent, not a > mutex.) Except (a) it's not needed for the CL I'm writing and (b) is not the best solution (in my opinion). > > > > We can leave room for revising this in a follow-up CL if it ends up being> > > needed > > > > for other platforms. We should it make it very clear with at least > comments > > - > > > > but maybe even #ifdef #error for non-Windows - that this needs to be > > > considered > > > > when it comes time to port to another platform. > > > > > > The trade off in terms of cross-platform support as I see it is this: > > > > > > 1. Assuming Brian's approach works, Windows gains multiple thread profiling > > > support. The cross-platform profiler code no longer works for Mac. In order > to > > > bring up the profiler, the Mac developer has to implement and understand not > > > just the relatively constrained platform-dependent piece of the profiler, > but > > > the platform-independent piece including substantial non-trivial threading > and > > > synchronization concerns. > > > > On an adjacent CL, you had me change a class named "Common" to "StackBuffer" > > because that was the only thing currently contained within it. A generic > > solution was made specific because that was all that was necessary for that > CL. > > No consideration was given to perhaps a Mac solution needing something else. > > No, that's wrong. I have a very good idea of what is necessary for Mac, having > reviewed the initial NativeStackSampler implementation in > https://codereview.chromium.org/1346453004. My judgement, in that case and this > one, is strongly informed by what is needed for the platform given the likely > implementation. But that's you. You're asking me to write for something about which I have no knowledge in a CL that doesn't need it. > > But here you're suggesting adding a huge piece of complex synchronization that > > is unneeded for Windows in order to support something being written elsewhere. > > How can you conclude it's huge and complex if you haven't explored it? Since > it's based on the current single-thread approach it's unlikely to be > significantly more complicated than that, and the amount of code is likely to be > comparable to the Windows-only approach. Please do not assume that because I disagree with you that I haven't explored the option. The implementation I see is for the dtor to create a WaitableEvent that gets passed to a StopTask() via a parameter to PostTask. Stock() would then wait on that event and StopTask() would signal it when ready. Seems pretty simple on the surface, though that can be misleading. It's not complex. But it's not necessary for this CL. And I believe it's an inadequate solution because it fails to cover too many use-cases and thus will eventually have to be removed in favor of something more complex. Regardless of what I believe, though, it still doesn't belong in this CL. > > > 2. Assuming my platform-agnostic approach works, all platforms gain multiple > > > thread profiling support. Profiling scenarios other than > thread-profiles-self > > > place responsibility on the entity initiating the profiling to ensure the > > > profiling stops before the thread exits. I have little concern about this > > caveat > > > since the thread scheduler generally should be able provide this > coordination > > in > > > the cases we care about. > > > > The basic solution doesn't even fully support A-samples-A. It supports only > > A-samples-A-below (meaning until the current scope exits). We'd have to > remove > > the static StartAndRunAsync methods, or perhaps limit them exclusively to the > UI > > thread. > > Removing StartAndRunAsync would be fine with me. It's not used, and the purpose > for which it was implemented was found to be supportable using an object-owned > StackSamplingProfiler. In most if not all cases where people might be tempted to > use it, they'd be better off using Start and coordinating threads. That would be > better from a system design perspective since it would force the inter-thread > relationships to be explicit in the code. That's news to me. If the interface supports nothing but A-samples-A-following then it's a different ballgame. It still doesn't belong in this CL but it's now a reasonable solution. Coded up quickly: https://codereview.chromium.org/2680703004 It should probably still have a DCHECK that sampled-thread == current-thread but I gotta go take my son to Judo. :-) > > > I'm not saying that a platform-agnostic solution isn't of benefit. I'm just > > saying it shouldn't be done here. > > > > I also disagree that fixing the A-samples-A-below case is insufficient and > that > > a full general solution to the A-samples-B case should be found in order to > > avoid having to rewrite it later when a developer has need to profile exactly > > that case. But this is a discussion for that other CL.
On 2017/02/07 23:04:07, bcwhite wrote: > But the Windows native stack sampler never sets THREAD_EXITED because > it has no way to detect that the thread has exited. If this approach can't tell us when the thread has exited, then it doesn't solve the problem at hand. Generating samples for a thread must stop after thread exit. Otherwise the extra samples will skew the results. Continuing profiler execution may also waste power due to unnecessary wakeups. > > > No consideration was given to perhaps a Mac solution needing something else. > > > > No, that's wrong. I have a very good idea of what is necessary for Mac, having > > reviewed the initial NativeStackSampler implementation in > > https://codereview.chromium.org/1346453004. My judgement, in that case and > this > > one, is strongly informed by what is needed for the platform given the likely > > implementation. > > But that's you. You're asking me to write for something about which I have no > knowledge in a CL that doesn't need it. Understanding the broader needs of the code and asking reviewees to address them is precisely my responsibility as OWNER and reviewer. There's no special knowledge needed to write the platform-agnostic code since the constraints on Mac are exactly the same as on Windows. That's the whole point of having platform-agnostic code. > Please do not assume that because I disagree with you that I haven't explored > the option. I won't assume you haven't explored the option if you don't assume I haven't thoroughly considered the larger context for this change. :) > > Removing StartAndRunAsync would be fine with me. It's not used, and the > purpose > > for which it was implemented was found to be supportable using an object-owned > > StackSamplingProfiler. In most if not all cases where people might be tempted > to > > use it, they'd be better off using Start and coordinating threads. That would > be > > better from a system design perspective since it would force the inter-thread > > relationships to be explicit in the code. > > That's news to me. If the interface supports nothing but A-samples-A-following > then it's a different ballgame. It still doesn't belong in this CL but it's now > a reasonable solution. > > Coded up quickly: > https://codereview.chromium.org/2680703004 > > It should probably still have a DCHECK that sampled-thread == current-thread but > I gotta go take my son to Judo. :-) That looks like a reasonable start to a platform-agnostic solution. > > > > > I'm not saying that a platform-agnostic solution isn't of benefit. I'm just > > > saying it shouldn't be done here. > > > > > > I also disagree that fixing the A-samples-A-below case is insufficient and > > that > > > a full general solution to the A-samples-B case should be found in order to > > > avoid having to rewrite it later when a developer has need to profile > exactly > > > that case. But this is a discussion for that other CL. As I've mentioned, we don't need to have a general case solution to the A-samples-B problem because it can be addressed external to the profiler for the likely use cases in Chrome. Doing so is even preferable because it will lead to better system design around threading. I'm afraid you're just going to have to believe me on this. :)
> > But the Windows native stack sampler never sets THREAD_EXITED because > > it has no way to detect that the thread has exited. > > If this approach can't tell us when the thread has exited, then it doesn't solve > the problem at hand. No, but it provides a mechanism for a native sampler that *can* detect the exit of a thread to report such and have sampling stop in that case. I'll remove it if you prefer but since it was already written, and a seemingly useful feature, I left it in. > Generating samples for a thread must stop after thread exit. Otherwise the extra > samples will skew the results. > > Continuing profiler execution may also waste power due to unnecessary wakeups. But it does stop! Destruction of the profiler requests the stop of the sampling. It just doesn't wait for it to stop. In the rare case where somebody creates a profiler on a thread that samples itself until its own death then it's possible that one sample may occur after the thread dies but before the sampling thread gets around to processing the "Remove" task that was posted to it. Note that because the posted task is not delayed, it will come before any pending sampling tasks which means that the thread under test would have to post the task and exit completely before an already-started "RecordSample" task on the sampling thread actually gets around to trying to suspend the thread. So yes, it *can* happen without "join synchronization" but it would be an amazingly rare occurrence with the only problem being a single empty stack frame recorded at the end. Such an occurrence would be nothing but noise, if it were to actually happen at all, of a threshold far below the general variation of the sampling itself. > > > > No consideration was given to perhaps a Mac solution needing something > else. > > > > > > No, that's wrong. I have a very good idea of what is necessary for Mac, > having > > > reviewed the initial NativeStackSampler implementation in > > > https://codereview.chromium.org/1346453004. My judgement, in that case and > > this > > > one, is strongly informed by what is needed for the platform given the > likely > > > implementation. > > > > But that's you. You're asking me to write for something about which I have no > > knowledge in a CL that doesn't need it. > > Understanding the broader needs of the code and asking reviewees to address them > is precisely my responsibility as OWNER and reviewer. Sure. Nobody is saying that these things aren't important. > There's no special knowledge needed to write the platform-agnostic code since > the constraints on Mac are exactly the same as on Windows. That's the whole > point of having platform-agnostic code. But it's unnecessary *here* which is why it should be in a separate CL. A separate CL where a proper discussion of what is necessary can be held and the intricacies of it can be explored in its own context. > > Please do not assume that because I disagree with you that I haven't explored > > the option. > > I won't assume you haven't explored the option if you don't assume I haven't > thoroughly considered the larger context for this change. :) Nobody is arguing that you haven't. I'm only arguing that it unnecessary here and to do it in a different CL. > > > Removing StartAndRunAsync would be fine with me. It's not used, and the > > purpose > > > for which it was implemented was found to be supportable using an > object-owned > > > StackSamplingProfiler. In most if not all cases where people might be > tempted > > to > > > use it, they'd be better off using Start and coordinating threads. That > would > > be > > > better from a system design perspective since it would force the > inter-thread > > > relationships to be explicit in the code. > > > > That's news to me. If the interface supports nothing but > A-samples-A-following > > then it's a different ballgame. It still doesn't belong in this CL but it's > now > > a reasonable solution. > > > > Coded up quickly: > > https://codereview.chromium.org/2680703004 > > > > It should probably still have a DCHECK that sampled-thread == current-thread > but > > I gotta go take my son to Judo. :-) > > That looks like a reasonable start to a platform-agnostic solution. Comments welcome. Happy to get it done. > > > > > > I'm not saying that a platform-agnostic solution isn't of benefit. > I'm just > > > > saying it shouldn't be done here. > > > > > > > > I also disagree that fixing the A-samples-A-below case is insufficient and > > > that > > > > a full general solution to the A-samples-B case should be found in order > to > > > > avoid having to rewrite it later when a developer has need to profile > > exactly > > > > that case. But this is a discussion for that other CL. > > As I've mentioned, we don't need to have a general case solution to the > A-samples-B problem because it can be addressed external to the profiler for the > likely use cases in Chrome. Doing so is even preferable because it will lead to > better system design around threading. I'm afraid you're just going to have to > believe me on this. :) Fine. But let's discuss it on another CL so this one can start testing.
On 2017/02/10 14:47:18, bcwhite wrote: > > > But the Windows native stack sampler never sets THREAD_EXITED because > > > it has no way to detect that the thread has exited. > > > > If this approach can't tell us when the thread has exited, then it doesn't > solve > > the problem at hand. > > No, but it provides a mechanism for a native sampler that *can* detect the exit > of a thread to report such and have sampling stop in that case. > > I'll remove it if you prefer but since it was already written, and a seemingly > useful feature, I left it in. By "this approach" I mean the entire strategy of handling thread exit by relying on SuspendThread failing. > > Generating samples for a thread must stop after thread exit. Otherwise the > extra > > samples will skew the results. > > > > Continuing profiler execution may also waste power due to unnecessary wakeups. > > But it does stop! Destruction of the profiler requests the stop of the > sampling. It just doesn't wait for it to stop. It does not stop in the A-samples-B case, where B exits. > > There's no special knowledge needed to write the platform-agnostic code since > > the constraints on Mac are exactly the same as on Windows. That's the whole > > point of having platform-agnostic code. > > But it's unnecessary *here* which is why it should be in a separate CL. A > separate CL where a proper discussion of what is necessary can be held and the > intricacies of it can be explored in its own context. It's premature to consider what would be done in any follow-on CLs when we don't even know if the approach in the current CL is viable.
> > > If this approach can't tell us when the thread has exited, then it doesn't > > solve > > > the problem at hand. > > > > No, but it provides a mechanism for a native sampler that *can* detect the > exit > > of a thread to report such and have sampling stop in that case. > > > > I'll remove it if you prefer but since it was already written, and a seemingly > > useful feature, I left it in. > > By "this approach" I mean the entire strategy of handling thread exit by relying > on SuspendThread failing. Again, there is no strategy of handling thread-exit by SuspendThread failing. I tried that and removed it a week or two ago. Now if a thread exits, it'll just append empty frames. > > > Generating samples for a thread must stop after thread exit. Otherwise the > > extra > > > samples will skew the results. > > > > > > Continuing profiler execution may also waste power due to unnecessary > wakeups. > > > > But it does stop! Destruction of the profiler requests the stop of the > > sampling. It just doesn't wait for it to stop. > > It does not stop in the A-samples-B case, where B exits. Why bring that up when you want to remove support for such? Yes, in that case you could end up with many empty frames at the end of the sample. Such could easily be pruned, either in Chrome or on the server, if you feel it's a real problem. There's no stability issues, however, because Windows won't start sampling some other thread with the same ID because open handles prevent the ID being reused. > > > There's no special knowledge needed to write the platform-agnostic code > since > > > the constraints on Mac are exactly the same as on Windows. That's the whole > > > point of having platform-agnostic code. > > > > But it's unnecessary *here* which is why it should be in a separate CL. A > > separate CL where a proper discussion of what is necessary can be held and the > > intricacies of it can be explored in its own context. > > It's premature to consider what would be done in any follow-on CLs when we don't > even know if the approach in the current CL is viable. I believe it is, and have provided evidence and reasoning to support it. I have no evidence to the contrary.
On 2017/02/10 18:36:06, bcwhite wrote: > > > > If this approach can't tell us when the thread has exited, then it doesn't > > > solve > > > > the problem at hand. > > > > > > No, but it provides a mechanism for a native sampler that *can* detect the > > exit > > > of a thread to report such and have sampling stop in that case. > > > > > > I'll remove it if you prefer but since it was already written, and a > seemingly > > > useful feature, I left it in. > > > > By "this approach" I mean the entire strategy of handling thread exit by > relying > > on SuspendThread failing. > > Again, there is no strategy of handling thread-exit by SuspendThread failing. I > tried that and removed it a week or two ago. Now if a thread exits, it'll just > append empty frames. Huh? How does the empty frame get appended in the thread exit case if not by the SuspendThread call failing? > > > > Generating samples for a thread must stop after thread exit. Otherwise the > > > extra > > > > samples will skew the results. > > > > > > > > Continuing profiler execution may also waste power due to unnecessary > > wakeups. > > > > > > But it does stop! Destruction of the profiler requests the stop of the > > > sampling. It just doesn't wait for it to stop. > > > > It does not stop in the A-samples-B case, where B exits. > > Why bring that up when you want to remove support for such? I didn't say I want to remove support for A-samples-B. I said we don't need a *general case* solution for A-samples-B, where there is no relationship between A and B other than the profiling. The other cases of A-samples-B can be handled external to the profiler by having the profiler user do the necessary thread synchronization. > Yes, in that case you could end up with many empty frames at the end of the > sample. Such could easily be pruned, either in Chrome or on the server, if you > feel it's a real problem. The empty samples* cannot be pruned because it's not possible to know which of them are the result of thread exit. Some or all can be the result of the transient issues detected by SuspendThreadAndRecordStack. Treating them all as thread exit samples does not work because it throws out the valid data represented by the transient-issue samples. Treating them all as transient-issue samples also does not work because it treats the bogus post-thread-exit samples as valid data. Either way would skew the results, and we would be blind to the severity of the problem. * To be clear: the scenario results in empty samples at the end of the collection rather than empty frames at the end of the sample. > > > > There's no special knowledge needed to write the platform-agnostic code > > since > > > > the constraints on Mac are exactly the same as on Windows. That's the > whole > > > > point of having platform-agnostic code. > > > > > > But it's unnecessary *here* which is why it should be in a separate CL. A > > > separate CL where a proper discussion of what is necessary can be held and > the > > > intricacies of it can be explored in its own context. > > > > It's premature to consider what would be done in any follow-on CLs when we > don't > > even know if the approach in the current CL is viable. > > I believe it is, and have provided evidence and reasoning to support it. I have > no evidence to the contrary. Understood, and I've already stated why I find this evidence and reasoning insufficient.
> > > By "this approach" I mean the entire strategy of handling thread exit by > > relying > > > on SuspendThread failing. > > > > Again, there is no strategy of handling thread-exit by SuspendThread failing. > I > > tried that and removed it a week or two ago. Now if a thread exits, it'll > just > > append empty frames. > > Huh? How does the empty frame get appended in the thread exit case if not by the > SuspendThread call failing? In exactly that way: by SuspendThread failing. A failing SuspendThread call results in an empty sample. I experimented with it causing an exit but due to your concern of it being only a transient error, I removed it. There is no way I found for NativeStackSamplerWin to detect that a thread has exited so that condition never gets set. Other OS may have that ability and set it. > > > > > Generating samples for a thread must stop after thread exit. Otherwise > the > > > > extra > > > > > samples will skew the results. > > > > > > > > > > Continuing profiler execution may also waste power due to unnecessary > > > wakeups. > > > > > > > > But it does stop! Destruction of the profiler requests the stop of the > > > > sampling. It just doesn't wait for it to stop. > > > > > > It does not stop in the A-samples-B case, where B exits. > > > > Why bring that up when you want to remove support for such? > > I didn't say I want to remove support for A-samples-B. I said we don't need a > *general case* solution for A-samples-B, where there is no relationship between > A and B other than the profiling. The other cases of A-samples-B can be handled > external to the profiler by having the profiler user do the necessary thread > synchronization. Okay. > > Yes, in that case you could end up with many empty frames at the end of the > > sample. Such could easily be pruned, either in Chrome or on the server, if > you > > feel it's a real problem. > > The empty samples* cannot be pruned because it's not possible to know which of > them are the result of thread exit. Some or all can be the result of the > transient issues detected by SuspendThreadAndRecordStack. Yes, it's possible that the final empty samples are due to a transient failure but dropping them anyway isn't really going to cause any more confusion that leaving them in. > Treating them all as thread exit samples does not work because it throws out the > valid data represented by the transient-issue samples. Treating them all as > transient-issue samples also does not work because it treats the bogus > post-thread-exit samples as valid data. > > Either way would skew the results, and we would be blind to the severity of the > problem. But that's not something we have to worry about here because, as you said above, management of A-samples-B is to be handled by external synchronization and this is only an A-samples-B issue. > * To be clear: the scenario results in empty samples at the end of the > collection rather than empty frames at the end of the sample. Right. > > > > > There's no special knowledge needed to write the platform-agnostic code > > > since > > > > > the constraints on Mac are exactly the same as on Windows. That's the > > whole > > > > > point of having platform-agnostic code. > > > > > > > > But it's unnecessary *here* which is why it should be in a separate CL. A > > > > separate CL where a proper discussion of what is necessary can be held and > > the > > > > intricacies of it can be explored in its own context. > > > > > > It's premature to consider what would be done in any follow-on CLs when we > > don't > > > even know if the approach in the current CL is viable. > > > > I believe it is, and have provided evidence and reasoning to support it. I > have > > no evidence to the contrary. > > Understood, and I've already stated why I find this evidence and reasoning > insufficient. We should talk about this on a VC Monday with Alexei because we're just going in circles here.
On 2017/02/11 02:57:10, bcwhite wrote: > > Treating them all as thread exit samples does not work because it throws out > the > > valid data represented by the transient-issue samples. Treating them all as > > transient-issue samples also does not work because it treats the bogus > > post-thread-exit samples as valid data. > > > > Either way would skew the results, and we would be blind to the severity of > the > > problem. > > But that's not something we have to worry about here because, as you said above, > management of A-samples-B is to be handled by external synchronization and this > is only an A-samples-B issue. That's correct, if we have the StackSamplingProfiler/SamplingThread synchronization in place at StackSamplingProfiler destruction (or otherwise pre-thread-exit). We need to have that implementation in place to avoid this scenario. > We should talk about this on a VC Monday with Alexei because we're just going in > circles here. I think we have come full circle -- if I understand correctly we've just established that the current proposed approach to handling thread exit is exactly functionally equivalent to the existing implementation. The main remaining piece left to make the thread exit piece work is the StackSamplingProfiler/SamplingThread synchronization. If we can do that and some code cleanup, then I'm good for that part of the review. Then the remaining piece to get this thing done is the review of the shutdown behavior. Not sure if we still need the VC, but I'm happy to do it if you want.
> > > Treating them all as thread exit samples does not work because it throws out > > the > > > valid data represented by the transient-issue samples. Treating them all as > > > transient-issue samples also does not work because it treats the bogus > > > post-thread-exit samples as valid data. > > > > > > Either way would skew the results, and we would be blind to the severity of > > the > > > problem. > > > > But that's not something we have to worry about here because, as you said > above, > > management of A-samples-B is to be handled by external synchronization and > this > > is only an A-samples-B issue. > > That's correct, if we have the StackSamplingProfiler/SamplingThread > synchronization in place at StackSamplingProfiler destruction (or otherwise > pre-thread-exit). We need to have that implementation in place to avoid this > scenario. The current implentation is not synchronized but it's damned close with the extremely unlikely case with at most one sample being taken after the dtor returns. This is safe under Windows (empty sample should the thread have had time to exit) but unknown for future OS implementations. > > We should talk about this on a VC Monday with Alexei because we're just going > in > > circles here. > > I think we have come full circle -- if I understand correctly we've just > established that the current proposed approach to handling thread exit is > exactly functionally equivalent to the existing implementation. > > The main remaining piece left to make the thread exit piece work is the > StackSamplingProfiler/SamplingThread synchronization. If we can do that and some > code cleanup, then I'm good for that part of the review. Then the remaining > piece to get this thing done is the review of the shutdown behavior. Synchronization is done in draft. It's here: https://codereview.chromium.org/2680703004
On 2017/02/13 18:25:38, bcwhite wrote: > > > > Treating them all as thread exit samples does not work because it throws > out > > > the > > > > valid data represented by the transient-issue samples. Treating them all > as > > > > transient-issue samples also does not work because it treats the bogus > > > > post-thread-exit samples as valid data. > > > > > > > > Either way would skew the results, and we would be blind to the severity > of > > > the > > > > problem. > > > > > > But that's not something we have to worry about here because, as you said > > above, > > > management of A-samples-B is to be handled by external synchronization and > > this > > > is only an A-samples-B issue. > > > > That's correct, if we have the StackSamplingProfiler/SamplingThread > > synchronization in place at StackSamplingProfiler destruction (or otherwise > > pre-thread-exit). We need to have that implementation in place to avoid this > > scenario. > > The current implentation is not synchronized but it's damned close with the > extremely unlikely case with at most one sample being taken after the dtor > returns. This is safe under Windows (empty sample should the thread have had > time to exit) but unknown for future OS implementations. We saw intermittent crashes in the profiler when running tests on the buildbots, before the Join implementation was in place, so I don't think we can conclude it's sufficiently close to synchronized. It's also completely unsynchronized in the A-samples-B case. > > > We should talk about this on a VC Monday with Alexei because we're just > going > > in > > > circles here. > > > > I think we have come full circle -- if I understand correctly we've just > > established that the current proposed approach to handling thread exit is > > exactly functionally equivalent to the existing implementation. > > > > The main remaining piece left to make the thread exit piece work is the > > StackSamplingProfiler/SamplingThread synchronization. If we can do that and > some > > code cleanup, then I'm good for that part of the review. Then the remaining > > piece to get this thing done is the review of the shutdown behavior. > > Synchronization is done in draft. It's here: > https://codereview.chromium.org/2680703004 Yes, I think we need to move that into this CL.
> > > > > Treating them all as thread exit samples does not work because it throws > > out > > > > the > > > > > valid data represented by the transient-issue samples. Treating them all > > as > > > > > transient-issue samples also does not work because it treats the bogus > > > > > post-thread-exit samples as valid data. > > > > > > > > > > Either way would skew the results, and we would be blind to the severity > > of > > > > the > > > > > problem. > > > > > > > > But that's not something we have to worry about here because, as you said > > > above, > > > > management of A-samples-B is to be handled by external synchronization and > > > this > > > > is only an A-samples-B issue. > > > > > > That's correct, if we have the StackSamplingProfiler/SamplingThread > > > synchronization in place at StackSamplingProfiler destruction (or otherwise > > > pre-thread-exit). We need to have that implementation in place to avoid this > > > scenario. > > > > The current implentation is not synchronized but it's damned close with the > > extremely unlikely case with at most one sample being taken after the dtor > > returns. This is safe under Windows (empty sample should the thread have had > > time to exit) but unknown for future OS implementations. > > We saw intermittent crashes in the profiler when running tests on the buildbots, > before the Join implementation was in place, so I don't think we can conclude > it's sufficiently close to synchronized. Were those crashes due to the sampled-thread exiting while under test or the sampling-thread continuing to operate? > It's also completely unsynchronized in > the A-samples-B case. Yes. There is no fix for that since they're independent and thus either could exit at any time.
On 2017/02/13 18:46:39, bcwhite wrote: > > > > > > Treating them all as thread exit samples does not work because it > throws > > > out > > > > > the > > > > > > valid data represented by the transient-issue samples. Treating them > all > > > as > > > > > > transient-issue samples also does not work because it treats the bogus > > > > > > post-thread-exit samples as valid data. > > > > > > > > > > > > Either way would skew the results, and we would be blind to the > severity > > > of > > > > > the > > > > > > problem. > > > > > > > > > > But that's not something we have to worry about here because, as you > said > > > > above, > > > > > management of A-samples-B is to be handled by external synchronization > and > > > > this > > > > > is only an A-samples-B issue. > > > > > > > > That's correct, if we have the StackSamplingProfiler/SamplingThread > > > > synchronization in place at StackSamplingProfiler destruction (or > otherwise > > > > pre-thread-exit). We need to have that implementation in place to avoid > this > > > > scenario. > > > > > > The current implentation is not synchronized but it's damned close with the > > > extremely unlikely case with at most one sample being taken after the dtor > > > returns. This is safe under Windows (empty sample should the thread have > had > > > time to exit) but unknown for future OS implementations. > > > > We saw intermittent crashes in the profiler when running tests on the > buildbots, > > before the Join implementation was in place, so I don't think we can conclude > > it's sufficiently close to synchronized. > > Were those crashes due to the sampled-thread exiting while under test or the > sampling-thread continuing to operate? They were due to the sampled-thread exiting while under test. > > It's also completely unsynchronized in > > the A-samples-B case. > > Yes. There is no fix for that since they're independent and thus either could > exit at any time. Right, which is why we need the synchronization in this CL.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Sync-stop CL merged into this one. PTAL
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
On 2017/02/13 21:08:52, bcwhite wrote: > Sync-stop CL merged into this one. PTAL Thanks. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_s... File base/profiler/native_stack_sampler.h (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_s... base/profiler/native_stack_sampler.h:24: // The thread state as determined by the sampler during it's last attempt. All changes in this file and native_stack_sampler_win.cc can be removed since they're not adding any functionality to the existing implementation. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:130: PlatformThreadId target; // The thread being sampled. const https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:131: SamplingParams params; // Information about how to sample. const https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:132: CompletedCallback callback; // Callback made when sampling is complete. const https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:133: WaitableEvent* finished; // Signaled when all sampling is complete. WaitableEvent* const https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:136: std::unique_ptr<NativeStackSampler> native_sampler; const https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:136: std::unique_ptr<NativeStackSampler> native_sampler; Move all const state before all mutable state, to make it easier to understand what's changing here. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:41: // If we refactor the constructors to: StackSamplingProfiler(const SamplingParams& params, const CompletedCallback& callback, NativeStackSamplerTestDelegate* test_delegate = nullptr); for the A-samples-A case, and StackSamplingProfiler(PlatformThreadId thread_id, const SamplingParams& params, const CompletedCallback& callback, NativeStackSamplerTestDelegate* test_delegate = nullptr); for the A-samples-B case, then this comment can go directly on the second constructor where it's more likely to be noticed. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:879: TEST(StackSamplingProfilerTest, MAYBE_DestroyThreadWhileProfiling) { This test can be removed since it's no longer the responsibility of the StackSamplingProfiler to manage thread exit. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:973: // Checks that requests to start profiling while another profile is taking place This comment needs updating. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1020: // Give the other profiler a chance to finish and verify it does no. nit: does so
We'll probably need to special case the ThreadRestrictions::AssertWaitAllowed implementation for this to address the test failures.
On 2017/02/13 22:41:04, Mike Wittman wrote: > We'll probably need to special case the ThreadRestrictions::AssertWaitAllowed > implementation for this to address the test failures. ScopedAllowWait is probably the right way to go.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #20 (id:600001) has been deleted
https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_s... File base/profiler/native_stack_sampler.h (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/native_s... base/profiler/native_stack_sampler.h:24: // The thread state as determined by the sampler during it's last attempt. On 2017/02/13 22:35:57, Mike Wittman wrote: > All changes in this file and native_stack_sampler_win.cc can be removed since > they're not adding any functionality to the existing implementation. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:130: PlatformThreadId target; // The thread being sampled. On 2017/02/13 22:35:57, Mike Wittman wrote: > const Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:131: SamplingParams params; // Information about how to sample. On 2017/02/13 22:35:57, Mike Wittman wrote: > const Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:132: CompletedCallback callback; // Callback made when sampling is complete. On 2017/02/13 22:35:57, Mike Wittman wrote: > const Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:133: WaitableEvent* finished; // Signaled when all sampling is complete. On 2017/02/13 22:35:57, Mike Wittman wrote: > WaitableEvent* const Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:136: std::unique_ptr<NativeStackSampler> native_sampler; On 2017/02/13 22:35:57, Mike Wittman wrote: > Move all const state before all mutable state, to make it easier to understand > what's changing here. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:136: std::unique_ptr<NativeStackSampler> native_sampler; On 2017/02/13 22:35:57, Mike Wittman wrote: > Move all const state before all mutable state, to make it easier to understand > what's changing here. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:41: // On 2017/02/13 22:35:57, Mike Wittman wrote: > If we refactor the constructors to: > > StackSamplingProfiler(const SamplingParams& params, > const CompletedCallback& callback, > NativeStackSamplerTestDelegate* test_delegate = > nullptr); > > for the A-samples-A case, and > > StackSamplingProfiler(PlatformThreadId thread_id, > const SamplingParams& params, > const CompletedCallback& callback, > NativeStackSamplerTestDelegate* test_delegate = > nullptr); > > for the A-samples-B case, then this comment can go directly on the second > constructor where it's more likely to be noticed. I like it. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:879: TEST(StackSamplingProfilerTest, MAYBE_DestroyThreadWhileProfiling) { On 2017/02/13 22:35:57, Mike Wittman wrote: > This test can be removed since it's no longer the responsibility of the > StackSamplingProfiler to manage thread exit. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:973: // Checks that requests to start profiling while another profile is taking place On 2017/02/13 22:35:57, Mike Wittman wrote: > This comment needs updating. Done. https://codereview.chromium.org/2554123002/diff/580001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1020: // Give the other profiler a chance to finish and verify it does no. On 2017/02/13 22:35:57, Mike Wittman wrote: > nit: does so Done.
bcwhite@chromium.org changed reviewers: + brettw@chromium.org
brettw@chromium.org: Please review changes in base/threading/thread_restrictions.h Rationale is provided here: https://codereview.chromium.org/2554123002/diff/620001/base/profiler/stack_sa... line 614 // The behavior of sampling a thread that has exited is undefined and could // cause Bad Things(tm) to occur. The safety model provided by this class is // that an instance of this object is expected to live at least as long as // the thread it is sampling. However, because the sampling is performed // asynchnously by the SamplingThread, there is no way to guarantee this is // true without waiting for it to signal that it has finished.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
On 2017/02/14 14:37:03, bcwhite wrote: > mailto:brettw@chromium.org: Please review changes in > base/threading/thread_restrictions.h > > Rationale is provided here: > https://codereview.chromium.org/2554123002/diff/620001/base/profiler/stack_sa... > line 614 > > // The behavior of sampling a thread that has exited is undefined and could > // cause Bad Things(tm) to occur. The safety model provided by this class is > // that an instance of this object is expected to live at least as long as > // the thread it is sampling. However, because the sampling is performed > // asynchnously by the SamplingThread, there is no way to guarantee this is > // true without waiting for it to signal that it has finished. Also, the maximum expected wait here is the time to take one sample, or a few hundred microseconds. Typical wait will be zero. It's worth adding that to the comment.
On 2017/02/14 16:17:34, Mike Wittman wrote: > On 2017/02/14 14:37:03, bcwhite wrote: > > mailto:brettw@chromium.org: Please review changes in > > base/threading/thread_restrictions.h > > > > Rationale is provided here: > > > https://codereview.chromium.org/2554123002/diff/620001/base/profiler/stack_sa... > > line 614 > > > > // The behavior of sampling a thread that has exited is undefined and could > > // cause Bad Things(tm) to occur. The safety model provided by this class is > > // that an instance of this object is expected to live at least as long as > > // the thread it is sampling. However, because the sampling is performed > > // asynchnously by the SamplingThread, there is no way to guarantee this is > > // true without waiting for it to signal that it has finished. > > Also, the maximum expected wait here is the time to take one sample, or a few > hundred microseconds. Typical wait will be zero. > > It's worth adding that to the comment. There's a comment to that effect a few lines earlier. // Stop is immediate but asynchronous. There is a non-zero probability that // one more sample will be taken after this call returns.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/02/14 17:32:44, bcwhite wrote: > On 2017/02/14 16:17:34, Mike Wittman wrote: > > On 2017/02/14 14:37:03, bcwhite wrote: > > > mailto:brettw@chromium.org: Please review changes in > > > base/threading/thread_restrictions.h > > > > > > Rationale is provided here: > > > > > > https://codereview.chromium.org/2554123002/diff/620001/base/profiler/stack_sa... > > > line 614 > > > > > > // The behavior of sampling a thread that has exited is undefined and > could > > > // cause Bad Things(tm) to occur. The safety model provided by this class > is > > > // that an instance of this object is expected to live at least as long as > > > // the thread it is sampling. However, because the sampling is performed > > > // asynchnously by the SamplingThread, there is no way to guarantee this > is > > > // true without waiting for it to signal that it has finished. > > > > Also, the maximum expected wait here is the time to take one sample, or a few > > hundred microseconds. Typical wait will be zero. > > > > It's worth adding that to the comment. > > There's a comment to that effect a few lines earlier. > > // Stop is immediate but asynchronous. There is a non-zero probability that > // one more sample will be taken after this call returns. It's important to have the information directly within the comment on ScopedAllowWait, to facilitate auditing of ScopedAllowWait usage by people unfamiliar with this code. The goal of disallowing waits is to avoid jank, so it's also important to specify the time bounds for the expected and maximum waits to justify why this won't by janky.
https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:200: // IMPORTANT: This should generally be created on the local stack (i.e. NOT This guidance is more conservative than necessary. I think it's sufficient to say that the object must be destroyed before thread exit. The general expectation in Chrome is that non-singleton objects are destroyed on the thread where they are created. Not destroying would be a memory leak, so the restriction for this constructor is fairly unremarkable and probably doesn't justify an IMPORTANT label. https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:211: // IMPORTANT: Only threads guaranteed to live beyond the lifetime of the It would be best to move the prescriptive advice to the top, and drop the text about the current thread within this comment. Maybe something like: IMPORTANT: Users of this interface must ensure the specified thread outlives this object. Otherwise the profiler will continue trying to profile past thread exit, resulting in crashes or worse. https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:241: static void Shutdown(); It's worth considering process shutdown now that we have a solution for thread exit. The destruction of the profilers before profiled thread exit will ensure we won't be actively profiling at process shutdown, so we don't need to worry about the profiled threads at shutdown. The guidance from the thread scheduler team is that we should not be trying to terminate the profiler thread at process shutdown, because it's just unnecessary work. So we don't need to do anything special there either. I'm not aware of anything else that would need to happen at shutdown, so I think we can remove this function and the next one.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Patchset #22 (id:660001) has been deleted
https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:200: // IMPORTANT: This should generally be created on the local stack (i.e. NOT On 2017/02/14 17:52:32, Mike Wittman wrote: > This guidance is more conservative than necessary. I think it's sufficient to > say that the object must be destroyed before thread exit. > > The general expectation in Chrome is that non-singleton objects are destroyed on > the thread where they are created. Not destroying would be a memory leak, so the > restriction for this constructor is fairly unremarkable and probably doesn't > justify an IMPORTANT label. Done. https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:211: // IMPORTANT: Only threads guaranteed to live beyond the lifetime of the On 2017/02/14 17:52:32, Mike Wittman wrote: > It would be best to move the prescriptive advice to the top, and drop the text > about the current thread within this comment. > > Maybe something like: > > IMPORTANT: Users of this interface must ensure the specified thread outlives > this object. Otherwise the profiler will continue trying to profile past thread > exit, resulting in crashes or worse. Done. https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:241: static void Shutdown(); > The destruction of the profilers before profiled thread exit will ensure we > won't be actively profiling at process shutdown, so we don't need to worry about > the profiled threads at shutdown. As long as we assume that the assumptions a thread makes about the lifetime of a thread under test are still valid during shutdown. Done.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: android_n5x_swarming_rel on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_n5x_...)
Still thinking through ShutdownTask() and the tests... https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:241: static void Shutdown(); On 2017/02/14 20:13:27, bcwhite wrote: > > The destruction of the profilers before profiled thread exit will ensure we > > won't be actively profiling at process shutdown, so we don't need to worry > about > > the profiled threads at shutdown. > > As long as we assume that the assumptions a thread makes about the lifetime of a > thread under test are still valid during shutdown. There's a well-defined shut down procedure for process threads within BrowserMainLoop::ShutdownThreadsAndCleanUp(), so I think this is a reasonable assumption. A worst case scenario might require some special case code to run before this function, but I don't anticipate this to be necessary. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:216: TimeDelta task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(5); I think it's worth bumping this up to something more like a minute, since keeping a thread around is very cheap. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:257: if (task_runner) { No need to check the task_runner anymore. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:310: // started it so that it can be self-managed or stopped on by another nit: stopped by https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:350: // calculated it now. nit: calculate https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:390: collection->native_sampler->RecordStackSample(&sample); nit: RecordStackSample(&profile.samples.back()) and remove the previous line https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:399: void StackSamplingProfiler::SamplingThread::CheckForIdle() { how about calling this ScheduleShutdownIfIdle, to be clear what's happening here? https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection = collection_ptr.get(); nit: better to save off the id and initial_delay and pass those below, to eliminate the burden on the reader of figuring out if the pointer usage is safe https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); Can we initialize this in the CollectionContext constructor? https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:508: // This will keep a consistent average interval between samples but will Should this comment be on the above if statement? https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, Can we continue to use the current strategy of constructing this as a local variable in Start()? It's moved within that function, so the pointer is not valid after that anyway. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:579: // short task or none at all if sampling has already completed. nit: ... as long as it takes to collect one sample, taking ~200μs, or none at all ... Jank has a precise definition (tasks taking >118ms) so readers will want to evaluate the wait duration in absolute terms. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:187: // are move-only. This should run quickly as possible that another thread, I'm having trouble parsing this. How about: Other threads, including the UI thread, may block on callback completion, so this should run as quickly as possible. ? https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:200: // Ensure that this object gets destroyed before the current thread exits. nit: The caller must ensure ... Comments with imperatives directed at the reader are not generally done and will be confusing to most developers. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:209: // IMPORTANT: Ensure that the thread being sampled does not exit before this Same here. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); Can we move these three functions into an internal TestApi class? (See other examples in the code.) And also, do the same for the two functions they call in SamplingThread? It's not obvious what parts of that class are there for test purposes.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/640001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:241: static void Shutdown(); On 2017/02/15 03:26:44, Mike Wittman wrote: > On 2017/02/14 20:13:27, bcwhite wrote: > > > The destruction of the profilers before profiled thread exit will ensure we > > > won't be actively profiling at process shutdown, so we don't need to worry > > about > > > the profiled threads at shutdown. > > > > As long as we assume that the assumptions a thread makes about the lifetime of > a > > thread under test are still valid during shutdown. > > There's a well-defined shut down procedure for process threads within > BrowserMainLoop::ShutdownThreadsAndCleanUp(), so I think this is a reasonable > assumption. A worst case scenario might require some special case code to run > before this function, but I don't anticipate this to be necessary. Acknowledged. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:216: TimeDelta task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(5); On 2017/02/15 03:26:44, Mike Wittman wrote: > I think it's worth bumping this up to something more like a minute, since > keeping a thread around is very cheap. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:257: if (task_runner) { On 2017/02/15 03:26:44, Mike Wittman wrote: > No need to check the task_runner anymore. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:310: // started it so that it can be self-managed or stopped on by another On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: stopped by Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:350: // calculated it now. On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: calculate Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:390: collection->native_sampler->RecordStackSample(&sample); On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: RecordStackSample(&profile.samples.back()) and remove the previous line Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:399: void StackSamplingProfiler::SamplingThread::CheckForIdle() { On 2017/02/15 03:26:44, Mike Wittman wrote: > how about calling this ScheduleShutdownIfIdle, to be clear what's happening > here? Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection = collection_ptr.get(); On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: better to save off the id and initial_delay and pass those below, to > eliminate the burden on the reader of figuring out if the pointer usage is safe Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/15 03:26:44, Mike Wittman wrote: > Can we initialize this in the CollectionContext constructor? No because it needs to be recorded when the first sample is being done which is after the initial delay plus any other things the sampling-thread is doing. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:508: // This will keep a consistent average interval between samples but will On 2017/02/15 03:26:44, Mike Wittman wrote: > Should this comment be on the above if statement? Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/15 03:26:44, Mike Wittman wrote: > Can we continue to use the current strategy of constructing this as a local > variable in Start()? It's moved within that function, so the pointer is not > valid after that anyway. I thought about that but decided to leave it this way because it should allow multiple start/stop operations without having to recreate it each time. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:579: // short task or none at all if sampling has already completed. On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: ... as long as it takes to collect one sample, taking ~200μs, or none at > all ... > > Jank has a precise definition (tasks taking >118ms) so readers will want to > evaluate the wait duration in absolute terms. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:187: // are move-only. This should run quickly as possible that another thread, On 2017/02/15 03:26:44, Mike Wittman wrote: > I'm having trouble parsing this. How about: > > Other threads, including the UI thread, may block on callback completion, so > this should run as quickly as possible. > > ? Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:200: // Ensure that this object gets destroyed before the current thread exits. On 2017/02/15 03:26:44, Mike Wittman wrote: > nit: The caller must ensure ... > > Comments with imperatives directed at the reader are not generally done and will > be confusing to most developers. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:209: // IMPORTANT: Ensure that the thread being sampled does not exit before this On 2017/02/15 03:26:44, Mike Wittman wrote: > Same here. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); On 2017/02/15 03:26:44, Mike Wittman wrote: > Can we move these three functions into an internal TestApi class? (See other > examples in the code.) > > And also, do the same for the two functions they call in SamplingThread? It's > not obvious what parts of that class are there for test purposes. Done. Let me know if it's what you were thinking.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Wow, supporting restartable threads certainly makes for tons of complications around thread start and thread exit... I think this general approach can work. Hopefully it's just a matter of fixing a few things and simplifying to facilitate understanding. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection = collection_ptr.get(); On 2017/02/15 16:17:34, bcwhite wrote: > On 2017/02/15 03:26:44, Mike Wittman wrote: > > nit: better to save off the id and initial_delay and pass those below, to > > eliminate the burden on the reader of figuring out if the pointer usage is > safe > > Done. This is no longer potentially dangerous, so I think we can remove the comment as well. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); On 2017/02/15 16:17:35, bcwhite wrote: > On 2017/02/15 03:26:44, Mike Wittman wrote: > > Can we move these three functions into an internal TestApi class? (See other > > examples in the code.) > > > > And also, do the same for the two functions they call in SamplingThread? It's > > not obvious what parts of that class are there for test purposes. > > Done. Let me know if it's what you were thinking. Looks good. Can you add a TestAPI to SamplingThread as well so we can distinguish the test support code in that class from the other code? https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:380: // The currently active profile being acptured. nit: captured https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:462: void StackSamplingProfiler::SamplingThread::ShutdownTask() { I think need some kind of invalidation for ShutdownTasks in-flight. Otherwise, I believe we can get in a situation where an earlier-posted ShutdownTask shuts down the thread immediately after a collection finishes. The relevant sequence of events would be: - a collection starts - the collection stops and posts ShutdownTask #1 - a new collection starts - the new collection stops and posts ShutdownTask #2 - ShutdownTask #1 executes shortly thereafter, finds a non-zero task_runner_create_requests_, and posts itself again - ShutdownTask #1.1 executes immediately, finds that all conditions to stop have been satisfied, and shuts down the thread CancelableTaskTracker unfortunately doesn't support delayed tasks, and is also focused on tasks with reply, so I'd suggest implementing a poor-man's task cancellation: - keep a "state" counter and increment whenever the the ShutdownTask should be invalidated - bind the current state counter value to a ShutdownTask argument when posting - abort the ShutdownTask if the passed state counter is not equal to the current state counter I suspect this would make the logic a little easier to reason about too. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty()) Isn't this case already covered by the task_runner_create_requests_ check below? i.e. if a new collection was added then task_runner_create_requests_ must have been incremented, right? https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:485: StopSoon(); One note on the Stop/StopSoon/DetachFromSequence calls here and in GetOrCreateTaskRunner: these require mutual exclusion per the Thread interface. We have this by virtue of the task_runner_lock_, but it's not obvious from reading the code that these calls have to be guarded by that lock to ensure correct operation. I'd wait to address this until we're pretty close to a resolution on the restart behavior, however, in case other things change in the mean time.
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/15 16:17:34, bcwhite wrote: > On 2017/02/15 03:26:44, Mike Wittman wrote: > > Can we initialize this in the CollectionContext constructor? > > No because it needs to be recorded when the first sample is being done which is > after the initial delay plus any other things the sampling-thread is doing. But the value recorded isn't derived from next_sample_time; the only place where next_sample_time is read is in the PostDelayedTaskCall below. It looks to me that the max expression will be equal to TimeDelta() regardless of whether next_sample_time is set to Time::Now() here or when the CollectionContext is constructed. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/15 16:17:34, bcwhite wrote: > On 2017/02/15 03:26:44, Mike Wittman wrote: > > Can we continue to use the current strategy of constructing this as a local > > variable in Start()? It's moved within that function, so the pointer is not > > valid after that anyway. > > I thought about that but decided to leave it this way because it should allow > multiple start/stop operations without having to recreate it each time. On second look, I think we have to create the native sampler in Start() because the current implementation is not correct. The native sampler is moved into the CollectionContext the first time Start is called, and is never recreated, so it's not valid to move again when Start is called a second time.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:415: CollectionContext* collection = collection_ptr.get(); On 2017/02/15 21:56:00, Mike Wittman wrote: > On 2017/02/15 16:17:34, bcwhite wrote: > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > nit: better to save off the id and initial_delay and pass those below, to > > > eliminate the burden on the reader of figuring out if the pointer usage is > > safe > > > > Done. > > This is no longer potentially dangerous, so I think we can remove the comment as > well. Done. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/15 22:52:16, Mike Wittman wrote: > On 2017/02/15 16:17:34, bcwhite wrote: > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > Can we initialize this in the CollectionContext constructor? > > > > No because it needs to be recorded when the first sample is being done which > is > > after the initial delay plus any other things the sampling-thread is doing. > > But the value recorded isn't derived from next_sample_time; the only place where > next_sample_time is read is in the PostDelayedTaskCall below. It looks to me > that the max expression will be equal to TimeDelta() regardless of whether > next_sample_time is set to Time::Now() here or when the CollectionContext is > constructed. next_sample_time is persistent and all sample times will be calculated from its starting value. If it is set any time before the very first sample, then it will race to catch up, capturing multiple samples at the start. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/15 22:52:16, Mike Wittman wrote: > On 2017/02/15 16:17:34, bcwhite wrote: > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > Can we continue to use the current strategy of constructing this as a local > > > variable in Start()? It's moved within that function, so the pointer is not > > > valid after that anyway. > > > > I thought about that but decided to leave it this way because it should allow > > multiple start/stop operations without having to recreate it each time. > > On second look, I think we have to create the native sampler in Start() because > the current implementation is not correct. The native sampler is moved into the > CollectionContext the first time Start is called, and is never recreated, so > it's not valid to move again when Start is called a second time. That was the case but not any more. Ownership of the native sampler stays with this object and the context has only a pointer to it, much like the signaled event. A invalid native_sampler_ is used to return early in many API calls so as to not try to access a sampling-thread that doesn't exist. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); On 2017/02/15 21:56:00, Mike Wittman wrote: > On 2017/02/15 16:17:35, bcwhite wrote: > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > Can we move these three functions into an internal TestApi class? (See other > > > examples in the code.) > > > > > > And also, do the same for the two functions they call in SamplingThread? > It's > > > not obvious what parts of that class are there for test purposes. > > > > Done. Let me know if it's what you were thinking. > > Looks good. Can you add a TestAPI to SamplingThread as well so we can > distinguish the test support code in that class from the other code? I didn't think it was necessary since that class is embedded inside this .cc file and so has no access from the outside. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:380: // The currently active profile being acptured. On 2017/02/15 21:56:01, Mike Wittman wrote: > nit: captured Done. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:462: void StackSamplingProfiler::SamplingThread::ShutdownTask() { On 2017/02/15 21:56:01, Mike Wittman wrote: > I think need some kind of invalidation for ShutdownTasks in-flight. Otherwise, I > believe we can get in a situation where an earlier-posted ShutdownTask shuts > down the thread immediately after a collection finishes. The relevant sequence > of events would be: > > - a collection starts > - the collection stops and posts ShutdownTask #1 > - a new collection starts > - the new collection stops and posts ShutdownTask #2 > - ShutdownTask #1 executes shortly thereafter, finds a non-zero > task_runner_create_requests_, and posts itself again > - ShutdownTask #1.1 executes immediately, finds that all conditions to stop have > been satisfied, and shuts down the thread > > > CancelableTaskTracker unfortunately doesn't support delayed tasks, and is also > focused on tasks with reply, so I'd suggest implementing a poor-man's task > cancellation: > - keep a "state" counter and increment whenever the the ShutdownTask should be > invalidated > - bind the current state counter value to a ShutdownTask argument when posting > - abort the ShutdownTask if the passed state counter is not equal to the > current state counter > > I suspect this would make the logic a little easier to reason about too. I can use the existing task_runner_create_requests_ as the state and then just pass it's current value and a first/second flag to ShutdownTask. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty()) On 2017/02/15 21:56:01, Mike Wittman wrote: > Isn't this case already covered by the task_runner_create_requests_ check below? > i.e. if a new collection was added then task_runner_create_requests_ must have > been incremented, right? They happen at different times so something could race in between. I think it could fail without this check if: - last sample of only active collection begins - Add increments create_requests and posts AddCollectionTask - last sample completes and does Finish+ScheduleShutdownIfIdle - SSII finds active_collections to be empty, posts delayed task using current create_requests (which was incremented above) - AddCollectionTask runs, adds new collection to active_collections - profiling of the new collections goes on and on - ShutdownTask eventually runs, creation_requests is unchanged, posts second task - ShutdownTask runs again, still no change, thinks its done - thread exits while a collection is still running I have a love/hate relationship with this stuff. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:485: StopSoon(); On 2017/02/15 21:56:01, Mike Wittman wrote: > One note on the Stop/StopSoon/DetachFromSequence calls here and in > GetOrCreateTaskRunner: these require mutual exclusion per the Thread interface. > We have this by virtue of the task_runner_lock_, but it's not obvious from > reading the code that these calls have to be guarded by that lock to ensure > correct operation. > > I'd wait to address this until we're pretty close to a resolution on the restart > behavior, however, in case other things change in the mean time. Acknowledged.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:445: collection->next_sample_time = Time::Now(); On 2017/02/16 17:39:49, bcwhite wrote: > On 2017/02/15 22:52:16, Mike Wittman wrote: > > On 2017/02/15 16:17:34, bcwhite wrote: > > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > > Can we initialize this in the CollectionContext constructor? > > > > > > No because it needs to be recorded when the first sample is being done which > > is > > > after the initial delay plus any other things the sampling-thread is doing. > > > > But the value recorded isn't derived from next_sample_time; the only place > where > > next_sample_time is read is in the PostDelayedTaskCall below. It looks to me > > that the max expression will be equal to TimeDelta() regardless of whether > > next_sample_time is set to Time::Now() here or when the CollectionContext is > > constructed. > > next_sample_time is persistent and all sample times will be calculated from its > starting value. If it is set any time before the very first sample, then it > will race to catch up, capturing multiple samples at the start. Ah, right. Missed that all the sample times are offset from the initial one. https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/16 17:39:49, bcwhite wrote: > On 2017/02/15 22:52:16, Mike Wittman wrote: > > On 2017/02/15 16:17:34, bcwhite wrote: > > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > > Can we continue to use the current strategy of constructing this as a > local > > > > variable in Start()? It's moved within that function, so the pointer is > not > > > > valid after that anyway. > > > > > > I thought about that but decided to leave it this way because it should > allow > > > multiple start/stop operations without having to recreate it each time. > > > > On second look, I think we have to create the native sampler in Start() > because > > the current implementation is not correct. The native sampler is moved into > the > > CollectionContext the first time Start is called, and is never recreated, so > > it's not valid to move again when Start is called a second time. > > That was the case but not any more. Ownership of the native sampler stays with > this object and the context has only a pointer to it, much like the signaled > event. I just remembered why I implemented the creation of the native sampler in Start originally: it's so that the use and destruction of the object does not occur on different threads. This makes it trivial to reason about the correctness of the use of the object, without having to consider any synchronization concerns. The readability benefit of not having to think about synchronization greatly outweighs the runtime benefit of avoiding additional object constructions, so we should keep the existing behavior. > A invalid native_sampler_ is used to return early in many API calls so as to not > try to access a sampling-thread that doesn't exist. I don't understand. In patch set 25, how can native_sampler_ ever be null after creation? https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); On 2017/02/16 17:39:49, bcwhite wrote: > On 2017/02/15 21:56:00, Mike Wittman wrote: > > On 2017/02/15 16:17:35, bcwhite wrote: > > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > > Can we move these three functions into an internal TestApi class? (See > other > > > > examples in the code.) > > > > > > > > And also, do the same for the two functions they call in SamplingThread? > > It's > > > > not obvious what parts of that class are there for test purposes. > > > > > > Done. Let me know if it's what you were thinking. > > > > Looks good. Can you add a TestAPI to SamplingThread as well so we can > > distinguish the test support code in that class from the other code? > > I didn't think it was necessary since that class is embedded inside this .cc > file and so has no access from the outside. Segregating test support code all the way down makes it clear which parts of the class are core functionality. This makes it easier to understand the important parts of the code and facilitates refactoring. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty()) On 2017/02/16 17:39:49, bcwhite wrote: > On 2017/02/15 21:56:01, Mike Wittman wrote: > > Isn't this case already covered by the task_runner_create_requests_ check > below? > > i.e. if a new collection was added then task_runner_create_requests_ must have > > been incremented, right? > > They happen at different times so something could race in between. > > I think it could fail without this check if: > - last sample of only active collection begins > - Add increments create_requests and posts AddCollectionTask > - last sample completes and does Finish+ScheduleShutdownIfIdle > - SSII finds active_collections to be empty, posts delayed task using current > create_requests (which was incremented above) > - AddCollectionTask runs, adds new collection to active_collections > - profiling of the new collections goes on and on > - ShutdownTask eventually runs, creation_requests is unchanged, posts second > task > - ShutdownTask runs again, still no change, thinks its done > - thread exits while a collection is still running > > I have a love/hate relationship with this stuff. Yeah, I can see that sequence of events occurring. Stepping back a bit it seems like we have a number of interacting constraints around shutdown/startup: 1. A delayed shutdown must be initiated when the number of active collections drops to zero. 2. Any delayed shutdowns must be aborted (or have no effect) if there are pending collections at the time of execution of the shutdown. 3. Requests for collection must be synchronous with respect to shutdown execution. Otherwise collection requests can be racily added only to have the thread exit before they get serviced. 4. It's not possible to actually perform the thread exit itself synchronously with respect to other events in the system because the thread can't hold a lock as it exits. This means that seeing the thread to exit must be done on another thread. 5. Taking #3 and #4 together, we effectively need to synchronize the thread exit execution (which takes place across two different threads), with the requests for collection. Does that sound like a reasonable summary? Are there other relevant synchronization constraints? I want to be able to convince myself that the solution works from first principles. https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:466: // get postponed until StopSoon can run thus eliminating the race. It seems like the key to eliminating the race is actually setting task_runner_ to null while holding the lock, since that indicates to GetOrCreateTaskRunner that it needs to wait for the thread to shut down before restarting it. https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) { I think the task_runner_create_requests_ check above should avoid the need for a second task posting: If GetOrCreateTaskRunner was already executed, then we would have failed the check, regardless of whether the task got posted yet. If GetOrCreateTaskRunner is being executed now and waiting on task_runner_lock_, then it will not start executing until we've done the shutdown sequence below and reset task_runner_ so it won't have a chance to post any tasks until after restart. task_runner_create_requests_ essentially serves as a proxy for whether there could be pending add requests that haven't been executed yet.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, > I just remembered why I implemented the creation of the native sampler in Start > originally: it's so that the use and destruction of the object does not occur on > different threads. This makes it trivial to reason about the correctness of the > use of the object, without having to consider any synchronization concerns. Use will always be on a different thread that construction because use is always on the SamplingThread. Right now construction & destruction of the native sampler occurs on the same thread that does construction & destruction of the generic object. If I move the construction to Start and Start can be called by yet another thread, then construction and destruction of the native sampler can be called on different threads because the destruction of the native sampler will occur with the destruction of the generic sampler. Moving destruction of the native sampler to Stop won't help because there is no requirement that Stop ever be called. > The readability benefit of not having to think about synchronization greatly > outweighs the runtime benefit of avoiding additional object constructions, so we > should keep the existing behavior. The existing behavior was that the native sampler was constructed by the thread calling Start but destructed on the sampling thread. That seems riskier. > > A invalid native_sampler_ is used to return early in many API calls so as to > not > > try to access a sampling-thread that doesn't exist. > > I don't understand. In patch set 25, how can native_sampler_ ever be null after > creation? Unsupported platforms return nullptr from NativeStackSampler::Create(). https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:233: static bool IsSamplingThreadRunningForTesting(); On 2017/02/17 16:10:19, Mike Wittman wrote: > On 2017/02/16 17:39:49, bcwhite wrote: > > On 2017/02/15 21:56:00, Mike Wittman wrote: > > > On 2017/02/15 16:17:35, bcwhite wrote: > > > > On 2017/02/15 03:26:44, Mike Wittman wrote: > > > > > Can we move these three functions into an internal TestApi class? (See > > other > > > > > examples in the code.) > > > > > > > > > > And also, do the same for the two functions they call in SamplingThread? > > > It's > > > > > not obvious what parts of that class are there for test purposes. > > > > > > > > Done. Let me know if it's what you were thinking. > > > > > > Looks good. Can you add a TestAPI to SamplingThread as well so we can > > > distinguish the test support code in that class from the other code? > > > > I didn't think it was necessary since that class is embedded inside this .cc > > file and so has no access from the outside. > > Segregating test support code all the way down makes it clear which parts of the > class are core functionality. This makes it easier to understand the important > parts of the code and facilitates refactoring. Done. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty()) > Yeah, I can see that sequence of events occurring. > > > Stepping back a bit it seems like we have a number of interacting constraints > around shutdown/startup: > > 1. A delayed shutdown must be initiated when the number of active collections > drops to zero. Should. Since new collections could come in during the delay, "must" is stronger than necessary since that condition cannot be relied upon later on. > 2. Any delayed shutdowns must be aborted (or have no effect) if there are > pending collections at the time of execution of the shutdown. > > 3. Requests for collection must be synchronous with respect to shutdown > execution. Otherwise collection requests can be racily added only to have the > thread exit before they get serviced. > > 4. It's not possible to actually perform the thread exit itself synchronously > with respect to other events in the system because the thread can't hold a lock > as it exits. This means that seeing the thread to exit must be done on another > thread. Yes. The thread must indicate that it is about to exit so that an outside thread can know to wait for the thread to exit (and possibly restart it). > 5. Taking #3 and #4 together, we effectively need to synchronize the thread exit > execution (which takes place across two different threads), with the requests > for collection. > > Does that sound like a reasonable summary? Are there other relevant > synchronization constraints? I want to be able to convince myself that the > solution works from first principles. And thread API access has to be synchronized (which is done via the task_runner_lock_). https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:466: // get postponed until StopSoon can run thus eliminating the race. On 2017/02/17 16:10:19, Mike Wittman wrote: > It seems like the key to eliminating the race is actually setting task_runner_ > to null while holding the lock, since that indicates to GetOrCreateTaskRunner > that it needs to wait for the thread to shut down before restarting it. Done. https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) { On 2017/02/17 16:10:19, Mike Wittman wrote: > I think the task_runner_create_requests_ check above should avoid the need for a > second task posting: > > If GetOrCreateTaskRunner was already executed, then we would have failed the > check, regardless of whether the task got posted yet. If GetOrCreateTaskRunner > is being executed now and waiting on task_runner_lock_, then it will not start > executing until we've done the shutdown sequence below and reset task_runner_ so > it won't have a chance to post any tasks until after restart. > > task_runner_create_requests_ essentially serves as a proxy for whether there > could be pending add requests that haven't been executed yet. Yeah, I think that's reasonable on the assumption that the "idle delay" is sufficient to ensure the execution of the AddCollectionTask associated with an increment of task_runner_create_requests. But what about this: - task_runner_create_requests_ is incremented (at time T) - ScheduleShutdownIfIdle runs with this new value - AddTask gets posted - New collections runs for 55 seconds - ShutdownTask runs at T+60 seconds - thread exits only 5 seconds after the completion of the last collection It's not really a problem; it just results in the thread possibly exiting without waiting for the full delay. Also, I think the fix would be complicated. ScheduleShutdownIfIdle would have to do the two back-to-back runs to get both the latest task_runner_creation_requests_ count and the non-empty collections boolean. On top of that... The PostTask done by Add() would have be done while the task_runner_lock_ remains held to ensure that both those two ScheduleShutdownIfIdle() task runs couldn't occur between the increment and the posting. Doing that would require undoing the GetOrCreateTaskRunner helper method.
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/21 16:21:19, bcwhite wrote: > > The readability benefit of not having to think about synchronization greatly > > outweighs the runtime benefit of avoiding additional object constructions, so > we > > should keep the existing behavior. > > The existing behavior was that the native sampler was constructed by the thread > calling Start but destructed on the sampling thread. That seems riskier. What's the risk you're concerned about in creating and destroying the native sampler on different threads? It's not possible for those two lifecycle events to overlap, so there's no coordination required. I'm not seeing where this could go wrong. If use and destruction are on different threads, however, then this requires coordination between the threads to ensure that the object is not destroyed while it is still being used. The more state that requires cross-thread synchronization, the more difficult it is to understand and validate the code, and the harder it is to make future changes correctly. > > > A invalid native_sampler_ is used to return early in many API calls so as to > > not > > > try to access a sampling-thread that doesn't exist. > > > > I don't understand. In patch set 25, how can native_sampler_ ever be null > after > > creation? > > Unsupported platforms return nullptr from NativeStackSampler::Create(). Right. I don't think it's worth the complexity to try to maintain the NativeStackSampler pointer as the sentinel for whether sampling is supported. Better to save off a boolean on first attempted creation or add a NativeStackSampler::IsSupported() function, whatever's simpler. https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/720001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:469: if (!active_collections_.empty()) On 2017/02/21 16:21:19, bcwhite wrote: > > Stepping back a bit it seems like we have a number of interacting constraints > > around shutdown/startup: > > > > 1. A delayed shutdown must be initiated when the number of active collections > > drops to zero. > > Should. Since new collections could come in during the delay, "must" is > stronger than necessary since that condition cannot be relied upon later on. > > > > 2. Any delayed shutdowns must be aborted (or have no effect) if there are > > pending collections at the time of execution of the shutdown. Actually, I think this should be: 2. Any delayed shutdowns must be aborted (or have no effect) if any additional collections have occurred or are pending at the time of execution of the shutdown. > > 3. Requests for collection must be synchronous with respect to shutdown > > execution. Otherwise collection requests can be racily added only to have the > > thread exit before they get serviced. > > > > 4. It's not possible to actually perform the thread exit itself synchronously > > with respect to other events in the system because the thread can't hold a > lock > > as it exits. This means that seeing the thread to exit must be done on another > > thread. > > Yes. The thread must indicate that it is about to exit so that an outside > thread can know to wait for the thread to exit (and possibly restart it). > > > > 5. Taking #3 and #4 together, we effectively need to synchronize the thread > exit > > execution (which takes place across two different threads), with the requests > > for collection. > > > > Does that sound like a reasonable summary? Are there other relevant > > synchronization constraints? I want to be able to convince myself that the > > solution works from first principles. > > And thread API access has to be synchronized (which is done via the > task_runner_lock_). Great, I think we can get to a workable and understandable (although nontrivial) solution within these constraints. https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) { On 2017/02/21 16:21:19, bcwhite wrote: > But what about this: > > - task_runner_create_requests_ is incremented (at time T) > - ScheduleShutdownIfIdle runs with this new value > - AddTask gets posted > - New collections runs for 55 seconds > - ShutdownTask runs at T+60 seconds > - thread exits only 5 seconds after the completion of the last collection > > It's not really a problem; it just results in the thread possibly exiting > without waiting for the full delay. > > Also, I think the fix would be complicated. ScheduleShutdownIfIdle would have > to do the two back-to-back runs to get both the latest > task_runner_creation_requests_ count and the non-empty collections boolean. > > On top of that... The PostTask done by Add() would have be done while the > task_runner_lock_ remains held to ensure that both those two > ScheduleShutdownIfIdle() task runs couldn't occur between the increment and the > posting. Doing that would require undoing the GetOrCreateTaskRunner helper > method. It seems like the basic issue here is distinguishing whether the active_collections_.empty() state is the result of the state being empty since the ShutdownTask was posted, or the result of collections occurring then completing. Rather than trying to figure this out post hoc, I think a simpler mechanism would be to make a note of when collections start, e.g. by separately incrementing task_runner_create_requests_ (under lock) in AddCollectionTask. Then the create_requests check would detect this case too, and there would be no need to try to divine the current state using the active_collections_.empty() check and multiple task postings in ShutdownTask.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, > If use and destruction are on different threads, however, then this requires > coordination between the threads to ensure that the object is not destroyed > while it is still being used. That's already done because destruction requires waiting on collection having finished. > Right. I don't think it's worth the complexity to try to maintain the > NativeStackSampler pointer as the sentinel for whether sampling is supported. > Better to save off a boolean on first attempted creation or add a > NativeStackSampler::IsSupported() function, whatever's simpler. Existing collection_id will do it, too. Done. https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/740001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:482: if (!second && task_runner_) { On 2017/02/21 18:57:02, Mike Wittman wrote: > On 2017/02/21 16:21:19, bcwhite wrote: > > But what about this: > > > > - task_runner_create_requests_ is incremented (at time T) > > - ScheduleShutdownIfIdle runs with this new value > > - AddTask gets posted > > - New collections runs for 55 seconds > > - ShutdownTask runs at T+60 seconds > > - thread exits only 5 seconds after the completion of the last collection > > > > It's not really a problem; it just results in the thread possibly exiting > > without waiting for the full delay. > > > > Also, I think the fix would be complicated. ScheduleShutdownIfIdle would have > > to do the two back-to-back runs to get both the latest > > task_runner_creation_requests_ count and the non-empty collections boolean. > > > > On top of that... The PostTask done by Add() would have be done while the > > task_runner_lock_ remains held to ensure that both those two > > ScheduleShutdownIfIdle() task runs couldn't occur between the increment and > the > > posting. Doing that would require undoing the GetOrCreateTaskRunner helper > > method. > > It seems like the basic issue here is distinguishing whether the > active_collections_.empty() state is the result of the state being empty since > the ShutdownTask was posted, or the result of collections occurring then > completing. > > Rather than trying to figure this out post hoc, I think a simpler mechanism > would be to make a note of when collections start, e.g. by separately > incrementing task_runner_create_requests_ (under lock) in AddCollectionTask. > Then the create_requests check would detect this case too, and there would be no > need to try to divine the current state using the active_collections_.empty() > check and multiple task postings in ShutdownTask. Works for me.
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: chromeos_amd64-generic_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) chromeos_daisy_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) chromium_presubmit on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL)
https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/700001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:562: native_sampler_ = NativeStackSampler::Create(thread_id_, &RecordAnnotations, On 2017/02/21 21:48:05, bcwhite wrote: > > Right. I don't think it's worth the complexity to try to maintain the > > NativeStackSampler pointer as the sentinel for whether sampling is supported. > > Better to save off a boolean on first attempted creation or add a > > NativeStackSampler::IsSupported() function, whatever's simpler. > > Existing collection_id will do it, too. Done. That works. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:220: int task_runner_create_requests_ = 0; This should have a more general name now. collection_add_events_ ? Also, documentation on what it's used for. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:436: AutoLock lock(task_runner_lock_); nit: enclose in a block https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; The thread execution state from here to GetOrCreateTaskRuner is both pretty subtle and critically important. So it probably should have a representation with more inherent meaning than a null task runner pointer. I'd recommend an enum and associated variable; e.g. enum class ThreadExecutionState { NOT_STARTED, RUNNING, EXITING }; Top-level documentation for the management of thread exit could be usefully hung off of this type and it potentially would allow meaningful DCHECKS about the thread execution state in the code, which would be helpful from a correctness and documentation perspective. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait(); One more comment here, after considering the case of performing more than one collection from the same StackSamplingProfiler object. I believe we need to move the wait into Stop() to support that case: as it is now the WaitableEvent remains signalled once the first collection finishes, so the StackSamplingProfiler object could be destroyed any time after the first collection completes, without waiting for a later collection to complete. We probably should have a test exercising this case as well. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:623: if (collection_id_ == -1) { This should be a named constant. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:848: // Capture thread should still be running at this point. The shutdown and restart behavior should be broken out into a different test (or tests). https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:866: #define MAYBE_ConcurrentProfiling ConcurrentProfiling I think we should test different interleavings of Start(), Stop(), and profiler destruction for the concurrent case, probably in different tests.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linu...) linux_chromium_tsan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:220: int task_runner_create_requests_ = 0; On 2017/02/22 03:06:48, Mike Wittman wrote: > This should have a more general name now. collection_add_events_ ? > > Also, documentation on what it's used for. Done. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:436: AutoLock lock(task_runner_lock_); On 2017/02/22 03:06:48, Mike Wittman wrote: > nit: enclose in a block Done. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 03:06:48, Mike Wittman wrote: > The thread execution state from here to GetOrCreateTaskRuner is both pretty > subtle and critically important. So it probably should have a representation > with more inherent meaning than a null task runner pointer. I'd recommend an > enum and associated variable; e.g. > > enum class ThreadExecutionState { > NOT_STARTED, > RUNNING, > EXITING > }; > > Top-level documentation for the management of thread exit could be usefully hung > off of this type and it potentially would allow meaningful DCHECKS about the > thread execution state in the code, which would be helpful from a correctness > and documentation perspective. While I like the idea, it's a second variable accessed independently of task_runner_, access to which has been largely pushed into "helper methods". Since those helper methods acquire and release the lock privately, it'll mean doing two acquire/release operations on the lock, changing all the helper methods to be WhileLocked, or removing the helper methods altogether. Preferences? https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait(); > One more comment here, after considering the case of performing more than one > collection from the same StackSamplingProfiler object. I believe we need to move > the wait into Stop() to support that case: as it is now the WaitableEvent > remains signalled once the first collection finishes, so the > StackSamplingProfiler object could be destroyed any time after the first > collection completes, without waiting for a later collection to complete. The same occurred to me but I think it's better to do the wait in Start() so that Stop() remains asynchronous. This means that the controlling thread won't block until absolutely necessary, which means probably never since the stop will happen within microseconds. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:623: if (collection_id_ == -1) { On 2017/02/22 03:06:47, Mike Wittman wrote: > This should be a named constant. Done. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:848: // Capture thread should still be running at this point. On 2017/02/22 03:06:48, Mike Wittman wrote: > The shutdown and restart behavior should be broken out into a different test (or > tests). Done. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:866: #define MAYBE_ConcurrentProfiling ConcurrentProfiling On 2017/02/22 03:06:48, Mike Wittman wrote: > I think we should test different interleavings of Start(), Stop(), and profiler > destruction for the concurrent case, probably in different tests. Done.
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 14:32:51, bcwhite wrote: > On 2017/02/22 03:06:48, Mike Wittman wrote: > > The thread execution state from here to GetOrCreateTaskRuner is both pretty > > subtle and critically important. So it probably should have a representation > > with more inherent meaning than a null task runner pointer. I'd recommend an > > enum and associated variable; e.g. > > > > enum class ThreadExecutionState { > > NOT_STARTED, > > RUNNING, > > EXITING > > }; > > > > Top-level documentation for the management of thread exit could be usefully > hung > > off of this type and it potentially would allow meaningful DCHECKS about the > > thread execution state in the code, which would be helpful from a correctness > > and documentation perspective. > > While I like the idea, it's a second variable accessed independently of > task_runner_, access to which has been largely pushed into "helper methods". > Since those helper methods acquire and release the lock privately, it'll mean > doing two acquire/release operations on the lock, changing all the helper > methods to be WhileLocked, or removing the helper methods altogether. > > Preferences? I think using the enum is worth doing for the improved documentation and readability, plus the fact that it removes the ambiguity between the not started state and the exiting state. Just using it in place of checking task_runner_ for null shouldn't require any additional locking. Adding DCHECKS might require additional locking and complexity, but the locking could be gated by DCHECK_IS_ON() to avoid any non-debug performance impact. The question of whether the added complexity it worth it probably can be considered on a case-by-case basis. https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait(); On 2017/02/22 14:32:51, bcwhite wrote: > > One more comment here, after considering the case of performing more than one > > collection from the same StackSamplingProfiler object. I believe we need to > move > > the wait into Stop() to support that case: as it is now the WaitableEvent > > remains signalled once the first collection finishes, so the > > StackSamplingProfiler object could be destroyed any time after the first > > collection completes, without waiting for a later collection to complete. > > The same occurred to me but I think it's better to do the wait in Start() so > that Stop() remains asynchronous. > > This means that the controlling thread won't block until absolutely necessary, > which means probably never since the stop will happen within microseconds. Sounds reasonable to me, but the current name doesn't quite fit for this use. Better to call it something like profiling_inactive_. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:32: const int NO_ID = -1; nit: how about NULL_COLLECTION_ID, so it's clear what state this corresponds to? This also could use some documentation. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:223: int task_runner_create_requests_ = 0; remove https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:224: TimeDelta task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(60); If we're not supporting configurable shutdown times (per comment in the header), then this can be replaced with something like: bool disable_idle_shutdown_for_testing_ = false; And the TimeDelta::FromSeconds(60) can be moved into ScheduleShutdownIfIdle(). https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:592: WaitableEvent::InitialState::SIGNALED), Comment on why the initial state is signaled. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:630: finished_event_.Reset(); This code is equivalent to just: finished_event_.Wait(); For the comment I think it's sufficient to say "Wait for any previously started profiling to complete." https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:642: finished_event_.Signal(); This is no longer necessary. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:194: static void SetSamplingThreadIdleShutdownTime(int shutdown_ms); This function is only ever used to disable the the idle shutdown, so we should name it accordingly, e.g. DisableSamplingThreadIdleShutdown(). I can't see a test use for setting the shutdown time that wouldn't be better served by explicit coordination using the TestDelegate. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:291: // An ID uniquely identifying this collection to the sampling thread. Also mention the conditions under which this will be null. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); What's the reason for the TimedWait? https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); Channeling brucedawson@: while (condition) yield(); results in a busy wait if there spare execution cycles in the system, where the thread is repeatedly scheduled, executes, and yields. This is bad for power usage. Admittedly this is not a big concern in tests, but people do tend to copy-paste code around. The preferred formulation is while (condition) sleep(1); to allow the processor to be idle for some time. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:922: #define MAYBE_ConcurrentProfiling1 ConcurrentProfiling1 Test name and comment should be descriptive of the specific conditions this is testing, rather than using a numeric suffix. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:962: WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2); This probably should wait until all profilers have completed, then verify that results are as expected. The expectation below is not correct since both of the profilers could have completed before this call. Same comment potentially applies to the tests below, depending on what is being tested. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1070: WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2); 2 => 3, or better, arraysize(params). Since we're duplicating and changing tests here, we probably should replace all the constants here and in the previous tests with arraysize(params) to make this less fragile. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1077: I'd like to have a test that interleaves Start and Stop calls on different profilers as well.
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; > Just using it in place of checking task_runner_ for null shouldn't require any > additional locking. The task-runner that is checked against null is returned from helper methods that access task_runner_ while locked. If thread_state_ is to be used instead then the lock has to be acquired to access that member variable. This may be done before or after fetching the task-runner but either way the lock is acquired twice. The solutions I can see off-hand are to have the helper functions return both values or eliminating the helper functions so that both values can be checked as needed under the same lock.
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 20:51:15, bcwhite wrote: > > Just using it in place of checking task_runner_ for null shouldn't require any > > additional locking. > > The task-runner that is checked against null is returned from helper methods > that access task_runner_ while locked. > > If thread_state_ is to be used instead then the lock has to be acquired to > access that member variable. This may be done before or after fetching the > task-runner but either way the lock is acquired twice. > > The solutions I can see off-hand are to have the helper functions return both > values or eliminating the helper functions so that both values can be checked as > needed under the same lock. The only non-test place I see where the task_runner state is being checked without the lock is Remove(). I think it's reasonable to have GetTaskRunner() return the enum as an optional out param to accommodate that use.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:504: task_runner_ = nullptr; On 2017/02/22 21:11:13, Mike Wittman wrote: > On 2017/02/22 20:51:15, bcwhite wrote: > > > Just using it in place of checking task_runner_ for null shouldn't require > any > > > additional locking. > > > > The task-runner that is checked against null is returned from helper methods > > that access task_runner_ while locked. > > > > If thread_state_ is to be used instead then the lock has to be acquired to > > access that member variable. This may be done before or after fetching the > > task-runner but either way the lock is acquired twice. > > > > The solutions I can see off-hand are to have the helper functions return both > > values or eliminating the helper functions so that both values can be checked > as > > needed under the same lock. > > The only non-test place I see where the task_runner state is being checked > without the lock is Remove(). I think it's reasonable to have GetTaskRunner() > return the enum as an optional out param to accommodate that use. Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:187: enum ThreadExecutionState { This needs substantial comments explaining the lifecycle of the thread and the meaning of the different states, so readers can understand the lifecycle and corresponding subtleties without having to piece it together from the implementing code. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:265: if (state != RUNNING) DCHECK_EQ(RUNNING, state) unless this is called in the other states in tests https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING) DCHECK_NE(NOT_STARTED, state) before this https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:331: if (task_runner_thread_state_ != RUNNING) { I think this code would be easier to follow structured like: if (task_runner_thread_state_ == RUNNING { ... code currently in else clause ... return task_runner_; } if (task_runner_thread_state == EXITING) { // ... Stop(); } ... code currently in if clause ... return task_runner_; https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ != RUNNING) This would be better as a DCHECK: DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ : !task_runner_) https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:572: task_runner_thread_state_ = NOT_STARTED; I don't think we need or want this. CleanUp is documented to be called after the message loop ends, so this would overwrite the EXITING state before it could be seen by GetOrCreateTaskRunner().
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: cast_shell_linux on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/cast_shell_linu...) ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Patchset #31 (id:860001) has been deleted
https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/780001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:600: finished_event_.Wait(); On 2017/02/22 20:32:17, Mike Wittman wrote: > On 2017/02/22 14:32:51, bcwhite wrote: > > > One more comment here, after considering the case of performing more than > one > > > collection from the same StackSamplingProfiler object. I believe we need to > > move > > > the wait into Stop() to support that case: as it is now the WaitableEvent > > > remains signalled once the first collection finishes, so the > > > StackSamplingProfiler object could be destroyed any time after the first > > > collection completes, without waiting for a later collection to complete. > > > > The same occurred to me but I think it's better to do the wait in Start() so > > that Stop() remains asynchronous. > > > > This means that the controlling thread won't block until absolutely necessary, > > which means probably never since the stop will happen within microseconds. > > Sounds reasonable to me, but the current name doesn't quite fit for this use. > Better to call it something like profiling_inactive_. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:32: const int NO_ID = -1; On 2017/02/22 20:32:18, Mike Wittman wrote: > nit: how about NULL_COLLECTION_ID, so it's clear what state this corresponds to? > > This also could use some documentation. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:223: int task_runner_create_requests_ = 0; On 2017/02/22 20:32:18, Mike Wittman wrote: > remove Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:224: TimeDelta task_runner_idle_shutdown_time_ = TimeDelta::FromSeconds(60); On 2017/02/22 20:32:18, Mike Wittman wrote: > If we're not supporting configurable shutdown times (per comment in the header), > then this can be replaced with something like: > bool disable_idle_shutdown_for_testing_ = false; > And the TimeDelta::FromSeconds(60) can be moved into ScheduleShutdownIfIdle(). Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:592: WaitableEvent::InitialState::SIGNALED), On 2017/02/22 20:32:18, Mike Wittman wrote: > Comment on why the initial state is signaled. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:630: finished_event_.Reset(); On 2017/02/22 20:32:18, Mike Wittman wrote: > This code is equivalent to just: > finished_event_.Wait(); > > For the comment I think it's sufficient to say "Wait for any previously started > profiling to complete." Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:642: finished_event_.Signal(); On 2017/02/22 20:32:17, Mike Wittman wrote: > This is no longer necessary. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:194: static void SetSamplingThreadIdleShutdownTime(int shutdown_ms); On 2017/02/22 20:32:18, Mike Wittman wrote: > This function is only ever used to disable the the idle shutdown, so we should > name it accordingly, e.g. DisableSamplingThreadIdleShutdown(). > > I can't see a test use for setting the shutdown time that wouldn't be better > served by explicit coordination using the TestDelegate. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.h:291: // An ID uniquely identifying this collection to the sampling thread. On 2017/02/22 20:32:18, Mike Wittman wrote: > Also mention the conditions under which this will be null. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); On 2017/02/22 20:32:18, Mike Wittman wrote: > What's the reason for the TimedWait? To ensure that it runs to completion before being stopped. Otherwise it could stop before the first sample and there wouldn't be any evidence that it would have run. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:922: #define MAYBE_ConcurrentProfiling1 ConcurrentProfiling1 On 2017/02/22 20:32:18, Mike Wittman wrote: > Test name and comment should be descriptive of the specific conditions this is > testing, rather than using a numeric suffix. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:962: WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2); On 2017/02/22 20:32:18, Mike Wittman wrote: > This probably should wait until all profilers have completed, then verify that > results are as expected. The expectation below is not correct since both of the > profilers could have completed before this call. The test below isn't the number of completed profiles but rather that the completed profile has data. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1070: WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2); On 2017/02/22 20:32:18, Mike Wittman wrote: > 2 => 3, or better, arraysize(params). Since we're duplicating and changing tests > here, we probably should replace all the constants here and in the previous > tests with arraysize(params) to make this less fragile. Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1077: On 2017/02/22 20:32:18, Mike Wittman wrote: > I'd like to have a test that interleaves Start and Stop calls on different > profilers as well. Done. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:187: enum ThreadExecutionState { On 2017/02/23 18:26:52, Mike Wittman wrote: > This needs substantial comments explaining the lifecycle of the thread and the > meaning of the different states, so readers can understand the lifecycle and > corresponding subtleties without having to piece it together from the > implementing code. Done. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:265: if (state != RUNNING) On 2017/02/23 18:26:52, Mike Wittman wrote: > DCHECK_EQ(RUNNING, state) > > unless this is called in the other states in tests Done. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING) On 2017/02/23 18:26:52, Mike Wittman wrote: > DCHECK_NE(NOT_STARTED, state) before this It could legitimately be in that state if the the collection runs to completion, the idle time expires, the thread shuts down, and then an attempt is made to remove the collection. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:331: if (task_runner_thread_state_ != RUNNING) { On 2017/02/23 18:26:52, Mike Wittman wrote: > I think this code would be easier to follow structured like: > > if (task_runner_thread_state_ == RUNNING { > ... code currently in else clause ... > > return task_runner_; > } > > if (task_runner_thread_state == EXITING) { > // ... > Stop(); > } > > ... code currently in if clause ... > > return task_runner_; Done. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ != RUNNING) On 2017/02/23 18:26:52, Mike Wittman wrote: > This would be better as a DCHECK: > DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ : !task_runner_) It needs to handle not-RUNNING by exiting early or the GetThreadId() below will hang. I can DCHECK the task-runner, though. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:572: task_runner_thread_state_ = NOT_STARTED; On 2017/02/23 18:26:52, Mike Wittman wrote: > I don't think we need or want this. CleanUp is documented to be called after the > message loop ends, so this would overwrite the EXITING state before it could be > seen by GetOrCreateTaskRunner(). Done.
https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); On 2017/02/24 20:39:24, bcwhite wrote: > On 2017/02/22 20:32:18, Mike Wittman wrote: > > What's the reason for the TimedWait? > > To ensure that it runs to completion before being stopped. Otherwise it could > stop before the first sample and there wouldn't be any evidence that it would > have run. Right, but why not just Wait(), and remove the Wait() call below? https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:962: WaitableEvent::WaitMany(sampling_completed_rawptrs.data(), 2); On 2017/02/24 20:39:24, bcwhite wrote: > On 2017/02/22 20:32:18, Mike Wittman wrote: > > This probably should wait until all profilers have completed, then verify that > > results are as expected. The expectation below is not correct since both of > the > > profilers could have completed before this call. > > The test below isn't the number of completed profiles but rather that the > completed profile has data. Ah right, misread that. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING) On 2017/02/24 20:39:25, bcwhite wrote: > On 2017/02/23 18:26:52, Mike Wittman wrote: > > DCHECK_NE(NOT_STARTED, state) before this > > It could legitimately be in that state if the the collection runs to completion, > the idle time expires, the thread shuts down, and then an attempt is made to > remove the collection. Won't it be in the EXITING state in that case? https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ != RUNNING) On 2017/02/24 20:39:25, bcwhite wrote: > On 2017/02/23 18:26:52, Mike Wittman wrote: > > This would be better as a DCHECK: > > DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ : !task_runner_) > > It needs to handle not-RUNNING by exiting early or the GetThreadId() below will > hang. I can DCHECK the task-runner, though. How about putting the DCHECKS in a conditional then? if (task_runner_thread_state_ == RUNNING) { // ... DCHECK_NE(GetThreadId(), PlatformThread::CurrentId()); DCHECK(task_runner_); } else { DCHECK(!task_runner_); } return task_runner_; https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will be started (or restarted) when a sampling As the code is currently, the state is only set to NOT_STARTED when the thread has never been started. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:202: // initiated. We should mention that new profiling requests (which occur on their own thread) are responsible for ensuring the exit has completed then starting the thread and transitioning to the RUNNING state. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:340: StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() { We should rename this to something like GetOrCreateTaskRunnerForAdd, since incrementing the task_runner_add_events_ is predicated on this only being called for Add. Also add a comment discussing why this should only be called from Add. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:354: // to call Stop() before Start(). This is safe even the thread has never The last sentence is no longer relevant and can be removed. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset(); We can avoid the manual reset by setting the reset policy to AUTOMATIC. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay = TimeDelta::FromMilliseconds(10); Do we need an initial delay for this set of params (and the one below)? https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; Can we reduce this to something like 3-5 samples (and the one below)? If the sampling is serviced at the normal timer tick interval of 15.6ms, then the 10 samples in this test likely will take 160+ ms. We should strive to minimize test execution time where possible. Same comment applies to the subsequent tests as well. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:935: params[1].samples_per_burst = 10; We should make this value different than above, and check the number of samples returned below, to test that we've profiled against each set of parameters once. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:964: sampling_completed_rawptrs.data(), sampling_completed_rawptrs.size()); This block of code down to this line can be extracted into a utility function and reused across all these tests. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:969: EXPECT_TRUE(sampling_completed[other_profiler]->TimedWait( This should be a regular Wait() call. The test will fail with a time out if the event is never signaled. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay = TimeDelta::FromMilliseconds(2); The resolution for all waiting values is the 15.6ms timer tick interval, so this profiler is likely to start on the same timer tick as the previous one (with 94% probability). The same logic applies to the sampling intervals, so this test will substantially be testing the same behavior as the previous one, except for the Stop behavior at the end. As a general principle, we can't rely on wait times to force a specific interleaving of execution. Even if things work out relative to the timer tick interval, execution still can be delayed and push things onto the same timer tick if the system is under load. Given this, I think the best we can do for the ConcurrentProfiling tests is to test various interleavings of Start/Stop/destroy calls. e.g.: Start() // 1 // sample Start() // 2 // sample Stop() // 1 // sample Stop() // 2 Start() // 1 // sample Start() // 2 // sample destroy // 2 // sample destroy // 1 where the fact that samples have occurred is validated by signaling the test delegate. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:991: params[1].samples_per_burst = 10; If this test is intending to exercise the stopping of one profiler, then the params for this profiler (or the other) should be set up to ensure it executes much longer than the other one, to try to avoid races between profiler exit and the Stop call. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset(); The resetting code is unnecessary; the profilers will be destroyed when |profilers| is destroyed at the end of the block. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1039: params[0].initial_delay = TimeDelta::FromMilliseconds(8); Same comments here with respect to the timer tick interval and forcing specific interleaving of execution. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1082: profiler[i].reset(); This is unnecessary.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:858: sampling_completed.TimedWait(AVeryLongTimeDelta()); On 2017/02/27 23:27:34, Mike Wittman wrote: > On 2017/02/24 20:39:24, bcwhite wrote: > > On 2017/02/22 20:32:18, Mike Wittman wrote: > > > What's the reason for the TimedWait? > > > > To ensure that it runs to completion before being stopped. Otherwise it could > > stop before the first sample and there wouldn't be any evidence that it would > > have run. > > Right, but why not just Wait(), and remove the Wait() call below? Done. https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); On 2017/02/22 20:32:18, Mike Wittman wrote: > Channeling brucedawson@: while (condition) yield(); results in a busy wait if > there spare execution cycles in the system, where the thread is repeatedly > scheduled, executes, and yields. This is bad for power usage. Admittedly this is > not a big concern in tests, but people do tend to copy-paste code around. The > preferred formulation is while (condition) sleep(1); to allow the processor to > be idle for some time. Done. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:316: if (state != RUNNING) On 2017/02/27 23:27:34, Mike Wittman wrote: > On 2017/02/24 20:39:25, bcwhite wrote: > > On 2017/02/23 18:26:52, Mike Wittman wrote: > > > DCHECK_NE(NOT_STARTED, state) before this > > > > It could legitimately be in that state if the the collection runs to > completion, > > the idle time expires, the thread shuts down, and then an attempt is made to > > remove the collection. > > Won't it be in the EXITING state in that case? Yes. Originally I had planned for the thread to go from EXITING to NOT_STARTED when it had exited completely. https://codereview.chromium.org/2554123002/diff/820001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:363: if (task_runner_thread_state_ != RUNNING) On 2017/02/27 23:27:34, Mike Wittman wrote: > On 2017/02/24 20:39:25, bcwhite wrote: > > On 2017/02/23 18:26:52, Mike Wittman wrote: > > > This would be better as a DCHECK: > > > DCHECK(task_runner_thread_state_ == RUNNING ? !!task_runner_ : > !task_runner_) > > > > It needs to handle not-RUNNING by exiting early or the GetThreadId() below > will > > hang. I can DCHECK the task-runner, though. > > How about putting the DCHECKS in a conditional then? > > if (task_runner_thread_state_ == RUNNING) { > // ... > DCHECK_NE(GetThreadId(), PlatformThread::CurrentId()); > DCHECK(task_runner_); > } else { > DCHECK(!task_runner_); > } > > return task_runner_; Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will be started (or restarted) when a sampling On 2017/02/27 23:27:34, Mike Wittman wrote: > As the code is currently, the state is only set to NOT_STARTED when the thread > has never been started. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:202: // initiated. On 2017/02/27 23:27:34, Mike Wittman wrote: > We should mention that new profiling requests (which occur on their own thread) > are responsible for ensuring the exit has completed then starting the thread and > transitioning to the RUNNING state. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:340: StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() { On 2017/02/27 23:27:34, Mike Wittman wrote: > We should rename this to something like GetOrCreateTaskRunnerForAdd, since > incrementing the task_runner_add_events_ is predicated on this only being called > for Add. Also add a comment discussing why this should only be called from Add. Done. This is exactly why I didn't want these helper methods in the first place. They're not "general purpose". https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:354: // to call Stop() before Start(). This is safe even the thread has never On 2017/02/27 23:27:34, Mike Wittman wrote: > The last sentence is no longer relevant and can be removed. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset(); On 2017/02/27 23:27:34, Mike Wittman wrote: > We can avoid the manual reset by setting the reset policy to AUTOMATIC. True but only because there is currently nothing else (other than the dtor) that waits on this. Conceptually it's not an automatic-reset that could be checked for multiple reasons. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay = TimeDelta::FromMilliseconds(10); On 2017/02/27 23:27:34, Mike Wittman wrote: > Do we need an initial delay for this set of params (and the one below)? The initial delay just provides some extra time to be confident that both are scheduled before one starts to execute. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/02/27 23:27:35, Mike Wittman wrote: > Can we reduce this to something like 3-5 samples (and the one below)? If the > sampling is serviced at the normal timer tick interval of 15.6ms, then the 10 > samples in this test likely will take 160+ ms. The sampling will take 10ms + timer-resolution. Samples are taken at strict times and if the thread runs behind then multiple samples will be taken to "catch up". https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:935: params[1].samples_per_burst = 10; On 2017/02/27 23:27:34, Mike Wittman wrote: > We should make this value different than above, and check the number of samples > returned below, to test that we've profiled against each set of parameters once. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:964: sampling_completed_rawptrs.data(), sampling_completed_rawptrs.size()); On 2017/02/27 23:27:35, Mike Wittman wrote: > This block of code down to this line can be extracted into a utility function > and reused across all these tests. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:969: EXPECT_TRUE(sampling_completed[other_profiler]->TimedWait( On 2017/02/27 23:27:35, Mike Wittman wrote: > This should be a regular Wait() call. The test will fail with a time out if the > event is never signaled. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay = TimeDelta::FromMilliseconds(2); > The resolution for all waiting values is the 15.6ms timer tick interval, so this > profiler is likely to start on the same timer tick as the previous one (with 94% > probability). It doesn't actually matter because events are posted to the queue in order based on desired run time. Thus, regardless of when execution is scheduled, the two tasks will interleave. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:991: params[1].samples_per_burst = 10; On 2017/02/27 23:27:35, Mike Wittman wrote: > If this test is intending to exercise the stopping of one profiler, then the > params for this profiler (or the other) should be set up to ensure it executes > much longer than the other one, to try to avoid races between profiler exit and > the Stop call. Races between exit and stop should be safe. Keeping them close helps test that edge case. The test is just "don't crash". https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset(); On 2017/02/27 23:27:35, Mike Wittman wrote: > The resetting code is unnecessary; the profilers will be destroyed when > |profilers| is destroyed at the end of the block. Yes but they will be destroyed in descending order (oppose of construction). I want them destroyed in ascending order just like all other calls. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1039: params[0].initial_delay = TimeDelta::FromMilliseconds(8); On 2017/02/27 23:27:35, Mike Wittman wrote: > Same comments here with respect to the timer tick interval and forcing specific > interleaving of execution. Acknowledged. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1082: profiler[i].reset(); On 2017/02/27 23:27:34, Mike Wittman wrote: > This is unnecessary. Acknowledged.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); On 2017/03/13 18:50:17, bcwhite wrote: > On 2017/02/22 20:32:18, Mike Wittman wrote: > > Channeling brucedawson@: while (condition) yield(); results in a busy wait if > > there spare execution cycles in the system, where the thread is repeatedly > > scheduled, executes, and yields. This is bad for power usage. Admittedly this > is > > not a big concern in tests, but people do tend to copy-paste code around. The > > preferred formulation is while (condition) sleep(1); to allow the processor to > > be idle for some time. > > Done. Sleeping for 1ms is preferable, to minimize test execution time. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will be started (or restarted) when a sampling On 2017/03/13 18:50:17, bcwhite wrote: > On 2017/02/27 23:27:34, Mike Wittman wrote: > > As the code is currently, the state is only set to NOT_STARTED when the thread > > has never been started. > > Done. I think it would be clearer to remove the "(or restarted)" part since the thread is not in this execution state before being restarted. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:340: StackSamplingProfiler::SamplingThread::GetOrCreateTaskRunner() { On 2017/03/13 18:50:17, bcwhite wrote: > On 2017/02/27 23:27:34, Mike Wittman wrote: > > We should rename this to something like GetOrCreateTaskRunnerForAdd, since > > incrementing the task_runner_add_events_ is predicated on this only being > called > > for Add. Also add a comment discussing why this should only be called from > Add. > > Done. This is exactly why I didn't want these helper methods in the first > place. They're not "general purpose". The most important values for Chrome code are readability, understandability, and ease of modification. I think this remains a quite reasonable encapsulation by those metrics despite the clunky name. If this function needs to be used from a non-Add function in the future it's a simple modification to revert the name change and take a flag or callback for the Add case, since the relevant assumptions are documented (if not DCHECK'ed) in the code. Writing "general purpose" code outside of public APIs is pretty universally considered an anti-pattern in Chrome, because it adds complexity based on assumptions about how the code will be used in the future. When those assumptions are wrong, as they often are, the added complexity makes it more difficult to refactor to support use cases the original developer didn't anticipate: the developer doing the refactoring has to tease apart the aspects of the general purpose solution that are actually required and used from those that are superfluous. The most effective way to make code future-proof is to make it "generalizable" rather than "general purpose", which is accomplished through readability and understandability: being explicit about the constraints, assumptions, and intent of the code. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:669: profiling_inactive_.Reset(); On 2017/03/13 18:50:17, bcwhite wrote: > On 2017/02/27 23:27:34, Mike Wittman wrote: > > We can avoid the manual reset by setting the reset policy to AUTOMATIC. > > True but only because there is currently nothing else (other than the dtor) that > waits on this. Conceptually it's not an automatic-reset that could be checked > for multiple reasons. Acknowledged. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay = TimeDelta::FromMilliseconds(10); On 2017/03/13 18:50:18, bcwhite wrote: > On 2017/02/27 23:27:34, Mike Wittman wrote: > > Do we need an initial delay for this set of params (and the one below)? > > The initial delay just provides some extra time to be confident that both are > scheduled before one starts to execute. Ok. Please document this within the test. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/03/13 18:50:18, bcwhite wrote: > On 2017/02/27 23:27:35, Mike Wittman wrote: > > Can we reduce this to something like 3-5 samples (and the one below)? If the > > sampling is serviced at the normal timer tick interval of 15.6ms, then the 10 > > samples in this test likely will take 160+ ms. > > The sampling will take 10ms + timer-resolution. Samples are taken at strict > times and if the thread runs behind then multiple samples will be taken to > "catch up". > Please document this as well. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:989: params[1].initial_delay = TimeDelta::FromMilliseconds(2); On 2017/03/13 18:50:18, bcwhite wrote: > > The resolution for all waiting values is the 15.6ms timer tick interval, so > this > > profiler is likely to start on the same timer tick as the previous one (with > 94% > > probability). > > It doesn't actually matter because events are posted to the queue in order based > on desired run time. Thus, regardless of when execution is scheduled, the two > tasks will interleave. Ok, I can see that. In that case though, it's not clear to me what the value is in testing multiple interleavings (see my general comment below). https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:991: params[1].samples_per_burst = 10; On 2017/03/13 18:50:18, bcwhite wrote: > On 2017/02/27 23:27:35, Mike Wittman wrote: > > If this test is intending to exercise the stopping of one profiler, then the > > params for this profiler (or the other) should be set up to ensure it executes > > much longer than the other one, to try to avoid races between profiler exit > and > > the Stop call. > > Races between exit and stop should be safe. Keeping them close helps test that > edge case. The test is just "don't crash". If the desire is to test both winning and losing the race, that should be done in two separate tests, each of which is written to unambiguously exercise one case or the other. Having a race in the test, even if benign, risks flaky failures if only one the two cases fails. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset(); On 2017/03/13 18:50:18, bcwhite wrote: > On 2017/02/27 23:27:35, Mike Wittman wrote: > > The resetting code is unnecessary; the profilers will be destroyed when > > |profilers| is destroyed at the end of the block. > > Yes but they will be destroyed in descending order (oppose of construction). I > want them destroyed in ascending order just like all other calls. Why do you want this in this particular test? How does the ordering of the Stop calls and destruction relate to the interleaving scenario above? Same question applies to the tests below too. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1127: EXPECT_FALSE(sampling_completed[1]->IsSignaled()); The first two profilers could both complete by this point if the system is under load, resulting in flaky failures. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1132: EXPECT_FALSE(sampling_completed[2]->IsSignaled()); Same here for the second and third profilers. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; I think it would be better to call this something like thread_execution_state_lock_ at this point, since it's basically protecting changes that affect the thread's execution state. This naming would also have the benefit of encompasssing the protection of the non-thread-safe Start/Stop/StopSoon/DetachFromSequence Thread API calls. The documentation should state that these are covered by the lock as well. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:489: // Another increment of "create requests" serves to invalidate any pending create requests => add events https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:549: // those always increments "create requests". There may be other requests, create requests => add events https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { General comments on testing this functionality, now that I mostly understand what the current tests are doing: 1. The most important aspects of this change to test are the subtleties around collection and thread lifetime. We should have dedicated tests exercising the different conditional outcomes for all the conditionals in GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and for the id-not-found cases in RemoveCollectionTask and PerformCollectionTask. There are several interleavings of Start/Stop/destroy in the current tests, but it's not clear which of the cases are being exercised by them, which is why they should have dedicated tests. 2. Testing that multiple profilers can run concurrently is important, but it's not clear to me what the value is in testing multiple sampling interleavings. The sampling interleaving is mostly an implementation detail of the profiler -- users won't care how they're interleaved with other collections as long as their samples are collected close to "on time". The interleaving behavior is also substantially provided by the message loop, so tests of different interleavings are testing mostly that behavior not the behavior implemented in this class. Perhaps I'm missing something though -- is there another reason to test multiple interleavings? https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1025: // Stop and destroy all profilers, always in the some order. Don't crash. nit: same https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1076: // Stop and destroy all profilers, always in the some order. Don't crash. nit: same
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/800001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:911: PlatformThread::YieldCurrentThread(); On 2017/03/14 18:57:33, Mike Wittman wrote: > On 2017/03/13 18:50:17, bcwhite wrote: > > On 2017/02/22 20:32:18, Mike Wittman wrote: > > > Channeling brucedawson@: while (condition) yield(); results in a busy wait > if > > > there spare execution cycles in the system, where the thread is repeatedly > > > scheduled, executes, and yields. This is bad for power usage. Admittedly > this > > is > > > not a big concern in tests, but people do tend to copy-paste code around. > The > > > preferred formulation is while (condition) sleep(1); to allow the processor > to > > > be idle for some time. > > > > Done. > > Sleeping for 1ms is preferable, to minimize test execution time. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:191: // because it has exited. It will be started (or restarted) when a sampling On 2017/03/14 18:57:33, Mike Wittman wrote: > On 2017/03/13 18:50:17, bcwhite wrote: > > On 2017/02/27 23:27:34, Mike Wittman wrote: > > > As the code is currently, the state is only set to NOT_STARTED when the > thread > > > has never been started. > > > > Done. > > I think it would be clearer to remove the "(or restarted)" part since the thread > is not in this execution state before being restarted. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:929: params[0].initial_delay = TimeDelta::FromMilliseconds(10); On 2017/03/14 18:57:33, Mike Wittman wrote: > On 2017/03/13 18:50:18, bcwhite wrote: > > On 2017/02/27 23:27:34, Mike Wittman wrote: > > > Do we need an initial delay for this set of params (and the one below)? > > > > The initial delay just provides some extra time to be confident that both are > > scheduled before one starts to execute. > > Ok. Please document this within the test. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/03/14 18:57:33, Mike Wittman wrote: > On 2017/03/13 18:50:18, bcwhite wrote: > > On 2017/02/27 23:27:35, Mike Wittman wrote: > > > Can we reduce this to something like 3-5 samples (and the one below)? If the > > > sampling is serviced at the normal timer tick interval of 15.6ms, then the > 10 > > > samples in this test likely will take 160+ ms. > > > > The sampling will take 10ms + timer-resolution. Samples are taken at strict > > times and if the thread runs behind then multiple samples will be taken to > > "catch up". > > > > Please document this as well. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:991: params[1].samples_per_burst = 10; > If the desire is to test both winning and losing the race, that should be done > in two separate tests, each of which is written to unambiguously exercise one > case or the other. The desire is to make sure that different sampling parameters will operate in parallel. Anything else is superfluous. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1026: profiler[i].reset(); On 2017/03/14 18:57:33, Mike Wittman wrote: > On 2017/03/13 18:50:18, bcwhite wrote: > > On 2017/02/27 23:27:35, Mike Wittman wrote: > > > The resetting code is unnecessary; the profilers will be destroyed when > > > |profilers| is destroyed at the end of the block. > > > > Yes but they will be destroyed in descending order (oppose of construction). I > > want them destroyed in ascending order just like all other calls. > > Why do you want this in this particular test? How does the ordering of the Stop > calls and destruction relate to the interleaving scenario above? Same question > applies to the tests below too. It's an "interleave" test so order needs to be consistent to ensure that. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1127: EXPECT_FALSE(sampling_completed[1]->IsSignaled()); On 2017/03/14 18:57:33, Mike Wittman wrote: > The first two profilers could both complete by this point if the system is under > load, resulting in flaky failures. Done. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1132: EXPECT_FALSE(sampling_completed[2]->IsSignaled()); On 2017/03/14 18:57:33, Mike Wittman wrote: > Same here for the second and third profilers. Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; > I think it would be better to call this something like > thread_execution_state_lock_ at this point, since it's basically protecting > changes that affect the thread's execution state. All the variables it protects start with task_runner_. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:489: // Another increment of "create requests" serves to invalidate any pending On 2017/03/14 18:57:33, Mike Wittman wrote: > create requests => add events Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:549: // those always increments "create requests". There may be other requests, On 2017/03/14 18:57:33, Mike Wittman wrote: > create requests => add events Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > 1. The most important aspects of this change to test are the subtleties around > collection and thread lifetime. We should have dedicated tests exercising the > different conditional outcomes for all the conditionals in > GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and for > the id-not-found cases in RemoveCollectionTask and PerformCollectionTask. GetOrCreateTaskRunnerForAdd: - state==RUNNING: tested by ConcurrentProfiling_* Entered when doing parallel sampling. - state==EXITING: tested by WillRestartSampler Entered after sampler shutdown. - state==other: tested by everything Every first sampler will exercise this state. Remove: - state!=RUNNING: added StopAfterIdle test - state==other: tested by every valid Stop() ShutdownTask: - not idle: tested everywhere - changed add-event This is to handle a race condition which is necessarily difficult (if not impossible) to test. RemoveCollectionTask: - found: tested every stop before completion - not-found: tested every stop after completion PerformCollectionTask: - found: tested with every valid sample - not-found: tested every stop before completion > 2. Testing that multiple profilers can run concurrently is important, but it's > not clear to me what the value is in testing multiple sampling interleavings. > The sampling interleaving is mostly an implementation detail of the profiler -- > users won't care how they're interleaved with other collections as long as their > samples are collected close to "on time". The interleaving behavior is also > substantially provided by the message loop, so tests of different interleavings > are testing mostly that behavior not the behavior implemented in this class. > Perhaps I'm missing something though -- is there another reason to test multiple > interleavings? Removed one of the tests. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1025: // Stop and destroy all profilers, always in the some order. Don't crash. On 2017/03/14 18:57:33, Mike Wittman wrote: > nit: same Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1076: // Stop and destroy all profilers, always in the some order. Don't crash. On 2017/03/14 18:57:34, Mike Wittman wrote: > nit: same Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
Still taking another look at the ConcurrentProfiling_* tests. https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/03/16 15:56:25, bcwhite wrote: > On 2017/03/14 18:57:33, Mike Wittman wrote: > > On 2017/03/13 18:50:18, bcwhite wrote: > > > On 2017/02/27 23:27:35, Mike Wittman wrote: > > > > Can we reduce this to something like 3-5 samples (and the one below)? If > the > > > > sampling is serviced at the normal timer tick interval of 15.6ms, then the > > 10 > > > > samples in this test likely will take 160+ ms. > > > > > > The sampling will take 10ms + timer-resolution. Samples are taken at strict > > > times and if the thread runs behind then multiple samples will be taken to > > > "catch up". > > > > > > > Please document this as well. > > Done. I don't see something equivalent to this in the comments. The text above ("The sampling will take 10ms + ...") would be fine to just copy into the comment. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/16 15:56:25, bcwhite wrote: > > I think it would be better to call this something like > > thread_execution_state_lock_ at this point, since it's basically protecting > > changes that affect the thread's execution state. > > All the variables it protects start with task_runner_. Yes, and their names should be updated also, for the same reason. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/16 15:56:25, bcwhite wrote: > > 1. The most important aspects of this change to test are the subtleties around > > collection and thread lifetime. We should have dedicated tests exercising the > > different conditional outcomes for all the conditionals in > > GetOrCreateTaskRunnerForAdd, Remove, FinishCollection, and ShutdownTask, and > for > > the id-not-found cases in RemoveCollectionTask and PerformCollectionTask. Relying on existing tests is OK if the test is obviously and reliably testing the particular behavior. All the other cases need dedicated tests so that (1) it's clear that the behavior is important and that it's actually being tested, and (2) the test of the behavior survives changes and refactoring to the code and tests. In particular, the behavior under test in several of the cases below is subject to races dependent on the sampling parameters, system load, and shutdown idle time. > GetOrCreateTaskRunnerForAdd: > - state==RUNNING: tested by ConcurrentProfiling_* > Entered when doing parallel sampling. We can't rely on the RUNNING state being tested in the ConcurrentProfiling_* tests because the state in the second and later Start() calls is subject to races dependent on the sampling parameters, the load on the system, and the shutdown idle time. > - state==EXITING: tested by WillRestartSampler > Entered after sampler shutdown. This is a good test for this behavior. > - state==other: tested by everything > Every first sampler will exercise this state. > > Remove: > - state!=RUNNING: added StopAfterIdle test This is also a good test for this behavior. > - state==other: tested by every valid Stop() We can't rely on state==other(EXITING) being tested in Remove for the same configuration/load-dependent reasons as for state==RUNNING in GetOrCreateTaskRunnerForAdd. > ShutdownTask: > - not idle: tested everywhere This is not tested everywhere, If I'm not mistaken, since ShutdownTask is only executed in tests where InitiateSamplingThreadIdleShutdown() is called. Independent of that, I think this conditional can be removed entirely. See the comment on the code. > - changed add-event > This is to handle a race condition which is necessarily difficult > (if not impossible) to test. If the previous conditional is removed, this can be tested by calling InitiateSamplingThreadIdleShutdown(true) while a collection is in process. > RemoveCollectionTask: > - found: tested every stop before completion We can't rely on the item being found for the for the same configuration/load-dependent reasons as above. > - not-found: tested every stop after completion This is pretty subtle in most of the tests but it's a key part of StopAfterIdle, I think that's reasonable. > PerformCollectionTask: > - found: tested with every valid sample > - not-found: tested every stop before completion We can't rely on the item being not-found for the for the same configuration/load-dependent reasons as above. When run in their own, tests of the cases with configuration/load-dependent sensitivity can have exactly the configuration they need to avoid races in the behavior under test. They should be set up to be race-free even if the idle shutdown is set to a very short duration. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:543: return; I don't think this conditional is needed. Alternative solution: 1. Add a bool simulateDefunctShutdown parameter to InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1 to the posted task if true. 2. Document on the InitiateSamplingThreadIdleShutdown() interface that simulateDefunctShutdown must be set to true if any collections are still active. 3. CHECK(simulateDefunctShutdown || active_collections_.empty()) within InitiateSamplingThreadIdleShutdown() to enforce (2). 4. Eliminate a race in FinishCollection so that tests can depend on the profiler being idle when all collections have finished: at the end of the function, save off the finished WaitableEvent and only signal it after active_collections_.erase(). https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:355: void CreateProfilers(PlatformThreadId target_thread_id, Nice, encapsulating this functionality makes the tests cleaner. Can you pass the input profiles as a vector, and the three output arguments as pointers to empty vectors (and push_back() here), so we don't need to keep all the array/vector sizes in sync in the tests? Also, input arguments should appear before output arguments in the interface. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:359: SamplingParams* params, const https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:922: #define MAYBE_WillRestartSampler WillRestartSampler WillRestartSamplerAfterIdleShutdown would be a better name for this test. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:943: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) Thinking about this in terms of the underlying StackSamplingProfiler state, I don't see the need to wait for the thread to exit in this test. Thread::Stop() is documented to handle both the pre-exit and exited cases, so the behavior in StackSamplingProfiler is the same regardless of which state the thread is in. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:956: #define MAYBE_StopAfterIdle StopAfterIdle StopAfterIdleShutdown would be a better name. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:980: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) I don't think we need this for the same reason as above. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1001: // run at their scheduled, interleaved times regardless of whatever This doesn't make sense to me. How can the samples run at their scheduled times if the thread hasn't woken up by then? https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); How does the ordering of the Start and Stop calls and destruction relate to the parameters above?
https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:543: return; On 2017/03/18 01:38:41, Mike Wittman wrote: > I don't think this conditional is needed. > > Alternative solution: > > 1. Add a bool simulateDefunctShutdown parameter to > InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1 to > the posted task if true. > > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that > simulateDefunctShutdown must be set to true if any collections are still active. > > 3. CHECK(simulateDefunctShutdown || active_collections_.empty()) within > InitiateSamplingThreadIdleShutdown() to enforce (2). > > 4. Eliminate a race in FinishCollection so that tests can depend on the profiler > being idle when all collections have finished: at the end of the function, save > off the finished WaitableEvent and only signal it after > active_collections_.erase(). I think we'll also need: 5. Invoke the callback after active_collections_.erase() as well, so that both Stop and the callback are guaranteed to be called with the profiler idle on the last collection.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/880001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:931: params[0].samples_per_burst = 10; On 2017/03/18 01:38:41, Mike Wittman wrote: > On 2017/03/16 15:56:25, bcwhite wrote: > > On 2017/03/14 18:57:33, Mike Wittman wrote: > > > On 2017/03/13 18:50:18, bcwhite wrote: > > > > On 2017/02/27 23:27:35, Mike Wittman wrote: > > > > > Can we reduce this to something like 3-5 samples (and the one below)? If > > the > > > > > sampling is serviced at the normal timer tick interval of 15.6ms, then > the > > > 10 > > > > > samples in this test likely will take 160+ ms. > > > > > > > > The sampling will take 10ms + timer-resolution. Samples are taken at > strict > > > > times and if the thread runs behind then multiple samples will be taken to > > > > "catch up". > > > > > > > > > > Please document this as well. > > > > Done. > > I don't see something equivalent to this in the comments. The text above ("The > sampling will take 10ms + ...") would be fine to just copy into the comment. Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/18 01:38:41, Mike Wittman wrote: > On 2017/03/16 15:56:25, bcwhite wrote: > > > I think it would be better to call this something like > > > thread_execution_state_lock_ at this point, since it's basically protecting > > > changes that affect the thread's execution state. > > > > All the variables it protects start with task_runner_. > > Yes, and their names should be updated also, for the same reason. Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > Relying on existing tests is OK if the test is obviously and reliably testing > the particular behavior. All the other cases need dedicated tests so that (1) > it's clear that the behavior is important and that it's actually being tested, > and (2) the test of the behavior survives changes and refactoring to the code > and tests. In particular, the behavior under test in several of the cases below > is subject to races dependent on the sampling parameters, system load, and > shutdown idle time. > > > GetOrCreateTaskRunnerForAdd: > > - state==RUNNING: tested by ConcurrentProfiling_* > > Entered when doing parallel sampling. > > We can't rely on the RUNNING state being tested in the ConcurrentProfiling_* > tests because the state in the second and later Start() calls is subject to > races dependent on the sampling parameters, the load on the system, and the > shutdown idle time. I've disabled the idle shutdown in those tests. 60 seconds wasn't going to be an issue but this is guaranteed. > > - state==EXITING: tested by WillRestartSampler > > Entered after sampler shutdown. > > This is a good test for this behavior. > > > - state==other: tested by everything > > Every first sampler will exercise this state. > > > > Remove: > > - state!=RUNNING: added StopAfterIdle test > > This is also a good test for this behavior. > > > - state==other: tested by every valid Stop() > > We can't rely on state==other(EXITING) being tested in Remove for the same > configuration/load-dependent reasons as for state==RUNNING in > GetOrCreateTaskRunnerForAdd. For EXITING state specifically, the StopAfterIdle test will call Remove() when it is in the EXITING state. > > ShutdownTask: > > - not idle: tested everywhere > > This is not tested everywhere, If I'm not mistaken, since ShutdownTask is only > executed in tests where InitiateSamplingThreadIdleShutdown() is called. > Independent of that, I think this conditional can be removed entirely. See the > comment on the code. And wherever DisableIdleShutdown is not called. > > - changed add-event > > This is to handle a race condition which is necessarily difficult > > (if not impossible) to test. > > If the previous conditional is removed, this can be tested by calling > InitiateSamplingThreadIdleShutdown(true) while a collection is in process. > > > RemoveCollectionTask: > > - found: tested every stop before completion > > We can't rely on the item being found for the for the same > configuration/load-dependent reasons as above. There are dozens of stop calls before completion, some with huge timeouts. They're not all going to miss. > > - not-found: tested every stop after completion > > This is pretty subtle in most of the tests but it's a key part of StopAfterIdle, > I think that's reasonable. > > > PerformCollectionTask: > > - found: tested with every valid sample > > - not-found: tested every stop before completion > > We can't rely on the item being not-found for the for the same > configuration/load-dependent reasons as above. For all practical purposes, I believe we can. Your original StopDuring*() tests do this. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:543: return; > I don't think this conditional is needed. > > Alternative solution: > > 1. Add a bool simulateDefunctShutdown parameter to > InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1 to > the posted task if true. > > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that > simulateDefunctShutdown must be set to true if any collections are still active. That puts a lot of burden and complexity on the test to avoid what amounts to a "var != 0" at the top of this method. I don't see it being worth it. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:355: void CreateProfilers(PlatformThreadId target_thread_id, On 2017/03/18 01:38:41, Mike Wittman wrote: > Nice, encapsulating this functionality makes the tests cleaner. > > Can you pass the input profiles as a vector, and the three output arguments as > pointers to empty vectors (and push_back() here), so we don't need to keep all > the array/vector sizes in sync in the tests? > > Also, input arguments should appear before output arguments in the interface. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:359: SamplingParams* params, On 2017/03/18 01:38:41, Mike Wittman wrote: > const Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:922: #define MAYBE_WillRestartSampler WillRestartSampler On 2017/03/18 01:38:41, Mike Wittman wrote: > WillRestartSamplerAfterIdleShutdown would be a better name for this test. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:943: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/18 01:38:41, Mike Wittman wrote: > Thinking about this in terms of the underlying StackSamplingProfiler state, I > don't see the need to wait for the thread to exit in this test. > > Thread::Stop() is documented to handle both the pre-exit and exited cases, so > the behavior in StackSamplingProfiler is the same regardless of which state the > thread is in. But then you wouldn't be able to tell that the thread restarted. For all the test could know, the same thread would have continued to run. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:956: #define MAYBE_StopAfterIdle StopAfterIdle On 2017/03/18 01:38:41, Mike Wittman wrote: > StopAfterIdleShutdown would be a better name. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1001: // run at their scheduled, interleaved times regardless of whatever On 2017/03/18 01:38:41, Mike Wittman wrote: > This doesn't make sense to me. How can the samples run at their scheduled times > if the thread hasn't woken up by then? Fixed wording. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); On 2017/03/18 01:38:41, Mike Wittman wrote: > How does the ordering of the Start and Stop calls and destruction relate to the > parameters above? They don't. It's three different sampling parameters that start in a staggered ordering.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_ozone_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/20 21:50:51, bcwhite wrote: > On 2017/03/18 01:38:41, Mike Wittman wrote: > > On 2017/03/16 15:56:25, bcwhite wrote: > > > > I think it would be better to call this something like > > > > thread_execution_state_lock_ at this point, since it's basically > protecting > > > > changes that affect the thread's execution state. > > > > > > All the variables it protects start with task_runner_. > > > > Yes, and their names should be updated also, for the same reason. > > Done. Please add a comment stating that the lock also protects execution of the non-thread-safe Thread API calls related to the execution state: Start, Stop, StopSoon, DetachFromSequence. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { To repeat: Relying on existing tests is OK if the test is obviously and reliably testing the particular behavior. *All the other cases need dedicated tests* so that (1) it's clear that the behavior is important and that it's actually being tested, and (2) the test of the behavior survives changes and refactoring to the code and tests. The purpose in pointing out possible races is not to try to fix or explain away the races. It's to demonstrate that the existing tests are not obviously and reliably testing some of the behaviors, and that the test of those behaviors needs to be split out into dedicated tests. Please create separate tests for those behaviors. > > > Remove: > > > - state!=RUNNING: added StopAfterIdle test > > > > This is also a good test for this behavior. > > > > > - state==other: tested by every valid Stop() > > > > We can't rely on state==other(EXITING) being tested in Remove for the same > > configuration/load-dependent reasons as for state==RUNNING in > > GetOrCreateTaskRunnerForAdd. > > For EXITING state specifically, the StopAfterIdle test will call Remove() when > it is in the EXITING state. I think this is true but it took me 10+ minutes of reasoning about the code to reach that conclusion. So this is still not obvious and still needs a dedicated test. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:543: return; On 2017/03/20 21:50:51, bcwhite wrote: > > I don't think this conditional is needed. > > > > Alternative solution: > > > > 1. Add a bool simulateDefunctShutdown parameter to > > InitiateSamplingThreadIdleShutdown(). Bind sampler->task_runner_add_events_-1 > to > > the posted task if true. > > > > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that > > simulateDefunctShutdown must be set to true if any collections are still > active. > > That puts a lot of burden and complexity on the test to avoid what amounts to a > "var != 0" at the top of this method. I don't see it being worth it. Why do we need this conditional at all at this point? With the changes to FinishCollection the TestAPI now has the ability to tell if the profiler under test is idle. As a side note, test code is exactly where the burden of test complexity belongs. Putting it in production code makes the code less readable and understandable, especially in cases like this where the the behavior is already very subtle. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:355: void CreateProfilers(PlatformThreadId target_thread_id, On 2017/03/20 21:50:51, bcwhite wrote: > On 2017/03/18 01:38:41, Mike Wittman wrote: > > Nice, encapsulating this functionality makes the tests cleaner. > > > > Can you pass the input profiles as a vector, and the three output arguments as > > pointers to empty vectors (and push_back() here), so we don't need to keep all > > the array/vector sizes in sync in the tests? > > > > Also, input arguments should appear before output arguments in the interface. > > Done. param should be passed as a vector also. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:943: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/20 21:50:51, bcwhite wrote: > On 2017/03/18 01:38:41, Mike Wittman wrote: > > Thinking about this in terms of the underlying StackSamplingProfiler state, I > > don't see the need to wait for the thread to exit in this test. > > > > Thread::Stop() is documented to handle both the pre-exit and exited cases, so > > the behavior in StackSamplingProfiler is the same regardless of which state > the > > thread is in. > > But then you wouldn't be able to tell that the thread restarted. For all the > test could know, the same thread would have continued to run. That's true. But whether the thread stopped after idle shutdown is a different concern than whether a new collection can start. It's also a very important behavior on its own and deserves to be separated out into its own test. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); On 2017/03/20 21:50:51, bcwhite wrote: > On 2017/03/18 01:38:41, Mike Wittman wrote: > > How does the ordering of the Start and Stop calls and destruction relate to > the > > parameters above? > > They don't. It's three different sampling parameters that start in a staggered > ordering. In that case, if there's still a motivation for this ordering then it should be documented and moved to its own test since it's independent. If not, then this can be simplified to three Start() calls followed by three wait calls and remove the Stop() calls. https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:430: // move them because this collection is about to be deleted. This last sentence is no longer relevant and can be removed. https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:362: CHECK(profiles->empty()); ASSERT_TRUE rather than CHECK in test code (or EXPECT_TRUE if ASSERT_TRUE doesn't work here)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:242: Lock task_runner_lock_; On 2017/03/21 16:50:38, Mike Wittman wrote: > On 2017/03/20 21:50:51, bcwhite wrote: > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > On 2017/03/16 15:56:25, bcwhite wrote: > > > > > I think it would be better to call this something like > > > > > thread_execution_state_lock_ at this point, since it's basically > > protecting > > > > > changes that affect the thread's execution state. > > > > > > > > All the variables it protects start with task_runner_. > > > > > > Yes, and their names should be updated also, for the same reason. > > > > Done. > > Please add a comment stating that the lock also protects execution of the > non-thread-safe Thread API calls related to the execution state: Start, Stop, > StopSoon, DetachFromSequence. Done. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/21 16:50:38, Mike Wittman wrote: > To repeat: > Relying on existing tests is OK if the test is obviously and reliably testing > the particular behavior. *All the other cases need dedicated tests* so that (1) > it's clear that the behavior is important and that it's actually being tested, > and (2) the test of the behavior survives changes and refactoring to the code > and tests. > > The purpose in pointing out possible races is not to try to fix or explain away > the races. It's to demonstrate that the existing tests are not obviously and > reliably testing some of the behaviors, and that the test of those behaviors > needs to be split out into dedicated tests. Please create separate tests for > those behaviors. > > > > > Remove: > > > > - state!=RUNNING: added StopAfterIdle test > > > > > > This is also a good test for this behavior. > > > > > > > - state==other: tested by every valid Stop() > > > > > > We can't rely on state==other(EXITING) being tested in Remove for the same > > > configuration/load-dependent reasons as for state==RUNNING in > > > GetOrCreateTaskRunnerForAdd. > > > > For EXITING state specifically, the StopAfterIdle test will call Remove() when > > it is in the EXITING state. > > I think this is true but it took me 10+ minutes of reasoning about the code to > reach that conclusion. So this is still not obvious and still needs a dedicated > test. This IS the dedicated test! StopAfterIdleShutdown was added, at your request, just to cover this case. The only way to know that a thread has gone into the EXITING state is that it has stopped for being idle. At any prior time, it could still be a posted task waiting to execute. I'll add a comment to that effect. > > RemoveCollectionTask: > > - found: tested every stop before completion > > We can't rely on the item being found for the for the same > configuration/load-dependent reasons as above. The run-time for the tasks is 1 day, longer than the run-time of the test. If the test completes, the task was stopped before completion. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:543: return; On 2017/03/21 16:50:38, Mike Wittman wrote: > On 2017/03/20 21:50:51, bcwhite wrote: > > > I don't think this conditional is needed. > > > > > > Alternative solution: > > > > > > 1. Add a bool simulateDefunctShutdown parameter to > > > InitiateSamplingThreadIdleShutdown(). Bind > sampler->task_runner_add_events_-1 > > to > > > the posted task if true. > > > > > > 2. Document on the InitiateSamplingThreadIdleShutdown() interface that > > > simulateDefunctShutdown must be set to true if any collections are still > > active. > > > > That puts a lot of burden and complexity on the test to avoid what amounts to > a > > "var != 0" at the top of this method. I don't see it being worth it. > > Why do we need this conditional at all at this point? With the changes to > FinishCollection the TestAPI now has the ability to tell if the profiler under > test is idle. > > As a side note, test code is exactly where the burden of test complexity > belongs. Putting it in production code makes the code less readable and > understandable, especially in cases like this where the the behavior is already > very subtle. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:355: void CreateProfilers(PlatformThreadId target_thread_id, On 2017/03/21 16:50:38, Mike Wittman wrote: > On 2017/03/20 21:50:51, bcwhite wrote: > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > Nice, encapsulating this functionality makes the tests cleaner. > > > > > > Can you pass the input profiles as a vector, and the three output arguments > as > > > pointers to empty vectors (and push_back() here), so we don't need to keep > all > > > the array/vector sizes in sync in the tests? > > > > > > Also, input arguments should appear before output arguments in the > interface. > > > > Done. > > param should be passed as a vector also. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:943: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/21 16:50:38, Mike Wittman wrote: > On 2017/03/20 21:50:51, bcwhite wrote: > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > Thinking about this in terms of the underlying StackSamplingProfiler state, > I > > > don't see the need to wait for the thread to exit in this test. > > > > > > Thread::Stop() is documented to handle both the pre-exit and exited cases, > so > > > the behavior in StackSamplingProfiler is the same regardless of which state > > the > > > thread is in. > > > > But then you wouldn't be able to tell that the thread restarted. For all the > > test could know, the same thread would have continued to run. > > That's true. But whether the thread stopped after idle shutdown is a different > concern than whether a new collection can start. It's also a very important > behavior on its own and deserves to be separated out into its own test. Done. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); On 2017/03/21 16:50:38, Mike Wittman wrote: > On 2017/03/20 21:50:51, bcwhite wrote: > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > How does the ordering of the Start and Stop calls and destruction relate to > > the > > > parameters above? > > > > They don't. It's three different sampling parameters that start in a > staggered > > ordering. > > In that case, if there's still a motivation for this ordering then it should be > documented and moved to its own test since it's independent. If not, then this > can be simplified to three Start() calls followed by three wait calls and remove > the Stop() calls. No motivation, per say. If you don't feel that staggered start/stop calls are of any use then I'll just remove the test because it would become essentially the same as the other ConcurrentProfiling tests. https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler.cc:430: // move them because this collection is about to be deleted. On 2017/03/21 16:50:38, Mike Wittman wrote: > This last sentence is no longer relevant and can be removed. Done. https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/960001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:362: CHECK(profiles->empty()); On 2017/03/21 16:50:38, Mike Wittman wrote: > ASSERT_TRUE rather than CHECK in test code (or EXPECT_TRUE if ASSERT_TRUE > doesn't work here) Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > > > > > Remove: > > > > > - state!=RUNNING: added StopAfterIdle test > > > > > > > > This is also a good test for this behavior. > > > > > > > > > - state==other: tested by every valid Stop() > > > > > > > > We can't rely on state==other(EXITING) being tested in Remove for the same > > > > configuration/load-dependent reasons as for state==RUNNING in > > > > GetOrCreateTaskRunnerForAdd. > > > > > > For EXITING state specifically, the StopAfterIdle test will call Remove() > when > > > it is in the EXITING state. > > > > I think this is true but it took me 10+ minutes of reasoning about the code to > > reach that conclusion. So this is still not obvious and still needs a > dedicated > > test. > > This IS the dedicated test! StopAfterIdleShutdown was added, at your request, > just to cover this case. The only way to know that a thread has gone into the > EXITING state is that it has stopped for being idle. At any prior time, it > could still be a posted task waiting to execute. > I'll add a comment to that effect. This is still pretty subtle and could use some even more extensive comments explaining how it works. I made suggestions in the code. > > > GetOrCreateTaskRunnerForAdd: > > > - state==RUNNING: tested by ConcurrentProfiling_* > > > Entered when doing parallel sampling. > > > > We can't rely on the RUNNING state being tested in the ConcurrentProfiling_* > > tests because the state in the second and later Start() calls is subject to > > races dependent on the sampling parameters, the load on the system, and the > > shutdown idle time. > > I've disabled the idle shutdown in those tests. 60 seconds wasn't going to be > an issue but this is guaranteed. It's still not obvious to a reader that this scenario is reliably tested. The ConcurrentProfiling_* tests stated intention is to test behavior other than this so could easily be changed in the future such that they no longer test this behavior. Please write a dedicated test. All that's required is starting a profiler with a very large initial delay before starting a second profiler. > > ShutdownTask: > > - changed add-event > > This is to handle a race condition which is necessarily difficult > > (if not impossible) to test. > > If the previous conditional is removed, this can be tested by calling > InitiateSamplingThreadIdleShutdown(true) while a collection is in process. Please write a dedicated test. > > > RemoveCollectionTask: > > > - found: tested every stop before completion > > > > We can't rely on the item being found for the for the same > > configuration/load-dependent reasons as above. > > There are dozens of stop calls before completion, some with huge timeouts. > They're not all going to miss. None of these tests are intending to reliably exercise this behavior and could easily be changed in the future such that they no longer test the behavior. The use of Stop in the destructor is an implementation detail and shouldn't be relied upon when testing this behavior. Please write a dedicated test. All that's required is starting a profiler with a very long sampling interval, waiting for one sample to be collected using the test delegate, and stopping the profiler. > > > - not-found: tested every stop after completion > > > > This is pretty subtle in most of the tests but it's a key part of > StopAfterIdle, > > I think that's reasonable. > > > > > PerformCollectionTask: > > > - found: tested with every valid sample > > > - not-found: tested every stop before completion > > > > We can't rely on the item being not-found for the for the same > > configuration/load-dependent reasons as above. > > For all practical purposes, I believe we can. Your original StopDuring*() tests > do this. There's no reason to believe that the test process will continue execution long enough for the next PerformCollectionTask to be executed after Stop. Even if there was, none of these tests are intending to exercise this behavior and could easily be changed in the future such that they no longer test the behavior. Please write a dedicated test. This is trickier but I believe can be done by relying on the message loop task ordering and observing when samples are taken via the test delegate. Start two profilers with interleaved execution, wait for both to take samples, stop the first, and observe that two samples of the second occur with no interleaved sample from the first. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); On 2017/03/22 17:48:54, bcwhite wrote: > On 2017/03/21 16:50:38, Mike Wittman wrote: > > On 2017/03/20 21:50:51, bcwhite wrote: > > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > > How does the ordering of the Start and Stop calls and destruction relate > to > > > the > > > > parameters above? > > > > > > They don't. It's three different sampling parameters that start in a > > staggered > > > ordering. > > > > In that case, if there's still a motivation for this ordering then it should > be > > documented and moved to its own test since it's independent. If not, then this > > can be simplified to three Start() calls followed by three wait calls and > remove > > the Stop() calls. > > No motivation, per say. If you don't feel that staggered start/stop calls are > of any use then I'll just remove the test because it would become essentially > the same as the other ConcurrentProfiling tests. Removing SGTM. Any specific behaviors exercised by staggered stop/stop would be better addressed in focused tests. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1011: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) Since the previous test has already established that the thread will eventually exit, there's no need to reverify it here. The important behavior under test here is simply that profiling can take place after idle shutdown has run. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1046: // its task and before the thread actually exits. This is much better, although I think this information would be easier to understand fleshed out more and applied to on the individual calls. Suggested comments below. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1049: StackSamplingProfiler::TestAPI::InitiateSamplingThreadIdleShutdown(); // Post a ShutdownTask on the sampling thread, which will mark the thread as EXITING and shut down the thread asynchronously after the function exits. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1050: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) // Wait for the thread to exit to ensure the ShutdownTask has finished executing and has set the thread state to EXITING. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1053: // Ensure it's still safe to stop. // Attempt to stop the profiler now that we know the thread is in the EXITING state.
more tests; improved tests
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Some re-work was necessary to fix tests when --gtest_shuffle is set. https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > > This IS the dedicated test! StopAfterIdleShutdown was added, at your request, > > just to cover this case. The only way to know that a thread has gone into the > > EXITING state is that it has stopped for being idle. At any prior time, it > > could still be a posted task waiting to execute. > > I'll add a comment to that effect. > > This is still pretty subtle and could use some even more extensive comments > explaining how it works. I made suggestions in the code. Done. > > > > GetOrCreateTaskRunnerForAdd: > > > > - state==RUNNING: tested by ConcurrentProfiling_* > > > > Entered when doing parallel sampling. > > > > > > We can't rely on the RUNNING state being tested in the ConcurrentProfiling_* > > > tests because the state in the second and later Start() calls is subject to > > > races dependent on the sampling parameters, the load on the system, and the > > > shutdown idle time. > > > > I've disabled the idle shutdown in those tests. 60 seconds wasn't going to be > > an issue but this is guaranteed. > > It's still not obvious to a reader that this scenario is reliably tested. The > ConcurrentProfiling_* tests stated intention is to test behavior other than this > so could easily be changed in the future such that they no longer test this > behavior. > > Please write a dedicated test. All that's required is starting a profiler with a > very large initial delay before starting a second profiler. Done. MultipleStart. > > > ShutdownTask: > > > - changed add-event > > > This is to handle a race condition which is necessarily difficult > > > (if not impossible) to test. > > > > If the previous conditional is removed, this can be tested by calling > > InitiateSamplingThreadIdleShutdown(true) while a collection is in process. > > Please write a dedicated test. IdleShutdownAbort It's definitely not that easy but I've changed the InitiateShutdown into a PerformShutdown that waits for the task to execute. At that point the test can know that at least StopSoon() has been called and the state set to EXITING but there's still no way to know if the thread has actually exited without waiting. > > > > RemoveCollectionTask: > > > > - found: tested every stop before completion > > > > > > We can't rely on the item being found for the for the same > > > configuration/load-dependent reasons as above. > > > > There are dozens of stop calls before completion, some with huge timeouts. > > They're not all going to miss. > > None of these tests are intending to reliably exercise this behavior and could > easily be changed in the future such that they no longer test the behavior. The > use of Stop in the destructor is an implementation detail and shouldn't be > relied upon when testing this behavior. > > Please write a dedicated test. All that's required is starting a profiler with a > very long sampling interval, waiting for one sample to be collected using the > test delegate, and stopping the profiler. StopSafely Thanks for being specific. Done. > > > > - not-found: tested every stop after completion > > > > > > This is pretty subtle in most of the tests but it's a key part of > > StopAfterIdle, > > > I think that's reasonable. > > > > > > > PerformCollectionTask: > > > > - found: tested with every valid sample > > > > - not-found: tested every stop before completion > > > > > > We can't rely on the item being not-found for the for the same > > > configuration/load-dependent reasons as above. > > > > For all practical purposes, I believe we can. Your original StopDuring*() > tests > > do this. > > There's no reason to believe that the test process will continue execution long > enough for the next PerformCollectionTask to be executed after Stop. Even if > there was, none of these tests are intending to exercise this behavior and could > easily be changed in the future such that they no longer test the behavior. > > Please write a dedicated test. This is trickier but I believe can be done by > relying on the message loop task ordering and observing when samples are taken > via the test delegate. Start two profilers with interleaved execution, wait for > both to take samples, stop the first, and observe that two samples of the second > occur with no interleaved sample from the first. Done as part of StopSafely. https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/940001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1110: profiler[0]->Start(); On 2017/03/23 22:18:31, Mike Wittman wrote: > On 2017/03/22 17:48:54, bcwhite wrote: > > On 2017/03/21 16:50:38, Mike Wittman wrote: > > > On 2017/03/20 21:50:51, bcwhite wrote: > > > > On 2017/03/18 01:38:41, Mike Wittman wrote: > > > > > How does the ordering of the Start and Stop calls and destruction relate > > to > > > > the > > > > > parameters above? > > > > > > > > They don't. It's three different sampling parameters that start in a > > > staggered > > > > ordering. > > > > > > In that case, if there's still a motivation for this ordering then it should > > be > > > documented and moved to its own test since it's independent. If not, then > this > > > can be simplified to three Start() calls followed by three wait calls and > > remove > > > the Stop() calls. > > > > No motivation, per say. If you don't feel that staggered start/stop calls are > > of any use then I'll just remove the test because it would become essentially > > the same as the other ConcurrentProfiling tests. > > Removing SGTM. Any specific behaviors exercised by staggered stop/stop would be > better addressed in focused tests. Done. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1011: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/23 22:18:31, Mike Wittman wrote: > Since the previous test has already established that the thread will eventually > exit, there's no need to reverify it here. The important behavior under test > here is simply that profiling can take place after idle shutdown has run. It's not enough to know that it will exit. It must have actually exited before CaptureProfiles() below in order to know that it restarted and didn't just get the idle-shutdown cancelled. Without the IsSamplingThreadRunning() there is no way to know that the idle shutdown has run let alone that the thread has exited. InitiateSamplingThreadIdleShutdown only posts a task that could be delayed by any amount of time based on current activity and system load. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1046: // its task and before the thread actually exits. On 2017/03/23 22:18:31, Mike Wittman wrote: > This is much better, although I think this information would be easier to > understand fleshed out more and applied to on the individual calls. Suggested > comments below. Acknowledged. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1049: StackSamplingProfiler::TestAPI::InitiateSamplingThreadIdleShutdown(); On 2017/03/23 22:18:31, Mike Wittman wrote: > // Post a ShutdownTask on the sampling thread, which will mark the thread as > EXITING and shut down the thread asynchronously after the function exits. Done. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1050: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/23 22:18:31, Mike Wittman wrote: > // Wait for the thread to exit to ensure the ShutdownTask has finished executing > and has set the thread state to EXITING. Done. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1053: // Ensure it's still safe to stop. On 2017/03/23 22:18:31, Mike Wittman wrote: > // Attempt to stop the profiler now that we know the thread is in the EXITING > state. Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Seems like we're converging on the tests covering the lifetime behaviors. Can you also provide tests for correct behavior when - concurrently profiling two different threads, and - concurrently profiling FROM two different threads https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { > > > > > RemoveCollectionTask: > > > > > - found: tested every stop before completion > > > > > > > > We can't rely on the item being found for the for the same > > > > configuration/load-dependent reasons as above. > > > > > > There are dozens of stop calls before completion, some with huge timeouts. > > > They're not all going to miss. > > > > None of these tests are intending to reliably exercise this behavior and could > > easily be changed in the future such that they no longer test the behavior. > The > > use of Stop in the destructor is an implementation detail and shouldn't be > > relied upon when testing this behavior. > > > > Please write a dedicated test. All that's required is starting a profiler with > a > > very long sampling interval, waiting for one sample to be collected using the > > test delegate, and stopping the profiler. > > StopSafely > > Thanks for being specific. Done. Please split this out into a dedicated test. It's not easy to even determine that this behavior is being tested from reading StopSafely because of the complexity required for the not found case of PerformCollectionTask. https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/980001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:1011: while (StackSamplingProfiler::TestAPI::IsSamplingThreadRunning()) On 2017/03/27 17:52:43, bcwhite wrote: > On 2017/03/23 22:18:31, Mike Wittman wrote: > > Since the previous test has already established that the thread will > eventually > > exit, there's no need to reverify it here. The important behavior under test > > here is simply that profiling can take place after idle shutdown has run. > > It's not enough to know that it will exit. It must have actually exited before > CaptureProfiles() below in order to know that it restarted and didn't just get > the idle-shutdown cancelled. That's true. The explicit event for shutdown run rather than waiting on the thread to exit is a clearer indication of what's happening regardless. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:282: DCHECK(sampler); No need to DCHECK this. We can assume the singleton operates correctly. Applies two places below also. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); CHECK No reason to use DCHECK in test API code. Applies to DCHECKs in ShtudownAssumingIdle as well. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ != NULL_COLLECTION_ID) { Why do we need this conditional? https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:200: // still happens asynchronously. Watch IsSamplingThreadRunningForTesting() IsSamplingThreadRunningForTesting => IsSamplingThreadRunning https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:204: static void PerformSamplingThreadIdleShutdown(bool simulate_start); Can we call this simulate_intervening_start, to make it clear that the shutdown is not doing some new start-like activity. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:385: if (delegates) { shorter: profilers->push_back(MakeUnique<StackSamplingProfiler>(target_thread_id, params[i], callback, delegates ? (*delegates[i]).get() : nullptr)); https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id, params[i], callback, delegates->at(i).get())); (*delegates)[i].get() vector<>::at() is no different than operator[]() since Chrome builds without exceptions, and is more confusing to read for the same reason. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:646: StackSamplingProfiler::TestAPI::Reset(); Why do only some of the tests use Reset()? https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_ = 0; This should use locks rather than atomic ops. From the atomicops.h file header: "If you plan to use these routines, you should have a good reason, such as solid evidence that performance would otherwise suffer, or there being no alternative." https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:869: std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded; std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> and access with static_cast<SampleRecordedCounter*>(samples_recorded[0].get()) reinterpret_cast across two levels of template instantiation is highly unsafe and dependent on multiple layers of undefined behavior. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:884: // Wait for both to start accumulating samples. It seems like using WaitableEvents in the test delegate would be more appropriate here than sleeping until the relevant conditions are met. However, I considered what it would take to do this and I think it results in a more complicated solution. There's inherently a race between the task posted by the Stop call and the next PerformCollectionTask on profiler 0, so it's possible that either zero or one collections could take place on that profiler after Stop() returns. Thus one can't know how many times to wait for collection on profiler 0 before it stops. Can you add a comment with this information so that future readers of this test know why this seemingly appropriate solution doesn't work well here? https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1065: // will be 10ms (delay) + 10x1ms (sampling) + 1/2 timer minimum interval. This comment is no longer relevant and can be removed. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1066: params[0].initial_delay = TimeDelta::FromDays(1); AVeryLongTimeDelta() is the established way to say "effectively infinite time" in these tests. Applies one other place below too. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1067: params[0].sampling_interval = TimeDelta::FromMilliseconds(1); This parameter can be removed. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1070: params[1].initial_delay = TimeDelta::FromMilliseconds(0); This line can be removed since this is the default initial delay. Same comment applies four other places below. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1081: profilers[1]->Start(); We should wait on the second profiler to finish and check that it got the right data, to validate that the Start() call succeeded. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1082: EXPECT_FALSE(sampling_completed[0]->IsSignaled()); This should be removed since what it's checking is unrelated to the behavior under test. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1162: std::vector<SamplingParams> params(1); There's no need for vectors here since there's only one profiler. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the Since we can't reliably validate that the thread was not stopped, I think the best way to check the behavior is to collect another profile using a second profiler and validate that it works as expected. This is the expected user-observable behavior anyway.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/28 19:32:01, Mike Wittman wrote: > > > > > > RemoveCollectionTask: > > > > > > - found: tested every stop before completion > > > > > > > > > > We can't rely on the item being found for the for the same > > > > > configuration/load-dependent reasons as above. > > > > > > > > There are dozens of stop calls before completion, some with huge timeouts. > > > > > They're not all going to miss. > > > > > > None of these tests are intending to reliably exercise this behavior and > could > > > easily be changed in the future such that they no longer test the behavior. > > The > > > use of Stop in the destructor is an implementation detail and shouldn't be > > > relied upon when testing this behavior. > > > > > > Please write a dedicated test. All that's required is starting a profiler > with > > a > > > very long sampling interval, waiting for one sample to be collected using > the > > > test delegate, and stopping the profiler. > > > > StopSafely > > > > Thanks for being specific. Done. > > Please split this out into a dedicated test. It's not easy to even determine > that this behavior is being tested from reading StopSafely because of the > complexity required for the not found case of PerformCollectionTask. That would be pretty much the same as StopDuringInterSampleInterval but using a test delegate rather than timers. That's what you want? https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:282: DCHECK(sampler); On 2017/03/28 19:32:02, Mike Wittman wrote: > No need to DCHECK this. We can assume the singleton operates correctly. Applies > two places below also. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/28 19:32:02, Mike Wittman wrote: > CHECK > > No reason to use DCHECK in test API code. Applies to DCHECKs in > ShtudownAssumingIdle as well. Calling this with active samples would indicate bad test code. I'd rather flag it explicitly rather than lose time debugging a test that is just getting bad results. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ != NULL_COLLECTION_ID) { On 2017/03/28 19:32:02, Mike Wittman wrote: > Why do we need this conditional? Because you wanted to ensure that Remove() wasn't called when the sampling thread had "not started". I'll remove the DCHECK from there instead. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:200: // still happens asynchronously. Watch IsSamplingThreadRunningForTesting() On 2017/03/28 19:32:02, Mike Wittman wrote: > IsSamplingThreadRunningForTesting => IsSamplingThreadRunning Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:204: static void PerformSamplingThreadIdleShutdown(bool simulate_start); On 2017/03/28 19:32:02, Mike Wittman wrote: > Can we call this simulate_intervening_start, to make it clear that the shutdown > is not doing some new start-like activity. Adding "intervening" wouldn't indicate anything about whether it is or is not doing a new start-like activity. "Simulate" and the comment, on the other hand, make it pretty clear. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:385: if (delegates) { On 2017/03/28 19:32:12, Mike Wittman wrote: > shorter: > profilers->push_back(MakeUnique<StackSamplingProfiler>(target_thread_id, > params[i], callback, delegates ? (*delegates[i]).get() : nullptr)); Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id, params[i], callback, delegates->at(i).get())); On 2017/03/28 19:32:02, Mike Wittman wrote: > (*delegates)[i].get() > > vector<>::at() is no different than operator[]() since Chrome builds without > exceptions, and is more confusing to read for the same reason. at() is required when working with const vectors (though I had left that out from the parameter definition). https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:646: StackSamplingProfiler::TestAPI::Reset(); On 2017/03/28 19:32:07, Mike Wittman wrote: > Why do only some of the tests use Reset()? It's only necessary for tests that deal with the startup of the sampling thread. I'd prefer it on all but figured you'd claim it was not required. Added everywhere. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_ = 0; On 2017/03/28 19:32:02, Mike Wittman wrote: > This should use locks rather than atomic ops. From the atomicops.h file header: > "If you plan to use these routines, you should have a good reason, such as solid > evidence that performance would otherwise suffer, or there being no > alternative." I'm relatively well versed in the subtleties of atomics, thanks. But fine, we'll do it your way. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:869: std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded; On 2017/03/28 19:32:02, Mike Wittman wrote: > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> > and access with > static_cast<SampleRecordedCounter*>(samples_recorded[0].get()) > > reinterpret_cast across two levels of template instantiation is highly unsafe > and dependent on multiple layers of undefined behavior. I tried many different ways this was the only one that would fully compile. unique_ptr will down-cast automatically but not a vector of them even though they're compatible. Making it a native vector of the base class would mean up-casting it with every use in this method. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:884: // Wait for both to start accumulating samples. On 2017/03/28 19:32:06, Mike Wittman wrote: > It seems like using WaitableEvents in the test delegate would be more > appropriate here than sleeping until the relevant conditions are met. > > However, I considered what it would take to do this and I think it results in a > more complicated solution. There's inherently a race between the task posted by > the Stop call and the next PerformCollectionTask on profiler 0, so it's possible > that either zero or one collections could take place on that profiler after > Stop() returns. Thus one can't know how many times to wait for collection on > profiler 0 before it stops. > > Can you add a comment with this information so that future readers of this test > know why this seemingly appropriate solution doesn't work well here? Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1065: // will be 10ms (delay) + 10x1ms (sampling) + 1/2 timer minimum interval. On 2017/03/28 19:32:05, Mike Wittman wrote: > This comment is no longer relevant and can be removed. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1066: params[0].initial_delay = TimeDelta::FromDays(1); On 2017/03/28 19:32:09, Mike Wittman wrote: > AVeryLongTimeDelta() is the established way to say "effectively infinite time" > in these tests. Applies one other place below too. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1067: params[0].sampling_interval = TimeDelta::FromMilliseconds(1); On 2017/03/28 19:32:11, Mike Wittman wrote: > This parameter can be removed. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1070: params[1].initial_delay = TimeDelta::FromMilliseconds(0); On 2017/03/28 19:32:02, Mike Wittman wrote: > This line can be removed since this is the default initial delay. > > Same comment applies four other places below. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1081: profilers[1]->Start(); On 2017/03/28 19:32:08, Mike Wittman wrote: > We should wait on the second profiler to finish and check that it got the right > data, to validate that the Start() call succeeded. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1082: EXPECT_FALSE(sampling_completed[0]->IsSignaled()); On 2017/03/28 19:32:08, Mike Wittman wrote: > This should be removed since what it's checking is unrelated to the behavior > under test. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1162: std::vector<SamplingParams> params(1); On 2017/03/28 19:32:04, Mike Wittman wrote: > There's no need for vectors here since there's only one profiler. CreateProfilers() takes a vector so this allows code-reuse. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the On 2017/03/28 19:32:02, Mike Wittman wrote: > Since we can't reliably validate that the thread was not stopped, I think the > best way to check the behavior is to collect another profile using a second > profiler and validate that it works as expected. This is the expected > user-observable behavior anyway. If the sampling thread did stop (when it shouldn't) then starting a new collection would quietly restart it. Thus, the test would always pass.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/29 14:56:57, bcwhite wrote: > On 2017/03/28 19:32:01, Mike Wittman wrote: > > > > > > > RemoveCollectionTask: > > > > > > > - found: tested every stop before completion > > > > > > > > > > > > We can't rely on the item being found for the for the same > > > > > > configuration/load-dependent reasons as above. > > > > > > > > > > There are dozens of stop calls before completion, some with huge > timeouts. > > > > > > > They're not all going to miss. > > > > > > > > None of these tests are intending to reliably exercise this behavior and > > could > > > > easily be changed in the future such that they no longer test the > behavior. > > > The > > > > use of Stop in the destructor is an implementation detail and shouldn't be > > > > relied upon when testing this behavior. > > > > > > > > Please write a dedicated test. All that's required is starting a profiler > > with > > > a > > > > very long sampling interval, waiting for one sample to be collected using > > the > > > > test delegate, and stopping the profiler. > > > > > > StopSafely > > > > > > Thanks for being specific. Done. > > > > Please split this out into a dedicated test. It's not easy to even determine > > that this behavior is being tested from reading StopSafely because of the > > complexity required for the not found case of PerformCollectionTask. > > That would be pretty much the same as StopDuringInterSampleInterval but using a > test delegate rather than timers. That's what you want? Yes. We might as well replace StopDuringInterSampleInterval's implementation with that one, along with a comment specifying the additional behavior being tested, since that's one of the existing tests that is racy. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/29 14:56:57, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > CHECK > > > > No reason to use DCHECK in test API code. Applies to DCHECKs in > > ShtudownAssumingIdle as well. > > Calling this with active samples would indicate bad test code. I'd rather flag > it explicitly rather than lose time debugging a test that is just getting bad > results. My comment was intended to convey that this should be CHECK rather than DCHECK. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ != NULL_COLLECTION_ID) { On 2017/03/29 14:56:57, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > Why do we need this conditional? > > Because you wanted to ensure that Remove() wasn't called when the sampling > thread had "not started". I'll remove the DCHECK from there instead. What was in the previous change that caused you to add this when it wasn't needed before? Is this to support resetting the thread? https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:204: static void PerformSamplingThreadIdleShutdown(bool simulate_start); On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > Can we call this simulate_intervening_start, to make it clear that the > shutdown > > is not doing some new start-like activity. > > Adding "intervening" wouldn't indicate anything about whether it is or is not > doing a new start-like activity. "Simulate" and the comment, on the other hand, > make it pretty clear. It would indicate the temporal relationship of the start event to the idle shutdown, which is not possible to infer directly from the function signature without reading the comment. It would be very easy to interpret this signature as meaning that the function is simulating a start of the thread after idle shutdown. The behavior around timing of the starts/adds with respect to the shutdown is the most tricky part of this entire implementation to understand, so I think it's important to be abundantly clear in the interfaces. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id, params[i], callback, delegates->at(i).get())); On 2017/03/29 14:56:59, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > (*delegates)[i].get() > > > > vector<>::at() is no different than operator[]() since Chrome builds without > > exceptions, and is more confusing to read for the same reason. > > at() is required when working with const vectors (though I had left that out > from the parameter definition). The const change is good. at() is not required. vector<>::operator[] has a const overload. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:646: StackSamplingProfiler::TestAPI::Reset(); On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:07, Mike Wittman wrote: > > Why do only some of the tests use Reset()? > > It's only necessary for tests that deal with the startup of the sampling thread. > I'd prefer it on all but figured you'd claim it was not required. Added > everywhere. If we need to clean up state to make the test runs hermetic, which it sounds like we do, then ensuring every test executes with the expected state is the appropriate thing to do. It prevents mysterious failures when people add tests in the future. The expected way to accomplish this is to use a test fixture and do the cleanup in the TearDown method. The relevant principle is that tests should clean up after themselves. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:850: subtle::AtomicWord count_ = 0; On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > This should use locks rather than atomic ops. From the atomicops.h file > header: > > "If you plan to use these routines, you should have a good reason, such as > solid > > evidence that performance would otherwise suffer, or there being no > > alternative." > > I'm relatively well versed in the subtleties of atomics, thanks. But fine, > we'll do it your way. You may be but readers generally won't, and we optimize for code readability, not writeability. This is not my way, it's the accepted guidance for using this functionality in both Chrome and google3, and has been for 8 years. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:869: std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded; On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> > > and access with > > static_cast<SampleRecordedCounter*>(samples_recorded[0].get()) > > > > reinterpret_cast across two levels of template instantiation is highly unsafe > > and dependent on multiple layers of undefined behavior. > > I tried many different ways this was the only one that would fully compile. > unique_ptr will down-cast automatically but not a vector of them even though > they're compatible. Making it a native vector of the base class would mean > up-casting it with every use in this method. Casting to SampleRecordedCounter* on use is what's required. Using reinterpret_cast in this way is not typesafe or valid C++, and only functions at all because std::unique_ptr<NativeStackSamplerTestDelegate> and std::unique_ptr<SampleRecordedCounter>, and std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> std::vector<std::unique_ptr<SampleRecordedCounter>> happen to compile to the same binary representations. A reasonable rule of thumb for casting in Chrome is: use static_cast where possible, and reinterpret_cast between pointers to POD types*. Any other case is probably not type safe. * with caveats: see bit_cast. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1162: std::vector<SamplingParams> params(1); On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:04, Mike Wittman wrote: > > There's no need for vectors here since there's only one profiler. > > CreateProfilers() takes a vector so this allows code-reuse. Yes, but we already have several instances where scalar profilers are created directly within tests, so this would be the one inconsistency in the code. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the On 2017/03/29 14:56:58, bcwhite wrote: > On 2017/03/28 19:32:02, Mike Wittman wrote: > > Since we can't reliably validate that the thread was not stopped, I think the > > best way to check the behavior is to collect another profile using a second > > profiler and validate that it works as expected. This is the expected > > user-observable behavior anyway. > > If the sampling thread did stop (when it shouldn't) then starting a new > collection would quietly restart it. Thus, the test would always pass. It's still worth checking this to prevent future regressions in user-observable behavior on this code path. https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); How about profiler_thread1 and profiler_thread2 to make it clear these are ProfilerThread objects and not StackSamplingProfiler objects?
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/920001/base/profiler/stack_sa... base/profiler/stack_sampling_profiler_unittest.cc:940: TEST(StackSamplingProfilerTest, MAYBE_ConcurrentProfiling_InSync) { On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:57, bcwhite wrote: > > On 2017/03/28 19:32:01, Mike Wittman wrote: > > > > > > > > RemoveCollectionTask: > > > > > > > > - found: tested every stop before completion > > > > > > > > > > > > > > We can't rely on the item being found for the for the same > > > > > > > configuration/load-dependent reasons as above. > > > > > > > > > > > > There are dozens of stop calls before completion, some with huge > > timeouts. > > > > > > > > > They're not all going to miss. > > > > > > > > > > None of these tests are intending to reliably exercise this behavior and > > > could > > > > > easily be changed in the future such that they no longer test the > > behavior. > > > > The > > > > > use of Stop in the destructor is an implementation detail and shouldn't > be > > > > > relied upon when testing this behavior. > > > > > > > > > > Please write a dedicated test. All that's required is starting a > profiler > > > with > > > > a > > > > > very long sampling interval, waiting for one sample to be collected > using > > > the > > > > > test delegate, and stopping the profiler. > > > > > > > > StopSafely > > > > > > > > Thanks for being specific. Done. > > > > > > Please split this out into a dedicated test. It's not easy to even determine > > > that this behavior is being tested from reading StopSafely because of the > > > complexity required for the not found case of PerformCollectionTask. > > > > That would be pretty much the same as StopDuringInterSampleInterval but using > a > > test delegate rather than timers. That's what you want? > > Yes. We might as well replace StopDuringInterSampleInterval's implementation > with that one, along with a comment specifying the additional behavior being > tested, since that's one of the existing tests that is racy. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:57, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > CHECK > > > > > > No reason to use DCHECK in test API code. Applies to DCHECKs in > > > ShtudownAssumingIdle as well. > > > > Calling this with active samples would indicate bad test code. I'd rather > flag > > it explicitly rather than lose time debugging a test that is just getting bad > > results. > > My comment was intended to convey that this should be CHECK rather than DCHECK. Does the compiler completely remove this as dead code when its being shipped? If not, it would be better to not include the statements in release builds for simple code-size reasons. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ != NULL_COLLECTION_ID) { On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:57, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > Why do we need this conditional? > > > > Because you wanted to ensure that Remove() wasn't called when the sampling > > thread had "not started". I'll remove the DCHECK from there instead. > > What was in the previous change that caused you to add this when it wasn't > needed before? Is this to support resetting the thread? StopWithoutStarting. If the object is created but never started then the dtor will call Stop() which will call Remove() when the thread is NOT_STARTED. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:204: static void PerformSamplingThreadIdleShutdown(bool simulate_start); On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:58, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > Can we call this simulate_intervening_start, to make it clear that the > > shutdown > > > is not doing some new start-like activity. > > > > Adding "intervening" wouldn't indicate anything about whether it is or is not > > doing a new start-like activity. "Simulate" and the comment, on the other > hand, > > make it pretty clear. > > It would indicate the temporal relationship of the start event to the idle > shutdown, which is not possible to infer directly from the function signature > without reading the comment. It would be very easy to interpret this signature > as meaning that the function is simulating a start of the thread after idle > shutdown. > > The behavior around timing of the starts/adds with respect to the shutdown is > the most tricky part of this entire implementation to understand, so I think > it's important to be abundantly clear in the interfaces. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:387: target_thread_id, params[i], callback, delegates->at(i).get())); On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:59, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > (*delegates)[i].get() > > > > > > vector<>::at() is no different than operator[]() since Chrome builds without > > > exceptions, and is more confusing to read for the same reason. > > > > at() is required when working with const vectors (though I had left that out > > from the parameter definition). > > The const change is good. at() is not required. vector<>::operator[] has a const > overload. Right. It's maps that don't allow [] with const. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:646: StackSamplingProfiler::TestAPI::Reset(); On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:58, bcwhite wrote: > > On 2017/03/28 19:32:07, Mike Wittman wrote: > > > Why do only some of the tests use Reset()? > > > > It's only necessary for tests that deal with the startup of the sampling > thread. > > I'd prefer it on all but figured you'd claim it was not required. Added > > everywhere. > > If we need to clean up state to make the test runs hermetic, which it sounds > like we do, then ensuring every test executes with the expected state is the > appropriate thing to do. It prevents mysterious failures when people add tests > in the future. > > The expected way to accomplish this is to use a test fixture and do the cleanup > in the TearDown method. The relevant principle is that tests should clean up > after themselves. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:869: std::vector<std::unique_ptr<SampleRecordedCounter>> samples_recorded; On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:58, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> > > > and access with > > > static_cast<SampleRecordedCounter*>(samples_recorded[0].get()) > > > > > > reinterpret_cast across two levels of template instantiation is highly > unsafe > > > and dependent on multiple layers of undefined behavior. > > > > I tried many different ways this was the only one that would fully compile. > > unique_ptr will down-cast automatically but not a vector of them even though > > they're compatible. Making it a native vector of the base class would mean > > up-casting it with every use in this method. > > Casting to SampleRecordedCounter* on use is what's required. Using > reinterpret_cast in this way is not typesafe or valid C++, and only functions at > all because std::unique_ptr<NativeStackSamplerTestDelegate> and > std::unique_ptr<SampleRecordedCounter>, and > std::vector<std::unique_ptr<NativeStackSamplerTestDelegate>> > std::vector<std::unique_ptr<SampleRecordedCounter>> happen to compile to the > same binary representations. > > A reasonable rule of thumb for casting in Chrome is: use static_cast where > possible, and reinterpret_cast between pointers to POD types*. Any other case is > probably not type safe. > > * with caveats: see bit_cast. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1162: std::vector<SamplingParams> params(1); On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:58, bcwhite wrote: > > On 2017/03/28 19:32:04, Mike Wittman wrote: > > > There's no need for vectors here since there's only one profiler. > > > > CreateProfilers() takes a vector so this allows code-reuse. > > Yes, but we already have several instances where scalar profilers are created > directly within tests, so this would be the one inconsistency in the code. Usually where it's being passed to CaptureProfiles, which does all the heavy-lifting of creating the profiler. Here I'm using CreateProfilers to do the heavy-lifting and it takes vectors. It's much more difficult to make mistakes by re-using CreateProfilers even if it makes a bit more boiler-plate code in the test. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the On 2017/03/30 16:18:38, Mike Wittman wrote: > On 2017/03/29 14:56:58, bcwhite wrote: > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > Since we can't reliably validate that the thread was not stopped, I think > the > > > best way to check the behavior is to collect another profile using a second > > > profiler and validate that it works as expected. This is the expected > > > user-observable behavior anyway. > > > > If the sampling thread did stop (when it shouldn't) then starting a new > > collection would quietly restart it. Thus, the test would always pass. > > It's still worth checking this to prevent future regressions in user-observable > behavior on this code path. That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes. https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); On 2017/03/30 16:18:38, Mike Wittman wrote: > How about profiler_thread1 and profiler_thread2 to make it clear these are > ProfilerThread objects and not StackSamplingProfiler objects? Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...) ios-simulator-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-simulator-xco...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) ios-device-xcode-clang on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device-xcode-...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
With the changes below this should be looking pretty good. I'm going to make another pass for anything I missed. Beyond that, I think it would be good to get a quick review from gab@ in case anything has changed with the Thread API in the mean time. And you'll need to ping brettw@ for the thread_restrictions.h change. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/30 18:54:51, bcwhite wrote: > On 2017/03/30 16:18:38, Mike Wittman wrote: > > On 2017/03/29 14:56:57, bcwhite wrote: > > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > > CHECK > > > > > > > > No reason to use DCHECK in test API code. Applies to DCHECKs in > > > > ShtudownAssumingIdle as well. > > > > > > Calling this with active samples would indicate bad test code. I'd rather > > flag > > > it explicitly rather than lose time debugging a test that is just getting > bad > > > results. > > > > My comment was intended to convey that this should be CHECK rather than > DCHECK. > > Does the compiler completely remove this as dead code when its being shipped? > If not, it would be better to not include the statements in release builds for > simple code-size reasons. Yes, the linkers on various platforms do dead code elimination, so none of the TestAPI code should be present in a release binary. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:763: if (collection_id_ != NULL_COLLECTION_ID) { On 2017/03/30 18:54:51, bcwhite wrote: > On 2017/03/30 16:18:38, Mike Wittman wrote: > > On 2017/03/29 14:56:57, bcwhite wrote: > > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > > Why do we need this conditional? > > > > > > Because you wanted to ensure that Remove() wasn't called when the sampling > > > thread had "not started". I'll remove the DCHECK from there instead. > > > > What was in the previous change that caused you to add this when it wasn't > > needed before? Is this to support resetting the thread? > > StopWithoutStarting. If the object is created but never started then the dtor > will call Stop() which will call Remove() when the thread is NOT_STARTED. Removing the DCHECK sounds like a good solution. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1162: std::vector<SamplingParams> params(1); On 2017/03/30 18:54:51, bcwhite wrote: > On 2017/03/30 16:18:38, Mike Wittman wrote: > > On 2017/03/29 14:56:58, bcwhite wrote: > > > On 2017/03/28 19:32:04, Mike Wittman wrote: > > > > There's no need for vectors here since there's only one profiler. > > > > > > CreateProfilers() takes a vector so this allows code-reuse. > > > > Yes, but we already have several instances where scalar profilers are created > > directly within tests, so this would be the one inconsistency in the code. > > Usually where it's being passed to CaptureProfiles, which does all the > heavy-lifting of creating the profiler. Here I'm using CreateProfilers to do > the heavy-lifting and it takes vectors. It's much more difficult to make > mistakes by re-using CreateProfilers even if it makes a bit more boiler-plate > code in the test. OK, sounds reasonable. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the On 2017/03/30 18:54:51, bcwhite wrote: > On 2017/03/30 16:18:38, Mike Wittman wrote: > > On 2017/03/29 14:56:58, bcwhite wrote: > > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > > Since we can't reliably validate that the thread was not stopped, I think > > the > > > > best way to check the behavior is to collect another profile using a > second > > > > profiler and validate that it works as expected. This is the expected > > > > user-observable behavior anyway. > > > > > > If the sampling thread did stop (when it shouldn't) then starting a new > > > collection would quietly restart it. Thus, the test would always pass. > > > > It's still worth checking this to prevent future regressions in > user-observable > > behavior on this code path. > > That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes. Neither of those tests exercise the shutdown abort code path. https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); On 2017/03/30 18:54:51, bcwhite wrote: > On 2017/03/30 16:18:38, Mike Wittman wrote: > > How about profiler_thread1 and profiler_thread2 to make it clear these are > > ProfilerThread objects and not StackSamplingProfiler objects? > > Done. I don't see this change in the patch sets. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); Can you move this function into the TestAPI for consistency? https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); Also, this call should go in TearDown. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:638: StackSamplingProfiler::TestAPI::DisableIdleShutdown(); It's worth commenting that we disable idle shutdown because it takes too long to occur to be testable. The behavior is tested in some tests by artificially triggering an idle shutdown. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1009: while ( This test would be simpler and easier to understand using a WaitableEvent that gets signaled in the test delegate and waited on here, along with two samples per burst and a long sampling interval.
wittman@chromium.org changed reviewers: + gab@chromium.org
Hi gab, can you review the Thread API usage in this change for consistency with current best practice?
Patchset #40 (id:1060001) has been deleted
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Patchset #41 (id:1090005) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:288: DCHECK(sampler->active_collections_.empty()); On 2017/03/31 01:38:21, Mike Wittman wrote: > On 2017/03/30 18:54:51, bcwhite wrote: > > On 2017/03/30 16:18:38, Mike Wittman wrote: > > > On 2017/03/29 14:56:57, bcwhite wrote: > > > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > > > CHECK > > > > > > > > > > No reason to use DCHECK in test API code. Applies to DCHECKs in > > > > > ShtudownAssumingIdle as well. > > > > > > > > Calling this with active samples would indicate bad test code. I'd rather > > > flag > > > > it explicitly rather than lose time debugging a test that is just getting > > bad > > > > results. > > > > > > My comment was intended to convey that this should be CHECK rather than > > DCHECK. > > > > Does the compiler completely remove this as dead code when its being shipped? > > If not, it would be better to not include the statements in release builds for > > simple code-size reasons. > > Yes, the linkers on various platforms do dead code elimination, so none of the > TestAPI code should be present in a release binary. Done. https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1000001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1263: // Though the shutdown-task has been executed, any actual exit of the On 2017/03/31 01:38:21, Mike Wittman wrote: > On 2017/03/30 18:54:51, bcwhite wrote: > > On 2017/03/30 16:18:38, Mike Wittman wrote: > > > On 2017/03/29 14:56:58, bcwhite wrote: > > > > On 2017/03/28 19:32:02, Mike Wittman wrote: > > > > > Since we can't reliably validate that the thread was not stopped, I > think > > > the > > > > > best way to check the behavior is to collect another profile using a > > second > > > > > profiler and validate that it works as expected. This is the expected > > > > > user-observable behavior anyway. > > > > > > > > If the sampling thread did stop (when it shouldn't) then starting a new > > > > collection would quietly restart it. Thus, the test would always pass. > > > > > > It's still worth checking this to prevent future regressions in > > user-observable > > > behavior on this code path. > > > > That is either WillRestartSamplerAfterIdleShutdown or CanRunMultipleTimes. > > Neither of those tests exercise the shutdown abort code path. Done. https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); On 2017/03/31 01:38:22, Mike Wittman wrote: > On 2017/03/30 18:54:51, bcwhite wrote: > > On 2017/03/30 16:18:38, Mike Wittman wrote: > > > How about profiler_thread1 and profiler_thread2 to make it clear these are > > > ProfilerThread objects and not StackSamplingProfiler objects? > > > > Done. > > I don't see this change in the patch sets. It's there; #38 and above. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); On 2017/03/31 01:38:22, Mike Wittman wrote: > Can you move this function into the TestAPI for consistency? Done. And it makes sense to call that as part of the more general Reset() behavior. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); On 2017/03/31 01:38:22, Mike Wittman wrote: > Also, this call should go in TearDown. Reset needs to happen in SetUp because it's possible that other tests (in other files) could have set some values here. But it should be in tear-down, too. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:638: StackSamplingProfiler::TestAPI::DisableIdleShutdown(); On 2017/03/31 01:38:22, Mike Wittman wrote: > It's worth commenting that we disable idle shutdown because it takes too long to > occur to be testable. The behavior is tested in some tests by artificially > triggering an idle shutdown. Done. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1009: while ( On 2017/03/31 01:38:22, Mike Wittman wrote: > This test would be simpler and easier to understand using a WaitableEvent that > gets signaled in the test delegate and waited on here, along with two samples > per burst and a long sampling interval. Done.
@brettw: stack_sampling_profiler.cc, 738 // The behavior of sampling a thread that has exited is undefined and could // cause Bad Things(tm) to occur. The safety model provided by this class is // that an instance of this object is expected to live at least as long as // the thread it is sampling. However, because the sampling is performed // asynchronously by the SamplingThread, there is no way to guarantee this // is true without waiting for it to signal that it has finished. // // The wait time should, at most, be only as long as it takes to collect one // sample (~200us) or none at all if sampling has already completed. ThreadRestrictions::ScopedAllowWait allow_wait; profiling_inactive_.Wait();
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Can you write a better CL description? I don't really know what's going on.
https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1020001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1611: ProfilerThread profiler2("profiler2", target_thread_id, params2); On 2017/03/31 13:57:56, bcwhite wrote: > On 2017/03/31 01:38:22, Mike Wittman wrote: > > On 2017/03/30 18:54:51, bcwhite wrote: > > > On 2017/03/30 16:18:38, Mike Wittman wrote: > > > > How about profiler_thread1 and profiler_thread2 to make it clear these are > > > > ProfilerThread objects and not StackSamplingProfiler objects? > > > > > > Done. > > > > I don't see this change in the patch sets. > > It's there; #38 and above. Ah, the thread names changed. My comment was intended to be about the variable names. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); On 2017/03/31 13:57:56, bcwhite wrote: > On 2017/03/31 01:38:22, Mike Wittman wrote: > > Also, this call should go in TearDown. > > Reset needs to happen in SetUp because it's possible that other tests (in other > files) could have set some values here. But it should be in tear-down, too. That's a very good point that we need to be concerned about other users of the profiler. Unfortunately, I don't think running Reset() in SetUp() is sufficient (or necessary, even) to solve the issue. For the annotations case, non-profiler tests that exercise calls to SetProcessMilestone() won't benefit from this fixture. They'll potentially set the same milestone multiple times, hitting at least one DCHECK. We don't see this problem now because none of the milestone setting code appears to be tested outside of browser_tests and interactive_ui_tests which run one test per process. For the sampling thread case, I don't think anything needs to be done at SetUp() to ensure correct behavior. The profiler tests are the only ones in the base_unittests target that exercise the profiler, and should be the only ones in that target that will ever do so, so running Reset() only in TearDown() is sufficient. All the other unit test targets should be fine without doing Reset() -- they'll get the normal thread restart behavior -- but it still would be good practice to mock out the profiler in unit tests of new code using the profiler to isolate that code from the profiler dependency. Given these facts, I think the best course of action is: - only run DisableIdleShutdown() in SetUp() - only run Reset() in TearDown(), and add a comment on the fixture saying that all profiler tests in base_unittests must use the fixture to ensure proper clean-up - in a separate CL, update the milestone handling logic to accept milestones being set multiple times and add a comment about why/how this happens https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:249: // Updates the |next_sample_time| time based on configured parameters. add comment on the meaning of the return value https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:259: // so that multiple threads may make those calls. broaden comment to discuss the general thread execution state under the lock. e.g.: // State maintained about the current execution (or non-execution) of // the thread. This state must always be accessed while holding the // lock. A copy of the task-runner is maintained here for use by any // calling thread; this is necessary because Thread's accessor for it is // not itself thread-safe. The lock is also used to order calls to the // Thread API (Start, Stop, StopSoon, & DetachFromSequence) so that // multiple threads may make those calls. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:275: std::map<int, std::unique_ptr<CollectionContext>> active_collections_; nit: move this declaration above the lock to make totally obvious that this is not protected by the lock. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:603: // All capturing has completed so finish the collection. Let object expire. nit: clarify the meaning of "Let object expire." https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:613: // eliminating the race. mention what the race is https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:28: class WaitableEvent; This can be removed since the header is already included. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:183: // Testing support. Add comment: The functions on this API are static because they affect the single sampling thread that is used across all StackSamplingProfilers. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:245: // This will block until the callback has been run. update comment: This will block until the callback has been run _if profiling is taking place_. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:364: std::vector<std::unique_ptr<WaitableEvent>>* completed) { There's a very subtle issue with this function in that it hides destruction ordering constraints from the caller: if the profiles or completed vectors were declared before the profilers vector in a test, the profilers could access those objects after they were destroyed but before the profilers were destroyed. This would cause flaky crashes that would be very difficult to track down. We should put these in a struct to enforce proper destruction order: struct ProfilerState { std::unique_ptr<StackSamplingProfiler> profiler; CallStackProfiles profiles; std::unique_ptr<WaitableEvent> completed; }; along with a comment calling out the reason for the ordering. Then, return a std::vector<ProfilerState> from this function. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:903: static_cast<SampleRecordedCounter*>(samples_recorded[0].get())->Get() == I think we should be able to avoid casting entirely by using a template function. Taking the comment on the CreateProfilers function into account: template <typename T> std::vector<ProfilerState> CreateProfilers( PlatformThreadId target_thread_id, const std::vector<SamplingParams>& params, const std::vector<std::unique_ptr<T>>* test_delegates) { // ... existing function ... } std::vector<ProfilerState> CreateProfilers( PlatformThreadId target_thread_id, const std::vector<SamplingParams>& params) { return CreateProfilers<NativeStackSamplerTestDelegate>(target_thread_id, params, nullptr); } And convert the non-delegate-users to the second overload. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:991: void Wait() { sample_recorded_.Wait(); } nit: WaitForSampleToOccur https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1154: // Initiate an "idle" shutdown and ensure it happens. Idle-shutdown was Idle-shutdown is disabled in the test fixture ...
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. BUG=671716 ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ==========
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ==========
On 2017/03/31 01:46:33, Mike Wittman wrote: > Hi gab, can you review the Thread API usage in this change for consistency with > current best practice? I'll be happy to but can you please describe (ideally in CL description) how threading works in this CL so I can spare grokking 1400 LOCs from scratch :).
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ==========
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Patchset #42 (id:1130001) has been deleted
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
CL description changes: Done. https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1080001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:637: StackSamplingProfiler::ResetAnnotationsForTesting(); > Given these facts, I think the best course of action is: > - only run DisableIdleShutdown() in SetUp() > - only run Reset() in TearDown(), and add a comment on the fixture saying that > all profiler tests in base_unittests must use the fixture to ensure proper > clean-up Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:249: // Updates the |next_sample_time| time based on configured parameters. On 2017/03/31 18:12:33, Mike Wittman wrote: > add comment on the meaning of the return value Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:259: // so that multiple threads may make those calls. On 2017/03/31 18:12:33, Mike Wittman wrote: > broaden comment to discuss the general thread execution state under the lock. > e.g.: > > // State maintained about the current execution (or non-execution) of > // the thread. This state must always be accessed while holding the > // lock. A copy of the task-runner is maintained here for use by any > // calling thread; this is necessary because Thread's accessor for it is > // not itself thread-safe. The lock is also used to order calls to the > // Thread API (Start, Stop, StopSoon, & DetachFromSequence) so that > // multiple threads may make those calls. Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:275: std::map<int, std::unique_ptr<CollectionContext>> active_collections_; On 2017/03/31 18:12:33, Mike Wittman wrote: > nit: move this declaration above the lock to make totally obvious that this is > not protected by the lock. Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:603: // All capturing has completed so finish the collection. Let object expire. On 2017/03/31 18:12:33, Mike Wittman wrote: > nit: clarify the meaning of "Let object expire." Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:613: // eliminating the race. On 2017/03/31 18:12:33, Mike Wittman wrote: > mention what the race is Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:28: class WaitableEvent; On 2017/03/31 18:12:33, Mike Wittman wrote: > This can be removed since the header is already included. Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:183: // Testing support. On 2017/03/31 18:12:33, Mike Wittman wrote: > Add comment: The functions on this API are static because they affect the single > sampling thread that is used across all StackSamplingProfilers. Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:245: // This will block until the callback has been run. On 2017/03/31 18:12:33, Mike Wittman wrote: > update comment: This will block until the callback has been run _if profiling is > taking place_. Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:364: std::vector<std::unique_ptr<WaitableEvent>>* completed) { On 2017/03/31 18:12:33, Mike Wittman wrote: > There's a very subtle issue with this function in that it hides destruction > ordering constraints from the caller: if the profiles or completed vectors were > declared before the profilers vector in a test, the profilers could access those > objects after they were destroyed but before the profilers were destroyed. This > would cause flaky crashes that would be very difficult to track down. > > We should put these in a struct to enforce proper destruction order: > struct ProfilerState { > std::unique_ptr<StackSamplingProfiler> profiler; > CallStackProfiles profiles; > std::unique_ptr<WaitableEvent> completed; > }; > > along with a comment calling out the reason for the ordering. Then, return a > std::vector<ProfilerState> from this function. Better than that, I think, is to create a struct that contains everything for a single test profiler and create a vector of pointers to those. This cleans up a lot of things and means we can get rid of vectors of a single element. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:903: static_cast<SampleRecordedCounter*>(samples_recorded[0].get())->Get() == On 2017/03/31 18:12:33, Mike Wittman wrote: > I think we should be able to avoid casting entirely by using a template > function. Taking the comment on the CreateProfilers function into account: > > template <typename T> > std::vector<ProfilerState> CreateProfilers( > PlatformThreadId target_thread_id, > const std::vector<SamplingParams>& params, > const std::vector<std::unique_ptr<T>>* test_delegates) { > // ... existing function ... > } > > std::vector<ProfilerState> CreateProfilers( > PlatformThreadId target_thread_id, > const std::vector<SamplingParams>& params) { > return CreateProfilers<NativeStackSamplerTestDelegate>(target_thread_id, > params, nullptr); > } > > And convert the non-delegate-users to the second overload. No casting necessary with new TestProfilerInfo structure. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:991: void Wait() { sample_recorded_.Wait(); } On 2017/03/31 18:12:33, Mike Wittman wrote: > nit: WaitForSampleToOccur Done. https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1154: // Initiate an "idle" shutdown and ensure it happens. Idle-shutdown was On 2017/03/31 18:12:33, Mike Wittman wrote: > Idle-shutdown is disabled in the test fixture ... Done.
owners lgtm but I mostly only looked at the API. Be sure to follow up with the others for the real review. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); Can this and the two non-debug assertions 2 functions down be converted to debug ones? Non-debug assertions add ~100 bytes each to the release binary. And since these are test ones, most of the tests are run with debug assertions enabled (even in release mode).
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > Can this and the two non-debug assertions 2 functions down be converted to debug > ones? Non-debug assertions add ~100 bytes each to the release binary. And since > these are test ones, most of the tests are run with debug assertions enabled > (even in release mode). I had them DCHECK originally but was told to convert them to CHECK because the linker would remove these methods as "dead code" (they're only called from tests). Is that not the case?
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 12:59:31, bcwhite wrote: > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > Can this and the two non-debug assertions 2 functions down be converted to > debug > > ones? Non-debug assertions add ~100 bytes each to the release binary. And > since > > these are test ones, most of the tests are run with debug assertions enabled > > (even in release mode). > > I had them DCHECK originally but was told to convert them to CHECK because the > linker would remove these methods as "dead code" (they're only called from > tests). Is that not the case? If they're only called from tests, then what's the problem with them being removed as dead code from Chrome build? They shouldn't be removed by the linker when building the tests.
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > On 2017/04/04 12:59:31, bcwhite wrote: > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > Can this and the two non-debug assertions 2 functions down be converted to > > debug > > > ones? Non-debug assertions add ~100 bytes each to the release binary. And > > since > > > these are test ones, most of the tests are run with debug assertions enabled > > > (even in release mode). > > > > I had them DCHECK originally but was told to convert them to CHECK because the > > linker would remove these methods as "dead code" (they're only called from > > tests). Is that not the case? > > If they're only called from tests, then what's the problem with them being > removed as dead code from Chrome build? > > They shouldn't be removed by the linker when building the tests. No problem. The question was whether there was any difference between CHECK and DCHECK in test code. If test code is removed by the linker when building FOR RELEASE then it wouldn't matter which check is used inside that code. If it hangs around, then we should definitely use DCHECK.
This is looking good to me. Just waiting for gab's review. A couple comments on the CL description: > Sampling will continue until the desired number has been collected, > it is manually stopped, or the controlling object gets destructed. ... until the desired number _of samples_ has been collected ... > The thread under test is expected to live at least as long as the > thread controlling the sampling. The sampled thread is expected to live at least as long ... https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1110001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:364: std::vector<std::unique_ptr<WaitableEvent>>* completed) { On 2017/04/03 20:18:13, bcwhite wrote: > On 2017/03/31 18:12:33, Mike Wittman wrote: > > There's a very subtle issue with this function in that it hides destruction > > ordering constraints from the caller: if the profiles or completed vectors > were > > declared before the profilers vector in a test, the profilers could access > those > > objects after they were destroyed but before the profilers were destroyed. > This > > would cause flaky crashes that would be very difficult to track down. > > > > We should put these in a struct to enforce proper destruction order: > > struct ProfilerState { > > std::unique_ptr<StackSamplingProfiler> profiler; > > CallStackProfiles profiles; > > std::unique_ptr<WaitableEvent> completed; > > }; > > > > along with a comment calling out the reason for the ordering. Then, return a > > std::vector<ProfilerState> from this function. > > Better than that, I think, is to create a struct that contains everything for a > single test profiler and create a vector of pointers to those. > > This cleans up a lot of things and means we can get rid of vectors of a single > element. Nice, that makes things much cleaner. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 15:54:49, bcwhite wrote: > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > > On 2017/04/04 12:59:31, bcwhite wrote: > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > > Can this and the two non-debug assertions 2 functions down be converted to > > > debug > > > > ones? Non-debug assertions add ~100 bytes each to the release binary. And > > > since > > > > these are test ones, most of the tests are run with debug assertions > enabled > > > > (even in release mode). > > > > > > I had them DCHECK originally but was told to convert them to CHECK because > the > > > linker would remove these methods as "dead code" (they're only called from > > > tests). Is that not the case? > > > > If they're only called from tests, then what's the problem with them being > > removed as dead code from Chrome build? > > > > They shouldn't be removed by the linker when building the tests. > > No problem. The question was whether there was any difference between CHECK and > DCHECK in test code. > > If test code is removed by the linker when building FOR RELEASE then it wouldn't > matter which check is used inside that code. If it hangs around, then we should > definitely use DCHECK. I did a spot check of several TestAPI definitions in Windows 64-bit 59.0.3060.0 canary and there are no symbols for those classes in the release build despite the presence of symbols for the enclosing class. There are no TestAPI symbols in the build at all. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:882: SampleRecordedCounter samples_recorded[2]; nit: arraysize(params) https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1274: profiler_info.Reset(); Now that we have the TestProfilerInfo struct it would be simpler just to create a new TestProfilerInfo here and run that.
The DCHECK/CHECK thing doesn't matter either way in practice for this patch so I want to avoid bikeshedding on that and delaying this patch. My inclination for DCHECK is that one would normally use DCHECK for this type of thing (for the reasons I outlined), so we should do it here too. Non-debug assertions jump out at me as either being critically important or wrong.
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The thread under test is expected to live at least as long as the thread controlling the sampling. BUG=671716 ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. BUG=671716 ==========
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Some comments, some drive-bys on a few nits I spotted and a meta-comment: Why does sampling have to be startable from any thread? Besides starting and self-stopping, what makes the threading complicated here? Can this CL be split? >1K LOCs is a huge review... I therefore didn't read the whole thing but did look at everything that seemed related to threading (skipped tests mostly though). https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:114: class StackSamplingProfiler::SamplingThread : public Thread { This is big enough to warrant its own file and unit tests IMO https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:240: // Check if the sampling thread is idle and begin a shutdown if so. "begin a shutdown if so" sounds weird to me https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 17:59:51, Mike Wittman wrote: > On 2017/04/04 15:54:49, bcwhite wrote: > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > > > On 2017/04/04 12:59:31, bcwhite wrote: > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > > > Can this and the two non-debug assertions 2 functions down be converted > to > > > > debug > > > > > ones? Non-debug assertions add ~100 bytes each to the release binary. > And > > > > since > > > > > these are test ones, most of the tests are run with debug assertions > > enabled > > > > > (even in release mode). > > > > > > > > I had them DCHECK originally but was told to convert them to CHECK because > > the > > > > linker would remove these methods as "dead code" (they're only called from > > > > tests). Is that not the case? > > > > > > If they're only called from tests, then what's the problem with them being > > > removed as dead code from Chrome build? > > > > > > They shouldn't be removed by the linker when building the tests. > > > > No problem. The question was whether there was any difference between CHECK > and > > DCHECK in test code. > > > > If test code is removed by the linker when building FOR RELEASE then it > wouldn't > > matter which check is used inside that code. If it hangs around, then we > should > > definitely use DCHECK. > > I did a spot check of several TestAPI definitions in Windows 64-bit 59.0.3060.0 > canary and there are no symbols for those classes in the release build despite > the presence of symbols for the enclosing class. There are no TestAPI symbols in > the build at all. Yes, this should be DCHECK, don't CHECK just to force a test in non-test code: https://chromium.googlesource.com/chromium/src/+/master/styleguide/c++/c++.md... https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:354: void StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent( You can just post two tasks in a row instead of a having a custom helper that does two things in one. base::Thread will run all tasks before winding down anyways. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:366: : Thread("Chrome_SamplingProfilerThread") {} No need to prefix with "Chrome_", the thread names will always be viewed as scoped to Chrome's browser process anyways. No need to suffix with "Thread" either. "StackSamplingProfiler" is a shorter and more precise name. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:369: Stop(); Not necessary, ~Thread() does this already so = default; is sufficient here. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:380: options.priority = ThreadPriority::DISPLAY; Hmmm I don't think that's appropriate. On Android only UI/IO run at DISPLAY I think and on Desktop no thread runs at DISPLAY priority for now (all NORMAL or lower). I understand that sampling has to be regular to be accurate but we also don't want to slow down the product in order to sample... right? With this we're telling the OS, if you're under crunch and there's only one thing you can make chrome do: schedule this thread.. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:435: // thread and the thread that creates it (i.e. this thread). Add " for thread-safety reasons which are alleviated in SamplingThread per gating its access on |thread_execution_state_lock_|." https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:475: CollectionContext* collection) { Add DCHECK_EQ(GetThreadId(), PlatformThread::CurrentId()); calls that document methods that always run on the sampling thread (others should use AutoLock or have a meta-comment so that it's clear which context each method is entered from). Or even better would be to split this state into (1) a class that only runs on sampling thread (members never touched from elsewhere) and (2) the multi-threaded part with state behind lock. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:603: std::max(collection->next_sample_time - Time::Now(), TimeDelta())); This isn't required I think, negative delays should be the same as no delays. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:616: // get postponed until thread_execution_state_ is updated, thus eliminating |thread_execution_state_| https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:639: // work comes in. Remove the thread_execution_state_task_runner_ to avoid |thread_execution_state_task_runner_ | and maybe elsewhere too https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset(); Use a ResetPolicy::AUTOMATIC WaitableEvent? https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:194: static bool IsSamplingThreadRunning(); Thread::IsRunning() isn't thread-safe (though the check is sadly disabled right now [1]) and as such this method isn't either (must be called from owning sequence). Please document it as such (or probably this entire TestAPI class as such in fact). [1] https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//... https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown(); Since you should support having multiple pending delayed shutdown tasks in your queue (I don't see anything that prevents that from happening), why bother disable them? Your tests should complete much before they fire for real anyways so it shouldn't be a source of flakiness, disabling them merely brings you further from testing your real product code. https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:729: // Stop is immediate but asynchronous. There is a non-zero probability that If it's asynchronous, it's not immediate :). "// Stop is asynchronous: there is a non-zero..."
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/04 17:59:51, Mike Wittman wrote: > On 2017/04/04 15:54:49, bcwhite wrote: > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > > > On 2017/04/04 12:59:31, bcwhite wrote: > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > > > Can this and the two non-debug assertions 2 functions down be converted > to > > > > debug > > > > > ones? Non-debug assertions add ~100 bytes each to the release binary. > And > > > > since > > > > > these are test ones, most of the tests are run with debug assertions > > enabled > > > > > (even in release mode). > > > > > > > > I had them DCHECK originally but was told to convert them to CHECK because > > the > > > > linker would remove these methods as "dead code" (they're only called from > > > > tests). Is that not the case? > > > > > > If they're only called from tests, then what's the problem with them being > > > removed as dead code from Chrome build? > > > > > > They shouldn't be removed by the linker when building the tests. > > > > No problem. The question was whether there was any difference between CHECK > and > > DCHECK in test code. > > > > If test code is removed by the linker when building FOR RELEASE then it > wouldn't > > matter which check is used inside that code. If it hangs around, then we > should > > definitely use DCHECK. > > I did a spot check of several TestAPI definitions in Windows 64-bit 59.0.3060.0 > canary and there are no symbols for those classes in the release build despite > the presence of symbols for the enclosing class. There are no TestAPI symbols in > the build at all. Mike, you asked for them to be CHECK and you're the owner. Your call. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler_unittest.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:882: SampleRecordedCounter samples_recorded[2]; On 2017/04/04 17:59:52, Mike Wittman wrote: > nit: arraysize(params) Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler_unittest.cc:1274: profiler_info.Reset(); On 2017/04/04 17:59:52, Mike Wittman wrote: > Now that we have the TestProfilerInfo struct it would be simpler just to create > a new TestProfilerInfo here and run that. Done.
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/06 16:18:49, bcwhite wrote: > On 2017/04/04 17:59:51, Mike Wittman wrote: > > On 2017/04/04 15:54:49, bcwhite wrote: > > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > > > > On 2017/04/04 12:59:31, bcwhite wrote: > > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > > > > Can this and the two non-debug assertions 2 functions down be > converted > > to > > > > > debug > > > > > > ones? Non-debug assertions add ~100 bytes each to the release binary. > > And > > > > > since > > > > > > these are test ones, most of the tests are run with debug assertions > > > enabled > > > > > > (even in release mode). > > > > > > > > > > I had them DCHECK originally but was told to convert them to CHECK > because > > > the > > > > > linker would remove these methods as "dead code" (they're only called > from > > > > > tests). Is that not the case? > > > > > > > > If they're only called from tests, then what's the problem with them being > > > > removed as dead code from Chrome build? > > > > > > > > They shouldn't be removed by the linker when building the tests. > > > > > > No problem. The question was whether there was any difference between CHECK > > and > > > DCHECK in test code. > > > > > > If test code is removed by the linker when building FOR RELEASE then it > > wouldn't > > > matter which check is used inside that code. If it hangs around, then we > > should > > > definitely use DCHECK. > > > > I did a spot check of several TestAPI definitions in Windows 64-bit > 59.0.3060.0 > > canary and there are no symbols for those classes in the release build despite > > the presence of symbols for the enclosing class. There are no TestAPI symbols > in > > the build at all. > > Mike, you asked for them to be CHECK and you're the owner. Your call. Following the guidance from Brett and Gab is OK with me. If DCHECKs are generally always enabled in tests then the CHECKs don't really buy us anything.
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:114: class StackSamplingProfiler::SamplingThread : public Thread { On 2017/04/05 20:38:42, gab wrote: > This is big enough to warrant its own file and unit tests IMO Acknowledged. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:240: // Check if the sampling thread is idle and begin a shutdown if so. On 2017/04/05 20:38:42, gab wrote: > "begin a shutdown if so" sounds weird to me Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:291: CHECK(sampler->active_collections_.empty()); On 2017/04/06 16:55:41, Mike Wittman wrote: > On 2017/04/06 16:18:49, bcwhite wrote: > > On 2017/04/04 17:59:51, Mike Wittman wrote: > > > On 2017/04/04 15:54:49, bcwhite wrote: > > > > On 2017/04/04 15:44:26, Alexei Svitkine (slow) wrote: > > > > > On 2017/04/04 12:59:31, bcwhite wrote: > > > > > > On 2017/04/03 21:42:51, brettw (plz ping after 24h) wrote: > > > > > > > Can this and the two non-debug assertions 2 functions down be > > converted > > > to > > > > > > debug > > > > > > > ones? Non-debug assertions add ~100 bytes each to the release > binary. > > > And > > > > > > since > > > > > > > these are test ones, most of the tests are run with debug assertions > > > > enabled > > > > > > > (even in release mode). > > > > > > > > > > > > I had them DCHECK originally but was told to convert them to CHECK > > because > > > > the > > > > > > linker would remove these methods as "dead code" (they're only called > > from > > > > > > tests). Is that not the case? > > > > > > > > > > If they're only called from tests, then what's the problem with them > being > > > > > removed as dead code from Chrome build? > > > > > > > > > > They shouldn't be removed by the linker when building the tests. > > > > > > > > No problem. The question was whether there was any difference between > CHECK > > > and > > > > DCHECK in test code. > > > > > > > > If test code is removed by the linker when building FOR RELEASE then it > > > wouldn't > > > > matter which check is used inside that code. If it hangs around, then we > > > should > > > > definitely use DCHECK. > > > > > > I did a spot check of several TestAPI definitions in Windows 64-bit > > 59.0.3060.0 > > > canary and there are no symbols for those classes in the release build > despite > > > the presence of symbols for the enclosing class. There are no TestAPI > symbols > > in > > > the build at all. > > > > Mike, you asked for them to be CHECK and you're the owner. Your call. > > Following the guidance from Brett and Gab is OK with me. If DCHECKs are > generally always enabled in tests then the CHECKs don't really buy us anything. Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:354: void StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent( On 2017/04/05 20:38:42, gab wrote: > You can just post two tasks in a row instead of a having a custom helper that > does two things in one. > > base::Thread will run all tasks before winding down anyways. Wouldn't two successive posts create a race-condition? - ShutdownTask gets posted - ShutdownTask runs - thread exits - SignalDoneTask gets posted - never runs https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:366: : Thread("Chrome_SamplingProfilerThread") {} On 2017/04/05 20:38:42, gab wrote: > No need to prefix with "Chrome_", the thread names will always be viewed as > scoped to Chrome's browser process anyways. No need to suffix with "Thread" > either. > > "StackSamplingProfiler" is a shorter and more precise name. Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:369: Stop(); On 2017/04/05 20:38:42, gab wrote: > Not necessary, ~Thread() does this already so = default; is sufficient here. Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:380: options.priority = ThreadPriority::DISPLAY; > Hmmm I don't think that's appropriate. On Android only UI/IO run at DISPLAY I > think and on Desktop no thread runs at DISPLAY priority for now (all NORMAL or > lower). > > I understand that sampling has to be regular to be accurate but we also don't > want to slow down the product in order to sample... right? Right. > With this we're telling the OS, if you're under crunch and there's only one > thing you can make chrome do: schedule this thread.. Done. I can remove the method and let the default one do the default thing. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:435: // thread and the thread that creates it (i.e. this thread). On 2017/04/05 20:38:42, gab wrote: > Add " for thread-safety reasons which are alleviated in SamplingThread per > gating its access on |thread_execution_state_lock_|." Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:475: CollectionContext* collection) { On 2017/04/05 20:38:42, gab wrote: > Add DCHECK_EQ(GetThreadId(), PlatformThread::CurrentId()); calls that document > methods that always run on the sampling thread (others should use AutoLock or > have a meta-comment so that it's clear which context each method is entered > from). Done. > Or even better would be to split this state into (1) a class that only runs on > sampling thread (members never touched from elsewhere) and (2) the > multi-threaded part with state behind lock. As in... Create yet another sub-class that has all the thread-specific access and give it a pointer to its parent class for accessing the shared information, including the lock to that information? I'm not sure that's worth the effort but I'll think about it. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:603: std::max(collection->next_sample_time - Time::Now(), TimeDelta())); On 2017/04/05 20:38:42, gab wrote: > This isn't required I think, negative delays should be the same as no delays. There is a DCHECK in incoming_task_queue.cc (line 45) that checks that delay is not negative. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:616: // get postponed until thread_execution_state_ is updated, thus eliminating On 2017/04/05 20:38:42, gab wrote: > |thread_execution_state_| Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:639: // work comes in. Remove the thread_execution_state_task_runner_ to avoid On 2017/04/05 20:38:42, gab wrote: > |thread_execution_state_task_runner_ | and maybe elsewhere too Done. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset(); On 2017/04/05 20:38:42, gab wrote: > Use a ResetPolicy::AUTOMATIC WaitableEvent? There are other Wait calls on this that don't reset it. Comment added where it is defined. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:194: static bool IsSamplingThreadRunning(); On 2017/04/05 20:38:43, gab wrote: > Thread::IsRunning() isn't thread-safe (though the check is sadly disabled right > now [1]) and as such this method isn't either (must be called from owning > sequence). Please document it as such (or probably this entire TestAPI class as > such in fact). I see. Would the best solution be to have CleanUp set a flag (under lock) and return that? > > [1] > https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//... Done. I also added a DetachFromSequence in the code for this. Let me know if that's unnecessary. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown(); On 2017/04/05 20:38:42, gab wrote: > Since you should support having multiple pending delayed shutdown tasks in your > queue (I don't see anything that prevents that from happening), They can and it's handled. > why bother > disable them? Your tests should complete much before they fire for real anyways > so it shouldn't be a source of flakiness, disabling them merely brings you > further from testing your real product code. Just for guaranteed operation. Right now the idle shutdown time is 1 minute but it's an internal thing. If it was changed to 1s we wouldn't want the tests to become flaky. https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1170001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:729: // Stop is immediate but asynchronous. There is a non-zero probability that On 2017/04/05 20:38:43, gab wrote: > If it's asynchronous, it's not immediate :). > > "// Stop is asynchronous: there is a non-zero..." Done. :-)
asvitkine@chromium.org changed reviewers: - asvitkine@chromium.org
> Why does sampling have to be startable from any thread? Besides starting and > self-stopping, what makes the threading complicated here? Generally, a thread will initiate sampling upon itself but since any thread could do that to itself, the sampling thread needs to be startable from any thread. And there may be cases as well well, for example, the UI thread wants to initiate sampling on some worker thread. > Can this CL be split? >1K LOCs is a huge review... I therefore didn't read the > whole thing but did look at everything that seemed related to threading (skipped > tests mostly though). I don't see how. It's one change. And it's a replacement for currently-active functionality so even it it could be broken into pieces, it wouldn't "drop in" (or revert cleanly) as needed.
On 2017/04/06 18:44:03, bcwhite wrote: > > Why does sampling have to be startable from any thread? Besides starting and > > self-stopping, what makes the threading complicated here? > > Generally, a thread will initiate sampling upon itself but since any thread > could do that to itself, the sampling thread needs to be startable from any > thread. > > And there may be cases as well well, for example, the UI thread wants to > initiate sampling on some worker thread. > > > > Can this CL be split? >1K LOCs is a huge review... I therefore didn't read the > > whole thing but did look at everything that seemed related to threading > (skipped > > tests mostly though). > > I don't see how. It's one change. And it's a replacement for currently-active > functionality so even it it could be broken into pieces, it wouldn't "drop in" > (or revert cleanly) as needed. I'm not the main reviewer so won't force it but the way we usually do this in base/task_scheduler land et al. is build individual components on their own w/ unit tests that aren't yet attached to the system (e.g. StackSamplingProfiler class could be one) instead of having them hidden in anonymous namespace and tested by integration. This makes testing easier (focused unit tests), eventual refactoring easier (e.g. sequenced_worker_pool.cc is a mumbo-jumbo mess of anonymous classes and is hard to test and refactor because of that -- we avoided that in base/task_scheduler, base::internal:: namespace is used instead to depict impl-only boundary), and incremental CLs. Didn't do another full pass but overall threading lgtm w/ comments below. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.cc (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:354: void StackSamplingProfiler::SamplingThread::TestAPI::ShutdownTaskAndSignalEvent( On 2017/04/06 18:40:18, bcwhite wrote: > On 2017/04/05 20:38:42, gab wrote: > > You can just post two tasks in a row instead of a having a custom helper that > > does two things in one. > > > > base::Thread will run all tasks before winding down anyways. > > Wouldn't two successive posts create a race-condition? > - ShutdownTask gets posted > - ShutdownTask runs > - thread exits > - SignalDoneTask gets posted > - never runs Ah, good point, yes. Hadn't realized ShutdownTask() initiated its own async shutdown (via StopSoon()) when I wrote this, that paradigm is a first in Chrome! (but it's okay, we had talked about it, just forgot) https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:603: std::max(collection->next_sample_time - Time::Now(), TimeDelta())); On 2017/04/06 18:40:18, bcwhite wrote: > On 2017/04/05 20:38:42, gab wrote: > > This isn't required I think, negative delays should be the same as no delays. > > There is a DCHECK in incoming_task_queue.cc (line 45) that checks that delay is > not negative. Ah ok interesting, probably an artifact but std::max here is fine then :) https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.cc:759: profiling_inactive_.Reset(); On 2017/04/06 18:40:18, bcwhite wrote: > On 2017/04/05 20:38:42, gab wrote: > > Use a ResetPolicy::AUTOMATIC WaitableEvent? > > There are other Wait calls on this that don't reset it. Comment added where it > is defined. The only other call I see is in the destructor (at which point resetting or not doesn't matter)? https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:194: static bool IsSamplingThreadRunning(); On 2017/04/06 18:40:18, bcwhite wrote: > On 2017/04/05 20:38:43, gab wrote: > > Thread::IsRunning() isn't thread-safe (though the check is sadly disabled > right > > now [1]) and as such this method isn't either (must be called from owning > > sequence). Please document it as such (or probably this entire TestAPI class > as > > such in fact). > > I see. Would the best solution be to have CleanUp set a flag (under lock) and > return that? > > > > > [1] > > > https://cs.chromium.org/chromium/src/base/threading/thread.cc?type=cs&q=%22//... > > Done. I also added a DetachFromSequence in the code for this. Let me know if > that's unnecessary. Hmmm if the comment above is respected and IsRunning() is always called from owning thread then it's always fine. No need to have a fancy Cleanup() -- that's already what Thread::IsRunning() is doing when called properly. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown(); On 2017/04/06 18:40:18, bcwhite wrote: > On 2017/04/05 20:38:42, gab wrote: > > Since you should support having multiple pending delayed shutdown tasks in > your > > queue (I don't see anything that prevents that from happening), > > They can and it's handled. > > > > why bother > > disable them? Your tests should complete much before they fire for real > anyways > > so it shouldn't be a source of flakiness, disabling them merely brings you > > further from testing your real product code. > > Just for guaranteed operation. Right now the idle shutdown time is 1 minute but > it's an internal thing. If it was changed to 1s we wouldn't want the tests to > become flaky. Hmm okay, but you also by doing so don't test calling ShutdownTask with other pending ShutdownTasks. Up to you and wittman@ to decide which you prefer, just highlighting this.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
LGTM Since this is a pretty major change, we should land it after next week's branch point to minimize risk and potential disturbance to the release stabilization process. The following Monday or Wednesday morning (4/17 or 4/19) would be ideal as that would give one day of bake time in canary before the following dev release. Also, please manually sample browser_tests output on later runs of the CQ and try bots during the day that this has landed to ensure it's not causing stability issues. (Grep'ing for crash stacks containing "StackSamplingProfiler" in the browser_tests log output is sufficient.) On 2017/04/06 19:31:28, gab (behind) wrote: > On 2017/04/06 18:44:03, bcwhite wrote: > > > Can this CL be split? >1K LOCs is a huge review... I therefore didn't read > the > > > whole thing but did look at everything that seemed related to threading > > (skipped > > > tests mostly though). > > > > I don't see how. It's one change. And it's a replacement for > currently-active > > functionality so even it it could be broken into pieces, it wouldn't "drop in" > > (or revert cleanly) as needed. > > I'm not the main reviewer so won't force it but the way we usually do this in > base/task_scheduler land et al. is build individual components on their own w/ > unit tests that aren't yet attached to the system (e.g. StackSamplingProfiler > class could be one) instead of having them hidden in anonymous namespace and > tested by integration. This makes testing easier (focused unit tests), eventual > refactoring easier (e.g. sequenced_worker_pool.cc is a mumbo-jumbo mess of > anonymous classes and is hard to test and refactor because of that -- we avoided > that in base/task_scheduler, base::internal:: namespace is used instead to > depict impl-only boundary), and incremental CLs. I'm fully on board with building incrementally, but unfortunately that was for the most part not a viable option with this change. :( The difficulty here is that the vast majority of the complexity is in the interrelationships of the thread and collection lifetimes along with the tasks implementing them. Even now, I don't see a way that these could have been meaningfully separated while preserving the essential complexity of the problem. That said, there are opportunities to enforce a more explicit decoupling of some pieces of this implementation. Moving SamplingThread to its own file would be beneficial. Also: factoring out the collection sampling state management from SamplingThread, and decoupling SamplingThread from the platform NativeStackSampler implementations by using a mock in tests. But this review has gone on long enough as it is, and those things can be addressed independently. https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... File base/profiler/stack_sampling_profiler.h (right): https://codereview.chromium.org/2554123002/diff/1150001/base/profiler/stack_s... base/profiler/stack_sampling_profiler.h:197: static void DisableIdleShutdown(); On 2017/04/06 19:31:28, gab (behind) wrote: > On 2017/04/06 18:40:18, bcwhite wrote: > > On 2017/04/05 20:38:42, gab wrote: > > > Since you should support having multiple pending delayed shutdown tasks in > > your > > > queue (I don't see anything that prevents that from happening), > > > > They can and it's handled. > > > > > > > why bother > > > disable them? Your tests should complete much before they fire for real > > anyways > > > so it shouldn't be a source of flakiness, disabling them merely brings you > > > further from testing your real product code. > > > > Just for guaranteed operation. Right now the idle shutdown time is 1 minute > but > > it's an internal thing. If it was changed to 1s we wouldn't want the tests to > > become flaky. > > Hmm okay, but you also by doing so don't test calling ShutdownTask with other > pending ShutdownTasks. > > Up to you and wittman@ to decide which you prefer, just highlighting this. Yes, the tests don't explicitly generate multiple ShutdownTasks, but we extensively considered this behavior in the review and I'm satisfied that all the code paths exercised in this case are adequately tested.
> Since this is a pretty major change, we should land it after next week's branch > point to minimize risk and potential disturbance to the release stabilization > process. The following Monday or Wednesday morning (4/17 or 4/19) would be ideal > as that would give one day of bake time in canary before the following dev > release. I'll make sure there is time to bake. > I'm fully on board with building incrementally, but unfortunately that was for > the most part not a viable option with this change. :( The difficulty here is > that the vast majority of the complexity is in the interrelationships of the > thread and collection lifetimes along with the tasks implementing them. Even > now, I don't see a way that these could have been meaningfully separated while > preserving the essential complexity of the problem. There is actually a second CL to this, though this one is by far the larger. :-)
The CQ bit was checked by bcwhite@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from brettw@chromium.org Link to the patchset: https://codereview.chromium.org/2554123002/#ps1190001 (title: "addressed review comments by gab")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...) ios-device on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/ios-device/builds...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
The CQ bit was checked by bcwhite@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by bcwhite@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from wittman@chromium.org, gab@chromium.org, brettw@chromium.org Link to the patchset: https://codereview.chromium.org/2554123002/#ps1210001 (title: "rebased")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 1210001, "attempt_start_ts": 1492615544585300, "parent_rev": "431dd44543668f59e341aaf350f1370690ee9b35", "commit_rev": "69e964496800e75cb0e3cdd974436659bd24e9cf"}
Message was sent while issue was closed.
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. BUG=671716 ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. BUG=671716 Review-Url: https://codereview.chromium.org/2554123002 Cr-Commit-Position: refs/heads/master@{#465614} Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... ==========
Message was sent while issue was closed.
Committed patchset #45 (id:1210001) as https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443...
Message was sent while issue was closed.
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. BUG=671716 Review-Url: https://codereview.chromium.org/2554123002 Cr-Commit-Position: refs/heads/master@{#465614} Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. SHERIFFS: Don't hesitate to roll this back if it correlates well with some kind of instability. Sampling has been known to have odd effects in the past and this rewrites a large part of it. BUG=671716 Review-Url: https://codereview.chromium.org/2554123002 Cr-Commit-Position: refs/heads/master@{#465614} Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... ==========
Message was sent while issue was closed.
Description was changed from ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. SHERIFFS: Don't hesitate to roll this back if it correlates well with some kind of instability. Sampling has been known to have odd effects in the past and this rewrites a large part of it. BUG=671716 Review-Url: https://codereview.chromium.org/2554123002 Cr-Commit-Position: refs/heads/master@{#465614} Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... ========== to ========== Support parallel captures from the StackSamplingProfiler. Previously, only one sampling operation could be running and it was generally used to profile the startup of the browser. To make it more useful, it can now run against any thread and multiple profilers can execute in parallel. Sampling will continue until the desired number of samples has been collected, it is manually stopped, or the controlling object gets destructed. The SamplingThread is a singleton base::Thread that is self-managing. - It is started (via GetOrCreateTaskRunnerForAdd) on the calling thread when work arrives. - It stops (via ShutdownTask) on its own thread when it has been idle for 1 minute. - DetachFromSequence is called after both of these to allow for accessing the API from different threads. - thread_execution_state_lock_ is held when doing Thread API calls to ensure that access is sequenced. The sampled thread is expected to live at least as long as the thread controlling the sampling. SHERIFFS: Don't hesitate to roll this back if it correlates well with some kind of instability. Sampling has been known to have odd effects in the past and this rewrites a large part of it. BUG=671716 Review-Url: https://codereview.chromium.org/2554123002 Cr-Commit-Position: refs/heads/master@{#465614} Committed: https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... ==========
Message was sent while issue was closed.
On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > Committed patchset #45 (id:1210001) as > https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... Hey guys, Findit's analysis for a flaky test "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest.AddBeforeUnloadDuringClosing/0" suggests this as the culprit according to analysis ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jyb3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpaWEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZHVnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpBPQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM, can someone please help verify? Thanks, Jeff on behalf of Findit team
Message was sent while issue was closed.
On 2017/04/26 23:52:53, lijeffrey wrote: > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > > Committed patchset #45 (id:1210001) as > > > https://chromium.googlesource.com/chromium/src/+/69e964496800e75cb0e3cdd97443... > > Hey guys, Findit's analysis for a flaky test > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest.AddBeforeUnloadDuringClosing/0" > suggests this as the culprit according to analysis > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9vdCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jyb3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpaWEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZHVnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpBPQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM, > can someone please help verify? > > Thanks, > Jeff on behalf of Findit team Oops sorry here's the full link to the analysis: https://findit-for-me.appspot.com/waterfall/flake?key=ag9zfmZpbmRpdC1mb3ItbWV...
Message was sent while issue was closed.
Can you provide a link to a build where this test fails? It's passing in build 39545 linked in the analysis. In any case, this change is unlikely to be the cause of flakiness on Mac because the modified functionality is only enabled for 64-bit Windows. On Wed, Apr 26, 2017 at 4:54 PM, <lijeffrey@chromium.org> wrote: > On 2017/04/26 23:52:53, lijeffrey wrote: > > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > > > Committed patchset #45 (id:1210001) as > > > > > > https://chromium.googlesource.com/chromium/src/+/ > 69e964496800e75cb0e3cdd974436659bd24e9cf > > > > Hey guys, Findit's analysis for a flaky test > > > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest > .AddBeforeUnloadDuringClosing/0" > > suggests this as the culprit according to analysis > > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM, > > can someone please help verify? > > > > Thanks, > > Jeff on behalf of Findit team > > Oops sorry here's the full link to the analysis: > > https://findit-for-me.appspot.com/waterfall/flake?key= > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM > > https://codereview.chromium.org/2554123002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
Message was sent while issue was closed.
On 2017/04/27 00:34:54, Mike Wittman wrote: > Can you provide a link to a build where this test fails? It's passing in > build 39545 linked in the analysis. > > In any case, this change is unlikely to be the cause of flakiness on Mac > because the modified functionality is only enabled for 64-bit Windows. > > On Wed, Apr 26, 2017 at 4:54 PM, <mailto:lijeffrey@chromium.org> wrote: > > > On 2017/04/26 23:52:53, lijeffrey wrote: > > > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > > > > Committed patchset #45 (id:1210001) as > > > > > > > > > https://chromium.googlesource.com/chromium/src/+/ > > 69e964496800e75cb0e3cdd974436659bd24e9cf > > > > > > Hey guys, Findit's analysis for a flaky test > > > > > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest > > .AddBeforeUnloadDuringClosing/0" > > > suggests this as the culprit according to analysis > > > > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM, > > > can someone please help verify? > > > > > > Thanks, > > > Jeff on behalf of Findit team > > > > Oops sorry here's the full link to the analysis: > > > > https://findit-for-me.appspot.com/waterfall/flake?key= > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM > > > > https://codereview.chromium.org/2554123002/ > > > > -- > You received this message because you are subscribed to the Google Groups > "Chromium-reviews" group. > To unsubscribe from this group and stop receiving emails from it, send an email > to mailto:chromium-reviews+unsubscribe@chromium.org. Thanks for the reply! https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%20Tests%20%28dbg%... and https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9%20Tests%20%28dbg%... both fail for the same test which appears to have started flaking after this CL landed. If it's a false positive please let us know so we can improve the flake analyzer! :)
Message was sent while issue was closed.
This is a false positive. The failure appears to be due to some destruction ordering error in Mac UI, which is entirely unrelated to the CL at hand. On Thu, Apr 27, 2017 at 4:16 AM, <lijeffrey@chromium.org> wrote: > On 2017/04/27 00:34:54, Mike Wittman wrote: > > Can you provide a link to a build where this test fails? It's passing in > > build 39545 linked in the analysis. > > > > In any case, this change is unlikely to be the cause of flakiness on Mac > > because the modified functionality is only enabled for 64-bit Windows. > > > > On Wed, Apr 26, 2017 at 4:54 PM, <mailto:lijeffrey@chromium.org> wrote: > > > > > On 2017/04/26 23:52:53, lijeffrey wrote: > > > > On 2017/04/19 15:30:50, commit-bot: I haz the power wrote: > > > > > Committed patchset #45 (id:1210001) as > > > > > > > > > > > > https://chromium.googlesource.com/chromium/src/+/ > > > 69e964496800e75cb0e3cdd974436659bd24e9cf > > > > > > > > Hey guys, Findit's analysis for a flaky test > > > > > > > "BrowserCloseManagerBrowserTest/BrowserCloseManagerBrowserTest > > > .AddBeforeUnloadDuringClosing/0" > > > > suggests this as the culprit according to analysis > > > > > > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM, > > > > can someone please help verify? > > > > > > > > Thanks, > > > > Jeff on behalf of Findit team > > > > > > Oops sorry here's the full link to the analysis: > > > > > > https://findit-for-me.appspot.com/waterfall/flake?key= > > > ag9zfmZpbmRpdC1mb3ItbWVy6AELEhdNYXN0ZXJGbGFrZUFuYWx5c2lzUm9v > > > dCKxAWNocm9taXVtLm1hYy9NYWMxMC45IFRlc3RzIChkYmcpLzM5NTUwL2Jy > > > b3dzZXJfdGVzdHMvUW5KdmQzTmxja05zYjNObFRXRnVZV2RsY2tKeWIzZHpa > > > WEpVWlhOMEwwSnliM2R6WlhKRGJHOXpaVTFoYm1GblpYSkNjbTkzYzJWeVZH > > > VnpkQzVCWkdSQ1pXWnZjbVZWYm14dllXUkVkWEpwYm1kRGJHOXphVzVuTHpB > > > PQwLEhNNYXN0ZXJGbGFrZUFuYWx5c2lzGAEM > > > > > > https://codereview.chromium.org/2554123002/ > > > > > > > -- > > You received this message because you are subscribed to the Google Groups > > "Chromium-reviews" group. > > To unsubscribe from this group and stop receiving emails from it, send an > email > > to mailto:chromium-reviews+unsubscribe@chromium.org. > > Thanks for the reply! > https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9% > 20Tests%20%28dbg%29/39548 > and > https://luci-milo.appspot.com/buildbot/chromium.mac/Mac10.9% > 20Tests%20%28dbg%29/39550 > both fail for the same test which appears to have started flaking after > this CL > landed. If it's a false positive please let us know so we can improve the > flake > analyzer! :) > > https://codereview.chromium.org/2554123002/ > -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org. |