|
|
DescriptionEnable sharding of AllocationRegister on desktop.
Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.
This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop platforms. This stores allocation/backtrace information across 64
different AllocationRegister instances. In addition, the sizes of the fixed size
hash maps were changed. The number of allocation buckets was reduced by a factor
of 8, and the number of backtrace buckets reduced by a factor of 16.
The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.
BUG=724651
Review-Url: https://codereview.chromium.org/2890363003
Cr-Commit-Position: refs/heads/master@{#473696}
Committed: https://chromium.googlesource.com/chromium/src/+/bd599af5d33647f0b498311eab05be79fac98e63
Patch Set 1 #Patch Set 2 : Fix performance. #Patch Set 3 : Fixes. #Patch Set 4 : clang format. #Patch Set 5 : comment from primiano. #
Total comments: 20
Patch Set 6 : Comments from primiano. #Patch Set 7 : Clean up. #
Total comments: 4
Patch Set 8 : comments from primiano, thakis. #Messages
Total messages: 66 (38 generated)
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== fix native heap profiling performance. BUG= ========== to ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ==========
Description was changed from ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ========== to ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ==========
Description was changed from ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ==========
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant perforamnce hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all mallocs in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL adds sharding for desktop platforms to store allocation/backtrace information across 256 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 32, and the number of backtrace buckets reduced by a factor of 64. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL a new class ShardedAllocationRegister to handle sharding for desktop platforms to store allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Testing locally with a bool and a base::Lock just for reading/writing from the bool shows that the lock has performance almost as bad as the global lock. BUG=724651 ==========
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL a new class ShardedAllocationRegister to handle sharding for desktop platforms to store allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Testing locally with a bool and a base::Lock just for reading/writing from the bool shows that the lock has performance almost as bad as the global lock. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL a new class ShardedAllocationRegister to handle sharding for desktop platforms to store allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Testing locally with a bool and a base::Lock just for reading/writing from the bool shows that the lock has performance almost as bad as the global lock. BUG=724651 ==========
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL a new class ShardedAllocationRegister to handle sharding for desktop platforms to store allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Testing locally with a bool and a base::Lock just for reading/writing from the bool shows that the lock has performance almost as bad as the global lock. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL introduces a new class ShardedAllocationRegister to handle sharding for desktop platforms. This stores allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Using a bool to record this state and a base::Lock to synchronize reading/writing from the bool has performance almost as bad as the global lock. BUG=724651 ==========
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL introduces a new class ShardedAllocationRegister to handle sharding for desktop platforms. This stores allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Using a bool to record this state and a base::Lock to synchronize reading/writing from the bool has performance almost as bad as the global lock. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL introduces a new class ShardedAllocationRegister to handle sharding for desktop platforms. This stores allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Using a bool to record this state and a base::Lock to synchronize reading/writing from the bool has performance almost as bad as the global lock. BUG=724651 ==========
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
erikchen@chromium.org changed reviewers: + primiano@chromium.org
primiano: Please review.
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/20 01:42:17, erikchen wrote: > primiano: Please review. Sorry for the incomplete review but, as usual, I am on the runway with roaming only. Quickly glanced and seems good. The only modification that I would do is to move the atomic inside the sharded register to avoid having to pull it into the 3 clients. This will make this review less controversial (pulling base/atomics in blink) What I would do is : - always instantiate the sharded register in the 3 mdp, regardless of heap profiling being on. However male the ctor itself a no-op. - add an initialize method to the sharded class. Invoke it onheapprofilerenabled. This method will do the actual expensive creation of the tables. At this point the atomic can live entirely within the sharded class, and you can avoid that replicated complexity in the 3 clients. Also, plz use a boring array instead of a vector. I know that you reserved that but I don't like to have to trust stl if a simpler option is available (you know me, I have trust issues)
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/20 02:52:44, Primiano (slow - traveling) wrote: > On 2017/05/20 01:42:17, erikchen wrote: > > primiano: Please review. > > Sorry for the incomplete review but, as usual, I am on the runway with roaming > only. > Quickly glanced and seems good. The only modification that I would do is to move > the atomic inside the sharded register to avoid having to pull it into the 3 > clients. > This will make this review less controversial (pulling base/atomics in blink) > What I would do is : > - always instantiate the sharded register in the 3 mdp, regardless of heap > profiling being on. However male the ctor itself a no-op. > - add an initialize method to the sharded class. Invoke it > onheapprofilerenabled. This method will do the actual expensive creation of the > tables. > At this point the atomic can live entirely within the sharded class, and you can > avoid that replicated complexity in the 3 clients. Done. PTAL > > Also, plz use a boring array instead of a vector. I know that you reserved that > but I don't like to have to trust stl if a simpler option is available (you know > me, I have trust issues) Sorry, calling premature optimization on this one. The performance cost of vector vs array is minimal compared to the cost of grabbing the stack trace, storing it, etc. This is outweighed by the clarity of having a vector [automatic destructor] vs array.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
On 2017/05/20 17:26:42, erikchen wrote: > Sorry, calling premature optimization on this one. The performance cost of > vector vs array is minimal compared to the cost of grabbing the stack trace, > storing it, etc. Is not premature optimization, is just don't trust and don't use things if you don't need them. > This is outweighed by the clarity of having a vector [automatic destructor] vs array. Not sure I follow, maybe I am missing something. What I was suggesting is a std::unique_ptr<RegisterAndLock[]>? Wouldn't that have the same exact semantic, same clarity but pull in less complexity?
dskiba@chromium.org changed reviewers: + dskiba@chromium.org
Couple of thoughts: First, it seems that three parameters (2 hash table capacities and a shard count) need to be configured in one place. I would move them all into sharded_allocation_register.h - that way you could also statically preallocate space for allocation_registers_. The second thing is that AllocationRegister inserts into two hash tables (allocations, backtraces), but we shard only by allocation address - i.e. we probably end up inserting similar backtraces into many registers. And maybe we don't need locks at all - in some of my experiments I saw that most pointers are allocated / freed mostly on the same thread. I.e. how about per-thread AllocationRegister? Or, more specifically, per-thread allocations hash table + sharded backtraces hash table?
On 2017/05/22 15:55:22, DmitrySkiba wrote: > Couple of thoughts: > > First, it seems that three parameters (2 hash table capacities and a shard > count) need to be configured in one place. I would move them all into > sharded_allocation_register.h - that way you could also statically preallocate > space for allocation_registers_. Right now the size of the FixedHashMap is a template parameter. This means that while it's possible to make the size dynamically set by another TU, it would be a non-trivial refactor. I'm not inclined to do that right now - the comments make it abundantly clear that ShardCount must be paired with bucket/capacity numbers. > > The second thing is that AllocationRegister inserts into two hash tables > (allocations, backtraces), but we shard only by allocation address - i.e. we > probably end up inserting similar backtraces into many registers. Yup, this is true. Given that this is an order of magnitude improvement in performance, I'm not particularly concerned with additional memory bloat. > > And maybe we don't need locks at all - in some of my experiments I saw that most > pointers are allocated / freed mostly on the same thread. I.e. how about > per-thread AllocationRegister? Or, more specifically, per-thread allocations > hash table + sharded backtraces hash table? We discussed this. There's no easy solution to pointers freed on a different thread.
https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... File base/trace_event/malloc_dump_provider.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:307: { at this point this {} scope delimiter is not needed anymore. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:308: if (allocation_register_) { this condition is always going to evaluate to true, as you populate the unique_ptr on ctor. I guess you want a if(...IsEnabled()) ? https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:337: if (!allocation_register_->IsInitialized()) just call Initialize and early out in its implementation if IsInitialized already? https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... File base/trace_event/malloc_dump_provider.h (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.h:53: std::unique_ptr<ShardedAllocationRegister> allocation_register_; there is no need for a unique_ptr anymore here now, this could just be a field, right? https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... File base/trace_event/sharded_allocation_register.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:21: size_t HashAddress(const void* address) { since this is copy-pasted from a header that is included here (well in the .h file) can you just reuse AddressHasher ? If you are worried about perf, you can move it to a static inline HashAddress function in heap_profiler_allocation_register.h and use in both. Feels a bit odd that we have to have two identical copy-pasted function in two files that include each other to achieve the same goal (hash an address) https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:37: ral.reset(new RegisterAndLock); doesn't make a difference here (so I don't care about what you do in this line) but as a general suggestion, get used to () around new. If RegisterAndLock was POD this would have skipped the initialization. (this is one of the arguments why people on chromium-dev push for MakeUnique, because it enforces the initialization) https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:44: void ShardedAllocationRegister::SetEnabled(bool enabled) { I would remove this method and do this in initialize(). this will remove a potential bug if the client does the Setenabled(), Initialize() in the wrong order. Also there doesn't seem to be a valid semantic for doing SetEnabled on its own. So I would have only two methods (or one) in total, not 4. Up to you if you want to call them: - SetEnabled() (which would do the initialization also) and SetDisabled() - SetEnabled(bool), where the "true" path does also the initialization https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:49: return base::subtle::Acquire_Load(&enabled_); I'd make move this to the header file and turn into is_enabled(). This will introduce an extra call to do the check that: (i) is not needed at all; (ii) wasn't there before this cl; (iii) shouldn't emit any more code than a function call as this boils down to a read op. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... File base/trace_event/sharded_allocation_register.h (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.h:35: bool IsEnabled() const; this could be an in-line is_enabled() const { return base::subtle..... } so it gets inlined in the caller. (but fine even without if you have any reason against) https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.h:55: using MetricsMap = std::unordered_map<AllocationContext, AllocationMetrics>; nit: structs and type declarations should go on the top of the section, move to line 28. practically this confused me a bunch of times because line 55 above has AllocationMetrics, but that is declared in another file. this OutputMetrics is declared in an adjacent line but is used only in line 65. Hence moving them up solves the confusion by not having it adjacent to something that seems similar.
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
primiano: PTAL https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... File base/trace_event/malloc_dump_provider.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:307: { On 2017/05/22 16:37:50, Primiano Tucci wrote: > at this point this {} scope delimiter is not needed anymore. Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:308: if (allocation_register_) { On 2017/05/22 16:37:50, Primiano Tucci wrote: > this condition is always going to evaluate to true, as you populate the > unique_ptr on ctor. I guess you want a if(...IsEnabled()) ? The condition ins already checked on line 290. Removed this line. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.cc:337: if (!allocation_register_->IsInitialized()) On 2017/05/22 16:37:50, Primiano Tucci wrote: > just call Initialize and early out in its implementation if IsInitialized > already? Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... File base/trace_event/malloc_dump_provider.h (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc... base/trace_event/malloc_dump_provider.h:53: std::unique_ptr<ShardedAllocationRegister> allocation_register_; On 2017/05/22 16:37:50, Primiano Tucci wrote: > there is no need for a unique_ptr anymore here now, this could just be a field, > right? Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... File base/trace_event/sharded_allocation_register.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:21: size_t HashAddress(const void* address) { On 2017/05/22 16:37:51, Primiano Tucci wrote: > since this is copy-pasted from a header that is included here (well in the .h > file) can you just reuse AddressHasher ? If you are worried about perf, you can > move it to a static inline HashAddress function in > heap_profiler_allocation_register.h and use in both. > Feels a bit odd that we have to have two identical copy-pasted function in two > files that include each other to achieve the same goal (hash an address) Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:37: ral.reset(new RegisterAndLock); On 2017/05/22 16:37:51, Primiano Tucci wrote: > doesn't make a difference here (so I don't care about what you do in this line) > but as a general suggestion, get used to () around new. If RegisterAndLock was > POD this would have skipped the initialization. > (this is one of the arguments why people on chromium-dev push for MakeUnique, > because it enforces the initialization) Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:44: void ShardedAllocationRegister::SetEnabled(bool enabled) { Done. [SetEnabled and SetDisabled] https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.cc:49: return base::subtle::Acquire_Load(&enabled_); On 2017/05/22 16:37:51, Primiano Tucci wrote: > I'd make move this to the header file and turn into is_enabled(). This will > introduce an extra call to do the check that: (i) is not needed at all; (ii) > wasn't there before this cl; (iii) shouldn't emit any more code than a function > call as this boils down to a read op. Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... File base/trace_event/sharded_allocation_register.h (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.h:35: bool IsEnabled() const; On 2017/05/22 16:37:51, Primiano Tucci wrote: > this could be an in-line is_enabled() const { return base::subtle..... } so it > gets inlined in the caller. > (but fine even without if you have any reason against) Done. https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde... base/trace_event/sharded_allocation_register.h:55: using MetricsMap = std::unordered_map<AllocationContext, AllocationMetrics>; On 2017/05/22 16:37:51, Primiano Tucci wrote: > nit: structs and type declarations should go on the top of the section, move to > line 28. > practically this confused me a bunch of times because line 55 above has > AllocationMetrics, but that is declared in another file. this OutputMetrics is > declared in an adjacent line but is used only in line 65. Hence moving them up > solves the confusion by not having it adjacent to something that seems similar. Done.
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/22 16:24:52, erikchen wrote: > On 2017/05/22 15:55:22, DmitrySkiba wrote: > > Couple of thoughts: > > > > First, it seems that three parameters (2 hash table capacities and a shard > > count) need to be configured in one place. I would move them all into > > sharded_allocation_register.h - that way you could also statically preallocate > > space for allocation_registers_. > Right now the size of the FixedHashMap is a template parameter. This means that > while it's possible to make the size dynamically set by another TU, it would be > a non-trivial refactor. I'm not inclined to do that right now - the comments > make it abundantly clear that ShardCount must be paired with bucket/capacity > numbers. I see. BTW, why do we need ShardedAllocationRegister thing at all - why not move locks into AllocationRegister? I.e. inside AllocationRegister you would have TablesAndLock with both hash tables + lock. In my view either ShardedAllocationRegister or AllocationRegister needs to go - we don't need to have both of them. > > > > The second thing is that AllocationRegister inserts into two hash tables > > (allocations, backtraces), but we shard only by allocation address - i.e. we > > probably end up inserting similar backtraces into many registers. > Yup, this is true. Given that this is an order of magnitude improvement in > performance, I'm not particularly concerned with additional memory bloat. Regarding improvement - can you add a perf test that would demonstrate the performance problem? It will be useful to make sure future changes don't regress the performance.
Cool, thanks a lot for the patience. Now that we are done with nits and renames, let me express some more high level comments. I am not fully sure that this is the right architecture to achieve this sharding. How about ... I am joking (I always hated those sort of codereview cycles). LGTM https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... File base/trace_event/sharded_allocation_register.cc (right): https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... base/trace_event/sharded_allocation_register.cc:16: size_t ShardCount = 1; this should have been a const size_t kShardCount, but I spotted it too late. However, not worth another re-upload & CQ cycle just for this. Change it only if you plan to make some other change to this file, otherwise we'll live with it, or change the next time we touch this file again, if ever.
The CQ bit was unchecked by erikchen@chromium.org
The CQ bit was checked by erikchen@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by erikchen@chromium.org
The CQ bit was checked by erikchen@chromium.org to run a CQ dry run
On 2017/05/22 17:37:55, Primiano Tucci wrote: > Cool, thanks a lot for the patience. > Now that we are done with nits and renames, let me express some more high level > comments. I am not fully sure that this is the right architecture to achieve > this sharding. How about ... > > I am joking (I always hated those sort of codereview cycles). > LGTM Well, at least we should add a perf test for this. Otherwise we'll spend more time in the future when (not if) AllocationRegister is changed again. > > https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... > File base/trace_event/sharded_allocation_register.cc (right): > > https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... > base/trace_event/sharded_allocation_register.cc:16: size_t ShardCount = 1; > this should have been a const size_t kShardCount, but I spotted it too late. > However, not worth another re-upload & CQ cycle just for this. > > Change it only if you plan to make some other change to this file, otherwise > we'll live with it, or change the next time we touch this file again, if ever.
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/22 17:36:05, DmitrySkiba wrote: > I see. BTW, why do we need ShardedAllocationRegister thing at all - why not move > locks into AllocationRegister? I.e. inside AllocationRegister you would have > TablesAndLock with both hash tables + lock. > In my view either ShardedAllocationRegister or AllocationRegister needs to go - > we don't need to have both of them. Erik and I discussed this offline, I had some similar suggestion. Doing what you say would make sense but be non trivial, because means redesigning the freelists (and having N of them)0. The reality is that, right now, this is something simple (the Sharded class) which wraps something test that has reliably worked (at least as long as data storage and thread safety are concerned) for a year (the AllocationRegister). We could definitely refactor this, but that means exposing ourself to a risk in term crashes / having to debug thread-safe code. It is not impossible to deal with that but comes with a cost. Given all the things on our plates, I like to iterate by composition right now and move that refactoring cost to a later stage. Right now our P0 is having a heap profiler that works and is perf efficient. I'd file a low-prio bug to track the possible future improvements. Definitely by having a single instance we could be more memory-efficient and have only one, perhaps smaller, pool of cells and use them more efficiently (maybe). But, until we can prove that it causes a problem, that is a lower-prio refactoring.
erikchen@chromium.org changed reviewers: + thakis@chromium.org
thakis: Please review third_party/WebKit/Source/platform/*
On 2017/05/22 17:44:59, Primiano Tucci wrote: > On 2017/05/22 17:36:05, DmitrySkiba wrote: > > I see. BTW, why do we need ShardedAllocationRegister thing at all - why not > move > > locks into AllocationRegister? I.e. inside AllocationRegister you would have > > TablesAndLock with both hash tables + lock. > > In my view either ShardedAllocationRegister or AllocationRegister needs to go > - > > we don't need to have both of them. > > Erik and I discussed this offline, I had some similar suggestion. Doing what you > say would make sense but be non trivial, because means redesigning the freelists > (and having N of them)0. > The reality is that, right now, this is something simple (the Sharded class) > which wraps something test that has reliably worked (at least as long as data > storage and thread safety are concerned) for a year (the AllocationRegister). > We could definitely refactor this, but that means exposing ourself to a risk in > term crashes / having to debug thread-safe code. It is not impossible to deal > with that but comes with a cost. > Given all the things on our plates, I like to iterate by composition right now > and move that refactoring cost to a later stage. > Right now our P0 is having a heap profiler that works and is perf efficient. > I'd file a low-prio bug to track the possible future improvements. Definitely by > having a single instance we could be more memory-efficient and have only one, > perhaps smaller, pool of cells and use them more efficiently (maybe). > But, until we can prove that it causes a problem, that is a lower-prio > refactoring. This is not what I was suggesting. Right now we have this: AllocationRegister { HashMap allocations; HashMap backtraces; }; ShardedAllocationRegister { RegisterAndLock { AllocationRegister register; Lock lock; }; RegisterAndLock allocation_registers_[]; } I'm proposing to change it to: AllocationRegister { TablesAndLock { HashMap allocations; HashMap backtraces; Lock lock; }; TablesAndLock tables_[]; }; Equally easy, but without an extra thing on the side.
On 2017/05/22 17:44:47, DmitrySkiba wrote: > On 2017/05/22 17:37:55, Primiano Tucci wrote: > > Cool, thanks a lot for the patience. > > Now that we are done with nits and renames, let me express some more high > level > > comments. I am not fully sure that this is the right architecture to achieve > > this sharding. How about ... > > > > I am joking (I always hated those sort of codereview cycles). > > LGTM > > Well, at least we should add a perf test for this. Otherwise we'll spend more > time in the future when (not if) AllocationRegister is changed again. I will LG any CL that adds a perf test to this (in that case tracing_preftests has already the right harness and is hooked to the perf dashboard). That, however, shouldn't be a blocker for this CL. Erik here took something that doesn't have a perf test and made it like 2x faster on mac. Requiring a perf-test for this means discouraging any future attempt of improving things :) I'd definitely onboard with requiring a perftest, if this was about a controversial micro-optimization. But, having the data that shows contention (which is pretty expected, given that we have 40 threads and do 400k allocations/seconds) and a sharding plan to reduce the lock domain, makes the plan quite sensible. Maybe I am overlooking something, is there anything here that realistically looks like could regress performance?
On 2017/05/22 17:49:17, DmitrySkiba wrote: > On 2017/05/22 17:44:59, Primiano Tucci wrote: > > On 2017/05/22 17:36:05, DmitrySkiba wrote: > > > I see. BTW, why do we need ShardedAllocationRegister thing at all - why not > > move > > > locks into AllocationRegister? I.e. inside AllocationRegister you would have > > > TablesAndLock with both hash tables + lock. > > > In my view either ShardedAllocationRegister or AllocationRegister needs to > go > > - > > > we don't need to have both of them. > > > > Erik and I discussed this offline, I had some similar suggestion. Doing what > you > > say would make sense but be non trivial, because means redesigning the > freelists > > (and having N of them)0. > > The reality is that, right now, this is something simple (the Sharded class) > > which wraps something test that has reliably worked (at least as long as data > > storage and thread safety are concerned) for a year (the AllocationRegister). > > We could definitely refactor this, but that means exposing ourself to a risk > in > > term crashes / having to debug thread-safe code. It is not impossible to deal > > with that but comes with a cost. > > Given all the things on our plates, I like to iterate by composition right now > > and move that refactoring cost to a later stage. > > Right now our P0 is having a heap profiler that works and is perf efficient. > > I'd file a low-prio bug to track the possible future improvements. Definitely > by > > having a single instance we could be more memory-efficient and have only one, > > perhaps smaller, pool of cells and use them more efficiently (maybe). > > But, until we can prove that it causes a problem, that is a lower-prio > > refactoring. > > This is not what I was suggesting. Right now we have this: > > AllocationRegister { > HashMap allocations; > HashMap backtraces; > }; > > ShardedAllocationRegister { > RegisterAndLock { > AllocationRegister register; > Lock lock; > }; > RegisterAndLock allocation_registers_[]; > } > > I'm proposing to change it to: > > AllocationRegister { > TablesAndLock { > HashMap allocations; > HashMap backtraces; > Lock lock; > }; > TablesAndLock tables_[]; > }; > > Equally easy, but without an extra thing on the side. Yeah that would have been though more code shuffling, more risk to break something and a larger review. Given that this CL is in shape and green, at this point the cost of the refactoring might now justify the cost which will have no functional benefit.
On 2017/05/22 17:49:39, Primiano Tucci wrote: > On 2017/05/22 17:44:47, DmitrySkiba wrote: > > On 2017/05/22 17:37:55, Primiano Tucci wrote: > > > Cool, thanks a lot for the patience. > > > Now that we are done with nits and renames, let me express some more high > > level > > > comments. I am not fully sure that this is the right architecture to achieve > > > this sharding. How about ... > > > > > > I am joking (I always hated those sort of codereview cycles). > > > LGTM > > > > Well, at least we should add a perf test for this. Otherwise we'll spend more > > time in the future when (not if) AllocationRegister is changed again. > > I will LG any CL that adds a perf test to this (in that case tracing_preftests > has already the right harness and is hooked to the perf dashboard). > That, however, shouldn't be a blocker for this CL. Erik here took something that > doesn't have a perf test and made it like 2x faster on mac. Requiring a > perf-test for this means discouraging any future attempt of improving things :) > I'd definitely onboard with requiring a perftest, if this was about a > controversial micro-optimization. But, having the data that shows contention > (which is pretty expected, given that we have 40 threads and do 400k > allocations/seconds) and a sharding plan to reduce the lock domain, makes the > plan quite sensible. Maybe I am overlooking something, is there anything here > that realistically looks like could regress performance? Yes, so the burden to write tests falls to the next person who touches the code. Erik used Instruments to assess the performance, so a change to accommodate some Windows-specific need would have to make sure the code doesn't regress on macOS, using macOS-specific tool. Perf tests also serve another function - they outline scenarios. I believe that the change in this CL is an improvement, but I don't exactly understand the extent of it. How many threads need to be contending to show the improvement? Why sharding count is 64? Maybe 32 works better? Or 128? This is impossible to answer without a perf test.
On 2017/05/22 17:56:50, DmitrySkiba wrote: > Yes, so the burden to write tests falls to the next person who touches the code. I am not sure I follow. Lot people, including you, touched code on the heap profiler and we lived all this time without a perftest. Why the next person suddenly will require a perf test? Again it's desirable but I don't see how this is a requirement. > Erik used Instruments to assess the performance, so a change to accommodate some > Windows-specific need would have to make sure the code doesn't regress on macOS, > using macOS-specific tool. It's all based on realism and pragmatism. This is just introducing sharding and reducing the lock contention domain, which I expect to be not os-specific. Can you make a realistic example on how making a change on !Mac would regress performance on Mac because of this? How this CL is making this statement more true than the previous state before this CL? > Perf tests also serve another function - they outline scenarios. I believe that > the change in this CL is an improvement, but I don't exactly understand the > extent of it. Isn't the commit message explaining it? Again, I agree that a perf test is a nice to have, but I don't see how is a necessity now (as opposite to all the other CLs that this code has seen) > How many threads need to be contending to show the improvement? > Why sharding count is 64? Maybe 32 works better? Or 128? I asked Erik to check with Instruments what was a reasonable number that kept the improvement and he came with 64. If you came along with a CL that has a commit message saying "actually on Windows looks like 32 is a better numer because of this measurement, so adding an #ifdef there" i will LG your cl as well :)
https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp (right): https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp:201: if (!allocation_register_.is_enabled()) I assume it's not a problem that is_enabled() and Insert() below are no longer protected by the same lock as an atomic unit (also below, also in the other file).
Durr, forgot to say "webkit/platform lgtm"
https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Sou... File third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp (right): https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Sou... third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp:201: if (!allocation_register_.is_enabled()) On 2017/05/22 18:48:34, Nico wrote: > I assume it's not a problem that is_enabled() and Insert() below are no longer > protected by the same lock as an atomic unit (also below, also in the other > file). right. if our reasoning is correct, is_enabled is a flag that can only transition false -> true, and when it does, the allocation_register_ is initialized before and the flag is set with release semantics (The check here has acquire semantics). The true -> false transition doesn't really tear down the register (we don't even have a path to disable the heap profiler today), so that |allocation_register_| will be safe to access even in case of hitting the true -> false race (if ever)
On Mon, May 22, 2017 at 2:51 PM, <primiano@chromium.org> wrote: > > https://codereview.chromium.org/2890363003/diff/120001/ > third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp > File > third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp > (right): > > https://codereview.chromium.org/2890363003/diff/120001/ > third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvid > er.cpp#newcode201 > third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvid > er.cpp:201: > if (!allocation_register_.is_enabled()) > On 2017/05/22 18:48:34, Nico wrote: > > I assume it's not a problem that is_enabled() and Insert() below are > no longer > > protected by the same lock as an atomic unit (also below, also in the > other > > file). > > right. if our reasoning is correct, is_enabled is a flag that can only > transition false -> true, and when it does, the allocation_register_ is > initialized before and the flag is set with release semantics (The check > here has acquire semantics). > The true -> false transition doesn't really tear down the register (we > don't even have a path to disable the heap profiler today), so that > |allocation_register_| will be safe to access even in case of hitting > the true -> false race (if ever) > Sounds like this might be subtle enough to warrant a comment? -- You received this message because you are subscribed to the Google Groups "Chromium-reviews" group. To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... File base/trace_event/sharded_allocation_register.cc (right): https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/shard... base/trace_event/sharded_allocation_register.cc:16: size_t ShardCount = 1; On 2017/05/22 17:37:55, Primiano Tucci wrote: > this should have been a const size_t kShardCount, but I spotted it too late. > However, not worth another re-upload & CQ cycle just for this. > > Change it only if you plan to make some other change to this file, otherwise > we'll live with it, or change the next time we touch this file again, if ever. Done.
> > > Sounds like this might be subtle enough to warrant a comment? Done
The CQ bit was checked by erikchen@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from primiano@chromium.org, thakis@chromium.org Link to the patchset: https://codereview.chromium.org/2890363003/#ps140001 (title: "comments from primiano, thakis.")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2017/05/22 19:16:21, commit-bot: I haz the power wrote: > CQ is trying da patch. > > Follow status at: > https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or... I spoke with dskiba@. We agreed that the current design is acceptable, and that we should have a perf test. I filed https://bugs.chromium.org/p/chromium/issues/detail?id=725174 against myself.
CQ is committing da patch. Bot data: {"patchset_id": 140001, "attempt_start_ts": 1495480553991550, "parent_rev": "1c38965c165d21706028861200fab06b1eca0bea", "commit_rev": "bd599af5d33647f0b498311eab05be79fac98e63"}
Message was sent while issue was closed.
Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL introduces a new class ShardedAllocationRegister to handle sharding for desktop platforms. This stores allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Using a bool to record this state and a base::Lock to synchronize reading/writing from the bool has performance almost as bad as the global lock. BUG=724651 ========== to ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap profiling was enabled, all allocations in a process were gated by a global lock. This proves to be a significant performance hit on macOS. Profiling shows that 50-75% of all wall time is spent waiting on this lock. This CL introduces a new class ShardedAllocationRegister to handle sharding for desktop platforms. This stores allocation/backtrace information across 64 different AllocationRegister instances. In addition, the sizes of the fixed size hash maps were changed. The number of allocation buckets was reduced by a factor of 8, and the number of backtrace buckets reduced by a factor of 16. The new class ShardedAllocationRegister is thread-safe, and its consumers no longer need to acquire a lock when using the container. Each consumer still needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32 along with Acquire_Load and Release_Store to determine this state. Using a bool to record this state and a base::Lock to synchronize reading/writing from the bool has performance almost as bad as the global lock. BUG=724651 Review-Url: https://codereview.chromium.org/2890363003 Cr-Commit-Position: refs/heads/master@{#473696} Committed: https://chromium.googlesource.com/chromium/src/+/bd599af5d33647f0b498311eab05... ==========
Message was sent while issue was closed.
Committed patchset #8 (id:140001) as https://chromium.googlesource.com/chromium/src/+/bd599af5d33647f0b498311eab05... |