Issue 2890363003: Enable sharding of AllocationRegister on desktop.

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-19 21:13:53 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/20001

3 years, 7 months ago (2017-05-19 21:14:42 UTC) #2

erikchen

Description was changed from ========== fix native heap profiling performance. BUG= ========== to ========== Allow ...

3 years, 7 months ago (2017-05-19 21:51:04 UTC) #3

erikchen

Description was changed from ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was ...

3 years, 7 months ago (2017-05-19 21:51:31 UTC) #4

erikchen

Description was changed from ========== Allow sharding of AllocationRegister. Previously, when native heap profiling was ...

3 years, 7 months ago (2017-05-19 21:51:49 UTC) #5

erikchen

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-19 21:52:01 UTC) #6

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-19 22:43:52 UTC) #7

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-19 22:43:53 UTC) #8

erikchen

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-20 01:33:15 UTC) #9

erikchen

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-20 01:33:38 UTC) #10

Description was changed from

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Rough profiling shows that 50-75% of all wall time is spent waiting on
this lock.

This CL a new class ShardedAllocationRegister to handle sharding for desktop
platforms to store allocation/backtrace information across 64 different
AllocationRegister instances. In addition, the sizes of the fixed size hash maps
were changed. The number of allocation buckets was reduced by a factor of 8,
and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Testing
locally with a bool and a base::Lock just for reading/writing from the bool
shows that the lock has performance almost as bad as the global lock.

BUG=724651
==========

to

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL a new class ShardedAllocationRegister to handle sharding for desktop
platforms to store allocation/backtrace information across 64 different
AllocationRegister instances. In addition, the sizes of the fixed size hash maps
were changed. The number of allocation buckets was reduced by a factor of 8,
and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Testing
locally with a bool and a base::Lock just for reading/writing from the bool
shows that the lock has performance almost as bad as the global lock.

BUG=724651
==========

erikchen

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-20 01:34:51 UTC) #11

Description was changed from

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL a new class ShardedAllocationRegister to handle sharding for desktop
platforms to store allocation/backtrace information across 64 different
AllocationRegister instances. In addition, the sizes of the fixed size hash maps
were changed. The number of allocation buckets was reduced by a factor of 8,
and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Testing
locally with a bool and a base::Lock just for reading/writing from the bool
shows that the lock has performance almost as bad as the global lock.

BUG=724651
==========

to

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop
platforms. This stores allocation/backtrace information across 64 different
AllocationRegister instances. In addition, the sizes of the fixed size hash maps
were changed. The number of allocation buckets was reduced by a factor of 8,
and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.

BUG=724651
==========

erikchen

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-20 01:35:17 UTC) #12

Description was changed from

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop
platforms. This stores allocation/backtrace information across 64 different
AllocationRegister instances. In addition, the sizes of the fixed size hash maps
were changed. The number of allocation buckets was reduced by a factor of 8,
and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.

BUG=724651
==========

to

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop platforms. This stores allocation/backtrace information across 64
different AllocationRegister instances. In addition, the sizes of the fixed size
hash maps were changed. The number of allocation buckets was reduced by a factor
of 8, and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.

BUG=724651
==========

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-20 01:42:02 UTC) #13

erikchen

erikchen@chromium.org changed reviewers: + primiano@chromium.org

3 years, 7 months ago (2017-05-20 01:42:16 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/60001

3 years, 7 months ago (2017-05-20 01:42:46 UTC) #16

Primiano Tucci (use gerrit)

On 2017/05/20 01:42:17, erikchen wrote: > primiano: Please review. Sorry for the incomplete review but, ...

3 years, 7 months ago (2017-05-20 02:52:44 UTC) #17

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-20 03:31:03 UTC) #18

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/450552)

3 years, 7 months ago (2017-05-20 03:31:04 UTC) #19

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-20 17:25:24 UTC) #20

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/80001

3 years, 7 months ago (2017-05-20 17:25:29 UTC) #21

erikchen

On 2017/05/20 02:52:44, Primiano (slow - traveling) wrote: > On 2017/05/20 01:42:17, erikchen wrote: > ...

3 years, 7 months ago (2017-05-20 17:26:42 UTC) #22

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-20 18:02:32 UTC) #23

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_rel_ng/builds/432605)

3 years, 7 months ago (2017-05-20 18:02:33 UTC) #24

Primiano Tucci (use gerrit)

On 2017/05/20 17:26:42, erikchen wrote: > Sorry, calling premature optimization on this one. The performance ...

3 years, 7 months ago (2017-05-22 15:45:09 UTC) #25

DmitrySkiba

dskiba@chromium.org changed reviewers: + dskiba@chromium.org

3 years, 7 months ago (2017-05-22 15:55:21 UTC) #26

DmitrySkiba

Couple of thoughts: First, it seems that three parameters (2 hash table capacities and a ...

3 years, 7 months ago (2017-05-22 15:55:22 UTC) #27

erikchen

On 2017/05/22 15:55:22, DmitrySkiba wrote: > Couple of thoughts: > > First, it seems that ...

3 years, 7 months ago (2017-05-22 16:24:52 UTC) #28

Primiano Tucci (use gerrit)

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc_dump_provider.cc File base/trace_event/malloc_dump_provider.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc_dump_provider.cc#newcode307 base/trace_event/malloc_dump_provider.cc:307: { at this point this {} scope delimiter is ...

3 years, 7 months ago (2017-05-22 16:37:51 UTC) #29

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
File base/trace_event/malloc_dump_provider.cc (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:307: {
at this point this {} scope delimiter is not needed anymore.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:308: if (allocation_register_) {
this condition is always going to evaluate to true, as you populate the
unique_ptr on ctor. I guess you want a if(...IsEnabled()) ?

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:337: if
(!allocation_register_->IsInitialized())
just call Initialize and early out in its implementation if IsInitialized
already?

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
File base/trace_event/malloc_dump_provider.h (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.h:53:
std::unique_ptr<ShardedAllocationRegister> allocation_register_;
there is no need for a unique_ptr anymore here now, this could just be a field,
right?

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
File base/trace_event/sharded_allocation_register.cc (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:21: size_t HashAddress(const
void* address) {
since this is copy-pasted from a header that is included here (well in the .h
file) can you just reuse AddressHasher ? If you are worried about perf, you can
move it to a static inline HashAddress function in
heap_profiler_allocation_register.h and use in both.
Feels a bit odd that we have to have two identical copy-pasted function in two
files that include each other to achieve the same goal (hash an address)

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:37: ral.reset(new
RegisterAndLock);
doesn't make a difference here (so I don't care about what you do in this line)
but as a general suggestion, get used to () around new. If RegisterAndLock was
POD this would have skipped the initialization.
 (this is one of the arguments why people on chromium-dev push for MakeUnique,
because it enforces the initialization)

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:44: void
ShardedAllocationRegister::SetEnabled(bool enabled) {
I would remove this method and do this in initialize().
this will remove a potential bug if the client does the Setenabled(),
Initialize() in the wrong order.
Also there doesn't seem to be a valid semantic for doing SetEnabled on its own.
So I would have only two methods (or one) in total, not 4. Up to you if you want
to call them:
- SetEnabled() (which would do the initialization also) and SetDisabled()
- SetEnabled(bool), where the "true" path does also the initialization

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:49: return
base::subtle::Acquire_Load(&enabled_);
I'd make move this to the header file and turn into is_enabled(). This will
introduce an extra call to do the check that: (i) is not needed at all; (ii)
wasn't there before this cl;  (iii) shouldn't emit any more code than a function
call as this boils down to a read op.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
File base/trace_event/sharded_allocation_register.h (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.h:35: bool IsEnabled() const;
this could be an in-line is_enabled()  const { return base::subtle..... } so it
gets inlined in the caller.
(but fine even without if you have any reason against)

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.h:55: using MetricsMap =
std::unordered_map<AllocationContext, AllocationMetrics>;
nit: structs and type declarations should go on the top of the section, move to
line 28.
practically this confused me a bunch of times because line 55 above has
AllocationMetrics, but that is declared in another file. this OutputMetrics is
declared in an adjacent line but is used only in line 65. Hence moving them up
solves the confusion by not having it  adjacent to something that seems similar.

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-22 17:08:59 UTC) #30

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/100001

3 years, 7 months ago (2017-05-22 17:09:33 UTC) #31

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-22 17:10:35 UTC) #32

erikchen

primiano: PTAL https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc_dump_provider.cc File base/trace_event/malloc_dump_provider.cc (right): https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc_dump_provider.cc#newcode307 base/trace_event/malloc_dump_provider.cc:307: { On 2017/05/22 16:37:50, Primiano Tucci wrote: ...

3 years, 7 months ago (2017-05-22 17:10:43 UTC) #33

primiano: PTAL

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
File base/trace_event/malloc_dump_provider.cc (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:307: {
On 2017/05/22 16:37:50, Primiano Tucci wrote:
> at this point this {}&nbsp;scope delimiter is not needed anymore.

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:308: if (allocation_register_) {
On 2017/05/22 16:37:50, Primiano Tucci wrote:
> this condition is always going to evaluate to true, as you populate the
> unique_ptr on ctor. I guess you want a if(...IsEnabled()) ?

The condition ins already checked on line 290. Removed this line.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.cc:337: if
(!allocation_register_->IsInitialized())
On 2017/05/22 16:37:50, Primiano Tucci wrote:
> just call Initialize and early out in its implementation if IsInitialized
> already?

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
File base/trace_event/malloc_dump_provider.h (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/malloc...
base/trace_event/malloc_dump_provider.h:53:
std::unique_ptr<ShardedAllocationRegister> allocation_register_;
On 2017/05/22 16:37:50, Primiano Tucci wrote:
> there is no need for a unique_ptr anymore here now, this could just be a
field,
> right?

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
File base/trace_event/sharded_allocation_register.cc (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:21: size_t HashAddress(const
void* address) {
On 2017/05/22 16:37:51, Primiano Tucci wrote:
> since this is copy-pasted from a header that is included here (well in the .h
> file) can you just reuse AddressHasher ? If you are worried about perf, you
can
> move it to a static inline HashAddress function in
> heap_profiler_allocation_register.h and use in both.
> Feels a bit odd that we have to have two identical copy-pasted function in two
> files that include each other to achieve the same goal (hash an address)

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:37: ral.reset(new
RegisterAndLock);
On 2017/05/22 16:37:51, Primiano Tucci wrote:
> doesn't make a difference here (so I don't care about what you do in this
line)
> but as a general suggestion, get used to () around new. If RegisterAndLock was
> POD this would have skipped the initialization.
>  (this is one of the arguments why people on chromium-dev push for MakeUnique,
> because it enforces the initialization)

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:44: void
ShardedAllocationRegister::SetEnabled(bool enabled) {
Done. [SetEnabled and SetDisabled]

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.cc:49: return
base::subtle::Acquire_Load(&enabled_);
On 2017/05/22 16:37:51, Primiano Tucci wrote:
> I'd make move this to the header file and turn into is_enabled(). This will
> introduce an extra call to do the check that: (i) is not needed at all; (ii)
> wasn't there before this cl;  (iii) shouldn't emit any more code than a
function
> call as this boils down to a read op.

Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
File base/trace_event/sharded_allocation_register.h (right):

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.h:35: bool IsEnabled() const;
On 2017/05/22 16:37:51, Primiano Tucci wrote:
> this could be an in-line is_enabled()  const { return base::subtle.....&nbsp;}
so it
> gets inlined in the caller.
> (but fine even without if you have any reason against)


Done.

https://codereview.chromium.org/2890363003/diff/80001/base/trace_event/sharde...
base/trace_event/sharded_allocation_register.h:55: using MetricsMap =
std::unordered_map<AllocationContext, AllocationMetrics>;
On 2017/05/22 16:37:51, Primiano Tucci wrote:
> nit: structs and type declarations should go on the top of the section, move
to
> line 28.
> practically this confused me a bunch of times because line 55 above has
> AllocationMetrics, but that is declared in another file. this OutputMetrics is
> declared in an adjacent line but is used only in line 65. Hence moving them up
> solves the confusion by not having it  adjacent to something that seems
similar.

Done.

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/120001

3 years, 7 months ago (2017-05-22 17:10:46 UTC) #34

DmitrySkiba

On 2017/05/22 16:24:52, erikchen wrote: > On 2017/05/22 15:55:22, DmitrySkiba wrote: > > Couple of ...

3 years, 7 months ago (2017-05-22 17:36:05 UTC) #35

Primiano Tucci (use gerrit)

Cool, thanks a lot for the patience. Now that we are done with nits and ...

3 years, 7 months ago (2017-05-22 17:37:55 UTC) #36

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/120001

3 years, 7 months ago (2017-05-22 17:42:28 UTC) #39

erikchen

The CQ bit was checked by erikchen@chromium.org to run a CQ dry run

3 years, 7 months ago (2017-05-22 17:44:19 UTC) #41

DmitrySkiba

On 2017/05/22 17:37:55, Primiano Tucci wrote: > Cool, thanks a lot for the patience. > ...

3 years, 7 months ago (2017-05-22 17:44:47 UTC) #42

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/120001

3 years, 7 months ago (2017-05-22 17:44:50 UTC) #43

Primiano Tucci (use gerrit)

On 2017/05/22 17:36:05, DmitrySkiba wrote: > I see. BTW, why do we need ShardedAllocationRegister thing ...

3 years, 7 months ago (2017-05-22 17:44:59 UTC) #44

erikchen

erikchen@chromium.org changed reviewers: + thakis@chromium.org

3 years, 7 months ago (2017-05-22 17:46:28 UTC) #45

erikchen

thakis: Please review third_party/WebKit/Source/platform/*

3 years, 7 months ago (2017-05-22 17:46:29 UTC) #46

DmitrySkiba

On 2017/05/22 17:44:59, Primiano Tucci wrote: > On 2017/05/22 17:36:05, DmitrySkiba wrote: > > I ...

3 years, 7 months ago (2017-05-22 17:49:17 UTC) #47

On 2017/05/22 17:44:59, Primiano Tucci wrote:
> On 2017/05/22 17:36:05, DmitrySkiba wrote:
> > I see. BTW, why do we need ShardedAllocationRegister thing at all - why not
> move
> > locks into AllocationRegister? I.e. inside AllocationRegister you would have
> > TablesAndLock with both hash tables + lock.
> > In my view either ShardedAllocationRegister or AllocationRegister needs to
go
> -
> > we don't need to have both of them.
> 
> Erik and I discussed this offline, I had some similar suggestion. Doing what
you
> say would make sense but be non trivial, because means redesigning the
freelists
> (and having N of them)0.
> The reality is that, right now, this is something simple (the Sharded class)
> which wraps something test that has reliably worked (at least as long as data
> storage and thread safety are concerned) for a year (the AllocationRegister).
> We could definitely refactor this, but that means exposing ourself to a risk
in
> term crashes / having to debug thread-safe code. It is not impossible to deal
> with that but comes with a cost.
> Given all the things on our plates, I like to iterate by composition right now
> and move that refactoring cost to a later stage. 
> Right now our P0 is having a heap profiler that works and is perf efficient.
> I'd file a low-prio bug to track the possible future improvements. Definitely
by
> having a single instance we could be more memory-efficient and have only one,
> perhaps smaller, pool of cells and use them more efficiently (maybe).
> But, until we can prove that it causes a problem, that is a lower-prio
> refactoring.

This is not what I was suggesting. Right now we have this:

AllocationRegister {
  HashMap allocations;
  HashMap backtraces;
};

ShardedAllocationRegister {
  RegisterAndLock {
    AllocationRegister register;
    Lock lock;
  };
  RegisterAndLock allocation_registers_[];
}

I'm proposing to change it to:

AllocationRegister {
  TablesAndLock {
    HashMap allocations;
    HashMap backtraces;
    Lock lock;
  };
  TablesAndLock tables_[];
};

Equally easy, but without an extra thing on the side.

Primiano Tucci (use gerrit)

On 2017/05/22 17:44:47, DmitrySkiba wrote: > On 2017/05/22 17:37:55, Primiano Tucci wrote: > > Cool, ...

3 years, 7 months ago (2017-05-22 17:49:39 UTC) #48

Primiano Tucci (use gerrit)

On 2017/05/22 17:49:17, DmitrySkiba wrote: > On 2017/05/22 17:44:59, Primiano Tucci wrote: > > On ...

3 years, 7 months ago (2017-05-22 17:56:45 UTC) #49

On 2017/05/22 17:49:17, DmitrySkiba wrote:
> On 2017/05/22 17:44:59, Primiano Tucci wrote:
> > On 2017/05/22 17:36:05, DmitrySkiba wrote:
> > > I see. BTW, why do we need ShardedAllocationRegister thing at all - why
not
> > move
> > > locks into AllocationRegister? I.e. inside AllocationRegister you would
have
> > > TablesAndLock with both hash tables + lock.
> > > In my view either ShardedAllocationRegister or AllocationRegister needs to
> go
> > -
> > > we don't need to have both of them.
> > 
> > Erik and I discussed this offline, I had some similar suggestion. Doing what
> you
> > say would make sense but be non trivial, because means redesigning the
> freelists
> > (and having N of them)0.
> > The reality is that, right now, this is something simple (the Sharded class)
> > which wraps something test that has reliably worked (at least as long as
data
> > storage and thread safety are concerned) for a year (the
AllocationRegister).
> > We could definitely refactor this, but that means exposing ourself to a risk
> in
> > term crashes / having to debug thread-safe code. It is not impossible to
deal
> > with that but comes with a cost.
> > Given all the things on our plates, I like to iterate by composition right
now
> > and move that refactoring cost to a later stage. 
> > Right now our P0 is having a heap profiler that works and is perf efficient.
> > I'd file a low-prio bug to track the possible future improvements.
Definitely
> by
> > having a single instance we could be more memory-efficient and have only
one,
> > perhaps smaller, pool of cells and use them more efficiently (maybe).
> > But, until we can prove that it causes a problem, that is a lower-prio
> > refactoring.
> 
> This is not what I was suggesting. Right now we have this:
> 
> AllocationRegister {
>   HashMap allocations;
>   HashMap backtraces;
> };
> 
> ShardedAllocationRegister {
>   RegisterAndLock {
>     AllocationRegister register;
>     Lock lock;
>   };
>   RegisterAndLock allocation_registers_[];
> }
> 
> I'm proposing to change it to:
> 
> AllocationRegister {
>   TablesAndLock {
>     HashMap allocations;
>     HashMap backtraces;
>     Lock lock;
>   };
>   TablesAndLock tables_[];
> };
> 
> Equally easy, but without an extra thing on the side.

Yeah that would have been though more code shuffling, more risk to break
something and a larger review.
Given that this CL is in shape and green, at this point the cost of the
refactoring might now justify the cost which will have no functional benefit.

DmitrySkiba

On 2017/05/22 17:49:39, Primiano Tucci wrote: > On 2017/05/22 17:44:47, DmitrySkiba wrote: > > On ...

3 years, 7 months ago (2017-05-22 17:56:50 UTC) #50

On 2017/05/22 17:49:39, Primiano Tucci wrote:
> On 2017/05/22 17:44:47, DmitrySkiba wrote:
> > On 2017/05/22 17:37:55, Primiano Tucci wrote:
> > > Cool, thanks a lot for the patience.
> > > Now that we are done with nits and renames, let me express some more high
> > level
> > > comments. I am not fully sure that this is the right architecture to
achieve
> > > this sharding. How about ...
> > > 
> > > I am joking (I always hated those sort of codereview cycles). 
> > > LGTM
> > 
> > Well, at least we should add a perf test for this. Otherwise we'll spend
more
> > time in the future when (not if) AllocationRegister is changed again.
> 
> I will LG any CL that adds a perf test to this (in that case tracing_preftests
> has already the right harness and is hooked to the perf dashboard).
> That, however, shouldn't be a blocker for this CL. Erik here took something
that
> doesn't have a perf test and made it like 2x faster on mac. Requiring a
> perf-test for this means discouraging any future attempt of improving things
:)
> I'd definitely onboard with requiring a perftest, if this was about a
> controversial micro-optimization. But, having the data that shows contention
> (which is pretty expected, given that we have 40 threads and do 400k
> allocations/seconds) and a sharding plan to reduce the lock domain, makes the
> plan quite sensible. Maybe I am overlooking something, is there anything here
> that realistically looks like could regress performance?

Yes, so the burden to write tests falls to the next person who touches the code.
Erik used Instruments to assess the performance, so a change to accommodate some
Windows-specific need would have to make sure the code doesn't regress on macOS,
using macOS-specific tool.

Perf tests also serve another function - they outline scenarios. I believe that
the change in this CL is an improvement, but I don't exactly understand the
extent of it. How many threads need to be contending to show the improvement?
Why sharding count is 64? Maybe 32 works better? Or 128? This is impossible to
answer without a perf test.

Primiano Tucci (use gerrit)

On 2017/05/22 17:56:50, DmitrySkiba wrote: > Yes, so the burden to write tests falls to ...

3 years, 7 months ago (2017-05-22 18:48:10 UTC) #51

Nico

https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp File third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp (right): https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp#newcode201 third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp:201: if (!allocation_register_.is_enabled()) I assume it's not a problem that ...

3 years, 7 months ago (2017-05-22 18:48:34 UTC) #52

Primiano Tucci (use gerrit)

https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp File third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp (right): https://codereview.chromium.org/2890363003/diff/120001/third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp#newcode201 third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp:201: if (!allocation_register_.is_enabled()) On 2017/05/22 18:48:34, Nico wrote: > I ...

3 years, 7 months ago (2017-05-22 18:51:30 UTC) #54

Nico

On Mon, May 22, 2017 at 2:51 PM, <primiano@chromium.org> wrote: > > https://codereview.chromium.org/2890363003/diff/120001/ > third_party/WebKit/Source/platform/PartitionAllocMemoryDumpProvider.cpp ...

3 years, 7 months ago (2017-05-22 18:52:41 UTC) #55

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 7 months ago (2017-05-22 19:05:01 UTC) #56

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 7 months ago (2017-05-22 19:05:02 UTC) #57

erikchen

https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/sharded_allocation_register.cc File base/trace_event/sharded_allocation_register.cc (right): https://codereview.chromium.org/2890363003/diff/120001/base/trace_event/sharded_allocation_register.cc#newcode16 base/trace_event/sharded_allocation_register.cc:16: size_t ShardCount = 1; On 2017/05/22 17:37:55, Primiano Tucci ...

3 years, 7 months ago (2017-05-22 19:15:33 UTC) #58

erikchen

> > > Sounds like this might be subtle enough to warrant a comment? Done

3 years, 7 months ago (2017-05-22 19:15:48 UTC) #59

erikchen

The patchset sent to the CQ was uploaded after l-g-t-m from primiano@chromium.org, thakis@chromium.org Link to ...

3 years, 7 months ago (2017-05-22 19:15:54 UTC) #61

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2890363003/140001

3 years, 7 months ago (2017-05-22 19:16:21 UTC) #62

erikchen

On 2017/05/22 19:16:21, commit-bot: I haz the power wrote: > CQ is trying da patch. ...

3 years, 7 months ago (2017-05-22 19:48:52 UTC) #63

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 140001, "attempt_start_ts": 1495480553991550, "parent_rev": "1c38965c165d21706028861200fab06b1eca0bea", "commit_rev": "bd599af5d33647f0b498311eab05be79fac98e63"}

3 years, 7 months ago (2017-05-22 21:15:11 UTC) #64

commit-bot: I haz the power

Description was changed from ========== Enable sharding of AllocationRegister on desktop. Previously, when native heap ...

3 years, 7 months ago (2017-05-22 21:15:23 UTC) #65

Message was sent while issue was closed.

Description was changed from

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop platforms. This stores allocation/backtrace information across 64
different AllocationRegister instances. In addition, the sizes of the fixed size
hash maps were changed. The number of allocation buckets was reduced by a factor
of 8, and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.

BUG=724651
==========

to

==========
Enable sharding of AllocationRegister on desktop.

Previously, when native heap profiling was enabled, all allocations in a process
were gated by a global lock. This proves to be a significant performance hit on
macOS. Profiling shows that 50-75% of all wall time is spent waiting on this
lock.

This CL introduces a new class ShardedAllocationRegister to handle sharding for
desktop platforms. This stores allocation/backtrace information across 64
different AllocationRegister instances. In addition, the sizes of the fixed size
hash maps were changed. The number of allocation buckets was reduced by a factor
of 8, and the number of backtrace buckets reduced by a factor of 16.

The new class ShardedAllocationRegister is thread-safe, and its consumers no
longer need to acquire a lock when using the container. Each consumer still
needs to know whether heap profiling is enabled. This CL uses a subtle::Atomic32
along with Acquire_Load and Release_Store to determine this state. Using a bool
to record this state and a base::Lock to synchronize reading/writing from the
bool has performance almost as bad as the global lock.

BUG=724651

Review-Url: https://codereview.chromium.org/2890363003
Cr-Commit-Position: refs/heads/master@{#473696}
Committed:
https://chromium.googlesource.com/chromium/src/+/bd599af5d33647f0b498311eab05...
==========

commit-bot: I haz the power

3 years, 7 months ago (2017-05-22 21:15:25 UTC) #66

Message was sent while issue was closed.

Committed patchset #8 (id:140001) as
https://chromium.googlesource.com/chromium/src/+/bd599af5d33647f0b498311eab05...

Issue 2890363003: Enable sharding of AllocationRegister on desktop. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Fix performance. #

Patch Set 3 : Fixes. #

Patch Set 4 : clang format. #

Patch Set 5 : comment from primiano. #

Patch Set 6 : Comments from primiano. #

Patch Set 7 : Clean up. #

Patch Set 8 : comments from primiano, thakis. #

Messages