Issue 2364243002: Serialize Histograms more efficiently.

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. Avoid serializing unnecessary information. Use small ...

4 years, 2 months ago (2016-09-23 23:08:43 UTC) #1

benjhayden

benjhayden@chromium.org changed reviewers: + eakuefner@chromium.org, nednguyen@google.com

4 years, 2 months ago (2016-09-23 23:09:13 UTC) #2

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. Avoid serializing unnecessary information. Serialize sparse ...

4 years, 2 months ago (2016-09-23 23:10:48 UTC) #4

nednguyen

On 2016/09/23 23:09:14, benjhayden wrote: > PTAL :-) Can you run Juan's benchmark to see ...

4 years, 2 months ago (2016-09-23 23:11:09 UTC) #5

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. Avoid serializing unnecessary information. Serialize sparse ...

4 years, 2 months ago (2016-09-24 04:40:42 UTC) #6

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-24 04:42:33 UTC) #7

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-24 05:54:35 UTC) #8

benjhayden

On 2016/09/23 at 23:11:09, nednguyen wrote: > On 2016/09/23 23:09:14, benjhayden wrote: > > PTAL ...

4 years, 2 months ago (2016-09-26 05:30:38 UTC) #9

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 06:40:30 UTC) #10

nednguyen

On 2016/09/26 05:30:38, benjhayden wrote: > On 2016/09/23 at 23:11:09, nednguyen wrote: > > On ...

4 years, 2 months ago (2016-09-26 17:38:27 UTC) #11

nednguyen

The optimization of serializing sparse centralBins arrays as dictionaries makes a lot of sense to ...

4 years, 2 months ago (2016-09-26 18:45:57 UTC) #12

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 19:02:33 UTC) #13

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 19:03:13 UTC) #14

Description was changed from

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Use short strings for dict keys.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use tags instead of string keys is left for another
CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

to

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Use short strings for dict keys.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use short string keys is left for another CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 19:11:17 UTC) #15

Description was changed from

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Use short strings for dict keys.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use short string keys is left for another CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

to

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half to two-thirds.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Use short strings for dict keys.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use short string keys is left for another CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

benjhayden

On 2016/09/26 at 18:45:57, nednguyen wrote: > The optimization of serializing sparse centralBins arrays as ...

4 years, 2 months ago (2016-09-26 19:41:00 UTC) #16

On 2016/09/26 at 18:45:57, nednguyen wrote:
> The optimization of serializing sparse centralBins arrays as dictionaries
makes a lot of sense to me.
> 
> However, I don't really buy the idea of mapping dictionary strings to number.
It's basically a compression scheme, which should better be achieved by gzip. If
unpacking the zipped data create large string, let's use
https://github.com/catapult-project/catapult/issues/2826 to address it.

Actually, JSON keys must be strings, so I changed the number tags to short
strings. I tried to choose meaningful characters, but I think that shorter keys
do have a significant impact here.
I could understand an argument that JSON should theoretically be human-readable,
but I'm not sure if that's a valid use case here. value-set-json is already
serialized to a single line, and none of our tools (devtools, vim, less) are
capable of dealing with a 92MB-long line without line-breaks. If you're going to
have to dip into python/javascript repl anyway in order to debug
Histogram.asDict/fromDict, then you might as well write a couple lines of code
to help you parse the results. Similarly, protocol buffer provides tools to
convert between binary and pretty-printed text.

Sorry, I don't understand the objection to "compression schemes". Compression is
just using fewer bits to encode the same information. Serializing sparse
centralBins is a compression scheme.

I agree that JSONStream will solve the v8-string-too-large problem once and for
all, but efficiency (or rather, I'd call JSONStream an inefficient work-around)
in one part of the pipeline does not absolve inefficiency in another part. We're
trying to optimize the entire workflow of generating results, storing them,
serving them, and loading them. JSONStream will make loading obscenely large
JSONs possible, and gzip will help store and serve them at the cost of loading
latency, but the entire workflow would always be faster if those obscenely large
JSONs were smaller.

I also implemented the bin boundary builder recipes, which had a significant
impact.

LMK if you want to VC.

Thanks :-)

benjhayden

On 2016/09/26 at 19:41:00, benjhayden wrote: > On 2016/09/26 at 18:45:57, nednguyen wrote: > > ...

4 years, 2 months ago (2016-09-26 20:39:56 UTC) #17

On 2016/09/26 at 19:41:00, benjhayden wrote:
> On 2016/09/26 at 18:45:57, nednguyen wrote:
> > The optimization of serializing sparse centralBins arrays as dictionaries
makes a lot of sense to me.
> > 
> > However, I don't really buy the idea of mapping dictionary strings to
number. It's basically a compression scheme, which should better be achieved by
gzip. If unpacking the zipped data create large string, let's use
https://github.com/catapult-project/catapult/issues/2826 to address it.
> 
> Actually, JSON keys must be strings, so I changed the number tags to short
strings. I tried to choose meaningful characters, but I think that shorter keys
do have a significant impact here.
> I could understand an argument that JSON should theoretically be
human-readable, but I'm not sure if that's a valid use case here. value-set-json
is already serialized to a single line, and none of our tools (devtools, vim,
less) are capable of dealing with a 92MB-long line without line-breaks. If
you're going to have to dip into python/javascript repl anyway in order to debug
Histogram.asDict/fromDict, then you might as well write a couple lines of code
to help you parse the results. Similarly, protocol buffer provides tools to
convert between binary and pretty-printed text.
> 
> Sorry, I don't understand the objection to "compression schemes". Compression
is just using fewer bits to encode the same information. Serializing sparse
centralBins is a compression scheme.
> 
> I agree that JSONStream will solve the v8-string-too-large problem once and
for all, but efficiency (or rather, I'd call JSONStream an inefficient
work-around) in one part of the pipeline does not absolve inefficiency in
another part. We're trying to optimize the entire workflow of generating
results, storing them, serving them, and loading them. JSONStream will make
loading obscenely large JSONs possible, and gzip will help store and serve them
at the cost of loading latency, but the entire workflow would always be faster
if those obscenely large JSONs were smaller.
> 
> I also implemented the bin boundary builder recipes, which had a significant
impact.
> 
> LMK if you want to VC.
> 
> Thanks :-)

Discussed offline. I'll back out the mapping to short key strings and check that
the builder recipes fix the binBoundaries heisenbug.
Thanks :-)

benjhayden

On 2016/09/26 at 20:39:56, benjhayden wrote: > On 2016/09/26 at 19:41:00, benjhayden wrote: > > ...

4 years, 2 months ago (2016-09-26 21:01:30 UTC) #18

On 2016/09/26 at 20:39:56, benjhayden wrote:
> On 2016/09/26 at 19:41:00, benjhayden wrote:
> > On 2016/09/26 at 18:45:57, nednguyen wrote:
> > > The optimization of serializing sparse centralBins arrays as dictionaries
makes a lot of sense to me.
> > > 
> > > However, I don't really buy the idea of mapping dictionary strings to
number. It's basically a compression scheme, which should better be achieved by
gzip. If unpacking the zipped data create large string, let's use
https://github.com/catapult-project/catapult/issues/2826 to address it.
> > 
> > Actually, JSON keys must be strings, so I changed the number tags to short
strings. I tried to choose meaningful characters, but I think that shorter keys
do have a significant impact here.
> > I could understand an argument that JSON should theoretically be
human-readable, but I'm not sure if that's a valid use case here. value-set-json
is already serialized to a single line, and none of our tools (devtools, vim,
less) are capable of dealing with a 92MB-long line without line-breaks. If
you're going to have to dip into python/javascript repl anyway in order to debug
Histogram.asDict/fromDict, then you might as well write a couple lines of code
to help you parse the results. Similarly, protocol buffer provides tools to
convert between binary and pretty-printed text.
> > 
> > Sorry, I don't understand the objection to "compression schemes".
Compression is just using fewer bits to encode the same information. Serializing
sparse centralBins is a compression scheme.
> > 
> > I agree that JSONStream will solve the v8-string-too-large problem once and
for all, but efficiency (or rather, I'd call JSONStream an inefficient
work-around) in one part of the pipeline does not absolve inefficiency in
another part. We're trying to optimize the entire workflow of generating
results, storing them, serving them, and loading them. JSONStream will make
loading obscenely large JSONs possible, and gzip will help store and serve them
at the cost of loading latency, but the entire workflow would always be faster
if those obscenely large JSONs were smaller.
> > 
> > I also implemented the bin boundary builder recipes, which had a significant
impact.
> > 
> > LMK if you want to VC.
> > 
> > Thanks :-)
> 
> Discussed offline. I'll back out the mapping to short key strings and check
that the builder recipes fix the binBoundaries heisenbug.
> Thanks :-)

For posterity, the argument that convinced me is that long key strings won't be
a problem when the entire pipeline is streaming all the way from Histogram
objects to JSONStream to gzip'd files on disk and back again, and the mapping
between short string keys and Histogram fields could be fragile, and we want to
try to avoid changing the serialization format too many times.

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 21:35:26 UTC) #19

Description was changed from

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half to two-thirds.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Use short strings for dict keys.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use short string keys is left for another CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

to

==========
Serialize Histograms more efficiently.

In a results2.html that Juan emailed recently, 16700 Histograms took 95646761
bytes, for an average of 5727 bytes per Histogram. The only Diagnostics were
IterationInfo. This CL should reduce the average size of Histograms by about
half to two-thirds.

Avoid serializing unnecessary information.
Serialize sparse centralBins arrays as dictionaries.
Serialize HistogramBinBoundaries builder recipes instead of the boundaries
themselves.
Add tests to monitor serialization sizes.

Trimming floats is left for another CL.
Changing Diagnostic dicts to use short string keys is left for another CL.

Histograms now look like this:
{"b":[0,[0,1000,10]],"N":"","u":"unitless","g":"1197bc31-0bbb-44aa-80f1-066121e20dc9","m":10,"v":[230,250,270,300,510,580,620,680,850,880],"r":[93,990,6.1115859726001505,532.483870967742,100,49521,6530807.225806451],"c":[0,12,[11,[{"foo":{"type":"Generic","value":"bar"}}]],10,10,10,10,10,10,10]}

BUG=catapult:#2794
==========

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-26 21:35:46 UTC) #20

benjhayden

Serializing bin boundaries builder recipes does seem to fix the mysterious boundary deletions. :-)

4 years, 2 months ago (2016-09-26 22:40:28 UTC) #21

nednguyen

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/running_statistics.html File tracing/tracing/base/running_statistics.html (right): https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/running_statistics.html#newcode143 tracing/tracing/base/running_statistics.html:143: return 0; nits: can you change to return "null"? ...

4 years, 2 months ago (2016-09-27 15:04:49 UTC) #22

nednguyen

On 2016/09/27 15:04:49, nednguyen wrote: > https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/running_statistics.html > File tracing/tracing/base/running_statistics.html (right): > > https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/running_statistics.html#newcode143 > ...

4 years, 2 months ago (2016-09-27 15:07:52 UTC) #23

benjhayden

On 2016/09/27 at 15:07:52, nednguyen wrote: > On 2016/09/27 15:04:49, nednguyen wrote: > > https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/running_statistics.html ...

4 years, 2 months ago (2016-09-27 16:02:54 UTC) #24

On 2016/09/27 at 15:07:52, nednguyen wrote:
> On 2016/09/27 15:04:49, nednguyen wrote:
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > File tracing/tracing/base/running_statistics.html (right):
> > 
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > tracing/tracing/base/running_statistics.html:143: return 0;
> > nits: can you change to return "null"?
> > 
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > File tracing/tracing/value/histogram.html (right):
> > 
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > tracing/tracing/value/histogram.html:72: return this.count;
> > I would keep this a Object.freeze([this.count]) for simplicity. I don't
expect
> > size of "[5]" is that much different from size of "5"
> > 
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > tracing/tracing/value/histogram.html:162: while (this.maxNumSampleValues <
> > this.sampleValues_.length) {
> > seems like you just want to randomly remove (this.sampleValues_.length - n)
> > element. Can you add a method like getRandomSamples(populationArray,
> > sampleSize)?
> > 
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > tracing/tracing/value/histogram.html:550: dict.binBoundaries =
> > this.binBoundaries_;
> > I thought you mean to serialize the bin boundary builder instead of the
> > boundaries themselves?
> 
> Let me know if my idea is silly: instead of serializing the bins directly, why
can we just serialize the histogram's configs & all the collected samples (a
dictionary + a array of numbers)? Aren't those enough to reconstruct the whole
histogram?

Histograms could contain millions of samples, so we do need machinery for
serializing bins, but, you're right, when there are fewer than
maxNumSampleValues, then, yes, it would be better to only serialize those and
not the bins. I'll do that. Thanks!

nednguyen

On 2016/09/27 16:02:54, benjhayden wrote: > On 2016/09/27 at 15:07:52, nednguyen wrote: > > On ...

4 years, 2 months ago (2016-09-27 16:22:17 UTC) #25

On 2016/09/27 16:02:54, benjhayden wrote:
> On 2016/09/27 at 15:07:52, nednguyen wrote:
> > On 2016/09/27 15:04:49, nednguyen wrote:
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > > File tracing/tracing/base/running_statistics.html (right):
> > > 
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > > tracing/tracing/base/running_statistics.html:143: return 0;
> > > nits: can you change to return "null"?
> > > 
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > File tracing/tracing/value/histogram.html (right):
> > > 
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > tracing/tracing/value/histogram.html:72: return this.count;
> > > I would keep this a Object.freeze([this.count]) for simplicity. I don't
> expect
> > > size of "[5]" is that much different from size of "5"
> > > 
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > tracing/tracing/value/histogram.html:162: while (this.maxNumSampleValues <
> > > this.sampleValues_.length) {
> > > seems like you just want to randomly remove (this.sampleValues_.length -
n)
> > > element. Can you add a method like getRandomSamples(populationArray,
> > > sampleSize)?
> > > 
> > >
>
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > tracing/tracing/value/histogram.html:550: dict.binBoundaries =
> > > this.binBoundaries_;
> > > I thought you mean to serialize the bin boundary builder instead of the
> > > boundaries themselves?
> > 
> > Let me know if my idea is silly: instead of serializing the bins directly,
why
> can we just serialize the histogram's configs & all the collected samples (a
> dictionary + a array of numbers)? Aren't those enough to reconstruct the whole
> histogram?
> 
> Histograms could contain millions of samples, so we do need machinery for
> serializing bins, but, you're right, when there are fewer than
> maxNumSampleValues, then, yes, it would be better to only serialize those and
> not the bins. I'll do that. Thanks!

Arghh, I forgot the case that we need to keep track of the bin's sample count
which could be million. I think we should stick with one strategy of serializing
the histogram now to simplify things, and serializing the bin (or bin builder)
seems good to me.

benjhayden

On 2016/09/27 at 16:22:17, nednguyen wrote: > On 2016/09/27 16:02:54, benjhayden wrote: > > On ...

4 years, 2 months ago (2016-09-27 16:43:00 UTC) #26

On 2016/09/27 at 16:22:17, nednguyen wrote:
> On 2016/09/27 16:02:54, benjhayden wrote:
> > On 2016/09/27 at 15:07:52, nednguyen wrote:
> > > On 2016/09/27 15:04:49, nednguyen wrote:
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > > > File tracing/tracing/base/running_statistics.html (right):
> > > > 
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
> > > > tracing/tracing/base/running_statistics.html:143: return 0;
> > > > nits: can you change to return "null"?
> > > > 
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > > File tracing/tracing/value/histogram.html (right):
> > > > 
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > > tracing/tracing/value/histogram.html:72: return this.count;
> > > > I would keep this a Object.freeze([this.count]) for simplicity. I don't
> > expect
> > > > size of "[5]" is that much different from size of "5"
> > > > 
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > > tracing/tracing/value/histogram.html:162: while (this.maxNumSampleValues
<
> > > > this.sampleValues_.length) {
> > > > seems like you just want to randomly remove (this.sampleValues_.length -
n)
> > > > element. Can you add a method like getRandomSamples(populationArray,
> > > > sampleSize)?
> > > > 
> > > >
> >
https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
> > > > tracing/tracing/value/histogram.html:550: dict.binBoundaries =
> > > > this.binBoundaries_;
> > > > I thought you mean to serialize the bin boundary builder instead of the
> > > > boundaries themselves?
> > > 
> > > Let me know if my idea is silly: instead of serializing the bins directly,
why
> > can we just serialize the histogram's configs & all the collected samples (a
> > dictionary + a array of numbers)? Aren't those enough to reconstruct the
whole
> > histogram?
> > 
> > Histograms could contain millions of samples, so we do need machinery for
> > serializing bins, but, you're right, when there are fewer than
> > maxNumSampleValues, then, yes, it would be better to only serialize those
and
> > not the bins. I'll do that. Thanks!
> 
> Arghh, I forgot the case that we need to keep track of the bin's sample count
which could be million. I think we should stick with one strategy of serializing
the histogram now to simplify things, and serializing the bin (or bin builder)
seems good to me.

K, I'll just fix the binBoundaries_ field name. :-)

benjhayden

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-27 18:24:59 UTC) #27

benjhayden

I'm still occasionally seeing (unmergeable) in results2.html with memory.top_10_mobile. I found one histogram that was ...

4 years, 2 months ago (2016-09-27 18:46:35 UTC) #28

I'm still occasionally seeing (unmergeable) in results2.html with
memory.top_10_mobile. I found one histogram that was serialized (by vinn)
without its leading 0 bin boundary, and one that created the wrong number of
bins when it was deserialized (by chrome 53.0.2785.92) even though it used the
exact same HistogramBinBoundaries object with the same boundaries_ array as
other Histograms that created the correct number of bins. Reloading fixed the
latter histogram but not the former.
What do you do when for loops are broken?

Anyway, this CL is getting old. PTAL? :-)

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
File tracing/tracing/base/running_statistics.html (right):

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/base/ru...
tracing/tracing/base/running_statistics.html:143: return 0;
On 2016/09/27 at 15:04:49, nednguyen wrote:
> nits: can you change to return "null"?

We prefer undefined over null in tracing, but yes.

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
File tracing/tracing/value/histogram.html (right):

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
tracing/tracing/value/histogram.html:72: return this.count;
On 2016/09/27 at 15:04:49, nednguyen wrote:
> I would keep this a Object.freeze([this.count]) for simplicity. I don't expect
size of "[5]" is that much different from size of "5"

"[5]" is 300% larger than "5".
Multiplied by dozens of bins per histogram times thousands of histograms.
I think that's significant.

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/h...
tracing/tracing/value/histogram.html:162: while (this.maxNumSampleValues <
this.sampleValues_.length) {
On 2016/09/27 at 15:04:49, nednguyen wrote:
> seems like you just want to randomly remove (this.sampleValues_.length - n)
element. Can you add a method like getRandomSamples(populationArray,
sampleSize)?

Statistics.uniformlySampleArray?

nednguyen

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html File tracing/tracing/base/running_statistics.html (right): https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html#newcode142 tracing/tracing/base/running_statistics.html:142: return undefined; JSON.stringify(undefined) --> undefined, so I think it's ...

4 years, 2 months ago (2016-09-27 18:48:54 UTC) #29

nednguyen

https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/histogram.html File tracing/tracing/value/histogram.html (right): https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/histogram.html#newcode72 tracing/tracing/value/histogram.html:72: return this.count; On 2016/09/27 18:46:34, benjhayden wrote: > On ...

4 years, 2 months ago (2016-09-27 19:02:58 UTC) #30

nednguyen

On 2016/09/27 19:02:58, nednguyen wrote: > https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/histogram.html > File tracing/tracing/value/histogram.html (right): > > https://codereview.chromium.org/2364243002/diff/80001/tracing/tracing/value/histogram.html#newcode72 > ...

4 years, 2 months ago (2016-09-27 19:03:58 UTC) #31

benjhayden

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html File tracing/tracing/value/histogram.html (right): https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html#newcode72 tracing/tracing/value/histogram.html:72: return this.count; On 2016/09/27 at 18:48:54, nednguyen wrote: > ...

4 years, 2 months ago (2016-09-27 21:05:17 UTC) #32

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
File tracing/tracing/value/histogram.html (right):

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:72: return this.count;
On 2016/09/27 at 18:48:54, nednguyen wrote:
> ping on not special casing on this?

Done.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:84: var DEFAULT_SUMMARY_OPTIONS = new Map([
On 2016/09/27 at 18:48:54, nednguyen wrote:
> nits: Object.freeze(..) this?

Hm, it looks like Object.freeze doesn't work on Maps?
https://jsfiddle.net/11kuq8m3/
http://stackoverflow.com/questions/35747325/is-there-a-way-to-freeze-a-es6-map

Also, Object.freeze() impacts performance and is not necessary for these
asDict() methods. I only added them to see if they would reveal any hidden
places that might modify the dicts on their way out of vinn. There don't seem to
be any problems with removing them.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:125: this.binBoundariesBuilderDict_ =
binBoundaries.asDict();
On 2016/09/27 at 18:48:54, nednguyen wrote:
> Oh, now I see that the API is binBoundaries.asDict() and not
binBoundaries.builder.asDict(). If you choose to hide the implementation details
of binBoundaries's serialization, keep binBoundariesDict_ is fine.
> 
> Though another question: why do we need to create this object right here?
Isn't it only needed for asDict() method?

I added a comment. It's possible that binBoundaries can be modified after
creating this histogram.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:137: this.summaryOptions.set('percentile',
[]);
On 2016/09/27 at 18:48:54, nednguyen wrote:
> Why not put 'percentile' in DEFAULT_SUMMARY_OPTIONS as well?

I added a comment to DEFAULT_SUMMARY_OPTIONS.
Its default value is [], which is modifiable, so clients can push() to it, so
each Histogram instance needs a separate Array instance.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:164: var i = parseInt(Math.random() *
this.sampleValues_.length);
On 2016/09/27 at 18:48:54, nednguyen wrote:
> Ping on refactor this to a helper method

Done.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:592: // efficient.
On 2016/09/27 at 18:48:54, nednguyen wrote:
> Can you factor the logic of serializing CENTRAL_BINS to a separate method?
this asDict() method is getting too long..

Done.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:599: var centralBinsArrayOverhead = 0;
On 2016/09/27 at 18:48:54, nednguyen wrote:
> I think this is a bit too complicated. Since the only extra overhead of the
dictionary form is the index (should not exceeds 10k) plus the colon and comma
which is very negligible, I think it's ok to just use the dict form over the
array form.
> 
> I would choose code simplicity, except when efficiency win is huge.
> 
> In addition, a problem with "try both & pick the less overhead" strategy is
that it increases peak heap memory usage.

If a histogram has 1000 bins and they are all non-empty, dictOverhead=4890 and
arrayOverhead=1000.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:961: // If boundaries_ had been built, then
clear them.
On 2016/09/27 at 18:48:54, nednguyen wrote:
> This seems like a big change to existing API of histogram? On the left, I
don't find any logic that clear existing boundaries

The existing boundaries are preserved in builder_.

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/...
tracing/tracing/value/histogram.html:966:
this.range.addValue(nextMaxBinBoundary);
On 2016/09/27 at 18:48:54, nednguyen wrote:
> why not this.addBinBoundary(nextMaxBinBoundary) here?

That's not how it works.
Callers call addBinBoundary/addLinearBins/addExponentialBins. Those methods add
elements to builder_, not boundaries_.
When a Histogram calls binRanges(), HBB builds boundaries_ from builder_.
There are several HBBs created statically in metrics and in this file. Many of
them might never be used, so we can speed up loading trace-viewer by delaying
computing boundaries_ until they are needed.
Many Histograms can re-use the same HBB, so it is useful to cache boundaries_ in
HBB. And Histograms need the builder_, so HBB must keep both its builder_ and
its boundaries_.
But if an HBB is modified after a Histogram uses its boundaries_, then the
boundaries_ must be updated. The same logic applies for a never-used HBB as an
HBB that was modified after it was used: its updated boundaries_ might never be
needed.
Now, if you're suggesting that, since an HBB was used once, then it will
probably be used again after it was modified, then I can refactor build_ so that
it can be called either all at once or iteratively as clients modify it, and it
would behave differently based on whether it had ever been used, so that it
wouldn't need to re-compute old boundaries for old builder slices.
I can do that. But it would be complicated, and I feel like I might already be
over-optimizing just a bit.

nednguyen

Thanks for your patience lgtm https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html File tracing/tracing/value/histogram.html (right): https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html#newcode599 tracing/tracing/value/histogram.html:599: var centralBinsArrayOverhead = 0; ...

4 years, 2 months ago (2016-09-27 21:46:29 UTC) #33

benjhayden

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html File tracing/tracing/base/running_statistics.html (right): https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html#newcode142 tracing/tracing/base/running_statistics.html:142: return undefined; On 2016/09/27 at 18:48:53, nednguyen wrote: > ...

4 years, 2 months ago (2016-09-27 21:50:11 UTC) #34

nednguyen

On 2016/09/27 21:50:11, benjhayden wrote: > https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html > File tracing/tracing/base/running_statistics.html (right): > > https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/base/running_statistics.html#newcode142 > ...

4 years, 2 months ago (2016-09-27 21:52:12 UTC) #35

benjhayden

https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html File tracing/tracing/value/histogram.html (right): https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html#newcode599 tracing/tracing/value/histogram.html:599: var centralBinsArrayOverhead = 0; On 2016/09/27 at 21:46:29, nednguyen ...

4 years, 2 months ago (2016-09-27 22:47:50 UTC) #36

nednguyen

On 2016/09/27 22:47:50, benjhayden wrote: > https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html > File tracing/tracing/value/histogram.html (right): > > https://codereview.chromium.org/2364243002/diff/140001/tracing/tracing/value/histogram.html#newcode599 > ...

4 years, 2 months ago (2016-09-27 22:50:48 UTC) #37

benjhayden

The patchset sent to the CQ was uploaded after l-g-t-m from nednguyen@google.com Link to the ...

4 years, 2 months ago (2016-09-28 04:27:13 UTC) #39

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2364243002/200001

4 years, 2 months ago (2016-09-28 04:27:23 UTC) #40

benjhayden

Thanks for the great review! Sorry there turned out to be more design decisions than ...

4 years, 2 months ago (2016-09-28 04:28:35 UTC) #41

commit-bot: I haz the power

Description was changed from ========== Serialize Histograms more efficiently. In a results2.html that Juan emailed ...

4 years, 2 months ago (2016-09-28 04:50:02 UTC) #42

commit-bot: I haz the power

Committed patchset #11 (id:200001) as https://chromium.googlesource.com/external/github.com/catapult-project/catapult/+/03fbd54d8fdb0defa1a677b71d67a9f46544fa86

4 years, 2 months ago (2016-09-28 04:50:04 UTC) #43

nednguyen

4 years, 2 months ago (2016-09-28 13:07:00 UTC) #44

Message was sent while issue was closed.

On 2016/09/28 04:28:35, benjhayden wrote:
> Thanks for the great review!
> Sorry there turned out to be more design decisions than I'd thought. I should
> have started with a doc.

No problem, another thing that would help is divide this patch to multiple one,
each focuses on a single optimization.

Issue 2364243002: Serialize Histograms more efficiently. (Closed)

Description

Patch Set 1 #

Patch Set 2 : . #

Patch Set 3 : make HistogramBin.asDict() even more efficient #

Patch Set 4 : HistogramBinBoundaries builder recipes #

Patch Set 5 : back out DICT_TAGS #

Patch Set 6 : binBoundariesBuilderDict_ #

Patch Set 7 : *binRanges(), cache HistogramBinBoundaries #

Patch Set 8 : RunningStatistics.asDict undefined #

Patch Set 9 : uniformlySampleArray #

Patch Set 10 : . #

Patch Set 11 : . #

Messages