Issue 313613003: Increase "log too large" threshold in MetricsService.

Alexei Svitkine (slow)

6 years, 6 months ago (2014-06-03 18:41:04 UTC) #1

Ilya Sherman

https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc File chrome/browser/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc#newcode250 chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; Optional nit: This might ...

6 years, 6 months ago (2014-06-03 19:51:49 UTC) #2

Alexei Svitkine (slow)

Responding to one comment for now, will look at the others in a bit. https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc ...

6 years, 6 months ago (2014-06-03 19:56:46 UTC) #3

Ilya Sherman

https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc File chrome/browser/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc#newcode250 chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; On 2014/06/03 19:56:45, Alexei ...

6 years, 6 months ago (2014-06-03 20:25:19 UTC) #4

https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
File chrome/browser/metrics/metrics_service.cc (right):

https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
chrome/browser/metrics/metrics_service.cc:250: const size_t
kUploadLogAvoidRetransmitSize = 250000;
On 2014/06/03 19:56:45, Alexei Svitkine wrote:
> On 2014/06/03 19:51:49, Ilya Sherman wrote:
> > What is actually accounting for the larger size of logs today?  I'm
concerned
> > about just bumping this limit.  Should we be thinking about how to reduce
the
> > size of the logs instead?  UMA ideally shouldn't have a significant impact
on
> a
> > user's network or disk usage...
> 
> I'm not sure when the old value was chosen or how it was picked. My guess is
we
> simply have more histograms now than before.
> 
> I agree that we should try to reduce the size of logs. We already compress
them,
> which provides on average a 50% saving for net transmissions. We probably
should
> persist compressed logs too, rather than uncompressed ones.
> 
> And we do want to start deprecating / removing histograms more aggressively.
> 
> Additional, there's probably more efficient ways to store them than in local
> state - e.g. the base64 round-trip probably doesn't help performance.
> 
> Still, I don't think those efforts should block this change. Losing a large
> number of logs like we do currently is quite bad.

I agree that dropping lots of logs is quite bad, and that we should stop doing
that :)

However, while I don't think we should abandon this change in favor of something
more fundamental, I do think it's important to take the time to better
understand the problem before proceeding.  Specifically, I'd like to have a game
plan that goes beyond "let's bump this limit now, and then think about root
causes later" before we go ahead and bump the limit.  Otherwise, there's a good
chance that, despite good intentions, we'll just bump the limit and forget about
it.

I'd really like to better understand what's taking up space in the large logs. 
Is it really histograms?  Is it perhaps profiler data, or maybe even user
actions?  Also, are most logs around 50K now, or are we dealing with some sort
of multi-modal distribution?  Data would help.

In terms of compression, I agree that compression is a good idea.  We should
probably discard logs based on their compressed size, not their inflated size. 
We could then be more conservative about bumping this limit.

It might be the case that after deeper cogitation, we decide that bumping the
limit is really the right thing to do.  If so, that's fine.  I just want to make
sure that we take an appropriate amount of time to consider root causes and
other possible solutions before we just commit to using more resources.

Alexei Svitkine (slow)

On 2014/06/03 20:25:19, Ilya Sherman wrote: > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc > File chrome/browser/metrics/metrics_service.cc (right): > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metrics_service.cc#newcode250 ...

6 years, 6 months ago (2014-06-03 20:46:18 UTC) #5

On 2014/06/03 20:25:19, Ilya Sherman wrote:
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> File chrome/browser/metrics/metrics_service.cc (right):
> 
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> chrome/browser/metrics/metrics_service.cc:250: const size_t
> kUploadLogAvoidRetransmitSize = 250000;
> On 2014/06/03 19:56:45, Alexei Svitkine wrote:
> > On 2014/06/03 19:51:49, Ilya Sherman wrote:
> > > What is actually accounting for the larger size of logs today?  I'm
> concerned
> > > about just bumping this limit.  Should we be thinking about how to reduce
> the
> > > size of the logs instead?  UMA ideally shouldn't have a significant impact
> on
> > a
> > > user's network or disk usage...
> > 
> > I'm not sure when the old value was chosen or how it was picked. My guess is
> we
> > simply have more histograms now than before.
> > 
> > I agree that we should try to reduce the size of logs. We already compress
> them,
> > which provides on average a 50% saving for net transmissions. We probably
> should
> > persist compressed logs too, rather than uncompressed ones.
> > 
> > And we do want to start deprecating / removing histograms more aggressively.
> > 
> > Additional, there's probably more efficient ways to store them than in local
> > state - e.g. the base64 round-trip probably doesn't help performance.
> > 
> > Still, I don't think those efforts should block this change. Losing a large
> > number of logs like we do currently is quite bad.
> 
> I agree that dropping lots of logs is quite bad, and that we should stop doing
> that :)
> 
> However, while I don't think we should abandon this change in favor of
something
> more fundamental, I do think it's important to take the time to better
> understand the problem before proceeding.  Specifically, I'd like to have a
game
> plan that goes beyond "let's bump this limit now, and then think about root
> causes later" before we go ahead and bump the limit.  Otherwise, there's a
good
> chance that, despite good intentions, we'll just bump the limit and forget
about
> it.
> 
> I'd really like to better understand what's taking up space in the large logs.

> Is it really histograms?  Is it perhaps profiler data, or maybe even user
> actions?  Also, are most logs around 50K now, or are we dealing with some sort
> of multi-modal distribution?  Data would help.
> 
> In terms of compression, I agree that compression is a good idea.  We should
> probably discard logs based on their compressed size, not their inflated size.

> We could then be more conservative about bumping this limit.
> 
> It might be the case that after deeper cogitation, we decide that bumping the
> limit is really the right thing to do.  If so, that's fine.  I just want to
make
> sure that we take an appropriate amount of time to consider root causes and
> other possible solutions before we just commit to using more resources.

Good point about profiler data - I suspect that it may indeed be one of the
culprits. From previous analyses, I know that profiler data accounts for about
25% of the overall data, which is quite high given that we only send it in the
first metrics log. Indeed, I was seeing the limit hit in local builds for the
initial log upload, so likely profiler data may be the cause there. Let me
investigate more... +vadimt FYI

Alexei Svitkine (slow)

On 2014/06/03 20:46:18, Alexei Svitkine wrote: > On 2014/06/03 20:25:19, Ilya Sherman wrote: > > ...

6 years, 6 months ago (2014-06-03 21:09:34 UTC) #6

On 2014/06/03 20:46:18, Alexei Svitkine wrote:
> On 2014/06/03 20:25:19, Ilya Sherman wrote:
> >
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> > File chrome/browser/metrics/metrics_service.cc (right):
> > 
> >
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> > chrome/browser/metrics/metrics_service.cc:250: const size_t
> > kUploadLogAvoidRetransmitSize = 250000;
> > On 2014/06/03 19:56:45, Alexei Svitkine wrote:
> > > On 2014/06/03 19:51:49, Ilya Sherman wrote:
> > > > What is actually accounting for the larger size of logs today?  I'm
> > concerned
> > > > about just bumping this limit.  Should we be thinking about how to
reduce
> > the
> > > > size of the logs instead?  UMA ideally shouldn't have a significant
impact
> > on
> > > a
> > > > user's network or disk usage...
> > > 
> > > I'm not sure when the old value was chosen or how it was picked. My guess
is
> > we
> > > simply have more histograms now than before.
> > > 
> > > I agree that we should try to reduce the size of logs. We already compress
> > them,
> > > which provides on average a 50% saving for net transmissions. We probably
> > should
> > > persist compressed logs too, rather than uncompressed ones.
> > > 
> > > And we do want to start deprecating / removing histograms more
aggressively.
> > > 
> > > Additional, there's probably more efficient ways to store them than in
local
> > > state - e.g. the base64 round-trip probably doesn't help performance.
> > > 
> > > Still, I don't think those efforts should block this change. Losing a
large
> > > number of logs like we do currently is quite bad.
> > 
> > I agree that dropping lots of logs is quite bad, and that we should stop
doing
> > that :)
> > 
> > However, while I don't think we should abandon this change in favor of
> something
> > more fundamental, I do think it's important to take the time to better
> > understand the problem before proceeding.  Specifically, I'd like to have a
> game
> > plan that goes beyond "let's bump this limit now, and then think about root
> > causes later" before we go ahead and bump the limit.  Otherwise, there's a
> good
> > chance that, despite good intentions, we'll just bump the limit and forget
> about
> > it.
> > 
> > I'd really like to better understand what's taking up space in the large
logs.
> 
> > Is it really histograms?  Is it perhaps profiler data, or maybe even user
> > actions?  Also, are most logs around 50K now, or are we dealing with some
sort
> > of multi-modal distribution?  Data would help.
> > 
> > In terms of compression, I agree that compression is a good idea.  We should
> > probably discard logs based on their compressed size, not their inflated
size.
> 
> > We could then be more conservative about bumping this limit.
> > 
> > It might be the case that after deeper cogitation, we decide that bumping
the
> > limit is really the right thing to do.  If so, that's fine.  I just want to
> make
> > sure that we take an appropriate amount of time to consider root causes and
> > other possible solutions before we just commit to using more resources.
> 
> Good point about profiler data - I suspect that it may indeed be one of the
> culprits. From previous analyses, I know that profiler data accounts for about
> 25% of the overall data, which is quite high given that we only send it in the
> first metrics log. Indeed, I was seeing the limit hit in local builds for the
> initial log upload, so likely profiler data may be the cause there. Let me
> investigate more... +vadimt FYI

Indeed, it looks like profiler data is one of the contributors. Re-running my
local test with a TOT official build, I get the following for the first upload:

[27616:1287:0603/165940:INFO:metrics_log_base.cc(93)]
-----------------------------
[27616:1287:0603/165940:INFO:metrics_log_base.cc(94)] Size total: 77307
[27616:1287:0603/165940:INFO:metrics_log_base.cc(99)] Size without profiler
data: 30113
[27616:1287:0603/165940:INFO:metrics_log_base.cc(100)]
-----------------------------

What I did was simply remove the profiler data and re-serialize it and compare
both sizes. So on my Mac build, profiler data contributes to 47k out of the 77k
of the initial log. We also know it contributes a significant amount
server-side.

Any good ideas about what to do about it? In the past, when I only had the
server-side data size on my mind, I thought we could simply introduce a
sampling-percentage param - so if it's 0.5, then we would only collect it for
50% of the clients. But this wouldn't work if we're trying to optimize to keep
individual log size down. I suppose we could either collect profiler data
earlier at start up, so there's less of it (which doesn't sound ideal) or have a
separate log just for it, but that seems a lot of effort for dubious benefit
(i.e. we'd still be uploading just as much data).

It's also not clear to me whether we should really be worrying about this extra
size per-log. Honestly, it's still not that much data for a network transmission
(though perhaps a bit heavier for reading serialized logs from local state).

Ilya, do you have any concrete suggestions / ideas here? Or are you okay with
upping the limit now that we know the cause? WDYT?

In addition to upping the limit, I could also add some more detailed histograms
about log sizes, so that later one we could get data from actual users rather
than anecdotal evidence from local testing.

Ilya Sherman

On 2014/06/03 21:09:34, Alexei Svitkine wrote: > On 2014/06/03 20:46:18, Alexei Svitkine wrote: > > ...

6 years, 6 months ago (2014-06-03 21:32:16 UTC) #7

On 2014/06/03 21:09:34, Alexei Svitkine wrote:
> On 2014/06/03 20:46:18, Alexei Svitkine wrote:
> > On 2014/06/03 20:25:19, Ilya Sherman wrote:
> > >
> >
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> > > File chrome/browser/metrics/metrics_service.cc (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri...
> > > chrome/browser/metrics/metrics_service.cc:250: const size_t
> > > kUploadLogAvoidRetransmitSize = 250000;
> > > On 2014/06/03 19:56:45, Alexei Svitkine wrote:
> > > > On 2014/06/03 19:51:49, Ilya Sherman wrote:
> > > > > What is actually accounting for the larger size of logs today?  I'm
> > > concerned
> > > > > about just bumping this limit.  Should we be thinking about how to
> reduce
> > > the
> > > > > size of the logs instead?  UMA ideally shouldn't have a significant
> impact
> > > on
> > > > a
> > > > > user's network or disk usage...
> > > > 
> > > > I'm not sure when the old value was chosen or how it was picked. My
guess
> is
> > > we
> > > > simply have more histograms now than before.
> > > > 
> > > > I agree that we should try to reduce the size of logs. We already
compress
> > > them,
> > > > which provides on average a 50% saving for net transmissions. We
probably
> > > should
> > > > persist compressed logs too, rather than uncompressed ones.
> > > > 
> > > > And we do want to start deprecating / removing histograms more
> aggressively.
> > > > 
> > > > Additional, there's probably more efficient ways to store them than in
> local
> > > > state - e.g. the base64 round-trip probably doesn't help performance.
> > > > 
> > > > Still, I don't think those efforts should block this change. Losing a
> large
> > > > number of logs like we do currently is quite bad.
> > > 
> > > I agree that dropping lots of logs is quite bad, and that we should stop
> doing
> > > that :)
> > > 
> > > However, while I don't think we should abandon this change in favor of
> > something
> > > more fundamental, I do think it's important to take the time to better
> > > understand the problem before proceeding.  Specifically, I'd like to have
a
> > game
> > > plan that goes beyond "let's bump this limit now, and then think about
root
> > > causes later" before we go ahead and bump the limit.  Otherwise, there's a
> > good
> > > chance that, despite good intentions, we'll just bump the limit and forget
> > about
> > > it.
> > > 
> > > I'd really like to better understand what's taking up space in the large
> logs.
> > 
> > > Is it really histograms?  Is it perhaps profiler data, or maybe even user
> > > actions?  Also, are most logs around 50K now, or are we dealing with some
> sort
> > > of multi-modal distribution?  Data would help.
> > > 
> > > In terms of compression, I agree that compression is a good idea.  We
should
> > > probably discard logs based on their compressed size, not their inflated
> size.
> > 
> > > We could then be more conservative about bumping this limit.
> > > 
> > > It might be the case that after deeper cogitation, we decide that bumping
> the
> > > limit is really the right thing to do.  If so, that's fine.  I just want
to
> > make
> > > sure that we take an appropriate amount of time to consider root causes
and
> > > other possible solutions before we just commit to using more resources.
> > 
> > Good point about profiler data - I suspect that it may indeed be one of the
> > culprits. From previous analyses, I know that profiler data accounts for
about
> > 25% of the overall data, which is quite high given that we only send it in
the
> > first metrics log. Indeed, I was seeing the limit hit in local builds for
the
> > initial log upload, so likely profiler data may be the cause there. Let me
> > investigate more... +vadimt FYI
> 
> Indeed, it looks like profiler data is one of the contributors. Re-running my
> local test with a TOT official build, I get the following for the first
upload:
> 
> [27616:1287:0603/165940:INFO:metrics_log_base.cc(93)]
> -----------------------------
> [27616:1287:0603/165940:INFO:metrics_log_base.cc(94)] Size total: 77307
> [27616:1287:0603/165940:INFO:metrics_log_base.cc(99)] Size without profiler
> data: 30113
> [27616:1287:0603/165940:INFO:metrics_log_base.cc(100)]
> -----------------------------
> 
> What I did was simply remove the profiler data and re-serialize it and compare
> both sizes. So on my Mac build, profiler data contributes to 47k out of the
77k
> of the initial log. We also know it contributes a significant amount
> server-side.
> 
> Any good ideas about what to do about it? In the past, when I only had the
> server-side data size on my mind, I thought we could simply introduce a
> sampling-percentage param - so if it's 0.5, then we would only collect it for
> 50% of the clients. But this wouldn't work if we're trying to optimize to keep
> individual log size down. I suppose we could either collect profiler data
> earlier at start up, so there's less of it (which doesn't sound ideal) or have
a
> separate log just for it, but that seems a lot of effort for dubious benefit
> (i.e. we'd still be uploading just as much data).
> 
> It's also not clear to me whether we should really be worrying about this
extra
> size per-log. Honestly, it's still not that much data for a network
transmission
> (though perhaps a bit heavier for reading serialized logs from local state).
> 
> Ilya, do you have any concrete suggestions / ideas here? Or are you okay with
> upping the limit now that we know the cause? WDYT?
> 
> In addition to upping the limit, I could also add some more detailed
histograms
> about log sizes, so that later one we could get data from actual users rather
> than anecdotal evidence from local testing.

I agree that individual log size isn't really the metric that we care about: We
care more about the cumulative size.

Concretely, I'd suggest the following:
  (a) For network data usage, measure what we actually send, i.e. compressed
bytes.  We can then bump the limit, but to something less than 250KB.
  (b) For storage to disk, we should definitely compress the data.  In fact, we
should probably just compress the serialized logs immediately when we serialize
them.
  (c) Also for storage to disk, we might want to think about storing profiler
data separately.  We could then limit the amount of profiler data we're willing
to keep separately from how we limit other data.
  (d) We should add more metrics, including especially the uploaded log size. 
This will give a better indication of how frequent large logs are, and what the
distribution looks like.
  (e) We should check with the Android and iOS teams.  I think the mobile
browsers are likely going to be the most concerned about increasing this
threshold.

vadimt

Some of ideas regarding profiler data, not necessary for near-term implementation: 1. We can focus ...

6 years, 6 months ago (2014-06-04 02:31:17 UTC) #8

Alexei Svitkine (slow)

I'll take a look at compressing logs before they're stored to local state and also ...

6 years, 6 months ago (2014-06-05 14:35:25 UTC) #10

Alexei Svitkine (slow)

Now that the compression change landed, we're still seeing discarded logs but almost all of ...

6 years, 6 months ago (2014-06-17 19:34:53 UTC) #12

Ilya Sherman

Should we bump kStorageByteLimitPerLogType as well? https://codereview.chromium.org/313613003/diff/20001/components/metrics/metrics_service.cc File components/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/20001/components/metrics/metrics_service.cc#newcode225 components/metrics/metrics_service.cc:225: const size_t kUploadLogAvoidRetransmitSize ...

6 years, 6 months ago (2014-06-17 21:02:48 UTC) #13

Alexei Svitkine (slow)

I think we don't need to bump kStorageByteLimitPerLogType, since the logic is to keep the ...

6 years, 6 months ago (2014-06-19 19:28:36 UTC) #14

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/asvitkine@chromium.org/313613003/40001

6 years, 6 months ago (2014-06-19 21:04:17 UTC) #17

commit-bot: I haz the power

FYI, CQ is re-trying this CL (attempt #1). The failing builders are: android_aosp on tryserver.chromium ...

6 years, 6 months ago (2014-06-20 02:43:41 UTC) #18

Message was sent while issue was closed.

Change committed as 278612

Issue 313613003: Increase "log too large" threshold in MetricsService. (Closed)

Description

Patch Set 1 #

Patch Set 2 : #

Patch Set 3 : #

Messages