|
|
Created:
6 years, 6 months ago by Alexei Svitkine (slow) Modified:
6 years, 6 months ago Reviewers:
Ilya Sherman CC:
chromium-reviews, Ilya Sherman, asvitkine+watch_chromium.org, jar+watch_chromium.org, vadimt Base URL:
svn://svn.chromium.org/chrome/trunk/src/ Visibility:
Public. |
DescriptionIncrease "log too large" threshold in MetricsService.
Current value is at 50k, which is actually the average log
size on a lot of platforms. Increase it to 100k for a more
reasonable default that should cover the majority of cases
where logs are discarded due to this size limit according
to UMA data.
BUG=380282
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=278612
Patch Set 1 #
Total comments: 5
Patch Set 2 : #
Total comments: 2
Patch Set 3 : #Messages
Total messages: 19 (0 generated)
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... File chrome/browser/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; Optional nit: This might be a little easier to interpret as "250 * 1024" or "250 * 1000". https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; This limit is now very close to kStorageByteLimitPerLogType from metrics_log_manager.cc. Should we bump that limit as well? If not, we risk dropping more small logs for the sake of storing a single large log. https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; What is actually accounting for the larger size of logs today? I'm concerned about just bumping this limit. Should we be thinking about how to reduce the size of the logs instead? UMA ideally shouldn't have a significant impact on a user's network or disk usage...
Responding to one comment for now, will look at the others in a bit. https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... File chrome/browser/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; On 2014/06/03 19:51:49, Ilya Sherman wrote: > What is actually accounting for the larger size of logs today? I'm concerned > about just bumping this limit. Should we be thinking about how to reduce the > size of the logs instead? UMA ideally shouldn't have a significant impact on a > user's network or disk usage... I'm not sure when the old value was chosen or how it was picked. My guess is we simply have more histograms now than before. I agree that we should try to reduce the size of logs. We already compress them, which provides on average a 50% saving for net transmissions. We probably should persist compressed logs too, rather than uncompressed ones. And we do want to start deprecating / removing histograms more aggressively. Additional, there's probably more efficient ways to store them than in local state - e.g. the base64 round-trip probably doesn't help performance. Still, I don't think those efforts should block this change. Losing a large number of logs like we do currently is quite bad.
https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... File chrome/browser/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... chrome/browser/metrics/metrics_service.cc:250: const size_t kUploadLogAvoidRetransmitSize = 250000; On 2014/06/03 19:56:45, Alexei Svitkine wrote: > On 2014/06/03 19:51:49, Ilya Sherman wrote: > > What is actually accounting for the larger size of logs today? I'm concerned > > about just bumping this limit. Should we be thinking about how to reduce the > > size of the logs instead? UMA ideally shouldn't have a significant impact on > a > > user's network or disk usage... > > I'm not sure when the old value was chosen or how it was picked. My guess is we > simply have more histograms now than before. > > I agree that we should try to reduce the size of logs. We already compress them, > which provides on average a 50% saving for net transmissions. We probably should > persist compressed logs too, rather than uncompressed ones. > > And we do want to start deprecating / removing histograms more aggressively. > > Additional, there's probably more efficient ways to store them than in local > state - e.g. the base64 round-trip probably doesn't help performance. > > Still, I don't think those efforts should block this change. Losing a large > number of logs like we do currently is quite bad. I agree that dropping lots of logs is quite bad, and that we should stop doing that :) However, while I don't think we should abandon this change in favor of something more fundamental, I do think it's important to take the time to better understand the problem before proceeding. Specifically, I'd like to have a game plan that goes beyond "let's bump this limit now, and then think about root causes later" before we go ahead and bump the limit. Otherwise, there's a good chance that, despite good intentions, we'll just bump the limit and forget about it. I'd really like to better understand what's taking up space in the large logs. Is it really histograms? Is it perhaps profiler data, or maybe even user actions? Also, are most logs around 50K now, or are we dealing with some sort of multi-modal distribution? Data would help. In terms of compression, I agree that compression is a good idea. We should probably discard logs based on their compressed size, not their inflated size. We could then be more conservative about bumping this limit. It might be the case that after deeper cogitation, we decide that bumping the limit is really the right thing to do. If so, that's fine. I just want to make sure that we take an appropriate amount of time to consider root causes and other possible solutions before we just commit to using more resources.
On 2014/06/03 20:25:19, Ilya Sherman wrote: > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > File chrome/browser/metrics/metrics_service.cc (right): > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > chrome/browser/metrics/metrics_service.cc:250: const size_t > kUploadLogAvoidRetransmitSize = 250000; > On 2014/06/03 19:56:45, Alexei Svitkine wrote: > > On 2014/06/03 19:51:49, Ilya Sherman wrote: > > > What is actually accounting for the larger size of logs today? I'm > concerned > > > about just bumping this limit. Should we be thinking about how to reduce > the > > > size of the logs instead? UMA ideally shouldn't have a significant impact > on > > a > > > user's network or disk usage... > > > > I'm not sure when the old value was chosen or how it was picked. My guess is > we > > simply have more histograms now than before. > > > > I agree that we should try to reduce the size of logs. We already compress > them, > > which provides on average a 50% saving for net transmissions. We probably > should > > persist compressed logs too, rather than uncompressed ones. > > > > And we do want to start deprecating / removing histograms more aggressively. > > > > Additional, there's probably more efficient ways to store them than in local > > state - e.g. the base64 round-trip probably doesn't help performance. > > > > Still, I don't think those efforts should block this change. Losing a large > > number of logs like we do currently is quite bad. > > I agree that dropping lots of logs is quite bad, and that we should stop doing > that :) > > However, while I don't think we should abandon this change in favor of something > more fundamental, I do think it's important to take the time to better > understand the problem before proceeding. Specifically, I'd like to have a game > plan that goes beyond "let's bump this limit now, and then think about root > causes later" before we go ahead and bump the limit. Otherwise, there's a good > chance that, despite good intentions, we'll just bump the limit and forget about > it. > > I'd really like to better understand what's taking up space in the large logs. > Is it really histograms? Is it perhaps profiler data, or maybe even user > actions? Also, are most logs around 50K now, or are we dealing with some sort > of multi-modal distribution? Data would help. > > In terms of compression, I agree that compression is a good idea. We should > probably discard logs based on their compressed size, not their inflated size. > We could then be more conservative about bumping this limit. > > It might be the case that after deeper cogitation, we decide that bumping the > limit is really the right thing to do. If so, that's fine. I just want to make > sure that we take an appropriate amount of time to consider root causes and > other possible solutions before we just commit to using more resources. Good point about profiler data - I suspect that it may indeed be one of the culprits. From previous analyses, I know that profiler data accounts for about 25% of the overall data, which is quite high given that we only send it in the first metrics log. Indeed, I was seeing the limit hit in local builds for the initial log upload, so likely profiler data may be the cause there. Let me investigate more... +vadimt FYI
On 2014/06/03 20:46:18, Alexei Svitkine wrote: > On 2014/06/03 20:25:19, Ilya Sherman wrote: > > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > > File chrome/browser/metrics/metrics_service.cc (right): > > > > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > > chrome/browser/metrics/metrics_service.cc:250: const size_t > > kUploadLogAvoidRetransmitSize = 250000; > > On 2014/06/03 19:56:45, Alexei Svitkine wrote: > > > On 2014/06/03 19:51:49, Ilya Sherman wrote: > > > > What is actually accounting for the larger size of logs today? I'm > > concerned > > > > about just bumping this limit. Should we be thinking about how to reduce > > the > > > > size of the logs instead? UMA ideally shouldn't have a significant impact > > on > > > a > > > > user's network or disk usage... > > > > > > I'm not sure when the old value was chosen or how it was picked. My guess is > > we > > > simply have more histograms now than before. > > > > > > I agree that we should try to reduce the size of logs. We already compress > > them, > > > which provides on average a 50% saving for net transmissions. We probably > > should > > > persist compressed logs too, rather than uncompressed ones. > > > > > > And we do want to start deprecating / removing histograms more aggressively. > > > > > > Additional, there's probably more efficient ways to store them than in local > > > state - e.g. the base64 round-trip probably doesn't help performance. > > > > > > Still, I don't think those efforts should block this change. Losing a large > > > number of logs like we do currently is quite bad. > > > > I agree that dropping lots of logs is quite bad, and that we should stop doing > > that :) > > > > However, while I don't think we should abandon this change in favor of > something > > more fundamental, I do think it's important to take the time to better > > understand the problem before proceeding. Specifically, I'd like to have a > game > > plan that goes beyond "let's bump this limit now, and then think about root > > causes later" before we go ahead and bump the limit. Otherwise, there's a > good > > chance that, despite good intentions, we'll just bump the limit and forget > about > > it. > > > > I'd really like to better understand what's taking up space in the large logs. > > > Is it really histograms? Is it perhaps profiler data, or maybe even user > > actions? Also, are most logs around 50K now, or are we dealing with some sort > > of multi-modal distribution? Data would help. > > > > In terms of compression, I agree that compression is a good idea. We should > > probably discard logs based on their compressed size, not their inflated size. > > > We could then be more conservative about bumping this limit. > > > > It might be the case that after deeper cogitation, we decide that bumping the > > limit is really the right thing to do. If so, that's fine. I just want to > make > > sure that we take an appropriate amount of time to consider root causes and > > other possible solutions before we just commit to using more resources. > > Good point about profiler data - I suspect that it may indeed be one of the > culprits. From previous analyses, I know that profiler data accounts for about > 25% of the overall data, which is quite high given that we only send it in the > first metrics log. Indeed, I was seeing the limit hit in local builds for the > initial log upload, so likely profiler data may be the cause there. Let me > investigate more... +vadimt FYI Indeed, it looks like profiler data is one of the contributors. Re-running my local test with a TOT official build, I get the following for the first upload: [27616:1287:0603/165940:INFO:metrics_log_base.cc(93)] ----------------------------- [27616:1287:0603/165940:INFO:metrics_log_base.cc(94)] Size total: 77307 [27616:1287:0603/165940:INFO:metrics_log_base.cc(99)] Size without profiler data: 30113 [27616:1287:0603/165940:INFO:metrics_log_base.cc(100)] ----------------------------- What I did was simply remove the profiler data and re-serialize it and compare both sizes. So on my Mac build, profiler data contributes to 47k out of the 77k of the initial log. We also know it contributes a significant amount server-side. Any good ideas about what to do about it? In the past, when I only had the server-side data size on my mind, I thought we could simply introduce a sampling-percentage param - so if it's 0.5, then we would only collect it for 50% of the clients. But this wouldn't work if we're trying to optimize to keep individual log size down. I suppose we could either collect profiler data earlier at start up, so there's less of it (which doesn't sound ideal) or have a separate log just for it, but that seems a lot of effort for dubious benefit (i.e. we'd still be uploading just as much data). It's also not clear to me whether we should really be worrying about this extra size per-log. Honestly, it's still not that much data for a network transmission (though perhaps a bit heavier for reading serialized logs from local state). Ilya, do you have any concrete suggestions / ideas here? Or are you okay with upping the limit now that we know the cause? WDYT? In addition to upping the limit, I could also add some more detailed histograms about log sizes, so that later one we could get data from actual users rather than anecdotal evidence from local testing.
On 2014/06/03 21:09:34, Alexei Svitkine wrote: > On 2014/06/03 20:46:18, Alexei Svitkine wrote: > > On 2014/06/03 20:25:19, Ilya Sherman wrote: > > > > > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > > > File chrome/browser/metrics/metrics_service.cc (right): > > > > > > > > > https://codereview.chromium.org/313613003/diff/1/chrome/browser/metrics/metri... > > > chrome/browser/metrics/metrics_service.cc:250: const size_t > > > kUploadLogAvoidRetransmitSize = 250000; > > > On 2014/06/03 19:56:45, Alexei Svitkine wrote: > > > > On 2014/06/03 19:51:49, Ilya Sherman wrote: > > > > > What is actually accounting for the larger size of logs today? I'm > > > concerned > > > > > about just bumping this limit. Should we be thinking about how to > reduce > > > the > > > > > size of the logs instead? UMA ideally shouldn't have a significant > impact > > > on > > > > a > > > > > user's network or disk usage... > > > > > > > > I'm not sure when the old value was chosen or how it was picked. My guess > is > > > we > > > > simply have more histograms now than before. > > > > > > > > I agree that we should try to reduce the size of logs. We already compress > > > them, > > > > which provides on average a 50% saving for net transmissions. We probably > > > should > > > > persist compressed logs too, rather than uncompressed ones. > > > > > > > > And we do want to start deprecating / removing histograms more > aggressively. > > > > > > > > Additional, there's probably more efficient ways to store them than in > local > > > > state - e.g. the base64 round-trip probably doesn't help performance. > > > > > > > > Still, I don't think those efforts should block this change. Losing a > large > > > > number of logs like we do currently is quite bad. > > > > > > I agree that dropping lots of logs is quite bad, and that we should stop > doing > > > that :) > > > > > > However, while I don't think we should abandon this change in favor of > > something > > > more fundamental, I do think it's important to take the time to better > > > understand the problem before proceeding. Specifically, I'd like to have a > > game > > > plan that goes beyond "let's bump this limit now, and then think about root > > > causes later" before we go ahead and bump the limit. Otherwise, there's a > > good > > > chance that, despite good intentions, we'll just bump the limit and forget > > about > > > it. > > > > > > I'd really like to better understand what's taking up space in the large > logs. > > > > > Is it really histograms? Is it perhaps profiler data, or maybe even user > > > actions? Also, are most logs around 50K now, or are we dealing with some > sort > > > of multi-modal distribution? Data would help. > > > > > > In terms of compression, I agree that compression is a good idea. We should > > > probably discard logs based on their compressed size, not their inflated > size. > > > > > We could then be more conservative about bumping this limit. > > > > > > It might be the case that after deeper cogitation, we decide that bumping > the > > > limit is really the right thing to do. If so, that's fine. I just want to > > make > > > sure that we take an appropriate amount of time to consider root causes and > > > other possible solutions before we just commit to using more resources. > > > > Good point about profiler data - I suspect that it may indeed be one of the > > culprits. From previous analyses, I know that profiler data accounts for about > > 25% of the overall data, which is quite high given that we only send it in the > > first metrics log. Indeed, I was seeing the limit hit in local builds for the > > initial log upload, so likely profiler data may be the cause there. Let me > > investigate more... +vadimt FYI > > Indeed, it looks like profiler data is one of the contributors. Re-running my > local test with a TOT official build, I get the following for the first upload: > > [27616:1287:0603/165940:INFO:metrics_log_base.cc(93)] > ----------------------------- > [27616:1287:0603/165940:INFO:metrics_log_base.cc(94)] Size total: 77307 > [27616:1287:0603/165940:INFO:metrics_log_base.cc(99)] Size without profiler > data: 30113 > [27616:1287:0603/165940:INFO:metrics_log_base.cc(100)] > ----------------------------- > > What I did was simply remove the profiler data and re-serialize it and compare > both sizes. So on my Mac build, profiler data contributes to 47k out of the 77k > of the initial log. We also know it contributes a significant amount > server-side. > > Any good ideas about what to do about it? In the past, when I only had the > server-side data size on my mind, I thought we could simply introduce a > sampling-percentage param - so if it's 0.5, then we would only collect it for > 50% of the clients. But this wouldn't work if we're trying to optimize to keep > individual log size down. I suppose we could either collect profiler data > earlier at start up, so there's less of it (which doesn't sound ideal) or have a > separate log just for it, but that seems a lot of effort for dubious benefit > (i.e. we'd still be uploading just as much data). > > It's also not clear to me whether we should really be worrying about this extra > size per-log. Honestly, it's still not that much data for a network transmission > (though perhaps a bit heavier for reading serialized logs from local state). > > Ilya, do you have any concrete suggestions / ideas here? Or are you okay with > upping the limit now that we know the cause? WDYT? > > In addition to upping the limit, I could also add some more detailed histograms > about log sizes, so that later one we could get data from actual users rather > than anecdotal evidence from local testing. I agree that individual log size isn't really the metric that we care about: We care more about the cumulative size. Concretely, I'd suggest the following: (a) For network data usage, measure what we actually send, i.e. compressed bytes. We can then bump the limit, but to something less than 250KB. (b) For storage to disk, we should definitely compress the data. In fact, we should probably just compress the serialized logs immediately when we serialize them. (c) Also for storage to disk, we might want to think about storing profiler data separately. We could then limit the amount of profiler data we're willing to keep separately from how we limit other data. (d) We should add more metrics, including especially the uploaded log size. This will give a better indication of how frequent large logs are, and what the distribution looks like. (e) We should check with the Android and iOS teams. I think the mobile browsers are likely going to be the most concerned about increasing this threshold.
Some of ideas regarding profiler data, not necessary for near-term implementation: 1. We can focus on sending only “interesting” data. Probably, most interesting issues that can be found with profiler are (1) slow startup, (2) slow pageloads and (3) UI thread blocking/junk. Currently, we are sending first 30 sec, but for (1) it’s enough to stop recording when the browser is considered “started up”, which is typically earlier. For (2), we can turn recording on and off when the pageload task chain starts and ends. For (3) we can send only tasks in UI thread that take more than specified threshold (200ms?). 2. I can imagine decimation of the data on the client, with the decimation factor delivered via Finch.
Not junk, jank :)
I'll take a look at compressing logs before they're stored to local state and also checking the size of the logs after compression (hoping to do it Friday during my flight). It's a little tricky because: - We need to remain compatible with older logs (i.e. introduce a new pref) - Now we'll need to depend on compression_utils.cc from core MetricsService code, not just the net uploader - We also now should store the hash with the logs (since the hash is of the uncompressed data, so we can't re-compute it when loading) Anyway, just means have to be a bit more careful, but should be doable. Regarding profiler data, I agree with what vadimt@ said. It seems like the biggest bang for the buck we can do short-term is to use a Finch session-randomized field trial to control how many percent of users actually send up profiler data (i.e. your point #2). Then we can scale-it up/down server-side as needed. We can even set different percentages per platform since for e.g. we probably get a lot more data than we really need from Windows compared to other platforms. And this should provide a very good reduction server-side in logs. Vadim, could you take a look at setting this up? On Wed, Jun 4, 2014 at 9:49 AM, <vadimt@chromium.org> wrote: > Not junk, jank :) > > https://codereview.chromium.org/313613003/ > To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
That plan sounds good. Thanks, Alexei!
Now that the compression change landed, we're still seeing discarded logs but almost all of them fall to under <100k range. So I've updated this CL to set 100k as the new limit. Ilya, are you OK with this now?
Should we bump kStorageByteLimitPerLogType as well? https://codereview.chromium.org/313613003/diff/20001/components/metrics/metri... File components/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/20001/components/metrics/metri... components/metrics/metrics_service.cc:225: const size_t kUploadLogAvoidRetransmitSize = 100 * 1000; nit: Do we want 1024 rather than 1000, so that we're measuring in actual kilobytes?
I think we don't need to bump kStorageByteLimitPerLogType, since the logic is to keep the greater of min_log_count logs or kStorageByteLimitPerLogType bytes of logs. So if there's a very large log, we'll probably ignore that limit anyway and store min_log_count of logs (8 for ongoing logs and 20 for initial logs). It's only if the logs are very small that we'll try to store at least kStorageByteLimitPerLogType (300k) of them. So I think we can keep that limit as-is. https://codereview.chromium.org/313613003/diff/20001/components/metrics/metri... File components/metrics/metrics_service.cc (right): https://codereview.chromium.org/313613003/diff/20001/components/metrics/metri... components/metrics/metrics_service.cc:225: const size_t kUploadLogAvoidRetransmitSize = 100 * 1000; On 2014/06/17 21:02:48, Ilya Sherman wrote: > nit: Do we want 1024 rather than 1000, so that we're measuring in actual > kilobytes? Done.
Makes sense. LGTM. Thanks, Alexei.
The CQ bit was checked by isherman@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/asvitkine@chromium.org/313613003/40001
FYI, CQ is re-trying this CL (attempt #1). The failing builders are: android_aosp on tryserver.chromium (http://build.chromium.org/p/tryserver.chromium/builders/android_aosp/builds/8...)
Message was sent while issue was closed.
Change committed as 278612 |