|
|
Created:
4 years, 1 month ago by vakh (use Gerrit instead) Modified:
4 years, 1 month ago CC:
chromium-reviews, vakh+watch_chromium.org Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionHandle timeout for update requests.
1. Starts 30 second timer when the request to fetch updates is sent.
2. If a response is received within that time, the timer is reset.
3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled.
BUG=543161
Committed: https://crrev.com/1074a4f1111494a115606cecf8363abd63e9bb49
Cr-Commit-Position: refs/heads/master@{#431475}
Patch Set 1 #
Total comments: 8
Patch Set 2 : On timeout, schedule an update request after next_update_interval_ #Patch Set 3 : Added a histogram for counting update timeouts #
Total comments: 8
Patch Set 4 : nparker@ review #Patch Set 5 : rebase #Patch Set 6 : Add UMA for the case when the update request does not time out #Patch Set 7 : rebase #
Messages
Total messages: 64 (36 generated)
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Description was changed from ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another sent. BUG=543161 ========== to ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another sent. BUG=543161 ==========
vakh@chromium.org changed reviewers: + nparker@chromium.org, shess@chromium.org
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:308: request_.reset(); This should log some UMA value to indicate the request failed. https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:309: IssueUpdateRequest(); This will keep trying every 30 seconds. Is that what we want? Maybe it should have a fixed number of total requests (3?) and then give up and wait for a full update cycle (30 min). t
https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:309: IssueUpdateRequest(); On 2016/11/02 21:56:57, Nathan Parker wrote: > This will keep trying every 30 seconds. Is that what we want? Maybe it should > have a fixed number of total requests (3?) and then give up and wait for a full > update cycle (30 min). t Does the spec address this? I see language about how to handle errors, but not about timeouts. If the requests are reaching the Internet, then we definitely shouldn't be stacking them up like this because an unresponsive server is definitely bad! If the requests are _not_ reaching the Internet, then we shouldn't be stacking them up because we're going to repeated power up the CPU to process our requests and the radio to attempt to send them.
https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:308: request_.reset(); On 2016/11/02 21:56:57, Nathan Parker wrote: > This should log some UMA value to indicate the request failed. I thought about it but I couldn't find a good way to log the occurence of an event. Perhaps a UMA_HISTOGRAM_ENUMERATION that logs SUCCESS/FAILURE? Is there a better way to log this? https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:309: IssueUpdateRequest(); On 2016/11/02 21:56:57, Nathan Parker wrote: > This will keep trying every 30 seconds. Is that what we want? Maybe it should > have a fixed number of total requests (3?) Sure, that's possible. But what happens after that? We can't pause update requests. One way to do this is to perform backoff (different from the backoff specified by the protocol) and cap the backoff: 30 sec, 1 min, 2 min, 4 min, 4 min, 4 min... > and then give up and wait for a full > update cycle (30 min). Not sure what you mean here. This error is happening when we try to fetch the next update (typically) 30 mins after the last one. Are you saying that we should just skip an update entirely? That's worth considering but it means that one failure can lead to somewhat stale DB. https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:309: IssueUpdateRequest(); On 2016/11/02 22:05:08, Scott Hess wrote: > On 2016/11/02 21:56:57, Nathan Parker wrote: > > This will keep trying every 30 seconds. Is that what we want? Maybe it should > > have a fixed number of total requests (3?) and then give up and wait for a > full > > update cycle (30 min). t > > Does the spec address this? I see language about how to handle errors, but not > about timeouts. No, I don't see it: https://developers.google.com/safe-browsing/v4/update-api > > If the requests are reaching the Internet, then we definitely shouldn't be > stacking them up like this because an unresponsive server is definitely bad! If > the requests are _not_ reaching the Internet, then we shouldn't be stacking them > up because we're going to repeated power up the CPU to process our requests and > the radio to attempt to send them. So, just drop the request on timeout and schedule the next update at the last known frequency? This would be similar to Nathan's proposal. I am neutral to that idea. Note that the request can also get dropped because of an intermittent connection failure and I am sure there are a variety of other reasons. I'll also check with the SafeBrowsing team if they have any opinion about it.
https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:308: request_.reset(); On 2016/11/02 22:37:22, vakh (Varun Khaneja) wrote: > On 2016/11/02 21:56:57, Nathan Parker wrote: > > This should log some UMA value to indicate the request failed. > > I thought about it but I couldn't find a good way to log the occurence of an > event. Perhaps a UMA_HISTOGRAM_ENUMERATION that logs SUCCESS/FAILURE? Is there a > better way to log this? Yes, you can use a UMA_HISTOGRAM just as a counter. You could compare that to the one that logs the time of responses, or success/error statuses, to see what fraction of requests are timing out.
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On timeout, schedule an update request after next_update_interval_
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Changed the behavior on update timeout to schedule a regular update using the existing value of next_update_interval_ Working on adding UMA https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... components/safe_browsing_db/v4_update_protocol_manager.cc:309: IssueUpdateRequest(); On 2016/11/02 22:05:08, Scott Hess wrote: > On 2016/11/02 21:56:57, Nathan Parker wrote: > > This will keep trying every 30 seconds. Is that what we want? Maybe it should > > have a fixed number of total requests (3?) and then give up and wait for a > full > > update cycle (30 min). t > > Does the spec address this? I see language about how to handle errors, but not > about timeouts. > > If the requests are reaching the Internet, then we definitely shouldn't be > stacking them up like this because an unresponsive server is definitely bad! If > the requests are _not_ reaching the Internet, then we shouldn't be stacking them > up because we're going to repeated power up the CPU to process our requests and > the radio to attempt to send them. Changed the code to schedule a normal update after timeout. Please note that the first request is scheduled between 1 and 5 minutes after startup so if the first update request after startup fails, it'll still be retried at the same random interval (between 1 and 5 minutes). I think that's reasonable because the DB may be really out of date at startup.
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Added a histogram for counting update timeouts
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2016/11/05 00:32:21, Nathan Parker wrote: > https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... > File components/safe_browsing_db/v4_update_protocol_manager.cc (right): > > https://codereview.chromium.org/2470923003/diff/1/components/safe_browsing_db... > components/safe_browsing_db/v4_update_protocol_manager.cc:308: request_.reset(); > On 2016/11/02 22:37:22, vakh (Varun Khaneja) wrote: > > On 2016/11/02 21:56:57, Nathan Parker wrote: > > > This should log some UMA value to indicate the request failed. > > > > I thought about it but I couldn't find a good way to log the occurence of an > > event. Perhaps a UMA_HISTOGRAM_ENUMERATION that logs SUCCESS/FAILURE? Is there > a > > better way to log this? > > Yes, you can use a UMA_HISTOGRAM just as a counter. You could compare that to > the one that logs the time of responses, or success/error statuses, to see what > fraction of requests are timing out. Done. Histogram: SafeBrowsing.V4Update.Timeout.Count recorded 1 samples, mean = 1.0 (flags = 0x1) 0 O (0 = 0.0%) 1 ------------------------------------------------------------------------O (1 = 100.0%) {0.0%} 2 ...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
lgtm https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager.cc:306: UMA_HISTOGRAM_COUNTS_100("SafeBrowsing.V4Update.Timeout.Count", 1); Or UMA_HISTOGRAM_BOOLEAN("..", true) either way https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager.h (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager.h:182: // Used to interrupt and re-issue update requests that take too long. It doesn't re-issue, it just reschedules now, yes? And the CL description needs updating as well, to this effect. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager_unittest.cc (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager_unittest.cc:232: I'm not clear on why you need to remove these RunPendingTasks.. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager_unittest.cc:358: // Now wait for the next request to be scheduled. Is there some state you can test between these to verify it changes? Like, maybe there shouldn't be a Fetcher(1) yet?
Description was changed from ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another sent. BUG=543161 ========== to ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled. BUG=543161 ==========
nparker@ review
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
vakh@chromium.org changed reviewers: + rkaplow@chromium.org
rkaplow@ -- Can you please review the one new histogram? Thanks. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager.cc (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager.cc:306: UMA_HISTOGRAM_COUNTS_100("SafeBrowsing.V4Update.Timeout.Count", 1); On 2016/11/08 23:39:29, Nathan Parker wrote: > Or UMA_HISTOGRAM_BOOLEAN("..", true) > either way Done. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager.h (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager.h:182: // Used to interrupt and re-issue update requests that take too long. On 2016/11/08 23:39:29, Nathan Parker wrote: > It doesn't re-issue, it just reschedules now, yes? > > And the CL description needs updating as well, to this effect. Done. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... File components/safe_browsing_db/v4_update_protocol_manager_unittest.cc (right): https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager_unittest.cc:232: On 2016/11/08 23:39:29, Nathan Parker wrote: > I'm not clear on why you need to remove these RunPendingTasks.. IssueUpdateRequest queues a network request right away so there's no need to wait before getting |fetcher|. https://codereview.chromium.org/2470923003/diff/40001/components/safe_browsin... components/safe_browsing_db/v4_update_protocol_manager_unittest.cc:358: // Now wait for the next request to be scheduled. On 2016/11/08 23:39:29, Nathan Parker wrote: > Is there some state you can test between these to verify it changes? Like, > maybe there shouldn't be a Fetcher(1) yet? Done. Not much else to test.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_androi...)
LGTM. Dangling philosophical question, though. If we're just going to retry on next retry, does a timeout matter? As opposed to just cancelling before the retry? I'm not familiar enough with the networking stack's subtleties to know if it would wedge things or something. There doesn't seem to be anything tracking elapsed time from request to response in this code. Is there something of the sort elsewhere? I'm not entirely sure whether that would help, since not only will requests have a 99th percentile response time, devices and networks will, so just because N-seconds is good for the fleet might not mean it's good for a particular user. But such stats might be useful to inform whether a particular timeout is _clearly_ in bounds, or a bit aggressive.
lgtm looks ok, but unusual there's no other case checked. Should this also have a false where it didn't timeout get tracked, or is that really not important?
rebase
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2016/11/09 19:12:18, rkaplow wrote: > lgtm > > looks ok, but unusual there's no other case checked. Should this also have a > false where it didn't timeout get tracked, or is that really not important? Done.
Add UMA for the case when the update request does not time out
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2016/11/09 01:02:25, Scott Hess wrote: > LGTM. > > Dangling philosophical question, though. If we're just going to retry on next > retry, does a timeout matter? As opposed to just cancelling before the retry? > I'm not familiar enough with the networking stack's subtleties to know if it > would wedge things or something. Neither am I. I guess tracking the UMA counter would tell us how much we need to care about this case. > > There doesn't seem to be anything tracking elapsed time from request to response > in this code. Is there something of the sort elsewhere? I'm not entirely sure > whether that would help, since not only will requests have a 99th percentile > response time, devices and networks will, so just because N-seconds is good for > the fleet might not mean it's good for a particular user. But such stats might > be useful to inform whether a particular timeout is _clearly_ in bounds, or a > bit aggressive. Again, if we find the UMA stat to be too high (>2%, may be > 1%) we can look into it more closely. WDYT?
The CQ bit was checked by vakh@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org, rkaplow@chromium.org, shess@chromium.org Link to the patchset: https://codereview.chromium.org/2470923003/#ps100001 (title: "Add UMA for the case when the update request does not time out")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
On 2016/11/10 22:10:11, vakh (Varun Khaneja) wrote: > On 2016/11/09 01:02:25, Scott Hess wrote: > > LGTM. > > > > Dangling philosophical question, though. If we're just going to retry on next > > retry, does a timeout matter? As opposed to just cancelling before the retry? > > > I'm not familiar enough with the networking stack's subtleties to know if it > > would wedge things or something. > > Neither am I. I guess tracking the UMA counter would tell us how much we need to > care about this case. > > > > > There doesn't seem to be anything tracking elapsed time from request to > response > > in this code. Is there something of the sort elsewhere? I'm not entirely > sure > > whether that would help, since not only will requests have a 99th percentile > > response time, devices and networks will, so just because N-seconds is good > for > > the fleet might not mean it's good for a particular user. But such stats > might > > be useful to inform whether a particular timeout is _clearly_ in bounds, or a > > bit aggressive. > > Again, if we find the UMA stat to be too high (>2%, may be > 1%) we can look > into it more closely. WDYT? I guess? It's hard to tell, because the most likely outcome of the UMA stat is to show a very tiny fraction of timeouts. Honestly, if it's something like 2% that maybe indicates Something Is Very Wrong (what does it mean if 2% of http requests are failing? That sounds terrible to me).
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: android_compile_dbg on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_comp...) linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
On 2016/11/10 22:18:36, Scott Hess wrote: > On 2016/11/10 22:10:11, vakh (Varun Khaneja) wrote: > > On 2016/11/09 01:02:25, Scott Hess wrote: > > > LGTM. > > > > > > Dangling philosophical question, though. If we're just going to retry on > next > > > retry, does a timeout matter? As opposed to just cancelling before the > retry? > > > > > I'm not familiar enough with the networking stack's subtleties to know if it > > > would wedge things or something. > > > > Neither am I. I guess tracking the UMA counter would tell us how much we need > to > > care about this case. > > > > > > > > There doesn't seem to be anything tracking elapsed time from request to > > response > > > in this code. Is there something of the sort elsewhere? I'm not entirely > > sure > > > whether that would help, since not only will requests have a 99th percentile > > > response time, devices and networks will, so just because N-seconds is good > > for > > > the fleet might not mean it's good for a particular user. But such stats > > might > > > be useful to inform whether a particular timeout is _clearly_ in bounds, or > a > > > bit aggressive. > > > > Again, if we find the UMA stat to be too high (>2%, may be > 1%) we can look > > into it more closely. WDYT? > > I guess? It's hard to tell, because the most likely outcome of the UMA stat is > to show a very tiny fraction of timeouts. Honestly, if it's something like 2% > that maybe indicates Something Is Very Wrong (what does it mean if 2% of http > requests are failing? That sounds terrible to me). Yes, 2% is high. The UMA counter I added in this CL will tell us what the real rate of 30s timeout is. Once we have that number, we can discuss whether 30s is too high or low enough.
The CQ bit was checked by vakh@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: chromium_presubmit on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/chromium_presub...) linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) linux_chromium_chromeos_compile_dbg_ng on master.tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...)
rebase
The CQ bit was checked by vakh@chromium.org to run a CQ dry run
The CQ bit was checked by vakh@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from nparker@chromium.org, rkaplow@chromium.org, shess@chromium.org Link to the patchset: https://codereview.chromium.org/2470923003/#ps120001 (title: "rebase")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled. BUG=543161 ========== to ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled. BUG=543161 ==========
Message was sent while issue was closed.
Committed patchset #7 (id:120001)
Message was sent while issue was closed.
Description was changed from ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled. BUG=543161 ========== to ========== Handle timeout for update requests. 1. Starts 30 second timer when the request to fetch updates is sent. 2. If a response is received within that time, the timer is reset. 3. Otherwise, the outstanding request is cancelled and another scheduled after the same duration as the one cancelled. BUG=543161 Committed: https://crrev.com/1074a4f1111494a115606cecf8363abd63e9bb49 Cr-Commit-Position: refs/heads/master@{#431475} ==========
Message was sent while issue was closed.
Patchset 7 (id:??) landed as https://crrev.com/1074a4f1111494a115606cecf8363abd63e9bb49 Cr-Commit-Position: refs/heads/master@{#431475} |