|
|
Created:
6 years, 6 months ago by Sergiy Byelozyorov Modified:
6 years, 6 months ago CC:
chromium-reviews, Dirk Pranke, cmp-cc_chromium.org, iannucci+depot_tools_chromium.org, ilevy-cc_chromium.org Base URL:
https://chromium.googlesource.com/chromium/tools/depot_tools.git@master Visibility:
Public. |
DescriptionChanged number of times send operation is repeated. This will help to reduce apply_issue flakiness.
BUG=373797
R=maruel@chromium.org
Committed: http://src.chromium.org/viewvc/chrome?view=rev&revision=277040
Patch Set 1 #
Total comments: 5
Patch Set 2 : #
Total comments: 2
Patch Set 3 : #
Total comments: 2
Patch Set 4 : #Messages
Total messages: 22 (0 generated)
PTAL
https://codereview.chromium.org/323363002/diff/1/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/1/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+5**retry) 5**3 = 125. That's a hell of a long time. I don't see why it would sleep more than 10 seconds. It's better to fail that be sleeping for several minutes. What's the current failure rate?
https://codereview.chromium.org/323363002/diff/1/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/1/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+5**retry) On 2014/06/11 13:27:41, M-A Ruel wrote: > 5**3 = 125. That's a hell of a long time. > > I don't see why it would sleep more than 10 seconds. It's better to fail that be > sleeping for several minutes. I guess what I didn't consider is interactive usage. Is this used in any commands which can be invoked by the users interactively? > What's the current failure rate? In fact, it is quite low: about 0.23% on tryserver.chromium.
https://codereview.chromium.org/323363002/diff/1/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/1/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+5**retry) On 2014/06/11 15:46:42, Sergiy Byelozyorov wrote: > On 2014/06/11 13:27:41, M-A Ruel wrote: > > 5**3 = 125. That's a hell of a long time. > > > > I don't see why it would sleep more than 10 seconds. It's better to fail that > be > > sleeping for several minutes. > I guess what I didn't consider is interactive usage. Is this used in any > commands which can be invoked by the users interactively? Oh, I'm concerned about try jobs what would get stuck in a loop too. > > What's the current failure rate? > In fact, it is quite low: about 0.23% on tryserver.chromium. That's surprisingly high, especially that the code is already trying 5 times. What I'm concerned is that these .23% failures could be: - failed to apply the patch at all. (?) - the DB is corrupted, so retrying wouldn't help. This CL only helps when the transient failure is lasting more than 16 seconds (previous total sleep time) but less than 160s (new sleep time). I'd recommend to try more often than sleeping more. It's GAE, it can sustain the load.
PTAL https://codereview.chromium.org/323363002/diff/1/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/1/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+5**retry) On 2014/06/11 17:52:48, M-A Ruel wrote: > Oh, I'm concerned about try jobs what would get stuck in a loop too. Why is this an issue? We only wait extra few minutes and this only happens 0.23% of the time. > > > What's the current failure rate? > What I'm concerned is that these .23% failures could be: > - failed to apply the patch at all. (?) > - the DB is corrupted, so retrying wouldn't help. These are all urllib2.URLError errors when connecting to the GAE. > This CL only helps when the transient failure is lasting more than 16 seconds > (previous total sleep time) but less than 160s (new sleep time). Previously it was 55 seconds (11 seconds each time as maxtries was used instead of retry). New time would be 786s (2s + 6s + 26s + 126s + 626s), each longer delay with decreasing probability. > I'd recommend to try more often than sleeping more. It's GAE, it can sustain the load. We can try that too. This CL is an experiment anyway. We do not know what's really happening. I'll increase the number of tries and reduce the delay.
https://codereview.chromium.org/323363002/diff/1/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/1/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+5**retry) On 2014/06/12 09:43:50, Sergiy Byelozyorov wrote: > On 2014/06/11 17:52:48, M-A Ruel wrote: > > Oh, I'm concerned about try jobs what would get stuck in a loop too. > Why is this an issue? We only wait extra few minutes and this only happens 0.23% > of the time. > > > > > What's the current failure rate? > > What I'm concerned is that these .23% failures could be: > > - failed to apply the patch at all. (?) > > - the DB is corrupted, so retrying wouldn't help. > These are all urllib2.URLError errors when connecting to the GAE. Yes but I don't know if you specifically looked at url errors or a raw apply_issue to calculate the number. > > This CL only helps when the transient failure is lasting more than 16 seconds > > (previous total sleep time) but less than 160s (new sleep time). > Previously it was 55 seconds (11 seconds each time as maxtries was used instead > of retry). New time would be 786s (2s + 6s + 26s + 126s + 626s), each longer > delay with decreasing probability. It only sleeps for 0, 1, 2, and 3. It doesn't sleep for try #4. > > I'd recommend to try more often than sleeping more. It's GAE, it can sustain > the load. > We can try that too. This CL is an experiment anyway. We do not know what's > really happening. I'll increase the number of tries and reduce the delay. Please do this instead. It's fine to hammer the server. There's a few things you may want to optimize for: - Make it work reliably even when the server is transiently stable. - Make it fail fast to save costs. - Make it succeed fast to save devs time. - Reduce the number of requests to save costs. - Increase the number of requests to get faster turn around. It's clear to me here that you should aim to optimize is to save devs time. As such, sleeping is a net cost with no benefit.
> Please do this instead. It's fine to hammer the server. There's a few things you > may want to optimize for: > - Make it work reliably even when the server is transiently stable. > - Make it fail fast to save costs. > - Make it succeed fast to save devs time. > - Reduce the number of requests to save costs. > - Increase the number of requests to get faster turn around. > > It's clear to me here that you should aim to optimize is to save devs time. As > such, sleeping is a net cost with no benefit. My 2 cents: if GAE is down for 20 seconds 0.23% of the time, and we only do a few quick retries, we waste potentially hours of devs time by failing the patch and having them click the CQ button again. If we wait 500+ seconds in retries, we only lose about 10 mins of wait time, but then know for sure the app is down for good, so rejecting a CL may not be a bad idea (can't commit it anyway, no patch!). This code I believe is used both in bots and interactively, and what may make more sense to me is to specify some retry effort parameter (say the #retries) in kwargs, and make sure kwargs are passed through every exported API call. Then bots can dial it up as needed without affecting other tools.
On 2014/06/12 16:26:18, Sergey Berezin wrote: > > Please do this instead. It's fine to hammer the server. There's a few things > you > > may want to optimize for: > > - Make it work reliably even when the server is transiently stable. > > - Make it fail fast to save costs. > > - Make it succeed fast to save devs time. > > - Reduce the number of requests to save costs. > > - Increase the number of requests to get faster turn around. > > > > It's clear to me here that you should aim to optimize is to save devs time. As > > such, sleeping is a net cost with no benefit. > > My 2 cents: if GAE is down for 20 seconds 0.23% of the time, and we only do a > few quick retries, we waste potentially hours of devs time by failing the patch > and having them click the CQ button again. If we wait 500+ seconds in retries, > we only lose about 10 mins of wait time, but then know for sure the app is down > for good, so rejecting a CL may not be a bad idea (can't commit it anyway, no > patch!). > > This code I believe is used both in bots and interactively, and what may make > more sense to me is to specify some retry effort parameter (say the #retries) in > kwargs, and make sure kwargs are passed through every exported API call. Then > bots can dial it up as needed without affecting other tools. You misunderstood me. What I implicitly asked was to (approximatively) raise maxtries to 100 and limit sleep to 10 seconds, which gives you a ~6 minutes window.
On 2014/06/12 17:06:21, M-A Ruel wrote: > [snip] > You misunderstood me. What I implicitly asked was to (approximatively) raise > maxtries to 100 and limit sleep to 10 seconds, which gives you a ~6 minutes > window. Summary of our offline discussion: based on this data: https://isolateserver.appspot.com/stats?resolution=minutes, it appears that AppEngine is unlikely completely down for very long, but more likely a particular instance is "bad" for a particular request. So a quick retry has a high chance of succeeding. Having said that, there might be times when the app is indeed completely down, but not likely longer than for a few minutes (like, <5min). So, a long series of relatively quick retries (e.g. what M-A is suggesting above) is indeed the best approach based on the data we have. Thanks, M-A for clarifying it!
https://codereview.chromium.org/323363002/diff/20001/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/20001/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+retry) nit: Technically, this is a linear backoff now. Fix the comment?
On 2014/06/12 17:06:21, M-A Ruel wrote: > You misunderstood me. What I implicitly asked was to (approximatively) raise > maxtries to 100 and limit sleep to 10 seconds, which gives you a ~6 minutes > window. I assume you meant maxtries=40, because with 100, the window would be 16 minutes.
https://codereview.chromium.org/323363002/diff/20001/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/20001/rietveld.py#newcode438 rietveld.py:438: time.sleep(1+retry) On 2014/06/12 17:54:17, Sergey Berezin wrote: > nit: Technically, this is a linear backoff now. Fix the comment? Done.
On 2014/06/12 16:26:18, Sergey Berezin wrote: > This code I believe is used both in bots and interactively, and what may make > more sense to me is to specify some retry effort parameter (say the #retries) in > kwargs, and make sure kwargs are passed through every exported API call. Then > bots can dial it up as needed without affecting other tools. Users can always hit Ctrl+C interactively to interrupt the process if it takes too long for them. However, I can imagine if a user runs a command and goes for lunch, they would like to have it finished.
https://codereview.chromium.org/323363002/diff/40001/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/40001/rietveld.py#newcode438 rietveld.py:438: time.sleep(10) Please replace with: time.sleep(min(10, 1+maxtries*2))
On 2014/06/13 15:42:55, M-A Ruel wrote: > https://codereview.chromium.org/323363002/diff/40001/rietveld.py > File rietveld.py (right): > > https://codereview.chromium.org/323363002/diff/40001/rietveld.py#newcode438 > rietveld.py:438: time.sleep(10) > Please replace with: > > time.sleep(min(10, 1+maxtries*2)) Why 1+maxtries*2? That would always be 201 and 10 will always be smaller. Did you mean 1+retry*2?
On 2014/06/13 16:38:07, Sergiy Byelozyorov wrote: > On 2014/06/13 15:42:55, M-A Ruel wrote: > > https://codereview.chromium.org/323363002/diff/40001/rietveld.py > > File rietveld.py (right): > > > > https://codereview.chromium.org/323363002/diff/40001/rietveld.py#newcode438 > > rietveld.py:438: time.sleep(10) > > Please replace with: > > > > time.sleep(min(10, 1+maxtries*2)) > > Why 1+maxtries*2? That would always be 201 and 10 will always be smaller. Did > you mean 1+retry*2? Yes, I meant time.sleep(min(10, 1+retry*2))
On 2014/06/13 16:53:55, M-A Ruel wrote: > On 2014/06/13 16:38:07, Sergiy Byelozyorov wrote: > > On 2014/06/13 15:42:55, M-A Ruel wrote: > > > https://codereview.chromium.org/323363002/diff/40001/rietveld.py > > > File rietveld.py (right): > > > > > > https://codereview.chromium.org/323363002/diff/40001/rietveld.py#newcode438 > > > rietveld.py:438: time.sleep(10) > > > Please replace with: > > > > > > time.sleep(min(10, 1+maxtries*2)) > > > > Why 1+maxtries*2? That would always be 201 and 10 will always be smaller. Did > > you mean 1+retry*2? > > Yes, I meant > time.sleep(min(10, 1+retry*2)) Done.
https://codereview.chromium.org/323363002/diff/40001/rietveld.py File rietveld.py (right): https://codereview.chromium.org/323363002/diff/40001/rietveld.py#newcode438 rietveld.py:438: time.sleep(10) On 2014/06/13 15:42:55, M-A Ruel wrote: > Please replace with: > > time.sleep(min(10, 1+maxtries*2)) Done.
lgtm!
The CQ bit was checked by sergiyb@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/sergiyb@chromium.org/323363002/60001
Message was sent while issue was closed.
Change committed as 277040 |