|
|
DescriptionAdds TCP FastOpen blackhole recovery code to avoid getting persistently blackholed.
BUG=402005, 413370
Committed: https://crrev.com/764425441131e861b035499e3ecb5787c866a266
Cr-Commit-Position: refs/heads/master@{#297763}
Patch Set 1 #
Total comments: 5
Patch Set 2 : Fixed constness #
Total comments: 2
Patch Set 3 : Nit fixed. #
Total comments: 14
Patch Set 4 : "Streamlined logic for when RecordTCPFastOpenStatus is called (and other nits.)" #
Total comments: 12
Patch Set 5 : Nits addressed and histogram updated. #
Total comments: 12
Patch Set 6 : Comments added and/or clarified. #
Total comments: 10
Patch Set 7 : Removed check in UpdateTCPFastOpenStatusAfterRead #Patch Set 8 : Rebase. #Patch Set 9 : Minor nit. #
Messages
Total messages: 33 (5 generated)
jri@chromium.org changed reviewers: + mmenke@chromium.org, rdsmith@chromium.org
Randy: I have a question inline that you may be able to answer, PTAL. https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = FAST_OPEN_STATUS_UNKNOWN; Why do we need to reset tcp_fastopen_status_ here?
On 2014/09/19 02:18:16, Jana wrote: > Randy: I have a question inline that you may be able to answer, PTAL. > > https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... > File net/socket/tcp_socket_libevent.cc (right): > > https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... > net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = > FAST_OPEN_STATUS_UNKNOWN; > Why do we need to reset tcp_fastopen_status_ here? Probably won't get to this until next week.
Answered the question to the best of my ability. Do you want an actual review? If so, I'll need to page back in the state machine that was in my head when I wrote the code :-} (primarily around the FastOpenStatus enum). https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = FAST_OPEN_STATUS_UNKNOWN; On 2014/09/19 02:18:16, Jana wrote: > Why do we need to reset tcp_fastopen_status_ here? So I don't remember clearly why I did this, but looking at the code, I suspect I was worried about the object being re-used; I don't see anything in the API definition of the implementation that disallows that? https://codereview.chromium.org/583883002/diff/20001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/20001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:185: bool tcp_fastopen_write_attempted_; nit, suggestion: I'm used to a blank line before comments.
Randy -- thanks for the answer. I've responded inline. Of course, if you want to do a full review that'd be fine too :-) https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = FAST_OPEN_STATUS_UNKNOWN; On 2014/09/22 16:05:13, rdsmith wrote: > On 2014/09/19 02:18:16, Jana wrote: > > Why do we need to reset tcp_fastopen_status_ here? > > So I don't remember clearly why I did this, but looking at the code, I suspect I > was worried about the object being re-used; I don't see anything in the API > definition of the implementation that disallows that? Maybe Matt can answer this question, but as I read it, I don't think that happens... client_socket_factory creates TCPClientSockets, and a new TCPSocket (which maps to TCPSocketLibevent for *ix) is created for each TCPClientSocket. That said, I agree that I also don't see anything in the API definition of TCPSocketLibevent that explicitly prevents an object from being reused... If it is possible that a TCPSocketLibevent object is reused, then we'll have to reset use_tcp_fastopen_ as well, and I'd want to double-check the code to make sure that state is correctly reset. https://codereview.chromium.org/583883002/diff/20001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/20001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:185: bool tcp_fastopen_write_attempted_; On 2014/09/22 16:05:13, rdsmith wrote: > nit, suggestion: I'm used to a blank line before comments. Done.
Unless you ask, I'm afraid I'm not going to do a full review. This week is going to be pretty tight with just what's on my plate. Next week I should have more room. https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = FAST_OPEN_STATUS_UNKNOWN; On 2014/09/22 17:04:56, Jana wrote: > On 2014/09/22 16:05:13, rdsmith wrote: > > On 2014/09/19 02:18:16, Jana wrote: > > > Why do we need to reset tcp_fastopen_status_ here? > > > > So I don't remember clearly why I did this, but looking at the code, I suspect > I > > was worried about the object being re-used; I don't see anything in the API > > definition of the implementation that disallows that? > > Maybe Matt can answer this question, but as I read it, I don't think that > happens... client_socket_factory creates TCPClientSockets, and a new TCPSocket > (which maps to TCPSocketLibevent for *ix) is created for each TCPClientSocket. > That said, I agree that I also don't see anything in the API definition of > TCPSocketLibevent that explicitly prevents an object from being reused... > > If it is possible that a TCPSocketLibevent object is reused, then we'll have to > reset use_tcp_fastopen_ as well, and I'd want to double-check the code to make > sure that state is correctly reset. So I'm a fairly big believer in not making assumptions beyond what's codified in the header file about how a class will be used. If you wanted to stick a note in the header file about not reusing the class and dump a DCHECK someplace in the state transition to make sure it's not being re-used, I'm comfortable with not resetting the state.
On 2014/09/22 17:42:47, rdsmith wrote: > Unless you ask, I'm afraid I'm not going to do a full review. This week is > going to be pretty tight with just what's on my plate. Next week I should have > more room. > > https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... > File net/socket/tcp_socket_libevent.cc (right): > > https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... > net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = > FAST_OPEN_STATUS_UNKNOWN; > On 2014/09/22 17:04:56, Jana wrote: > > On 2014/09/22 16:05:13, rdsmith wrote: > > > On 2014/09/19 02:18:16, Jana wrote: > > > > Why do we need to reset tcp_fastopen_status_ here? > > > > > > So I don't remember clearly why I did this, but looking at the code, I > suspect > > I > > > was worried about the object being re-used; I don't see anything in the API > > > definition of the implementation that disallows that? > > > > Maybe Matt can answer this question, but as I read it, I don't think that > > happens... client_socket_factory creates TCPClientSockets, and a new TCPSocket > > (which maps to TCPSocketLibevent for *ix) is created for each TCPClientSocket. > > That said, I agree that I also don't see anything in the API definition of > > TCPSocketLibevent that explicitly prevents an object from being reused... > > > > If it is possible that a TCPSocketLibevent object is reused, then we'll have > to > > reset use_tcp_fastopen_ as well, and I'd want to double-check the code to make > > sure that state is correctly reset. > > So I'm a fairly big believer in not making assumptions beyond what's codified in > the header file about how a class will be used. If you wanted to stick a note > in the header file about not reusing the class and dump a DCHECK someplace in > the state transition to make sure it's not being re-used, I'm comfortable with > not resetting the state. Wow, that paragraph was written too fast. Let me try again: So I'm a fairly big believer in not making assumptions beyond what's codified in the header file about how a class will be used. If you wanted to stick a note in the header file about not reusing an instance of the class and dump a DCHECK someplace in the class state transition (connect can't be called twice?) to make sure it's not being re-used, I'm comfortable with not resetting tcp_fastopen_status_.
No troubles, Randy. Hopefully, Matt can do a review, and all will be well :-) https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/1/net/socket/tcp_socket_libeve... net/socket/tcp_socket_libevent.cc:423: tcp_fastopen_status_ = FAST_OPEN_STATUS_UNKNOWN; On 2014/09/22 17:42:47, rdsmith wrote: > On 2014/09/22 17:04:56, Jana wrote: > > On 2014/09/22 16:05:13, rdsmith wrote: > > > On 2014/09/19 02:18:16, Jana wrote: > > > > Why do we need to reset tcp_fastopen_status_ here? > > > > > > So I don't remember clearly why I did this, but looking at the code, I > suspect > > I > > > was worried about the object being re-used; I don't see anything in the API > > > definition of the implementation that disallows that? > > > > Maybe Matt can answer this question, but as I read it, I don't think that > > happens... client_socket_factory creates TCPClientSockets, and a new TCPSocket > > (which maps to TCPSocketLibevent for *ix) is created for each TCPClientSocket. > > That said, I agree that I also don't see anything in the API definition of > > TCPSocketLibevent that explicitly prevents an object from being reused... > > > > If it is possible that a TCPSocketLibevent object is reused, then we'll have > to > > reset use_tcp_fastopen_ as well, and I'd want to double-check the code to make > > sure that state is correctly reset. > > So I'm a fairly big believer in not making assumptions beyond what's codified in > the header file about how a class will be used. If you wanted to stick a note > in the header file about not reusing the class and dump a DCHECK someplace in > the state transition to make sure it's not being re-used, I'm comfortable with > not resetting the state. Yup, that sounds reasonable. I will try a DCHECK and check in with Matt as well. If not, we'll have to reset more TCP FastOpen state as well.
https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:148: tcp_fastopen_status_, FAST_OPEN_MAX_VALUE); You need to update histograms.xml in this CL as well. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:148: tcp_fastopen_status_, FAST_OPEN_MAX_VALUE); Since you're significantly changing this histogram, should probably rename it (Net.TcpFastOpenSocketConnection2) and deprecate the old one, to avoid mixing old and new data. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:666: void TCPSocketLibevent::RecordTCPFastOpenStatus() { nit: Think "Record" is a bit of a misnomer, since we don't do the UMA_* logic here. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:668: (tcp_fastopen_status_ == FAST_OPEN_FAST_CONNECT_RETURN || nit: Should be consistent here: Either always use fast_open/FastOpen or always use fastopen/Fastopen. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:669: tcp_fastopen_status_ == FAST_OPEN_SLOW_CONNECT_RETURN)) { While you're here, would you mind switching to an early return here instead? That's the preferred style, and makes to code a little easier to follow. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:676: FAST_OPEN_FAST_CONNECT_FAILED : FAST_OPEN_SLOW_CONNECT_FAILED); If g_tcp_fastopen_has_failed is true, you're assuming this particular fast open attempt failed, even if it might not have? https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:112: FAST_OPEN_SLOW_CONNECT_FAILED, These should be added at the end of the list, since they're used in a UMA histogram.
About the reuse issue: The network stack itself doesn't reuse sockets after they're disconnected. However, I'm reasonably certain that there are external consumers of sockets that *do*, at least for the server socket classes, so unless we want to explicitly forbid that, and fix all embedders, we should try to continue supporting socket reuse. That having been said, if I were designing the socket API myself, I would have forbidden that in the first place, as we don't get a whole lot out of it.
Thanks, Matt. I've addressed your comments, and made some changes otherwise also. In particular, I've changed when and how TCP FastOpen status is recorded. PTAL. (i) The high-order idea is that on a write (connect+write), we record ERROR on failure, but we can only record success (or some sort of FastOpen failure) when a read fails. This is when we know if TCP FastOpen succeeded or failed, and that status is now recorded here. (ii) On _any_ failure, we turn off TCP FastOpen for subsequent connections. This may be overly conservative, but my concern is that if there's a persistent failure mode anywhere, we need an escape hatch into non-FastOpen connects. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:148: tcp_fastopen_status_, FAST_OPEN_MAX_VALUE); On 2014/09/23 15:21:00, mmenke wrote: > You need to update histograms.xml in this CL as well. I did add histograms.xml to this CL, but to add myself as an owner for this histogram. There are no changes to be made there, since the values in that histogram are simply enumerations from TcpSocketStatus. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:148: tcp_fastopen_status_, FAST_OPEN_MAX_VALUE); On 2014/09/23 15:21:00, mmenke wrote: > Since you're significantly changing this histogram, should probably rename it > (Net.TcpFastOpenSocketConnection2) and deprecate the old one, to avoid mixing > old and new data. I thought about that ... but this histogram doesn't even show up yet in UMA, since TFO is currently only possible by being explicitly user-enabled... I don't think there's much (any) data yet. (In addition, my intent is to only look at histograms for the experiments (variations) that I'm going to have users in, and the experiment will be conditioned on a Min version for Chrome.) https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:666: void TCPSocketLibevent::RecordTCPFastOpenStatus() { On 2014/09/23 15:21:00, mmenke wrote: > nit: Think "Record" is a bit of a misnomer, since we don't do the UMA_* logic > here. Sure, but we are recording the status here locally, aren't we? UMA* logic simply pushes this status out. I cant seem to think of a better verb -- feel free to suggest! https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:668: (tcp_fastopen_status_ == FAST_OPEN_FAST_CONNECT_RETURN || On 2014/09/23 15:21:00, mmenke wrote: > nit: Should be consistent here: Either always use fast_open/FastOpen or always > use fastopen/Fastopen. Yup, good call. I left this one as it was (this histogram has been around), but I've now made it all consistent (I think). Done. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:669: tcp_fastopen_status_ == FAST_OPEN_SLOW_CONNECT_RETURN)) { On 2014/09/23 15:21:00, mmenke wrote: > While you're here, would you mind switching to an early return here instead? > That's the preferred style, and makes to code a little easier to follow. Done. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:676: FAST_OPEN_FAST_CONNECT_FAILED : FAST_OPEN_SLOW_CONNECT_FAILED); On 2014/09/23 15:21:00, mmenke wrote: > If g_tcp_fastopen_has_failed is true, you're assuming this particular fast open > attempt failed, even if it might not have? I don't think so -- but I've streamlined the logic for when recording is done, and it should be a bit clearer and more consistent now. PTAL. https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/40001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:112: FAST_OPEN_SLOW_CONNECT_FAILED, On 2014/09/23 15:21:01, mmenke wrote: > These should be added at the end of the list, since they're used in a UMA > histogram. Thanks for catching this! Done.
https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:422: // Reset TCP FastOpen state. Should we have: if (tcp_fastopen_write_attempted_) { UMA_HISTOGRAM_ENUMERATION("Net.TcpFastOpenSocketConnection" fast_open_status_, FAST_OPEN_MAX_VALUE); } And then just call Close() in the destructor? https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:573: !g_tcp_fastopen_has_failed) { The !g_tcp_fastopen_has_failed still seems kinda weird here. Suppose this happens: * Try to connect socket A using Tcp Fast Open * Try to connect socket B using Tcp Fast Open * Socket A ends up here with a failure. g_tcp_fastopen_has_failed is set to false. * Socket B ends up here (With a success or a failure): We now have tcp_fastopen_write_attempted_ is true, tcp_fastopen_connected_ is false, g_tcp_fastopen_has_failed is true, and tcp_fastopen_status_ is TCP_FASTOPEN_FAST_CONNECT_RETURN or TCP_FASTOPEN_SLOW_CONNECT_RETURN, presumably. We don't update tcp_fastopen_status_ for socket B because of the failure, but we then record the histogram on close/destruction for socket B...And the value we record isn't actually in any way comparable to the histogram we record for socket A, since we don't call RecordTCPFastOpenStatus(). https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:685: void TCPSocketLibevent::RecordTCPFastOpenStatus() { On 2014/09/24 00:51:26, Jana wrote: > On 2014/09/23 15:21:00, mmenke wrote: > > nit: Think "Record" is a bit of a misnomer, since we don't do the UMA_* logic > > here. > > Sure, but we are recording the status here locally, aren't we? UMA* logic simply > pushes this status out. I cant seem to think of a better verb -- feel free to > suggest! Maybe UpdateTCPFastOpenStatusAfterRead? https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:698: TCP_FASTOPEN_FAST_CONNECT_FAILED : TCP_FASTOPEN_SLOW_CONNECT_FAILED); nit: 4 space indent. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:145: TCP_FASTOPEN_MAX_VALUE One high level question... How do you intend to look at these histograms? The fail and success numbers aren't really comparable, since we disable fastopen on failure. Suppose we have two users, FastOpen works for one, but errors out on the other. Both use 100 connections that could theoretically support fastopen. We record 1 failure (Maybe a couple more, if we start multiple connections before our first failure), and 100 successes. How do you interpret those histogram numbers? https://codereview.chromium.org/583883002/diff/60001/tools/metrics/histograms... File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/583883002/diff/60001/tools/metrics/histograms... tools/metrics/histograms/histograms.xml:17733: <histogram name="Net.TcpFastOpenSocketConnection" enum="TcpSocketStatus"> Per my earlier comment, you're adding two values to the TcpSocketStatus enum. You need to add them to this file as well, so raw numbers don't show up in the histograms. Just search in this file for the other occurrence of "TcpSocketStatus".
Thanks, Matt. I've addressed your comments, and I thought about the histograms... I've added another element, and described it in my responses inline. PTAL. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:422: // Reset TCP FastOpen state. On 2014/09/24 14:46:11, mmenke wrote: > Should we have: > > if (tcp_fastopen_write_attempted_) { > UMA_HISTOGRAM_ENUMERATION("Net.TcpFastOpenSocketConnection" > fast_open_status_, FAST_OPEN_MAX_VALUE); > } > > And then just call Close() in the destructor? Good suggestion -- Done. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:573: !g_tcp_fastopen_has_failed) { On 2014/09/24 14:46:11, mmenke wrote: > The !g_tcp_fastopen_has_failed still seems kinda weird here. Suppose this > happens: > > * Try to connect socket A using Tcp Fast Open > * Try to connect socket B using Tcp Fast Open > * Socket A ends up here with a failure. g_tcp_fastopen_has_failed is set to > false. > * Socket B ends up here (With a success or a failure): We now have > tcp_fastopen_write_attempted_ is true, tcp_fastopen_connected_ is false, > g_tcp_fastopen_has_failed is true, and tcp_fastopen_status_ is > TCP_FASTOPEN_FAST_CONNECT_RETURN or TCP_FASTOPEN_SLOW_CONNECT_RETURN, > presumably. We don't update tcp_fastopen_status_ for socket B because of the > failure, but we then record the histogram on close/destruction for socket > B...And the value we record isn't actually in any way comparable to the > histogram we record for socket A, since we don't call RecordTCPFastOpenStatus(). Hmm -- good point. I had the g_tcp_fastopen_has_failed check as a redundant one, but as you point out in your example, it prevents a perfectly good histogram event from being recorded correctly. Also in light of my new histogram additions, it makes sense to simply allow all failures to go through and record themselves. I'm removing the check for g_tcp_fastopen_has_failed here. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:685: void TCPSocketLibevent::RecordTCPFastOpenStatus() { On 2014/09/24 14:46:11, mmenke wrote: > On 2014/09/24 00:51:26, Jana wrote: > > On 2014/09/23 15:21:00, mmenke wrote: > > > nit: Think "Record" is a bit of a misnomer, since we don't do the UMA_* > logic > > > here. > > > > Sure, but we are recording the status here locally, aren't we? UMA* logic > simply > > pushes this status out. I cant seem to think of a better verb -- feel free to > > suggest! > > Maybe UpdateTCPFastOpenStatusAfterRead? Done. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:698: TCP_FASTOPEN_FAST_CONNECT_FAILED : TCP_FASTOPEN_SLOW_CONNECT_FAILED); On 2014/09/24 14:46:11, mmenke wrote: > nit: 4 space indent. Done. https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.h (right): https://codereview.chromium.org/583883002/diff/60001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.h:145: TCP_FASTOPEN_MAX_VALUE On 2014/09/24 14:46:11, mmenke wrote: > One high level question... How do you intend to look at these histograms? The > fail and success numbers aren't really comparable, since we disable fastopen on > failure. > > Suppose we have two users, FastOpen works for one, but errors out on the other. > Both use 100 connections that could theoretically support fastopen. > > We record 1 failure (Maybe a couple more, if we start multiple connections > before our first failure), and 100 successes. How do you interpret those > histogram numbers? This is a good point. I thought about this a while... I've added a histogram element that records "Could have tried TFO but didn't since it had previously failed." I think between this one and the rest of the success/failure cases, I'll have all I need. In the case you describe, we will now record 1 failure + 99 previously_failed. I'm failrly certain that this gives us everything we need: (i) #successes vs. #failures (and potential_failures), and (ii) the distribution of various kinds of failures. https://codereview.chromium.org/583883002/diff/60001/tools/metrics/histograms... File tools/metrics/histograms/histograms.xml (left): https://codereview.chromium.org/583883002/diff/60001/tools/metrics/histograms... tools/metrics/histograms/histograms.xml:17733: <histogram name="Net.TcpFastOpenSocketConnection" enum="TcpSocketStatus"> On 2014/09/24 14:46:11, mmenke wrote: > Per my earlier comment, you're adding two values to the TcpSocketStatus enum. > You need to add them to this file as well, so raw numbers don't show up in the > histograms. Just search in this file for the other occurrence of > "TcpSocketStatus". Ah -- I don't know why I missed this. Thanks.
Jana: You'll need signoff from a histograms owner as well as me.
jri@chromium.org changed reviewers: + asvitkine@chromium.org - rdsmith@chromium.org
+asvitkine for histograms.
jri@chromium.org changed reviewers: + isherman@chromium.org
+isherman, also for histograms review (in case Alexei isn't able to get to this today.)
On 2014/09/26 00:05:45, Jana wrote: > +isherman, also for histograms review (in case Alexei isn't able to get to this > today.) Sorry, doesn't look like I'm going to get to it tonight myself - just too tired to do a reasonably careful review.
histograms lgtm
Thanks, Ilya!
jri@chromium.org changed reviewers: - asvitkine@chromium.org
https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:45: nit: Remove extra blank line. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:273: rv = HandleReadCompleted(buf, rv); Previously, we didn't call RecordFastOpenStatus on synchronous read errors. That was a bug, right? https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:445: tcp_fastopen_status_ = TCP_FASTOPEN_PREVIOUSLY_FAILED; This doesn't quite match the not-yet-failed case - it doesn't catch socket reuses (Which is probably irrelevant), and does catch sockets that we never actually call connect on. I think this is coode enough, but maybe a comment about it? https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:590: UpdateTCPFastOpenStatusAfterRead(); Not relevant for this CL, but... So if we ever get an error during connect, like ERR_NO_ROUTE_TO_HOST, we disable fastopen until restart? https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:618: // TCP FastOpen for all subsequent connections. This comment seems to be making the assumption that it was the first write that failed, but this isn't necessarily true. We may end up writing to a socket multiple times before we end up reading from it, and this code triggers as long as we haven't completed a read yet (successfully or with error). Do we care to distinguish fails on the first write vs failures on a subsequent write? Also, if an embedder tries to read from and write to a socket at once, and both fail, the read error always takes precedence with this code, as long as the socket is alive long enough, regardless of the order we learn about the errors from. Do we care?
https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:590: UpdateTCPFastOpenStatusAfterRead(); On 2014/09/29 17:49:34, mmenke wrote: > Not relevant for this CL, but... So if we ever get an error during connect, > like ERR_NO_ROUTE_TO_HOST, we disable fastopen until restart? Not relevant meaning I don't want to force a more complicated / smarter solution in this CL, but I think we should think about it.
Thanks much, Matt, for the careful review. Responses inline. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:45: On 2014/09/29 17:49:34, mmenke wrote: > nit: Remove extra blank line. Done. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:273: rv = HandleReadCompleted(buf, rv); On 2014/09/29 17:49:34, mmenke wrote: > Previously, we didn't call RecordFastOpenStatus on synchronous read errors. > That was a bug, right? Yes, I believe so. That's why I moved it to HandleReadCompleted to ensure that it was always called at the end of a read. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:445: tcp_fastopen_status_ = TCP_FASTOPEN_PREVIOUSLY_FAILED; On 2014/09/29 17:49:34, mmenke wrote: > This doesn't quite match the not-yet-failed case - it doesn't catch socket > reuses (Which is probably irrelevant), and does catch sockets that we never > actually call connect on. I think this is coode enough, but maybe a comment > about it? Yes, you're right on both accounts. I agree that it's reasonable to drop the socket reuse cases (we're conditioning all of this on SSL sockets at this point anyways), and I did think about doing it more accurately on a Write() call. I estimated that this is reasonably adequate, especially since we're now only measuring this for SSL sockets, and I expect that it should be uncommon for a subsequent connect to not occur. I've added a comment to tcp_socket_libevent.h and a TODO; PTAL. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:590: UpdateTCPFastOpenStatusAfterRead(); On 2014/09/29 17:49:34, mmenke wrote: > Not relevant for this CL, but... So if we ever get an error during connect, > like ERR_NO_ROUTE_TO_HOST, we disable fastopen until restart? Yes we are more conservative than we need to be. We could condition turning TFO off on more specific errors... I would also like to do optimizations to re-enabling fastopen, such as re-enabling after a few hours, for instance. Adding TODOs everywhere this flag is set to true to document this point. I've also added a next step for filtering based on more specific write() errors. I did think about this, and I think that data has to point the way here. I agree that smarter machinery is in order, but we'll have to do this carefully to keep the user from getting persistently blackholed. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:618: // TCP FastOpen for all subsequent connections. On 2014/09/29 17:49:34, mmenke wrote: > This comment seems to be making the assumption that it was the first write that > failed, but this isn't necessarily true. We may end up writing to a socket > multiple times before we end up reading from it, and this code triggers as long > as we haven't completed a read yet (successfully or with error). Do we care to > distinguish fails on the first write vs failures on a subsequent write? Hmm... good point. I am comfortable, especially given the scope of our experiment at this point (SSL only), to make this assumption. Either ways, this seems to me to fall under the category of better resilience on failures with FastOpen, and I'm comfortable leaving a note here to document this point. > Also, if an embedder tries to read from and write to a socket at once, and both > fail, the read error always takes precedence with this code, as long as the > socket is alive long enough, regardless of the order we learn about the errors > from. Do we care? There are other issues with an embedder attempting read before a write: the read will fail on a TFO socket (this is the read-before-write issue on a TFO socket that is yet unresolved, and a major reason for limiting this experiment to SSL). That said, yes, I think the read() error is the one that I explicitly wanted to give precedence to. I hadn't thought about the case you describe, but TFO failure modes are better articulated on a read, since we get additional information from the kernel there.
Two more questions, a couple minor comments. https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/80001/net/socket/tcp_socket_li... net/socket/tcp_socket_libevent.cc:618: // TCP FastOpen for all subsequent connections. On 2014/09/29 23:01:47, Jana wrote: > On 2014/09/29 17:49:34, mmenke wrote: > > This comment seems to be making the assumption that it was the first write > that > > failed, but this isn't necessarily true. We may end up writing to a socket > > multiple times before we end up reading from it, and this code triggers as > long > > as we haven't completed a read yet (successfully or with error). Do we care > to > > distinguish fails on the first write vs failures on a subsequent write? > > Hmm... good point. I am comfortable, especially given the scope of our > experiment at this point (SSL only), to make this assumption. Either ways, this > seems to me to fall under the category of better resilience on failures with > FastOpen, and I'm comfortable leaving a note here to document this point. > > > Also, if an embedder tries to read from and write to a socket at once, and > both > > fail, the read error always takes precedence with this code, as long as the > > socket is alive long enough, regardless of the order we learn about the errors > > from. Do we care? > > There are other issues with an embedder attempting read before a write: the read > will fail on a TFO socket (this is the read-before-write issue on a TFO socket > that is yet unresolved, and a major reason for limiting this experiment to SSL). > That said, yes, I think the read() error is the one that I explicitly wanted to > give precedence to. I hadn't thought about the case you describe, but TFO > failure modes are better articulated on a read, since we get additional > information from the kernel there. I should note I was wrong here. UpdateTCPFastOpenStatusAfterRead checks tcp_fastopen_status_, and does nothing on certain statuses. https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:242: if (use_tcp_fastopen_ && !tcp_fastopen_write_attempted_ && One more question: tcp_fastopen_write_attempted_ is set to true on the first write attempt, not when the first write (Which is also really the connection attempt) completes. Will socket_->IsConnected return true that soon? https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:587: if (rv >= 0) Is 0 a reliable indicator of success here? https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:704: tcp_fastopen_status_ != TCP_FASTOPEN_SLOW_CONNECT_RETURN)) { I think it's a little weird that here we check tcp_fastopen_status_, but in HandleReadCompleted we check tcp_fastopen_connected_. In both places, we're largely interested in the same thing. Do we even need the tcp_fastopen_status_ check there? https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:720: bool server_acked_data(false); nit: While you're here, "=" is almost always used to initialize fundamental types and enums
Thanks, Matt. I've answered your questions below, and made small mods to the code in response. PTAL. https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:242: if (use_tcp_fastopen_ && !tcp_fastopen_write_attempted_ && On 2014/09/30 14:30:34, mmenke wrote: > One more question: tcp_fastopen_write_attempted_ is set to true on the first > write attempt, not when the first write (Which is also really the connection > attempt) completes. Will socket_->IsConnected return true that soon? Good question -- it took me some digging to figure this out. So the answer turns out is yes. socket_->IsConnected() uses recv() and looks for EWOULDBLOCK to indicate that the socket is connected. If our TCP FastOpen sendto() returns success, then any subsequent read on a socket which has a TFO write in progress returns EWOULDBLOCK, which means socket_->IsConnected returns true. If the sendto() returns -1, with EINPROGRESS, that translates to IO_PENDING, which also results in the same outcome on socket_->IsConnected. If the write fails otherwise, then it's an error. So, as long as the Write() (connect) has started, socket_->IsConnected() will return true. Does that answer your question? https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:587: if (rv >= 0) On 2014/09/30 14:30:34, mmenke wrote: > Is 0 a reliable indicator of success here? I think so. 0 indicates a close, not an error, which is what should happen on any (including TFO) failure. As I understand it, if a TFO connect fails underneath, read returns -1 with errno set to ECONNREFUSED. https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:704: tcp_fastopen_status_ != TCP_FASTOPEN_SLOW_CONNECT_RETURN)) { On 2014/09/30 14:30:34, mmenke wrote: > I think it's a little weird that here we check tcp_fastopen_status_, but in > HandleReadCompleted we check tcp_fastopen_connected_. In both places, we're > largely interested in the same thing. > > Do we even need the tcp_fastopen_status_ check there? We don't need the fastopen status check here at all... since we shouldn't be calling this method unless we're in the ReadCompleted call with tcp_fastopen_write_attempted && !tcp_fastopen_connected. This was leftover code from earlier that I didn't clean up... I think its safe now to remove this, so I'm removing it, and adding a DCHECK. https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:720: bool server_acked_data(false); On 2014/09/30 14:30:34, mmenke wrote: > nit: While you're here, "=" is almost always used to initialize fundamental > types and enums Done.
LGTM! https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... File net/socket/tcp_socket_libevent.cc (right): https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:587: if (rv >= 0) On 2014/10/01 01:54:00, Jana wrote: > On 2014/09/30 14:30:34, mmenke wrote: > > Is 0 a reliable indicator of success here? > > I think so. 0 indicates a close, not an error, which is what should happen on > any (including TFO) failure. As I understand it, if a TFO connect fails > underneath, read returns -1 with errno set to ECONNREFUSED. Right, my concern was just that a confused server could somehow managed to make this look like a graceful close... Can't really see how that would happen, though. https://codereview.chromium.org/583883002/diff/100001/net/socket/tcp_socket_l... net/socket/tcp_socket_libevent.cc:704: tcp_fastopen_status_ != TCP_FASTOPEN_SLOW_CONNECT_RETURN)) { On 2014/10/01 01:54:00, Jana wrote: > On 2014/09/30 14:30:34, mmenke wrote: > > I think it's a little weird that here we check tcp_fastopen_status_, but in > > HandleReadCompleted we check tcp_fastopen_connected_. In both places, we're > > largely interested in the same thing. > > > > Do we even need the tcp_fastopen_status_ check there? > > We don't need the fastopen status check here at all... since we shouldn't be > calling this method unless we're in the ReadCompleted call with > tcp_fastopen_write_attempted && !tcp_fastopen_connected. This was leftover code > from earlier that I didn't clean up... I think its safe now to remove this, so > I'm removing it, and adding a DCHECK. Two read failures in a row could get us here...But yea, after one read failure, no one should be reading from us again.
Awesome -- Thanks, Matt! I'll rebase and submit.
The CQ bit was checked by jri@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/583883002/160001
Message was sent while issue was closed.
Committed patchset #9 (id:160001) as 6373bd5afd83165835230108d465b89aca3e3b9b
Message was sent while issue was closed.
Patchset 9 (id:??) landed as https://crrev.com/764425441131e861b035499e3ecb5787c866a266 Cr-Commit-Position: refs/heads/master@{#297763} |