|
|
Created:
6 years, 3 months ago by miu Modified:
6 years, 3 months ago Reviewers:
Alpha Left Google CC:
chromium-reviews, hclam+watch_chromium.org, cbentzel+watch_chromium.org, imcheng+watch_chromium.org, hguihot+watch_chromium.org, jasonroberts+watch_google.com, avayvod+watch_chromium.org, pwestin+watch_google.com, feature-media-reviews_chromium.org, miu+watch_chromium.org, hubbe+watch_chromium.org, mikhal+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
Description[Cast] Sanity-check for unbounded queue growth in PacedSender.
Removes the spammy warning message, and replaces it with a sanity-
check that the queue has not grown past 10 seconds' worth of
packets. If the sanity-check trips, a process crash dump will be
sent to make us aware, via crash reports, that this is a problem
in-the-wild.
Committed: https://crrev.com/1904e8682930a06460ffc323cc8b1c88c5f4966b
Cr-Commit-Position: refs/heads/master@{#294928}
Patch Set 1 #Patch Set 2 : Generate crash report if impossible situation is reached. #Patch Set 3 : No packet dropping. #
Messages
Total messages: 18 (2 generated)
miu@chromium.org changed reviewers: + hclam@chromium.org
hclam: PTAL.
What bug is this going to fix? We should never RTCP packets, in fact the pacer overwrites the RTCP packet with a later version. Also it doesn't seem right to not send anything.
On 2014/09/04 22:39:52, Alpha wrote: > What bug is this going to fix? We should never RTCP packets, in fact the pacer > overwrites the RTCP packet with a later version. Also it doesn't seem right to > not send anything. Good point on the RTCP packets. I misunderstood that code. I'll revert the change to SendRtcpPacket() in the next patch set. The bug this fixes: This removes that annoying "Packet queue is too long" warning message that spams the logs. I chatted with hubbe@ yesterday, and he said it was bogus. When I removed the warning, a part of me was scared PacedSender's queue could grow indefinitely. I wanted to make sure there was protection for when this is running in-the-wild. IMO, this is the right approach, since 10 seconds' worth of packets is a huge upper-bound. It can easily be 50+ MB of data held in memory. If this point is reached, it is a strong signal that the queue is growing faster than it is emptying. Also, it's perfectly fine to not send anything because packets are dropped in downstream code anyway (or as UDP packets on the network). The callers of the Send/Resend methods will simply attempt to re-send packets when the ACKs are not received.
> IMO, this is the right approach, since 10 seconds' worth of packets is a huge > upper-bound. It can easily be 50+ MB of data held in memory. If this point is > reached, it is a strong signal that the queue is growing faster than it is > emptying. Also, it's perfectly fine to not send anything because packets are > dropped in downstream code anyway (or as UDP packets on the network). The > callers of the Send/Resend methods will simply attempt to re-send packets when > the ACKs are not received. Would the easier to just to raise the limit? If we do accumulate this much data it seems worse not to send it (and silently drop it). Receiver doesn't know there's subsequent data. And really have this much data queued indicates something is wrong and if it seems worse developer cannot detect it. Currently the only possible way to lose a packet is socket error or lost in the network. I would prefer not to add other ways that packets can be dropped.
On 2014/09/05 18:48:29, Alpha wrote: > > IMO, this is the right approach, since 10 seconds' worth of packets is a huge > > upper-bound. It can easily be 50+ MB of data held in memory. If this point > is > > reached, it is a strong signal that the queue is growing faster than it is > > emptying. Also, it's perfectly fine to not send anything because packets are > > dropped in downstream code anyway (or as UDP packets on the network). The > > callers of the Send/Resend methods will simply attempt to re-send packets when > > the ACKs are not received. > > Would the easier to just to raise the limit? I don't understand. Raise what limit? > If we do accumulate this much data it seems worse not to send it (and silently > drop it). Receiver doesn't know there's subsequent data. And really have this > much data queued indicates something is wrong and if it seems worse developer > cannot detect it. The point is, if this much data accumulates in the PacedSender, it will NEVER be able to empty the queue. I totally agree that having this much data queued means something is wrong. That's why I've left in WARNING messages. I can change to LOG(DFATAL) if you feel that's more appropriate. BTW--It's not true that the receiver doesn't know about subsequent data. Two reasons: 1. FrameSender::ResendForKickstart(). 2. After some packets are sent, PacedSender will once again allow packets to be enqueued. The caller of SendPackets() will send packets for later frames, and then the receiver will notice that some packets were dropped in-between. > Currently the only possible way to lose a packet is socket error or lost in the > network. I would prefer not to add other ways that packets can be dropped. Packets are never going to be dropped, except when the 10-second queue length limit is reached. That should never happen unless we have a bug in our implementation. And if we do have a bug, at least we won't gobble-up memory indefinitely until Chrome crashes. ;-)
On 2014/09/05 20:50:34, miu wrote: > On 2014/09/05 18:48:29, Alpha wrote: > > > IMO, this is the right approach, since 10 seconds' worth of packets is a > huge > > > upper-bound. It can easily be 50+ MB of data held in memory. If this point > > is > > > reached, it is a strong signal that the queue is growing faster than it is > > > emptying. Also, it's perfectly fine to not send anything because packets > are > > > dropped in downstream code anyway (or as UDP packets on the network). The > > > callers of the Send/Resend methods will simply attempt to re-send packets > when > > > the ACKs are not received. > > > > Would the easier to just to raise the limit? > > I don't understand. Raise what limit? https://code.google.com/p/chromium/codesearch#chromium/src/media/cast/net/pac... The limit is 100, wouldn't it be simple to raise it to something like kMaxPacketsAllowedEnqueued. If the goal is to have developer notice such a case and not be annoying. > > > If we do accumulate this much data it seems worse not to send it (and silently > > drop it). Receiver doesn't know there's subsequent data. And really have this > > much data queued indicates something is wrong and if it seems worse developer > > cannot detect it. > > The point is, if this much data accumulates in the PacedSender, it will NEVER be > able to empty the queue. I totally agree that having this much data queued > means something is wrong. That's why I've left in WARNING messages. I can > change to LOG(DFATAL) if you feel that's more appropriate. > > BTW--It's not true that the receiver doesn't know about subsequent data. Two > reasons: > > 1. FrameSender::ResendForKickstart(). > > 2. After some packets are sent, PacedSender will once again allow packets to be > enqueued. The caller of SendPackets() will send packets for later frames, and > then the receiver will notice that some packets were dropped in-between. > > > Currently the only possible way to lose a packet is socket error or lost in > the > > network. I would prefer not to add other ways that packets can be dropped. > > Packets are never going to be dropped, except when the 10-second queue length > limit is reached. That should never happen unless we have a bug in our > implementation. And if we do have a bug, at least we won't gobble-up memory > indefinitely until Chrome crashes. ;-) I'm just not sure how dropping packets if exceed kMaxPacketsAllowedEnqueued can help, because the situation we're talking about is our implementation is critically broken. Having it running continuously undetected is not ideal. What about having PacedSender flow the error to the Javascript API such that we can terminate the session with an error logged in the extension?
On 2014/09/08 18:26:28, Alpha wrote: > On 2014/09/05 20:50:34, miu wrote: > > On 2014/09/05 18:48:29, Alpha wrote: > > > > IMO, this is the right approach, since 10 seconds' worth of packets is a > > huge > > > > upper-bound. It can easily be 50+ MB of data held in memory. If this > point > > > is > > > > reached, it is a strong signal that the queue is growing faster than it is > > > > emptying. Also, it's perfectly fine to not send anything because packets > > are > > > > dropped in downstream code anyway (or as UDP packets on the network). The > > > > callers of the Send/Resend methods will simply attempt to re-send packets > > when > > > > the ACKs are not received. > > > > > > Would the easier to just to raise the limit? > > > > I don't understand. Raise what limit? > > https://code.google.com/p/chromium/codesearch#chromium/src/media/cast/net/pac... > > The limit is 100, wouldn't it be simple to raise it to something like > kMaxPacketsAllowedEnqueued. If the goal is to have developer notice such a case > and not be annoying. > > > > > > If we do accumulate this much data it seems worse not to send it (and > silently > > > drop it). Receiver doesn't know there's subsequent data. And really have > this > > > much data queued indicates something is wrong and if it seems worse > developer > > > cannot detect it. > > > > The point is, if this much data accumulates in the PacedSender, it will NEVER > be > > able to empty the queue. I totally agree that having this much data queued > > means something is wrong. That's why I've left in WARNING messages. I can > > change to LOG(DFATAL) if you feel that's more appropriate. > > > > BTW--It's not true that the receiver doesn't know about subsequent data. Two > > reasons: > > > > 1. FrameSender::ResendForKickstart(). > > > > 2. After some packets are sent, PacedSender will once again allow packets to > be > > enqueued. The caller of SendPackets() will send packets for later frames, and > > then the receiver will notice that some packets were dropped in-between. > > > > > Currently the only possible way to lose a packet is socket error or lost in > > the > > > network. I would prefer not to add other ways that packets can be dropped. > > > > Packets are never going to be dropped, except when the 10-second queue length > > limit is reached. That should never happen unless we have a bug in our > > implementation. And if we do have a bug, at least we won't gobble-up memory > > indefinitely until Chrome crashes. ;-) > > I'm just not sure how dropping packets if exceed kMaxPacketsAllowedEnqueued can > help, because the situation we're talking about is our implementation is > critically broken. Having it running continuously undetected is not ideal. What > about having PacedSender flow the error to the Javascript API such that we can > terminate the session with an error logged in the extension? Undetected problems are horrible. I suggest that in addition to dropping packets if we reach kMaxPacketsAllowedEnqueued, we should also call DumpWithoutCrashing(). That way we will know if the impossible is less impossible than we thought. /Hubbe
PTAL. Patch Set 2 incorporates fixes, plus Hubbe's idea about crash reporting when the "impossible queue size" is reached. Other comments in-line: On 2014/09/08 18:26:28, Alpha wrote: > On 2014/09/05 20:50:34, miu wrote: > The limit is 100, wouldn't it be simple to raise it to something like > kMaxPacketsAllowedEnqueued. If the goal is to have developer notice such a case > and not be annoying. No, there actually is NO limit in the current code. "100" is only a threshold for determining when to log the warning message, and it's currently wrong (and very spammy). > I'm just not sure how dropping packets if exceed kMaxPacketsAllowedEnqueued can > help, because the situation we're talking about is our implementation is > critically broken. Having it running continuously undetected is not ideal. I totally AGREE with you in that we don't want to hide problems. I liked Hubbe's idea about creating a crash report if our "impossible limit" is ever reached, and incorporated this in the latest PS. What I don't agree with is that you see to be arguing we do not drop packets, and instead let the queue grow until Chrome crashes on OOM. IMO, it's much better for Cast to simply "freeze" for the user, a bug that can be easily identified for our team to deal with, rather than a random Chrome OOM crash that is not actionable (because these crashes rarely have stack traces that point at the problematic code). > What > about having PacedSender flow the error to the Javascript API such that we can > terminate the session with an error logged in the extension? Possibly. What would be really useful is to bubble-up any irrecoverable Cast run-time failures via the JS API. That's a much bigger project/change than I intend to tackle in this change.
> What I don't agree with is that you see to be arguing we do not drop packets, > and instead let the queue grow until Chrome crashes on OOM. IMO, it's much > better for Cast to simply "freeze" for the user, a bug that can be easily > identified for our team to deal with, rather than a random Chrome OOM crash that > is not actionable (because these crashes rarely have stack traces that point at > the problematic code). I'm not saying we should let the queue grow until there's OOM. My point is that there's no bug (that we know of) that would cause the queue to grow so big to a level that causes OOM. I'm just wary of putting a change for an issue that may or may not exist. If this problem does come up the best thing to do is to terminate the session with an error message to report. > > > What > > about having PacedSender flow the error to the Javascript API such that we can > > terminate the session with an error logged in the extension? > > Possibly. What would be really useful is to bubble-up any irrecoverable Cast > run-time failures via the JS API. That's a much bigger project/change than I > intend to tackle in this change. What about we file a bug (we might already do) such that JS API can install an error callback for network transport? This way network layer can report stuff like destination not reachable and critical failure like this one?
On 2014/09/11 01:04:16, Alpha wrote: > > What I don't agree with is that you see to be arguing we do not drop packets, > > and instead let the queue grow until Chrome crashes on OOM. IMO, it's much > > better for Cast to simply "freeze" for the user, a bug that can be easily > > identified for our team to deal with, rather than a random Chrome OOM crash > that > > is not actionable (because these crashes rarely have stack traces that point > at > > the problematic code). One more point: if the number of packets does grow to such a volumne then dropping them in PacedSender is not enough. Packets are already cached in PacketStorage. Dropping the packet in PacedSender doesn't actually release the packet: there won't be a material effect on memory usage. I think the right approach is to bubble this kind of error to the top level JS API. But if you think this approach fixes the problem you see I'm okay that we file a bug for doing the right thing.
New patch set (#3). PTAL. I took out all the packet dropping. All this change does now is to remove the spammy warning, but send a crash dump if the "impossible limit" is ever reached by users in-the-wild. LGTU?
On 2014/09/15 21:29:20, miu wrote: > New patch set (#3). PTAL. > > I took out all the packet dropping. All this change does now is to remove the > spammy warning, but send a crash dump if the "impossible limit" is ever reached > by users in-the-wild. > > LGTU? Code LGTM. Needs to update the CL description to reflect the changes.
On 2014/09/15 21:30:56, Alpha wrote: > On 2014/09/15 21:29:20, miu wrote: > > New patch set (#3). PTAL. > > > > I took out all the packet dropping. All this change does now is to remove the > > spammy warning, but send a crash dump if the "impossible limit" is ever > reached > > by users in-the-wild. > > > > LGTU? > > Code LGTM. Needs to update the CL description to reflect the changes. Just did. Thanks! :)
The CQ bit was checked by miu@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patchset/539323002/40001
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as d07112e514112b45a62c7f2685332099d4d4ffa3
Message was sent while issue was closed.
Patchset 3 (id:??) landed as https://crrev.com/1904e8682930a06460ffc323cc8b1c88c5f4966b Cr-Commit-Position: refs/heads/master@{#294928} |