|
|
Chromium Code Reviews|
Created:
6 years, 7 months ago by mdempsky Modified:
6 years, 7 months ago CC:
chromium-reviews, darin-cc_chromium.org, jam Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionUse RecvMsgWithPid to find real PID for zygote children
The new PID discovery protocol now works as follows:
1. The ZygoteHost allocates a UNIX socket pair and includes one end
with its fork request to the zygote process.
2. After forking, the zygote child process sends a message over the
UNIX socket pair.
3. The ZygoteHost receives the message and a PID for the sender (i.e.,
child), and writes the PID back over the control socket to the zygote.
If the zygote fails to fork a child, its end of the socket pair will
be closed, and the ZygoteHost will receive EOF instead of a message.
4. In the non-NaCl case, the zygote writes the child's PID over a pipe
for the child to receive. (In the NaCl case, this pipe is no longer
used and no PID value is sent to the NaCl child process.)
5. Finally, the zygote sends its normal pickled response back to the
ZygoteHost.
Two manual tests to make sure this works:
1. Make sure that the "End Process" button in the "Task Manager" dialog
still terminates renderer and NaCl processes correctly under Linux.
2. Make sure that Linux Chrome release builds gracefully fail (i.e.,
don't crash) when nacl_helper is missing at chrome startup and the user
navigates to a page that uses NaCl; e.g.,
http://gonativeclient.appspot.com/demo/ and click "Game of Life".
BUG=357670
TEST=manual: see above
NOTRY=true
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=269011
Patch Set 1 #Patch Set 2 : Tweak PID discovery to handle crashing child processes #
Total comments: 29
Patch Set 3 : Respond to jln feedback #
Total comments: 14
Patch Set 4 : Respond to more jln feedback #Patch Set 5 : LOG(FATAL) if we don't receive a child ping #
Total comments: 4
Patch Set 6 : More jln feedback #Patch Set 7 : Fix compile error #
Total comments: 6
Patch Set 8 : Restore error handling code paths #
Messages
Total messages: 27 (0 generated)
This is still a work in progress, but tests are already green, so I thought I'd mail it out so you can take a look and give any early feedback you might have. I feel like the ZygoteHost<->Zygote fork protocol is a bit brittle, but it seems to work. Happy to discuss it more.
- This looks pretty good, but I'm a little worried that it's hard to assess that the childs are still "trusted" while we do all this. It's especially true in the NaCl case, where I think the child could just go on since it doesn't wait to get its pid back. It's not a big deal, but something to think about. Maybe this simply requires a bit more comment, or a final "I'm a child and I'm no longer trusted" message, so that we clearly know in the Zygote when we can expect that the child could become hostile. - I would love if we could use Pickles instead of raw read/write to sockets for the pids. This predates you, so feel free to punt to a later CL. - waitpid + WNOHANG is racy. A child's socket can appear close, but the kernel can still return 0 in waitpid(). We should rewrite the main Zygote loop with a ppoll that also listens to SIGCHLD. Once that's done, we can avoid waitpid and simply add the pid to a "reap later" set. It's ok to keep as-is (as it's clearly a bad failure case if we can assert that the child is still trusted at this point). Maybe we should NOREACHED() it. https://chromiumcodereview.appspot.com/269543014/diff/20001/components/nacl/l... File components/nacl/loader/nacl_helper_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/20001/components/nacl/l... components/nacl/loader/nacl_helper_linux.cc:120: "x", Let's make this a defined constant and check for it. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/browser/z... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/20001/content/browser/z... content/browser/zygote_host/zygote_host_impl_linux.cc:305: if (socketpair(AF_UNIX, SOCK_SEQPACKET, 0, raw_socks) != 0) { I think it's fine to PCHECK here, this looks like a catastrophic case. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/browser/z... content/browser/zygote_host/zygote_host_impl_linux.cc:345: autoclose_fds.clear(); Hmmm, I sense a new SendMessage that takes ownership might be useful :) (For another CL) https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... File content/zygote/zygote_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:55: if (pipe(raw_pipe) != 0) Feel free to PCHECK and return void, as you prefer. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:345: pid_oracle.get(), "x", 1, std::vector<int>())); Do you want to make it a defined constant and check it on the other end? The FD soup can get confusing and it could be helpful to immediately know if one uses the wrong fd. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:349: if (!base::ReadFromFD(read_pipe.get(), I would much rather not add more of these without using Pickles. Since existing code already does that, feel free to land as-is (and eventually migrate to Pickles in a further CL). https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:377: reinterpret_cast<char*>(&real_pid), Pickle? https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:386: error: Let's kill this label at all cost! https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:387: if (waitpid(pid, NULL, WNOHANG) == -1) This is racy. The kernel can perfectly return immediately even if the child is dead, so we'd keep a zombie around. Let's at least document this and return an (other) error if waitpid returns 0. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:392: // If we're not using a helper, send the PID back to the child process. Isn't it better to send something in all cases? I would much rather be sure that the child process is still "trusted" and this point and know that it'll wait for us https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:395: HANDLE_EINTR(write(write_pipe.get(), &real_pid, sizeof(real_pid))); Pickle? https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:395: HANDLE_EINTR(write(write_pipe.get(), &real_pid, sizeof(real_pid))); Aren't we at risk of SIGPIPE if the child disappears here? Do you want to use sendmsg with MSG_NOSIGNAL instead? https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:444: if (numfds != static_cast<int>(fds.size()) - 1) This seems very confusing. Is it really problematic to account for the oracle socket on the other end?
Thanks for the feedback! I've responded to a couple of your questions, and I'll work on implementing the rest of your suggestions now. https://chromiumcodereview.appspot.com/269543014/diff/20001/components/nacl/l... File components/nacl/loader/nacl_helper_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/20001/components/nacl/l... components/nacl/loader/nacl_helper_linux.cc:120: "x", On 2014/05/02 18:25:00, jln wrote: > Let's make this a defined constant and check for it. Yep, I just used a dummy string constant for proof-of-concept/simplicity. I'll fix that properly now. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... File content/zygote/zygote_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:349: if (!base::ReadFromFD(read_pipe.get(), On 2014/05/02 18:25:00, jln wrote: > I would much rather not add more of these without using Pickles. > > Since existing code already does that, feel free to land as-is (and eventually > migrate to Pickles in a further CL). I'll look into this, and defer if I think it'll make the current CL too ugly. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:386: error: On 2014/05/02 18:25:00, jln wrote: > Let's kill this label at all cost! Yeah, I'm not a fan of it and hope to kill it completely still. I at least got it down to just two cases and it only has the waitpid() call, so it's at least a big improvement over the status quo I think. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:392: // If we're not using a helper, send the PID back to the child process. On 2014/05/02 18:25:00, jln wrote: > Isn't it better to send something in all cases? I would much rather be sure that > the child process is still "trusted" and this point and know that it'll wait for > us I'm leaning towards not sending the PID, but I don't feel super strongly either way. My rationale for doing it this way was: 1. We want to discourage dependencies on children knowing their real PID, and helper (i.e., NaCl) children already don't care. 2. Non-SFI NaCl children actually specifically don't want to know their real PID, so we need to add some helper-specific logic here either way; previously it was sending a useless 0 PID, but now we might as well not send anything. 3. Previously sending the PID was also an important synchronization so the child would know when we've discovered its PID and it can close the dummy socket. This is no longer a concern now that we don't have a dummy socket. 4. Arguably the children are still trusted even now, because we haven't responded to the browser yet, so it hasn't had a chance to send the child any untrusted content. https://chromiumcodereview.appspot.com/269543014/diff/20001/content/zygote/zy... content/zygote/zygote_linux.cc:395: HANDLE_EINTR(write(write_pipe.get(), &real_pid, sizeof(real_pid))); On 2014/05/02 18:25:00, jln wrote: > Aren't we at risk of SIGPIPE if the child disappears here? Yeah, I was wondering about that. I was somewhat relying on us ignoring SIGPIPE. > Do you want to use sendmsg with MSG_NOSIGNAL instead? Yeah, I'll change the pipe to a socketpair and use sendmsg/MSG_NOSIGNAL instead.
https://codereview.chromium.org/269543014/diff/20001/components/nacl/loader/n... File components/nacl/loader/nacl_helper_linux.cc (right): https://codereview.chromium.org/269543014/diff/20001/components/nacl/loader/n... components/nacl/loader/nacl_helper_linux.cc:120: "x", On 2014/05/02 18:25:00, jln wrote: > Let's make this a defined constant and check for it. Done. https://codereview.chromium.org/269543014/diff/20001/content/browser/zygote_h... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://codereview.chromium.org/269543014/diff/20001/content/browser/zygote_h... content/browser/zygote_host/zygote_host_impl_linux.cc:305: if (socketpair(AF_UNIX, SOCK_SEQPACKET, 0, raw_socks) != 0) { On 2014/05/02 18:25:00, jln wrote: > I think it's fine to PCHECK here, this looks like a catastrophic case. Done. https://codereview.chromium.org/269543014/diff/20001/content/browser/zygote_h... content/browser/zygote_host/zygote_host_impl_linux.cc:345: autoclose_fds.clear(); On 2014/05/02 18:25:00, jln wrote: > Hmmm, I sense a new SendMessage that takes ownership might be useful :) (For > another CL) Deferred. This is a bit tricky because we're only closing a subset of the descriptors we're sending, so we can't simply pass ownership of all the descriptors. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... File content/zygote/zygote_linux.cc (right): https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:55: if (pipe(raw_pipe) != 0) On 2014/05/02 18:25:00, jln wrote: > Feel free to PCHECK and return void, as you prefer. Done. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:345: pid_oracle.get(), "x", 1, std::vector<int>())); On 2014/05/02 18:25:00, jln wrote: > Do you want to make it a defined constant and check it on the other end? > > The FD soup can get confusing and it could be helpful to immediately know if one > uses the wrong fd. Done. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:349: if (!base::ReadFromFD(read_pipe.get(), On 2014/05/02 20:23:09, mdempsky wrote: > I'll look into this, and defer if I think it'll make the current CL too ugly. I'm going to keep the pipe using just plain read/write for now. I'll fix that in a followup. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:377: reinterpret_cast<char*>(&real_pid), On 2014/05/02 18:25:00, jln wrote: > Pickle? Done. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:386: error: On 2014/05/02 18:25:00, jln wrote: > Let's kill this label at all cost! Done. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:387: if (waitpid(pid, NULL, WNOHANG) == -1) On 2014/05/02 18:25:00, jln wrote: > This is racy. The kernel can perfectly return immediately even if the child is > dead, so we'd keep a zombie around. I changed this to using kill(pid, SIGKILL) in KillAndReap(), so I think there shouldn't be any races now. It should also help catch if the child process closes the PID oracle socket, but then doesn't exit on its own. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:395: HANDLE_EINTR(write(write_pipe.get(), &real_pid, sizeof(real_pid))); On 2014/05/02 20:23:09, mdempsky wrote: > Yeah, I'll change the pipe to a socketpair and use sendmsg/MSG_NOSIGNAL instead. Decided to keep as a pipe and use read/write for now. I noticed a few other places where we're write()'ing to pipes/sockets and risk SIGPIPE'ing, so I'm going to tackle that separately. https://codereview.chromium.org/269543014/diff/20001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:444: if (numfds != static_cast<int>(fds.size()) - 1) On 2014/05/02 18:25:00, jln wrote: > This seems very confusing. > Is it really problematic to account for the oracle socket on the other end? Nope, just didn't think to do that instead. Agreed, the logic is a bit cleaner this way.
This looks good in general, but I think we may want more assertions and less fallbacks. This code is very complex and has no unittest (and it wouldn't be reasonably possible to add them without a major refactor), so assertions are key to prevent regressions. While we're only running "trusted" code, I think we should LOG(FATAL) as much as possible. One could argue that the ZygoteHost should not have to trust the Zygote, and I could understand using NOTREACHED() there instead. (Even though in practice the zygote can trivially DoS the browser and that's not going to be easy to change. Let me know what you think. My preference would be LOG(FATAL) with a comment explaining that the child is still in a trusted codepath. https://chromiumcodereview.appspot.com/269543014/diff/40001/content/browser/z... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/40001/content/browser/z... content/browser/zygote_host/zygote_host_impl_linux.cc:310: pickle.WriteInt(1 + mapping.size()); This should be documented. How about something such as: size_t num_fds_to_send = 0; Then when you to fds.push_back(peer_sock) you add 1 to this. And before you add the mapping you add mapping.size() (and then do pickle.WriteInt(num_fds_to_send);) https://chromiumcodereview.appspot.com/269543014/diff/40001/content/browser/z... content/browser/zygote_host/zygote_host_impl_linux.cc:350: real_pid = -1; Add NOTREACHED()? I wouldn't be shocked by LOG(FATAL) either. https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... File content/zygote/zygote_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... content/zygote/zygote_linux.cc:73: PCHECK(pid == waitpid(pid, NULL, 0)); HANDLE_EINTR() https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... content/zygote/zygote_linux.cc:191: LOG(WARNING) << "Unexpected real PID message from browser"; Let's add NOTREACHED() then? https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... content/zygote/zygote_linux.cc:416: KillAndReap(pid, use_helper); Since the children are trusted at this point, what do you think about LOG(FATAL) instead here? This code is very complex and I think we should use assertion as much as possible when they make sense to prevent things from regressing. https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... content/zygote/zygote_linux.cc:426: KillAndReap(pid, use_helper); Same remark, how about just LOG(FATAL)? KillAndReap is already too complex since it won't work with fork delegates. https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... File content/zygote/zygote_linux.h (right): https://chromiumcodereview.appspot.com/269543014/diff/40001/content/zygote/zy... content/zygote/zygote_linux.h:80: // UMA_HISTOGRAM_ENUMERATION. It's a little unfortunately that this method is so poorly documented, but let's add some documentation for |pid_oracle| at least.
https://codereview.chromium.org/269543014/diff/40001/content/browser/zygote_h... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://codereview.chromium.org/269543014/diff/40001/content/browser/zygote_h... content/browser/zygote_host/zygote_host_impl_linux.cc:310: pickle.WriteInt(1 + mapping.size()); On 2014/05/05 19:01:42, jln wrote: > This should be documented. Done. I implemented something slightly different than you suggested, but I think it amounts to the same thing. https://codereview.chromium.org/269543014/diff/40001/content/browser/zygote_h... content/browser/zygote_host/zygote_host_impl_linux.cc:350: real_pid = -1; On 2014/05/05 19:01:42, jln wrote: > Add NOTREACHED()? I wouldn't be shocked by LOG(FATAL) either. Done. Went with LOG(FATAL). https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... File content/zygote/zygote_linux.cc (right): https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:73: PCHECK(pid == waitpid(pid, NULL, 0)); On 2014/05/05 19:01:42, jln wrote: > HANDLE_EINTR() Removed KillAndReap(). https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:191: LOG(WARNING) << "Unexpected real PID message from browser"; On 2014/05/05 19:01:42, jln wrote: > Let's add NOTREACHED() then? Done. https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:416: KillAndReap(pid, use_helper); On 2014/05/05 19:01:42, jln wrote: > Since the children are trusted at this point, what do you think about LOG(FATAL) > instead here? Works for me, done. https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... content/zygote/zygote_linux.cc:426: KillAndReap(pid, use_helper); On 2014/05/05 19:01:42, jln wrote: > Same remark, how about just LOG(FATAL)? KillAndReap is already too complex since > it won't work with fork delegates. Done. https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... File content/zygote/zygote_linux.h (right): https://codereview.chromium.org/269543014/diff/40001/content/zygote/zygote_li... content/zygote/zygote_linux.h:80: // UMA_HISTOGRAM_ENUMERATION. On 2014/05/05 19:01:42, jln wrote: > It's a little unfortunately that this method is so poorly documented, but let's > add some documentation for |pid_oracle| at least. Done. Tried to improve the uma_* docs a little bit too.
lgtm https://chromiumcodereview.appspot.com/269543014/diff/80001/components/nacl/l... File components/nacl/loader/nacl_helper_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/80001/components/nacl/l... components/nacl/loader/nacl_helper_linux.cc:118: // Ping the PID oracle socket. Feel free to add a DCHECK that child_fds.size() is > than std::max(kPidOracleFDIndex, kBrowserFDndex). https://chromiumcodereview.appspot.com/269543014/diff/80001/content/zygote/zy... File content/zygote/zygote_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/80001/content/zygote/zy... content/zygote/zygote_linux.cc:175: LOG(WARNING) << "Unexpected real PID message from browser"; LOG(ERROR) then?
https://chromiumcodereview.appspot.com/269543014/diff/80001/components/nacl/l... File components/nacl/loader/nacl_helper_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/80001/components/nacl/l... components/nacl/loader/nacl_helper_linux.cc:118: // Ping the PID oracle socket. On 2014/05/05 23:09:33, jln wrote: > Feel free to add a DCHECK that child_fds.size() is > than > std::max(kPidOracleFDIndex, kBrowserFDndex). Done. https://chromiumcodereview.appspot.com/269543014/diff/80001/content/zygote/zy... File content/zygote/zygote_linux.cc (right): https://chromiumcodereview.appspot.com/269543014/diff/80001/content/zygote/zy... content/zygote/zygote_linux.cc:175: LOG(WARNING) << "Unexpected real PID message from browser"; On 2014/05/05 23:09:33, jln wrote: > LOG(ERROR) then? Done.
piman: Please review tiny changes to content/public for OWNERS approval. Thanks!
lgtm
Mark, could you please take a look for OWNER approval of the small change in components/nacl/loader/nacl_helper_linux.cc ?
I do have a concern about the robustness of trusting the child process to be running, in cases of OOM -- see below. How is this code path tested these days? Are there any automated test cases that check that the PID we get back works? IIRC, it used to be that this was only tested manually, by checking the PID in the Task Manager GUI and checking that "End Process" worked. If that's still the case, would it be worth adding an automated test? Can you add a "TEST=" line to the commit message, please? https://codereview.chromium.org/269543014/diff/120001/content/browser/zygote_... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://codereview.chromium.org/269543014/diff/120001/content/browser/zygote_... content/browser/zygote_host/zygote_host_impl_linux.cc:357: LOG(FATAL) << "Did not receive ping from zygote child"; What happens if the nacl_helper executable is missing? It might be that Chromium starts and works, but if it ever tries to launch a NaCl process, this failure occurs and kills the browser. That would be unfortunate. It would be worth testing manually. I *suspect* that what happens in that case is that ForkRequest() returns -1, because NaClForkDelegate::Fork() returns -1 if nacl_helper didn't start successfully. You only check for a -1 result from the IPC below... Might this also kill the browser unnecessarily in OOM conditions? IIRC, zygote subprocesses are set to be more readily killable on OOM (though I'm not sure how early that happens), and any recently-fork()'d child is probably more of a target for the OOM killer than the browser would normally be. It looks like this would also kill the browser if fork() returned an error, e.g. due to out-of-memory. https://codereview.chromium.org/269543014/diff/120001/content/common/zygote_c... File content/common/zygote_commands_linux.h (right): https://codereview.chromium.org/269543014/diff/120001/content/common/zygote_c... content/common/zygote_commands_linux.h:21: // Message sent by zygote children to the browser so it can discover their Nit: to disambiguate -- "so the browser can discover the zygote child's process ID" perhaps? https://codereview.chromium.org/269543014/diff/120001/content/zygote/zygote_l... File content/zygote/zygote_linux.cc (right): https://codereview.chromium.org/269543014/diff/120001/content/zygote/zygote_l... content/zygote/zygote_linux.cc:400: // a message to the browser, the browser won't have found its PID. Won't the browser have crashed with LOG(FATAL) in that case?
> How is this code path tested these days? Are there any automated test cases > that check that the PID we get back works? There are some unit tests to make sure that RecvMsgWithPid() works across PID namespaces in sandbox/linux/services/unix_domain_socket_unittests.cc, and we do same sanity checking when we launch the Zygote that it's working correctly (e.g., to rule out pre-3.0 Linux kernels that might support PID namespaces but not do the translation correctly). There's no zygote unit tests to my knowledge aside from browser_tests and related full browser testing. > IIRC, it used to be that this was only tested manually, by checking the PID in > the Task Manager GUI and checking that "End Process" worked. If that's still > the case, would it be worth adding an automated test? Can you add a "TEST=" > line to the commit message, please? Will do. https://codereview.chromium.org/269543014/diff/120001/content/browser/zygote_... File content/browser/zygote_host/zygote_host_impl_linux.cc (right): https://codereview.chromium.org/269543014/diff/120001/content/browser/zygote_... content/browser/zygote_host/zygote_host_impl_linux.cc:357: LOG(FATAL) << "Did not receive ping from zygote child"; On 2014/05/06 23:34:13, Mark Seaborn wrote: > What happens if the nacl_helper executable is missing? With the current patch set, like you suggest: if the browser tries to launch a NaCl process, it will crash here. Earlier patch sets, I had more fallback cases, but jln@ and I decided to remove them in favor of fatal failures, because we don't have good testing coverage for this code. I'm fine with adding them back in though if you think that's important. > Might this also kill the browser unnecessarily in OOM conditions? IIRC, zygote > subprocesses are set to be more readily killable on OOM (though I'm not sure how > early that happens), and any recently-fork()'d child is probably more of a > target for the OOM killer than the browser would normally be. I believe the zygote child's OOM score is adjusted at the end of this function with the call to AdjustRendererOOMScore(). > It looks like this would also kill the browser if fork() returned an error, e.g. > due to out-of-memory. Yes, I believe so. https://codereview.chromium.org/269543014/diff/120001/content/zygote/zygote_l... File content/zygote/zygote_linux.cc (right): https://codereview.chromium.org/269543014/diff/120001/content/zygote/zygote_l... content/zygote/zygote_linux.cc:400: // a message to the browser, the browser won't have found its PID. On 2014/05/06 23:34:13, Mark Seaborn wrote: > Won't the browser have crashed with LOG(FATAL) in that case? In this version of the patch set, yes. The comment's a bit vestigial from earlier patch set revisions where I tried to handle this case.
On 2014/05/06 23:56:37, mdempsky wrote: > Earlier patch sets, I had more fallback cases, but jln@ and I decided to remove > them in favor of fatal failures, because we don't have good testing coverage for > this code. I'm fine with adding them back in though if you think that's > important. It's hard to strike the right balance here. I'm in general more in favor of strong assertion for this code, given how poorly tested it is. Matthew and I had discussed the robustness aspect in case of OOM and concluded that we may want to crash hard, but that was by a tiny margin. There are good arguments for a gracefull fallback in this case, so I think doing a graceful fallback in the nacl_loader / ZygoteHost paths is fine, as long as we NOTREACHED() it to ensure that everything is going as planned in debug builds. Mark, does that work for you?
I've restored the error handling paths that I originally had, so failing to fork a child or the child crashing before it sends the ping should no longer be fatal (aside from the UNREACHED() in debug builds). I'll verify tomorrow that it fails gracefully when nacl_helper is missing. https://codereview.chromium.org/269543014/diff/120001/content/common/zygote_c... File content/common/zygote_commands_linux.h (right): https://codereview.chromium.org/269543014/diff/120001/content/common/zygote_c... content/common/zygote_commands_linux.h:21: // Message sent by zygote children to the browser so it can discover their On 2014/05/06 23:34:13, Mark Seaborn wrote: > Nit: to disambiguate -- "so the browser can discover the zygote child's process > ID" perhaps? Done.
On 6 May 2014 17:52, <jln@chromium.org> wrote: > On 2014/05/06 23:56:37, mdempsky wrote: > >> Earlier patch sets, I had more fallback cases, but jln@ and I decided to >> remove > > them in favor of fatal failures, because we don't have good testing >> coverage for > > this code. I'm fine with adding them back in though if you think that's >> important. >> > > It's hard to strike the right balance here. I'm in general more in favor of > strong assertion for this code, given how poorly tested it is. > > Matthew and I had discussed the robustness aspect in case of OOM and > concluded > that we may want to crash hard, but that was by a tiny margin. > > There are good arguments for a gracefull fallback in this case, so I think > doing > a graceful fallback in the nacl_loader / ZygoteHost paths is fine, as long > as we > NOTREACHED() it to ensure that everything is going as planned in debug > builds. > > Mark, does that work for you? Yes, that's OK with me, thanks. Mark To unsubscribe from this group and stop receiving emails from it, send an email to chromium-reviews+unsubscribe@chromium.org.
LGTM
On 2014/05/07 05:39:25, mdempsky wrote: > I'll verify tomorrow that it > fails gracefully when nacl_helper is missing. Verified that Chrome release mode still behaves sanely when nacl_helper is missing. E.g., I loaded gonativeclient.appspot.com/demo/ and loaded the Game of Life demo, and saw "Load failed" and these console errors: [8:8:0507/115642:ERROR:nacl_fork_delegate_linux.cc(299)] Cannot launch NaCl process: nacl_helper failed to start [20304:20374:0507/115642:ERROR:zygote_host_impl_linux.cc(357)] Did not receive ping from zygote child [8:8:0507/115642:ERROR:zygote_linux.cc(524)] Zygote could not fork: process_type nacl-loader numfds 2 child_pid -1 [20304:20376:0507/115642:ERROR:child_process_launcher.cc(329)] Failed to launch child process [26,2587847040:11:56:42.806087] ServiceRuntime::StartSelLdrContinuation (start failed) A debug build on the other hand crashes as expected. However, the actual crash is currently the DCHECK(pid != -1) in unix_domain_socket_linux.cc instead of the NOTREACHED() in zygote_host_impl_linux.cc; I plan to fix the DCHECK in unix_domain_socket_linux.cc, but unless anyone objects I don't think it's worth holding up this CL for that fix. Additionally, I verified that the Task Manager still works correctly for terminating renderer and NaCl processes.
The CQ bit was checked by mdempsky@chromium.org
On 2014/05/06 23:34:12, Mark Seaborn wrote: > IIRC, it used to be that this was only tested manually, by checking the PID in > the Task Manager GUI and checking that "End Process" worked. If that's still > the case, would it be worth adding an automated test? Can you add a "TEST=" > line to the commit message, please? Done. I added two TEST= lines describing the manual tests that I conducted.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/mdempsky@chromium.org/269543014/140001
FYI, CQ is re-trying this CL (attempt #1). Please consider checking whether the failures are real, and report flakes to chrome-troopers@google.com. The failing builders are: android_dbg_triggered_tests on tryserver.chromium win_chromium_rel on tryserver.chromium
CQ is drunk. All bots that matters are green. Added NOTRY=true
The CQ bit was unchecked by jln@chromium.org
The CQ bit was checked by jln@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/mdempsky@chromium.org/269543014/140001
Message was sent while issue was closed.
Change committed as 269011 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
