Issue 2163073002: Avoid crashing if MHTMLGenerationManager::JobFinished is called twice.

Łukasz Anforowicz

lukasza@chromium.org changed reviewers: + dewittj@chromium.org, rdsmith@chromium.org

4 years, 5 months ago (2016-07-19 23:27:36 UTC) #1

Łukasz Anforowicz

Randy + Justin, can you take a look? I've confirmed (with some printf-style debugging) that ...

4 years, 5 months ago (2016-07-19 23:27:37 UTC) #2

dewittj

this lgtm, but it all makes me wonder if there is a better ownership model ...

4 years, 5 months ago (2016-07-20 16:37:48 UTC) #3

Randy Smith (Not in Mondays)

LGTM either way, but if it would work to just stop observation on JobFinished(), do ...

4 years, 5 months ago (2016-07-20 16:46:58 UTC) #4

Randy Smith (Not in Mondays)

Oh, whoops, two followon comments: * Test for this case? * When reading the code ...

4 years, 5 months ago (2016-07-20 16:48:20 UTC) #5

Łukasz Anforowicz

Can you take another look please? On 2016/07/20 16:48:20, Randy Smith - Not in Fridays ...

4 years, 5 months ago (2016-07-20 21:10:07 UTC) #6

Can you take another look please?

On 2016/07/20 16:48:20, Randy Smith - Not in Fridays wrote:
> Oh, whoops, two followon comments:
> 
> * Test for this case?

I am hoping that regressions would be caught by ClusterFuzz.

I am having a difficult time thinking how to write a new, explicit testcase for
this crash.  Two reasons:
- The repro is inherently racey
- I don't understand what ClusterFuzz does to cause the renderer to exit

Actually, after some more thinking I thought that I could shut down
RenderProcessHost in a BrowserMessageFilter that watches out for MHTML responses
from the renderer (*).  This seemed like a good idea, but despite verifying that
MHTMLGenerationManager::RenderProcessExited gets called I cannot repro the crash
in 100 iterations... :-/

(*)

class GenerateMHTMLAndExitRendererMessageFilter : public BrowserMessageFilter {
 public:
  GenerateMHTMLAndExitRendererMessageFilter(
      RenderProcessHostImpl* render_process_host)
      : BrowserMessageFilter(FrameMsgStart),
        render_process_host_(render_process_host) {}

 protected:
  ~GenerateMHTMLAndExitRendererMessageFilter() override {}

 private:
  bool OnMessageReceived(const IPC::Message& message) override {
    if (message.type() == FrameHostMsg_SerializeAsMHTMLResponse::ID) {
      LOG(ERROR) << "TEST FILTER KICKING IN";
      BrowserThread::PostTask(BrowserThread::UI, FROM_HERE, base::Bind(
          base::IgnoreResult(&RenderProcessHostImpl::FastShutdownIfPossible),
          base::Unretained(render_process_host_)));
    }

    return false;
  };

  RenderProcessHostImpl* render_process_host_;

  DISALLOW_COPY_AND_ASSIGN(GenerateMHTMLAndExitRendererMessageFilter);
};

// Regression test for the crash/race from https://crbug.com/612098.
IN_PROC_BROWSER_TEST_F(MHTMLGenerationTest, GenerateMHTMLAndExitRenderer) {
  NavigateToURL(shell(), embedded_test_server()->GetURL("/simple_page.html"));

  RenderProcessHostImpl* render_process_host =
static_cast<RenderProcessHostImpl*>(
      shell()->web_contents()->GetRenderProcessHost());
  scoped_refptr<BrowserMessageFilter> filter =
      new GenerateMHTMLAndExitRendererMessageFilter(render_process_host);
  render_process_host->AddFilter(filter.get());

  base::FilePath path(temp_dir_.path());
  path = path.Append(FILE_PATH_LITERAL("test.mht"));
  GenerateMHTMLForCurrentPage(MHTMLGenerationParams(path));

  EXPECT_GT(ReadFileSizeFromDisk(path), 100);  // Verify the actual file size.
}
 
> * When reading the code in order to do this review I found myself tripping
over
> the concept of job before quite getting it.  Suggestion: Consider adding a
> sentence (either in the header file or the .md documentation) indicating that
> MHTMLGenerationManager is a singleton and there is one job per requested save
> operation which interacts with all relevant renderers?

Done - I've updated mhtml_generation_manager.h and save-page-as.md (it already
talked a little bit about this in "Classes overview" section, but I've expanded
other sections a little bit)

https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
File content/browser/download/mhtml_generation_manager.cc (right):

https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
content/browser/download/mhtml_generation_manager.cc:404: job->MarkAsFinished();
On 2016/07/20 16:46:58, Randy Smith - Not in Fridays wrote:
> On 2016/07/20 16:37:48, dewittj wrote:
> > May want to clear |job|'s |observed_renderer_process_host_| as well, to
> prevent
> > more spurious callbacks.

Done.  Good point.
> 
> This is a good point, but wouldn't this solve this problem in a different
> (possibly cleaner) way?  If you stop observing things when the job is signaled
> as finished on the UI thread, you remove the possibility of getting further
> spurious notifications and keep the flow simple.  
> 
> (I have no objection to adding the is_finished_ flag and using it in DCHECKs
or
> CHECKs to enforce the behavior, but I'm pulled towards a model that naturally
> guarantees a single call to JobFinished because of the state machine rather
than
> one that handles idempotency.)

Interesting.  This requires ensuring that there is no other way to call
JobFinished twice.  Before this CL a malicious-or-buggy renderer could send 2
SerializeAsMHTMLResponse(mhtml_generation_in_renderer_succeeded=false) quickly
one after the other, and cause similar crash (with no involvement from
RenderProcessExited).

Let me ensure that |frame_tree_node_id_of_busy_frame_|-related verification is
done also in |mhtml_generation_in_renderer_succeeded=false| case.  I believe
this should mean that JobFinished is only called once per job.

Done?

Randy Smith (Not in Mondays)

On 2016/07/20 21:10:07, Łukasz Anforowicz wrote: > Can you take another look please? > > ...

4 years, 5 months ago (2016-07-20 22:47:33 UTC) #7

On 2016/07/20 21:10:07, Łukasz Anforowicz wrote:
> Can you take another look please?
> 
> On 2016/07/20 16:48:20, Randy Smith - Not in Fridays wrote:
> > Oh, whoops, two followon comments:
> > 
> > * Test for this case?
> 
> I am hoping that regressions would be caught by ClusterFuzz.
> 
> I am having a difficult time thinking how to write a new, explicit testcase
for
> this crash.  Two reasons:
> - The repro is inherently racey
> - I don't understand what ClusterFuzz does to cause the renderer to exit
> 
> Actually, after some more thinking I thought that I could shut down
> RenderProcessHost in a BrowserMessageFilter that watches out for MHTML
responses
> from the renderer (*).  This seemed like a good idea, but despite verifying
that
> MHTMLGenerationManager::RenderProcessExited gets called I cannot repro the
crash
> in 100 iterations... :-/

Ok, I thought I understood this crash and now I no longer think that. 
Job::CloseFile turns into a no-op (well, ok, runs the callback with a -1
argument) if the browser_file_ isn't valid, and the browser_file_ is std::move'd
to the file thread in that function.  So repeated calls to Job::CloseFile should
be idempotent (modulo repeated callbacks) as long as job hasn't been deleted. 
If job has been deleted, FindJob() should return null, which should result in a
ReceivedBadMessage() in OnSErializeAsMHTMLResponse.  And destruction of the job
should result in the removal of the observer.  So if the success result happens
first this should work fine, and if the failure result happens I'd expect a
ReceivedBadMessage() in OnSerializeAsMHTMLResponse.  I'm sure there's something
basic I'm missing, but I'm now missing something; can you clarify it?

In terms of the test, my (obviously incomplete) analysis above suggests that
there's probably a difference based on the success/failure ordering, and it
sounds like your test should result in failure then success; maybe that's the
problem?  More generally, if your test were to work I would think it would cause
the crash in all cases, so I'd think single stepping through how it interfaces
through the code would allow you to figure out where the problem is.  At least
if you understand how the crash is happening in the first place, which I no
longer do :-}.

> 
> (*)
> 
> class GenerateMHTMLAndExitRendererMessageFilter : public BrowserMessageFilter
{
>  public:
>   GenerateMHTMLAndExitRendererMessageFilter(
>       RenderProcessHostImpl* render_process_host)
>       : BrowserMessageFilter(FrameMsgStart),
>         render_process_host_(render_process_host) {}
> 
>  protected:
>   ~GenerateMHTMLAndExitRendererMessageFilter() override {}
> 
>  private:
>   bool OnMessageReceived(const IPC::Message& message) override {
>     if (message.type() == FrameHostMsg_SerializeAsMHTMLResponse::ID) {
>       LOG(ERROR) << "TEST FILTER KICKING IN";
>       BrowserThread::PostTask(BrowserThread::UI, FROM_HERE, base::Bind(
>           base::IgnoreResult(&RenderProcessHostImpl::FastShutdownIfPossible),
>           base::Unretained(render_process_host_)));
>     }
> 
>     return false;
>   };
> 
>   RenderProcessHostImpl* render_process_host_;
> 
>   DISALLOW_COPY_AND_ASSIGN(GenerateMHTMLAndExitRendererMessageFilter);
> };
> 
> // Regression test for the crash/race from https://crbug.com/612098.
> IN_PROC_BROWSER_TEST_F(MHTMLGenerationTest, GenerateMHTMLAndExitRenderer) {
>   NavigateToURL(shell(), embedded_test_server()->GetURL("/simple_page.html"));
> 
>   RenderProcessHostImpl* render_process_host =
> static_cast<RenderProcessHostImpl*>(
>       shell()->web_contents()->GetRenderProcessHost());
>   scoped_refptr<BrowserMessageFilter> filter =
>       new GenerateMHTMLAndExitRendererMessageFilter(render_process_host);
>   render_process_host->AddFilter(filter.get());
> 
>   base::FilePath path(temp_dir_.path());
>   path = path.Append(FILE_PATH_LITERAL("test.mht"));
>   GenerateMHTMLForCurrentPage(MHTMLGenerationParams(path));
> 
>   EXPECT_GT(ReadFileSizeFromDisk(path), 100);  // Verify the actual file size.
> }
>  
> > * When reading the code in order to do this review I found myself tripping
> over
> > the concept of job before quite getting it.  Suggestion: Consider adding a
> > sentence (either in the header file or the .md documentation) indicating
that
> > MHTMLGenerationManager is a singleton and there is one job per requested
save
> > operation which interacts with all relevant renderers?
> 
> Done - I've updated mhtml_generation_manager.h and save-page-as.md (it already
> talked a little bit about this in "Classes overview" section, but I've
expanded
> other sections a little bit)
> 
>
https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
> File content/browser/download/mhtml_generation_manager.cc (right):
> 
>
https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_manager.cc:404:
job->MarkAsFinished();
> On 2016/07/20 16:46:58, Randy Smith - Not in Fridays wrote:
> > On 2016/07/20 16:37:48, dewittj wrote:
> > > May want to clear |job|'s |observed_renderer_process_host_| as well, to
> > prevent
> > > more spurious callbacks.
> 
> Done.  Good point.
> > 
> > This is a good point, but wouldn't this solve this problem in a different
> > (possibly cleaner) way?  If you stop observing things when the job is
signaled
> > as finished on the UI thread, you remove the possibility of getting further
> > spurious notifications and keep the flow simple.  
> > 
> > (I have no objection to adding the is_finished_ flag and using it in DCHECKs
> or
> > CHECKs to enforce the behavior, but I'm pulled towards a model that
naturally
> > guarantees a single call to JobFinished because of the state machine rather
> than
> > one that handles idempotency.)
> 
> Interesting.  This requires ensuring that there is no other way to call
> JobFinished twice.  Before this CL a malicious-or-buggy renderer could send 2
> SerializeAsMHTMLResponse(mhtml_generation_in_renderer_succeeded=false) quickly
> one after the other, and cause similar crash (with no involvement from
> RenderProcessExited).
> 
> Let me ensure that |frame_tree_node_id_of_busy_frame_|-related verification is
> done also in |mhtml_generation_in_renderer_succeeded=false| case.  I believe
> this should mean that JobFinished is only called once per job.
> 
> Done?

Randy Smith (Not in Mondays)

Whoops, also one comment. https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mhtml_generation_manager.cc File content/browser/download/mhtml_generation_manager.cc (right): https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mhtml_generation_manager.cc#newcode404 content/browser/download/mhtml_generation_manager.cc:404: job->MarkAsFinished(); On 2016/07/20 21:10:07, Łukasz ...

4 years, 5 months ago (2016-07-20 22:47:49 UTC) #8

Whoops, also one comment.

https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
File content/browser/download/mhtml_generation_manager.cc (right):

https://codereview.chromium.org/2163073002/diff/1/content/browser/download/mh...
content/browser/download/mhtml_generation_manager.cc:404: job->MarkAsFinished();
On 2016/07/20 21:10:07, Łukasz Anforowicz wrote:
> On 2016/07/20 16:46:58, Randy Smith - Not in Fridays wrote:
> > On 2016/07/20 16:37:48, dewittj wrote:
> > > May want to clear |job|'s |observed_renderer_process_host_| as well, to
> > prevent
> > > more spurious callbacks.
> 
> Done.  Good point.
> > 
> > This is a good point, but wouldn't this solve this problem in a different
> > (possibly cleaner) way?  If you stop observing things when the job is
signaled
> > as finished on the UI thread, you remove the possibility of getting further
> > spurious notifications and keep the flow simple.  
> > 
> > (I have no objection to adding the is_finished_ flag and using it in DCHECKs
> or
> > CHECKs to enforce the behavior, but I'm pulled towards a model that
naturally
> > guarantees a single call to JobFinished because of the state machine rather
> than
> > one that handles idempotency.)
> 
> Interesting.  This requires ensuring that there is no other way to call
> JobFinished twice.  Before this CL a malicious-or-buggy renderer could send 2
> SerializeAsMHTMLResponse(mhtml_generation_in_renderer_succeeded=false) quickly
> one after the other, and cause similar crash (with no involvement from
> RenderProcessExited).
> 
> Let me ensure that |frame_tree_node_id_of_busy_frame_|-related verification is
> done also in |mhtml_generation_in_renderer_succeeded=false| case.  I believe
> this should mean that JobFinished is only called once per job.
> 
> Done?

Sure, that sounds good.  Thanks for thinking the issue through further than I
had.  But that sounds exactly right--there's one place the message can come from
in the success path (the single frame with the message outstanding to it) one it
can come from in the error path, and each path guarantees that the other won't
be able to send a message following.

Łukasz Anforowicz

On 2016/07/20 22:47:33, Randy Smith - Not in Fridays wrote: > Ok, I thought I ...

4 years, 5 months ago (2016-07-20 22:53:11 UTC) #9

Łukasz Anforowicz

Randy, I have a repro test in the latest patchset. It consistently fails on my ...

4 years, 5 months ago (2016-07-21 21:11:29 UTC) #10

dewittj

does a regular PostTask without delay repro the failure?

4 years, 5 months ago (2016-07-21 21:14:45 UTC) #11

Łukasz Anforowicz

On 2016/07/21 21:14:45, dewittj wrote: > does a regular PostTask without delay repro the failure? ...

4 years, 5 months ago (2016-07-21 21:24:56 UTC) #12

Randy Smith (Not in Mondays)

On 2016/07/21 21:24:56, Łukasz Anforowicz wrote: > On 2016/07/21 21:14:45, dewittj wrote: > > does ...

4 years, 5 months ago (2016-07-21 21:34:25 UTC) #13

On 2016/07/21 21:24:56, Łukasz Anforowicz wrote:
> On 2016/07/21 21:14:45, dewittj wrote:
> > does a regular PostTask without delay repro the failure?
> 
> It does not.
> 
> As Randy pointed out offline we have to achieve the following sequence on the
UI
> thread:
> 1. Regular handling of the IPC calls JobFinished
> 2. RendererProcessExited calls JobFinished and synchronously calls
OnFileClosed
> (because file was std::moved)
> 3. OnFileClosed is called from #1 (and crashes before this CL)
> 
> When we intercept the IPC message (and are ready to inject the test-trigerred
> renderer exit), we are on IO thread and have not yet posted to the UI thread a
> call to the real/product handler of the IPC message.  If we called PostTask,
> then #2 would happen before #1.  Calling PostDelayedTask means that #1 and #2
> are in the right order.  We still have a race between #2 and #3 that depends
on
> the delay (if the delay is too big, then #3 wins and we have no crash because
#3
> stops observation that triggers #2 [even before this CL]).

Would it work to bounce the FastShutdownIfPossible task off the file thread
before heading over to the UI thread?  Actually, thinking it through, what you
want to do is send it to the UI thread, and *then* bounce it off the FILE
thread.  My thinking is that our interception is on the IO thread before the
post to the UI thread of the successful result, but we know that successful
result will be posted in the same IO processed task as we are in.  If we send
the shutdown message -> UI -> FILE -> UI, it will initially hit the UI thread
before the success message (but the success message will be right behind it in
the queue because it'll be dispatched right afterwards from a different message
handling function), will hit the FILE thread before the file close message from
the UI thread generated by the success message, and (key), then hit the UI
thread *before* the return message from the FILE thread indicating success of
the file shutdown.  There's still a bit of a race in that if for some reason our
message gets delayed in getting to the UI thread there's some chance the message
back from the FILE thread will get executed first, but I basically can't imagine
how that would happen; the IO->UI messages are going to be put right next to
each other in the UI queue, and when the UI sends the message to the FILE
thread, the next thing it should do is execute our destruction message.  WDYT?

Please do document the h**l out of whatever you/we end of doing--I pity the poor
person who will need to figure out what the heck is going on in this test in the
future.

Łukasz Anforowicz

On 2016/07/21 21:34:25, Randy Smith - Not in Fridays wrote: > On 2016/07/21 21:24:56, Łukasz ...

4 years, 5 months ago (2016-07-21 22:50:30 UTC) #14

On 2016/07/21 21:34:25, Randy Smith - Not in Fridays wrote:
> On 2016/07/21 21:24:56, Łukasz Anforowicz wrote:
> > On 2016/07/21 21:14:45, dewittj wrote:
> > > does a regular PostTask without delay repro the failure?
> > 
> > It does not.
> > 
> > As Randy pointed out offline we have to achieve the following sequence on
the
> UI
> > thread:
> > 1. Regular handling of the IPC calls JobFinished
> > 2. RendererProcessExited calls JobFinished and synchronously calls
> OnFileClosed
> > (because file was std::moved)
> > 3. OnFileClosed is called from #1 (and crashes before this CL)
> > 
> > When we intercept the IPC message (and are ready to inject the
test-trigerred
> > renderer exit), we are on IO thread and have not yet posted to the UI thread
a
> > call to the real/product handler of the IPC message.  If we called PostTask,
> > then #2 would happen before #1.  Calling PostDelayedTask means that #1 and
#2
> > are in the right order.  We still have a race between #2 and #3 that depends
> on
> > the delay (if the delay is too big, then #3 wins and we have no crash
because
> #3
> > stops observation that triggers #2 [even before this CL]).
> 
> Would it work to bounce the FastShutdownIfPossible task off the file thread
> before heading over to the UI thread?  Actually, thinking it through, what you
> want to do is send it to the UI thread, and *then* bounce it off the FILE
> thread.  My thinking is that our interception is on the IO thread before the
> post to the UI thread of the successful result, but we know that successful
> result will be posted in the same IO processed task as we are in.  If we send
> the shutdown message -> UI -> FILE -> UI, it will initially hit the UI thread
> before the success message (but the success message will be right behind it in
> the queue because it'll be dispatched right afterwards from a different
message
> handling function), will hit the FILE thread before the file close message
from
> the UI thread generated by the success message, and (key), then hit the UI
> thread *before* the return message from the FILE thread indicating success of
> the file shutdown.  There's still a bit of a race in that if for some reason
our
> message gets delayed in getting to the UI thread there's some chance the
message
> back from the FILE thread will get executed first, but I basically can't
imagine
> how that would happen; the IO->UI messages are going to be put right next to
> each other in the UI queue, and when the UI sends the message to the FILE
> thread, the next thing it should do is execute our destruction message.  WDYT?

Hmmm... indeed - this works and consistently repros the crash.
I've got this in patchset #4.

> Please do document the h**l out of whatever you/we end of doing--I pity the
poor
> person who will need to figure out what the heck is going on in this test in
the
> future.

Done in a big comment in the ...browsertest.cc file.

I wonder how easy or difficult it will be to preserve this test after
mojofication... :-/

Łukasz Anforowicz

I assume it is okay to land this CL in the current shape. Please shout ...

4 years, 5 months ago (2016-07-22 16:34:44 UTC) #15

Łukasz Anforowicz

The CQ bit was checked by lukasza@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-07-22 16:34:52 UTC) #16

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2163073002/60001

4 years, 5 months ago (2016-07-22 16:35:07 UTC) #17

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 5 months ago (2016-07-22 17:19:40 UTC) #18

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_ng/builds/260684)

4 years, 5 months ago (2016-07-22 17:19:40 UTC) #19

Łukasz Anforowicz

The CQ bit was checked by lukasza@chromium.org

4 years, 5 months ago (2016-07-22 19:32:57 UTC) #20

Łukasz Anforowicz

The patchset sent to the CQ was uploaded after l-g-t-m from rdsmith@chromium.org, dewittj@chromium.org Link to ...

4 years, 5 months ago (2016-07-22 19:32:58 UTC) #21

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2163073002/60001

4 years, 5 months ago (2016-07-22 19:33:38 UTC) #22

commit-bot: I haz the power

Description was changed from ========== Avoid crashing if MHTMLGenerationManager::JobFinished is called twice. This CL makes ...

4 years, 5 months ago (2016-07-22 20:25:44 UTC) #24

commit-bot: I haz the power

Patchset 4 (id:??) landed as https://crrev.com/a668d7811c2c19662cbeec9e10a3e283292445e5 Cr-Commit-Position: refs/heads/master@{#407244}

4 years, 5 months ago (2016-07-22 20:25:45 UTC) #25

Randy Smith (Not in Mondays)

4 years, 5 months ago (2016-07-23 14:06:35 UTC) #26

Message was sent while issue was closed.

Still/again LGTM.  Sorry to not get back to you yesterday; busy, messy day.

Issue 2163073002: Avoid crashing if MHTMLGenerationManager::JobFinished is called twice. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Extra IPC sanitization + simpler idempotency. #

Patch Set 3 : Repro test for the crash. #

Patch Set 4 : Posting the injection ->UI->FILE->UI. #

Messages