Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(74)

Issue 2540483002: Allow Offline MHTML save operations to succeed even with missing frames.

Created:
4 years ago by carlosk
Modified:
3 years, 8 months ago
CC:
chromium-reviews, asanka, romax+watch_chromium.org, fgorski+watch_chromium.org, dewittj+watch_chromium.org, jam, petewil+watch_chromium.org, chili+watch_chromium.org, darin-cc_chromium.org, dimich+watch_chromium.org
Target Ref:
refs/pending/heads/master
Project:
chromium
Visibility:
Public.

Description

Allow Offline MHTML save operations to succeed even with missing frames. This change adds a new configuration parameter to MHTMLGenerationParams to allow for ignoring missing frames when saving a page to MHTML. This makes the saving process more robust at the cost of possible loss of page consistency/content. For the case of Offline usage, having *a* page saved is much more useful than having *none*. Some sites have pages with very dynamic DOMs, frames are coming and going frequently, causing a high save failure rate. This change should improve these situations. New tests were also added to test both the previously untested case of failing when a frame disappears and the newly introduced case of ignoring the same situation. BUG=655708, 655723

Patch Set 1 #

Total comments: 13

Patch Set 2 : Address code review comments. #

Unified diffs Side-by-side diffs Delta from patch set Stats (+146 lines, -11 lines) Patch
M chrome/browser/android/offline_pages/offline_page_mhtml_archiver.cc View 1 chunk +1 line, -0 lines 0 comments Download
M content/browser/download/mhtml_generation_browsertest.cc View 1 11 chunks +128 lines, -5 lines 0 comments Download
M content/browser/download/mhtml_generation_manager.cc View 1 chunk +12 lines, -6 lines 0 comments Download
M content/public/common/mhtml_generation_params.h View 1 chunk +5 lines, -0 lines 0 comments Download

Depends on Patchset:

Messages

Total messages: 11 (7 generated)
carlosk
fgorski@: PTAL at offline_pages files. nasko@, lukasza@: PTAL and MHTML generation and test files. Note ...
4 years ago (2016-11-28 21:34:16 UTC) #4
Łukasz Anforowicz
I've provided feedback on some small nits, but a bigger question I had was: how ...
4 years ago (2016-11-28 22:15:37 UTC) #5
carlosk
To make sure this is clear: I am not changing the default behavior of MHTML ...
4 years ago (2016-11-29 02:11:20 UTC) #8
Łukasz Anforowicz
4 years ago (2016-11-29 19:18:13 UTC) #9
On 2016/11/29 02:11:20, carlosk wrote:
> To make sure this is clear: I am not changing the default behavior of MHTML
> generation. Only the users/callers that actively set the new bool parameter to
> ignore missing frames will get different results. Only Offline pages do that
> with this patch.

Are we able to distinguish in the UMA data, whether MHTML generation was
triggered by the user VS by the offline pages feature?  I guess after this CL
lands, the remaining failures would have to be triggered by something outside of
offline pages.  I wonder if these other cases might also benefit from the change
in behavior implemented by the CL under review.

> In the Offline context we were seeing frequent failures on some specific
> websites, like TechCrunch for instance (see https://crbug.com/655723). Once I
> had the results of the failure reason from the patch this change is based off
I
> confirmed what was our suspicion that iframes disappearing during the load
were
> causing the failures. These are less important iframes (ads, general
> notifications, etc) that should not cause the whole saving operation to fail.

Out of curiosity - do you know if changing when the save happens might have
changed the frequency of the problem?  (i.e. maybe frames are more likely to
disappear shortly after a load)
> 
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > I've provided feedback on some small nits, but a bigger question I had was:
> how
> > the generated mhtml file will behave/render in the fallback case?
> 
> With Offline we already save pages midway through a load and many times these
> pages are acceptable enough. We still don't have a systematic test to verify
the
> "acceptability" of a semi-loaded saved page -- a not simple problem to solve
--
> but we're working on it.
> 
> Also keep in mind that our focus users are from EM markets where terrible
> connectivity is the norm. They are already used to interacting with
half-loaded
> pages and having something useful enough for offline usage is better than
having
> nothing.

Ack.
>  
> > 1) Is it worth adding the mhtml from the fallback (removed frames) case to
> > third_party/WebKit/LayoutTests/mhtml ?
> 
> You mean storing the generated MHTML with missing iframes to verify it loads
as
> expected, right?
> 
Yes.  This would also be a regression test for making sure the dialog mentioned
below doesn't show up (in case of a dangling CID URI).

> While trying to do this I noticed a big problem of this change which I hadn't
> realized before: when the last frame we will not write the footer and then
have
> a completely invalid file. Currently header and footer are being written by
the
> renderer and that is a problem in this case. While thinking about optimizing
> MHTML generation I had thought about moving these two parts to the browser and
> this problem provides another reason to do so.
> 
> I will have to investigate it further and will probably need another patch to
> land before this one. What are your thoughts on this?
> 
Thanks for catching the problem.

Some random notes:

- Writing to the file in the renderer process is mostly done out of inertia -
this is the way it was always done and there was no good reason to change this.

- Writing to the file from the browser process sounds good.  It sounds
especially good for short, bounded snippets of the file (like the final mhtml
boundary).  It is trickier for unbounded file parts (like mhtml snipets),
because of limits on the IPC size.

> > 2) does the mhtml generated in the fallback case render properly in IE and
> Edge?
> 
> The best answer I can give right now is that we don't care. Seriously! :)

Isn't advancing *open* (and interoperable) web platform part of Chromium agenda?
:-P

> We are currently only using the Offline generated files internally to
> Chrome/Offline (saved files are stored in Chrome private folder on Android).
> Once we get to implement the sharing of offlined pages we will have to worry
> about that.

Hmm.... okay... I guess we can take care of this at that point.

> > 3) what does Chromium do when opening an mhtml file with some missing parts?

> > will any network requests be made (to fetch http://example.com/foo if there
is
> > no corresponding mhtml part)?  what if the missing part was referred to
using
> a
> > content-id URI - will the dialog pointed out in https://crbug.com/570353
show
> up
> > (I vaguely [and possibly incorrectly] recall that maybe the dialog behavior
is
> > different on Android vs on desktop platforms...).
> >
> > Maybe the questions above shouldn't affect the CL under review, because the
> > dangling URIs are a known, preexisting issue (https://crbug.com/570353).  At
> the
> > same time, the CL under review does seem to make the dangling URI case more
> > common (not just dangling-CID case, but also a dangling-HTTP case).
> 
> I did not personally test the quality of half-loaded saved pages but as I
> mentioned they are being looked at and something useful is frequently the
case.
> 
> I was told by my team that no network requests are made when opening an MHTML
> file for security reasons.

Oh, okay.  I knew that there were discussions about this, but I didn't know that
we already made sure that network requests are blocked in case of MHTML.
> 
> I wasn't aware of that issue. Indeed this change will make MTHML files with
> dangling cid: URIs more common. But I'd say it should be solved as a separate
> issue. Also as a data point: I tried opening the attached mhtml file and I
> didn't get an external protocol dialog.
> 
Thanks for checking.  Maybe I forgot or didn't realize that this was taken care
of...
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> File content/browser/download/mhtml_generation_browsertest.cc (right):
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:254: :
> BrowserMessageFilter(FrameMsgStart), web_contents_(web_contents) {}
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > nit: DCHECK(web_contents_) ?
> 
> Done.
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:260: bool
> OnMessageReceived(const IPC::Message& message) override {
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > nit: DCHECK_CURRENTLY_ON(BrowserThread::IO) ?
> 
> Done.
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:276: void
> RemoveAllChildFrames() {
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > nit: DCHECK_CURRENTLY_ON(BrowserThread::UI) ?
> > 
> > (this is not very important, but I think it does help with readability -
with
> > the DCHECK it is immediately obvious which method is called on which thread
> > [without the DCHECK it is also pretty straightforward but requires extra
> > thinking]).
> 
> Agreed and done. Also added these checks to the other filter class.
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:284:
> root->RemoveChild(root->child_at(root->child_count() - 1));
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > Is it okay to instead call root->RemoveChild(0)?  This would simplify the
code
> a
> > little bit, while also removing all child frames (although in a different,
> > reversed order).  If the order of removal is important, then can you please
> add
> > a comment saying this + explaining why this is important?
> 
> This was just my ODD of always removing vector elements from its end. :)
> 
> But done anyway (with just a tiny bit of pain).
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:297:
> NavigateToURL(shell(),
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > nit: ASSERT_TRUE(NavigateToURL(...)
> > 
> > (this will also make alexmos@ happy I think - see
> >
>
https://chromium.googlesource.com/chromium/src/+/4df0d565e3cd9258f0a5fc4517bf...
> > :-)
> 
> Done for all calls of this function.
> 
>
https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
> content/browser/download/mhtml_generation_browsertest.cc:329:
> NavigateToURL(shell(),
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > nit: ASSERT_TRUE(NavigateToURL(...)
> 
> Done.

https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
File content/browser/download/mhtml_generation_browsertest.cc (right):

https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh...
content/browser/download/mhtml_generation_browsertest.cc:284:
root->RemoveChild(root->child_at(root->child_count() - 1));
On 2016/11/29 02:11:20, carlosk wrote:
> On 2016/11/28 22:15:37, Łukasz Anforowicz wrote:
> > Is it okay to instead call root->RemoveChild(0)?  This would simplify the
code
> a
> > little bit, while also removing all child frames (although in a different,
> > reversed order).  If the order of removal is important, then can you please
> add
> > a comment saying this + explaining why this is important?
> 
> This was just my ODD of always removing vector elements from its end. :)
> 
> But done anyway (with just a tiny bit of pain).

Thanks.

I see now that the simpler code might be O(N^2) because it needs to shift the
remaining elements with each removal.  Hmmm... maybe this is fine (test code
with only a few elements) but I think I now understand why you initially did
this in a different way.

Powered by Google App Engine
This is Rietveld 408576698