|
|
Created:
4 years ago by carlosk Modified:
3 years, 8 months ago CC:
chromium-reviews, asanka, romax+watch_chromium.org, fgorski+watch_chromium.org, dewittj+watch_chromium.org, jam, petewil+watch_chromium.org, chili+watch_chromium.org, darin-cc_chromium.org, dimich+watch_chromium.org Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionAllow Offline MHTML save operations to succeed even with missing frames.
This change adds a new configuration parameter to MHTMLGenerationParams
to allow for ignoring missing frames when saving a page to MHTML. This
makes the saving process more robust at the cost of possible loss of
page consistency/content.
For the case of Offline usage, having *a* page saved is much more useful
than having *none*. Some sites have pages with very dynamic DOMs, frames
are coming and going frequently, causing a high save failure rate. This
change should improve these situations.
New tests were also added to test both the previously untested case of
failing when a frame disappears and the newly introduced case of ignoring
the same situation.
BUG=655708, 655723
Patch Set 1 #
Total comments: 13
Patch Set 2 : Address code review comments. #
Depends on Patchset: Messages
Total messages: 11 (7 generated)
The CQ bit was checked by carlosk@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
carlosk@chromium.org changed reviewers: + fgorski@chromium.org, lukasza@chromium.org, nasko@chromium.org
fgorski@: PTAL at offline_pages files. nasko@, lukasza@: PTAL and MHTML generation and test files. Note this patch is based off of https://codereview.chromium.org/2519273002/ which is still struggling to land due to build issues. :/
I've provided feedback on some small nits, but a bigger question I had was: how the generated mhtml file will behave/render in the fallback case? 1) Is it worth adding the mhtml from the fallback (removed frames) case to third_party/WebKit/LayoutTests/mhtml ? 2) does the mhtml generated in the fallback case render properly in IE and Edge? 3) what does Chromium do when opening an mhtml file with some missing parts? will any network requests be made (to fetch http://example.com/foo if there is no corresponding mhtml part)? what if the missing part was referred to using a content-id URI - will the dialog pointed out in https://crbug.com/570353 show up (I vaguely [and possibly incorrectly] recall that maybe the dialog behavior is different on Android vs on desktop platforms...). Maybe the questions above shouldn't affect the CL under review, because the dangling URIs are a known, preexisting issue (https://crbug.com/570353). At the same time, the CL under review does seem to make the dangling URI case more common (not just dangling-CID case, but also a dangling-HTTP case). https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... File content/browser/download/mhtml_generation_browsertest.cc (right): https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:254: : BrowserMessageFilter(FrameMsgStart), web_contents_(web_contents) {} nit: DCHECK(web_contents_) ? https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:260: bool OnMessageReceived(const IPC::Message& message) override { nit: DCHECK_CURRENTLY_ON(BrowserThread::IO) ? https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:276: void RemoveAllChildFrames() { nit: DCHECK_CURRENTLY_ON(BrowserThread::UI) ? (this is not very important, but I think it does help with readability - with the DCHECK it is immediately obvious which method is called on which thread [without the DCHECK it is also pretty straightforward but requires extra thinking]). https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:284: root->RemoveChild(root->child_at(root->child_count() - 1)); Is it okay to instead call root->RemoveChild(0)? This would simplify the code a little bit, while also removing all child frames (although in a different, reversed order). If the order of removal is important, then can you please add a comment saying this + explaining why this is important? https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:297: NavigateToURL(shell(), nit: ASSERT_TRUE(NavigateToURL(...) (this will also make alexmos@ happy I think - see https://chromium.googlesource.com/chromium/src/+/4df0d565e3cd9258f0a5fc4517bf... :-) https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:329: NavigateToURL(shell(), nit: ASSERT_TRUE(NavigateToURL(...)
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
To make sure this is clear: I am not changing the default behavior of MHTML generation. Only the users/callers that actively set the new bool parameter to ignore missing frames will get different results. Only Offline pages do that with this patch. In the Offline context we were seeing frequent failures on some specific websites, like TechCrunch for instance (see https://crbug.com/655723). Once I had the results of the failure reason from the patch this change is based off I confirmed what was our suspicion that iframes disappearing during the load were causing the failures. These are less important iframes (ads, general notifications, etc) that should not cause the whole saving operation to fail. On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > I've provided feedback on some small nits, but a bigger question I had was: how > the generated mhtml file will behave/render in the fallback case? With Offline we already save pages midway through a load and many times these pages are acceptable enough. We still don't have a systematic test to verify the "acceptability" of a semi-loaded saved page -- a not simple problem to solve -- but we're working on it. Also keep in mind that our focus users are from EM markets where terrible connectivity is the norm. They are already used to interacting with half-loaded pages and having something useful enough for offline usage is better than having nothing. > 1) Is it worth adding the mhtml from the fallback (removed frames) case to > third_party/WebKit/LayoutTests/mhtml ? You mean storing the generated MHTML with missing iframes to verify it loads as expected, right? While trying to do this I noticed a big problem of this change which I hadn't realized before: when the last frame we will not write the footer and then have a completely invalid file. Currently header and footer are being written by the renderer and that is a problem in this case. While thinking about optimizing MHTML generation I had thought about moving these two parts to the browser and this problem provides another reason to do so. I will have to investigate it further and will probably need another patch to land before this one. What are your thoughts on this? > 2) does the mhtml generated in the fallback case render properly in IE and Edge? The best answer I can give right now is that we don't care. Seriously! :) We are currently only using the Offline generated files internally to Chrome/Offline (saved files are stored in Chrome private folder on Android). Once we get to implement the sharing of offlined pages we will have to worry about that. > 3) what does Chromium do when opening an mhtml file with some missing parts? > will any network requests be made (to fetch http://example.com/foo if there is > no corresponding mhtml part)? what if the missing part was referred to using a > content-id URI - will the dialog pointed out in https://crbug.com/570353 show up > (I vaguely [and possibly incorrectly] recall that maybe the dialog behavior is > different on Android vs on desktop platforms...). > > Maybe the questions above shouldn't affect the CL under review, because the > dangling URIs are a known, preexisting issue (https://crbug.com/570353). At the > same time, the CL under review does seem to make the dangling URI case more > common (not just dangling-CID case, but also a dangling-HTTP case). I did not personally test the quality of half-loaded saved pages but as I mentioned they are being looked at and something useful is frequently the case. I was told by my team that no network requests are made when opening an MHTML file for security reasons. I wasn't aware of that issue. Indeed this change will make MTHML files with dangling cid: URIs more common. But I'd say it should be solved as a separate issue. Also as a data point: I tried opening the attached mhtml file and I didn't get an external protocol dialog. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... File content/browser/download/mhtml_generation_browsertest.cc (right): https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:254: : BrowserMessageFilter(FrameMsgStart), web_contents_(web_contents) {} On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > nit: DCHECK(web_contents_) ? Done. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:260: bool OnMessageReceived(const IPC::Message& message) override { On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > nit: DCHECK_CURRENTLY_ON(BrowserThread::IO) ? Done. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:276: void RemoveAllChildFrames() { On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > nit: DCHECK_CURRENTLY_ON(BrowserThread::UI) ? > > (this is not very important, but I think it does help with readability - with > the DCHECK it is immediately obvious which method is called on which thread > [without the DCHECK it is also pretty straightforward but requires extra > thinking]). Agreed and done. Also added these checks to the other filter class. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:284: root->RemoveChild(root->child_at(root->child_count() - 1)); On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > Is it okay to instead call root->RemoveChild(0)? This would simplify the code a > little bit, while also removing all child frames (although in a different, > reversed order). If the order of removal is important, then can you please add > a comment saying this + explaining why this is important? This was just my ODD of always removing vector elements from its end. :) But done anyway (with just a tiny bit of pain). https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:297: NavigateToURL(shell(), On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > nit: ASSERT_TRUE(NavigateToURL(...) > > (this will also make alexmos@ happy I think - see > https://chromium.googlesource.com/chromium/src/+/4df0d565e3cd9258f0a5fc4517bf... > :-) Done for all calls of this function. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:329: NavigateToURL(shell(), On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > nit: ASSERT_TRUE(NavigateToURL(...) Done.
On 2016/11/29 02:11:20, carlosk wrote: > To make sure this is clear: I am not changing the default behavior of MHTML > generation. Only the users/callers that actively set the new bool parameter to > ignore missing frames will get different results. Only Offline pages do that > with this patch. Are we able to distinguish in the UMA data, whether MHTML generation was triggered by the user VS by the offline pages feature? I guess after this CL lands, the remaining failures would have to be triggered by something outside of offline pages. I wonder if these other cases might also benefit from the change in behavior implemented by the CL under review. > In the Offline context we were seeing frequent failures on some specific > websites, like TechCrunch for instance (see https://crbug.com/655723). Once I > had the results of the failure reason from the patch this change is based off I > confirmed what was our suspicion that iframes disappearing during the load were > causing the failures. These are less important iframes (ads, general > notifications, etc) that should not cause the whole saving operation to fail. Out of curiosity - do you know if changing when the save happens might have changed the frequency of the problem? (i.e. maybe frames are more likely to disappear shortly after a load) > > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > I've provided feedback on some small nits, but a bigger question I had was: > how > > the generated mhtml file will behave/render in the fallback case? > > With Offline we already save pages midway through a load and many times these > pages are acceptable enough. We still don't have a systematic test to verify the > "acceptability" of a semi-loaded saved page -- a not simple problem to solve -- > but we're working on it. > > Also keep in mind that our focus users are from EM markets where terrible > connectivity is the norm. They are already used to interacting with half-loaded > pages and having something useful enough for offline usage is better than having > nothing. Ack. > > > 1) Is it worth adding the mhtml from the fallback (removed frames) case to > > third_party/WebKit/LayoutTests/mhtml ? > > You mean storing the generated MHTML with missing iframes to verify it loads as > expected, right? > Yes. This would also be a regression test for making sure the dialog mentioned below doesn't show up (in case of a dangling CID URI). > While trying to do this I noticed a big problem of this change which I hadn't > realized before: when the last frame we will not write the footer and then have > a completely invalid file. Currently header and footer are being written by the > renderer and that is a problem in this case. While thinking about optimizing > MHTML generation I had thought about moving these two parts to the browser and > this problem provides another reason to do so. > > I will have to investigate it further and will probably need another patch to > land before this one. What are your thoughts on this? > Thanks for catching the problem. Some random notes: - Writing to the file in the renderer process is mostly done out of inertia - this is the way it was always done and there was no good reason to change this. - Writing to the file from the browser process sounds good. It sounds especially good for short, bounded snippets of the file (like the final mhtml boundary). It is trickier for unbounded file parts (like mhtml snipets), because of limits on the IPC size. > > 2) does the mhtml generated in the fallback case render properly in IE and > Edge? > > The best answer I can give right now is that we don't care. Seriously! :) Isn't advancing *open* (and interoperable) web platform part of Chromium agenda? :-P > We are currently only using the Offline generated files internally to > Chrome/Offline (saved files are stored in Chrome private folder on Android). > Once we get to implement the sharing of offlined pages we will have to worry > about that. Hmm.... okay... I guess we can take care of this at that point. > > 3) what does Chromium do when opening an mhtml file with some missing parts? > > will any network requests be made (to fetch http://example.com/foo if there is > > no corresponding mhtml part)? what if the missing part was referred to using > a > > content-id URI - will the dialog pointed out in https://crbug.com/570353 show > up > > (I vaguely [and possibly incorrectly] recall that maybe the dialog behavior is > > different on Android vs on desktop platforms...). > > > > Maybe the questions above shouldn't affect the CL under review, because the > > dangling URIs are a known, preexisting issue (https://crbug.com/570353). At > the > > same time, the CL under review does seem to make the dangling URI case more > > common (not just dangling-CID case, but also a dangling-HTTP case). > > I did not personally test the quality of half-loaded saved pages but as I > mentioned they are being looked at and something useful is frequently the case. > > I was told by my team that no network requests are made when opening an MHTML > file for security reasons. Oh, okay. I knew that there were discussions about this, but I didn't know that we already made sure that network requests are blocked in case of MHTML. > > I wasn't aware of that issue. Indeed this change will make MTHML files with > dangling cid: URIs more common. But I'd say it should be solved as a separate > issue. Also as a data point: I tried opening the attached mhtml file and I > didn't get an external protocol dialog. > Thanks for checking. Maybe I forgot or didn't realize that this was taken care of... > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > File content/browser/download/mhtml_generation_browsertest.cc (right): > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:254: : > BrowserMessageFilter(FrameMsgStart), web_contents_(web_contents) {} > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > nit: DCHECK(web_contents_) ? > > Done. > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:260: bool > OnMessageReceived(const IPC::Message& message) override { > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > nit: DCHECK_CURRENTLY_ON(BrowserThread::IO) ? > > Done. > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:276: void > RemoveAllChildFrames() { > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > nit: DCHECK_CURRENTLY_ON(BrowserThread::UI) ? > > > > (this is not very important, but I think it does help with readability - with > > the DCHECK it is immediately obvious which method is called on which thread > > [without the DCHECK it is also pretty straightforward but requires extra > > thinking]). > > Agreed and done. Also added these checks to the other filter class. > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:284: > root->RemoveChild(root->child_at(root->child_count() - 1)); > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > Is it okay to instead call root->RemoveChild(0)? This would simplify the code > a > > little bit, while also removing all child frames (although in a different, > > reversed order). If the order of removal is important, then can you please > add > > a comment saying this + explaining why this is important? > > This was just my ODD of always removing vector elements from its end. :) > > But done anyway (with just a tiny bit of pain). > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:297: > NavigateToURL(shell(), > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > nit: ASSERT_TRUE(NavigateToURL(...) > > > > (this will also make alexmos@ happy I think - see > > > https://chromium.googlesource.com/chromium/src/+/4df0d565e3cd9258f0a5fc4517bf... > > :-) > > Done for all calls of this function. > > https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... > content/browser/download/mhtml_generation_browsertest.cc:329: > NavigateToURL(shell(), > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > nit: ASSERT_TRUE(NavigateToURL(...) > > Done. https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... File content/browser/download/mhtml_generation_browsertest.cc (right): https://codereview.chromium.org/2540483002/diff/1/content/browser/download/mh... content/browser/download/mhtml_generation_browsertest.cc:284: root->RemoveChild(root->child_at(root->child_count() - 1)); On 2016/11/29 02:11:20, carlosk wrote: > On 2016/11/28 22:15:37, Łukasz Anforowicz wrote: > > Is it okay to instead call root->RemoveChild(0)? This would simplify the code > a > > little bit, while also removing all child frames (although in a different, > > reversed order). If the order of removal is important, then can you please > add > > a comment saying this + explaining why this is important? > > This was just my ODD of always removing vector elements from its end. :) > > But done anyway (with just a tiny bit of pain). Thanks. I see now that the simpler code might be O(N^2) because it needs to shift the remaining elements with each removal. Hmmm... maybe this is fine (test code with only a few elements) but I think I now understand why you initially did this in a different way.
Description was changed from ========== Allow Offline MHTML save operations to succeed even with missing frames. This change adds a new configuration parameter to MHTMLGenerationParams to allow for ignoring missing frames when saving a page to MHTML. This makes the saving process more robust at the cost of possible loss of page consistency/content. For the case of Offline usage, having *a* page saved is much more useful than having *none*. Some sites have pages with very dynamic DOMs, frames are coming and going frequently, causing a high save failure rate. This change should improve these situations. New tests were also added to test both the previously untested case of failing when a frame disappears and the newly introduced case of ignoring the same situation. BUG=655708,655723 ========== to ========== Allow Offline MHTML save operations to succeed even with missing frames. This change adds a new configuration parameter to MHTMLGenerationParams to allow for ignoring missing frames when saving a page to MHTML. This makes the saving process more robust at the cost of possible loss of page consistency/content. For the case of Offline usage, having *a* page saved is much more useful than having *none*. Some sites have pages with very dynamic DOMs, frames are coming and going frequently, causing a high save failure rate. This change should improve these situations. New tests were also added to test both the previously untested case of failing when a frame disappears and the newly introduced case of ignoring the same situation. BUG=655708,655723 ==========
fgorski@chromium.org changed reviewers: - fgorski@chromium.org |