Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(285)

Side by Side Diff: Source/web/WebPageSerializerImpl.cpp

Issue 68613003: Merges the two different page serializers (Closed) Base URL: https://chromium.googlesource.com/chromium/blink.git@master
Patch Set: Remove newline after XML decl Created 7 years ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « Source/web/WebPageSerializerImpl.h ('k') | Source/web/tests/MHTMLTest.cpp » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 /*
2 * Copyright (C) 2009 Google Inc. All rights reserved.
3 *
4 * Redistribution and use in source and binary forms, with or without
5 * modification, are permitted provided that the following conditions are
6 * met:
7 *
8 * * Redistributions of source code must retain the above copyright
9 * notice, this list of conditions and the following disclaimer.
10 * * Redistributions in binary form must reproduce the above
11 * copyright notice, this list of conditions and the following disclaimer
12 * in the documentation and/or other materials provided with the
13 * distribution.
14 * * Neither the name of Google Inc. nor the names of its
15 * contributors may be used to endorse or promote products derived from
16 * this software without specific prior written permission.
17 *
18 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
19 * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
20 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
21 * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
22 * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
23 * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
24 * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
25 * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
26 * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
27 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
29 */
30
31 // How we handle the base tag better.
32 // Current status:
33 // At now the normal way we use to handling base tag is
34 // a) For those links which have corresponding local saved files, such as
35 // savable CSS, JavaScript files, they will be written to relative URLs which
36 // point to local saved file. Why those links can not be resolved as absolute
37 // file URLs, because if they are resolved as absolute URLs, after moving the
38 // file location from one directory to another directory, the file URLs will
39 // be dead links.
40 // b) For those links which have not corresponding local saved files, such as
41 // links in A, AREA tags, they will be resolved as absolute URLs.
42 // c) We comment all base tags when serialzing DOM for the page.
43 // FireFox also uses above way to handle base tag.
44 //
45 // Problem:
46 // This way can not handle the following situation:
47 // the base tag is written by JavaScript.
48 // For example. The page "www.yahoo.com" use
49 // "document.write('<base href="http://www.yahoo.com/"...');" to setup base URL
50 // of page when loading page. So when saving page as completed-HTML, we assume
51 // that we save "www.yahoo.com" to "c:\yahoo.htm". After then we load the saved
52 // completed-HTML page, then the JavaScript will insert a base tag
53 // <base href="http://www.yahoo.com/"...> to DOM, so all URLs which point to
54 // local saved resource files will be resolved as
55 // "http://www.yahoo.com/yahoo_files/...", which will cause all saved resource
56 // files can not be loaded correctly. Also the page will be rendered ugly since
57 // all saved sub-resource files (such as CSS, JavaScript files) and sub-frame
58 // files can not be fetched.
59 // Now FireFox, IE and WebKit based Browser all have this problem.
60 //
61 // Solution:
62 // My solution is that we comment old base tag and write new base tag:
63 // <base href="." ...> after the previous commented base tag. In WebKit, it
64 // always uses the latest "href" attribute of base tag to set document's base
65 // URL. Based on this behavior, when we encounter a base tag, we comment it and
66 // write a new base tag <base href="."> after the previous commented base tag.
67 // The new added base tag can help engine to locate correct base URL for
68 // correctly loading local saved resource files. Also I think we need to inherit
69 // the base target value from document object when appending new base tag.
70 // If there are multiple base tags in original document, we will comment all old
71 // base tags and append new base tag after each old base tag because we do not
72 // know those old base tags are original content or added by JavaScript. If
73 // they are added by JavaScript, it means when loading saved page, the script(s)
74 // will still insert base tag(s) to DOM, so the new added base tag(s) can
75 // override the incorrect base URL and make sure we alway load correct local
76 // saved resource files.
77
78 #include "config.h"
79 #include "WebPageSerializerImpl.h"
80
81 #include "DOMUtilitiesPrivate.h"
82 #include "HTMLNames.h"
83 #include "WebFrameImpl.h"
84 #include "core/dom/Document.h"
85 #include "core/dom/DocumentType.h"
86 #include "core/dom/Element.h"
87 #include "core/editing/markup.h"
88 #include "core/html/HTMLAllCollection.h"
89 #include "core/html/HTMLElement.h"
90 #include "core/html/HTMLFormElement.h"
91 #include "core/html/HTMLHtmlElement.h"
92 #include "core/html/HTMLMetaElement.h"
93 #include "core/loader/DocumentLoader.h"
94 #include "core/loader/FrameLoader.h"
95 #include "public/platform/WebVector.h"
96 #include "wtf/text/TextEncoding.h"
97
98 using namespace WebCore;
99
100 namespace blink {
101
102 // Maximum length of data buffer which is used to temporary save generated
103 // html content data. This is a soft limit which might be passed if a very large
104 // contegious string is found in the page.
105 static const unsigned dataBufferCapacity = 65536;
106
107 WebPageSerializerImpl::SerializeDomParam::SerializeDomParam(const KURL& url,
108 const WTF::TextEncod ing& textEncoding,
109 Document* document,
110 const String& direct oryName)
111 : url(url)
112 , textEncoding(textEncoding)
113 , document(document)
114 , directoryName(directoryName)
115 , isHTMLDocument(document->isHTMLDocument())
116 , haveSeenDocType(false)
117 , haveAddedCharsetDeclaration(false)
118 , skipMetaElement(0)
119 , isInScriptOrStyleTag(false)
120 , haveAddedXMLProcessingDirective(false)
121 , haveAddedContentsBeforeEnd(false)
122 {
123 }
124
125 String WebPageSerializerImpl::preActionBeforeSerializeOpenTag(
126 const Element* element, SerializeDomParam* param, bool* needSkip)
127 {
128 StringBuilder result;
129
130 *needSkip = false;
131 if (param->isHTMLDocument) {
132 // Skip the open tag of original META tag which declare charset since we
133 // have overrided the META which have correct charset declaration after
134 // serializing open tag of HEAD element.
135 if (element->hasTagName(HTMLNames::metaTag)) {
136 const HTMLMetaElement* meta = toHTMLMetaElement(element);
137 // Check whether the META tag has declared charset or not.
138 String equiv = meta->httpEquiv();
139 if (equalIgnoringCase(equiv, "content-type")) {
140 String content = meta->content();
141 if (content.length() && content.contains("charset", false)) {
142 // Find META tag declared charset, we need to skip it when
143 // serializing DOM.
144 param->skipMetaElement = element;
145 *needSkip = true;
146 }
147 }
148 } else if (isHTMLHtmlElement(element)) {
149 // Check something before processing the open tag of HEAD element.
150 // First we add doc type declaration if original document has it.
151 if (!param->haveSeenDocType) {
152 param->haveSeenDocType = true;
153 result.append(createMarkup(param->document->doctype()));
154 }
155
156 // Add MOTW declaration before html tag.
157 // See http://msdn2.microsoft.com/en-us/library/ms537628(VS.85).aspx .
158 result.append(WebPageSerializer::generateMarkOfTheWebDeclaration(par am->url));
159 } else if (element->hasTagName(HTMLNames::baseTag)) {
160 // Comment the BASE tag when serializing dom.
161 result.append("<!--");
162 }
163 } else {
164 // Write XML declaration.
165 if (!param->haveAddedXMLProcessingDirective) {
166 param->haveAddedXMLProcessingDirective = true;
167 // Get encoding info.
168 String xmlEncoding = param->document->xmlEncoding();
169 if (xmlEncoding.isEmpty())
170 xmlEncoding = param->document->encodingName();
171 if (xmlEncoding.isEmpty())
172 xmlEncoding = UTF8Encoding().name();
173 result.append("<?xml version=\"");
174 result.append(param->document->xmlVersion());
175 result.append("\" encoding=\"");
176 result.append(xmlEncoding);
177 if (param->document->xmlStandalone())
178 result.append("\" standalone=\"yes");
179 result.append("\"?>\n");
180 }
181 // Add doc type declaration if original document has it.
182 if (!param->haveSeenDocType) {
183 param->haveSeenDocType = true;
184 result.append(createMarkup(param->document->doctype()));
185 }
186 }
187 return result.toString();
188 }
189
190 String WebPageSerializerImpl::postActionAfterSerializeOpenTag(
191 const Element* element, SerializeDomParam* param)
192 {
193 StringBuilder result;
194
195 param->haveAddedContentsBeforeEnd = false;
196 if (!param->isHTMLDocument)
197 return result.toString();
198 // Check after processing the open tag of HEAD element
199 if (!param->haveAddedCharsetDeclaration
200 && element->hasTagName(HTMLNames::headTag)) {
201 param->haveAddedCharsetDeclaration = true;
202 // Check meta element. WebKit only pre-parse the first 512 bytes
203 // of the document. If the whole <HEAD> is larger and meta is the
204 // end of head part, then this kind of pages aren't decoded correctly
205 // because of this issue. So when we serialize the DOM, we need to
206 // make sure the meta will in first child of head tag.
207 // See http://bugs.webkit.org/show_bug.cgi?id=16621.
208 // First we generate new content for writing correct META element.
209 result.append(WebPageSerializer::generateMetaCharsetDeclaration(
210 String(param->textEncoding.name())));
211
212 param->haveAddedContentsBeforeEnd = true;
213 // Will search each META which has charset declaration, and skip them al l
214 // in PreActionBeforeSerializeOpenTag.
215 } else if (element->hasTagName(HTMLNames::scriptTag)
216 || element->hasTagName(HTMLNames::styleTag)) {
217 param->isInScriptOrStyleTag = true;
218 }
219
220 return result.toString();
221 }
222
223 String WebPageSerializerImpl::preActionBeforeSerializeEndTag(
224 const Element* element, SerializeDomParam* param, bool* needSkip)
225 {
226 String result;
227
228 *needSkip = false;
229 if (!param->isHTMLDocument)
230 return result;
231 // Skip the end tag of original META tag which declare charset.
232 // Need not to check whether it's META tag since we guarantee
233 // skipMetaElement is definitely META tag if it's not 0.
234 if (param->skipMetaElement == element)
235 *needSkip = true;
236 else if (element->hasTagName(HTMLNames::scriptTag)
237 || element->hasTagName(HTMLNames::styleTag)) {
238 ASSERT(param->isInScriptOrStyleTag);
239 param->isInScriptOrStyleTag = false;
240 }
241
242 return result;
243 }
244
245 // After we finish serializing end tag of a element, we give the target
246 // element a chance to do some post work to add some additional data.
247 String WebPageSerializerImpl::postActionAfterSerializeEndTag(
248 const Element* element, SerializeDomParam* param)
249 {
250 StringBuilder result;
251
252 if (!param->isHTMLDocument)
253 return result.toString();
254 // Comment the BASE tag when serializing DOM.
255 if (element->hasTagName(HTMLNames::baseTag)) {
256 result.append("-->");
257 // Append a new base tag declaration.
258 result.append(WebPageSerializer::generateBaseTagDeclaration(
259 param->document->baseTarget()));
260 }
261
262 return result.toString();
263 }
264
265 void WebPageSerializerImpl::saveHTMLContentToBuffer(
266 const String& result, SerializeDomParam* param)
267 {
268 m_dataBuffer.append(result);
269 encodeAndFlushBuffer(WebPageSerializerClient::CurrentFrameIsNotFinished,
270 param,
271 DoNotForceFlush);
272 }
273
274 void WebPageSerializerImpl::encodeAndFlushBuffer(
275 WebPageSerializerClient::PageSerializationStatus status,
276 SerializeDomParam* param,
277 FlushOption flushOption)
278 {
279 // Data buffer is not full nor do we want to force flush.
280 if (flushOption != ForceFlush && m_dataBuffer.length() <= dataBufferCapacity )
281 return;
282
283 String content = m_dataBuffer.toString();
284 m_dataBuffer.clear();
285
286 CString encodedContent = param->textEncoding.normalizeAndEncode(content, WTF ::EntitiesForUnencodables);
287
288 // Send result to the client.
289 m_client->didSerializeDataForFrame(param->url,
290 WebCString(encodedContent.data(), encoded Content.length()),
291 status);
292 }
293
294 void WebPageSerializerImpl::openTagToString(Element* element,
295 SerializeDomParam* param)
296 {
297 bool needSkip;
298 StringBuilder result;
299 // Do pre action for open tag.
300 result.append(preActionBeforeSerializeOpenTag(element, param, &needSkip));
301 if (needSkip)
302 return;
303 // Add open tag
304 result.append('<');
305 result.append(element->nodeName().lower());
306 // Go through all attributes and serialize them.
307 if (element->hasAttributes()) {
308 unsigned numAttrs = element->attributeCount();
309 for (unsigned i = 0; i < numAttrs; i++) {
310 result.append(' ');
311 // Add attribute pair
312 const Attribute *attribute = element->attributeItem(i);
313 result.append(attribute->name().toString());
314 result.appendLiteral("=\"");
315 if (!attribute->value().isEmpty()) {
316 const String& attrValue = attribute->value();
317
318 // Check whether we need to replace some resource links
319 // with local resource paths.
320 const QualifiedName& attrName = attribute->name();
321 if (elementHasLegalLinkAttribute(element, attrName)) {
322 // For links start with "javascript:", we do not change it.
323 if (attrValue.startsWith("javascript:", false))
324 result.append(attrValue);
325 else {
326 // Get the absolute link
327 WebFrameImpl* subFrame = WebFrameImpl::fromFrameOwnerEle ment(element);
328 String completeURL = subFrame ? subFrame->frame()->docum ent()->url() :
329 param->document->complet eURL(attrValue);
330 // Check whether we have local files for those link.
331 if (m_localLinks.contains(completeURL)) {
332 if (!param->directoryName.isEmpty()) {
333 result.appendLiteral("./");
334 result.append(param->directoryName);
335 result.append('/');
336 }
337 result.append(m_localLinks.get(completeURL));
338 } else
339 result.append(completeURL);
340 }
341 } else {
342 if (param->isHTMLDocument)
343 result.append(m_htmlEntities.convertEntitiesInString(att rValue));
344 else
345 result.append(m_xmlEntities.convertEntitiesInString(attr Value));
346 }
347 }
348 result.append('\"');
349 }
350 }
351
352 // Do post action for open tag.
353 String addedContents = postActionAfterSerializeOpenTag(element, param);
354 // Complete the open tag for element when it has child/children.
355 if (element->hasChildNodes() || param->haveAddedContentsBeforeEnd)
356 result.append('>');
357 // Append the added contents generate in post action of open tag.
358 result.append(addedContents);
359 // Save the result to data buffer.
360 saveHTMLContentToBuffer(result.toString(), param);
361 }
362
363 // Serialize end tag of an specified element.
364 void WebPageSerializerImpl::endTagToString(Element* element,
365 SerializeDomParam* param)
366 {
367 bool needSkip;
368 StringBuilder result;
369 // Do pre action for end tag.
370 result.append(preActionBeforeSerializeEndTag(element, param, &needSkip));
371 if (needSkip)
372 return;
373 // Write end tag when element has child/children.
374 if (element->hasChildNodes() || param->haveAddedContentsBeforeEnd) {
375 result.appendLiteral("</");
376 result.append(element->nodeName().lower());
377 result.append('>');
378 } else {
379 // Check whether we have to write end tag for empty element.
380 if (param->isHTMLDocument) {
381 result.append('>');
382 // FIXME: This code is horribly wrong. WebPageSerializerImpl must d ie.
383 if (!element->isHTMLElement() || !toHTMLElement(element)->ieForbidsI nsertHTML()) {
384 // We need to write end tag when it is required.
385 result.appendLiteral("</");
386 result.append(element->nodeName().lower());
387 result.append('>');
388 }
389 } else {
390 // For xml base document.
391 result.appendLiteral(" />");
392 }
393 }
394 // Do post action for end tag.
395 result.append(postActionAfterSerializeEndTag(element, param));
396 // Save the result to data buffer.
397 saveHTMLContentToBuffer(result.toString(), param);
398 }
399
400 void WebPageSerializerImpl::buildContentForNode(Node* node,
401 SerializeDomParam* param)
402 {
403 switch (node->nodeType()) {
404 case Node::ELEMENT_NODE:
405 // Process open tag of element.
406 openTagToString(toElement(node), param);
407 // Walk through the children nodes and process it.
408 for (Node *child = node->firstChild(); child; child = child->nextSibling ())
409 buildContentForNode(child, param);
410 // Process end tag of element.
411 endTagToString(toElement(node), param);
412 break;
413 case Node::TEXT_NODE:
414 saveHTMLContentToBuffer(createMarkup(node), param);
415 break;
416 case Node::ATTRIBUTE_NODE:
417 case Node::DOCUMENT_NODE:
418 case Node::DOCUMENT_FRAGMENT_NODE:
419 // Should not exist.
420 ASSERT_NOT_REACHED();
421 break;
422 // Document type node can be in DOM?
423 case Node::DOCUMENT_TYPE_NODE:
424 param->haveSeenDocType = true;
425 default:
426 // For other type node, call default action.
427 saveHTMLContentToBuffer(createMarkup(node), param);
428 break;
429 }
430 }
431
432 WebPageSerializerImpl::WebPageSerializerImpl(WebFrame* frame,
433 bool recursiveSerialization,
434 WebPageSerializerClient* client,
435 const WebVector<WebURL>& links,
436 const WebVector<WebString>& localPa ths,
437 const WebString& localDirectoryName )
438 : m_client(client)
439 , m_recursiveSerialization(recursiveSerialization)
440 , m_framesCollected(false)
441 , m_localDirectoryName(localDirectoryName)
442 , m_htmlEntities(false)
443 , m_xmlEntities(true)
444 {
445 // Must specify available webframe.
446 ASSERT(frame);
447 m_specifiedWebFrameImpl = toWebFrameImpl(frame);
448 // Make sure we have non 0 client.
449 ASSERT(client);
450 // Build local resources map.
451 ASSERT(links.size() == localPaths.size());
452 for (size_t i = 0; i < links.size(); i++) {
453 KURL url = links[i];
454 ASSERT(!m_localLinks.contains(url.string()));
455 m_localLinks.set(url.string(), localPaths[i]);
456 }
457
458 ASSERT(m_dataBuffer.isEmpty());
459 }
460
461 void WebPageSerializerImpl::collectTargetFrames()
462 {
463 ASSERT(!m_framesCollected);
464 m_framesCollected = true;
465
466 // First, process main frame.
467 m_frames.append(m_specifiedWebFrameImpl);
468 // Return now if user only needs to serialize specified frame, not including
469 // all sub-frames.
470 if (!m_recursiveSerialization)
471 return;
472 // Collect all frames inside the specified frame.
473 for (int i = 0; i < static_cast<int>(m_frames.size()); ++i) {
474 WebFrameImpl* currentFrame = m_frames[i];
475 // Get current using document.
476 Document* currentDoc = currentFrame->frame()->document();
477 // Go through sub-frames.
478 RefPtr<HTMLCollection> all = currentDoc->all();
479
480 for (unsigned i = 0; Node* node = all->item(i); i++) {
481 if (!node->isHTMLElement())
482 continue;
483 Element* element = toElement(node);
484 WebFrameImpl* webFrame =
485 WebFrameImpl::fromFrameOwnerElement(element);
486 if (webFrame)
487 m_frames.append(webFrame);
488 }
489 }
490 }
491
492 bool WebPageSerializerImpl::serialize()
493 {
494 if (!m_framesCollected)
495 collectTargetFrames();
496
497 bool didSerialization = false;
498 KURL mainURL = m_specifiedWebFrameImpl->frame()->document()->url();
499
500 for (unsigned i = 0; i < m_frames.size(); ++i) {
501 WebFrameImpl* webFrame = m_frames[i];
502 Document* document = webFrame->frame()->document();
503 const KURL& url = document->url();
504
505 if (!url.isValid() || !m_localLinks.contains(url.string()))
506 continue;
507
508 didSerialization = true;
509
510 const WTF::TextEncoding& textEncoding = document->encoding().isValid() ? document->encoding() : UTF8Encoding();
511 String directoryName = url == mainURL ? m_localDirectoryName : "";
512
513 SerializeDomParam param(url, textEncoding, document, directoryName);
514
515 Element* documentElement = document->documentElement();
516 if (documentElement)
517 buildContentForNode(documentElement, &param);
518
519 encodeAndFlushBuffer(WebPageSerializerClient::CurrentFrameIsFinished, &p aram, ForceFlush);
520 }
521
522 ASSERT(m_dataBuffer.isEmpty());
523 m_client->didSerializeDataForFrame(KURL(), WebCString("", 0), WebPageSeriali zerClient::AllFramesAreFinished);
524 return didSerialization;
525 }
526
527 } // namespace blink
OLDNEW
« no previous file with comments | « Source/web/WebPageSerializerImpl.h ('k') | Source/web/tests/MHTMLTest.cpp » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698