Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(367)

Unified Diff: Source/web/WebPageSerializerImpl.cpp

Issue 1295063002: Revert "Remove old serializer" It caused a regression, crbug.com/510422 (Closed) Base URL: svn://svn.chromium.org/blink/trunk
Patch Set: Created 5 years, 4 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « Source/web/WebPageSerializerImpl.h ('k') | Source/web/tests/WebPageSerializerTest.cpp » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: Source/web/WebPageSerializerImpl.cpp
diff --git a/Source/web/WebPageSerializerImpl.cpp b/Source/web/WebPageSerializerImpl.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..53a421db9cd5b5cfaeac2cdf96615fc3d9ae73b8
--- /dev/null
+++ b/Source/web/WebPageSerializerImpl.cpp
@@ -0,0 +1,520 @@
+/*
+ * Copyright (C) 2009 Google Inc. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ * * Neither the name of Google Inc. nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+// How we handle the base tag better.
+// Current status:
+// At now the normal way we use to handling base tag is
+// a) For those links which have corresponding local saved files, such as
+// savable CSS, JavaScript files, they will be written to relative URLs which
+// point to local saved file. Why those links can not be resolved as absolute
+// file URLs, because if they are resolved as absolute URLs, after moving the
+// file location from one directory to another directory, the file URLs will
+// be dead links.
+// b) For those links which have not corresponding local saved files, such as
+// links in A, AREA tags, they will be resolved as absolute URLs.
+// c) We comment all base tags when serialzing DOM for the page.
+// FireFox also uses above way to handle base tag.
+//
+// Problem:
+// This way can not handle the following situation:
+// the base tag is written by JavaScript.
+// For example. The page "www.yahoo.com" use
+// "document.write('<base href="http://www.yahoo.com/"...');" to setup base URL
+// of page when loading page. So when saving page as completed-HTML, we assume
+// that we save "www.yahoo.com" to "c:\yahoo.htm". After then we load the saved
+// completed-HTML page, then the JavaScript will insert a base tag
+// <base href="http://www.yahoo.com/"...> to DOM, so all URLs which point to
+// local saved resource files will be resolved as
+// "http://www.yahoo.com/yahoo_files/...", which will cause all saved resource
+// files can not be loaded correctly. Also the page will be rendered ugly since
+// all saved sub-resource files (such as CSS, JavaScript files) and sub-frame
+// files can not be fetched.
+// Now FireFox, IE and WebKit based Browser all have this problem.
+//
+// Solution:
+// My solution is that we comment old base tag and write new base tag:
+// <base href="." ...> after the previous commented base tag. In WebKit, it
+// always uses the latest "href" attribute of base tag to set document's base
+// URL. Based on this behavior, when we encounter a base tag, we comment it and
+// write a new base tag <base href="."> after the previous commented base tag.
+// The new added base tag can help engine to locate correct base URL for
+// correctly loading local saved resource files. Also I think we need to inherit
+// the base target value from document object when appending new base tag.
+// If there are multiple base tags in original document, we will comment all old
+// base tags and append new base tag after each old base tag because we do not
+// know those old base tags are original content or added by JavaScript. If
+// they are added by JavaScript, it means when loading saved page, the script(s)
+// will still insert base tag(s) to DOM, so the new added base tag(s) can
+// override the incorrect base URL and make sure we alway load correct local
+// saved resource files.
+
+#include "config.h"
+#include "web/WebPageSerializerImpl.h"
+
+#include "core/HTMLNames.h"
+#include "core/dom/Document.h"
+#include "core/dom/DocumentType.h"
+#include "core/dom/Element.h"
+#include "core/editing/Serialization.h"
+#include "core/html/HTMLAllCollection.h"
+#include "core/html/HTMLElement.h"
+#include "core/html/HTMLFormElement.h"
+#include "core/html/HTMLHtmlElement.h"
+#include "core/html/HTMLMetaElement.h"
+#include "core/loader/DocumentLoader.h"
+#include "core/loader/FrameLoader.h"
+#include "public/platform/WebVector.h"
+#include "web/WebLocalFrameImpl.h"
+#include "wtf/text/TextEncoding.h"
+
+namespace blink {
+
+// Maximum length of data buffer which is used to temporary save generated
+// html content data. This is a soft limit which might be passed if a very large
+// contegious string is found in the page.
+static const unsigned dataBufferCapacity = 65536;
+
+WebPageSerializerImpl::SerializeDomParam::SerializeDomParam(const KURL& url,
+ const WTF::TextEncoding& textEncoding,
+ Document* document,
+ const String& directoryName)
+ : url(url)
+ , textEncoding(textEncoding)
+ , document(document)
+ , directoryName(directoryName)
+ , isHTMLDocument(document->isHTMLDocument())
+ , haveSeenDocType(false)
+ , haveAddedCharsetDeclaration(false)
+ , skipMetaElement(nullptr)
+ , isInScriptOrStyleTag(false)
+ , haveAddedXMLProcessingDirective(false)
+ , haveAddedContentsBeforeEnd(false)
+{
+}
+
+String WebPageSerializerImpl::preActionBeforeSerializeOpenTag(
+ const Element* element, SerializeDomParam* param, bool* needSkip)
+{
+ StringBuilder result;
+
+ *needSkip = false;
+ if (param->isHTMLDocument) {
+ // Skip the open tag of original META tag which declare charset since we
+ // have overrided the META which have correct charset declaration after
+ // serializing open tag of HEAD element.
+ ASSERT(element);
+ if (isHTMLMetaElement(*element)) {
+ const HTMLMetaElement& meta = toHTMLMetaElement(*element);
+ // Check whether the META tag has declared charset or not.
+ String equiv = meta.httpEquiv();
+ if (equalIgnoringCase(equiv, "content-type")) {
+ String content = meta.content();
+ if (content.length() && content.contains("charset", TextCaseInsensitive)) {
+ // Find META tag declared charset, we need to skip it when
+ // serializing DOM.
+ param->skipMetaElement = element;
+ *needSkip = true;
+ }
+ }
+ } else if (isHTMLHtmlElement(*element)) {
+ // Check something before processing the open tag of HEAD element.
+ // First we add doc type declaration if original document has it.
+ if (!param->haveSeenDocType) {
+ param->haveSeenDocType = true;
+ result.append(createMarkup(param->document->doctype()));
+ }
+
+ // Add MOTW declaration before html tag.
+ // See http://msdn2.microsoft.com/en-us/library/ms537628(VS.85).aspx.
+ result.append(WebPageSerializer::generateMarkOfTheWebDeclaration(param->url));
+ } else if (isHTMLBaseElement(*element)) {
+ // Comment the BASE tag when serializing dom.
+ result.appendLiteral("<!--");
+ }
+ } else {
+ // Write XML declaration.
+ if (!param->haveAddedXMLProcessingDirective) {
+ param->haveAddedXMLProcessingDirective = true;
+ // Get encoding info.
+ String xmlEncoding = param->document->xmlEncoding();
+ if (xmlEncoding.isEmpty())
+ xmlEncoding = param->document->encodingName();
+ if (xmlEncoding.isEmpty())
+ xmlEncoding = UTF8Encoding().name();
+ result.appendLiteral("<?xml version=\"");
+ result.append(param->document->xmlVersion());
+ result.appendLiteral("\" encoding=\"");
+ result.append(xmlEncoding);
+ if (param->document->xmlStandalone())
+ result.appendLiteral("\" standalone=\"yes");
+ result.appendLiteral("\"?>\n");
+ }
+ // Add doc type declaration if original document has it.
+ if (!param->haveSeenDocType) {
+ param->haveSeenDocType = true;
+ result.append(createMarkup(param->document->doctype()));
+ }
+ }
+ return result.toString();
+}
+
+String WebPageSerializerImpl::postActionAfterSerializeOpenTag(
+ const Element* element, SerializeDomParam* param)
+{
+ StringBuilder result;
+
+ param->haveAddedContentsBeforeEnd = false;
+ if (!param->isHTMLDocument)
+ return result.toString();
+ // Check after processing the open tag of HEAD element
+ if (!param->haveAddedCharsetDeclaration
+ && isHTMLHeadElement(*element)) {
+ param->haveAddedCharsetDeclaration = true;
+ // Check meta element. WebKit only pre-parse the first 512 bytes
+ // of the document. If the whole <HEAD> is larger and meta is the
+ // end of head part, then this kind of pages aren't decoded correctly
+ // because of this issue. So when we serialize the DOM, we need to
+ // make sure the meta will in first child of head tag.
+ // See http://bugs.webkit.org/show_bug.cgi?id=16621.
+ // First we generate new content for writing correct META element.
+ result.append(WebPageSerializer::generateMetaCharsetDeclaration(
+ String(param->textEncoding.name())));
+
+ param->haveAddedContentsBeforeEnd = true;
+ // Will search each META which has charset declaration, and skip them all
+ // in PreActionBeforeSerializeOpenTag.
+ } else if (isHTMLScriptElement(*element) || isHTMLScriptElement(*element)) {
+ param->isInScriptOrStyleTag = true;
+ }
+
+ return result.toString();
+}
+
+String WebPageSerializerImpl::preActionBeforeSerializeEndTag(
+ const Element* element, SerializeDomParam* param, bool* needSkip)
+{
+ String result;
+
+ *needSkip = false;
+ if (!param->isHTMLDocument)
+ return result;
+ // Skip the end tag of original META tag which declare charset.
+ // Need not to check whether it's META tag since we guarantee
+ // skipMetaElement is definitely META tag if it's not 0.
+ if (param->skipMetaElement == element) {
+ *needSkip = true;
+ } else if (isHTMLScriptElement(*element) || isHTMLScriptElement(*element)) {
+ ASSERT(param->isInScriptOrStyleTag);
+ param->isInScriptOrStyleTag = false;
+ }
+
+ return result;
+}
+
+// After we finish serializing end tag of a element, we give the target
+// element a chance to do some post work to add some additional data.
+String WebPageSerializerImpl::postActionAfterSerializeEndTag(
+ const Element* element, SerializeDomParam* param)
+{
+ StringBuilder result;
+
+ if (!param->isHTMLDocument)
+ return result.toString();
+ // Comment the BASE tag when serializing DOM.
+ if (isHTMLBaseElement(*element)) {
+ result.appendLiteral("-->");
+ // Append a new base tag declaration.
+ result.append(WebPageSerializer::generateBaseTagDeclaration(
+ param->document->baseTarget()));
+ }
+
+ return result.toString();
+}
+
+void WebPageSerializerImpl::saveHTMLContentToBuffer(
+ const String& result, SerializeDomParam* param)
+{
+ m_dataBuffer.append(result);
+ encodeAndFlushBuffer(WebPageSerializerClient::CurrentFrameIsNotFinished,
+ param,
+ DoNotForceFlush);
+}
+
+void WebPageSerializerImpl::encodeAndFlushBuffer(
+ WebPageSerializerClient::PageSerializationStatus status,
+ SerializeDomParam* param,
+ FlushOption flushOption)
+{
+ // Data buffer is not full nor do we want to force flush.
+ if (flushOption != ForceFlush && m_dataBuffer.length() <= dataBufferCapacity)
+ return;
+
+ String content = m_dataBuffer.toString();
+ m_dataBuffer.clear();
+
+ CString encodedContent = param->textEncoding.normalizeAndEncode(content, WTF::EntitiesForUnencodables);
+
+ // Send result to the client.
+ m_client->didSerializeDataForFrame(param->url,
+ WebCString(encodedContent.data(), encodedContent.length()),
+ status);
+}
+
+void WebPageSerializerImpl::openTagToString(Element* element,
+ SerializeDomParam* param)
+{
+ bool needSkip;
+ StringBuilder result;
+ // Do pre action for open tag.
+ result.append(preActionBeforeSerializeOpenTag(element, param, &needSkip));
+ if (needSkip)
+ return;
+ // Add open tag
+ result.append('<');
+ result.append(element->nodeName().lower());
+ // Go through all attributes and serialize them.
+ AttributeCollection attributes = element->attributes();
+ AttributeCollection::iterator end = attributes.end();
+ for (AttributeCollection::iterator it = attributes.begin(); it != end; ++it) {
+ result.append(' ');
+ // Add attribute pair
+ result.append(it->name().toString());
+ result.appendLiteral("=\"");
+ if (!it->value().isEmpty()) {
+ const String& attrValue = it->value();
+
+ // Check whether we need to replace some resource links
+ // with local resource paths.
+ const QualifiedName& attrName = it->name();
+ if (element->hasLegalLinkAttribute(attrName)) {
+ // For links start with "javascript:", we do not change it.
+ if (attrValue.startsWith("javascript:", TextCaseInsensitive)) {
+ result.append(attrValue);
+ } else {
+ // Get the absolute link
+ WebLocalFrameImpl* subFrame = WebLocalFrameImpl::fromFrameOwnerElement(element);
+ String completeURL = subFrame ? subFrame->frame()->document()->url() :
+ param->document->completeURL(attrValue);
+ // Check whether we have local files for those link.
+ if (m_localLinks.contains(completeURL)) {
+ if (!param->directoryName.isEmpty()) {
+ result.appendLiteral("./");
+ result.append(param->directoryName);
+ result.append('/');
+ }
+ result.append(m_localLinks.get(completeURL));
+ } else {
+ result.append(completeURL);
+ }
+ }
+ } else {
+ if (param->isHTMLDocument)
+ result.append(m_htmlEntities.convertEntitiesInString(attrValue));
+ else
+ result.append(m_xmlEntities.convertEntitiesInString(attrValue));
+ }
+ }
+ result.append('\"');
+ }
+
+ // Do post action for open tag.
+ String addedContents = postActionAfterSerializeOpenTag(element, param);
+ // Complete the open tag for element when it has child/children.
+ if (element->hasChildren() || param->haveAddedContentsBeforeEnd)
+ result.append('>');
+ // Append the added contents generate in post action of open tag.
+ result.append(addedContents);
+ // Save the result to data buffer.
+ saveHTMLContentToBuffer(result.toString(), param);
+}
+
+// Serialize end tag of an specified element.
+void WebPageSerializerImpl::endTagToString(Element* element,
+ SerializeDomParam* param)
+{
+ bool needSkip;
+ StringBuilder result;
+ // Do pre action for end tag.
+ result.append(preActionBeforeSerializeEndTag(element, param, &needSkip));
+ if (needSkip)
+ return;
+ // Write end tag when element has child/children.
+ if (element->hasChildren() || param->haveAddedContentsBeforeEnd) {
+ result.appendLiteral("</");
+ result.append(element->nodeName().lower());
+ result.append('>');
+ } else {
+ // Check whether we have to write end tag for empty element.
+ if (param->isHTMLDocument) {
+ result.append('>');
+ // FIXME: This code is horribly wrong. WebPageSerializerImpl must die.
+ if (!element->isHTMLElement() || !toHTMLElement(element)->ieForbidsInsertHTML()) {
+ // We need to write end tag when it is required.
+ result.appendLiteral("</");
+ result.append(element->nodeName().lower());
+ result.append('>');
+ }
+ } else {
+ // For xml base document.
+ result.appendLiteral(" />");
+ }
+ }
+ // Do post action for end tag.
+ result.append(postActionAfterSerializeEndTag(element, param));
+ // Save the result to data buffer.
+ saveHTMLContentToBuffer(result.toString(), param);
+}
+
+void WebPageSerializerImpl::buildContentForNode(Node* node,
+ SerializeDomParam* param)
+{
+ switch (node->nodeType()) {
+ case Node::ELEMENT_NODE:
+ // Process open tag of element.
+ openTagToString(toElement(node), param);
+ // Walk through the children nodes and process it.
+ for (Node *child = node->firstChild(); child; child = child->nextSibling())
+ buildContentForNode(child, param);
+ // Process end tag of element.
+ endTagToString(toElement(node), param);
+ break;
+ case Node::TEXT_NODE:
+ saveHTMLContentToBuffer(createMarkup(node), param);
+ break;
+ case Node::ATTRIBUTE_NODE:
+ case Node::DOCUMENT_NODE:
+ case Node::DOCUMENT_FRAGMENT_NODE:
+ // Should not exist.
+ ASSERT_NOT_REACHED();
+ break;
+ // Document type node can be in DOM?
+ case Node::DOCUMENT_TYPE_NODE:
+ param->haveSeenDocType = true;
+ default:
+ // For other type node, call default action.
+ saveHTMLContentToBuffer(createMarkup(node), param);
+ break;
+ }
+}
+
+WebPageSerializerImpl::WebPageSerializerImpl(WebFrame* frame,
+ bool recursiveSerialization,
+ WebPageSerializerClient* client,
+ const WebVector<WebURL>& links,
+ const WebVector<WebString>& localPaths,
+ const WebString& localDirectoryName)
+ : m_client(client)
+ , m_recursiveSerialization(recursiveSerialization)
+ , m_framesCollected(false)
+ , m_localDirectoryName(localDirectoryName)
+ , m_htmlEntities(false)
+ , m_xmlEntities(true)
+{
+ // Must specify available webframe.
+ ASSERT(frame);
+ m_specifiedWebLocalFrameImpl = toWebLocalFrameImpl(frame);
+ // Make sure we have non 0 client.
+ ASSERT(client);
+ // Build local resources map.
+ ASSERT(links.size() == localPaths.size());
+ for (size_t i = 0; i < links.size(); i++) {
+ KURL url = links[i];
+ ASSERT(!m_localLinks.contains(url.string()));
+ m_localLinks.set(url.string(), localPaths[i]);
+ }
+
+ ASSERT(m_dataBuffer.isEmpty());
+}
+
+void WebPageSerializerImpl::collectTargetFrames()
+{
+ ASSERT(!m_framesCollected);
+ m_framesCollected = true;
+
+ // First, process main frame.
+ m_frames.append(m_specifiedWebLocalFrameImpl);
+ // Return now if user only needs to serialize specified frame, not including
+ // all sub-frames.
+ if (!m_recursiveSerialization)
+ return;
+ // Collect all frames inside the specified frame.
+ for (WebLocalFrameImpl* frame : m_frames) {
+ // Get current using document.
+ Document* currentDoc = frame->frame()->document();
+ // Go through sub-frames.
+ RefPtrWillBeRawPtr<HTMLAllCollection> all = currentDoc->all();
+
+ for (unsigned i = 0; Element* element = all->item(i); ++i) {
+ if (!element->isHTMLElement())
+ continue;
+ WebLocalFrameImpl* webFrame =
+ WebLocalFrameImpl::fromFrameOwnerElement(element);
+ if (webFrame)
+ m_frames.append(webFrame);
+ }
+ }
+}
+
+bool WebPageSerializerImpl::serialize()
+{
+ if (!m_framesCollected)
+ collectTargetFrames();
+
+ bool didSerialization = false;
+ KURL mainURL = m_specifiedWebLocalFrameImpl->frame()->document()->url();
+
+ for (unsigned i = 0; i < m_frames.size(); ++i) {
+ WebLocalFrameImpl* webFrame = m_frames[i];
+ Document* document = webFrame->frame()->document();
+ const KURL& url = document->url();
+
+ if (!url.isValid() || !m_localLinks.contains(url.string()))
+ continue;
+
+ didSerialization = true;
+
+ const WTF::TextEncoding& textEncoding = document->encoding().isValid() ? document->encoding() : UTF8Encoding();
+ String directoryName = url == mainURL ? m_localDirectoryName : "";
+
+ SerializeDomParam param(url, textEncoding, document, directoryName);
+
+ Element* documentElement = document->documentElement();
+ if (documentElement)
+ buildContentForNode(documentElement, &param);
+
+ encodeAndFlushBuffer(WebPageSerializerClient::CurrentFrameIsFinished, &param, ForceFlush);
+ }
+
+ ASSERT(m_dataBuffer.isEmpty());
+ m_client->didSerializeDataForFrame(KURL(), WebCString("", 0), WebPageSerializerClient::AllFramesAreFinished);
+ return didSerialization;
+}
+
+} // namespace blink
« no previous file with comments | « Source/web/WebPageSerializerImpl.h ('k') | Source/web/tests/WebPageSerializerTest.cpp » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698