java/org/chromium/distiller/extractors/embeds/ImageExtractor.java - Issue 2020403002: Add support for figure element

Side by Side Diff: java/org/chromium/distiller/extractors/embeds/ImageExtractor.java

Issue 2020403002: Add support for figure element (Closed) Base URL: https://github.com/chromium/dom-distiller.git@master

Patch Set: comments addressed Created 4 years, 6 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 // Copyright 2015 The Chromium Authors. All rights reserved.	1 // Copyright 2015 The Chromium Authors. All rights reserved.

2 // Use of this source code is governed by a BSD-style license that can be	2 // Use of this source code is governed by a BSD-style license that can be

3 // found in the LICENSE file.	3 // found in the LICENSE file.

4	4

5 package org.chromium.distiller.extractors.embeds;	5 package org.chromium.distiller.extractors.embeds;

6	6

7 import com.google.gwt.dom.client.Element;	7 import com.google.gwt.dom.client.Element;

8 import com.google.gwt.dom.client.ImageElement;	8 import com.google.gwt.dom.client.ImageElement;

	9 import com.google.gwt.dom.client.NodeList;

	10 import org.chromium.distiller.webdocument.WebFigure;

9 import org.chromium.distiller.webdocument.WebImage;	11 import org.chromium.distiller.webdocument.WebImage;

10	12

11 import java.util.HashSet;	13 import java.util.HashSet;

12 import java.util.Set;	14 import java.util.Set;

13	15

14 /**	16 /**

15 * This class treats images as another type of embed and provides heuristics for lead image	17 * This class treats images as another type of embed and provides heuristics for lead image

16 * candidacy.	18 * candidacy.

17 */	19 */

18 public class ImageExtractor implements EmbedExtractor {	20 public class ImageExtractor implements EmbedExtractor {

19 private static final Set<String> relevantTags = new HashSet<>();	21 private static final Set<String> relevantTags = new HashSet<>();

	22 private String src;

	23 private int width;

	24 private int height;

	25

20 static {	26 static {

21 // TODO(mdjones): Add "DIV" to this list for css images and possibly cap tions.	27 // TODO(mdjones): Add "DIV" to this list for css images and possibly cap tions.

22 relevantTags.add("IMG");	28 relevantTags.add("IMG");

	29 relevantTags.add("FIGURE");

23 }	30 }

24	31

25 @Override	32 @Override

26 public Set<String> getRelevantTagNames() {	33 public Set<String> getRelevantTagNames() {

27 return relevantTags;	34 return relevantTags;

28 }	35 }

29	36

30 @Override	37 @Override

31 public WebImage extract(Element e) {	38 public WebImage extract(Element e) {

32 if (!relevantTags.contains(e.getTagName())) {	39 if (!relevantTags.contains(e.getTagName())) {

33 return null;	40 return null;

34 }	41 }

35 String imgSrc = "";

36 // Getting OffSetWidth/Height as default values, even they are	42 // Getting OffSetWidth/Height as default values, even they are

37 // affected by padding, border, etc.	43 // affected by padding, border, etc.

38 int width = e.getOffsetWidth();	44 width = e.getOffsetWidth();

39 int height = e.getOffsetHeight();	45 height = e.getOffsetHeight();

	46 src = "";

	47

40 if ("IMG".equals(e.getTagName())) {	48 if ("IMG".equals(e.getTagName())) {

41 // This will get the absolute URL of the image and	49 extractImageAttributes(ImageElement.as(e));

42 // the displayed image dimension.	50 return new WebImage(e, width, height, src);

43 ImageElement imageElement = ImageElement.as(e);	51 } else if ("FIGURE".equals(e.getTagName())) {

44 imgSrc = imageElement.getSrc();	52 Element img = getFirstElementByTagName(e, "IMG");

45 // As an ImageElement is manipulated here, it is possible	53 if (img != null) {

46 // to get the real dimensions.	54 String caption = "";

47 width = imageElement.getWidth();	55 extractImageAttributes(ImageElement.as(img));

48 height = imageElement.getHeight();	56 Element cap = getFirstElementByTagName(e, "FIGCAPTION");
	wychen 2016/06/02 23:48:49 Sadly some web sites don't follow the spec. For e Sadly some web sites don't follow the spec. For example, this site use <figure><div><p> to put the caption. http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ This one uses <figure><address> http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... This one uses <figure><div>. Search for "Soros Fund Management". http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... We can be more tolerant by trying harder when there's no <figcation>. marcelorcorrea 2016/06/06 20:38:30 I see your point. I thought about doing that too, Show quoted text On 2016/06/02 23:48:49, wychen wrote: > Sadly some web sites don't follow the spec. > > For example, this site use <figure><div><p> to put the caption. > http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ > > This one uses <figure><address> > http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... > > This one uses <figure><div>. Search for "Soros Fund Management". > http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... > > We can be more tolerant by trying harder when there's no <figcation>. I see your point. I thought about doing that too, but then we decided to just look for figcaption in order to follow the spec. Do you have any suggestions on what we could do here? get the whole innerText if non figcaption is found? wychen 2016/06/06 21:49:05 Sounds good. Show quoted text On 2016/06/06 20:38:30, marcelorcorrea wrote: > On 2016/06/02 23:48:49, wychen wrote: > > Sadly some web sites don't follow the spec. > > > > For example, this site use <figure><div><p> to put the caption. > > http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ > > > > This one uses <figure><address> > > > http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... > > > > This one uses <figure><div>. Search for "Soros Fund Management". > > > http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... > > > > We can be more tolerant by trying harder when there's no <figcation>. > > I see your point. I thought about doing that too, but then we decided to just > look for figcaption in order to follow the spec. > Do you have any suggestions on what we could do here? get the whole innerText if > non figcaption is found? Sounds good.
	57 if (cap != null) {

	58 caption = cap.getInnerText();
	wychen 2016/06/02 23:48:48 Some sites put non-caption elements into <figcapti Some sites put non-caption elements into <figcaption>. Search for "enlarge" here. http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... It's <figcaption><div><a href="large-img">Enlarge</a>actual caption</div> wychen 2016/06/02 23:48:49 Another issue: image credit could contain a link. Another issue: image credit could contain a link. Only keeping plain text is less than ideal in this case. Search for "caption-credit" in the source here: http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... marcelorcorrea 2016/06/06 20:38:30 Do you think it would be better if we kept the lin Show quoted text On 2016/06/02 23:48:49, wychen wrote: > Another issue: image credit could contain a link. Only keeping plain text is > less than ideal in this case. > > Search for "caption-credit" in the source here: > http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... Do you think it would be better if we kept the link too? wychen 2016/06/06 21:49:05 I'd like to keep the link, but retaining the DOM t Show quoted text On 2016/06/06 20:38:30, marcelorcorrea wrote: > On 2016/06/02 23:48:49, wychen wrote: > > Another issue: image credit could contain a link. Only keeping plain text is > > less than ideal in this case. > > > > Search for "caption-credit" in the source here: > > > http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... > > Do you think it would be better if we kept the link too? I'd like to keep the link, but retaining the DOM tree in other cases seems messy. How about keeping the whole DOM structure within <figcaption> only when it contains links, otherwise use innerText as is?
	59 }

	60 return new WebFigure(img, width, height, src, caption);

	61 }

49 }	62 }

	63 return null;

	64 }

50	65

51 return new WebImage(e, width, height, imgSrc);	66 private void extractImageAttributes(ImageElement img) {

	67 src = img.getSrc();

	68 width = img.getWidth();

	69 height = img.getHeight();

	70 }

	71

	72 private Element getFirstElementByTagName(Element e, String tagName) {

	73 NodeList<Element> elements = e.getElementsByTagName(tagName);

	74 if (elements.getLength() > 0) {

	75 return elements.getItem(0);

	76 }

	77 return null;

52 }	78 }

53 }	79 }

OLD	NEW

« no previous file with comments | « no previous file | java/org/chromium/distiller/webdocument/WebFigure.java » ('j') | java/org/chromium/distiller/webdocument/WebFigure.java » ('J')