|
|
Created:
4 years, 6 months ago by marcelorcorrea Modified:
4 years, 4 months ago CC:
dalmirsilva Base URL:
https://github.com/chromium/dom-distiller.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionAdd support for figure element
Created the WebFigure class that represents a figure element,
containing an image and optionally a caption.
This class is basically a placeholder for an image and a caption,
so it extends WebImage and handles the caption.
Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com
** Score changes:
https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html
Average precision: 0.950 → 0.948
https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html
Average recall: 0.973 → 0.973+
https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html
Average precision: 0.745 → 0.744
Average recall: 0.563 → 0.563
https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html
Average precision: 0.918 → 0.918-
Average recall: 0.946 → 0.947+
https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden.html
Average precision: 0.947 → 0.945
Average recall: 0.975 → 0.976
Overall, more things are extracted, so recall improved,
but precision regressed.
BUG=613374
R=mdjones@chromium.org, wychen@chromium.org
Committed: 365c44e85bbba9c237827444cef6df66e56f3689
Patch Set 1 #
Total comments: 9
Patch Set 2 : wychen's comments addressed #
Total comments: 4
Patch Set 3 : comments addressed #
Total comments: 11
Patch Set 4 : merged from master #
Total comments: 4
Patch Set 5 : wychen's comments addressed #Patch Set 6 : added support for lazily-loaded images in figures #
Total comments: 8
Patch Set 7 : comments addressed #
Total comments: 1
Patch Set 8 : Handling plain texts with generateOutputFromTree. #
Total comments: 5
Patch Set 9 : rebase & comments addressed #
Total comments: 2
Patch Set 10 : mdjones' comments addressed #
Messages
Total messages: 37 (7 generated)
Add support for figure element.
Thanks for the patch! https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:10: import org.chromium.distiller.webdocument.WebFigure; Forget to upload this one? https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:54: if (img != null) { If a malformed <figure> contains no <img>s, an empty WebImage is still returned. https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:72: private Element getElementByTagName(Element e, String tagName) { getFirstElementByTagName? https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... File javatests/org/chromium/distiller/EmbedExtractorTest.java (right): https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/EmbedExtractorTest.java:386: image.setSrc("http://wwww.someimage.com/image.jpeg"); We should use example.com for the domain. https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/EmbedExtractorTest.java:407: public void testFigureWithCaption() { Add one more test without <img>?
Sorry, wychen! Forgot, indeed ! :P WebFigure added. Thanks! https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:54: if (img != null) { On 2016/05/31 21:28:04, wychen wrote: > If a malformed <figure> contains no <img>s, an empty WebImage is still returned. Done. https://codereview.chromium.org/2020403002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:72: private Element getElementByTagName(Element e, String tagName) { On 2016/05/31 21:28:04, wychen wrote: > getFirstElementByTagName? Done. https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... File javatests/org/chromium/distiller/EmbedExtractorTest.java (right): https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/EmbedExtractorTest.java:386: image.setSrc("http://wwww.someimage.com/image.jpeg"); On 2016/05/31 21:28:04, wychen wrote: > We should use http://example.com for the domain. Done. https://codereview.chromium.org/2020403002/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/EmbedExtractorTest.java:407: public void testFigureWithCaption() { On 2016/05/31 21:28:04, wychen wrote: > Add one more test without <img>? Done.
How does this look in the viewer? Do we need to update the CSS? https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:23: private String caption; This doesn't need to be in class scope. https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:21: cap.setInnerHTML(figCaption); The caption is obtained from innerText, but set as innerHTML? If we want to keep text only, probably wrap inside <p> here. If we want to keep rich structure, sanitize it here.
Done. https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:23: private String caption; On 2016/05/31 22:14:46, wychen wrote: > This doesn't need to be in class scope. Done. https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:21: cap.setInnerHTML(figCaption); On 2016/05/31 22:14:46, wychen wrote: > The caption is obtained from innerText, but set as innerHTML? If we want to keep > text only, probably wrap inside <p> here. If we want to keep rich structure, > sanitize it here. Right. We've changed to set text only since according to this: https://www.w3.org/TR/html5/grouping-content.html#the-figure-element wouldn't make sense to preserve HTML. Usually the figure element is used like this: <figure> <div> <img src="http://img source"> <figcaption>example caption</figcaption> </div> </figure>
On 2016/05/31 22:14:46, wychen wrote: > How does this look in the viewer? Do we need to update the CSS? > We're not complete sure, but the figure element looks to be slightly shifted to the right due to its margins. We are building a new version of chromium with this customized dom-distiller version to check if it is ok. Thanks for the review, wychen!
Regarding how it looks like in the preview, it looks really good! See: https://s3-us-west-1.amazonaws.com/printx/public/Screen+Shot+2016-06-01+at+3....
I've run this through our internal dataset and found some issues. After this CL, some extra elements are extracted for this web page: http://www.chinatimes.com/realtimenews/20150506002032-260408 https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:56: Element cap = getFirstElementByTagName(e, "FIGCAPTION"); Sadly some web sites don't follow the spec. For example, this site use <figure><div><p> to put the caption. http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ This one uses <figure><address> http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... This one uses <figure><div>. Search for "Soros Fund Management". http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... We can be more tolerant by trying harder when there's no <figcation>. https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:58: caption = cap.getInnerText(); Some sites put non-caption elements into <figcaption>. Search for "enlarge" here. http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... It's <figcaption><div><a href="large-img">Enlarge</a>actual caption</div> https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:58: caption = cap.getInnerText(); Another issue: image credit could contain a link. Only keeping plain text is less than ideal in this case. Search for "caption-credit" in the source here: http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:16: public String generateOutput(boolean textOnly) { This doesn't work for text-only mode. Add this as the first line: if (textOnly) return figCaption; It might make sense to add one test for this. https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:18: fig.setInnerHTML(super.generateOutput(textOnly)); After that, textOnly can only be false here.
On 2016/06/02 23:48:49, wychen wrote: > I've run this through our internal dataset and found some issues. > > After this CL, some extra elements are extracted for this web page: > http://www.chinatimes.com/realtimenews/20150506002032-260408 > I did some tests, and it looks like it is a heuristic issue. When the figure logic is disabled, dom distiller marks the figcaption as content. When it is enabled, this figcaption is skipped for the heuristics, so dom distiller marks the social media links as content. I'll try to look why this is happening.
https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:56: Element cap = getFirstElementByTagName(e, "FIGCAPTION"); On 2016/06/02 23:48:49, wychen wrote: > Sadly some web sites don't follow the spec. > > For example, this site use <figure><div><p> to put the caption. > http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ > > This one uses <figure><address> > http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... > > This one uses <figure><div>. Search for "Soros Fund Management". > http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... > > We can be more tolerant by trying harder when there's no <figcation>. I see your point. I thought about doing that too, but then we decided to just look for figcaption in order to follow the spec. Do you have any suggestions on what we could do here? get the whole innerText if non figcaption is found? https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:58: caption = cap.getInnerText(); On 2016/06/02 23:48:49, wychen wrote: > Another issue: image credit could contain a link. Only keeping plain text is > less than ideal in this case. > > Search for "caption-credit" in the source here: > http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... Do you think it would be better if we kept the link too? https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:16: public String generateOutput(boolean textOnly) { On 2016/06/02 23:48:49, wychen wrote: > This doesn't work for text-only mode. Add this as the first line: > > if (textOnly) return figCaption; > > It might make sense to add one test for this. Acknowledged. https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/webdocument/WebFigure.java:18: fig.setInnerHTML(super.generateOutput(textOnly)); On 2016/06/02 23:48:49, wychen wrote: > After that, textOnly can only be false here. Acknowledged.
https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:56: Element cap = getFirstElementByTagName(e, "FIGCAPTION"); On 2016/06/06 20:38:30, marcelorcorrea wrote: > On 2016/06/02 23:48:49, wychen wrote: > > Sadly some web sites don't follow the spec. > > > > For example, this site use <figure><div><p> to put the caption. > > http://www.appledaily.com.tw/realtimenews/article/new/20150506/605427/ > > > > This one uses <figure><address> > > > http://www.thewire.com/entertainment/2014/07/guardians-of-the-galaxy-brings-b... > > > > This one uses <figure><div>. Search for "Soros Fund Management". > > > http://www.washingtontimes.com/news/2015/jan/14/george-soros-funds-ferguson-p... > > > > We can be more tolerant by trying harder when there's no <figcation>. > > I see your point. I thought about doing that too, but then we decided to just > look for figcaption in order to follow the spec. > Do you have any suggestions on what we could do here? get the whole innerText if > non figcaption is found? Sounds good. https://codereview.chromium.org/2020403002/diff/40001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:58: caption = cap.getInnerText(); On 2016/06/06 20:38:30, marcelorcorrea wrote: > On 2016/06/02 23:48:49, wychen wrote: > > Another issue: image credit could contain a link. Only keeping plain text is > > less than ideal in this case. > > > > Search for "caption-credit" in the source here: > > > http://arstechnica.com/gadgets/2014/02/the-2014-google-tracker-everything-we-... > > Do you think it would be better if we kept the link too? I'd like to keep the link, but retaining the DOM tree in other cases seems messy. How about keeping the whole DOM structure within <figcaption> only when it contains links, otherwise use innerText as is?
We also need to consider compatibility with ContentExtractor.getImageUrls() and LeadImageFinder. In that case, would keeping just WebImage instead of adding WebFigure simpler?
On 2016/06/07 00:56:08, wychen wrote: > We also need to consider compatibility with ContentExtractor.getImageUrls() and > LeadImageFinder. In that case, would keeping just WebImage instead of adding > WebFigure simpler? Ah. Never mind. It should work as is.
Thanks for formating my part. https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:10: import com.google.gwt.dom.client.NodeList; import should be sorted. https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:90: private void extractImageAttributes(ImageElement img) { Let's support lazily-loaded images in <figure> as well.
https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:10: import com.google.gwt.dom.client.NodeList; On 2016/06/07 16:27:20, wychen wrote: > import should be sorted. Done. https://codereview.chromium.org/2020403002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:90: private void extractImageAttributes(ImageElement img) { On 2016/06/07 16:27:20, wychen wrote: > Let's support lazily-loaded images in <figure> as well. Done.
https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:12: import org.chromium.distiller.LogUtil; The sorting is case sensitive. Capital ones come first. https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:59: cap.getInnerHTML() : cap.getInnerText(); This innerText should be escaped. Or a vulnerability like XSS could be introduced here. Be very careful about this. Getting innerHTML doesn't look right. For example, links don't get converted to absolute urls. You can check how WebTable handles this. https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... File javatests/org/chromium/distiller/EmbedExtractorTest.java (right): https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/EmbedExtractorTest.java:487: Element anchor = Document.get().createAnchorElement(); Put an href and see if it gets converted. Also, put other attributes and see if it gets cleaned. https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/EmbedExtractorTest.java:511: figcaption.setInnerHTML("This is a caption"); Put this inside other elements, and make sure it's not retained.
thanks for reviewing it! https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:59: cap.getInnerHTML() : cap.getInnerText(); On 2016/06/07 17:39:19, wychen wrote: > This innerText should be escaped. Or a vulnerability like XSS could be > introduced here. Be very careful about this. > > Getting innerHTML doesn't look right. For example, links don't get converted to > absolute urls. You can check how WebTable handles this. Done. https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... File javatests/org/chromium/distiller/EmbedExtractorTest.java (right): https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/EmbedExtractorTest.java:487: Element anchor = Document.get().createAnchorElement(); On 2016/06/07 17:39:19, wychen wrote: > Put an href and see if it gets converted. > > Also, put other attributes and see if it gets cleaned. Done. https://codereview.chromium.org/2020403002/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/EmbedExtractorTest.java:511: figcaption.setInnerHTML("This is a caption"); On 2016/06/07 17:39:19, wychen wrote: > Put this inside other elements, and make sure it's not retained. Done.
https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:59: cap.getInnerHTML() : cap.getInnerText(); On 2016/06/08 14:34:15, marcelorcorrea wrote: > On 2016/06/07 17:39:19, wychen wrote: > > This innerText should be escaped. Or a vulnerability like XSS could be > > introduced here. Be very careful about this. > > > > Getting innerHTML doesn't look right. For example, links don't get converted > to > > absolute urls. You can check how WebTable handles this. > > Done. WebTable stores an element, and use DomUtil.generateOutputFromTree() when generating its output. I think that looks more robust than doing innerHTML/innerText with manual escaping. https://codereview.chromium.org/2020403002/diff/120001/java/org/chromium/dist... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/120001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:18: if (textOnly) return figCaption; Now this doesn't work with these escaping.
Hi, What is the status of this CL?
On 2016/07/24 23:26:43, wychen wrote: > Hi, > > What is the status of this CL? Hello wychen! Sorry for the delay. I'm not sure if I got it, sorry. :( Are you suggesting that we could store the whole figure element like WebTable does and then use DomUtil.generateOutputFromTree() when returning its output? Or just for the captions? Wouldn't this keep unwanted elements as well? like spans, div e etc ... ? <figure> <span> <img src"..." /> <figcaption><div>...</div></figcaption> </span> </figure> Sorry if i'm totally mistaken!
On 2016/07/28 20:25:23, marcelorcorrea wrote: > On 2016/07/24 23:26:43, wychen wrote: > > Hi, > > > > What is the status of this CL? > > Hello wychen! > Sorry for the delay. > I'm not sure if I got it, sorry. :( > Are you suggesting that we could store the whole figure element like WebTable > does and then use DomUtil.generateOutputFromTree() when returning its output? > Or just for the captions? > Wouldn't this keep unwanted elements as well? > like spans, div e etc ... ? > <figure> > <span> > <img src"..." /> > <figcaption><div>...</div></figcaption> > </span> > </figure> > > Sorry if i'm totally mistaken! Just the captions, and only if we need more than plain text. Your current implementation with innerHTML/innerText is correct except for textOnly mode, but using innerHTML/innerText is considered smelly. If we don't handle these carefully, vulnerabilities could be introduced.
Wychen comments addressed
Thanks. We are getting close. https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:57: Element figcaption = Document.get().createElement("FIGCAPTION"); If we need to keep the structure, directly pass the <figcation> element (cap) to WebFigure(). Otherwise, create a new <figcaption> and use setInnerText(). This way, we don't use setInnerHTML, and avoid unnecessary copying. This piece of code is executed for every <figure>, even if they are not considered content, so we better keep it fast. https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:8: * WebFigure represents a html figure tag which contains an image and ... represents a figure element, containing an image and optionally a caption. https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:31: * and just handle he caption since an html figure is basically a the https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:39: figure.setInnerHTML(super.generateOutput(false)); Could you rebase and add to WebImage: protected ImageElement getProcessedNode() Then use that element here, instead of calling setInnerHTML? https://codereview.chromium.org/2020403002/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:41: Element container = Document.get().createElement("DIV"); We could just use: figure.appendChild(caption)
Patchset #9 (id:160001) has been deleted
Patchset #9 (id:180001) has been deleted
wychen's comments addressed again :D! Thanks for reviewing it!
lgtm. mdjones@, can you also take a look? Thanks!
lgtm. mdjones@, can you also take a look? Thanks!
lgtm % nits https://codereview.chromium.org/2020403002/diff/200001/java/org/chromium/dist... File java/org/chromium/distiller/extractors/embeds/ImageExtractor.java (right): https://codereview.chromium.org/2020403002/diff/200001/java/org/chromium/dist... java/org/chromium/distiller/extractors/embeds/ImageExtractor.java:60: figcaption = getFirstElementByTagName(cap, "A") != null ? Can you add an explanation for this? Why are we looking for an anchor before creating a caption? https://codereview.chromium.org/2020403002/diff/200001/java/org/chromium/dist... File java/org/chromium/distiller/webdocument/WebFigure.java (right): https://codereview.chromium.org/2020403002/diff/200001/java/org/chromium/dist... java/org/chromium/distiller/webdocument/WebFigure.java:27: @Override Move @Override under documentation.
Could you help update the CL description so that we can land it? Thanks!
Description was changed from ========== Add support for figure element In order to preserve caption for figure elements we added the FIGURE tag in the relevant tag list inside ImageExtractor. When there is a figure element we detect it, try to find the inner image (getting only the fist one) then find the FIGCAPTION tag. If there is a FIGCAPTION we use its inner text to build a (also created) WebFigure. A WebFigure is a WebImage and contains a caption (String). We override the generate output method to include the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption. So it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ==========
On 2016/08/06 07:09:05, wychen wrote: > Could you help update the CL description so that we can land it? Thanks! Done, do you think it's better now? Thanks!
Description was changed from ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption. So it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ==========
Description was changed from ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com ** Score changes: https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html Average precision: 0.950 → 0.948 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html Average recall: 0.973 → 0.973+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html Average precision: 0.745 → 0.744 Average recall: 0.563 → 0.563 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html Average precision: 0.918 → 0.918- Average recall: 0.946 → 0.947+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden... Average precision: 0.947 → 0.945 Average recall: 0.975 → 0.976 Overall, more things are extracted, so recall improved, but precision regressed. BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ==========
Description was changed from ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com ** Score changes: https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html Average precision: 0.950 → 0.948 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html Average recall: 0.973 → 0.973+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html Average precision: 0.745 → 0.744 Average recall: 0.563 → 0.563 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html Average precision: 0.918 → 0.918- Average recall: 0.946 → 0.947+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden... Average precision: 0.947 → 0.945 Average recall: 0.975 → 0.976 Overall, more things are extracted, so recall improved, but precision regressed. BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com ** Score changes: https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html Average precision: 0.950 → 0.948 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html Average recall: 0.973 → 0.973+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html Average precision: 0.745 → 0.744 Average recall: 0.563 → 0.563 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html Average precision: 0.918 → 0.918- Average recall: 0.946 → 0.947+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden... Average precision: 0.947 → 0.945 Average recall: 0.975 → 0.976 Overall, more things are extracted, so recall improved, but precision regressed. BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ==========
Description was changed from ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com ** Score changes: https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html Average precision: 0.950 → 0.948 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html Average recall: 0.973 → 0.973+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html Average precision: 0.745 → 0.744 Average recall: 0.563 → 0.563 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html Average precision: 0.918 → 0.918- Average recall: 0.946 → 0.947+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden... Average precision: 0.947 → 0.945 Average recall: 0.975 → 0.976 Overall, more things are extracted, so recall improved, but precision regressed. BUG=613374 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Add support for figure element Created the WebFigure class that represents a figure element, containing an image and optionally a caption. This class is basically a placeholder for an image and a caption, so it extends WebImage and handles the caption. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com ** Score changes: https://x20web.corp.google.com/~wychen/domdistillerscore/figure/cjk.html Average precision: 0.950 → 0.948 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/knowledge.html Average recall: 0.973 → 0.973+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/multi-page.html Average precision: 0.745 → 0.744 Average recall: 0.563 → 0.563 https://x20web.corp.google.com/~wychen/domdistillerscore/figure/page-links.html Average precision: 0.918 → 0.918- Average recall: 0.946 → 0.947+ https://x20web.corp.google.com/~wychen/domdistillerscore/figure/reader-golden... Average precision: 0.947 → 0.945 Average recall: 0.975 → 0.976 Overall, more things are extracted, so recall improved, but precision regressed. BUG=613374 R=mdjones@chromium.org, wychen@chromium.org Committed: 365c44e85bbba9c237827444cef6df66e56f3689 ==========
Message was sent while issue was closed.
Committed patchset #10 (id:220001) manually as 365c44e85bbba9c237827444cef6df66e56f3689 (presubmit successful). |