|
|
Created:
5 years, 2 months ago by marcelorcorrea Modified:
4 years, 6 months ago CC:
mlongaray, dalmirsilva Base URL:
https://github.com/chromium/dom-distiller.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionDiscard hidden articles when using fast path
Dom Distiller returns an empty content string for certain webpages.
The shortcut logic that Dom Distiller uses when some pages have one or
more <article> or <element itemscope itemtype="Article"> can lead to
erroneous content.
Some web pages have hidden articles, and these are being considered as
valid elements.
In order to fix this, when Dom Distiller is searching for article tags,
it will now discard hidden article elements.
Contributors=marcelorcorrea@hp.com, mlongaray@hp.com
BUG=544962, 616954
R=mdjones@chromium.org, wychen@chromium.org
Committed: 54d05ba208e431963af489785edf036f9fda3cb7
Patch Set 1 #
Total comments: 1
Patch Set 2 : Fixed inconsistent indentation #
Total comments: 5
Patch Set 3 : Comments addressed #
Total comments: 1
Patch Set 4 : Fixed indentation. #Patch Set 5 : Added accurate visibility test #
Total comments: 1
Patch Set 6 : wychen's comments addressed #
Total comments: 5
Patch Set 7 : Comments addressed & master rebased #
Total comments: 18
Patch Set 8 : Method and unit tests names changed #
Total comments: 7
Patch Set 9 : isVisibleByOffset() method improved. #
Total comments: 10
Patch Set 10 : nit fixed #
Total comments: 1
Patch Set 11 : nitpick fixed 2 #
Total comments: 2
Patch Set 12 : nitpick 3 fixed #
Total comments: 4
Patch Set 13 : nit fixed 4 #
Messages
Total messages: 50 (4 generated)
Thanks for the patch. Can you provide some example web sites that have hidden articles? https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... File javatests/org/chromium/distiller/ContentExtractorTest.java (right): https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/ContentExtractorTest.java:650: "<h1>" + CONTENT_TEXT + "</h1>" + nit: inconsistent indentation around here.
On 2015/10/21 17:21:20, wychen wrote: > Thanks for the patch. Can you provide some example web sites that have hidden > articles? > > https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... > File javatests/org/chromium/distiller/ContentExtractorTest.java (right): > > https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... > javatests/org/chromium/distiller/ContentExtractorTest.java:650: "<h1>" + > CONTENT_TEXT + "</h1>" + > nit: inconsistent indentation around here. Hello wychen, Sure. This webpage here: http://www.freep.com/story/news/nation/2015/09/30/woman-accused-tossing-newbo... have two <article> elements, one of them is hidden and doesn't any have content. And this webpage here: http://www.australia.com/en-us/places/great-barrier-reef/top-10-things-to-do.... have two items with itemtype="NewsArticle", one of them is hidden and doesn't have content either. Thanks!
https://codereview.chromium.org/1411603004/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1411603004/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/ContentExtractor.java:194: for (int i = 0; i < nodeList.getLength(); i ++) { nit: i++ https://codereview.chromium.org/1411603004/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/DomUtil.java:174: public static Node getNearestCommonAncestor(final NodeList ns) { Is this used any more? https://codereview.chromium.org/1411603004/diff/20001/javatests/org/chromium/... File javatests/org/chromium/distiller/ContentExtractorTest.java (right): https://codereview.chromium.org/1411603004/diff/20001/javatests/org/chromium/... javatests/org/chromium/distiller/ContentExtractorTest.java:569: "<div itemscope itemtype=\"http://schema.org/Article\" style=\"display:none\">" + article + "</div>"; Lines < 100 char please https://codereview.chromium.org/1411603004/diff/20001/javatests/org/chromium/... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/20001/javatests/org/chromium/... javatests/org/chromium/distiller/DomUtilTest.java:7: import com.google.gwt.dom.client.NodeList; This belongs in the import section below. https://codereview.chromium.org/1411603004/diff/20001/javatests/org/chromium/... javatests/org/chromium/distiller/DomUtilTest.java:116: assertEquals(div2, DomUtil.getNearestCommonAncestor(TestUtil.nodeListToList(elementNodeList))); Line length
comments addressed
https://codereview.chromium.org/1411603004/diff/40001/javatests/org/chromium/... File javatests/org/chromium/distiller/ContentExtractorTest.java (right): https://codereview.chromium.org/1411603004/diff/40001/javatests/org/chromium/... javatests/org/chromium/distiller/ContentExtractorTest.java:654: "<h1>" + CONTENT_TEXT + "</h1>" + nit: You could probably match the indentation style of testOnlyProcessOGArticleNestedWithNestedHiddenArticleElement().
Fixed. I'm very sorry for the indentation issues. :\
On 2015/10/21 17:33:13, marcelorcorrea wrote: > On 2015/10/21 17:21:20, wychen wrote: > > Thanks for the patch. Can you provide some example web sites that have hidden > > articles? > > > > > https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... > > File javatests/org/chromium/distiller/ContentExtractorTest.java (right): > > > > > https://codereview.chromium.org/1411603004/diff/1/javatests/org/chromium/dist... > > javatests/org/chromium/distiller/ContentExtractorTest.java:650: "<h1>" + > > CONTENT_TEXT + "</h1>" + > > nit: inconsistent indentation around here. > > Hello wychen, > Sure. > This webpage here: > http://www.freep.com/story/news/nation/2015/09/30/woman-accused-tossing-newbo... > have two <article> elements, one of them is hidden and doesn't any have content. > And this webpage here: > http://www.australia.com/en-us/places/great-barrier-reef/top-10-things-to-do.... > have two items with itemtype="NewsArticle", one of them is hidden and doesn't > have content either. > > Thanks! The isVisible() checking is fast but not accurate. For example, display=='none' and opacity==0 is not inherited, but still affect the visibility. For example, the article element would be regarded as visible in the following html. <div style='display: none'><article>not this one</article></div> Have you met cases like this? I'm wondering whether it makes sense to use an accurate visibility test for this case.
On 2015/10/27 23:15:12, wychen wrote: > The isVisible() checking is fast but not accurate. For example, display=='none' > and opacity==0 is not inherited, but still affect the visibility. > For example, the article element would be regarded as visible in the following > html. > > <div style='display: none'><article>not this one</article></div> > > Have you met cases like this? I'm wondering whether it makes sense to use an > accurate visibility test for this case. The problem "display" not inherited can be solved by using getBoundingClientRect(), but we might want some profiling before making it the default mechanism. No idea how to solve opacity, but it seems less important.
Hello Wychen. I've never seen this case you mentioned, but I think this is something that can eventually happen. I've submitted a new patch that handles the "display" issue. I couldn't use getBoundingClientRect() because Element object doesn't have this method. So, I was able to get its offsets to check the element visibility. Thanks! Marcelo
On 2015/10/28 20:33:33, marcelorcorrea wrote: > Hello Wychen. > > I've never seen this case you mentioned, but I think this is something that can > eventually happen. > > I've submitted a new patch that handles the "display" issue. > > I couldn't use getBoundingClientRect() because Element object doesn't have this > method. So, I was able to get its offsets to check the element visibility. > > Thanks! > Marcelo offsetWidth/Height seems cheap enough to be put to the isVisible() function. I'll need to check whether this changes the score in the eval system first.
Actually I tried to add the offsetWidth/Height check into the isVisible() method, but it broke some unit tests. The reason was that in unit tests (or even in real cases), we usually add the html to be tested into a container (a div for example). So, the catch here is that elements without any children or hidden children will have offsetWidth/Height zero, or when a container has 'special tags', such as, an image. This image will have offsetWidth/Height zero either. e.g: <div> (this div has offsets (or boundingClientRect) zero, therefore, considered as not visible) <div style="display:none;"> text </div> </div> or: <div> <img src="http://www.somelink.com/image.jpg"> (this img has offsets (or boundingClientRect) zero, therefore, considered as not visible) </div> Since we just want not-hidden article elements, we can test the article's parent offset. That's why I created a separated function for it. As for the necessity of this extra check (offset thing), i'm not sure either if it's really necessary. Anyway, please let me know whether the score changes or not, and if there anything else I can do.
On 2015/10/29 15:36:58, marcelorcorrea wrote: > Actually I tried to add the offsetWidth/Height check into the isVisible() > method, but it broke some unit tests. > The reason was that in unit tests (or even in real cases), we usually add the > html to be tested into a container (a div for example). > So, the catch here is that elements without any children or hidden children will > have offsetWidth/Height zero, or when a container has 'special tags', such as, > an image. This image will have offsetWidth/Height zero either. > e.g: > <div> (this div has offsets (or boundingClientRect) zero, therefore, considered > as not visible) > <div style="display:none;"> text </div> > </div> > or: > <div> > <img src="http://www.somelink.com/image.jpg"> (this img has offsets (or > boundingClientRect) zero, therefore, considered as not visible) > </div> > Since we just want not-hidden article elements, we can test the article's parent > offset. That's why I created a separated function for it. Got it. In DomConverter.visitElement(), we already skip the subtree when seeing an invisible element, so I guess we are good there. I'll need to check PagingLinksFinder though, because it's not tree traversal there. > As for the necessity of this extra check (offset thing), i'm not sure either if > it's really necessary. > Anyway, please let me know whether the score changes or not, and if there > anything else I can do. The score doesn't change. I think the only change is that the likelihood of taking the fast path is increased.
https://codereview.chromium.org/1411603004/diff/80001/javatests/org/chromium/... File javatests/org/chromium/distiller/ContentExtractorTest.java (right): https://codereview.chromium.org/1411603004/diff/80001/javatests/org/chromium/... javatests/org/chromium/distiller/ContentExtractorTest.java:499: public void testOnlyProcessArticleElement() { It might be better to directly test getArticleElement() instead of indirectly through ContentExtractor. Would you mind do the following tasks? - Move getArticleElement() and the functions it depends on to DomUtil as public static functions. - Move all the getArticleElement() tests here to DomUtilTest. - Instead of doing assertExtractor(), do this: private Element getArticleElement(String html) { mBody.setInnerHTML(html); return DomUtil.getArticleElement(mRoot); } then assertNull(getArticleElement(html)); or assertNotNull(getArticleElement(htmlArticle));
getArticleElement() method moved to DomUtil class.
https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:109: return !(e.getOffsetHeight() <= 0 || e.getOffsetWidth() <= 0); I amended the current isVisible() implementation in this CL: https://codereview.chromium.org/1508963003/ You might want to rebase on top of that. https://codereview.chromium.org/1411603004/diff/100001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:385: final String article = "<p>" + CONTENT_TEXT + "</p><p>" + CONTENT_TEXT + "</p>"; The reason I inserted these text was for ContentExtractor to recognize them as content. If we only test getArticleElement(), the html can be simplified to empty elements. For example, <h1></h1><article></article> should work. https://codereview.chromium.org/1411603004/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:391: assertNotNull(getArticleElement(htmlArticle)); Sorry for changing my mind back and forth. It might make sense to assert the correct element is returned, instead of just not null. https://codereview.chromium.org/1411603004/diff/100001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:523: assertNotNull(getArticleElement(htmlArticle)); Ditto. Making sure which element is picked makes even more sense for these complicated cases.
Hello wychen That CL is not on master branch yet, can you push it so I can rebase and work on these fixes? Or should I rebase from your CL manually? Thanks!
Hello, I've applied your solution manually, tested our patch and it worked as expected. So, as soon as you push your cl, I'll rebase and submit this cl with the modifications you suggested.
Hi there, We have a pending CL here that's being blocked by another one (https://codereview.chromium.org/1508963003/). Will that CL be landing any time soon? Please take a look at it and advise what should be our next step. Thanks!
On 2016/04/18 14:15:10, mlongaray wrote: > Hi there, > We have a pending CL here that's being blocked by another one > (https://codereview.chromium.org/1508963003/). Will that CL be landing any time > soon? > Please take a look at it and advise what should be our next step. Thanks! Sorry for the delay. The other CL still has issues with validation. It might be best to amend this CL without rebasing it.
https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:109: return !(e.getOffsetHeight() <= 0 || e.getOffsetWidth() <= 0); On 2015/12/10 23:33:36, wychen wrote: > I amended the current isVisible() implementation in this CL: > https://codereview.chromium.org/1508963003/ > > You might want to rebase on top of that. Let's just use the logic from there without rebasing for now.
It's a bit hard to review if rebasing and intended modifications are mixed up.
On 2016/06/01 08:46:30, wychen wrote: > https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... > File java/org/chromium/distiller/DomUtil.java (right): > > https://codereview.chromium.org/1411603004/diff/100001/java/org/chromium/dist... > java/org/chromium/distiller/DomUtil.java:109: return !(e.getOffsetHeight() <= 0 > || e.getOffsetWidth() <= 0); > On 2015/12/10 23:33:36, wychen wrote: > > I amended the current isVisible() implementation in this CL: > > https://codereview.chromium.org/1508963003/ > > > > You might want to rebase on top of that. > > Let's just use the logic from there without rebasing for now. Thanks for replying, wychen. Since it's been a while I submitted our last patch set, I rebased it on top of master and pushed another patch set addressing some concerns you had (https://codereview.chromium.org/1411603004/#msg16). However, I'm not sure I understood what you said by "use the logic from there". Should I apply your CL manually here? Or even only update the isVisible() method? If that's not the case, is there something still pending? Thanks!
On 2016/06/01 21:18:14, wychen wrote: > It's a bit hard to review if rebasing and intended modifications are mixed up. Yes, I could notice that. If it helps on anything, I was able to only see our intended modifications by setting "Left Patch Set" field as "Base" and "Right Patch Set" as "Patch Set 7". How should we proceed?
On 2016/06/01 21:35:39, marcelorcorrea wrote: > On 2016/06/01 21:18:14, wychen wrote: > > It's a bit hard to review if rebasing and intended modifications are mixed up. > > Yes, I could notice that. If it helps on anything, I was able to only see our > intended modifications by setting "Left Patch Set" field as "Base" and "Right > Patch Set" as "Patch Set 7". > > How should we proceed? Let's keep it as is for now, as I'm commenting on patch set 7. When you want to rebase/merge next time, it is advised to upload right after rebasing, and upload another patch set specifically showing your own changes.
On 2016/06/01 21:55:49, wychen wrote: > On 2016/06/01 21:35:39, marcelorcorrea wrote: > > On 2016/06/01 21:18:14, wychen wrote: > > > It's a bit hard to review if rebasing and intended modifications are mixed > up. > > > > Yes, I could notice that. If it helps on anything, I was able to only see our > > intended modifications by setting "Left Patch Set" field as "Base" and "Right > > Patch Set" as "Patch Set 7". > > > > How should we proceed? > > Let's keep it as is for now, as I'm commenting on patch set 7. > > When you want to rebase/merge next time, it is advised to upload right after > rebasing, and upload another patch set specifically showing your own changes. I see. My mistake, sorry for that.
On 2016/06/01 21:30:39, marcelorcorrea wrote: > However, I'm not sure I understood what you said by "use the logic from there". > Should I apply your CL manually here? Or even only update the isVisible() > method? Never mind. The difference is not that important in this use case. Let's keep it as is.
https://codereview.chromium.org/1411603004/diff/120001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/120001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:108: public static boolean isVisibleByItsOffset(Element e) { Rename to IsVisibleByOffset()? https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:426: "<h1>" + CONTENT_TEXT + "</h1>" + All these tests should work without putting CONTENT_TEXT in the tags. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:460: public void testOnlyProcessArticleElementMultipleWithHiddenArticleElement() { This seems a bit redundant. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:474: public void testOnlyProcessOGArticle() { My bad. I mixed up open graph with schema.org back then. testOnlyProcessSchemaOrgArticle() Apply to all the schema.org tests below. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:533: public void testOnlyProcessOGArticleNewsWithHiddenArticleElement() { I think we only need to test one variation of schema.org articles with hidden articles. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:575: public void testOnlyProcessOGArticleBlogWithHiddenArticleElement() { ditto https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:622: public void testOnlyProcessOGArticleNestedWithNestedHiddenArticleElement() { Keep this, since this seems complicated enough. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:650: public void testOnlyProcessOGArticleNestedWithHiddenArticleElement() { Also keep this.
https://codereview.chromium.org/1411603004/diff/120001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/120001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:108: public static boolean isVisibleByItsOffset(Element e) { On 2016/06/02 05:56:01, wychen wrote: > Rename to IsVisibleByOffset()? Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:426: "<h1>" + CONTENT_TEXT + "</h1>" + On 2016/06/02 05:56:01, wychen wrote: > All these tests should work without putting CONTENT_TEXT in the tags. We tried using only empty elements but didn't work because DomUtil.isVisible() method doesn't consider these elements as visible. That's why we kept CONTENT_TEXT inside of the elements. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:460: public void testOnlyProcessArticleElementMultipleWithHiddenArticleElement() { On 2016/06/02 05:56:01, wychen wrote: > This seems a bit redundant. Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:474: public void testOnlyProcessOGArticle() { On 2016/06/02 05:56:01, wychen wrote: > My bad. I mixed up open graph with http://schema.org back then. > testOnlyProcessSchemaOrgArticle() > > Apply to all the http://schema.org tests below. Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:533: public void testOnlyProcessOGArticleNewsWithHiddenArticleElement() { On 2016/06/02 05:56:01, wychen wrote: > I think we only need to test one variation of http://schema.org articles with hidden > articles. Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:575: public void testOnlyProcessOGArticleBlogWithHiddenArticleElement() { On 2016/06/02 05:56:01, wychen wrote: > ditto Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:622: public void testOnlyProcessOGArticleNestedWithNestedHiddenArticleElement() { On 2016/06/02 05:56:01, wychen wrote: > Keep this, since this seems complicated enough. Done. https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:650: public void testOnlyProcessOGArticleNestedWithHiddenArticleElement() { On 2016/06/02 05:56:01, wychen wrote: > Also keep this. Done.
https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:426: "<h1>" + CONTENT_TEXT + "</h1>" + On 2016/06/03 16:21:57, marcelorcorrea wrote: > On 2016/06/02 05:56:01, wychen wrote: > > All these tests should work without putting CONTENT_TEXT in the tags. > > We tried using only empty elements but didn't work because DomUtil.isVisible() > method doesn't consider these elements as visible. > That's why we kept CONTENT_TEXT inside of the elements. Ah. So it's about test again. Let's use the content of isVisible() below and put into our isVisibleByOffset(). https://codereview.chromium.org/1508963003/ You can read the comment there for more details. https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:140: public static List<Element> getVisibleElements( nit: space. https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:141: NodeList<Element> nodeList) { Nitpick: Why wrap here? I think we are using 100char line limit. https://codereview.chromium.org/1411603004/diff/140001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/140001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:116: currDiv.getChild(0))); nit: wrapping. https://codereview.chromium.org/1411603004/diff/140001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:397: "<div>Some Text</div>" + "Some Text" can also be gone after using the new isVisibleByOffset() logic.
This CL seems to fix https://bugs.chromium.org/p/chromium/issues/detail?id=616954 as well. If so, it can be added to the bug list.
Description was changed from ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to a erroneous content. Some web pages have hidden articles, and this ones are being considered as valid elements. In order to fix this, when Dom Distiller is searching for articles tags, it will now discard hidden article elements. BUG=544962 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to a erroneous content. Some web pages have hidden articles, and this ones are being considered as valid elements. In order to fix this, when Dom Distiller is searching for articles tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ==========
On 2016/06/04 00:19:04, wychen wrote: > This CL seems to fix > https://bugs.chromium.org/p/chromium/issues/detail?id=616954 as well. If so, it > can be added to the bug list. Hello wychen! Yes, this cl also fixes it. I added the bug number in description.
https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/120001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:426: "<h1>" + CONTENT_TEXT + "</h1>" + On 2016/06/03 16:49:53, wychen wrote: > On 2016/06/03 16:21:57, marcelorcorrea wrote: > > On 2016/06/02 05:56:01, wychen wrote: > > > All these tests should work without putting CONTENT_TEXT in the tags. > > > > We tried using only empty elements but didn't work because DomUtil.isVisible() > > method doesn't consider these elements as visible. > > That's why we kept CONTENT_TEXT inside of the elements. > > Ah. So it's about test again. > > Let's use the content of isVisible() below and put into our isVisibleByOffset(). > https://codereview.chromium.org/1508963003/ > You can read the comment there for more details. Done. https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:140: public static List<Element> getVisibleElements( On 2016/06/03 16:49:53, wychen wrote: > nit: space. Done. https://codereview.chromium.org/1411603004/diff/140001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:141: NodeList<Element> nodeList) { On 2016/06/03 16:49:53, wychen wrote: > Nitpick: Why wrap here? I think we are using 100char line limit. Done. https://codereview.chromium.org/1411603004/diff/140001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/140001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:116: currDiv.getChild(0))); On 2016/06/03 16:49:53, wychen wrote: > nit: wrapping. Done.
https://codereview.chromium.org/1411603004/diff/160001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:146: public static List<Element> getVisibleElements(NodeList<Element> nodeList) { nit: I meant the double space public static List<Element> https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:19: private static final String CONTENT_TEXT = "Lorem Ipsum Lorem Ipsum Lorem Ipsum."; Is this still used? https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... File javatests/org/chromium/distiller/webdocument/DomConverterTest.java (right): https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:49: String html = "<div style=\"display:none\">invisible parent" + Why changing this file? These are necessary for the changes of isVisible(), which is not here. https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:133: mBody.appendChild(container); Good catch! How did you find this?
https://codereview.chromium.org/1411603004/diff/160001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:146: public static List<Element> getVisibleElements(NodeList<Element> nodeList) { On 2016/06/06 18:11:22, wychen wrote: > nit: I meant the double space > public static List<Element> Done. https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:19: private static final String CONTENT_TEXT = "Lorem Ipsum Lorem Ipsum Lorem Ipsum."; On 2016/06/06 18:11:22, wychen wrote: > Is this still used? Done. https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... File javatests/org/chromium/distiller/webdocument/DomConverterTest.java (right): https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:49: String html = "<div style=\"display:none\">invisible parent" + On 2016/06/06 18:11:22, wychen wrote: > Why changing this file? These are necessary for the changes of isVisible(), > which is not here. Done. https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:133: mBody.appendChild(container); On 2016/06/06 18:11:23, wychen wrote: > Good catch! How did you find this? Actually, I copied from your CL. Sorry for changing that!
https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... File javatests/org/chromium/distiller/webdocument/DomConverterTest.java (right): https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:133: mBody.appendChild(container); On 2016/06/06 19:21:59, marcelorcorrea wrote: > On 2016/06/06 18:11:23, wychen wrote: > > Good catch! How did you find this? > > Actually, I copied from your CL. > Sorry for changing that! lol. I totally forget this. Well, since we don't have to touch this file for this CL, I'd rather keep this line unchanged for now. https://codereview.chromium.org/1411603004/diff/160001/javatests/org/chromium... javatests/org/chromium/distiller/webdocument/DomConverterTest.java:156: mBody.appendChild(container); ditto https://codereview.chromium.org/1411603004/diff/180001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/180001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:115: || e.getOffsetWidth() != 0; Nitpick: rewrap this function to 100char limit.
https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:110: // Using offsetParent alone wouldn't work because it's also null when position is fixed. nit: double space. null when
https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:110: // Using offsetParent alone wouldn't work because it's also null when position is fixed. nit: double space. null when
https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1411603004/diff/200001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:110: // Using offsetParent alone wouldn't work because it's also null when position is fixed. On 2016/06/07 13:43:11, wychen wrote: > nit: double space. > null when Done.
lgtm with nits. https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:19: nit: extra line https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:117: nit: extra line
https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... File javatests/org/chromium/distiller/DomUtilTest.java (right): https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:19: On 2016/06/07 16:20:08, wychen wrote: > nit: extra line Done. https://codereview.chromium.org/1411603004/diff/220001/javatests/org/chromium... javatests/org/chromium/distiller/DomUtilTest.java:117: On 2016/06/07 16:20:08, wychen wrote: > nit: extra line Done.
Fix typos in cl description, otherwise lgtm. "and this ones" -> "and these" "to a erroneous" -> "to erroneous" "for articles tags" -> "for article tags"
Description was changed from ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to a erroneous content. Some web pages have hidden articles, and this ones are being considered as valid elements. In order to fix this, when Dom Distiller is searching for articles tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to erroneous content. Some web pages have hidden articles, and these are being considered as valid elements. In order to fix this, when Dom Distiller is searching for article tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ==========
On 2016/06/07 23:02:40, mdjones wrote: > Fix typos in cl description, otherwise lgtm. > > "and this ones" -> "and these" > "to a erroneous" -> "to erroneous" > "for articles tags" -> "for article tags" Done. Sorry for the typos. :\
Description was changed from ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to erroneous content. Some web pages have hidden articles, and these are being considered as valid elements. In order to fix this, when Dom Distiller is searching for article tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to erroneous content. Some web pages have hidden articles, and these are being considered as valid elements. In order to fix this, when Dom Distiller is searching for article tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ==========
No change in eval scores. Landing.
Description was changed from ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to erroneous content. Some web pages have hidden articles, and these are being considered as valid elements. In order to fix this, when Dom Distiller is searching for article tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=wychen@chromium.org, mdjones@chromium.org ========== to ========== Discard hidden articles when using fast path Dom Distiller returns an empty content string for certain webpages. The shortcut logic that Dom Distiller uses when some pages have one or more <article> or <element itemscope itemtype="Article"> can lead to erroneous content. Some web pages have hidden articles, and these are being considered as valid elements. In order to fix this, when Dom Distiller is searching for article tags, it will now discard hidden article elements. Contributors=marcelorcorrea@hp.com, mlongaray@hp.com BUG=544962, 616954 R=mdjones@chromium.org, wychen@chromium.org Committed: 54d05ba208e431963af489785edf036f9fda3cb7 ==========
Message was sent while issue was closed.
Committed patchset #13 (id:240001) manually as 54d05ba208e431963af489785edf036f9fda3cb7 (presubmit successful). |