Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(110)

Issue 1705123002: Add support for Schema.org/Recipe

Created:
4 years, 10 months ago by dalmirsilva
Modified:
4 years, 5 months ago
CC:
marcelorcorrea
Base URL:
https://github.com/chromium/dom-distiller.git@master
Target Ref:
refs/heads/master
Visibility:
Public.

Description

Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, mdjones@chromium.org

Patch Set 1 #

Total comments: 6

Patch Set 2 : wychen's comments #

Patch Set 3 : merged from master #

Patch Set 4 : activate only for English pages #

Total comments: 9

Patch Set 5 : Merged with Master #

Patch Set 6 : wychen's comments addressed #

Total comments: 5
Unified diffs Side-by-side diffs Delta from patch set Stats (+729 lines, -55 lines) Patch
M java/org/chromium/distiller/ContentExtractor.java View 1 2 3 4 1 chunk +7 lines, -0 lines 2 comments Download
M java/org/chromium/distiller/DomUtil.java View 1 2 3 4 5 3 chunks +59 lines, -0 lines 2 comments Download
M java/org/chromium/distiller/IEReadingViewParser.java View 1 2 1 chunk +5 lines, -0 lines 0 comments Download
A java/org/chromium/distiller/MarkupGenerator.java View 3 4 1 chunk +76 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/MarkupParser.java View 1 2 2 chunks +13 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/OpenGraphProtocolParserAccessor.java View 1 2 1 chunk +5 lines, -0 lines 0 comments Download
M java/org/chromium/distiller/SchemaOrgParser.java View 1 2 12 chunks +197 lines, -15 lines 0 comments Download
M java/org/chromium/distiller/SchemaOrgParserAccessor.java View 1 2 3 4 5 3 chunks +33 lines, -0 lines 1 comment Download
M javatests/org/chromium/distiller/DomUtilTest.java View 1 2 3 4 2 chunks +47 lines, -0 lines 0 comments Download
M javatests/org/chromium/distiller/SchemaOrgParserAccessorTest.java View 1 2 3 18 chunks +287 lines, -40 lines 0 comments Download

Messages

Total messages: 30 (7 generated)
wychen
This looks really interesting. I'm just curious how popular this og:Recipe is, and how we ...
4 years, 10 months ago (2016-02-18 00:42:06 UTC) #3
marcelorcorrea
Hello wychen, Thanks for your reply. When you say "og:Recipe", I believe you mean Schema.org/Recipe, ...
4 years, 10 months ago (2016-02-18 15:06:27 UTC) #4
dalmirdasilva
On 2016/02/18 15:06:27, marcelorcorrea wrote: > Hello wychen, > Thanks for your reply. When you ...
4 years, 10 months ago (2016-02-24 17:58:42 UTC) #5
dalmirdasilva
+Marcelo.
4 years, 10 months ago (2016-02-24 17:59:47 UTC) #7
wychen
I did some benchmark on our dataset. This CL is called "recipe", and the lazy ...
4 years, 9 months ago (2016-03-09 07:29:03 UTC) #8
dalmirsilva
Nice benchmarks, thanks for sharing the result with us! It is not completely unexpected that ...
4 years, 9 months ago (2016-03-14 20:30:50 UTC) #9
wychen
On 2016/03/14 20:30:50, dalmirsilva wrote: > Nice benchmarks, thanks for sharing the result with us! ...
4 years, 9 months ago (2016-03-14 22:40:01 UTC) #10
wychen
https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller/ContentExtractor.java File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller/ContentExtractor.java#newcode90 java/org/chromium/distiller/ContentExtractor.java:90: String structuredData = parser.getStructuredData(); Might make sense to measure ...
4 years, 9 months ago (2016-03-14 22:58:42 UTC) #11
dalmirsilva
That is a interesting idea, but we didn't use it because there is no guarantee ...
4 years, 8 months ago (2016-03-28 19:11:18 UTC) #12
wychen
I think it worth a try to make some heuristics to extract the original language ...
4 years, 8 months ago (2016-03-28 20:29:41 UTC) #13
dalmirsilva
Hello, wychen! We've tried to label the elements in a new filter. But it didn't ...
4 years, 8 months ago (2016-03-30 17:48:46 UTC) #14
dalmirsilva
Hello, wychen! We've tried to label the elements in a new filter. But it didn't ...
4 years, 8 months ago (2016-03-30 17:49:31 UTC) #15
wychen
On 2016/03/30 17:49:31, dalmirsilva wrote: > Hello, wychen! > > We've tried to label the ...
4 years, 8 months ago (2016-04-07 00:16:38 UTC) #16
dalmirdasilva
On 2016/04/07 00:16:38, wychen wrote: > On 2016/03/30 17:49:31, dalmirsilva wrote: > > Hello, wychen! ...
4 years, 7 months ago (2016-05-12 16:16:02 UTC) #17
wychen
On 2016/05/12 16:16:02, dalmirdasilva wrote: > The approach of labeling itemprops elements as VERY_LIKELY_RELEVANT has ...
4 years, 7 months ago (2016-05-12 22:51:57 UTC) #18
dalmirsilva
Hi, wychen! Yes, we agree with your concerns about regression issues. Therefore, taking in consideration ...
4 years, 7 months ago (2016-05-19 14:19:02 UTC) #20
wychen
Hi, It looks you have merged or rebased between patch set 2 and 3. One ...
4 years, 7 months ago (2016-05-20 09:17:46 UTC) #21
dalmirsilva
We deleted the previous patch set, rebased, submitted, applied our changes and then submitted again. ...
4 years, 7 months ago (2016-05-23 18:57:34 UTC) #24
wychen
Now that the concern about language is fixed, have you tried fixing the speed regression? ...
4 years, 6 months ago (2016-05-31 21:56:34 UTC) #25
wychen
Since we are talking about language detection, it might also solve this bug here: https://bugs.chromium.org/p/chromium/issues/detail?id=482217
4 years, 6 months ago (2016-06-03 23:58:05 UTC) #26
dalmirsilva
Hello wychen! About the speed regression: we took a look on this issue and we ...
4 years, 5 months ago (2016-07-06 17:53:34 UTC) #28
wychen
On 2016/07/06 17:53:34, dalmirsilva wrote: > Hello wychen! > > About the speed regression: we ...
4 years, 5 months ago (2016-07-24 22:59:04 UTC) #29
wychen
4 years, 5 months ago (2016-07-24 23:06:34 UTC) #30
https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
File java/org/chromium/distiller/DomUtil.java (right):

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:481: public static String
getLanguage(Element root) {
On 2016/07/06 17:53:34, dalmirsilva wrote:
> On 2016/05/31 21:56:34, wychen wrote:
> > If the language is specified in http header, instead of inside html, does
this
> > still gets it? Sadly it might not be possible to test in our unit test
> > environment.
> 
> Unfortunately it doesn't get it.  We couldn't find a a solution for getting
the
> content-language from the http-header without making extra requests, since Dom
> Distiller is manipulating the DOM already loaded and ready for work.

Got it. I guess this is our technical limitation.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/ContentExtractor.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/ContentExtractor.java:89: 
nit: extra line

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/ContentExtractor.java:92: LogUtil.addTimingInfo(now,
mTimingInfo, "parser.getStructuredData()");
Maybe just "getStructuredData" for consistency.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/DomUtil.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:508: result.add(matchResult.getGroup(1)
+ " year(s)");
Can we handle plural forms?

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:545: NodeList<Element> languages =
DomUtil.querySelectorAll(root, query);
Would it be faster if we only handle <head> instead of <html>? I haven't tried.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/SchemaOrgParserAccessor.java:22: private static
List<SchemaOrgParser.Type> supportedTypes;
I suspect this still causes some slowness.
Would lazy initialization make it faster?

Powered by Google App Engine
This is Rietveld 408576698