Issue 1705123002: Add support for Schema.org/Recipe

Created:
4 years, 10 months ago by dalmirsilva

Modified:
4 years, 5 months ago

Reviewers:
mdjones, wychen, dalmirdasilva

CC:
marcelorcorrea

Base URL:
https://github.com/chromium/dom-distiller.git@master

Target Ref:
refs/heads/master

Visibility:
Public.

More Reviews

Description

Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, mdjones@chromium.org

Patch Set 1 #

Total comments: 6

Patch Set 2 : wychen's comments #

Patch Set 3 : merged from master #

Patch Set 4 : activate only for English pages #

Total comments: 9

Patch Set 5 : Merged with Master #

Patch Set 6 : wychen's comments addressed #

Total comments: 5

Created: 4 years, 5 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+729 lines, -55 lines)			Patch
M	java/org/chromium/distiller/ContentExtractor.java	View	1 2 3 4	1 chunk	+7 lines, -0 lines	2 comments	Download
M	java/org/chromium/distiller/DomUtil.java	View	1 2 3 4 5	3 chunks	+59 lines, -0 lines	2 comments	Download
M	java/org/chromium/distiller/IEReadingViewParser.java	View	1 2	1 chunk	+5 lines, -0 lines	0 comments	Download
A	java/org/chromium/distiller/MarkupGenerator.java	View	3 4	1 chunk	+76 lines, -0 lines	0 comments	Download
M	java/org/chromium/distiller/MarkupParser.java	View	1 2	2 chunks	+13 lines, -0 lines	0 comments	Download
M	java/org/chromium/distiller/OpenGraphProtocolParserAccessor.java	View	1 2	1 chunk	+5 lines, -0 lines	0 comments	Download
M	java/org/chromium/distiller/SchemaOrgParser.java	View	1 2	12 chunks	+197 lines, -15 lines	0 comments	Download
M	java/org/chromium/distiller/SchemaOrgParserAccessor.java	View	1 2 3 4 5	3 chunks	+33 lines, -0 lines	1 comment	Download
M	javatests/org/chromium/distiller/DomUtilTest.java	View	1 2 3 4	2 chunks	+47 lines, -0 lines	0 comments	Download
M	javatests/org/chromium/distiller/SchemaOrgParserAccessorTest.java	View	1 2 3	18 chunks	+287 lines, -40 lines	0 comments	Download

Messages

Total messages: 30 (7 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

wychen

Description was changed from ========== Add support for Schema.org/Recipe There is a large number of ...

4 years, 10 months ago (2016-02-17 23:41:03 UTC) #1

Description was changed from

==========
Add support for Schema.org/Recipe

There is a large number of web pages around internet which uses some
sort of structure or schema. In these cases, extracting the relevant
content should not be an homeric task as it normaly is.

DomDistiller should take advantage of this by supporting those
structured data schemas, instead of blindness looking around for
relevant content. It should take the shortcut.

DomDistiller parses the whole document once for each structured data
parser (which are 3). But it uses these parses only to extract some
minor information such as title and author. The idea is to use these
parsers to extract the content itself.

This CL aims to support content extraction from these parsers. If any of
them can retrieve the content from a page, we use it instead of
executing the normal flow.

Also, the idea is to allow the parsers to be extensible and gradatively
support different kind of content types.

In this CL, we have added support for schema.org/Recipe, which currently
appears to be a relevant issue. There are a lot of recipe
webpages that follow the schema.org convention, and in most of these
pages DomDistiller performes a very poor job by removing some key
relevant content, such as ingredients or prep times, or even discarding
the whole content.

Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com

BUG=397173
R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org,
bengr@chromium.org, nyquist@chromium.org, gene@chromium.org
==========

to

==========
Add support for Schema.org/Recipe

There is a large number of web pages around internet which uses some
sort of structure or schema. In these cases, extracting the relevant
content should not be an homeric task as it normaly is.

DomDistiller should take advantage of this by supporting those
structured data schemas, instead of blindness looking around for
relevant content. It should take the shortcut.

DomDistiller parses the whole document once for each structured data
parser (which are 3). But it uses these parses only to extract some
minor information such as title and author. The idea is to use these
parsers to extract the content itself.

This CL aims to support content extraction from these parsers. If any of
them can retrieve the content from a page, we use it instead of
executing the normal flow.

Also, the idea is to allow the parsers to be extensible and gradatively
support different kind of content types.

In this CL, we have added support for schema.org/Recipe, which currently
appears to be a relevant issue. There are a lot of recipe
webpages that follow the schema.org convention, and in most of these
pages DomDistiller performes a very poor job by removing some key
relevant content, such as ingredients or prep times, or even discarding
the whole content.

Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com

BUG=397173
R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org,
bengr@chromium.org, nyquist@chromium.org, gene@chromium.org
==========

wychen

This looks really interesting. I'm just curious how popular this og:Recipe is, and how we ...

4 years, 10 months ago (2016-02-18 00:42:06 UTC) #3

marcelorcorrea

Hello wychen, Thanks for your reply. When you say "og:Recipe", I believe you mean Schema.org/Recipe, ...

4 years, 10 months ago (2016-02-18 15:06:27 UTC) #4

dalmirdasilva

On 2016/02/18 15:06:27, marcelorcorrea wrote: > Hello wychen, > Thanks for your reply. When you ...

4 years, 10 months ago (2016-02-24 17:58:42 UTC) #5

wychen

I did some benchmark on our dataset. This CL is called "recipe", and the lazy ...

4 years, 9 months ago (2016-03-09 07:29:03 UTC) #8

I did some benchmark on our dataset. This CL is called "recipe", and the lazy
evaluation is called "lazy". "lazyrecipe" merges these two, and "lazy-oglater"
just changes the order. The git tree is shown below:

* 20508eb - (9 hours ago) og after schema.org  (lazy-oglater)
| * 19cccd4 - (11 hours ago) fix merging  (lazyrecipe)
| *   0021308 - (12 hours ago) Merge branch 'recipe' into lazyrecipe 
| |\  
|/ /  
| * 513d543 - (13 days ago) Add support for Schema.org/Recipe  (recipe)
* | 0f30775 - (12 hours ago) Lazy evaluation of Parsers  (lazy)
|/  
* 9101ca4 - (13 days ago) Fix spelling in comments  (master)

Total time:
0f30775: Time = 99.781189 +- 0.162174 ms, averaged over 100 runs
19cccd4: Time = 101.354524 +- 0.159249 ms, averaged over 100 runs
20508eb: Time = 100.806103 +- 0.172367 ms, averaged over 100 runs
513d543: Time = 101.290721 +- 0.166323 ms, averaged over 100 runs
9101ca4: Time = 99.978039 +- 0.140684 ms, averaged over 100 runs

"recipe" is 1.3% slower than TOT, and "lazy" is 0.2% faster. "lazyrecipe" is
slower than "recipe", since OpenGraph parser returns "" in getStructuredData(),
forcing schema.org parser to always run, so we gained nothing in making it lazy.
Running schema.org first and then OpenGraph could avoid this, but turned out to
be slower.

SchemaOrgParserAccessor time:
0f30775: Time = 0.083602 +- 0.000120 ms, averaged over 100 runs
19cccd4: Time = 1.382156 +- 0.002023 ms, averaged over 100 runs
20508eb: Time = 0.270443 +- 0.000334 ms, averaged over 100 runs
513d543: Time = 4.210656 +- 0.006692 ms, averaged over 100 runs
9101ca4: Time = 3.511269 +- 0.005107 ms, averaged over 100 runs

If the parser is lazy, this should be almost zero. "lazyrecipe" is much higher
than zero, so it might worth investigating what went wrong. This can be observed
on all the entries in our dataset, so I guess you could reproduce and debug this
locally. I'm suspecting supportedTypes initialization, but I haven't tried it.

OpenGraphProtocolParser time:
0f30775: Time = 0.275534 +- 0.000350 ms, averaged over 100 runs
19cccd4: Time = 0.275540 +- 0.000363 ms, averaged over 100 runs
20508eb: Time = 0.089264 +- 0.000153 ms, averaged over 100 runs
513d543: Time = 6.659252 +- 0.009092 ms, averaged over 100 runs
9101ca4: Time = 6.649062 +- 0.008637 ms, averaged over 100 runs

All expected here.

OpenGraphProtocolParser.parse  time:
0f30775: Time = 5.987436 +- 0.008167 ms, averaged over 100 runs
19cccd4: Time = 5.433146 +- 0.007546 ms, averaged over 100 runs
20508eb: Time = 4.670875 +- 0.007086 ms, averaged over 100 runs
513d543: Time = 6.470541 +- 0.009614 ms, averaged over 100 runs
9101ca4: Time = 6.460412 +- 0.009248 ms, averaged over 100 runs

SchemaOrgParser.parse time:
0f30775: Time = 1.551841 +- 0.003426 ms, averaged over 100 runs
19cccd4: Time = 2.138156 +- 0.003962 ms, averaged over 100 runs
20508eb: Time = 2.385430 +- 0.004932 ms, averaged over 100 runs
513d543: Time = 2.133204 +- 0.004898 ms, averaged over 100 runs
9101ca4: Time = 2.011774 +- 0.003853 ms, averaged over 100 runs

We can see some saved parse time in lazy eval, as expected.

dalmirsilva

Nice benchmarks, thanks for sharing the result with us! It is not completely unexpected that ...

4 years, 9 months ago (2016-03-14 20:30:50 UTC) #9

wychen

On 2016/03/14 20:30:50, dalmirsilva wrote: > Nice benchmarks, thanks for sharing the result with us! ...

4 years, 9 months ago (2016-03-14 22:40:01 UTC) #10

On 2016/03/14 20:30:50, dalmirsilva wrote:
> Nice benchmarks, thanks for sharing the result with us! 
> It is not completely unexpected that the 'recipe' is slower as this feature
adds
> some new routines in the code. But even so, it still worth considering the 
> following points:
> 
>   1. Nowadays there are not much cases when we are really sure about the
>   Dom Distiller results for a given page. We have a lot of examples of 
>   content missing etc. Therefore, parsing correctly recipes 
>   (and in the future other kinds of content like Articles) is a really 
>   valuable thing. So, this feature adds value to Dom Distiller, it is 
>   reasonable that it could cost a little bit on performance.

I do agree adding support for structured recipe is useful. It's just that
we want to minimize the performance impact, especially for those unrelated
pages (pages which are not recipes). For changes that benefit a huge
percentage of pages, we can justify >1% performance regression. Pages
having schema.org/Recipe seems to be a niche market, so we have tighter
criteria for performance regression.

I don't have the numbers at hand right now, but I guess it's <5% of all
distillable pages. Maybe it's higher for pages that people tend to print.

The speed difference might not be easily noticeable on desktop, but we
certainly don't want to make it shower on mobile devices than it already
is.

As a start, it might worth investigating why SchemaOrgParserAccessor time
for "lazyrecipe" is much higher than zero.

>   2. When we ask the parsers for structured data, we only run the 
>   http://Schema.org parser if there are microdata properties (ITEMPROP, 
>   ITEMSCOPE) in the document. So, the impact on the  sites that don't 
>   support http://Schema.org is not that big. But, for those that are 
>   Schema.org/Recipe, the impact would be positive. It is faster to run 
>   the parser than run all the heuristics. Therefore, to be fair we need 
>   to include some examples of Schema.org/Recipe in the testset. 
>   The number of examples should be proportional to real live number of 
>   such sites.

In the dataset I tested, none of them are schema.org/Recipe.
As I described in crbug.com/593457, we do need to have a dataset specific
for performance benchmark.

Code size also have negative impact overall, since the generated JS code
needs to be parsed and compiled by V8, even if that code path is not taken.
Pre-compiling the JS code is one of the things that we want to do but are
short-handed. Ref: crbug.com/594777.

>   3. Even running the parsers lazily, there is still a huge 
>   probability of executing all parsers, because the getMarkupInfo method 
>   is called. Sometimes more than one parser is needed because the 
>   previous one wasn't able to retrieve some content 
>   (returns empty string).

I think running schema.org first and then open graph could be faster,
but our dataset is biased. All the entries contain open graph tags, so
running open graph first is faster for this dataset.
After eval system is open sourced (crbug.com/594779), maybe it's possible
for you guys to run benchmarks.

>   4. This CL adds the needed infrastructure to allow us to extend the
>   functionality for the others structured content like NewsArticle, 
>   Article, Movies, etc.

wychen

https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller/ContentExtractor.java File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller/ContentExtractor.java#newcode90 java/org/chromium/distiller/ContentExtractor.java:90: String structuredData = parser.getStructuredData(); Might make sense to measure ...

4 years, 9 months ago (2016-03-14 22:58:42 UTC) #11

dalmirsilva

That is a interesting idea, but we didn't use it because there is no guarantee ...

4 years, 8 months ago (2016-03-28 19:11:18 UTC) #12

wychen

I think it worth a try to make some heuristics to extract the original language ...

4 years, 8 months ago (2016-03-28 20:29:41 UTC) #13

dalmirsilva

Hello, wychen! We've tried to label the elements in a new filter. But it didn't ...

4 years, 8 months ago (2016-03-30 17:48:46 UTC) #14

dalmirsilva

Hello, wychen! We've tried to label the elements in a new filter. But it didn't ...

4 years, 8 months ago (2016-03-30 17:49:31 UTC) #15

wychen

On 2016/03/30 17:49:31, dalmirsilva wrote: > Hello, wychen! > > We've tried to label the ...

4 years, 8 months ago (2016-04-07 00:16:38 UTC) #16

dalmirdasilva

On 2016/04/07 00:16:38, wychen wrote: > On 2016/03/30 17:49:31, dalmirsilva wrote: > > Hello, wychen! ...

4 years, 7 months ago (2016-05-12 16:16:02 UTC) #17

wychen

On 2016/05/12 16:16:02, dalmirdasilva wrote: > The approach of labeling itemprops elements as VERY_LIKELY_RELEVANT has ...

4 years, 7 months ago (2016-05-12 22:51:57 UTC) #18

dalmirsilva

Description was changed from ========== Add support for Schema.org/Recipe There is a large number of ...

4 years, 7 months ago (2016-05-19 13:49:54 UTC) #19

Description was changed from

==========
Add support for Schema.org/Recipe

There is a large number of web pages around internet which uses some
sort of structure or schema. In these cases, extracting the relevant
content should not be an homeric task as it normaly is.

DomDistiller should take advantage of this by supporting those
structured data schemas, instead of blindness looking around for
relevant content. It should take the shortcut.

DomDistiller parses the whole document once for each structured data
parser (which are 3). But it uses these parses only to extract some
minor information such as title and author. The idea is to use these
parsers to extract the content itself.

This CL aims to support content extraction from these parsers. If any of
them can retrieve the content from a page, we use it instead of
executing the normal flow.

Also, the idea is to allow the parsers to be extensible and gradatively
support different kind of content types.

In this CL, we have added support for schema.org/Recipe, which currently
appears to be a relevant issue. There are a lot of recipe
webpages that follow the schema.org convention, and in most of these
pages DomDistiller performes a very poor job by removing some key
relevant content, such as ingredients or prep times, or even discarding
the whole content.

Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com

BUG=397173
R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org,
bengr@chromium.org, nyquist@chromium.org, gene@chromium.org
==========

to

==========
Add support for Schema.org/Recipe

There is a large number of web pages around internet which uses some
sort of structure or schema. In these cases, extracting the relevant
content should not be an homeric task as it normaly is.

DomDistiller should take advantage of this by supporting those
structured data schemas, instead of blindness looking around for
relevant content. It should take the shortcut.

DomDistiller parses the whole document once for each structured data
parser (which are 3). But it uses these parses only to extract some
minor information such as title and author. The idea is to use these
parsers to extract the content itself.

This CL aims to support content extraction from these parsers. If any of
them can retrieve the content from a page, we use it instead of
executing the normal flow.

Also, the idea is to allow the parsers to be extensible and gradatively
support different kind of content types.

In this CL, we have added support for schema.org/Recipe, which currently
appears to be a relevant issue. There are a lot of recipe
webpages that follow the schema.org convention, and in most of these
pages DomDistiller performes a very poor job by removing some key
relevant content, such as ingredients or prep times, or even discarding
the whole content.

Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com

BUG=397173
R=wychen@chromium.org, mdjones@chromium.org
==========

dalmirsilva

Hi, wychen! Yes, we agree with your concerns about regression issues. Therefore, taking in consideration ...

4 years, 7 months ago (2016-05-19 14:19:02 UTC) #20

wychen

Hi, It looks you have merged or rebased between patch set 2 and 3. One ...

4 years, 7 months ago (2016-05-20 09:17:46 UTC) #21

dalmirsilva

We deleted the previous patch set, rebased, submitted, applied our changes and then submitted again. ...

4 years, 7 months ago (2016-05-23 18:57:34 UTC) #24

wychen

Now that the concern about language is fixed, have you tried fixing the speed regression? ...

4 years, 6 months ago (2016-05-31 21:56:34 UTC) #25

wychen

Since we are talking about language detection, it might also solve this bug here: https://bugs.chromium.org/p/chromium/issues/detail?id=482217

4 years, 6 months ago (2016-06-03 23:58:05 UTC) #26

dalmirsilva

Hello wychen! About the speed regression: we took a look on this issue and we ...

4 years, 5 months ago (2016-07-06 17:53:34 UTC) #28

Hello wychen!

About the speed regression: we took a look on this issue and we found out that
the HashSet for the SupportedTypes when instantiated, was taking a little more
time than expected. We're not sure why, but we replaced by an ArrayList and the
speed improved considerably.
Could you maybe run against your dataset and check if the speed was improved ?
Thanks!

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
File java/org/chromium/distiller/DomUtil.java (right):

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:481: public static String
getLanguage(Element root) {
On 2016/05/31 21:56:34, wychen wrote:
> If the language is specified in http header, instead of inside html, does this
> still gets it? Sadly it might not be possible to test in our unit test
> environment.

Unfortunately it doesn't get it.  We couldn't find a a solution for getting the
content-language from the http-header without making extra requests, since Dom
Distiller is manipulating the DOM already loaded and ready for work.

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:484: NodeList<Element> metas =
root.getElementsByTagName("META");
On 2016/05/31 21:56:34, wychen wrote:
> Using "META[HTTP-EQUIV="content-language" i][CONTENT],META[NAME="language"
> i][CONTENT]" in querySelectorAll() might be faster.

Done.

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right):

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/SchemaOrgParserAccessor.java:203: init();
On 2016/05/31 21:56:34, wychen wrote:
> Only init() when the page is in English.

Done.

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/SchemaOrgParserAccessor.java:205: if
(DomUtil.getLanguage(mRoot).contains(ENGLISH_LANGUAGE)) {
On 2016/05/31 21:56:34, wychen wrote:
> Strictly speaking, the lang code should "start with" en, not just containing
it.

Done.

wychen

On 2016/07/06 17:53:34, dalmirsilva wrote: > Hello wychen! > > About the speed regression: we ...

4 years, 5 months ago (2016-07-24 22:59:04 UTC) #29

On 2016/07/06 17:53:34, dalmirsilva wrote:
> Hello wychen!
> 
> About the speed regression: we took a look on this issue and we found out that
> the HashSet for the SupportedTypes when instantiated, was taking a little more
> time than expected. We're not sure why, but we replaced by an ArrayList and
the
> speed improved considerably.
> Could you maybe run against your dataset and check if the speed was improved ?
> Thanks!
Sorry for the late reply. The benchmark result is:

master:

IEReadingViewParser: Time = 0.050441 +- 0.000207 ms, averaged over 100 runs
OpenGraphProtocolParser: Time = 0.196501 +- 0.000858 ms, averaged over 100 runs
OpenGraphProtocolParser.checkRequired: Time = 0.324879 +- 0.001523 ms, averaged
over 100 runs
OpenGraphProtocolParser.findPrefixes: Time = 1.172565 +- 0.005483 ms, averaged
over 100 runs
OpenGraphProtocolParser.imageParser.verify: Time = 0.021933 +- 0.000204 ms,
averaged over 100 runs
OpenGraphProtocolParser.parse: Time = 5.541612 +- 0.026904 ms, averaged over 100
runs
OpenGraphProtocolParser.parseMetaTags: Time = 2.039917 +- 0.009878 ms, averaged
over 100 runs
Pagination: Time = 6.696991 +- 0.031547 ms, averaged over 100 runs
SchemaOrgParser.parse: Time = 1.425413 +- 0.007714 ms, averaged over 100 runs
SchemaOrgParserAccessor: Time = 0.075311 +- 0.000298 ms, averaged over 100 runs
article_processing: Time = 18.895528 +- 0.094584 ms, averaged over 100 runs
document_construction: Time = 53.293085 +- 0.347719 ms, averaged over 100 runs
formatting: Time = 4.862504 +- 0.026562 ms, averaged over 100 runs
markup_parsing: Time = 0.564774 +- 0.002356 ms, averaged over 100 runs
total: Time = 100.571858 +- 0.560194 ms, averaged over 100 runs

patch set 6:

IEReadingViewParser: Time = 0.054895 +- 0.000451 ms, averaged over 100 runs
OpenGraphProtocolParser: Time = 0.197528 +- 0.000957 ms, averaged over 100 runs
OpenGraphProtocolParser.checkRequired: Time = 0.328472 +- 0.001791 ms, averaged
over 100 runs
OpenGraphProtocolParser.findPrefixes: Time = 1.175623 +- 0.006301 ms, averaged
over 100 runs
OpenGraphProtocolParser.imageParser.verify: Time = 0.023860 +- 0.000374 ms,
averaged over 100 runs
OpenGraphProtocolParser.parse: Time = 5.457404 +- 0.029574 ms, averaged over 100
runs
OpenGraphProtocolParser.parseMetaTags: Time = 2.085397 +- 0.011321 ms, averaged
over 100 runs
Pagination: Time = 6.648639 +- 0.039150 ms, averaged over 100 runs
SchemaOrgParser.parse: Time = 2.063012 +- 0.013322 ms, averaged over 100 runs
SchemaOrgParserAccessor: Time = 0.364177 +- 0.002039 ms, averaged over 100 runs
article_processing: Time = 18.868166 +- 0.110768 ms, averaged over 100 runs
document_construction: Time = 52.912835 +- 0.367409 ms, averaged over 100 runs
formatting: Time = 4.872729 +- 0.029880 ms, averaged over 100 runs
markup_parsing: Time = 0.854773 +- 0.004386 ms, averaged over 100 runs
parser.getStructuredData(): Time = 3.447513 +- 0.020407 ms, averaged over 100
runs
total: Time = 101.996992 +- 0.633859 ms, averaged over 100 runs

I think aiming for <1% speed regression still makes sense, which we haven't met
yet.

BTW, after https://codereview.chromium.org/2108833002/, the utilization of
recipe handling
added in this CL would be essentially 0, since recipe pages usually don't
trigger Clank reader mode.
Unless we manually change the triggering model, low usage would make the
threshold of
speed regression tighter.

wychen

4 years, 5 months ago (2016-07-24 23:06:34 UTC) #30

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
File java/org/chromium/distiller/DomUtil.java (right):

https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:481: public static String
getLanguage(Element root) {
On 2016/07/06 17:53:34, dalmirsilva wrote:
> On 2016/05/31 21:56:34, wychen wrote:
> > If the language is specified in http header, instead of inside html, does
this
> > still gets it? Sadly it might not be possible to test in our unit test
> > environment.
> 
> Unfortunately it doesn't get it.  We couldn't find a a solution for getting
the
> content-language from the http-header without making extra requests, since Dom
> Distiller is manipulating the DOM already loaded and ready for work.

Got it. I guess this is our technical limitation.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/ContentExtractor.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/ContentExtractor.java:89: 
nit: extra line

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/ContentExtractor.java:92: LogUtil.addTimingInfo(now,
mTimingInfo, "parser.getStructuredData()");
Maybe just "getStructuredData" for consistency.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/DomUtil.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:508: result.add(matchResult.getGroup(1)
+ " year(s)");
Can we handle plural forms?

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/DomUtil.java:545: NodeList<Element> languages =
DomUtil.querySelectorAll(root, query);
Would it be faster if we only handle <head> instead of <html>? I haven't tried.

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right):

https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist...
java/org/chromium/distiller/SchemaOrgParserAccessor.java:22: private static
List<SchemaOrgParser.Type> supportedTypes;
I suspect this still causes some slowness.
Would lazy initialization make it faster?

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages