|
|
Created:
4 years, 10 months ago by dalmirsilva Modified:
4 years, 5 months ago CC:
marcelorcorrea Base URL:
https://github.com/chromium/dom-distiller.git@master Target Ref:
refs/heads/master Visibility:
Public. |
DescriptionAdd support for Schema.org/Recipe
There is a large number of web pages around internet which uses some
sort of structure or schema. In these cases, extracting the relevant
content should not be an homeric task as it normaly is.
DomDistiller should take advantage of this by supporting those
structured data schemas, instead of blindness looking around for
relevant content. It should take the shortcut.
DomDistiller parses the whole document once for each structured data
parser (which are 3). But it uses these parses only to extract some
minor information such as title and author. The idea is to use these
parsers to extract the content itself.
This CL aims to support content extraction from these parsers. If any of
them can retrieve the content from a page, we use it instead of
executing the normal flow.
Also, the idea is to allow the parsers to be extensible and gradatively
support different kind of content types.
In this CL, we have added support for schema.org/Recipe, which currently
appears to be a relevant issue. There are a lot of recipe
webpages that follow the schema.org convention, and in most of these
pages DomDistiller performes a very poor job by removing some key
relevant content, such as ingredients or prep times, or even discarding
the whole content.
Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com
BUG=397173
R=wychen@chromium.org, mdjones@chromium.org
Patch Set 1 #
Total comments: 6
Patch Set 2 : wychen's comments #Patch Set 3 : merged from master #Patch Set 4 : activate only for English pages #
Total comments: 9
Patch Set 5 : Merged with Master #Patch Set 6 : wychen's comments addressed #
Total comments: 5
Messages
Total messages: 30 (7 generated)
Description was changed from ========== Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org, bengr@chromium.org, nyquist@chromium.org, gene@chromium.org ========== to ========== Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org, bengr@chromium.org, nyquist@chromium.org, gene@chromium.org ==========
wychen@chromium.org changed reviewers: + mdjones@chromium.org - bengr@chromium.org, gene@chromium.org, jochen@chromium.org, kuan@chromium.org, nyquist@chromium.org
This looks really interesting. I'm just curious how popular this og:Recipe is, and how we are going to test this with the real world corpus. BTW, the og parser is a bit slow. It's probably time to convert to lazy evaluation, and make sure this doesn't slow down non-recipe pages.
Hello wychen, Thanks for your reply. When you say "og:Recipe", I believe you mean Schema.org/Recipe, correct? As for the popularity of these recipes, we have some recipe web sites samples that are very visited and follow Schema.org conventions. For example: http://allrecipes.com/recipe/34159/spicy-oven-fried-chicken/?internalSource=r... http://www.getmecooking.com/recipes http://www.recipe.com/quick-chicken-tortilla-bake/ http://www.bettycrocker.com/recipes/slow-cooker-beef-bourguignon/19bdb920-7f6... http://www.aspicyperspective.com/oven-chicken-wings-apple-onion-dip/2 http://peachesplease.com/angel-food-cake-french-toast/ And about non-recipes web pages, I don't think this solution would impact its extraction performance since we are using an implemented mechanism that is already being used. We've just added some fields that we want to keep when the parser is running. In this case, the recipe fields. Regarding making the parsers lazy, how about create another Bug/CL since this one is a little bit extensive to be reviewed and doesn't affect the current flow?
On 2016/02/18 15:06:27, marcelorcorrea wrote: > Hello wychen, > Thanks for your reply. When you say "og:Recipe", I believe you mean > Schema.org/Recipe, correct? > As for the popularity of these recipes, we have some recipe web sites samples > that are very visited and follow http://Schema.org conventions. > For example: > http://allrecipes.com/recipe/34159/spicy-oven-fried-chicken/?internalSource=r... > http://www.getmecooking.com/recipes > http://www.recipe.com/quick-chicken-tortilla-bake/ > http://www.bettycrocker.com/recipes/slow-cooker-beef-bourguignon/19bdb920-7f6... > http://www.aspicyperspective.com/oven-chicken-wings-apple-onion-dip/2 > http://peachesplease.com/angel-food-cake-french-toast/ > > And about non-recipes web pages, I don't think this solution would impact its > extraction performance since > we are using an implemented mechanism that is already being used. We've just > added some fields that we want to keep when the parser is running. > In this case, the recipe fields. > > Regarding making the parsers lazy, > how about create another Bug/CL since this one is a little bit extensive to be > reviewed and doesn't affect the current flow? Wychen, I agree with Marcelo, this CL doesn't impact on the performance since right now all parsers are being run. I saw you created a new BUG to make Schema.org lazy parsed, which is a interesting feature, we can work on it separately. However, just to make it clearer, today we are asking the parsers for 'title', 'author'..., which could be found in any other parser but schema.org (if found, Schema.org don't need to be executed), but, as this specific CL adds the support for content extraction, and it is only implemented in Schema.org (the others return empty), we will always execute Schema.org until we don't implement it in the others parsers.
dalmirdasilva@gmail.com changed reviewers: + dalmirdasilva@gmail.com
+Marcelo.
I did some benchmark on our dataset. This CL is called "recipe", and the lazy evaluation is called "lazy". "lazyrecipe" merges these two, and "lazy-oglater" just changes the order. The git tree is shown below: * 20508eb - (9 hours ago) og after schema.org (lazy-oglater) | * 19cccd4 - (11 hours ago) fix merging (lazyrecipe) | * 0021308 - (12 hours ago) Merge branch 'recipe' into lazyrecipe | |\ |/ / | * 513d543 - (13 days ago) Add support for Schema.org/Recipe (recipe) * | 0f30775 - (12 hours ago) Lazy evaluation of Parsers (lazy) |/ * 9101ca4 - (13 days ago) Fix spelling in comments (master) Total time: 0f30775: Time = 99.781189 +- 0.162174 ms, averaged over 100 runs 19cccd4: Time = 101.354524 +- 0.159249 ms, averaged over 100 runs 20508eb: Time = 100.806103 +- 0.172367 ms, averaged over 100 runs 513d543: Time = 101.290721 +- 0.166323 ms, averaged over 100 runs 9101ca4: Time = 99.978039 +- 0.140684 ms, averaged over 100 runs "recipe" is 1.3% slower than TOT, and "lazy" is 0.2% faster. "lazyrecipe" is slower than "recipe", since OpenGraph parser returns "" in getStructuredData(), forcing schema.org parser to always run, so we gained nothing in making it lazy. Running schema.org first and then OpenGraph could avoid this, but turned out to be slower. SchemaOrgParserAccessor time: 0f30775: Time = 0.083602 +- 0.000120 ms, averaged over 100 runs 19cccd4: Time = 1.382156 +- 0.002023 ms, averaged over 100 runs 20508eb: Time = 0.270443 +- 0.000334 ms, averaged over 100 runs 513d543: Time = 4.210656 +- 0.006692 ms, averaged over 100 runs 9101ca4: Time = 3.511269 +- 0.005107 ms, averaged over 100 runs If the parser is lazy, this should be almost zero. "lazyrecipe" is much higher than zero, so it might worth investigating what went wrong. This can be observed on all the entries in our dataset, so I guess you could reproduce and debug this locally. I'm suspecting supportedTypes initialization, but I haven't tried it. OpenGraphProtocolParser time: 0f30775: Time = 0.275534 +- 0.000350 ms, averaged over 100 runs 19cccd4: Time = 0.275540 +- 0.000363 ms, averaged over 100 runs 20508eb: Time = 0.089264 +- 0.000153 ms, averaged over 100 runs 513d543: Time = 6.659252 +- 0.009092 ms, averaged over 100 runs 9101ca4: Time = 6.649062 +- 0.008637 ms, averaged over 100 runs All expected here. OpenGraphProtocolParser.parse time: 0f30775: Time = 5.987436 +- 0.008167 ms, averaged over 100 runs 19cccd4: Time = 5.433146 +- 0.007546 ms, averaged over 100 runs 20508eb: Time = 4.670875 +- 0.007086 ms, averaged over 100 runs 513d543: Time = 6.470541 +- 0.009614 ms, averaged over 100 runs 9101ca4: Time = 6.460412 +- 0.009248 ms, averaged over 100 runs SchemaOrgParser.parse time: 0f30775: Time = 1.551841 +- 0.003426 ms, averaged over 100 runs 19cccd4: Time = 2.138156 +- 0.003962 ms, averaged over 100 runs 20508eb: Time = 2.385430 +- 0.004932 ms, averaged over 100 runs 513d543: Time = 2.133204 +- 0.004898 ms, averaged over 100 runs 9101ca4: Time = 2.011774 +- 0.003853 ms, averaged over 100 runs We can see some saved parse time in lazy eval, as expected.
Nice benchmarks, thanks for sharing the result with us! It is not completely unexpected that the 'recipe' is slower as this feature adds some new routines in the code. But even so, it still worth considering the following points: 1. Nowadays there are not much cases when we are really sure about the Dom Distiller results for a given page. We have a lot of examples of content missing etc. Therefore, parsing correctly recipes (and in the future other kinds of content like Articles) is a really valuable thing. So, this feature adds value to Dom Distiller, it is reasonable that it could cost a little bit on performance. 2. When we ask the parsers for structured data, we only run the Schema.org parser if there are microdata properties (ITEMPROP, ITEMSCOPE) in the document. So, the impact on the sites that don't support Schema.org is not that big. But, for those that are Schema.org/Recipe, the impact would be positive. It is faster to run the parser than run all the heuristics. Therefore, to be fair we need to include some examples of Schema.org/Recipe in the testset. The number of examples should be proportional to real live number of such sites. 3. Even running the parsers lazily, there is still a huge probability of executing all parsers, because the getMarkupInfo method is called. Sometimes more than one parser is needed because the previous one wasn't able to retrieve some content (returns empty string). 4. This CL adds the needed infrastructure to allow us to extend the functionality for the others structured content like NewsArticle, Article, Movies, etc.
On 2016/03/14 20:30:50, dalmirsilva wrote: > Nice benchmarks, thanks for sharing the result with us! > It is not completely unexpected that the 'recipe' is slower as this feature adds > some new routines in the code. But even so, it still worth considering the > following points: > > 1. Nowadays there are not much cases when we are really sure about the > Dom Distiller results for a given page. We have a lot of examples of > content missing etc. Therefore, parsing correctly recipes > (and in the future other kinds of content like Articles) is a really > valuable thing. So, this feature adds value to Dom Distiller, it is > reasonable that it could cost a little bit on performance. I do agree adding support for structured recipe is useful. It's just that we want to minimize the performance impact, especially for those unrelated pages (pages which are not recipes). For changes that benefit a huge percentage of pages, we can justify >1% performance regression. Pages having schema.org/Recipe seems to be a niche market, so we have tighter criteria for performance regression. I don't have the numbers at hand right now, but I guess it's <5% of all distillable pages. Maybe it's higher for pages that people tend to print. The speed difference might not be easily noticeable on desktop, but we certainly don't want to make it shower on mobile devices than it already is. As a start, it might worth investigating why SchemaOrgParserAccessor time for "lazyrecipe" is much higher than zero. > 2. When we ask the parsers for structured data, we only run the > http://Schema.org parser if there are microdata properties (ITEMPROP, > ITEMSCOPE) in the document. So, the impact on the sites that don't > support http://Schema.org is not that big. But, for those that are > Schema.org/Recipe, the impact would be positive. It is faster to run > the parser than run all the heuristics. Therefore, to be fair we need > to include some examples of Schema.org/Recipe in the testset. > The number of examples should be proportional to real live number of > such sites. In the dataset I tested, none of them are schema.org/Recipe. As I described in crbug.com/593457, we do need to have a dataset specific for performance benchmark. Code size also have negative impact overall, since the generated JS code needs to be parsed and compiled by V8, even if that code path is not taken. Pre-compiling the JS code is one of the things that we want to do but are short-handed. Ref: crbug.com/594777. > 3. Even running the parsers lazily, there is still a huge > probability of executing all parsers, because the getMarkupInfo method > is called. Sometimes more than one parser is needed because the > previous one wasn't able to retrieve some content > (returns empty string). I think running schema.org first and then open graph could be faster, but our dataset is biased. All the entries contain open graph tags, so running open graph first is faster for this dataset. After eval system is open sourced (crbug.com/594779), maybe it's possible for you guys to run benchmarks. > 4. This CL adds the needed infrastructure to allow us to extend the > functionality for the others structured content like NewsArticle, > Article, Movies, etc.
https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/ContentExtractor.java:90: String structuredData = parser.getStructuredData(); Might make sense to measure the time spent in this section and record in TimingInfo. https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/DomUtil.java:438: public static String formatDuration(String duration) { I18n and l10n might be difficult. Singular/plural forms are also tricky. These "(s)" are not appealing. To get around these issues, why don't we just keep what's shown in the tag instead of generating the text from structured data? https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/MarkupGenerator.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/MarkupGenerator.java:15: createElement("p", "Author: ", recipe.author) + L10n issues again. It would look weird on non-English recipe pages. https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/SchemaOrgParser.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/SchemaOrgParser.java:113: return !mStringProperties.containsKey(name) ? "" : DomUtil.join(mStringProperties.get(name).toArray(), ", "); Possible l10n issues here. Not all languages use ",", but this is less an issue than other l10n issues. https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/1705123002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/SchemaOrgParserAccessor.java:37: static { static sections tend to have performance issues in GWT. Be careful with this. https://codereview.chromium.org/1705123002/diff/1/javatests/org/chromium/dist... File javatests/org/chromium/distiller/SchemaOrgParserAccessorTest.java (right): https://codereview.chromium.org/1705123002/diff/1/javatests/org/chromium/dist... javatests/org/chromium/distiller/SchemaOrgParserAccessorTest.java:546: "<div id=\"1\" itemscope itemtype=\"http://schema.org/Recipe\">" + Lots of duplicated text here. Maybe saving the common things in strings and concatenate when needed?
That is a interesting idea, but we didn't use it because there is no guarantee that they are the same tag. The tag that presents the content and the one which holds the time property could be different as it is not required by the Schema.org spec, we can see in this example: http://schema.org/prepTime (microdata tab). Implementing properly i18n and l10n seems to be a complex task. It would require such potentially huge dictionaries that will increase the output in size and probably impact on performance. Also this support has to be done in a different CL to build the infrastructure needed by this one. Do you have any suggestion on what we could do here? Use the content from the tag (being eventually wrong), implement i18n and l10n (maybe in another CL first) or keep it as it is for now flagged as a TODO? Or even something else? This is the same problem discussed in the other comment. Maybe here we have the possibility to replace it by little icons (base64) that would represent Creator, Prep Time and so on, avoiding the need to use words.
I think it worth a try to make some heuristics to extract the original language first, and use the generated message in English as the last resort. Have you tried labeling these elements VERY_LIKELY_CONTENT and use the filter pipeline instead of directly generating the content from metadata? I think it is possible to tune the pipeline to make this work reasonably well.
Hello, wychen! We've tried to label the elements in a new filter. But it didn't seem to work because boiler pipe works only with TextBlock’s which keeps only text nodes. So, unless we change what information is preserved inside the TextBlocks, there is no way we could access the elements’ schema.org attributes inside a filter. We also tried to add the label VERY_LIKELY_CONTENT when the DOM is being walked, on Element Action, before boiler pipe is executed. But we had no success either. Boiler pipe runs its natural flow "ignoring" those labels. This approach appear to be unfruitful so far.
Hello, wychen! We've tried to label the elements in a new filter. But it didn't seem to work because boiler pipe works only with TextBlock’s which keeps only text nodes. So, unless we change what information is preserved inside the TextBlocks, there is no way we could access the elements’ schema.org attributes inside a filter. We also tried to add the label VERY_LIKELY_CONTENT when the DOM is being walked, on Element Action, before boiler pipe is executed. But we had no success either. Boiler pipe runs its natural flow "ignoring" those labels. This approach appear to be unfruitful so far.
On 2016/03/30 17:49:31, dalmirsilva wrote: > Hello, wychen! > > We've tried to label the elements in a new filter. But it didn't seem to work > because boiler pipe works only with TextBlock’s which keeps only text nodes. > So, unless we change what information is preserved inside the TextBlocks, there > is no way we could access the elements’ http://schema.org attributes inside a filter. > > We also tried to add the label VERY_LIKELY_CONTENT when the DOM is being > walked, on Element Action, before boiler pipe is executed. But we had no success > either. Boiler pipe runs its natural flow "ignoring" those labels. > > This approach appear to be unfruitful so far. Sorry to hear it's unfruitful. It should be possible to make second approach (labelling on Element Action) work reasonably well.
On 2016/04/07 00:16:38, wychen wrote: > On 2016/03/30 17:49:31, dalmirsilva wrote: > > Hello, wychen! > > > > We've tried to label the elements in a new filter. But it didn't seem to work > > because boiler pipe works only with TextBlock’s which keeps only text nodes. > > So, unless we change what information is preserved inside the TextBlocks, > there > > is no way we could access the elements’ http://schema.org attributes inside a > filter. > > > > We also tried to add the label VERY_LIKELY_CONTENT when the DOM is being > > walked, on Element Action, before boiler pipe is executed. But we had no > success > > either. Boiler pipe runs its natural flow "ignoring" those labels. > > > > This approach appear to be unfruitful so far. > > Sorry to hear it's unfruitful. It should be possible to make second approach > (labelling on Element Action) work reasonably well. The approach of labeling itemprops elements as VERY_LIKELY_RELEVANT has the disadvantage of only keeping the value of a property, not its name. Which is very similar problem we have in our previous approach. Lets see an example: <li> <p>Prep time</p> <time itemprop="prepTime" datetime="PT15M"> <span>15</span> m </time> </li> The element we will label, <time>, doesn't represent the entire information we want to keep. The value from the above <p> tag needs to be used to give semantic meaning to the information.
On 2016/05/12 16:16:02, dalmirdasilva wrote: > The approach of labeling itemprops elements as VERY_LIKELY_RELEVANT has > the disadvantage of only keeping the value of a property, not its name. > Which is very similar problem we have in our previous approach. > > Lets see an example: > > <li> > <p>Prep time</p> > <time itemprop="prepTime" datetime="PT15M"> > <span>15</span> m > </time> > </li> > > The element we will label, <time>, doesn't represent the entire > information we want to keep. The value from the above <p> tag needs to > be used to give semantic meaning to the information. This is indeed challenging. The largest road blocker from my perspective is i18n issues. We certainly don't want regressions. In most cases, recipe pages doesn't work well, and this CL as is is helpful if the language is English. However, if in other languages, or if distillation already works, then this CL is a regression. One example is here: http://www.christinesrecipes.com/2016/05/baked-honey-lemon-chicken-drumsticks...
Description was changed from ========== Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, kuan@chromium.org, jochen@chromium.org, bengr@chromium.org, nyquist@chromium.org, gene@chromium.org ========== to ========== Add support for Schema.org/Recipe There is a large number of web pages around internet which uses some sort of structure or schema. In these cases, extracting the relevant content should not be an homeric task as it normaly is. DomDistiller should take advantage of this by supporting those structured data schemas, instead of blindness looking around for relevant content. It should take the shortcut. DomDistiller parses the whole document once for each structured data parser (which are 3). But it uses these parses only to extract some minor information such as title and author. The idea is to use these parsers to extract the content itself. This CL aims to support content extraction from these parsers. If any of them can retrieve the content from a page, we use it instead of executing the normal flow. Also, the idea is to allow the parsers to be extensible and gradatively support different kind of content types. In this CL, we have added support for schema.org/Recipe, which currently appears to be a relevant issue. There are a lot of recipe webpages that follow the schema.org convention, and in most of these pages DomDistiller performes a very poor job by removing some key relevant content, such as ingredients or prep times, or even discarding the whole content. Contributors=marcelorcorrea@hp.com; dalmirsilva@hp.com BUG=397173 R=wychen@chromium.org, mdjones@chromium.org ==========
Hi, wychen! Yes, we agree with your concerns about regression issues. Therefore, taking in consideration that English pages are the majority of the pages on the internet and this feature is indeed helpful, we propose to use the specialised extractor only for English *for now*. We created a simple heuristic (strictly speaking) inside DomUtil that tries to identify the language of the page, then, we use this information to activate the Schema.org extractor if the page is English. Apart from that we think it will be important to create another CL to implement the i18n and i10n feature. Doing it in this CL would be too coupled. After adding support for i18n and i10n we will be able to expand the Schema.org support for other languages since the needed infrastructure will be available. There are some other places in the Dom Distiller code where i18n will help to increase the accuracy of the heuristics, such as pagination algorithm where it looks for next and previous words, etc. Looking forward to hearing more from you. Best regards!
Hi, It looks you have merged or rebased between patch set 2 and 3. One trick to make code review easier is to upload a patch set right after merging or rebasing, and another one showing your changes. This way, diff between patch set is much more understandable.
Patchset #3 (id:40001) has been deleted
Patchset #3 (id:60001) has been deleted
We deleted the previous patch set, rebased, submitted, applied our changes and then submitted again. Wychen, please let us know if we are good now or if we have to do something else. Best regards!
Now that the concern about language is fixed, have you tried fixing the speed regression? https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:481: public static String getLanguage(Element root) { If the language is specified in http header, instead of inside html, does this still gets it? Sadly it might not be possible to test in our unit test environment. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:484: NodeList<Element> metas = root.getElementsByTagName("META"); Using "META[HTTP-EQUIV="content-language" i][CONTENT],META[NAME="language" i][CONTENT]" in querySelectorAll() might be faster. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/SchemaOrgParserAccessor.java:203: init(); Only init() when the page is in English. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/SchemaOrgParserAccessor.java:205: if (DomUtil.getLanguage(mRoot).contains(ENGLISH_LANGUAGE)) { Strictly speaking, the lang code should "start with" en, not just containing it.
Since we are talking about language detection, it might also solve this bug here: https://bugs.chromium.org/p/chromium/issues/detail?id=482217
Patchset #5 (id:120001) has been deleted
Hello wychen! About the speed regression: we took a look on this issue and we found out that the HashSet for the SupportedTypes when instantiated, was taking a little more time than expected. We're not sure why, but we replaced by an ArrayList and the speed improved considerably. Could you maybe run against your dataset and check if the speed was improved ? Thanks! https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:481: public static String getLanguage(Element root) { On 2016/05/31 21:56:34, wychen wrote: > If the language is specified in http header, instead of inside html, does this > still gets it? Sadly it might not be possible to test in our unit test > environment. Unfortunately it doesn't get it. We couldn't find a a solution for getting the content-language from the http-header without making extra requests, since Dom Distiller is manipulating the DOM already loaded and ready for work. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:484: NodeList<Element> metas = root.getElementsByTagName("META"); On 2016/05/31 21:56:34, wychen wrote: > Using "META[HTTP-EQUIV="content-language" i][CONTENT],META[NAME="language" > i][CONTENT]" in querySelectorAll() might be faster. Done. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/SchemaOrgParserAccessor.java:203: init(); On 2016/05/31 21:56:34, wychen wrote: > Only init() when the page is in English. Done. https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/SchemaOrgParserAccessor.java:205: if (DomUtil.getLanguage(mRoot).contains(ENGLISH_LANGUAGE)) { On 2016/05/31 21:56:34, wychen wrote: > Strictly speaking, the lang code should "start with" en, not just containing it. Done.
On 2016/07/06 17:53:34, dalmirsilva wrote: > Hello wychen! > > About the speed regression: we took a look on this issue and we found out that > the HashSet for the SupportedTypes when instantiated, was taking a little more > time than expected. We're not sure why, but we replaced by an ArrayList and the > speed improved considerably. > Could you maybe run against your dataset and check if the speed was improved ? > Thanks! Sorry for the late reply. The benchmark result is: master: IEReadingViewParser: Time = 0.050441 +- 0.000207 ms, averaged over 100 runs OpenGraphProtocolParser: Time = 0.196501 +- 0.000858 ms, averaged over 100 runs OpenGraphProtocolParser.checkRequired: Time = 0.324879 +- 0.001523 ms, averaged over 100 runs OpenGraphProtocolParser.findPrefixes: Time = 1.172565 +- 0.005483 ms, averaged over 100 runs OpenGraphProtocolParser.imageParser.verify: Time = 0.021933 +- 0.000204 ms, averaged over 100 runs OpenGraphProtocolParser.parse: Time = 5.541612 +- 0.026904 ms, averaged over 100 runs OpenGraphProtocolParser.parseMetaTags: Time = 2.039917 +- 0.009878 ms, averaged over 100 runs Pagination: Time = 6.696991 +- 0.031547 ms, averaged over 100 runs SchemaOrgParser.parse: Time = 1.425413 +- 0.007714 ms, averaged over 100 runs SchemaOrgParserAccessor: Time = 0.075311 +- 0.000298 ms, averaged over 100 runs article_processing: Time = 18.895528 +- 0.094584 ms, averaged over 100 runs document_construction: Time = 53.293085 +- 0.347719 ms, averaged over 100 runs formatting: Time = 4.862504 +- 0.026562 ms, averaged over 100 runs markup_parsing: Time = 0.564774 +- 0.002356 ms, averaged over 100 runs total: Time = 100.571858 +- 0.560194 ms, averaged over 100 runs patch set 6: IEReadingViewParser: Time = 0.054895 +- 0.000451 ms, averaged over 100 runs OpenGraphProtocolParser: Time = 0.197528 +- 0.000957 ms, averaged over 100 runs OpenGraphProtocolParser.checkRequired: Time = 0.328472 +- 0.001791 ms, averaged over 100 runs OpenGraphProtocolParser.findPrefixes: Time = 1.175623 +- 0.006301 ms, averaged over 100 runs OpenGraphProtocolParser.imageParser.verify: Time = 0.023860 +- 0.000374 ms, averaged over 100 runs OpenGraphProtocolParser.parse: Time = 5.457404 +- 0.029574 ms, averaged over 100 runs OpenGraphProtocolParser.parseMetaTags: Time = 2.085397 +- 0.011321 ms, averaged over 100 runs Pagination: Time = 6.648639 +- 0.039150 ms, averaged over 100 runs SchemaOrgParser.parse: Time = 2.063012 +- 0.013322 ms, averaged over 100 runs SchemaOrgParserAccessor: Time = 0.364177 +- 0.002039 ms, averaged over 100 runs article_processing: Time = 18.868166 +- 0.110768 ms, averaged over 100 runs document_construction: Time = 52.912835 +- 0.367409 ms, averaged over 100 runs formatting: Time = 4.872729 +- 0.029880 ms, averaged over 100 runs markup_parsing: Time = 0.854773 +- 0.004386 ms, averaged over 100 runs parser.getStructuredData(): Time = 3.447513 +- 0.020407 ms, averaged over 100 runs total: Time = 101.996992 +- 0.633859 ms, averaged over 100 runs I think aiming for <1% speed regression still makes sense, which we haven't met yet. BTW, after https://codereview.chromium.org/2108833002/, the utilization of recipe handling added in this CL would be essentially 0, since recipe pages usually don't trigger Clank reader mode. Unless we manually change the triggering model, low usage would make the threshold of speed regression tighter.
https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1705123002/diff/100001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:481: public static String getLanguage(Element root) { On 2016/07/06 17:53:34, dalmirsilva wrote: > On 2016/05/31 21:56:34, wychen wrote: > > If the language is specified in http header, instead of inside html, does this > > still gets it? Sadly it might not be possible to test in our unit test > > environment. > > Unfortunately it doesn't get it. We couldn't find a a solution for getting the > content-language from the http-header without making extra requests, since Dom > Distiller is manipulating the DOM already loaded and ready for work. Got it. I guess this is our technical limitation. https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... File java/org/chromium/distiller/ContentExtractor.java (right): https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/ContentExtractor.java:89: nit: extra line https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/ContentExtractor.java:92: LogUtil.addTimingInfo(now, mTimingInfo, "parser.getStructuredData()"); Maybe just "getStructuredData" for consistency. https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... File java/org/chromium/distiller/DomUtil.java (right): https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:508: result.add(matchResult.getGroup(1) + " year(s)"); Can we handle plural forms? https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/DomUtil.java:545: NodeList<Element> languages = DomUtil.querySelectorAll(root, query); Would it be faster if we only handle <head> instead of <html>? I haven't tried. https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... File java/org/chromium/distiller/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/1705123002/diff/160001/java/org/chromium/dist... java/org/chromium/distiller/SchemaOrgParserAccessor.java:22: private static List<SchemaOrgParser.Type> supportedTypes; I suspect this still causes some slowness. Would lazy initialization make it faster? |