|
|
Descriptionrecognize and parse Schema.org Markup
1) webpages tags are marked up with Schema.org vocabulary, along with microdata
format, using the following attributes in any element, including <html> tag:
a) itemscope: the element and its children are about a particular item
b) itemtype: the type of item as defined by Schema.org
c) itemprop: the property name as defined by itemtype
types currently supported: ImageObject, Article, Person, Organization.
2) implement SchemaOrgParser
- for each supported type, the class recognizes and parses each element as
per specs in (1) in a one-time parsing during object instantiation
- if a type is unsupported, its children are still parsed for top-level
(i.e. non-itemprop) items, but not its itemprop attributes.
3) implement SchemaOrgParserAccessor
- impl MarkupParser.Parser interface for SchemaOrgParser, added to list
of parsers
- add tests to test interface by accessing SchemaOrgParser.
BUG=364356
R=cjhopman@chromium.org
Committed: 8fe6964
Patch Set 1 #
Total comments: 18
Patch Set 2 : addressed all comments #
Total comments: 14
Patch Set 3 : addressed comments, impl some more specs #Patch Set 4 : rm dbg info #Patch Set 5 : fixed bug with nested types, added test #Patch Set 6 : fix to not parse "itemprop" for unsupported parent #Patch Set 7 : fine-tune prev bug fix #
Total comments: 21
Patch Set 8 : addressed all comments #Patch Set 9 : addressed missed-out comments #
Total comments: 22
Patch Set 10 : addressed comments #Patch Set 11 : rm unused props in image, mv fn to class-level #Patch Set 12 : rm 1 more unused prop in image #
Total comments: 4
Patch Set 13 : addressed comments #
Messages
Total messages: 15 (0 generated)
chris, based on your reviews for the IEReadingViewParser cl about using non-map approach to store property values, i've changed the implementation in this cl from EnumMap to [] of specific types. i thought of using HashMap for faster performance during string search, but i worry that might be an overkill memory-wise when the number of properties are very small for certain Schema.Org types. if you think HashMap might be better, i'll change accordingly. ptal. thx.
https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... File src/com/dom_distiller/client/MarkupParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/MarkupParser.java:148: if (schema != null) mParsers.add(schema); new Foo() never returns null. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:146: private final Map<Type, String> mTypeUrls = new EnumMap<Type, String>(Type.class); This appears to only be used to lookup a type for a string, so it should probably be a Map<String, Type> https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:264: private void parse(Element root, ThingItem currItem) { This function and its uses will be simplified if currItem is the closest ancestor item of the element passed in. (I.e. it is never the item started at the element passed in). https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i < allElems.getLength(); i++) { The way that the tree is parsed is hard for me to follow. Why not walk the tree recursively rather than iterating over all the elements? (The current approach also has bad worst case behavior) https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:303: if (!e.hasAttribute("ITEMSCOPE") || !e.hasAttribute("ITEMTYPE")) return null; This should be a different function so you don't have to pass around a one-element boolean array to get the result of this check. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem); This parse() call makes it harder for me to reason about the tree parsing, could we move this recursion out of this function (and then into parse() directly)? (and actually if change parse to recurse over the tree like my comment above, both cases would advance the same way through the recursion, just with different args). https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:334: Iterator<Map.Entry<Type, String>> iter = typeUrls.iterator(); I think you can do: for (Map.Entry<Type, String> typeUrl : mTypeUrls.entrySet()) { if (typeUrl...) return ...; } https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:345: private void extractProperty(Element e, ThingItem currItem, ThingItem embeddedItem) { This function does a lot (and most of what it does isn't reflected in the name). It gets the itemprop and then adds the embedded item to current item or gets the element's property value and then adds that property to the current item. This would be clearer as two functions: getItemprop(Element) getPropertyValue(Element) and then handle the putItem/putString separately elsewhere. If you keep this function close to what it currently is, the name should at least indicate that it will add the extracted property to currItem.
i've addressed all comments in patch set 2. ptal. thx. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... File src/com/dom_distiller/client/MarkupParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/MarkupParser.java:148: if (schema != null) mParsers.add(schema); On 2014/04/17 17:35:26, cjhopman wrote: > new Foo() never returns null. Done. repeated for ie. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:146: private final Map<Type, String> mTypeUrls = new EnumMap<Type, String>(Type.class); On 2014/04/17 17:35:26, cjhopman wrote: > This appears to only be used to lookup a type for a string, so it should > probably be a Map<String, Type> Done. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:264: private void parse(Element root, ThingItem currItem) { On 2014/04/17 17:35:26, cjhopman wrote: > This function and its uses will be simplified if currItem is the closest > ancestor item of the element passed in. (I.e. it is never the item started at > the element passed in). Done. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i < allElems.getLength(); i++) { On 2014/04/17 17:35:26, cjhopman wrote: > The way that the tree is parsed is hard for me to follow. Why not walk the tree > recursively rather than iterating over all the elements? (The current approach > also has bad worst case behavior) Done. what's the worst case behavior? https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:303: if (!e.hasAttribute("ITEMSCOPE") || !e.hasAttribute("ITEMTYPE")) return null; On 2014/04/17 17:35:26, cjhopman wrote: > This should be a different function so you don't have to pass around a > one-element boolean array to get the result of this check. Done. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem); On 2014/04/17 17:35:26, cjhopman wrote: > This parse() call makes it harder for me to reason about the tree parsing, could > we move this recursion out of this function (and then into parse() directly)? > (and actually if change parse to recurse over the tree like my comment above, > both cases would advance the same way through the recursion, just with different > args). Done. i'm not sure if i code it the way u want regarding "both cases would advance the same way through the recursion, just with different args". https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:334: Iterator<Map.Entry<Type, String>> iter = typeUrls.iterator(); On 2014/04/17 17:35:26, cjhopman wrote: > I think you can do: > > for (Map.Entry<Type, String> typeUrl : mTypeUrls.entrySet()) { > if (typeUrl...) return ...; > } Done. since it's now a HashMap of <String, Type>, i just need to use containsKey() and get(). https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:345: private void extractProperty(Element e, ThingItem currItem, ThingItem embeddedItem) { On 2014/04/17 17:35:26, cjhopman wrote: > This function does a lot (and most of what it does isn't reflected in the name). > It gets the itemprop and then adds the embedded item to current item or gets the > element's property value and then adds that property to the current item. > > This would be clearer as two functions: > getItemprop(Element) > getPropertyValue(Element) > > and then handle the putItem/putString separately elsewhere. > > If you keep this function close to what it currently is, the name should at > least indicate that it will add the extracted property to currItem. Done.
https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i < allElems.getLength(); i++) { On 2014/04/18 00:19:03, kuan wrote: > On 2014/04/17 17:35:26, cjhopman wrote: > > The way that the tree is parsed is hard for me to follow. Why not walk the > tree > > recursively rather than iterating over all the elements? (The current approach > > also has bad worst case behavior) > > Done. what's the worst case behavior? The previous version had O(n^2) worst case complexity (this would happen if you had a bunch of nested elements with each one starting a new nested item). https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client... src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem); On 2014/04/18 00:19:03, kuan wrote: > On 2014/04/17 17:35:26, cjhopman wrote: > > This parse() call makes it harder for me to reason about the tree parsing, > could > > we move this recursion out of this function (and then into parse() directly)? > > (and actually if change parse to recurse over the tree like my comment above, > > both cases would advance the same way through the recursion, just with > different > > args). > > Done. i'm not sure if i code it the way u want regarding "both cases would > advance the same way through the recursion, just with different args". See the new comment in parse() for what I meant by that. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:65: protected final Type mType; nit: I think all the fields here could be private instead of protected and then the methods could all just drop protected and have the default (or public) access. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:66: protected final Element mRoot; mRoot looks unused. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:142: private final List<ThingItem> mItemScopes; this can be static and initialized in a static initializer. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:143: private Element mRoot = null; is this used? https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:258: newItem = createItemForElement(e); It looks like we might handle nesting of elements involving unsupported types incorrectly. Consider the following cases itemscope article itemprop itemscope unsupported itemprop itemscope image /itemprop /itemscope /itemscope itemscope unsupported itemprop itemscope image /itemscope /itemscope itemscope unsupported itemscope article /itemscope /itemscope (that last one we handle correctly I think, but to do so we have to make sure not to skip the children of unsupported items). https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:282: Clarification of what I meant with "both cases would advance the same way through the recursion, just with different args": This last part could just iterate over children and then there only needs to be one recursive parse() call (instead of 3): for (Node n : e.getChildNodes()) { if (!n.getNodeType() == Node.ELEMENT_NODE) continue; Element child = Element.as(n); parse(child, currItem); } where currItem is newItem if this element started a new scope or parentItem if not. https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c... File test/com/dom_distiller/client/SchemaOrgParserTest.java (right): https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c... test/com/dom_distiller/client/SchemaOrgParserTest.java:33: setItemProp(h, "description"); Note: the following comment applies to all of these test cases You can construct an html string for the structure that you're building and then use setInnerHTML() (instead of programmatically creating the structure). Pro: It's easier to see/understand what structure you are building Con: It's harder to refer to values inside that structure (things like 'expectedTitle' might end up being specified in two places) I'm just trying to think about ways to make it more clear what the structure of the tree you are building is. If you don't like the html string approach, here's some other possibilities: how about moving all the appendChild calls to be next to each other (basically in the order that they would be in the html string)? I guess this would require unique names for each thing you build here. maybe even something as simple as a comment at the top of the function describing the structure would be fine.
i've addressed all comments, plus implemented some more of the specs: - multiple names in itemprop attribute - multiple values for itemprop: only keep the first value - extract author when <a> or <link> has rel="author" and added tests for them. ptal. thx. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:65: protected final Type mType; On 2014/04/18 01:17:01, cjhopman wrote: > nit: I think all the fields here could be private instead of protected and then > the methods could all just drop protected and have the default (or public) > access. Done. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:66: protected final Element mRoot; On 2014/04/18 01:17:01, cjhopman wrote: > mRoot looks unused. Done. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:142: private final List<ThingItem> mItemScopes; On 2014/04/18 01:17:01, cjhopman wrote: > this can be static and initialized in a static initializer. different instances of SchemaOrgParser have different mItemScopes - if i make it static, the testcases would fail. mTypeUrls, however, is the same across all instances of SchemaOrgParser, so i made it static and initialized its entries in a static initializer block. what i was on that, i also renamed all static variables to use "s" prefix. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:143: private Element mRoot = null; On 2014/04/18 01:17:01, cjhopman wrote: > is this used? Done. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:258: newItem = createItemForElement(e); On 2014/04/18 01:17:01, cjhopman wrote: > It looks like we might handle nesting of elements involving unsupported types > incorrectly. Consider the following cases > > > itemscope article > itemprop itemscope unsupported > itemprop itemscope image > /itemprop > /itemscope > /itemscope > > > itemscope unsupported > itemprop itemscope image > /itemscope > /itemscope > > itemscope unsupported > itemscope article > /itemscope > /itemscope > > (that last one we handle correctly I think, but to do so we have to make sure > not to skip the children of unsupported items). > > > > Done. before, i coded it based on the assumption, incorrect apparently, that all children of unsupported types should be ignored. new code passes ur cases, which are tests now. https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl... src/com/dom_distiller/client/SchemaOrgParser.java:282: On 2014/04/18 01:17:01, cjhopman wrote: > Clarification of what I meant with "both cases would > advance the same way through the recursion, just with different args": > > This last part could just iterate over children and then there only needs to be > one recursive parse() call (instead of 3): > > for (Node n : e.getChildNodes()) { > if (!n.getNodeType() == Node.ELEMENT_NODE) continue; > Element child = Element.as(n); > parse(child, currItem); > } > > where currItem is newItem if this element started a new scope or parentItem if > not. Done. fyi, can't use forEach for NodeList. https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c... File test/com/dom_distiller/client/SchemaOrgParserTest.java (right): https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c... test/com/dom_distiller/client/SchemaOrgParserTest.java:33: setItemProp(h, "description"); On 2014/04/18 01:17:01, cjhopman wrote: > Note: the following comment applies to all of these test cases > > You can construct an html string for the structure that you're building and then > use setInnerHTML() (instead of programmatically creating the structure). > > Pro: It's easier to see/understand what structure you are building > Con: It's harder to refer to values inside that structure (things like > 'expectedTitle' might end up being specified in two places) > > > I'm just trying to think about ways to make it more clear what the structure of > the tree you are building is. If you don't like the html string approach, here's > some other possibilities: > > how about moving all the appendChild calls to be next to each other (basically > in the order that they would be in the html string)? I guess this would require > unique names for each thing you build here. > > maybe even something as simple as a comment at the top of the function > describing the structure would be fine. Done. i changed to using html string and setInnerHTML(), with the perk of using expected*.
i discovered a bug with nested types, fixed it and added test in patch set 5. in my previous update, i forgot to mention that i also added more tags to extract property value, as per spec.
https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:36: public class SchemaOrgParser implements MarkupParser.Parser { Can we split this class into two parts: 1. A SchemaOrgParser (that is basically all the the parse() step and all that it does) 2. A MarkupParser.Parser that has all the logic for the getXxx() calls and uses (1) https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:161: sTypeUrls.put("http://schema.org/Article", Type.ARTICLE); We should probably recognize schema.org/NewsArticle and /BlobPosting as articles too (and maybe all the other subtypes of article) https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:163: sTypeUrls.put("http://schema.org/Organization", Type.ORGANIZATION); There are a whole bunch of subtypes of Organization that we will ignore (though a lot of them we probably don't care about). https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:202: if (title.isEmpty()) title = findStringProperty(NAME_PROP); This seems to mean that we will get the first item with a name. I really don't think we want to return the name of a person or image or organization as the title. I think we should probably be getting the title only from the article item. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:210: return mItemScopes.get(0).mType.toString(); Does schema.org specify that this is a good way to get a type of a page? I don't think that they really have this concept and picking the first item doesn't seem correct since it's just picking the marked up thing that happens to appear first in the html. I think we either just have to always return nothing or maybe return some sort of article type if there is an article item. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:215: return findStringProperty(URL_PROP); I don't really understand what we expect the url from getUrl() to be. I think the only thing that makes sense for this would be for it to be the url of an article item (not a person/organization/etc). https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:219: public MarkupParser.Image[] getImages() { We should be careful about what this returns, we only want to include images that are part of the article. For example, we don't want this list to include a picture of the author or organization. The images that we know are part of the article would include: the article item's image property all image items with the article as their associatedArticle property Things that we could probably assume are part of the article: all image items https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:240: return findStringProperty(DESCRIPTION_PROP); Again, this should probably only be the description from an article item. Same for author and copyright. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:245: return findStringProperty(PUBLISHER_PROP); I think we would only want the publisher property from an article, not other types (maybe we may also want to look at the copyrightHolder or other properties of article) https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:474: // FAILY_NAME_PROP delimited by a whitespace. s/FAILY/FAMILY
i've addressed all comments, ptal at delta between patch sets 9 and 7. thx. in case u're wondering abt the name SchemaOrgParserAccessor, i'm thinking of renaming MarkupParser.Parser interface to MarkupParser.Accessor in the next cl ('cos i don't want to change the other parsers here). what do u think? https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:36: public class SchemaOrgParser implements MarkupParser.Parser { On 2014/04/21 16:52:22, cjhopman wrote: > Can we split this class into two parts: > > 1. A SchemaOrgParser (that is basically all the the parse() step and all that it > does) > 2. A MarkupParser.Parser that has all the logic for the getXxx() calls and uses > (1) Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:161: sTypeUrls.put("http://schema.org/Article", Type.ARTICLE); On 2014/04/21 16:52:22, cjhopman wrote: > We should probably recognize schema.org/NewsArticle and /BlobPosting as articles > too (and maybe all the other subtypes of article) Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:163: sTypeUrls.put("http://schema.org/Organization", Type.ORGANIZATION); On 2014/04/21 16:52:22, cjhopman wrote: > There are a whole bunch of subtypes of Organization that we will ignore (though > a lot of them we probably don't care about). i added more subtypes, but i'm not sure if they're enough or excess. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:202: if (title.isEmpty()) title = findStringProperty(NAME_PROP); On 2014/04/21 16:52:22, cjhopman wrote: > This seems to mean that we will get the first item with a name. I really don't > think we want to return the name of a person or image or organization as the > title. I think we should probably be getting the title only from the article > item. Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:210: return mItemScopes.get(0).mType.toString(); On 2014/04/21 16:52:22, cjhopman wrote: > Does http://schema.org specify that this is a good way to get a type of a page? I don't > think that they really have this concept and picking the first item doesn't seem > correct since it's just picking the marked up thing that happens to appear first > in the html. > > I think we either just have to always return nothing or maybe return some sort > of article type if there is an article item. Done. for the record, as per our off-line discussion: after this cl, i intend to consolidate and standardize the types returned by the 3 parsers in MarkupParser, added todo. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:215: return findStringProperty(URL_PROP); On 2014/04/21 16:52:22, cjhopman wrote: > I don't really understand what we expect the url from getUrl() to be. > > I think the only thing that makes sense for this would be for it to be the url > of an article item (not a person/organization/etc). Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:219: public MarkupParser.Image[] getImages() { On 2014/04/21 16:52:22, cjhopman wrote: > We should be careful about what this returns, we only want to include images > that are part of the article. For example, we don't want this list to include a > picture of the author or organization. > > The images that we know are part of the article would include: > the article item's image property > all image items with the article as their associatedArticle property > > Things that we could probably assume are part of the article: > all image items Done. per our discussion off-line, i won't impl the associatedArticle property yet; it requires pretty complicated parsing and resolution of itemref attribute. so we have the article item's image property and all the ImageObject's in this cl. however, i did impl the "associatedMedia" and "encoding" properties of article, which take precedence over the article's "image" property and forces ImageItem's representativeOfPage property to true. also added test. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:240: return findStringProperty(DESCRIPTION_PROP); On 2014/04/21 16:52:22, cjhopman wrote: > Again, this should probably only be the description from an article item. Same > for author and copyright. Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:245: return findStringProperty(PUBLISHER_PROP); On 2014/04/21 16:52:22, cjhopman wrote: > I think we would only want the publisher property from an article, not other > types (maybe we may also want to look at the copyrightHolder or other properties > of article) Done. https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:474: // FAILY_NAME_PROP delimited by a whitespace. On 2014/04/21 16:52:22, cjhopman wrote: > s/FAILY/FAMILY Done.
just a gentle reminder...
https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:321: // If <a> or <link> tags specify rel="author", extract it. Where does this rel="author" stuff come from? I can't find anything about it on schema.org or in the Microdata Dom API spec. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:73: private final String[] mStringPropertyNames; How this all works together is rather confusing to me. It think that a ThingItem instance should just have a map<String, String> mStringProperties and map<String, ThingItem> mItemProperties. Whether or not a certain property is supported by some ThingItem type should be a class-level thing (and they should share those supported by the base ThingItem class). https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:78: // |stringPropertyNames| and |itemPropertyNames| are names of properties that this nit: Use javadoc comment format https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:97: MarkupParser.Article getArticle() { Let's get rid of these functions that only make sense on certain types (and so the caller should only call them on the right types). So: getArticle and isImageRepresentativeOfPage should be removed. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:110: // Otherwise, repeat for |mItemProperties|. This is weird that we go looking into the properties of other items and treating them like string properties on this item. Where do we use this and is there a better way of doing it? https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:192: sTagAttributesMap = new HashMap<String, String[]>(); nit: move this (and the sTagAttributesMap declaration) down by the function that uses it. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:200: sTagAttributesMap.put("A", new String[] { "HREF", "REL" }); I can't find documentation anywhere that says to use the rel= attribute. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:222: // Returns the first item that has the requested property value. I don't think that this is something we would ever want to do. The only time we use it is to find the headline which is a property only of articles. In that case we should be iterating over the articles and checking each of them for a headline. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:232: ThingItem findFirstArticle() { This should return an ArticleItem https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:425: imageItem.putStringValue(REPRESENTATIVE_PROP, "true", true); This is strange, and only seems to work if I do getImage() on the article before processing image types. I'd prefer to just remove setting representativeOfPage on this item and handle it in a separate change. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:466: String getStringProperty(String propertyName) { Overriding getStringProperty like this is a little odd to me. I think it would be a lot easier to reason about if ThingItem's string property stuff acting just like a map. If it is common to want a different "name" concept than the schema.org name property, ThingItem can have a getName function that is overridable (and the default is just the name property).
i've addressed all comments in patch set 10. ptal. thx. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:73: private final String[] mStringPropertyNames; On 2014/04/25 20:52:34, cjhopman wrote: > How this all works together is rather confusing to me. > > It think that a ThingItem instance should just have a map<String, String> > mStringProperties and map<String, ThingItem> mItemProperties. Whether or not a > certain property is supported by some ThingItem type should be a class-level > thing (and they should share those supported by the base ThingItem class). Done. i was using map b4, but changed to arrays when u didn't want map in the ie parser. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:78: // |stringPropertyNames| and |itemPropertyNames| are names of properties that this On 2014/04/25 20:52:34, cjhopman wrote: > nit: Use javadoc comment format comments r not needed now. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:97: MarkupParser.Article getArticle() { On 2014/04/25 20:52:34, cjhopman wrote: > Let's get rid of these functions that only make sense on certain types (and so > the caller should only call them on the right types). So: getArticle and > isImageRepresentativeOfPage should be removed. Done. i was doing it this way to avoid casting ThingItem, which i thought was discouraged as per the review for ie parser. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:110: // Otherwise, repeat for |mItemProperties|. On 2014/04/25 20:52:34, cjhopman wrote: > This is weird that we go looking into the properties of other items and treating > them like string properties on this item. Where do we use this and is there a > better way of doing it? Done. author, publisher and copyright holder can be a simple name of string type, or a schema.org person, or a schema.org organization. i was trying to shield the callers from having to deal with string and item property types. in any case, i've removed this and changed the impl. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:192: sTagAttributesMap = new HashMap<String, String[]>(); On 2014/04/25 20:52:34, cjhopman wrote: > nit: move this (and the sTagAttributesMap declaration) down by the function that > uses it. Done. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:200: sTagAttributesMap.put("A", new String[] { "HREF", "REL" }); On 2014/04/25 20:52:34, cjhopman wrote: > I can't find documentation anywhere that says to use the rel= attribute. http://schema.org/author, or from http://schema.org/Article, look for "author" property. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:222: // Returns the first item that has the requested property value. On 2014/04/25 20:52:34, cjhopman wrote: > I don't think that this is something we would ever want to do. The only time we > use it is to find the headline which is a property only of articles. In that > case we should be iterating over the articles and checking each of them for a > headline. Done. Image also has headline, but i assume we only care abt Article. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:232: ThingItem findFirstArticle() { On 2014/04/25 20:52:34, cjhopman wrote: > This should return an ArticleItem Done. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:425: imageItem.putStringValue(REPRESENTATIVE_PROP, "true", true); On 2014/04/25 20:52:34, cjhopman wrote: > This is strange, and only seems to work if I do getImage() on the article before > processing image types. > > I'd prefer to just remove setting representativeOfPage on this item and handle > it in a separate change. Done. i've removed setting representativeOfPage and handle the logic in getImages(), not sure if that's what u mean by "in a separate change". https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:466: String getStringProperty(String propertyName) { On 2014/04/25 20:52:34, cjhopman wrote: > Overriding getStringProperty like this is a little odd to me. I think it would > be a lot easier to reason about if ThingItem's string property stuff acting just > like a map. > > If it is common to want a different "name" concept than the http://schema.org name > property, ThingItem can have a getName function that is overridable (and the > default is just the name property). Done. i've made getStringProperty final, but implemented the logic differently from ur "getName" suggestion.
https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:200: sTagAttributesMap.put("A", new String[] { "HREF", "REL" }); On 2014/04/29 00:23:10, kuan wrote: > On 2014/04/25 20:52:34, cjhopman wrote: > > I can't find documentation anywhere that says to use the rel= attribute. > > http://schema.org/author, or from http://schema.org/Article, look for "author" > property. That's interesting. Maybe you could add a comment somehere pointing to that. https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:422: sTagAttributesMap.put("LINK", new String[] { "HREF", "REL" }); I think this would be clearer if this was just a map from the element type to the normal (i.e. non-rel-author) place to find the value. Then just completely handle the rel-author stuff in getAuthorFromRelAttribute(). https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParserAccessor.java:35: if (item.getType() == SchemaOrgParser.Type.ARTICLE) { Iterating through the articles seems pretty common. We could add a getArticleItems() to the parser (then we could do the cast in there instead). You could probably drop the findFirstArticle then and just get the first item in the list returned by getArticleItems. Probably could do the same for images even though I think we only iterate through those once.
i've addressed all comments in patch set 13. ptal. thx. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:200: sTagAttributesMap.put("A", new String[] { "HREF", "REL" }); On 2014/04/29 17:04:19, cjhopman wrote: > On 2014/04/29 00:23:10, kuan wrote: > > On 2014/04/25 20:52:34, cjhopman wrote: > > > I can't find documentation anywhere that says to use the rel= attribute. > > > > http://schema.org/author, or from http://schema.org/Article, look for "author" > > property. > > That's interesting. Maybe you could add a comment somehere pointing to that. > Done. comment is in parse() where the rel attribute is extracted. https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParser.java:422: sTagAttributesMap.put("LINK", new String[] { "HREF", "REL" }); On 2014/04/29 17:04:19, cjhopman wrote: > I think this would be clearer if this was just a map from the element type to > the normal (i.e. non-rel-author) place to find the value. > > Then just completely handle the rel-author stuff in getAuthorFromRelAttribute(). Done. hehe.. back then when i changed to include "rel" in the map, i thought maybe it'd be clearer to not include it, but decided to leave it to see what u think :) https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... File src/com/dom_distiller/client/SchemaOrgParserAccessor.java (right): https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c... src/com/dom_distiller/client/SchemaOrgParserAccessor.java:35: if (item.getType() == SchemaOrgParser.Type.ARTICLE) { On 2014/04/29 17:04:19, cjhopman wrote: > Iterating through the articles seems pretty common. We could add a > getArticleItems() to the parser (then we could do the cast in there instead). > You could probably drop the findFirstArticle then and just get the first item in > the list returned by getArticleItems. > > Probably could do the same for images even though I think we only iterate > through those once. Done.
lgtm
Message was sent while issue was closed.
Committed patchset #13 manually as r8fe6964 (presubmit successful). |