Issue 240073007: recognize and parse Schema.org Markup

Issue 240073007: recognize and parse Schema.org Markup (Closed)

Created:
6 years, 8 months ago by kuan

Modified:
6 years, 7 months ago

Reviewers:
cjhopman

Base URL:
https://code.google.com/p/dom-distiller/@master

Visibility:
Public.

Description

recognize and parse Schema.org Markup 1) webpages tags are marked up with Schema.org vocabulary, along with microdata format, using the following attributes in any element, including <html> tag: a) itemscope: the element and its children are about a particular item b) itemtype: the type of item as defined by Schema.org c) itemprop: the property name as defined by itemtype types currently supported: ImageObject, Article, Person, Organization. 2) implement SchemaOrgParser - for each supported type, the class recognizes and parses each element as per specs in (1) in a one-time parsing during object instantiation - if a type is unsupported, its children are still parsed for top-level (i.e. non-itemprop) items, but not its itemprop attributes. 3) implement SchemaOrgParserAccessor - impl MarkupParser.Parser interface for SchemaOrgParser, added to list of parsers - add tests to test interface by accessing SchemaOrgParser. BUG=364356 R=cjhopman@chromium.org Committed: 8fe6964

Patch Set 1 #

Total comments: 18

Patch Set 2 : addressed all comments #

Total comments: 14

Patch Set 3 : addressed comments, impl some more specs #

Patch Set 4 : rm dbg info #

Patch Set 5 : fixed bug with nested types, added test #

Patch Set 6 : fix to not parse "itemprop" for unsupported parent #

Patch Set 7 : fine-tune prev bug fix #

Total comments: 21

Patch Set 8 : addressed all comments #

Patch Set 9 : addressed missed-out comments #

Total comments: 22

Patch Set 10 : addressed comments #

Patch Set 11 : rm unused props in image, mv fn to class-level #

Patch Set 12 : rm 1 more unused prop in image #

Total comments: 4

Patch Set 13 : addressed comments #

Created: 6 years, 7 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+1244 lines, -2 lines)			Patch
M	src/com/dom_distiller/client/MarkupParser.java	View	1 2 3 4 5 6 7 8	1 chunk	+2 lines, -2 lines	0 comments	Download
A	src/com/dom_distiller/client/SchemaOrgParser.java	View	1 2 3 4 5 6 7 8 9 10 11 12	1 chunk	+470 lines, -0 lines	0 comments	Download
A	src/com/dom_distiller/client/SchemaOrgParserAccessor.java	View	1 2 3 4 5 6 7 8 9 10 11 12	1 chunk	+170 lines, -0 lines	0 comments	Download
A	test/com/dom_distiller/client/SchemaOrgParserAccessorTest.java	View	1 2 3 4 5 6 7 8 9 10 11 12	1 chunk	+596 lines, -0 lines	0 comments	Download
M	test/com/dom_distiller/client/TestUtil.java	View		1 chunk	+6 lines, -0 lines	0 comments	Download

Messages

Total messages: 15 (0 generated)

Expand Messages | Collapse Messages

kuan

chris, based on your reviews for the IEReadingViewParser cl about using non-map approach to store ...

6 years, 8 months ago (2014-04-17 16:19:15 UTC) #1

cjhopman

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/MarkupParser.java File src/com/dom_distiller/client/MarkupParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/MarkupParser.java#newcode148 src/com/dom_distiller/client/MarkupParser.java:148: if (schema != null) mParsers.add(schema); new Foo() never returns ...

6 years, 8 months ago (2014-04-17 17:35:26 UTC) #2

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
File src/com/dom_distiller/client/MarkupParser.java (right):

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/MarkupParser.java:148: if (schema != null)
mParsers.add(schema);
new Foo() never returns null.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:146: private final Map<Type,
String> mTypeUrls = new EnumMap<Type, String>(Type.class);
This appears to only be used to lookup a type for a string, so it should
probably be a Map<String, Type>

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:264: private void
parse(Element root, ThingItem currItem) {
This function and its uses will be simplified if currItem is the closest
ancestor item of the element passed in. (I.e. it is never the item started at
the element passed in).

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i <
allElems.getLength(); i++) {
The way that the tree is parsed is hard for me to follow. Why not walk the tree
recursively rather than iterating over all the elements? (The current approach
also has bad worst case behavior)

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:303: if
(!e.hasAttribute("ITEMSCOPE") || !e.hasAttribute("ITEMTYPE")) return null;
This should be a different function so you don't have to pass around a
one-element boolean array to get the result of this check.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem);
This parse() call makes it harder for me to reason about the tree parsing, could
we move this recursion out of this function (and then into parse() directly)?
(and actually if change parse to recurse over the tree like my comment above,
both cases would advance the same way through the recursion, just with different
args).

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:334: Iterator<Map.Entry<Type,
String>> iter = typeUrls.iterator();
I think you can do:

for (Map.Entry<Type, String> typeUrl : mTypeUrls.entrySet()) {
  if (typeUrl...) return ...;
}

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:345: private void
extractProperty(Element e, ThingItem currItem, ThingItem embeddedItem) {
This function does a lot (and most of what it does isn't reflected in the name).
It gets the itemprop and then adds the embedded item to current item or gets the
element's property value and then adds that property to the current item.

This would be clearer as two functions:
getItemprop(Element)
getPropertyValue(Element)

and then handle the putItem/putString separately elsewhere.

If you keep this function close to what it currently is, the name should at
least indicate that it will add the extracted property to currItem.

kuan

i've addressed all comments in patch set 2. ptal. thx. https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/MarkupParser.java File src/com/dom_distiller/client/MarkupParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/MarkupParser.java#newcode148 ...

6 years, 8 months ago (2014-04-18 00:19:03 UTC) #3

i've addressed all comments in patch set 2.  ptal.  thx.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
File src/com/dom_distiller/client/MarkupParser.java (right):

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/MarkupParser.java:148: if (schema != null)
mParsers.add(schema);
On 2014/04/17 17:35:26, cjhopman wrote:
> new Foo() never returns null.

Done.  repeated for ie.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:146: private final Map<Type,
String> mTypeUrls = new EnumMap<Type, String>(Type.class);
On 2014/04/17 17:35:26, cjhopman wrote:
> This appears to only be used to lookup a type for a string, so it should
> probably be a Map<String, Type>

Done.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:264: private void
parse(Element root, ThingItem currItem) {
On 2014/04/17 17:35:26, cjhopman wrote:
> This function and its uses will be simplified if currItem is the closest
> ancestor item of the element passed in. (I.e. it is never the item started at
> the element passed in). 

Done.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i <
allElems.getLength(); i++) {
On 2014/04/17 17:35:26, cjhopman wrote:
> The way that the tree is parsed is hard for me to follow. Why not walk the
tree
> recursively rather than iterating over all the elements? (The current approach
> also has bad worst case behavior)

Done.  what's the worst case behavior?

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:303: if
(!e.hasAttribute("ITEMSCOPE") || !e.hasAttribute("ITEMTYPE")) return null;
On 2014/04/17 17:35:26, cjhopman wrote:
> This should be a different function so you don't have to pass around a
> one-element boolean array to get the result of this check.

Done.

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem);
On 2014/04/17 17:35:26, cjhopman wrote:
> This parse() call makes it harder for me to reason about the tree parsing,
could
> we move this recursion out of this function (and then into parse() directly)?
> (and actually if change parse to recurse over the tree like my comment above,
> both cases would advance the same way through the recursion, just with
different
> args).

Done.  i'm not sure if i code it the way u want regarding "both cases would
advance the same way through the recursion, just with different args".

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:334: Iterator<Map.Entry<Type,
String>> iter = typeUrls.iterator();
On 2014/04/17 17:35:26, cjhopman wrote:
> I think you can do:
> 
> for (Map.Entry<Type, String> typeUrl : mTypeUrls.entrySet()) {
>   if (typeUrl...) return ...;
> }

Done.  since it's now a HashMap of <String, Type>, i just need to use
containsKey() and get().

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:345: private void
extractProperty(Element e, ThingItem currItem, ThingItem embeddedItem) {
On 2014/04/17 17:35:26, cjhopman wrote:
> This function does a lot (and most of what it does isn't reflected in the
name).
> It gets the itemprop and then adds the embedded item to current item or gets
the
> element's property value and then adds that property to the current item.
> 
> This would be clearer as two functions:
> getItemprop(Element)
> getPropertyValue(Element)
> 
> and then handle the putItem/putString separately elsewhere.
> 
> If you keep this function close to what it currently is, the name should at
> least indicate that it will add the extracted property to currItem.

Done.

cjhopman

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client/SchemaOrgParser.java#newcode266 src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i < allElems.getLength(); i++) ...

6 years, 8 months ago (2014-04-18 01:17:01 UTC) #4

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:266: for (int i = 0; i <
allElems.getLength(); i++) {
On 2014/04/18 00:19:03, kuan wrote:
> On 2014/04/17 17:35:26, cjhopman wrote:
> > The way that the tree is parsed is hard for me to follow. Why not walk the
> tree
> > recursively rather than iterating over all the elements? (The current
approach
> > also has bad worst case behavior)
> 
> Done.  what's the worst case behavior?

The previous version had O(n^2) worst case complexity (this would happen if you
had a bunch of nested elements with each one starting a new nested item).

https://codereview.chromium.org/240073007/diff/1/src/com/dom_distiller/client...
src/com/dom_distiller/client/SchemaOrgParser.java:327: parse(e, newItem);
On 2014/04/18 00:19:03, kuan wrote:
> On 2014/04/17 17:35:26, cjhopman wrote:
> > This parse() call makes it harder for me to reason about the tree parsing,
> could
> > we move this recursion out of this function (and then into parse()
directly)?
> > (and actually if change parse to recurse over the tree like my comment
above,
> > both cases would advance the same way through the recursion, just with
> different
> > args).
> 
> Done.  i'm not sure if i code it the way u want regarding "both cases would
> advance the same way through the recursion, just with different args".

See the new comment in parse() for what I meant by that.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:65: protected final Type
mType;
nit: I think all the fields here could be private instead of protected and then
the methods could all just drop protected and have the default (or public)
access.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:66: protected final Element
mRoot;
mRoot looks unused.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:142: private final
List<ThingItem> mItemScopes;
this can be static and initialized in a static initializer.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:143: private Element mRoot =
null;
is this used?

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:258: newItem =
createItemForElement(e);
It looks like we might handle nesting of elements involving unsupported types
incorrectly. Consider the following cases

itemscope article
  itemprop itemscope unsupported
    itemprop itemscope image
    /itemprop
  /itemscope
/itemscope

itemscope unsupported
  itemprop itemscope image
  /itemscope
/itemscope

itemscope unsupported
  itemscope article
  /itemscope
/itemscope

(that last one we handle correctly I think, but to do so we have to make sure
not to skip the children of unsupported items).

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:282: 
Clarification of what I meant with "both cases would
advance the same way through the recursion, just with different args":

This last part could just iterate over children and then there only needs to be
one recursive parse() call (instead of 3):

for (Node n : e.getChildNodes()) {
 if (!n.getNodeType() == Node.ELEMENT_NODE) continue;
 Element child = Element.as(n);
 parse(child, currItem);
}

where currItem is newItem if this element started a new scope or parentItem if
not.

https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c...
File test/com/dom_distiller/client/SchemaOrgParserTest.java (right):

https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c...
test/com/dom_distiller/client/SchemaOrgParserTest.java:33: setItemProp(h,
"description");
Note: the following comment applies to all of these test cases

You can construct an html string for the structure that you're building and then
use setInnerHTML() (instead of programmatically creating the structure).

Pro: It's easier to see/understand what structure you are building
Con: It's harder to refer to values inside that structure (things like
'expectedTitle' might end up being specified in two places)

I'm just trying to think about ways to make it more clear what the structure of
the tree you are building is. If you don't like the html string approach, here's
some other possibilities:

how about moving all the appendChild calls to be next to each other (basically
in the order that they would be in the html string)? I guess this would require
unique names for each thing you build here.

maybe even something as simple as a comment at the top of the function
describing the structure would be fine.

kuan

i've addressed all comments, plus implemented some more of the specs: - multiple names in ...

6 years, 8 months ago (2014-04-18 23:34:37 UTC) #5

i've addressed all comments, plus implemented some more of the specs:
- multiple names in itemprop attribute
- multiple values for itemprop: only keep the first value
- extract author when <a> or <link> has rel="author"
and added tests for them.

ptal.  thx.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:65: protected final Type
mType;
On 2014/04/18 01:17:01, cjhopman wrote:
> nit: I think all the fields here could be private instead of protected and
then
> the methods could all just drop protected and have the default (or public)
> access.

Done.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:66: protected final Element
mRoot;
On 2014/04/18 01:17:01, cjhopman wrote:
> mRoot looks unused.

Done.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:142: private final
List<ThingItem> mItemScopes;
On 2014/04/18 01:17:01, cjhopman wrote:
> this can be static and initialized in a static initializer.

different instances of SchemaOrgParser have different mItemScopes - if i make it
static, the testcases would fail.
mTypeUrls, however, is the same across all instances of SchemaOrgParser, so i
made it static and initialized its entries in a static initializer block.
what i was on that, i also renamed all static variables to use "s" prefix.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:143: private Element mRoot =
null;
On 2014/04/18 01:17:01, cjhopman wrote:
> is this used?

Done.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:258: newItem =
createItemForElement(e);
On 2014/04/18 01:17:01, cjhopman wrote:
> It looks like we might handle nesting of elements involving unsupported types
> incorrectly. Consider the following cases
> 
> 
> itemscope article
>   itemprop itemscope unsupported
>     itemprop itemscope image
>     /itemprop
>   /itemscope
> /itemscope
> 
> 
> itemscope unsupported
>   itemprop itemscope image
>   /itemscope
> /itemscope
> 
> itemscope unsupported
>   itemscope article
>   /itemscope
> /itemscope
> 
> (that last one we handle correctly I think, but to do so we have to make sure
> not to skip the children of unsupported items).
> 
> 
> 
> 

Done.  before, i coded it based on the assumption, incorrect apparently, that
all children of unsupported types should be ignored.
new code passes ur cases, which are tests now.

https://codereview.chromium.org/240073007/diff/20001/src/com/dom_distiller/cl...
src/com/dom_distiller/client/SchemaOrgParser.java:282: 
On 2014/04/18 01:17:01, cjhopman wrote:
> Clarification of what I meant with "both cases would
> advance the same way through the recursion, just with different args":
> 
> This last part could just iterate over children and then there only needs to
be
> one recursive parse() call (instead of 3):
> 
> for (Node n : e.getChildNodes()) {
>  if (!n.getNodeType() == Node.ELEMENT_NODE) continue;
>  Element child = Element.as(n);
>  parse(child, currItem);
> }
> 
> where currItem is newItem if this element started a new scope or parentItem if
> not.

Done.  fyi, can't use forEach for NodeList.

https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c...
File test/com/dom_distiller/client/SchemaOrgParserTest.java (right):

https://codereview.chromium.org/240073007/diff/20001/test/com/dom_distiller/c...
test/com/dom_distiller/client/SchemaOrgParserTest.java:33: setItemProp(h,
"description");
On 2014/04/18 01:17:01, cjhopman wrote:
> Note: the following comment applies to all of these test cases
> 
> You can construct an html string for the structure that you're building and
then
> use setInnerHTML() (instead of programmatically creating the structure).
> 
> Pro: It's easier to see/understand what structure you are building
> Con: It's harder to refer to values inside that structure (things like
> 'expectedTitle' might end up being specified in two places)
> 
> 
> I'm just trying to think about ways to make it more clear what the structure
of
> the tree you are building is. If you don't like the html string approach,
here's
> some other possibilities:
> 
> how about moving all the appendChild calls to be next to each other (basically
> in the order that they would be in the html string)? I guess this would
require
> unique names for each thing you build here.
> 
> maybe even something as simple as a comment at the top of the function
> describing the structure would be fine.

Done.  i changed to using html string and setInnerHTML(), with the perk of using
expected*.

kuan

i discovered a bug with nested types, fixed it and added test in patch set ...

6 years, 8 months ago (2014-04-19 15:01:58 UTC) #6

cjhopman

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/client/SchemaOrgParser.java#newcode36 src/com/dom_distiller/client/SchemaOrgParser.java:36: public class SchemaOrgParser implements MarkupParser.Parser { Can we split ...

6 years, 8 months ago (2014-04-21 16:52:22 UTC) #7

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:36: public class
SchemaOrgParser implements MarkupParser.Parser {
Can we split this class into two parts:

1. A SchemaOrgParser (that is basically all the the parse() step and all that it
does)
2. A MarkupParser.Parser that has all the logic for the getXxx() calls and uses
(1)

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:161:
sTypeUrls.put("http://schema.org/Article", Type.ARTICLE);
We should probably recognize schema.org/NewsArticle and /BlobPosting as articles
too (and maybe all the other subtypes of article)

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:163:
sTypeUrls.put("http://schema.org/Organization", Type.ORGANIZATION);
There are a whole bunch of subtypes of Organization that we will ignore (though
a lot of them we probably don't care about).

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:202: if (title.isEmpty())
title = findStringProperty(NAME_PROP);
This seems to mean that we will get the first item with a name. I really don't
think we want to return the name of a person or image or organization as the
title. I think we should probably be getting the title only from the article
item.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:210: return
mItemScopes.get(0).mType.toString();
Does schema.org specify that this is a good way to get a type of a page? I don't
think that they really have this concept and picking the first item doesn't seem
correct since it's just picking the marked up thing that happens to appear first
in the html. 

I think we either just have to always return nothing or maybe return some sort
of article type if there is an article item.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:215: return
findStringProperty(URL_PROP);
I don't really understand what we expect the url from getUrl() to be.

I think the only thing that makes sense for this would be for it to be the url
of an article item (not a person/organization/etc).

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:219: public
MarkupParser.Image[] getImages() {
We should be careful about what this returns, we only want to include images
that are part of the article. For example, we don't want this list to include a
picture of the author or organization.

The images that we know are part of the article would include:
the article item's image property
all image items with the article as their associatedArticle property

Things that we could probably assume are part of the article:
all image items

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:240: return
findStringProperty(DESCRIPTION_PROP);
Again, this should probably only be the description from an article item. Same
for author and copyright.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:245: return
findStringProperty(PUBLISHER_PROP);
I think we would only want the publisher property from an article, not other
types (maybe we may also want to look at the copyrightHolder or other properties
of article)

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:474: // FAILY_NAME_PROP
delimited by a whitespace.
s/FAILY/FAMILY

kuan

i've addressed all comments, ptal at delta between patch sets 9 and 7. thx. in ...

6 years, 8 months ago (2014-04-23 15:32:35 UTC) #8

i've addressed all comments, ptal at delta between patch sets 9 and 7.   thx.

in case u're wondering abt the name SchemaOrgParserAccessor, i'm thinking of
renaming MarkupParser.Parser interface to MarkupParser.Accessor in the next cl
('cos i don't want to change the other parsers here).  what do u think?

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:36: public class
SchemaOrgParser implements MarkupParser.Parser {
On 2014/04/21 16:52:22, cjhopman wrote:
> Can we split this class into two parts:
> 
> 1. A SchemaOrgParser (that is basically all the the parse() step and all that
it
> does)
> 2. A MarkupParser.Parser that has all the logic for the getXxx() calls and
uses
> (1)

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:161:
sTypeUrls.put("http://schema.org/Article", Type.ARTICLE);
On 2014/04/21 16:52:22, cjhopman wrote:
> We should probably recognize schema.org/NewsArticle and /BlobPosting as
articles
> too (and maybe all the other subtypes of article)

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:163:
sTypeUrls.put("http://schema.org/Organization", Type.ORGANIZATION);
On 2014/04/21 16:52:22, cjhopman wrote:
> There are a whole bunch of subtypes of Organization that we will ignore
(though
> a lot of them we probably don't care about).

i added more subtypes, but i'm not sure if they're enough or excess.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:202: if (title.isEmpty())
title = findStringProperty(NAME_PROP);
On 2014/04/21 16:52:22, cjhopman wrote:
> This seems to mean that we will get the first item with a name. I really don't
> think we want to return the name of a person or image or organization as the
> title. I think we should probably be getting the title only from the article
> item.

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:210: return
mItemScopes.get(0).mType.toString();
On 2014/04/21 16:52:22, cjhopman wrote:
> Does http://schema.org specify that this is a good way to get a type of a
page? I don't
> think that they really have this concept and picking the first item doesn't
seem
> correct since it's just picking the marked up thing that happens to appear
first
> in the html. 
> 
> I think we either just have to always return nothing or maybe return some sort
> of article type if there is an article item.

Done.  for the record, as per our off-line discussion: after this cl, i intend
to consolidate and standardize the types returned by the 3 parsers in
MarkupParser, added todo.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:215: return
findStringProperty(URL_PROP);
On 2014/04/21 16:52:22, cjhopman wrote:
> I don't really understand what we expect the url from getUrl() to be.
> 
> I think the only thing that makes sense for this would be for it to be the url
> of an article item (not a person/organization/etc).

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:219: public
MarkupParser.Image[] getImages() {
On 2014/04/21 16:52:22, cjhopman wrote:
> We should be careful about what this returns, we only want to include images
> that are part of the article. For example, we don't want this list to include
a
> picture of the author or organization.
> 
> The images that we know are part of the article would include:
> the article item's image property
> all image items with the article as their associatedArticle property
> 
> Things that we could probably assume are part of the article:
> all image items

Done.

per our discussion off-line, i won't impl the associatedArticle property yet; it
requires pretty complicated parsing and resolution of itemref attribute.  so we
have the article item's image property and all the ImageObject's in this cl.

however, i did impl the "associatedMedia" and "encoding" properties of article,
which take precedence over the article's "image" property and forces ImageItem's
representativeOfPage property to true.  also added test.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:240: return
findStringProperty(DESCRIPTION_PROP);
On 2014/04/21 16:52:22, cjhopman wrote:
> Again, this should probably only be the description from an article item. Same
> for author and copyright.

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:245: return
findStringProperty(PUBLISHER_PROP);
On 2014/04/21 16:52:22, cjhopman wrote:
> I think we would only want the publisher property from an article, not other
> types (maybe we may also want to look at the copyrightHolder or other
properties
> of article)

Done.

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:474: // FAILY_NAME_PROP
delimited by a whitespace.
On 2014/04/21 16:52:22, cjhopman wrote:
> s/FAILY/FAMILY

Done.

cjhopman

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/client/SchemaOrgParser.java#newcode321 src/com/dom_distiller/client/SchemaOrgParser.java:321: // If <a> or <link> tags specify rel="author", extract ...

6 years, 8 months ago (2014-04-25 20:52:34 UTC) #10

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/120001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:321: // If <a> or <link> tags
specify rel="author", extract it.
Where does this rel="author" stuff come from? I can't find anything about it on
schema.org or in the Microdata Dom API spec.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:73: private final String[]
mStringPropertyNames;
How this all works together is rather confusing to me. 

It think that a ThingItem instance should just have a map<String, String>
mStringProperties and map<String, ThingItem> mItemProperties. Whether or not a
certain property is supported by some ThingItem type should be a class-level
thing (and they should share those supported by the base ThingItem class).

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:78: // |stringPropertyNames|
and |itemPropertyNames| are names of properties that this
nit: Use javadoc comment format

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:97: MarkupParser.Article
getArticle() {
Let's get rid of these functions that only make sense on certain types (and so
the caller should only call them on the right types). So: getArticle and
isImageRepresentativeOfPage should be removed.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:110: // Otherwise, repeat for
|mItemProperties|.
This is weird that we go looking into the properties of other items and treating
them like string properties on this item. Where do we use this and is there a
better way of doing it?

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:192: sTagAttributesMap = new
HashMap<String, String[]>();
nit: move this (and the sTagAttributesMap declaration) down by the function that
uses it.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:200:
sTagAttributesMap.put("A", new String[] { "HREF", "REL" });
I can't find documentation anywhere that says to use the rel= attribute.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:222: // Returns the first item
that has the requested property value.
I don't think that this is something we would ever want to do. The only time we
use it is to find the headline which is a property only of articles. In that
case we should be iterating over the articles and checking each of them for a
headline.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:232: ThingItem
findFirstArticle() {
This should return an ArticleItem

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:425:
imageItem.putStringValue(REPRESENTATIVE_PROP, "true", true);
This is strange, and only seems to work if I do getImage() on the article before
processing image types.

I'd prefer to just remove setting representativeOfPage on this item and handle
it in a separate change.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:466: String
getStringProperty(String propertyName) {
Overriding getStringProperty like this is a little odd to me. I think it would
be a lot easier to reason about if ThingItem's string property stuff acting just
like a map.

If it is common to want a different "name" concept than the schema.org name
property, ThingItem can have a getName function that is overridable (and the
default is just the name property).

kuan

i've addressed all comments in patch set 10. ptal. thx. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java#newcode73 ...

6 years, 7 months ago (2014-04-29 00:23:10 UTC) #11

i've addressed all comments in patch set 10.  ptal.  thx.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:73: private final String[]
mStringPropertyNames;
On 2014/04/25 20:52:34, cjhopman wrote:
> How this all works together is rather confusing to me. 
> 
> It think that a ThingItem instance should just have a map<String, String>
> mStringProperties and map<String, ThingItem> mItemProperties. Whether or not a
> certain property is supported by some ThingItem type should be a class-level
> thing (and they should share those supported by the base ThingItem class).

Done.  i was using map b4, but changed to arrays when u didn't want map in the
ie parser.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:78: // |stringPropertyNames|
and |itemPropertyNames| are names of properties that this
On 2014/04/25 20:52:34, cjhopman wrote:
> nit: Use javadoc comment format

comments r not needed now.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:97: MarkupParser.Article
getArticle() {
On 2014/04/25 20:52:34, cjhopman wrote:
> Let's get rid of these functions that only make sense on certain types (and so
> the caller should only call them on the right types). So: getArticle and
> isImageRepresentativeOfPage should be removed.

Done.  i was doing it this way to avoid casting ThingItem, which i thought was
discouraged as per the review for ie parser.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:110: // Otherwise, repeat for
|mItemProperties|.
On 2014/04/25 20:52:34, cjhopman wrote:
> This is weird that we go looking into the properties of other items and
treating
> them like string properties on this item. Where do we use this and is there a
> better way of doing it?

Done.  author, publisher and copyright holder can be a simple name of string
type, or a schema.org person, or a schema.org organization.  i was trying to
shield the callers from having to deal with string and item property types.  in
any case, i've removed this and changed the impl.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:192: sTagAttributesMap = new
HashMap<String, String[]>();
On 2014/04/25 20:52:34, cjhopman wrote:
> nit: move this (and the sTagAttributesMap declaration) down by the function
that
> uses it.

Done.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:200:
sTagAttributesMap.put("A", new String[] { "HREF", "REL" });
On 2014/04/25 20:52:34, cjhopman wrote:
> I can't find documentation anywhere that says to use the rel= attribute.

http://schema.org/author, or from http://schema.org/Article, look for "author"
property.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:222: // Returns the first item
that has the requested property value.
On 2014/04/25 20:52:34, cjhopman wrote:
> I don't think that this is something we would ever want to do. The only time
we
> use it is to find the headline which is a property only of articles. In that
> case we should be iterating over the articles and checking each of them for a
> headline.

Done.  Image also has headline, but i assume we only care abt Article.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:232: ThingItem
findFirstArticle() {
On 2014/04/25 20:52:34, cjhopman wrote:
> This should return an ArticleItem

Done.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:425:
imageItem.putStringValue(REPRESENTATIVE_PROP, "true", true);
On 2014/04/25 20:52:34, cjhopman wrote:
> This is strange, and only seems to work if I do getImage() on the article
before
> processing image types.
> 
> I'd prefer to just remove setting representativeOfPage on this item and handle
> it in a separate change.

Done.  i've removed setting representativeOfPage and handle the logic in
getImages(), not sure if that's what u mean by "in a separate change".

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:466: String
getStringProperty(String propertyName) {
On 2014/04/25 20:52:34, cjhopman wrote:
> Overriding getStringProperty like this is a little odd to me. I think it would
> be a lot easier to reason about if ThingItem's string property stuff acting
just
> like a map.
> 
> If it is common to want a different "name" concept than the http://schema.org
name
> property, ThingItem can have a getName function that is overridable (and the
> default is just the name property).

Done.  i've made getStringProperty final, but implemented the logic differently
from ur "getName" suggestion.

cjhopman

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java#newcode200 src/com/dom_distiller/client/SchemaOrgParser.java:200: sTagAttributesMap.put("A", new String[] { "HREF", "REL" }); On 2014/04/29 ...

6 years, 7 months ago (2014-04-29 17:04:18 UTC) #12

kuan

i've addressed all comments in patch set 13. ptal. thx. https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java File src/com/dom_distiller/client/SchemaOrgParser.java (right): https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/client/SchemaOrgParser.java#newcode200 ...

6 years, 7 months ago (2014-04-29 23:26:42 UTC) #13

i've addressed all comments in patch set 13.  ptal.  thx.

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/160001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:200:
sTagAttributesMap.put("A", new String[] { "HREF", "REL" });
On 2014/04/29 17:04:19, cjhopman wrote:
> On 2014/04/29 00:23:10, kuan wrote:
> > On 2014/04/25 20:52:34, cjhopman wrote:
> > > I can't find documentation anywhere that says to use the rel= attribute.
> > 
> > http://schema.org/author, or from http://schema.org/Article, look for
"author"
> > property.
> 
> That's interesting. Maybe you could add a comment somehere pointing to that.
> 

Done.  comment is in parse() where the rel attribute is extracted.

https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParser.java (right):

https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParser.java:422:
sTagAttributesMap.put("LINK", new String[] { "HREF", "REL" });
On 2014/04/29 17:04:19, cjhopman wrote:
> I think this would be clearer if this was just a map from the element type to
> the normal (i.e. non-rel-author) place to find the value.
> 
> Then just completely handle the rel-author stuff in
getAuthorFromRelAttribute().

Done.  hehe.. back then when i changed to include "rel" in the map, i thought
maybe it'd be clearer to not include it, but decided to leave it to see what u
think :)

https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c...
File src/com/dom_distiller/client/SchemaOrgParserAccessor.java (right):

https://codereview.chromium.org/240073007/diff/220001/src/com/dom_distiller/c...
src/com/dom_distiller/client/SchemaOrgParserAccessor.java:35: if (item.getType()
== SchemaOrgParser.Type.ARTICLE) {
On 2014/04/29 17:04:19, cjhopman wrote:
> Iterating through the articles seems pretty common. We could add a
> getArticleItems() to the parser (then we could do the cast in there instead).
> You could probably drop the findFirstArticle then and just get the first item
in
> the list returned by getArticleItems.
> 
> Probably could do the same for images even though I think we only iterate
> through those once.

Done.

kuan

6 years, 7 months ago (2014-04-30 04:55:45 UTC) #15

Message was sent while issue was closed.

Committed patchset #13 manually as r8fe6964 (presubmit successful).

Expand Messages | Collapse Messages