Issue 1178633002: implement parser for new pagination algorithm

kuan

kuan@chromium.org changed reviewers: + cjhopman@chromium.org

5 years, 6 months ago (2015-06-10 21:01:24 UTC) #1

cjhopman

https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller/PageParameterParser.java#newcode88 java/org/chromium/distiller/PageParameterParser.java:88: NodeList<Element> allLinks = root.getElementsByTagName("A"); What do you think of ...

5 years, 6 months ago (2015-06-26 00:35:25 UTC) #3

kuan

i've addressed all comments in patch set 2. ptal. thx. https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller/PageParameterParser.java#newcode88 ...

5 years, 5 months ago (2015-07-13 22:57:04 UTC) #4

cjhopman

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java#newcode105 java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; What's this parent wrapper? I ...

5 years, 4 months ago (2015-07-29 01:07:53 UTC) #5

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
File java/org/chromium/distiller/PageParameterParser.java (right):

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper =
null;
What's this parent wrapper? I don't recall that being in the previous changes. I
don't understand what it is for.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:151: if (width == 0 ||
height == 0 || !DomUtil.isVisible(link)) return null;
It seems odd that invisible links are handled here. Does that mean that in a
sequence of numbered links we treat an invisible link the same as one that is
visible but with non-number text?

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:183: private boolean
checkForPrevSiblingWithText(Node start) {
I'm having difficulty understanding both the way that previous number siblings
are found and the way that next number siblings are found.

I would expect it to look something like:

node = getPrevNode(node); // or node = getNextNode(node);
if node is text:
  if is number:
    add it
  elif is non-number text:
    done
elif is anchor:  // the prevNode case doesn't need this one
  if is number:
    add it
  else:
    done

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:262: * Add
PageParamInfo.PageInfo for a non-link with numeric text.
It looks like the text doesn't have to be strictly numeric. Is that correct?
Could you expand here on what sort of text is accepted/rejected?

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:309: private void
addValidLink(AnchorElement link, AnchorElement baseAnchor) {
probably rename this to addLinkIfValid() since it doesn't always add the link.

but really, it seems odd that this might not add the link, but then it doesn't
tell the caller whether it added it or not. The callers don't care if it is
added or not? Or is it that they will only be passing valid links that will
always get added?

kuan

i've addressed all comments in patch set 3, which also includes fixes for urls in ...

5 years, 4 months ago (2015-07-30 16:47:01 UTC) #6

i've addressed all comments in patch set 3, which also includes fixes for urls
in new dataset.  ptal.  thx.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
File java/org/chromium/distiller/PageParameterParser.java (right):

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper =
null;
On 2015/07/29 01:07:53, cjhopman wrote:
> What's this parent wrapper? I don't recall that being in the previous changes.
I
> don't understand what it is for.

i had it in the previous change, and attempted to explain it @107 (now 108); the
latest patch does include some fixes for urls in new dataset.
some urls wrap its pagination links in a div and/or span.  such links will not
have siblings; instead their parent wrappers would have siblings that contain
the pagination links.  i call it a parent wrapper because it's the topmost
parent that simply wraps the link without adding extra text.  if i don't get
this parent wrapper, i'd not get the adjacent text before or after the link it
contains.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:151: if (width == 0 ||
height == 0 || !DomUtil.isVisible(link)) return null;
On 2015/07/29 01:07:53, cjhopman wrote:
> It seems odd that invisible links are handled here. Does that mean that in a
> sequence of numbered links we treat an invisible link the same as one that is
> visible but with non-number text?

invisible links need to be ignored.  i do this here because this fn is where i
decide if a link is to be considered for possible pagination.

i guess the answer to ur question wld be yes - an invisible link should be
ignored, as is a link with non-number text.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:183: private boolean
checkForPrevSiblingWithText(Node start) {
On 2015/07/29 01:07:53, cjhopman wrote:
> I'm having difficulty understanding both the way that previous number siblings
> are found and the way that next number siblings are found.
> 
> I would expect it to look something like:
> 
> node = getPrevNode(node); // or node = getNextNode(node);
> if node is text:
>   if is number:
>     add it
>   elif is non-number text:
>     done
> elif is anchor:  // the prevNode case doesn't need this one
>   if is number:
>     add it
>   else:
>     done
>   

i initially had the check for previous and next number siblings in one function,
but split it when i changed the algo as per ur suggestion in your previous
review, because of the different terminating conditions and the type of siblings
processed.
1) checking for previous number siblings stops when the sibling is a
link/document, or contains text, or there's no more sibling.
2) checking for next number siblings stops when the sibling is a document, or
there's no more sibling - links are specifically handled here.
more often than not, (2) would theoretically take more iterations than (1).
the algo is further complicated by the parent wrapper consideration, and more
variations in urls in new dataset.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:262: * Add
PageParamInfo.PageInfo for a non-link with numeric text.
On 2015/07/29 01:07:53, cjhopman wrote:
> It looks like the text doesn't have to be strictly numeric. Is that correct?
> Could you expand here on what sort of text is accepted/rejected?

Done.  renamed fn too.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:309: private void
addValidLink(AnchorElement link, AnchorElement baseAnchor) {
On 2015/07/29 01:07:53, cjhopman wrote:
> probably rename this to addLinkIfValid() since it doesn't always add the link.
> 
> but really, it seems odd that this might not add the link, but then it doesn't
> tell the caller whether it added it or not. The callers don't care if it is
> added or not? Or is it that they will only be passing valid links that will
> always get added?

Done.  this fn is created simply to prevent duplication, 'cos it's called twice
when checking for next number siblings.  the caller doesn't care if link is
added, it just cares that link has been processed.

cjhopman

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java#newcode105 java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/07/30 16:47:00, kuan wrote: ...

5 years, 4 months ago (2015-08-04 21:58:41 UTC) #7

kuan

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/distiller/PageParameterParser.java#newcode105 java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/08/04 21:58:41, cjhopman wrote: ...

5 years, 4 months ago (2015-08-04 22:38:37 UTC) #8

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
File java/org/chromium/distiller/PageParameterParser.java (right):

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper =
null;
On 2015/08/04 21:58:41, cjhopman wrote:
> On 2015/07/30 16:47:00, kuan wrote:
> > On 2015/07/29 01:07:53, cjhopman wrote:
> > > What's this parent wrapper? I don't recall that being in the previous
> changes.
> > I
> > > don't understand what it is for.
> > 
> > i had it in the previous change, and attempted to explain it @107 (now 108);
> the
> > latest patch does include some fixes for urls in new dataset.
> > some urls wrap its pagination links in a div and/or span.  such links will
not
> > have siblings; instead their parent wrappers would have siblings that
contain
> > the pagination links.  i call it a parent wrapper because it's the topmost
> > parent that simply wraps the link without adding extra text.  if i don't get
> > this parent wrapper, i'd not get the adjacent text before or after the link
it
> > contains.
> 
> But why the parent wrapper thing? Why not just walk backwards/forwards in the
> tree until you hit a bad node (non-number text/anchor basically)? This parent
> wrapper approach just seems wrong and complicated and fails on a lot of cases
> that I think we would want to accept.

what do u mean by "backwards/forwards in the tree"?  breadth-wise or depth-wise?
getting previous/next siblings is already walking breadth-wise, which yields
nothing.  hence, i hv to do upwards (i.e. depth-wise) to look for the parent
wrapper.
e.g. at http://www.doctoroz.com/article/andrew-weil-5-health-essentials (in
page-links-golden-data.sstable), each pagination link is a child of a <li>.  the
pagination <a> has no sibling.  if i don't get its parent wrapper i.e. the <li>,
i won't know all the pagination links are adjacent to each other.

kuan

i've addressed all comments in patch set 4, which also includes fixes for urls in ...

5 years, 4 months ago (2015-08-11 19:09:39 UTC) #9

i've addressed all comments in patch set 4, which also includes fixes for urls
in new dataset.  ptal.  thx.

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
File java/org/chromium/distiller/PageParameterParser.java (right):

https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti...
java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper =
null;
On 2015/08/04 22:38:37, kuan wrote:
> On 2015/08/04 21:58:41, cjhopman wrote:
> > On 2015/07/30 16:47:00, kuan wrote:
> > > On 2015/07/29 01:07:53, cjhopman wrote:
> > > > What's this parent wrapper? I don't recall that being in the previous
> > changes.
> > > I
> > > > don't understand what it is for.
> > > 
> > > i had it in the previous change, and attempted to explain it @107 (now
108);
> > the
> > > latest patch does include some fixes for urls in new dataset.
> > > some urls wrap its pagination links in a div and/or span.  such links will
> not
> > > have siblings; instead their parent wrappers would have siblings that
> contain
> > > the pagination links.  i call it a parent wrapper because it's the topmost
> > > parent that simply wraps the link without adding extra text.  if i don't
get
> > > this parent wrapper, i'd not get the adjacent text before or after the
link
> it
> > > contains.
> > 
> > But why the parent wrapper thing? Why not just walk backwards/forwards in
the
> > tree until you hit a bad node (non-number text/anchor basically)? This
parent
> > wrapper approach just seems wrong and complicated and fails on a lot of
cases
> > that I think we would want to accept.
> 
> what do u mean by "backwards/forwards in the tree"?  breadth-wise or
depth-wise?
> getting previous/next siblings is already walking breadth-wise, which yields
> nothing.  hence, i hv to do upwards (i.e. depth-wise) to look for the parent
> wrapper.
> e.g. at http://www.doctoroz.com/article/andrew-weil-5-health-essentials (in
> page-links-golden-data.sstable), each pagination link is a child of a <li>. 
the
> pagination <a> has no sibling.  if i don't get its parent wrapper i.e. the
<li>,
> i won't know all the pagination links are adjacent to each other.

Done.

kuan

kuan@chromium.org changed reviewers: + wychen@chromium.org - cjhopman@chromium.org

5 years, 4 months ago (2015-08-18 21:47:18 UTC) #10

wychen

https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/distiller/PageParameterParser.java#newcode50 java/org/chromium/distiller/PageParameterParser.java:50: * Parses the document to collect outlinks with digital ...

5 years, 3 months ago (2015-09-21 23:08:03 UTC) #12

kuan

i've addressed all comments in patch set 5. ptal. thx. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromium/distiller/PageParameterParser.java File java/org/chromium/distiller/PageParameterParser.java (right): https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromium/distiller/PageParameterParser.java#newcode50 ...

5 years, 2 months ago (2015-10-02 15:59:17 UTC) #13

i've addressed all comments in patch set 5.  ptal.  thx.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
File java/org/chromium/distiller/PageParameterParser.java (right):

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:50: * Parses the document
to collect outlinks with digital anchor text and numeric text around
On 2015/09/21 23:08:03, wychen wrote:
> Does digital mean the same thing as numeric?

Done.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:83: if (sHrefCleaner ==
null) sHrefCleaner = RegExp.compile("\\/$");
On 2015/09/21 23:08:03, wychen wrote:
> Is this faster than eager initialization? If these RegExp objects are always
> used, eager initialization would be simpler, and might even be faster.

sHrefCleaner is always used, so i've changed to init it at declaration.
sInvalidParentWrapper is not, so i've moved its initialization to just before
its usage.
btw, this init-when-needed approach was what chris wanted for the previous
changelists - he didn't want unnecessary static initializations. so i stuck to
it here.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:137: * "javascript:void"
links with numeric text are considered valid links to be added.
On 2015/09/21 23:08:03, wychen wrote:
> nit: not necessarily void.

Done.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:163: if
(!isPlainPageNumber(number)) return null;
On 2015/09/21 23:08:03, wychen wrote:
> Since most links aren't numbers, can we use this as early termination?
> For example, use /^[()[]{}\d]+$/ to test innerText in the beginning of the
while
> loop.
> Need profiling to justify though.

i can't move it to beginning of while loop - it needs to stay in this function
so that it's executed every time this fn is called from multiple places.
if i move it to near beginning of this fn (i.e. aft isVisible(link)), profiling
multiple times using the new dataset seems to show:
- slightly better average performance
- 10-18 individual urls are worse, the rest of the 200+ urls are better or same.
i'm doing as u suggest based on the slightly better average perf results.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:239: // given node.
On 2015/09/21 23:08:03, wychen wrote:
> nit: indentation

Done.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:326: return
Style.Cursor.valueOf(style.getCursor().toUpperCase()) == Style.Cursor.TEXT;
On 2015/09/21 23:08:03, wychen wrote:
> Even if the cursor style is different, the link is still clickable, right?
It's
> just not obvious to the user that it's a link. Do you have examples that
require
> this checking?

no, the link is not clickable - it behaves like regular text.
in page-links-golden-data.sstable,
http://nymag.com/arts/books/features/mann-gulch-norman-maclean-2014-9/, the page
"1" is a link but has cursor=text and is not clickable.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:338: if
(linkHref.isEmpty()) return "";
On 2015/09/21 23:08:03, wychen wrote:
> If href="", it means the current URL. What's the reason to override it?
> If there's a pattern that links with onclick event tend to have empty href,
> maybe leave a comment.

anchors w/out "href" attr are not considered pagination links, so they need to
be discarded right away.
Done.  added a comment.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu...
java/org/chromium/distiller/PageParameterParser.java:347: linkText =
linkText.replaceAll("\\s\\{2,\\}", " ");
On 2015/09/21 23:08:03, wychen wrote:
> Why is this necessary?

the original code has this, so i follow suit.  however, removing it didn't break
the dataset, so it's now removed.

https://chromiumcodereview.appspot.com/1178633002/diff/60001/javatests/org/ch...
File javatests/org/chromium/distiller/PageParameterParserTest.java (right):

https://chromiumcodereview.appspot.com/1178633002/diff/60001/javatests/org/ch...
javatests/org/chromium/distiller/PageParameterParserTest.java:177: public void
testParentSibling0() {
On 2015/09/21 23:08:03, wychen wrote:
> Should we add tests for things like this to test separators?
> <div>
> 1 | <a>2</a> | <a>3</a>
> </div>
> And probably one more test for complicated nested structure.

Done.  fyi, i already had testPuncationSibling() to test ',' as separator.

kuan

5 years, 2 months ago (2015-10-05 20:14:47 UTC) #15

Message was sent while issue was closed.

Committed patchset #5 (id:80001) manually as
36a509ca42a49285ad4a770b38a8558f102c3337 (presubmit successful).

Issue 1178633002: implement parser for new pagination algorithm (Closed)

Description

Patch Set 1 #

Patch Set 2 : addr chris's comments #

Patch Set 3 : addr chris's comments, fixes for dataset urls #

Patch Set 4 : addr chris's comments, fixes for dataset #

Patch Set 5 : addr wychen's comments #

Messages