|
|
Descriptionimplement parser for new pagination algorithm
PageParameterParser
- parses a document to collect groups of adjacent plain text numbers and
outlinks with digital anchor text
- calls PageParameterDetector to detect pagination URLs
- add tests.
ParsedUrl
- add fns to get and set hash
- add test.
BUG=464123
R=wychen@chromium.org
Committed: 36a509ca42a49285ad4a770b38a8558f102c3337
Patch Set 1 #
Total comments: 2
Patch Set 2 : addr chris's comments #
Total comments: 13
Patch Set 3 : addr chris's comments, fixes for dataset urls #Patch Set 4 : addr chris's comments, fixes for dataset #
Total comments: 18
Patch Set 5 : addr wychen's comments #
Messages
Total messages: 15 (2 generated)
kuan@chromium.org changed reviewers: + cjhopman@chromium.org
https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/PageParameterParser.java:88: NodeList<Element> allLinks = root.getElementsByTagName("A"); What do you think of this alternative approach: allLinks = root.gEBTN("A"); idx = 0; while idx < allLinks.size: while idx < allLinks.size && !isNumber(allLinks[idx]): idx++; <find+add previous number siblings> <add allLinks[idx]> idx += 1 + <find+add next number siblings and return number of links added> That is, find a link that is a number, and then find all its siblings and skip ahead in the links list.
i've addressed all comments in patch set 2. ptal. thx. https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/1/java/org/chromium/distiller... java/org/chromium/distiller/PageParameterParser.java:88: NodeList<Element> allLinks = root.getElementsByTagName("A"); On 2015/06/26 00:35:25, cjhopman wrote: > What do you think of this alternative approach: > > allLinks = root.gEBTN("A"); > idx = 0; > while idx < allLinks.size: > while idx < allLinks.size && !isNumber(allLinks[idx]): idx++; > <find+add previous number siblings> > <add allLinks[idx]> > idx += 1 + <find+add next number siblings and return number of links added> > > That is, find a link that is a number, and then find all its siblings and skip > ahead in the links list. Done.
https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; What's this parent wrapper? I don't recall that being in the previous changes. I don't understand what it is for. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:151: if (width == 0 || height == 0 || !DomUtil.isVisible(link)) return null; It seems odd that invisible links are handled here. Does that mean that in a sequence of numbered links we treat an invisible link the same as one that is visible but with non-number text? https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:183: private boolean checkForPrevSiblingWithText(Node start) { I'm having difficulty understanding both the way that previous number siblings are found and the way that next number siblings are found. I would expect it to look something like: node = getPrevNode(node); // or node = getNextNode(node); if node is text: if is number: add it elif is non-number text: done elif is anchor: // the prevNode case doesn't need this one if is number: add it else: done https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:262: * Add PageParamInfo.PageInfo for a non-link with numeric text. It looks like the text doesn't have to be strictly numeric. Is that correct? Could you expand here on what sort of text is accepted/rejected? https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:309: private void addValidLink(AnchorElement link, AnchorElement baseAnchor) { probably rename this to addLinkIfValid() since it doesn't always add the link. but really, it seems odd that this might not add the link, but then it doesn't tell the caller whether it added it or not. The callers don't care if it is added or not? Or is it that they will only be passing valid links that will always get added?
i've addressed all comments in patch set 3, which also includes fixes for urls in new dataset. ptal. thx. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/07/29 01:07:53, cjhopman wrote: > What's this parent wrapper? I don't recall that being in the previous changes. I > don't understand what it is for. i had it in the previous change, and attempted to explain it @107 (now 108); the latest patch does include some fixes for urls in new dataset. some urls wrap its pagination links in a div and/or span. such links will not have siblings; instead their parent wrappers would have siblings that contain the pagination links. i call it a parent wrapper because it's the topmost parent that simply wraps the link without adding extra text. if i don't get this parent wrapper, i'd not get the adjacent text before or after the link it contains. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:151: if (width == 0 || height == 0 || !DomUtil.isVisible(link)) return null; On 2015/07/29 01:07:53, cjhopman wrote: > It seems odd that invisible links are handled here. Does that mean that in a > sequence of numbered links we treat an invisible link the same as one that is > visible but with non-number text? invisible links need to be ignored. i do this here because this fn is where i decide if a link is to be considered for possible pagination. i guess the answer to ur question wld be yes - an invisible link should be ignored, as is a link with non-number text. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:183: private boolean checkForPrevSiblingWithText(Node start) { On 2015/07/29 01:07:53, cjhopman wrote: > I'm having difficulty understanding both the way that previous number siblings > are found and the way that next number siblings are found. > > I would expect it to look something like: > > node = getPrevNode(node); // or node = getNextNode(node); > if node is text: > if is number: > add it > elif is non-number text: > done > elif is anchor: // the prevNode case doesn't need this one > if is number: > add it > else: > done > i initially had the check for previous and next number siblings in one function, but split it when i changed the algo as per ur suggestion in your previous review, because of the different terminating conditions and the type of siblings processed. 1) checking for previous number siblings stops when the sibling is a link/document, or contains text, or there's no more sibling. 2) checking for next number siblings stops when the sibling is a document, or there's no more sibling - links are specifically handled here. more often than not, (2) would theoretically take more iterations than (1). the algo is further complicated by the parent wrapper consideration, and more variations in urls in new dataset. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:262: * Add PageParamInfo.PageInfo for a non-link with numeric text. On 2015/07/29 01:07:53, cjhopman wrote: > It looks like the text doesn't have to be strictly numeric. Is that correct? > Could you expand here on what sort of text is accepted/rejected? Done. renamed fn too. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:309: private void addValidLink(AnchorElement link, AnchorElement baseAnchor) { On 2015/07/29 01:07:53, cjhopman wrote: > probably rename this to addLinkIfValid() since it doesn't always add the link. > > but really, it seems odd that this might not add the link, but then it doesn't > tell the caller whether it added it or not. The callers don't care if it is > added or not? Or is it that they will only be passing valid links that will > always get added? Done. this fn is created simply to prevent duplication, 'cos it's called twice when checking for next number siblings. the caller doesn't care if link is added, it just cares that link has been processed.
https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/07/30 16:47:00, kuan wrote: > On 2015/07/29 01:07:53, cjhopman wrote: > > What's this parent wrapper? I don't recall that being in the previous changes. > I > > don't understand what it is for. > > i had it in the previous change, and attempted to explain it @107 (now 108); the > latest patch does include some fixes for urls in new dataset. > some urls wrap its pagination links in a div and/or span. such links will not > have siblings; instead their parent wrappers would have siblings that contain > the pagination links. i call it a parent wrapper because it's the topmost > parent that simply wraps the link without adding extra text. if i don't get > this parent wrapper, i'd not get the adjacent text before or after the link it > contains. But why the parent wrapper thing? Why not just walk backwards/forwards in the tree until you hit a bad node (non-number text/anchor basically)? This parent wrapper approach just seems wrong and complicated and fails on a lot of cases that I think we would want to accept.
https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/08/04 21:58:41, cjhopman wrote: > On 2015/07/30 16:47:00, kuan wrote: > > On 2015/07/29 01:07:53, cjhopman wrote: > > > What's this parent wrapper? I don't recall that being in the previous > changes. > > I > > > don't understand what it is for. > > > > i had it in the previous change, and attempted to explain it @107 (now 108); > the > > latest patch does include some fixes for urls in new dataset. > > some urls wrap its pagination links in a div and/or span. such links will not > > have siblings; instead their parent wrappers would have siblings that contain > > the pagination links. i call it a parent wrapper because it's the topmost > > parent that simply wraps the link without adding extra text. if i don't get > > this parent wrapper, i'd not get the adjacent text before or after the link it > > contains. > > But why the parent wrapper thing? Why not just walk backwards/forwards in the > tree until you hit a bad node (non-number text/anchor basically)? This parent > wrapper approach just seems wrong and complicated and fails on a lot of cases > that I think we would want to accept. what do u mean by "backwards/forwards in the tree"? breadth-wise or depth-wise? getting previous/next siblings is already walking breadth-wise, which yields nothing. hence, i hv to do upwards (i.e. depth-wise) to look for the parent wrapper. e.g. at http://www.doctoroz.com/article/andrew-weil-5-health-essentials (in page-links-golden-data.sstable), each pagination link is a child of a <li>. the pagination <a> has no sibling. if i don't get its parent wrapper i.e. the <li>, i won't know all the pagination links are adjacent to each other.
i've addressed all comments in patch set 4, which also includes fixes for urls in new dataset. ptal. thx. https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/20001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:105: Node parentWrapper = null; On 2015/08/04 22:38:37, kuan wrote: > On 2015/08/04 21:58:41, cjhopman wrote: > > On 2015/07/30 16:47:00, kuan wrote: > > > On 2015/07/29 01:07:53, cjhopman wrote: > > > > What's this parent wrapper? I don't recall that being in the previous > > changes. > > > I > > > > don't understand what it is for. > > > > > > i had it in the previous change, and attempted to explain it @107 (now 108); > > the > > > latest patch does include some fixes for urls in new dataset. > > > some urls wrap its pagination links in a div and/or span. such links will > not > > > have siblings; instead their parent wrappers would have siblings that > contain > > > the pagination links. i call it a parent wrapper because it's the topmost > > > parent that simply wraps the link without adding extra text. if i don't get > > > this parent wrapper, i'd not get the adjacent text before or after the link > it > > > contains. > > > > But why the parent wrapper thing? Why not just walk backwards/forwards in the > > tree until you hit a bad node (non-number text/anchor basically)? This parent > > wrapper approach just seems wrong and complicated and fails on a lot of cases > > that I think we would want to accept. > > what do u mean by "backwards/forwards in the tree"? breadth-wise or depth-wise? > getting previous/next siblings is already walking breadth-wise, which yields > nothing. hence, i hv to do upwards (i.e. depth-wise) to look for the parent > wrapper. > e.g. at http://www.doctoroz.com/article/andrew-weil-5-health-essentials (in > page-links-golden-data.sstable), each pagination link is a child of a <li>. the > pagination <a> has no sibling. if i don't get its parent wrapper i.e. the <li>, > i won't know all the pagination links are adjacent to each other. Done.
kuan@chromium.org changed reviewers: + wychen@chromium.org - cjhopman@chromium.org
chris's gone. wei-yin, pls help take a look. thx!
https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... File java/org/chromium/distiller/PageParameterParser.java (right): https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:50: * Parses the document to collect outlinks with digital anchor text and numeric text around Does digital mean the same thing as numeric? https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:83: if (sHrefCleaner == null) sHrefCleaner = RegExp.compile("\\/$"); Is this faster than eager initialization? If these RegExp objects are always used, eager initialization would be simpler, and might even be faster. https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:137: * "javascript:void" links with numeric text are considered valid links to be added. nit: not necessarily void. https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:163: if (!isPlainPageNumber(number)) return null; Since most links aren't numbers, can we use this as early termination? For example, use /^[()[]{}\d]+$/ to test innerText in the beginning of the while loop. Need profiling to justify though. https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:239: // given node. nit: indentation https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:326: return Style.Cursor.valueOf(style.getCursor().toUpperCase()) == Style.Cursor.TEXT; Even if the cursor style is different, the link is still clickable, right? It's just not obvious to the user that it's a link. Do you have examples that require this checking? https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:338: if (linkHref.isEmpty()) return ""; If href="", it means the current URL. What's the reason to override it? If there's a pattern that links with onclick event tend to have empty href, maybe leave a comment. https://codereview.chromium.org/1178633002/diff/60001/java/org/chromium/disti... java/org/chromium/distiller/PageParameterParser.java:347: linkText = linkText.replaceAll("\\s\\{2,\\}", " "); Why is this necessary? https://codereview.chromium.org/1178633002/diff/60001/javatests/org/chromium/... File javatests/org/chromium/distiller/PageParameterParserTest.java (right): https://codereview.chromium.org/1178633002/diff/60001/javatests/org/chromium/... javatests/org/chromium/distiller/PageParameterParserTest.java:177: public void testParentSibling0() { Should we add tests for things like this to test separators? <div> 1 | <a>2</a> | <a>3</a> </div> And probably one more test for complicated nested structure.
i've addressed all comments in patch set 5. ptal. thx. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... File java/org/chromium/distiller/PageParameterParser.java (right): https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:50: * Parses the document to collect outlinks with digital anchor text and numeric text around On 2015/09/21 23:08:03, wychen wrote: > Does digital mean the same thing as numeric? Done. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:83: if (sHrefCleaner == null) sHrefCleaner = RegExp.compile("\\/$"); On 2015/09/21 23:08:03, wychen wrote: > Is this faster than eager initialization? If these RegExp objects are always > used, eager initialization would be simpler, and might even be faster. sHrefCleaner is always used, so i've changed to init it at declaration. sInvalidParentWrapper is not, so i've moved its initialization to just before its usage. btw, this init-when-needed approach was what chris wanted for the previous changelists - he didn't want unnecessary static initializations. so i stuck to it here. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:137: * "javascript:void" links with numeric text are considered valid links to be added. On 2015/09/21 23:08:03, wychen wrote: > nit: not necessarily void. Done. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:163: if (!isPlainPageNumber(number)) return null; On 2015/09/21 23:08:03, wychen wrote: > Since most links aren't numbers, can we use this as early termination? > For example, use /^[()[]{}\d]+$/ to test innerText in the beginning of the while > loop. > Need profiling to justify though. i can't move it to beginning of while loop - it needs to stay in this function so that it's executed every time this fn is called from multiple places. if i move it to near beginning of this fn (i.e. aft isVisible(link)), profiling multiple times using the new dataset seems to show: - slightly better average performance - 10-18 individual urls are worse, the rest of the 200+ urls are better or same. i'm doing as u suggest based on the slightly better average perf results. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:239: // given node. On 2015/09/21 23:08:03, wychen wrote: > nit: indentation Done. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:326: return Style.Cursor.valueOf(style.getCursor().toUpperCase()) == Style.Cursor.TEXT; On 2015/09/21 23:08:03, wychen wrote: > Even if the cursor style is different, the link is still clickable, right? It's > just not obvious to the user that it's a link. Do you have examples that require > this checking? no, the link is not clickable - it behaves like regular text. in page-links-golden-data.sstable, http://nymag.com/arts/books/features/mann-gulch-norman-maclean-2014-9/, the page "1" is a link but has cursor=text and is not clickable. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:338: if (linkHref.isEmpty()) return ""; On 2015/09/21 23:08:03, wychen wrote: > If href="", it means the current URL. What's the reason to override it? > If there's a pattern that links with onclick event tend to have empty href, > maybe leave a comment. anchors w/out "href" attr are not considered pagination links, so they need to be discarded right away. Done. added a comment. https://chromiumcodereview.appspot.com/1178633002/diff/60001/java/org/chromiu... java/org/chromium/distiller/PageParameterParser.java:347: linkText = linkText.replaceAll("\\s\\{2,\\}", " "); On 2015/09/21 23:08:03, wychen wrote: > Why is this necessary? the original code has this, so i follow suit. however, removing it didn't break the dataset, so it's now removed. https://chromiumcodereview.appspot.com/1178633002/diff/60001/javatests/org/ch... File javatests/org/chromium/distiller/PageParameterParserTest.java (right): https://chromiumcodereview.appspot.com/1178633002/diff/60001/javatests/org/ch... javatests/org/chromium/distiller/PageParameterParserTest.java:177: public void testParentSibling0() { On 2015/09/21 23:08:03, wychen wrote: > Should we add tests for things like this to test separators? > <div> > 1 | <a>2</a> | <a>3</a> > </div> > And probably one more test for complicated nested structure. Done. fyi, i already had testPuncationSibling() to test ',' as separator.
lgtm
Message was sent while issue was closed.
Committed patchset #5 (id:80001) manually as 36a509ca42a49285ad4a770b38a8558f102c3337 (presubmit successful). |