Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(139)

Issue 290993004: Fix final content and title extraction. (Closed)

Created:
6 years, 7 months ago by Yaron
Modified:
6 years, 7 months ago
Reviewers:
cjhopman, kuan
CC:
kuan
Base URL:
https://code.google.com/p/dom-distiller/@master
Visibility:
Public.

Description

Fix final content and title extraction. The code in ContentExtractor had a slight bug in that it didn't drop elements that weren't classified as Content. This means that elements which remained in the dom classified as Boilerplate but with labels "Maybe Content" etc, would make it into the final output. Relatedly, we were often merging the <title> and promoting it to "content". Updated so we don't merge the title with subsequent blocks and ignore whether it's content BUG=375449 R=cjhopman@chromium.org Committed: 88806c2

Patch Set 1 #

Patch Set 2 : changed title handling, added test #

Total comments: 4

Patch Set 3 : #

Unified diffs Side-by-side diffs Delta from patch set Stats (+73 lines, -6 lines) Patch
M boilerpipe-core/src/main/de/l3s/boilerpipe/filters/heuristics/BlockProximityFusion.java View 1 4 chunks +10 lines, -5 lines 0 comments Download
M src/com/dom_distiller/client/ContentExtractor.java View 1 2 2 chunks +3 lines, -1 line 0 comments Download
A test/com/dom_distiller/client/ContentExtractorTest.java View 1 2 1 chunk +60 lines, -0 lines 0 comments Download

Messages

Total messages: 8 (0 generated)
Yaron
Could use an eval run on this one to see if it drastically breaks things. ...
6 years, 7 months ago (2014-05-20 21:02:11 UTC) #1
Yaron
On 2014/05/20 21:02:11, Yaron wrote: > Could use an eval run on this one to ...
6 years, 7 months ago (2014-05-21 22:29:06 UTC) #2
Yaron
trying again, this time a test. Local testing seems to do better
6 years, 7 months ago (2014-05-22 00:22:35 UTC) #3
cjhopman
https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java File src/com/dom_distiller/client/ContentExtractor.java (right): https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java#newcode76 src/com/dom_distiller/client/ContentExtractor.java:76: LogUtil.logToConsole(clonedSubtree.getNodeType() + " node:" + Node.ELEMENT_NODE); probably don't need ...
6 years, 7 months ago (2014-05-22 00:27:21 UTC) #4
Yaron
https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java File src/com/dom_distiller/client/ContentExtractor.java (right): https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java#newcode76 src/com/dom_distiller/client/ContentExtractor.java:76: LogUtil.logToConsole(clonedSubtree.getNodeType() + " node:" + Node.ELEMENT_NODE); On 2014/05/22 00:27:21, ...
6 years, 7 months ago (2014-05-22 17:05:24 UTC) #5
cjhopman
lgtm
6 years, 7 months ago (2014-05-22 19:01:36 UTC) #6
kuan
fyi: i patched this in, and tried to distill 2 of the urls i was ...
6 years, 7 months ago (2014-05-22 19:13:34 UTC) #7
Yaron
6 years, 7 months ago (2014-05-22 20:14:38 UTC) #8
Message was sent while issue was closed.
Committed patchset #3 manually as r88806c2 (presubmit successful).

Powered by Google App Engine
This is Rietveld 408576698