Issue 290993004: Fix final content and title extraction.

Issue 290993004: Fix final content and title extraction. (Closed)

Created:
6 years, 7 months ago by Yaron

Modified:
6 years, 7 months ago

Reviewers:
cjhopman, kuan

CC:
kuan

Base URL:
https://code.google.com/p/dom-distiller/@master

Visibility:
Public.

More Reviews

Description

Fix final content and title extraction. The code in ContentExtractor had a slight bug in that it didn't drop elements that weren't classified as Content. This means that elements which remained in the dom classified as Boilerplate but with labels "Maybe Content" etc, would make it into the final output. Relatedly, we were often merging the <title> and promoting it to "content". Updated so we don't merge the title with subsequent blocks and ignore whether it's content BUG=375449 R=cjhopman@chromium.org Committed: 88806c2

Patch Set 1 #

Patch Set 2 : changed title handling, added test #

Total comments: 4

Patch Set 3 : #

Created: 6 years, 7 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+73 lines, -6 lines)			Patch
M	boilerpipe-core/src/main/de/l3s/boilerpipe/filters/heuristics/BlockProximityFusion.java	View	1	4 chunks	+10 lines, -5 lines	0 comments	Download
M	src/com/dom_distiller/client/ContentExtractor.java	View	1 2	2 chunks	+3 lines, -1 line	0 comments	Download
A	test/com/dom_distiller/client/ContentExtractorTest.java	View	1 2	1 chunk	+60 lines, -0 lines	0 comments	Download

Messages

Total messages: 8 (0 generated)

Expand Messages | Collapse Messages

Yaron

Could use an eval run on this one to see if it drastically breaks things. ...

6 years, 7 months ago (2014-05-20 21:02:11 UTC) #1

Yaron

On 2014/05/20 21:02:11, Yaron wrote: > Could use an eval run on this one to ...

6 years, 7 months ago (2014-05-21 22:29:06 UTC) #2

Yaron

trying again, this time a test. Local testing seems to do better

6 years, 7 months ago (2014-05-22 00:22:35 UTC) #3

cjhopman

https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java File src/com/dom_distiller/client/ContentExtractor.java (right): https://codereview.chromium.org/290993004/diff/20001/src/com/dom_distiller/client/ContentExtractor.java#newcode76 src/com/dom_distiller/client/ContentExtractor.java:76: LogUtil.logToConsole(clonedSubtree.getNodeType() + " node:" + Node.ELEMENT_NODE); probably don't need ...

6 years, 7 months ago (2014-05-22 00:27:21 UTC) #4

Yaron

6 years, 7 months ago (2014-05-22 17:05:24 UTC) #5

kuan

fyi: i patched this in, and tried to distill 2 of the urls i was ...

6 years, 7 months ago (2014-05-22 19:13:34 UTC) #7

Yaron

6 years, 7 months ago (2014-05-22 20:14:38 UTC) #8

Message was sent while issue was closed.

Committed patchset #3 manually as r88806c2 (presubmit successful).

Expand Messages | Collapse Messages