DescriptionFix final content and title extraction.
The code in ContentExtractor had a slight bug in that it didn't drop
elements that weren't classified as Content. This means that elements
which remained in the dom classified as Boilerplate but with labels
"Maybe Content" etc, would make it into the final output.
Relatedly, we were often merging the <title> and promoting it to "content".
Updated so we don't merge the title with subsequent blocks and ignore
whether it's content
BUG=375449
R=cjhopman@chromium.org
Committed: 88806c2
Patch Set 1 #Patch Set 2 : changed title handling, added test #
Total comments: 4
Patch Set 3 : #
Messages
Total messages: 8 (0 generated)
|