Issue 1568723002: Improve extraction of accessible text from PDF.

dmazzoni

dmazzoni@chromium.org changed reviewers: + jbreiden@google.com, raymes@chromium.org

4 years, 11 months ago (2016-01-06 22:33:33 UTC) #1

dmazzoni

Ready for an initial look. This loses support for links, but I'm okay with that ...

4 years, 11 months ago (2016-01-06 22:33:33 UTC) #2

dmazzoni

dmazzoni@chromium.org changed reviewers: + thestig@chromium.org

4 years, 11 months ago (2016-01-06 22:36:04 UTC) #3

dmazzoni

+thestig as an alternate reviewer since he's also cc'd on this.

4 years, 11 months ago (2016-01-06 22:36:05 UTC) #4

jbreiden

There is a fundamental question here. The original bug report was garbled ChromeVox output on ...

4 years, 11 months ago (2016-01-07 00:58:11 UTC) #6

jbreiden

I patched in this code (manually) and took a look at a small suite of ...

4 years, 11 months ago (2016-01-07 18:10:09 UTC) #7

jbreiden

I patched in this code (manually) and took a look at a small suite of ...

4 years, 11 months ago (2016-01-07 18:10:10 UTC) #8

jbreiden

I took a few screenshots. Here you can see English looks pretty good (many people ...

4 years, 11 months ago (2016-01-07 18:43:11 UTC) #9

dmazzoni

On 2016/01/07 00:58:11, jbreiden wrote: > not appear to overlap. So in some sense the ...

4 years, 11 months ago (2016-01-07 20:46:46 UTC) #10

dmazzoni

On 2016/01/07 18:10:10, jbreiden wrote: > How can I enable text-to-speech on a Linux > ...

4 years, 11 months ago (2016-01-07 20:53:26 UTC) #11

dmazzoni

On 2016/01/07 18:43:11, jbreiden wrote: > I took a few screenshots. Here you can see ...

4 years, 11 months ago (2016-01-07 21:00:03 UTC) #12

jbreiden

Does it make sense to refactor or share code with text selection? See PDFiumEngine::GetSelectedText() and ...

4 years, 11 months ago (2016-01-07 21:59:17 UTC) #13

Lei Zhang

I'll let jbreiden verify the functionality is good. So is FPDFText_GetBoundedText() fundamentally broken, or can ...

4 years, 11 months ago (2016-01-08 04:01:36 UTC) #14

jbreiden

* On Linux, --enable-speech-dispatcher gave a very crude rendition of Hebrew, spelling the words out ...

4 years, 11 months ago (2016-01-09 01:48:52 UTC) #15

dmazzoni

https://codereview.chromium.org/1568723002/diff/1/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/1/pdf/pdfium/pdfium_page.cc#newcode37 pdf/pdfium/pdfium_page.cc:37: pp::Rect PageRectToGViewRect(const pp::Rect &input, FPDF_PAGE page) { On 2016/01/08 ...

4 years, 11 months ago (2016-01-11 19:58:02 UTC) #16

dmazzoni

On 2016/01/09 01:48:52, jbreiden wrote: > * As discussed, this changelist makes a ChromeVox visual ...

4 years, 11 months ago (2016-01-11 20:17:08 UTC) #17

dmazzoni

The Hebrew example is working better now. Lei caught a bug where I wasn't using ...

4 years, 11 months ago (2016-01-11 23:07:39 UTC) #18

jbreiden

Much better on most Hebrew and Arabic. However, I'm seeing a regression on single line ...

4 years, 11 months ago (2016-01-12 01:25:51 UTC) #19

jbreiden

This changelist eats the final line of every PDF. It's just much more obvious with ...

4 years, 11 months ago (2016-01-12 01:29:17 UTC) #20

jbreiden

https://codereview.chromium.org/1568723002/diff/20001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/20001/pdf/pdfium/pdfium_page.cc#newcode56 pdf/pdfium/pdfium_page.cc:56: VLOG(9) << "xml-invalid-rectangle"; There is no XML here, so ...

4 years, 11 months ago (2016-01-13 20:58:32 UTC) #21

dmazzoni

Fixed the last-line issue. Thanks. https://codereview.chromium.org/1568723002/diff/20001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/20001/pdf/pdfium/pdfium_page.cc#newcode56 pdf/pdfium/pdfium_page.cc:56: VLOG(9) << "xml-invalid-rectangle"; On ...

4 years, 11 months ago (2016-01-13 23:05:05 UTC) #22

jbreiden

We depended on null terminate strings? Yikes. Thanks for figuring this out. https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc ...

4 years, 11 months ago (2016-01-14 01:07:56 UTC) #23

We depended on null terminate strings? Yikes. Thanks for figuring this out.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc
File pdf/pdfium/pdfium_page.cc (right):

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:48: if (max_x < min_x)
Hmmm.... this works best for left-to-right languages. See how
jbreiden3.mtv/results/ara.pdf has bad justification? Something like this should
fix it.

int w = abs(max_x - min_x);  // width
int h = abs(max_y - min_y);  // height

if (max_y < min_y)
  std::swap(min_y, max_y);

if (max_x < min_x)
  pp::Rect output_rect(max_x - w, max_y - h, w, h);
else
  pp::Rect output_rect(min_x, min_y, w, h);

But honestly, the world of Arabic and Hebrew is a mess in PDF
there is a lot of variation. So getting this working beautifully
for Google Books does not mean we will do well with arbitrary
Hebrew and Arabic PDF. E.g. I think the suggested code change above will work
nicely and not harm anything, but it is hard to be sure.

Maybe it is best just to leave the code alone add a comment 
mentioning our weakness. Something like:

// This code left justifies text in the bounding box, which 
// makes sense for left-to-right languages such as English.
// This is not correct for a right to left language such as 
// Arabic.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:62: if (right < left)
Same as above.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:231: is_intraword_linebreak =
!OverlapsOnYAxis(char_rect, next_char_rect);
Maybe a comment somewhere in this function mentioning that we are assuming
horizontal text and will do badly with vertical text such as Japanese or
Chinese. 

This is not a regression (old ChromeVox was also a disaster) but we might want
to get this this working some day in the future and a comment may be useful.

jbreiden

or something like that. Hard to write correct code here without trial and error.

4 years, 11 months ago (2016-01-14 02:18:13 UTC) #24

Lei Zhang

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc#newcode41 pdf/pdfium/pdfium_page.cc:41: int min_x, min_y; One var per line please. https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc#newcode70 ...

4 years, 11 months ago (2016-01-14 02:45:06 UTC) #25

Lei Zhang

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc#newcode211 pdf/pdfium/pdfium_page.cc:211: for (int i = 0; i <= chars_count; i++) ...

4 years, 11 months ago (2016-01-14 02:47:58 UTC) #26

jbreiden

I am happy with functionality and do not need to see further work on Hebrew/Arabic ...

4 years, 11 months ago (2016-01-14 03:31:18 UTC) #27

dmazzoni

Thanks for your patience, I got sidetracked. Ready for another look. https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): ...

4 years, 11 months ago (2016-01-21 23:10:43 UTC) #28

Thanks for your patience, I got sidetracked. Ready for another look.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page.cc
File pdf/pdfium/pdfium_page.cc (right):

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:41: int min_x, min_y;
On 2016/01/14 02:45:06, Lei Zhang (OOO) wrote:
> One var per line please.

Done.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:48: if (max_x < min_x)
On 2016/01/14 01:07:56, jbreiden wrote:
> Hmmm.... this works best for left-to-right languages. See how

No, the issue here is that a pp::Rect isn't allowed to be "backwards", i.e. with
a left coordinate that's larger than its right coordinate.

This code is correct no matter what the text direction.

However, in order to correctly *display* the resulting text properly, we need to
extract the text direction and pass that as part of the data structure.

FWIW, I think this will be moot once I switch to implementing accessibility
natively. The way it will work then is that we'll just export the bounds of
every character, and the user will see the full rendered PDF visually with an
accessible bounding box for whatever range they have selected.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:62: if (right < left)
On 2016/01/14 01:07:56, jbreiden wrote:
> Same as above.

Same answer. We have to swap or it doesn't work.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:70: // This is the character Pfdium inserts where a
word is broken across lines.
On 2016/01/14 02:45:06, Lei Zhang (OOO) wrote:
> PDFium

Done.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:211: for (int i = 0; i <= chars_count; i++) {
On 2016/01/14 02:47:58, Lei Zhang (OOO) wrote:
> On 2016/01/14 02:45:06, Lei Zhang wrote:
> > This went it i < chars_count in patch set 2 and back to <= in patch set 3.
Do
> we
> > need to go past the last char or not?
> 
> "This went to"

This is on purpose. When i == chars_count, we pretend the character is a newline
in order to flush it.

I added a comment so it's clear this isn't a bug.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:231: is_intraword_linebreak =
!OverlapsOnYAxis(char_rect, next_char_rect);
On 2016/01/14 01:07:56, jbreiden wrote:
> Maybe a comment somewhere in this function mentioning that we are assuming
> horizontal text and will do badly with vertical text such as Japanese or
> Chinese. 
> 
> This is not a regression (old ChromeVox was also a disaster) but we might want
> to get this this working some day in the future and a comment may be useful.

Added a link to a bug.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:234: base::IsUnicodeWhitespace(character) ||
On 2016/01/14 02:45:06, Lei Zhang (OOO) wrote:
> funny indentation

Done.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:242: if (IsEol(character) || is_intraword_linebreak) {
On 2016/01/14 02:45:06, Lei Zhang (OOO) wrote:
> You can also check |is_intraword_linebreak| first here.

Done.

https://codereview.chromium.org/1568723002/diff/40001/pdf/pdfium/pdfium_page....
pdf/pdfium/pdfium_page.cc:248: base::DictionaryValue* line_node = new
base::DictionaryValue();
On 2016/01/14 02:45:06, Lei Zhang (OOO) wrote:
> Can you create/initialize the Values in order? |text_node| -> |text_nodes| ->
> |line_node|.

Done.

chromium-reviews

Not related to this review, but hit me up if you need to check any ...

4 years, 11 months ago (2016-01-22 09:03:25 UTC) #29

dmazzoni

> > Not related to this review, but hit me up if you need to ...

4 years, 11 months ago (2016-01-22 17:50:50 UTC) #30

chromium-reviews

Sounds great. Please make sure to read the code supporting copy-paste if you haven't already. ...

4 years, 10 months ago (2016-01-30 03:40:13 UTC) #31

Lei Zhang

lgtm https://codereview.chromium.org/1568723002/diff/60001/pdf/pdfium/pdfium_page.cc File pdf/pdfium/pdfium_page.cc (right): https://codereview.chromium.org/1568723002/diff/60001/pdf/pdfium/pdfium_page.cc#newcode294 pdf/pdfium/pdfium_page.cc:294: node->Set(kPageTextBox, text); // Takes ownership of |text| Given ...

4 years, 10 months ago (2016-02-04 02:32:50 UTC) #33

dmazzoni

The patchset sent to the CQ was uploaded after l-g-t-m from thestig@chromium.org Link to the ...

4 years, 10 months ago (2016-02-04 17:15:25 UTC) #35