|
|
DescriptionAdd text dump to pdfium_test.
This Cl adds a '--txt' flag to pdfium_test. When provided the text from the
PDF will be extracted and written into page numbered text files. The files are
written in UTF32-LE as that's what FPDFText_GetUnicode() provides.
Committed: https://pdfium.googlesource.com/pdfium/+/b63068f04681f7ade9c062a442977c660e3503d0
Patch Set 1 #
Total comments: 4
Patch Set 2 : Review fixes #
Total comments: 2
Patch Set 3 : Review feedback #
Total comments: 2
Patch Set 4 : Fix help message error #Messages
Total messages: 21 (8 generated)
dsinclair@chromium.org changed reviewers: + halcanary@chromium.org, thestig@chromium.org
PTAL.
Description was changed from ========== Add text dump to pdfium_text. This Cl adds a '--text' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. ========== to ========== Add text dump to pdfium_test. This Cl adds a '--text' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. ==========
halcanary@google.com changed reviewers: + halcanary@google.com
https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... samples/pdfium_test.cc:138: fwrite(&c, sizeof(wchar_t), 1, fp); sizeof(wchar_t) is "compiler-dependent and therefore not very portable"
https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... samples/pdfium_test.cc:132: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; What is the output format? UTF-32-LE?
On 2016/06/14 19:27:57, Hal Canary wrote: > https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc > File samples/pdfium_test.cc (right): > > https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... > samples/pdfium_test.cc:132: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; > What is the output format? UTF-32-LE? Oh, I see now that that's what you say in the description. please add that to `--help`.
Description was changed from ========== Add text dump to pdfium_test. This Cl adds a '--text' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. ========== to ========== Add text dump to pdfium_test. This Cl adds a '--txt' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. ==========
https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... samples/pdfium_test.cc:132: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; On 2016/06/14 19:27:57, Hal Canary wrote: > What is the output format? UTF-32-LE? Yup, UTF32-LE. https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... samples/pdfium_test.cc:138: fwrite(&c, sizeof(wchar_t), 1, fp); On 2016/06/14 19:23:12, Hal Canary wrote: > sizeof(wchar_t) is "compiler-dependent and therefore not very portable" Done.
On 2016/06/14 19:28:43, Hal Canary wrote: > On 2016/06/14 19:27:57, Hal Canary wrote: > > https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc > > File samples/pdfium_test.cc (right): > > > > > https://codereview.chromium.org/2060983005/diff/1/samples/pdfium_test.cc#newc... > > samples/pdfium_test.cc:132: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; > > What is the output format? UTF-32-LE? > > Oh, I see now that that's what you say in the description. please add that to > `--help`. Done, added help and changed the flag to --txt to better match the other flags we provide.
https://codereview.chromium.org/2060983005/diff/20001/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/20001/samples/pdfium_test.cc#... samples/pdfium_test.cc:133: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; I would prefer: uint32_t bom = 0x0000FEFF; fwrite(&bom, sizeof(bom), 1, fp); uint32_t c = FPDFText_GetUnicode(textpage, i); fwrite(&c, sizeof(c), 1, fp);
Patchset #3 (id:40001) has been deleted
https://codereview.chromium.org/2060983005/diff/20001/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/20001/samples/pdfium_test.cc#... samples/pdfium_test.cc:133: unsigned char bom[] = {0xFF, 0xFE, 0x00, 0x00}; On 2016/06/14 20:54:11, Hal Canary wrote: > I would prefer: > > uint32_t bom = 0x0000FEFF; > fwrite(&bom, sizeof(bom), 1, fp); > > uint32_t c = FPDFText_GetUnicode(textpage, i); > fwrite(&c, sizeof(c), 1, fp); > Done.
ping.
lgtm with nit. https://codereview.chromium.org/2060983005/diff/60001/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/60001/samples/pdfium_test.cc#... samples/pdfium_test.cc:791: " --txt - write page text in UTF32-LE <pdf-name.<page-number>.txt\n" <pdf-name>.<page-number>.txt
https://codereview.chromium.org/2060983005/diff/60001/samples/pdfium_test.cc File samples/pdfium_test.cc (right): https://codereview.chromium.org/2060983005/diff/60001/samples/pdfium_test.cc#... samples/pdfium_test.cc:791: " --txt - write page text in UTF32-LE <pdf-name.<page-number>.txt\n" On 2016/06/16 14:32:15, Hal Canary wrote: > <pdf-name>.<page-number>.txt Done.
The CQ bit was checked by dsinclair@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from halcanary@google.com Link to the patchset: https://codereview.chromium.org/2060983005/#ps80001 (title: "Fix help message error")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/2060983005/80001
Message was sent while issue was closed.
Description was changed from ========== Add text dump to pdfium_test. This Cl adds a '--txt' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. ========== to ========== Add text dump to pdfium_test. This Cl adds a '--txt' flag to pdfium_test. When provided the text from the PDF will be extracted and written into page numbered text files. The files are written in UTF32-LE as that's what FPDFText_GetUnicode() provides. Committed: https://pdfium.googlesource.com/pdfium/+/b63068f04681f7ade9c062a442977c660e35... ==========
Message was sent while issue was closed.
Committed patchset #4 (id:80001) as https://pdfium.googlesource.com/pdfium/+/b63068f04681f7ade9c062a442977c660e35... |