|
|
DescriptionCheck whether the annotation content is empty using CFX_WideString.
CFX_ByteString cannot properly check whether the contents string is empty
because the first two bytes of text strings encoded in Unicode are always ASCII
254 followed by 255. So if we get contents in CFX_ByteString, the length will
always be 2.
Also, roll DEPS for testing/corpus to 608bf04.
Committed: https://pdfium.googlesource.com/pdfium/+/3451c0900cbbc06d82a07bca3670ac87ec36f52e
Patch Set 1 #Patch Set 2 : Rebased #Patch Set 3 : Check whether the annotation content is empty using CFX_WideString. #Messages
Total messages: 31 (12 generated)
jaepark@google.com changed reviewers: + dsinclair@chromium.org, thestig@chromium.org
The test case for this is in https://codereview.chromium.org/2301793002 .
Can you be more specific about why CFX_ByteString cannot properly check for emptiness?
Description was changed from ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty. ========== to ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So the length of the CFX_ByteString will always be 2. ==========
Description was changed from ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So the length of the CFX_ByteString will always be 2. ========== to ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So the length of the CFX_ByteString will always be 2. ==========
Description was changed from ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So the length of the CFX_ByteString will always be 2. ========== to ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So if we get contents in CFX_ByteString, the length will always be 2. ==========
On 2016/09/01 01:17:52, Lei Zhang wrote: > Can you be more specific about why CFX_ByteString cannot properly check for > emptiness? Updated in the description.
On 2016/09/01 01:50:35, jaepark wrote: > On 2016/09/01 01:17:52, Lei Zhang wrote: > > Can you be more specific about why CFX_ByteString cannot properly check for > > emptiness? > > Updated in the description. We have a unicode BOM at the beginning of every wide string? That seems .... broken. This feels like the bug is somewhere else and we shouldn't be storing the BOM with every string.
On 2016/09/01 14:21:29, dsinclair wrote: > > We have a unicode BOM at the beginning of every wide string? That seems .... > broken. This feels like the bug is somewhere else and we shouldn't be storing > the BOM with every string. CPDF_Dictionary::GetUnicodeTextBy() decodes the encoded text so that it removes unicode BOM. However, there is unicode BOM for some fields in a dictionary, like "Contents" or "T". Are you suggesting that we store decoded text when parsing the content into the dictionary?
On 2016/09/01 17:20:27, jaepark wrote: > On 2016/09/01 14:21:29, dsinclair wrote: > > > > We have a unicode BOM at the beginning of every wide string? That seems .... > > broken. This feels like the bug is somewhere else and we shouldn't be storing > > the BOM with every string. > > CPDF_Dictionary::GetUnicodeTextBy() decodes the encoded text so that it removes > unicode BOM. > However, there is unicode BOM for some fields in a dictionary, like "Contents" > or "T". > Are you suggesting that we store decoded text when parsing the content into the > dictionary? It seems strange that we'd store the BOM for some strings and not for others. Feels like we should be stripping the BOM everywhere and storing the text in a consistent format internally. Lei, does that make sense to you?
On 2016/09/01 17:22:16, dsinclair wrote: > It seems strange that we'd store the BOM for some strings and not for others. > Feels like we should be stripping the BOM everywhere and storing the text in a > consistent format internally. > > Lei, does that make sense to you? Who is "we"? If a $user saved a PDF using $pdf_reader and it has BOMs, and other PDF viewers handle that fine, we should too, right?
On 2016/09/01 21:18:39, Lei Zhang wrote: > On 2016/09/01 17:22:16, dsinclair wrote: > > It seems strange that we'd store the BOM for some strings and not for others. > > Feels like we should be stripping the BOM everywhere and storing the text in a > > consistent format internally. > > > > Lei, does that make sense to you? > > Who is "we"? If a $user saved a PDF using $pdf_reader and it has BOMs, and other > PDF viewers handle that fine, we should too, right? We is PDFium. If the PDF has BOM's we should, internally, be removing them when we stick the string into a WideString. We should be using the BOM to make sure we're doing the right conversion interally (I think UTF16-BE is what we use?). There is no reason to keep the BOM on the string after we've converted into a wide string.
On 2016/09/06 12:53:14, dsinclair wrote: > We is PDFium. If the PDF has BOM's we should, internally, be removing them when > we stick the string into a WideString. We should be using the BOM to make sure > we're doing the right conversion interally (I think UTF16-BE is what we use?). > > There is no reason to keep the BOM on the string after we've converted into a > wide string. Are you suggesting we change CPDF_String? It's hard to predict what effects that would have...
On 2016/09/07 01:47:14, Lei Zhang wrote: > On 2016/09/06 12:53:14, dsinclair wrote: > > We is PDFium. If the PDF has BOM's we should, internally, be removing them > when > > we stick the string into a WideString. We should be using the BOM to make sure > > we're doing the right conversion interally (I think UTF16-BE is what we use?). > > > > There is no reason to keep the BOM on the string after we've converted into a > > wide string. > > Are you suggesting we change CPDF_String? It's hard to predict what effects that > would have... No, CPDF_String shouldn't know where the string contents came from, I think we should be putting the right value in there to begin with. Otherwise, how will we ever know what strings need the BOM skipped and which don't? We should fix the parser to do the same thing in all cases.
On 2016/09/07 02:20:36, dsinclair wrote: > On 2016/09/07 01:47:14, Lei Zhang wrote: > > On 2016/09/06 12:53:14, dsinclair wrote: > > > We is PDFium. If the PDF has BOM's we should, internally, be removing them > > when > > > we stick the string into a WideString. We should be using the BOM to make > sure > > > we're doing the right conversion interally (I think UTF16-BE is what we > use?). > > > > > > There is no reason to keep the BOM on the string after we've converted into > a > > > wide string. > > > > Are you suggesting we change CPDF_String? It's hard to predict what effects > that > > would have... > > No, CPDF_String shouldn't know where the string contents came from, I think we > should be putting the right value in there to begin with. Otherwise, how will we > ever know what strings need the BOM skipped and which don't? We should fix the > parser to do the same thing in all cases. My question is: * We have strings that do not have BOMs * We have strings that do have BOMs * On a case-by-case basis in the code we'll have to skip the first 2 characters * I'm guessing these strings are read straight from the PDF files, is that correct? * When we read the string from the file, we can strip the BOM and convert to the format that CPDF_String expects * Then, we always have strings without BOMs internally * Then, we never have to do the magic skip 2 characters dance and everythings consistent. This just feels like a bit of inconsistency in the code that is going to cause problems in the future.
On 2016/09/07 12:59:34, dsinclair wrote: > On 2016/09/07 02:20:36, dsinclair wrote: > > On 2016/09/07 01:47:14, Lei Zhang wrote: > > > On 2016/09/06 12:53:14, dsinclair wrote: > > > > We is PDFium. If the PDF has BOM's we should, internally, be removing them > > > when > > > > we stick the string into a WideString. We should be using the BOM to make > > sure > > > > we're doing the right conversion interally (I think UTF16-BE is what we > > use?). > > > > > > > > There is no reason to keep the BOM on the string after we've converted > into > > a > > > > wide string. > > > > > > Are you suggesting we change CPDF_String? It's hard to predict what effects > > that > > > would have... > > > > No, CPDF_String shouldn't know where the string contents came from, I think we > > should be putting the right value in there to begin with. Otherwise, how will > we > > ever know what strings need the BOM skipped and which don't? We should fix the > > parser to do the same thing in all cases. > > > My question is: > * We have strings that do not have BOMs > * We have strings that do have BOMs > * On a case-by-case basis in the code we'll have to skip the first 2 > characters > * I'm guessing these strings are read straight from the PDF files, is that > correct? > * When we read the string from the file, we can strip the BOM and convert to > the format that CPDF_String expects > * Then, we always have strings without BOMs internally > * Then, we never have to do the magic skip 2 characters dance and everythings > consistent. > > This just feels like a bit of inconsistency in the code that is going to cause > problems in the future. Makes sense, shall we file a bug? I think it's outside the scope of Jae's annotation work, so we may end up dropping this CL.
On 2016/09/07 22:13:50, Lei Zhang wrote: > On 2016/09/07 12:59:34, dsinclair wrote: > > On 2016/09/07 02:20:36, dsinclair wrote: > > > On 2016/09/07 01:47:14, Lei Zhang wrote: > > > > On 2016/09/06 12:53:14, dsinclair wrote: > > > > > We is PDFium. If the PDF has BOM's we should, internally, be removing > them > > > > when > > > > > we stick the string into a WideString. We should be using the BOM to > make > > > sure > > > > > we're doing the right conversion interally (I think UTF16-BE is what we > > > use?). > > > > > > > > > > There is no reason to keep the BOM on the string after we've converted > > into > > > a > > > > > wide string. > > > > > > > > Are you suggesting we change CPDF_String? It's hard to predict what > effects > > > that > > > > would have... > > > > > > No, CPDF_String shouldn't know where the string contents came from, I think > we > > > should be putting the right value in there to begin with. Otherwise, how > will > > we > > > ever know what strings need the BOM skipped and which don't? We should fix > the > > > parser to do the same thing in all cases. > > > > > > My question is: > > * We have strings that do not have BOMs > > * We have strings that do have BOMs > > * On a case-by-case basis in the code we'll have to skip the first 2 > > characters > > * I'm guessing these strings are read straight from the PDF files, is that > > correct? > > * When we read the string from the file, we can strip the BOM and convert to > > the format that CPDF_String expects > > * Then, we always have strings without BOMs internally > > * Then, we never have to do the magic skip 2 characters dance and > everythings > > consistent. > > > > This just feels like a bit of inconsistency in the code that is going to cause > > problems in the future. > > Makes sense, shall we file a bug? I think it's outside the scope of Jae's > annotation work, so we may end up dropping this CL. Sounds good, Jae, can you file a bug for this and reference which strings you know of have a BOM at the beginning of them to give us a starting point?
On 2016/09/08 12:52:49, dsinclair wrote: > > Sounds good, Jae, can you file a bug for this and reference which strings you > know of have a BOM at the beginning of them to give us a starting point? Done. https://bugs.chromium.org/p/pdfium/issues/detail?id=593
Rebased and added a comment with TODO. Test case is in https://codereview.chromium.org/2301793002 .
lgtm
lgtm
Description was changed from ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So if we get contents in CFX_ByteString, the length will always be 2. ========== to ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So if we get contents in CFX_ByteString, the length will always be 2. Also, roll DEPS for testing/corpus to 608bf04. ==========
The CQ bit was checked by jaepark@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by jaepark@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from thestig@chromium.org, dsinclair@chromium.org Link to the patchset: https://codereview.chromium.org/2293403003/#ps40001 (title: "Check whether the annotation content is empty using CFX_WideString.")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So if we get contents in CFX_ByteString, the length will always be 2. Also, roll DEPS for testing/corpus to 608bf04. ========== to ========== Check whether the annotation content is empty using CFX_WideString. CFX_ByteString cannot properly check whether the contents string is empty because the first two bytes of text strings encoded in Unicode are always ASCII 254 followed by 255. So if we get contents in CFX_ByteString, the length will always be 2. Also, roll DEPS for testing/corpus to 608bf04. Committed: https://pdfium.googlesource.com/pdfium/+/3451c0900cbbc06d82a07bca3670ac87ec36... ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://pdfium.googlesource.com/pdfium/+/3451c0900cbbc06d82a07bca3670ac87ec36... |