|
|
Created:
5 years ago by Wei Li Modified:
5 years ago CC:
pdfium-reviews_googlegroups.com, kai_jing Base URL:
https://pdfium.googlesource.com/pdfium.git@master Target Ref:
refs/heads/master Visibility:
Public. |
Descriptionfix for stream object reading
Loosen a check for earlier version of PDF files. When the bytes with
specified length are followed by 'endstream' keyword, even if there is
no EOL marker before the keyword, it signals the end of stream.
BUG=551258
R=jun_fang@foxitsoftware.com, tsepez@chromium.org
Committed: https://pdfium.googlesource.com/pdfium/+/0ff66089c87ab6e3adaaff0ec69728ce7a8d8299
Patch Set 1 : #
Messages
Total messages: 18 (3 generated)
Patchset #1 (id:1) has been deleted
weili@chromium.org changed reviewers: + jun_fang@foxitsoftware.com, tsepez@chromium.org
PTAL, thanks.
On 2015/12/02 20:25:52, Wei wrote: > PTAL, thanks. To Jun for review. We made some changes here recently to be more "correct", so its not clear if this is what we should do.
On 2015/12/02 20:25:52, Wei wrote: > PTAL, thanks. LGTM.
On 2015/12/02 20:28:21, Tom Sepez wrote: > On 2015/12/02 20:25:52, Wei wrote: > > PTAL, thanks. > > To Jun for review. We made some changes here recently to be more "correct", so > its not clear if this is what we should do. It's caused by different behaviors defined in varied pdf specifications. For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, the bytes before endstream should be EOL markers. Unfortunately, pdfium doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. pdf 1.7 is stricter on parsing endobj than the older versions.
On 2015/12/03 12:20:39, jun_fang wrote: > On 2015/12/02 20:28:21, Tom Sepez wrote: > > On 2015/12/02 20:25:52, Wei wrote: > > > PTAL, thanks. > > > > To Jun for review. We made some changes here recently to be more "correct", > so > > its not clear if this is what we should do. > > It's caused by different behaviors defined in varied pdf specifications. > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > the bytes before endstream should be EOL markers. Unfortunately, pdfium > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > pdf 1.7 is stricter on parsing endobj than the older versions. Thank you, Jun, for the review. I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and earlier versions. Unless we explicitly drop support on some versions, pdfium should keep backward compatibility. In the implementation, it is indeed tricky to handle some corner cases.
lgtm
On 2015/12/03 18:04:33, Wei wrote: > On 2015/12/03 12:20:39, jun_fang wrote: > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > On 2015/12/02 20:25:52, Wei wrote: > > > > PTAL, thanks. > > > > > > To Jun for review. We made some changes here recently to be more "correct", > > so > > > its not clear if this is what we should do. > > > > It's caused by different behaviors defined in varied pdf specifications. > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > pdf 1.7 is stricter on parsing endobj than the older versions. > > Thank you, Jun, for the review. > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > earlier > versions. Unless we explicitly drop support on some versions, pdfium should keep > > backward compatibility. In the implementation, it is indeed tricky to handle > some corner cases. Hi Wei, You are right. Pdfium should process the differences based on the version of pdf files. However, it doesn't. In theory, this patch isn't a perfect solution. It only changed the current implementation to follow older versions. It breaks pdf1.7.
On 2015/12/03 23:16:28, jun_fang wrote: > On 2015/12/03 18:04:33, Wei wrote: > > On 2015/12/03 12:20:39, jun_fang wrote: > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > PTAL, thanks. > > > > > > > > To Jun for review. We made some changes here recently to be more > "correct", > > > so > > > > its not clear if this is what we should do. > > > > > > It's caused by different behaviors defined in varied pdf specifications. > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > Thank you, Jun, for the review. > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > > earlier > > versions. Unless we explicitly drop support on some versions, pdfium should > keep > > > > backward compatibility. In the implementation, it is indeed tricky to handle > > some corner cases. > > Hi Wei, > > You are right. Pdfium should process the differences based on the version > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > solution. It only changed the current implementation to follow older versions. > It breaks pdf1.7. Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to pdf 1.7 should be parsed correctly since line 2352 already processed EOL marker. If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not conforming to the spec), but its length of bytes correctly ends right before the keyword 'endstream', should we allow it? This is the only thing changed for pdf 1.7 file handling.
On 2015/12/04 00:26:32, Wei wrote: > On 2015/12/03 23:16:28, jun_fang wrote: > > On 2015/12/03 18:04:33, Wei wrote: > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > PTAL, thanks. > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > "correct", > > > > so > > > > > its not clear if this is what we should do. > > > > > > > > It's caused by different behaviors defined in varied pdf specifications. > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > Thank you, Jun, for the review. > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > > > earlier > > > versions. Unless we explicitly drop support on some versions, pdfium should > > keep > > > > > > backward compatibility. In the implementation, it is indeed tricky to handle > > > > some corner cases. > > > > Hi Wei, > > > > You are right. Pdfium should process the differences based on the version > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > solution. It only changed the current implementation to follow older versions. > > It breaks pdf1.7. > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to pdf > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > conforming to the spec), but its length of bytes correctly ends right before the > keyword 'endstream', should we allow it? This is the only thing changed for pdf > 1.7 file handling. BTW, you are right, ideally we should have differentiated parsing based on versions.
On 2015/12/04 00:26:32, Wei wrote: > On 2015/12/03 23:16:28, jun_fang wrote: > > On 2015/12/03 18:04:33, Wei wrote: > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > PTAL, thanks. > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > "correct", > > > > so > > > > > its not clear if this is what we should do. > > > > > > > > It's caused by different behaviors defined in varied pdf specifications. > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > Thank you, Jun, for the review. > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > > > earlier > > > versions. Unless we explicitly drop support on some versions, pdfium should > > keep > > > > > > backward compatibility. In the implementation, it is indeed tricky to handle > > > > some corner cases. > > > > Hi Wei, > > > > You are right. Pdfium should process the differences based on the version > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > solution. It only changed the current implementation to follow older versions. > > It breaks pdf1.7. > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to pdf > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > conforming to the spec), but its length of bytes correctly ends right before the > keyword 'endstream', should we allow it? This is the only thing changed for pdf > 1.7 file handling. Hi Wei, Let me give you an example to explain the difference. stream ... ... endstreamtest For this example, pdf1.4 and pdf1.7 define different behaviors. In this case, this patch only follows pdf1.4. It breaks pdf1.7.
On 2015/12/04 00:37:53, Wei wrote: > On 2015/12/04 00:26:32, Wei wrote: > > On 2015/12/03 23:16:28, jun_fang wrote: > > > On 2015/12/03 18:04:33, Wei wrote: > > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > > PTAL, thanks. > > > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > > "correct", > > > > > so > > > > > > its not clear if this is what we should do. > > > > > > > > > > It's caused by different behaviors defined in varied pdf specifications. > > > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > > > Thank you, Jun, for the review. > > > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > > > > earlier > > > > versions. Unless we explicitly drop support on some versions, pdfium > should > > > keep > > > > > > > > backward compatibility. In the implementation, it is indeed tricky to > handle > > > > > > some corner cases. > > > > > > Hi Wei, > > > > > > You are right. Pdfium should process the differences based on the version > > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > > solution. It only changed the current implementation to follow older > versions. > > > It breaks pdf1.7. > > > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to pdf > > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > > conforming to the spec), but its length of bytes correctly ends right before > the > > keyword 'endstream', should we allow it? This is the only thing changed for > pdf > > 1.7 file handling. > > BTW, you are right, ideally we should have differentiated parsing based on > versions. The root cause of this issue is that there is a conflicting on parsing "endstream" in pdf specification level.
On 2015/12/04 00:45:56, jun_fang wrote: > On 2015/12/04 00:37:53, Wei wrote: > > On 2015/12/04 00:26:32, Wei wrote: > > > On 2015/12/03 23:16:28, jun_fang wrote: > > > > On 2015/12/03 18:04:33, Wei wrote: > > > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > > > PTAL, thanks. > > > > > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > > > "correct", > > > > > > so > > > > > > > its not clear if this is what we should do. > > > > > > > > > > > > It's caused by different behaviors defined in varied pdf > specifications. > > > > > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > > > the bytes before endstream should be EOL markers. Unfortunately, > pdfium > > > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > > > > > Thank you, Jun, for the review. > > > > > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 > and > > > > > earlier > > > > > versions. Unless we explicitly drop support on some versions, pdfium > > should > > > > keep > > > > > > > > > > backward compatibility. In the implementation, it is indeed tricky to > > handle > > > > > > > > some corner cases. > > > > > > > > Hi Wei, > > > > > > > > You are right. Pdfium should process the differences based on the version > > > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > > > solution. It only changed the current implementation to follow older > > versions. > > > > It breaks pdf1.7. > > > > > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to > pdf > > > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > > > > > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > > > conforming to the spec), but its length of bytes correctly ends right before > > the > > > keyword 'endstream', should we allow it? This is the only thing changed for > > pdf > > > 1.7 file handling. > > > > BTW, you are right, ideally we should have differentiated parsing based on > > versions. > > The root cause of this issue is that there is a conflicting on parsing > "endstream" > in pdf specification level. Anyway, if the length is valid, there is a string "endstream" at start_pos + length. No matter whether it's a whole string. It's a very high possibility to say that it's a keyword "endstream". That's why I signed off this patch. Please go ahead to land this patch first. Thanks for the clarification.
On 2015/12/04 00:41:37, jun_fang wrote: > On 2015/12/04 00:26:32, Wei wrote: > > On 2015/12/03 23:16:28, jun_fang wrote: > > > On 2015/12/03 18:04:33, Wei wrote: > > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > > PTAL, thanks. > > > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > > "correct", > > > > > so > > > > > > its not clear if this is what we should do. > > > > > > > > > > It's caused by different behaviors defined in varied pdf specifications. > > > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > > the bytes before endstream should be EOL markers. Unfortunately, pdfium > > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > > > Thank you, Jun, for the review. > > > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 and > > > > earlier > > > > versions. Unless we explicitly drop support on some versions, pdfium > should > > > keep > > > > > > > > backward compatibility. In the implementation, it is indeed tricky to > handle > > > > > > some corner cases. > > > > > > Hi Wei, > > > > > > You are right. Pdfium should process the differences based on the version > > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > > solution. It only changed the current implementation to follow older > versions. > > > It breaks pdf1.7. > > > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to pdf > > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > > conforming to the spec), but its length of bytes correctly ends right before > the > > keyword 'endstream', should we allow it? This is the only thing changed for > pdf > > 1.7 file handling. > > Hi Wei, > > Let me give you an example to explain the difference. > > stream > ... > ... > endstreamtest > > For this example, pdf1.4 and pdf1.7 define different behaviors. > In this case, this patch only follows pdf1.4. It breaks pdf1.7. Let me repeat this: this doesn't break pdf 1.7. Any files conforming to pdf 1.7 should still be parsed correctly. AND the files conforming to earlier versions of pdf should be parsed correctly now. If you have any legit pdf 1.7 file/example that can not be parsed correctly, pls let me know. thanks. As for how to handle files not conforming to the standard, we need to find a time to discuss this in person as there are more cases involved.
On 2015/12/04 01:29:14, Wei wrote: > On 2015/12/04 00:41:37, jun_fang wrote: > > On 2015/12/04 00:26:32, Wei wrote: > > > On 2015/12/03 23:16:28, jun_fang wrote: > > > > On 2015/12/03 18:04:33, Wei wrote: > > > > > On 2015/12/03 12:20:39, jun_fang wrote: > > > > > > On 2015/12/02 20:28:21, Tom Sepez wrote: > > > > > > > On 2015/12/02 20:25:52, Wei wrote: > > > > > > > > PTAL, thanks. > > > > > > > > > > > > > > To Jun for review. We made some changes here recently to be more > > > > "correct", > > > > > > so > > > > > > > its not clear if this is what we should do. > > > > > > > > > > > > It's caused by different behaviors defined in varied pdf > specifications. > > > > > > > > For example, in pdf 1.4, any bytes can precede endstream. In pdf 1.7, > > > > > > the bytes before endstream should be EOL markers. Unfortunately, > pdfium > > > > > > doesn't handle the differences. Currently, Pdfium targets on pdf 1.7. > > > > > > pdf 1.7 is stricter on parsing endobj than the older versions. > > > > > > > > > > Thank you, Jun, for the review. > > > > > > > > > > I think that pdfium targets on pdf 1.7 should mean it supports pdf 1.7 > and > > > > > earlier > > > > > versions. Unless we explicitly drop support on some versions, pdfium > > should > > > > keep > > > > > > > > > > backward compatibility. In the implementation, it is indeed tricky to > > handle > > > > > > > > some corner cases. > > > > > > > > Hi Wei, > > > > > > > > You are right. Pdfium should process the differences based on the version > > > > of pdf files. However, it doesn't. In theory, this patch isn't a perfect > > > > solution. It only changed the current implementation to follow older > > versions. > > > > It breaks pdf1.7. > > > > > > Just to clarify, this patch doesn't break pdf 1.7. Any file conforming to > pdf > > > 1.7 should be parsed correctly since line 2352 already processed EOL marker. > > > > > > > If a pdf file in version 1.7 doesn't have EOL marker before 'endstream' (not > > > conforming to the spec), but its length of bytes correctly ends right before > > the > > > keyword 'endstream', should we allow it? This is the only thing changed for > > pdf > > > 1.7 file handling. > > > > Hi Wei, > > > > Let me give you an example to explain the difference. > > > > stream > > ... > > ... > > endstreamtest > > > > For this example, pdf1.4 and pdf1.7 define different behaviors. > > In this case, this patch only follows pdf1.4. It breaks pdf1.7. > > Let me repeat this: this doesn't break pdf 1.7. Any files conforming to pdf 1.7 > should still be parsed correctly. AND the files conforming to earlier versions > of pdf should be parsed correctly now. > > If you have any legit pdf 1.7 file/example that can not be parsed correctly, pls > let me know. thanks. > > As for how to handle files not conforming to the standard, we need to find a > time to discuss this in person as there are more cases involved. Please go ahead. Let me try to create a test file for you later.
Description was changed from ========== fix for stream object reading Loosen a check for earlier version of PDF files. When the bytes with specified length are followed by 'endstream' keyword, even if there is no EOL marker before the keyword, it signals the end of stream. BUG=551258 ========== to ========== fix for stream object reading Loosen a check for earlier version of PDF files. When the bytes with specified length are followed by 'endstream' keyword, even if there is no EOL marker before the keyword, it signals the end of stream. BUG=551258 R=jun_fang@foxitsoftware.com, tsepez@chromium.org Committed: https://pdfium.googlesource.com/pdfium/+/0ff66089c87ab6e3adaaff0ec69728ce7a8d... ==========
Message was sent while issue was closed.
Committed patchset #1 (id:20001) manually as 0ff66089c87ab6e3adaaff0ec69728ce7a8d8299 (presubmit successful). |