|
|
DescriptionTraverse PDF page tree only once in CPDF_Document Try 3
Now, we do not start traversal from where we were at, but from the top.
This makes the code less prone to bugs, as now there is no need to call
methods to recursively fix things. This will save a lot of time when
the trees are rather flat, as in the PDF file in the bug. It can still
be slow, for instance if we have a chain of page nodes, and the last
in the chain contains all of the pages (this is artificial).
Try 2 at https://codereview.chromium.org/2442403002/
Also added test where Try 2 would have failed.
Tested the pdf from the bug on my Mac:
With this CL: load in 21 seconds
Without this CL: did not load in 4 minutes, got tired of waiting
BUG=chromium:638513
Committed: https://pdfium.googlesource.com/pdfium/+/ec64cee9acccd0d1e574bbbd8aa91b08c1cf254f
Patch Set 1 #Patch Set 2 : Rebase and test #
Total comments: 8
Patch Set 3 : Rebase, nits #
Total comments: 17
Patch Set 4 : Comments #
Messages
Total messages: 30 (21 generated)
Description was changed from ========== Traverse PDF page tree only once in CPDF_Document Try 3 Now, we do not start traversal from where we were at, but from the top. This makes the code less prone to bugs, as now there is no need to call methods to recursively fix things. This will save a lot of time when the trees are rather flat, as in the PDF file in the bug. It can still be slow, for instance if we have a chain of page nodes, and the last in the chain contains all of the pages (this is artificial). Try 2 at https://codereview.chromium.org/2442403002/ Tested the pdf from the bug on my Mac: With this CL: load in 21 seconds Without this CL: did not load in 4 minutes, got tired of waiting BUG=chromium:638513 ========== to ========== Traverse PDF page tree only once in CPDF_Document Try 3 Now, we do not start traversal from where we were at, but from the top. This makes the code less prone to bugs, as now there is no need to call methods to recursively fix things. This will save a lot of time when the trees are rather flat, as in the PDF file in the bug. It can still be slow, for instance if we have a chain of page nodes, and the last in the chain contains all of the pages (this is artificial). Try 2 at https://codereview.chromium.org/2442403002/ Also added test where Try 2 would have failed. Tested the pdf from the bug on my Mac: With this CL: load in 21 seconds Without this CL: did not load in 4 minutes, got tired of waiting BUG=chromium:638513 ==========
npm@chromium.org changed reviewers: + dsinclair@chromium.org, thestig@chromium.org, tsepez@chromium.org
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux on master.tryserver.client.pdfium (JOB_FAILED, https://build.chromium.org/p/tryserver.client.pdfium/builders/linux/builds/2626)
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
PTAL. Third time the charm? ...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
I'm still having a hard time getting my tiny brain around all of this and I'm trying to think how I would simplify it. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:339: m_iLastPageTraversed(-1), nit: without actually looking, my hunch is that its easier to track the next untraversed page than the last traversed one. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:406: int& nPagesToGo, I can't help but think this might be easier if we had current page number and an absolute page number to stop at rather than counting down and maybe missing a place in some branch. Here's a suggestion. Maybe this needs a helper function that's responsible for flattening, and a top-level function that just gets the value out of the flattened tree. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:427: nPagesToGo--; I worry about this going negative here ... https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:672: int CPDF_Document::InsertDeletePDFPage(CPDF_Dictionary* pPages, can we do this refactoring first as a separate CL? I like these being methods and removing the |this|, but its hard to see what else changed.
I'm sorry that the code is hard to understand, although I do think that this version is an improvement over the previous two tries! Anyways, I'm unsure how to make this any simpler, but I am welcome to suggestions, and also I'll be happy to discuss and explain how this works. For now, I'll detach the refactor part and do some small changes. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:339: m_iLastPageTraversed(-1), On 2016/11/02 17:37:37, Tom Sepez wrote: > nit: without actually looking, my hunch is that its easier to track the next > untraversed page than the last traversed one. If that makes the code clearer, I can do that. It is just a shift by 1 :) https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:406: int& nPagesToGo, On 2016/11/02 17:37:37, Tom Sepez wrote: > I can't help but think this might be easier if we had current page number and an > absolute page number to stop at rather than counting down and maybe missing a > place in some branch. > > Here's a suggestion. Maybe this needs a helper function that's responsible for > flattening, and a top-level function that just gets the value out of the > flattened tree. > > If I understand correctly what you mean by current/absolute, current is iPage - nPagesToGo + 1, and absolute is iPage. I can change the meaning of nPagestoGo to be the current page, if that makes things more clear. As for the suggestion, I don't understand exactly what you mean. The problem I see is we would have to 'partially flatten' the tree up to the page we're supposed to stop at. And I don't see how doing that without losing track of what we have flattened and what's still missing will be easier than the current implementation. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:427: nPagesToGo--; On 2016/11/02 17:37:37, Tom Sepez wrote: > I worry about this going negative here ... Yes, this can go negative. This wouldn't be a problem, other than we processed some more nodes than was needed. But this can be avoided by adding a check before the end of the for, so we break when we are about to go negative. https://codereview.chromium.org/2470803003/diff/20001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:672: int CPDF_Document::InsertDeletePDFPage(CPDF_Dictionary* pPages, On 2016/11/02 17:37:37, Tom Sepez wrote: > can we do this refactoring first as a separate CL? I like these being methods > and removing the |this|, but its hard to see what else changed. Sounds good, will do.
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
I think this works, LGTM, but see if Lei has any issues. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:700: // Tree will change, so reset tree transversal variables nit: maybe a ResetTraversal() helper method that does both of these together, and lose the comment? https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.h (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:108: CPDF_Dictionary* TraversePDFPages(int iPage, int& nPagesToGo, int level); nit: the style guide would say int* nPagesToGo. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:135: // of the child being processed. nit: // of the child being processed within the dictionary's /Kids array.
https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:457: if (static_cast<int>(m_pTreeTraversal.size()) != level + 1 || Just noticed this 'if' might be causing confusion. It is sufficient to have if (nPagesToGo == 0) because, if the child did not finish, then it must have hit a break, so nPagesToGo must have reached 0. I guess I was being extra cautious to make sure to only continue in the for loop when I'm sure the m_pTreeeTraversal[level].second is increased. But I can simplify the if, WDYT?
https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:404: // When this is called, m_pTreeTraversal[level] exists Move documentation to the header, or make it an ASSERT(). https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:449: if (static_cast<int>(m_pTreeTraversal.size()) == level + 1) Should |level| be a size_t since it's an index into an array, so you don't need all these casts? https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.h (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:134: // Vector of nodes to know current position in the page tree. Int is the index Maybe mention the index into this vector is the level, and what both values in the pair represent. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:136: std::vector<std::pair<CPDF_Dictionary*, int>> m_pTreeTraversal; Should the int also be a size_t here? https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document_unittest.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document_unittest.cpp:125: ASSERT_TRUE(page->GetObjectFor("PageNumbering")); Since you are not grabbing the actual object here, use KeyExist() instead? I guess this is copy-pasta, so feel free to fix it all later.
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:404: // When this is called, m_pTreeTraversal[level] exists On 2016/11/04 01:38:52, Lei Zhang wrote: > Move documentation to the header, or make it an ASSERT(). Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:449: if (static_cast<int>(m_pTreeTraversal.size()) == level + 1) On 2016/11/04 01:38:52, Lei Zhang wrote: > Should |level| be a size_t since it's an index into an array, so you don't need > all these casts? Yes, I remember I tried this and failed, maybe forgot to change the other to size_t too. Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:700: // Tree will change, so reset tree transversal variables On 2016/11/03 22:31:00, Tom Sepez wrote: > nit: maybe a ResetTraversal() helper method that does both of these together, > and lose the comment? Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.h (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:108: CPDF_Dictionary* TraversePDFPages(int iPage, int& nPagesToGo, int level); On 2016/11/03 22:31:00, Tom Sepez wrote: > nit: the style guide would say int* nPagesToGo. Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:134: // Vector of nodes to know current position in the page tree. Int is the index On 2016/11/04 01:38:52, Lei Zhang wrote: > Maybe mention the index into this vector is the level, and what both values in > the pair represent. Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:135: // of the child being processed. On 2016/11/03 22:31:00, Tom Sepez wrote: > nit: // of the child being processed within the dictionary's /Kids array. Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:136: std::vector<std::pair<CPDF_Dictionary*, int>> m_pTreeTraversal; On 2016/11/04 01:38:52, Lei Zhang wrote: > Should the int also be a size_t here? Done. https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document_unittest.cpp (right): https://codereview.chromium.org/2470803003/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document_unittest.cpp:125: ASSERT_TRUE(page->GetObjectFor("PageNumbering")); On 2016/11/04 01:38:52, Lei Zhang wrote: > Since you are not grabbing the actual object here, use KeyExist() instead? I > guess this is copy-pasta, so feel free to fix it all later. Done.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by npm@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from tsepez@chromium.org Link to the patchset: https://codereview.chromium.org/2470803003/#ps60001 (title: "Comments")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Traverse PDF page tree only once in CPDF_Document Try 3 Now, we do not start traversal from where we were at, but from the top. This makes the code less prone to bugs, as now there is no need to call methods to recursively fix things. This will save a lot of time when the trees are rather flat, as in the PDF file in the bug. It can still be slow, for instance if we have a chain of page nodes, and the last in the chain contains all of the pages (this is artificial). Try 2 at https://codereview.chromium.org/2442403002/ Also added test where Try 2 would have failed. Tested the pdf from the bug on my Mac: With this CL: load in 21 seconds Without this CL: did not load in 4 minutes, got tired of waiting BUG=chromium:638513 ========== to ========== Traverse PDF page tree only once in CPDF_Document Try 3 Now, we do not start traversal from where we were at, but from the top. This makes the code less prone to bugs, as now there is no need to call methods to recursively fix things. This will save a lot of time when the trees are rather flat, as in the PDF file in the bug. It can still be slow, for instance if we have a chain of page nodes, and the last in the chain contains all of the pages (this is artificial). Try 2 at https://codereview.chromium.org/2442403002/ Also added test where Try 2 would have failed. Tested the pdf from the bug on my Mac: With this CL: load in 21 seconds Without this CL: did not load in 4 minutes, got tired of waiting BUG=chromium:638513 Committed: https://pdfium.googlesource.com/pdfium/+/ec64cee9acccd0d1e574bbbd8aa91b08c1cf... ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001) as https://pdfium.googlesource.com/pdfium/+/ec64cee9acccd0d1e574bbbd8aa91b08c1cf... |