|
|
DescriptionTraverse PDF page tree only once in CPDF_Document
In our current implementation of CPDF_Document::GetPage, we traverse
the PDF page tree until we find the index we are looking for. This is
slow when we do calls GetPage(0), GetPage(1), ... since in this case
the page tree will be traversed n times if there are n pages. This CL
makes sure the page tree is only traversed once.
Time to load the PDF from the bug below in chrome official build:
Before this CL: 1 minute 40 seconds
After this CL: 5 seconds
BUG=chromium:638513
Committed: https://pdfium.googlesource.com/pdfium/+/7c29e27dae139a205755c1a29b7f3ac8b36ec0da
Patch Set 1 #
Total comments: 6
Patch Set 2 : Traverse until needed #Patch Set 3 : Nits #
Total comments: 7
Patch Set 4 : Move methods from namespace to private #Messages
Total messages: 28 (17 generated)
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_xfa on master.tryserver.client.pdfium (JOB_FAILED, https://build.chromium.org/p/tryserver.client.pdfium/builders/linux_xfa/build...) win_xfa on master.tryserver.client.pdfium (JOB_FAILED, https://build.chromium.org/p/tryserver.client.pdfium/builders/win_xfa/builds/...)
npm@chromium.org changed reviewers: + dsinclair@chromium.org, thestig@chromium.org, tsepez@chromium.org
PTAL. I think corpus tests will pass after https://codereview.chromium.org/2419793004/ So please ignore red bots for now.
https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:507: int nPages = pKid->GetIntegerFor("Count"); I think we have to deal with the situation where Count does not agree with sizeof /Kids array, which is why they had a separate parameter.
https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:507: int nPages = pKid->GetIntegerFor("Count"); On 2016/10/14 19:01:16, Tom Sepez wrote: > I think we have to deal with the situation where Count does not agree with > sizeof /Kids array, which is why they had a separate parameter. Well, it is not clear to me how we should behave in that case. But in fact I think this will behave similar in that case: if Count > sizeOf, some pages indices will be skipped if Count < sizeOf, all the pages in the Kids array will be used. But then iPage only increases by Count. So if there's anything after, the m_PageList values for those will be overriden. Before this CL: if Count > sizeOf, some pages indices will be skipped if Count < sizeOf, the excess pages will be ignored.
https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:507: int nPages = pKid->GetIntegerFor("Count"); On 2016/10/14 19:31:25, npm wrote: > On 2016/10/14 19:01:16, Tom Sepez wrote: > > I think we have to deal with the situation where Count does not agree with > > sizeof /Kids array, which is why they had a separate parameter. > > Well, it is not clear to me how we should behave in that case. > But in fact I think this will behave similar in that case: > > if Count > sizeOf, some pages indices will be skipped > if Count < sizeOf, all the pages in the Kids array will be used. > But then iPage only increases by Count. So if there's anything after, > the m_PageList values for those will be overriden. > > Before this CL: > if Count > sizeOf, some pages indices will be skipped > if Count < sizeOf, the excess pages will be ignored. Should we increase iPage by the min(count, sizeof Kids array) to fix the issue where we either skip or ignore? https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:539: TraversePDFPages(pPages, 0, 0); This may have to do more work then needed in some cases. If we have a 10k page document and we just want page 1 we'll have to walk all 10k pages the first time instead of just the first page.
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
PTAL https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:507: int nPages = pKid->GetIntegerFor("Count"); On 2016/10/17 13:26:37, dsinclair wrote: > On 2016/10/14 19:31:25, npm wrote: > > On 2016/10/14 19:01:16, Tom Sepez wrote: > > > I think we have to deal with the situation where Count does not agree with > > > sizeof /Kids array, which is why they had a separate parameter. > > > > Well, it is not clear to me how we should behave in that case. > > But in fact I think this will behave similar in that case: > > > > if Count > sizeOf, some pages indices will be skipped > > if Count < sizeOf, all the pages in the Kids array will be used. > > But then iPage only increases by Count. So if there's anything after, > > the m_PageList values for those will be overriden. > > > > Before this CL: > > if Count > sizeOf, some pages indices will be skipped > > if Count < sizeOf, the excess pages will be ignored. > > > Should we increase iPage by the min(count, sizeof Kids array) to fix the issue > where we either skip or ignore? We were not doing this before. I think this might not be the right thing to do: we should use Count to keep track of the page indices, that is its purpose. https://codereview.chromium.org/2414423002/diff/1/core/fpdfapi/parser/cpdf_do... core/fpdfapi/parser/cpdf_document.cpp:539: TraversePDFPages(pPages, 0, 0); On 2016/10/17 13:26:37, dsinclair wrote: > This may have to do more work then needed in some cases. If we have a 10k page > document and we just want page 1 we'll have to walk all 10k pages the first time > instead of just the first page. Changed to traverse only until needed. https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:491: if (nPagesToGo != 1) Before nPagesToGo was the number of pages left minus 1. So compare with 1 instead of 0. https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:502: bool shouldFinish = pPages->GetIntegerFor("Count") <= nPagesToGo; This will make sure that the page is popped properly even when the number of leafs does not match Count. https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:504: for (size_t i = lastProc->second + 1; i < pKidList->GetCount(); i++) { lastProc->second is the index of the last completely processed kid, which initially is -1, and should be pKidList->GetCount()-1 when done.
Can you update the description with some numbers on the performance gain? https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:491: if (nPagesToGo != 1) On 2016/10/17 21:10:23, npm wrote: > Before nPagesToGo was the number of pages left minus 1. So compare with 1 > instead of 0. So, this is check if we're on the page we're looking for? https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.h (right): https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:107: int m_iLastPageTraversed; Should these two members be private?
Description was changed from ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. BUG=chromium:638513 ========== to ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. Time to load the PDF from the bug below in release build: Before this CL: 1 minute 40 seconds After this CL: 5 seconds BUG=chromium:638513 ==========
Updated with performance https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.cpp (right): https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.cpp:491: if (nPagesToGo != 1) On 2016/10/18 13:35:59, dsinclair wrote: > On 2016/10/17 21:10:23, npm wrote: > > Before nPagesToGo was the number of pages left minus 1. So compare with 1 > > instead of 0. > > So, this is check if we're on the page we're looking for? The if(!pKidList) will only be entered when the root node is actually page. Notice that we don't add nodes without Kids into the stack in this method. If the root has no kids, it has to be a single page. https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... File core/fpdfapi/parser/cpdf_document.h (right): https://codereview.chromium.org/2414423002/diff/40001/core/fpdfapi/parser/cpd... core/fpdfapi/parser/cpdf_document.h:107: int m_iLastPageTraversed; On 2016/10/18 13:35:59, dsinclair wrote: > Should these two members be private? Since I needed to reset them in namespace method InsertDeletePDFPage, I made them public
Description was changed from ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. Time to load the PDF from the bug below in release build: Before this CL: 1 minute 40 seconds After this CL: 5 seconds BUG=chromium:638513 ========== to ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. Time to load the PDF from the bug below in chrome official build: Before this CL: 1 minute 40 seconds After this CL: 5 seconds BUG=chromium:638513 ==========
The CQ bit was checked by npm@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by dsinclair@chromium.org
lgtm
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
Message was sent while issue was closed.
Description was changed from ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. Time to load the PDF from the bug below in chrome official build: Before this CL: 1 minute 40 seconds After this CL: 5 seconds BUG=chromium:638513 ========== to ========== Traverse PDF page tree only once in CPDF_Document In our current implementation of CPDF_Document::GetPage, we traverse the PDF page tree until we find the index we are looking for. This is slow when we do calls GetPage(0), GetPage(1), ... since in this case the page tree will be traversed n times if there are n pages. This CL makes sure the page tree is only traversed once. Time to load the PDF from the bug below in chrome official build: Before this CL: 1 minute 40 seconds After this CL: 5 seconds BUG=chromium:638513 Committed: https://pdfium.googlesource.com/pdfium/+/7c29e27dae139a205755c1a29b7f3ac8b36e... ==========
Message was sent while issue was closed.
Committed patchset #4 (id:60001) as https://pdfium.googlesource.com/pdfium/+/7c29e27dae139a205755c1a29b7f3ac8b36e...
Message was sent while issue was closed.
A revert of this CL (patchset #4 id:60001) has been created in https://codereview.chromium.org/2430313006/ by dsinclair@chromium.org. The reason for reverting is: Possible cause of crbug.com/657897 reverting to find out. BUG=657897. |