|
|
Created:
6 years, 10 months ago by reveman Modified:
6 years, 10 months ago Reviewers:
vmpstr CC:
chromium-reviews, cc-bugs_chromium.org Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
Descriptioncc: Reuse the same TaskGraph and completed tasks vector.
This improves performance by avoiding unnecessary resizing of
vectors. RasterWorkerPoolPerfTest.ScheduleTasks is ~25% faster
for ImageRasterWorkerPool with this change.
BUG=246546
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=250305
Patch Set 1 #
Total comments: 4
Patch Set 2 : update comments #
Messages
Total messages: 11 (0 generated)
https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... File cc/resources/pixel_buffer_raster_worker_pool.h (right): https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... cc/resources/pixel_buffer_raster_worker_pool.h:92: // Avoid unnecessary heap allocations by reusing the same graph and This comment kind of describes the patch, instead of describing what these variables do. Could you instead change it to something like "Graph to use when scheduling tasks, and a vector to gather completed tasks"? https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... cc/resources/pixel_buffer_raster_worker_pool.h:95: internal::Task::Vector completed_tasks_; I'm kind of conflicted about this change. What I mean to say is that I suspect this is optimizing the perftest, but makes little difference in practice. I mean completed_tasks_ really saves a local std::vector, which isn't that horribly expensive. It's also called in CheckForCompletedTasks: something we normally call once every 6ms? Do you know what the actual time savings are per call, at least approximately?
https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... File cc/resources/pixel_buffer_raster_worker_pool.h (right): https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... cc/resources/pixel_buffer_raster_worker_pool.h:92: // Avoid unnecessary heap allocations by reusing the same graph and On 2014/02/10 18:00:19, vmpstr wrote: > This comment kind of describes the patch, instead of describing what these > variables do. Could you instead change it to something like "Graph to use when > scheduling tasks, and a vector to gather completed tasks"? Done. https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... cc/resources/pixel_buffer_raster_worker_pool.h:95: internal::Task::Vector completed_tasks_; On 2014/02/10 18:00:19, vmpstr wrote: > I'm kind of conflicted about this change. What I mean to say is that I suspect > this is optimizing the perftest, but makes little difference in practice. I mean > completed_tasks_ really saves a local std::vector, which isn't that horribly > expensive. It's also called in CheckForCompletedTasks: something we normally > call once every 6ms? > > Do you know what the actual time savings are per call, at least approximately? The performance tests reflect real usage quite well now and this optimization is definitely worth it imo. The cost we're avoiding is having to create+grow+delete all these vectors with each use. It's hard to give you a savings per call number but what I can tell is that the heap allocation functions used to process these vectors are number 2 after ref-counting when profiling this code. Eliminating them makes a big difference and if it wasn't for the current ref-counting bottleneck this would provide a much greater percentage improvement. FYI, I have another patch that does the same at the TileManager to RasterWorkerPool interaction level but decided it was better to do that as a follow up.
On 2014/02/10 19:13:26, David Reveman wrote: > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > File cc/resources/pixel_buffer_raster_worker_pool.h (right): > > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > cc/resources/pixel_buffer_raster_worker_pool.h:92: // Avoid unnecessary heap > allocations by reusing the same graph and > On 2014/02/10 18:00:19, vmpstr wrote: > > This comment kind of describes the patch, instead of describing what these > > variables do. Could you instead change it to something like "Graph to use when > > scheduling tasks, and a vector to gather completed tasks"? > > Done. > > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > cc/resources/pixel_buffer_raster_worker_pool.h:95: internal::Task::Vector > completed_tasks_; > On 2014/02/10 18:00:19, vmpstr wrote: > > I'm kind of conflicted about this change. What I mean to say is that I suspect > > this is optimizing the perftest, but makes little difference in practice. I > mean > > completed_tasks_ really saves a local std::vector, which isn't that horribly > > expensive. It's also called in CheckForCompletedTasks: something we normally > > call once every 6ms? > > > > Do you know what the actual time savings are per call, at least approximately? > > The performance tests reflect real usage quite well now and this optimization is > definitely worth it imo. The cost we're avoiding is having to create+grow+delete > all these vectors with each use. It's hard to give you a savings per call number > but what I can tell is that the heap allocation functions used to process these > vectors are number 2 after ref-counting when profiling this code. Eliminating > them makes a big difference and if it wasn't for the current ref-counting > bottleneck this would provide a much greater percentage improvement. > Ok, I can buy that this actually has savings in the fact that the vector retains its capacity which allows us to allocate less (or nothing) next time around. lgtm. Re: time per call, I meant the perftest.. you said there's 25% increase. What's the run_time / number_of_runs before and after? I'm curious if we're talking about 1ms -> 0.75ms or something like 0.010ms -> 0.007ms (on whatever machine is fine, I just want to get a gut feeling for the speed of the whole schedule) > FYI, I have another patch that does the same at the TileManager to > RasterWorkerPool interaction level but decided it was better to do that as a > follow up.
On 2014/02/10 19:40:15, vmpstr wrote: > On 2014/02/10 19:13:26, David Reveman wrote: > > > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > > File cc/resources/pixel_buffer_raster_worker_pool.h (right): > > > > > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > > cc/resources/pixel_buffer_raster_worker_pool.h:92: // Avoid unnecessary heap > > allocations by reusing the same graph and > > On 2014/02/10 18:00:19, vmpstr wrote: > > > This comment kind of describes the patch, instead of describing what these > > > variables do. Could you instead change it to something like "Graph to use > when > > > scheduling tasks, and a vector to gather completed tasks"? > > > > Done. > > > > > https://codereview.chromium.org/144463012/diff/1/cc/resources/pixel_buffer_ra... > > cc/resources/pixel_buffer_raster_worker_pool.h:95: internal::Task::Vector > > completed_tasks_; > > On 2014/02/10 18:00:19, vmpstr wrote: > > > I'm kind of conflicted about this change. What I mean to say is that I > suspect > > > this is optimizing the perftest, but makes little difference in practice. I > > mean > > > completed_tasks_ really saves a local std::vector, which isn't that horribly > > > expensive. It's also called in CheckForCompletedTasks: something we normally > > > call once every 6ms? > > > > > > Do you know what the actual time savings are per call, at least > approximately? > > > > The performance tests reflect real usage quite well now and this optimization > is > > definitely worth it imo. The cost we're avoiding is having to > create+grow+delete > > all these vectors with each use. It's hard to give you a savings per call > number > > but what I can tell is that the heap allocation functions used to process > these > > vectors are number 2 after ref-counting when profiling this code. Eliminating > > them makes a big difference and if it wasn't for the current ref-counting > > bottleneck this would provide a much greater percentage improvement. > > > > Ok, I can buy that this actually has savings in the fact that the vector retains > its capacity which allows us to allocate less (or nothing) next time around. > lgtm. > > Re: time per call, I meant the perftest.. you said there's 25% increase. What's > the run_time / number_of_runs before and after? I'm curious if we're talking > about 1ms -> 0.75ms or something like 0.010ms -> 0.007ms (on whatever machine is > fine, I just want to get a gut feeling for the speed of the whole schedule) ~200000 -> 250000 run/s, so 0.005ms -> 0.004ms on my workstation. as it's heap allocation related, I expect the same or better on Android. > > > FYI, I have another patch that does the same at the TileManager to > > RasterWorkerPool interaction level but decided it was better to do that as a > > follow up.
> ~200000 -> 250000 run/s, so 0.005ms -> 0.004ms on my workstation. as it's heap > allocation related, I expect the same or better on Android. > That's a 1us savings on a call that, again, is not that frequent in non-perftest environment. I mean, I'm happy with the fact that it's faster, but I fear that this will lock us in into a particular micro-optimized solution. That is, adding a trace event or something in this function will probably invalidate the savings and make the perfbots graph look like it's a regression. We had patches before that removed some heap allocations (https://codereview.chromium.org/38553002/ comes to mind), but the consensus was that it might not be worth it. Why the change of heart? :) > > > > > FYI, I have another patch that does the same at the TileManager to > > > RasterWorkerPool interaction level but decided it was better to do that as a > > > follow up.
On 2014/02/10 22:02:26, vmpstr wrote: > > ~200000 -> 250000 run/s, so 0.005ms -> 0.004ms on my workstation. as it's heap > > allocation related, I expect the same or better on Android. > > > > That's a 1us savings on a call that, again, is not that frequent in non-perftest > environment. I mean, I'm happy with the fact that it's faster, but I fear that > this will lock us in into a particular micro-optimized solution. That is, adding > a trace event or something in this function will probably invalidate the savings > and make the perfbots graph look like it's a regression. > > We had patches before that removed some heap allocations > (https://codereview.chromium.org/38553002/ comes to mind), but the consensus was > that it might not be worth it. Why the change of heart? :) Profiling. It shows that all these heap allocations add up as we're doing them at every level and they are likely costing us more in a non-perftest environment (where other heap allocations also happen) than the other way around. They are also more expensive on mobile than desktop. This is also not just about performance. I've generally been against this kind of optimizations when they made the code more complicated and we had no data demonstrating their usefulness. Profiling of our new "useful" perftests show that they are definitely worth it and we now have a pattern of swapping between vectors that grow to their desired size at the core TaskGraphRunner level that I find clean and doesn't add additional complexity to the code. I think we should apply the same pattern throughout the whole worker-pool/tile-manager system. Even in places where it doesn't make a difference in performance, just for the sake of consistency. In general profiling is the key. ~90% of the CPU time spent in RWP and TGR used to be heap allocations and ref counting. I'm starting at the bottom (TGR) and removing it one level at a time. Each level makes a small difference but together it changes everything. > > > > > > > > FYI, I have another patch that does the same at the TileManager to > > > > RasterWorkerPool interaction level but decided it was better to do that as > a > > > > follow up.
On 2014/02/10 23:12:03, David Reveman wrote: > On 2014/02/10 22:02:26, vmpstr wrote: > > > ~200000 -> 250000 run/s, so 0.005ms -> 0.004ms on my workstation. as it's > heap > > > allocation related, I expect the same or better on Android. > > > > > > > That's a 1us savings on a call that, again, is not that frequent in > non-perftest > > environment. I mean, I'm happy with the fact that it's faster, but I fear that > > this will lock us in into a particular micro-optimized solution. That is, > adding > > a trace event or something in this function will probably invalidate the > savings > > and make the perfbots graph look like it's a regression. > > > > We had patches before that removed some heap allocations > > (https://codereview.chromium.org/38553002/ comes to mind), but the consensus > was > > that it might not be worth it. Why the change of heart? :) > > Profiling. It shows that all these heap allocations add up as we're doing them > at every level and they are likely costing us more in a non-perftest environment > (where other heap allocations also happen) than the other way around. They are > also more expensive on mobile than desktop. > > This is also not just about performance. I've generally been against this kind > of optimizations when they made the code more complicated and we had no data > demonstrating their usefulness. Profiling of our new "useful" perftests show > that they are definitely worth it and we now have a pattern of swapping between > vectors that grow to their desired size at the core TaskGraphRunner level that I > find clean and doesn't add additional complexity to the code. I think we should > apply the same pattern throughout the whole worker-pool/tile-manager system. > Even in places where it doesn't make a difference in performance, just for the > sake of consistency. > > In general profiling is the key. ~90% of the CPU time spent in RWP and TGR used > to be heap allocations and ref counting. I'm starting at the bottom (TGR) and > removing it one level at a time. Each level makes a small difference but > together it changes everything. > > > > > > > > > > > > FYI, I have another patch that does the same at the TileManager to > > > > > RasterWorkerPool interaction level but decided it was better to do that > as > > a > > > > > follow up. sgtm.
The CQ bit was checked by reveman@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/reveman@chromium.org/144463012/60001
Message was sent while issue was closed.
Change committed as 250305 |