|
|
Descriptioncc: Use pointers for heaps in tile manager raster tile priority queue.
This patch changes what we heapify in tile manager raster tile priority
queue. Instead of using objects, we now use pointers to those objects.
The reasoning behind this is that since make_heap et al copy things
around to maintain a heap order, it is significantly cheaper to move
pointer around instead of objects.
Perf results:
Before:
tile_manager_raster_tile_queue_construct:
2= 1348545 runs/s
10= 1297846.75 runs/s
50= 1098756.25 runs/s
tile_manager_raster_tile_queue_construct_and_iterate:
2_16= 74439.5546875 runs/s
2_32= 36854.30078125 runs/s
2_64= 17582.240234375 runs/s
2_128= 8018.2802734375 runs/s
After:
tile_manager_raster_tile_queue_construct:
2= 1294338.75 runs/s (-4%)
10= 1254695 runs/s (-3%)
50= 1047427.375 runs/s (-5%)
tile_manager_raster_tile_queue_construct_and_iterate:
2_16= 298292.03125 runs/s (+300%)
2_32= 176878.578125 runs/s (+380%)
2_64= 93562.8515625 runs/s (+430%)
2_128= 43368.1796875 runs/s (+440%)
R=reveman
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=287378
Patch Set 1 #
Total comments: 5
Patch Set 2 : ScopedPtrVector #
Messages
Total messages: 14 (0 generated)
PTAL. Note that tiling level raster iterator doesn't have vectors, so the objects here should be all contiguous memory. I've attempted to make this two other ways (unsuccessfully). 1. Using ScopedPtrVector and pushing make_scoped_ptr(new Paired...); 2. Using std::vector<scoped_refptr<Paired...> > and pushing make_scoped_refptr(new Paired...); Both of those failed, since make_heap is quite particular about what elements it is willing to accept. ScopedPtrVector failed because it didn't like ScopedPtrVector::iterator. The scoped_refptr version failed because (and I quote): error: no matching conversion for functional-style cast from 'scoped_refptr<cc::RasterTilePriorityQueue::PairedPictureLayerQueue>' to '_ValueType' (aka 'scoped_refptr<cc::RasterTilePriorityQueue::PairedPictureLayerQueue>') :\ So with that, I think this is a fairly simple approach that gives us quite a bit of savings on iteration. Understandably, construction has a small regression.
https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... File cc/resources/raster_tile_priority_queue.cc (right): https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... cc/resources/raster_tile_priority_queue.cc:80: TreePriority tree_priority) { That this function is not being faster makes me very suspicious. Why is the greatest saving not from make_heap? Once you start iterating you'll end up with non-empty PairedPictureLayerQueue::returned_shared_tiles. Does that have a significant factor in why this patch makes iterating cheaper? Another general observation of these results. The improvement is 300% to 440%. The reason for this being faster would be because less data has to be copied when moving elements around in the heap. Reducing the size of PairedPictureLayerQueue to 1/4 of its current size should have the same effect. What's the current size of PairedPictureLayerQueue? It would be interesting to see a break down of each of the members. I would not be surprised if we could reduce the size of this significantly. That would not only provide the heap performance we're looking for but also improve the performance of these iterators in other aspects, such as construction.
https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... File cc/resources/raster_tile_priority_queue.cc (right): https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... cc/resources/raster_tile_priority_queue.cc:80: TreePriority tree_priority) { On 2014/07/30 16:04:09, reveman wrote: > That this function is not being faster makes me very suspicious. Why is the > greatest saving not from make_heap? Once you start iterating you'll end up with > non-empty PairedPictureLayerQueue::returned_shared_tiles. Does that have a > significant factor in why this patch makes iterating cheaper? One possible explanation is that reallocations in paired_queues_heap_ coupled with an extra iteration is the thing that prevents us from seeing savings here. Another more likely case is that make_heap does no work in perftests, since the layers initially have equal priority (every layer has a now bin tile at the top). So the cost we're seeing here is just from the loop construction. I did an experiment where I kept the code the same, just added an extra loop to push paired layers pointers into a temp vector and I see the same type of regression. > > Another general observation of these results. The improvement is 300% to 440%. > The reason for this being faster would be because less data has to be copied > when moving elements around in the heap. Reducing the size of > PairedPictureLayerQueue to 1/4 of its current size should have the same effect. > What's the current size of PairedPictureLayerQueue? It would be interesting to > see a break down of each of the members. I would not be surprised if we could > reduce the size of this significantly. That would not only provide the heap > performance we're looking for but also improve the performance of these > iterators in other aspects, such as construction. I don't know if 1/4 the current size is the magic number here. Just from moving objects around (ignoring pointer indirection when accessing elements), wouldn't we have to reduce the layer iterator to the size of a single pointer? I agree that reducing the size is always a good thing, but at the same time I'm weary of the fact that our performance profile would depend heavily on the size of layer iterator (note that it has the tiling iterator as members as well, so the performance would depend on the size of all the iterator layers below this). That makes it difficult for us to change any meaningful part of the iterator (ie adding new variables, caching something) without negatively affecting performance. In other words, I'd rather us see take a black box approach to this performance, and not have to assume anything about the size of the iterators that we're using. Note the same general statements that we normally state when talking about iterators still apply: they are fairly cheap to create and are appropriate for stack creation, since they aren't large. IMO, talking about particular size and relying on the characteristic that if we reduce the size by half, our performance doubles encourages micro-optimizations that favor tiny object sizes over clarity. For reference here's what the layer raster iterator contains: enum IteratorType { LOW_RES, HIGH_RES, NUM_ITERATORS }; struct IterationStage { IteratorType iterator_type; TilePriority::PriorityBin tile_type; }; PictureLayerImpl* layer_; int current_stage_; IterationStage stages_[4]; PictureLayerTiling::TilingRasterTileIterator iterators_[NUM_ITERATORS]; That's probably not the optimal size (we can remove stages array in favor of determining the next stage computationally for instance), but for the sake of clarity I think this is pretty streamlined. Wdyt?
Btw, you probably need to add a make_heap function to ScopedPtrVector (just like we have for sorting) to use that. https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... File cc/resources/raster_tile_priority_queue.cc (right): https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... cc/resources/raster_tile_priority_queue.cc:80: TreePriority tree_priority) { On 2014/07/30 16:45:51, vmpstr wrote: > On 2014/07/30 16:04:09, reveman wrote: > > That this function is not being faster makes me very suspicious. Why is the > > greatest saving not from make_heap? Once you start iterating you'll end up > with > > non-empty PairedPictureLayerQueue::returned_shared_tiles. Does that have a > > significant factor in why this patch makes iterating cheaper? > > One possible explanation is that reallocations in paired_queues_heap_ coupled > with an extra iteration is the thing that prevents us from seeing savings here. > > Another more likely case is that make_heap does no work in perftests, since the > layers initially have equal priority (every layer has a now bin tile at the > top). So the cost we're seeing here is just from the loop construction. I did an > experiment where I kept the code the same, just added an extra loop to push > paired layers pointers into a temp vector and I see the same type of regression. Ok, let's start by fixing the perftests then. > > > > > Another general observation of these results. The improvement is 300% to 440%. > > The reason for this being faster would be because less data has to be copied > > when moving elements around in the heap. Reducing the size of > > PairedPictureLayerQueue to 1/4 of its current size should have the same > effect. > > What's the current size of PairedPictureLayerQueue? It would be interesting to > > see a break down of each of the members. I would not be surprised if we could > > reduce the size of this significantly. That would not only provide the heap > > performance we're looking for but also improve the performance of these > > iterators in other aspects, such as construction. > > I don't know if 1/4 the current size is the magic number here. Just from moving > objects around (ignoring pointer indirection when accessing elements), wouldn't > we have to reduce the layer iterator to the size of a single pointer? Yes, no idea what the actual number is but I'd guess that the difference will become insignificant much before you reach the size of a pointer. It's hard to say as moving some data around might be free unless it causes a cache miss. You need to access the data even if you use pointers so we might get the same cache misses and the pointers might not buy us much at all. > I agree > that reducing the size is always a good thing, but at the same time I'm weary of > the fact that our performance profile would depend heavily on the size of layer > iterator (note that it has the tiling iterator as members as well, so the > performance would depend on the size of all the iterator layers below this). > That makes it difficult for us to change any meaningful part of the iterator (ie > adding new variables, caching something) without negatively affecting > performance. > > In other words, I'd rather us see take a black box approach to this performance, > and not have to assume anything about the size of the iterators that we're > using. Note the same general statements that we normally state when talking > about iterators still apply: they are fairly cheap to create and are appropriate > for stack creation, since they aren't large. IMO, talking about particular size > and relying on the characteristic that if we reduce the size by half, our > performance doubles encourages micro-optimizations that favor tiny object sizes > over clarity. Not making assumptions about the size makes some sense. Although, I think that reducing the size of these are one of the most useful optimizations we can do. It improves performance not only when accessing these objects but when constructing them and moving them. That said, we should be careful about making changes that reduce readability. > > For reference here's what the layer raster iterator contains: > enum IteratorType { LOW_RES, HIGH_RES, NUM_ITERATORS }; > struct IterationStage { > IteratorType iterator_type; > TilePriority::PriorityBin tile_type; > }; > > PictureLayerImpl* layer_; > int current_stage_; > IterationStage stages_[4]; > PictureLayerTiling::TilingRasterTileIterator iterators_[NUM_ITERATORS]; > > That's probably not the optimal size (we can remove stages array in favor of > determining the next stage computationally for instance), but for the sake of > clarity I think this is pretty streamlined. I actually think it's a great idea to remove that array and compute the stage as we iterate instead of at construction time. I think it can improve readability if done right. > > Wdyt? I'm not completely against this patch. I just need some more data to be convinced that this is a good idea. After updating the perftests, can you do some profiling and post the results to this CL?
> I'm not completely against this patch. I just need some more data to be > convinced that this is a good idea. After updating the perftests, can you do > some profiling and post the results to this CL? I reran the fixed perftests with and without this patch, here are the result: Before: [ RUN ] TileManagerPerfTest.RasterTileQueueConstruct *RESULT tile_manager_raster_tile_queue_construct: 2= 1295742.375 runs/s *RESULT tile_manager_raster_tile_queue_construct: 10= 271312.28125 runs/s *RESULT tile_manager_raster_tile_queue_construct: 50= 33427.76171875 runs/s [ OK ] TileManagerPerfTest.RasterTileQueueConstruct (6028 ms) [ RUN ] TileManagerPerfTest.RasterTileQueueConstructAndIterate *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_16= 104820.0703125 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_32= 52762.65234375 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_64= 25451.34765625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_128= 11645.6328125 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_16= 65317.77734375 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_32= 36263.56640625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_64= 19423.873046875 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_128= 9491.0283203125 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_16= 21731.697265625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_32= 16152.537109375 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_64= 11045.8193359375 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_128= 6502.76611328125 runs/s [ OK ] TileManagerPerfTest.RasterTileQueueConstructAndIterate (24199 ms) After: [ RUN ] TileManagerPerfTest.RasterTileQueueConstruct *RESULT tile_manager_raster_tile_queue_construct: 2= 1360116.625 runs/s *RESULT tile_manager_raster_tile_queue_construct: 10= 303830.90625 runs/s *RESULT tile_manager_raster_tile_queue_construct: 50= 37095.8828125 runs/s [ OK ] TileManagerPerfTest.RasterTileQueueConstruct (6027 ms) [ RUN ] TileManagerPerfTest.RasterTileQueueConstructAndIterate *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_16= 337294.65625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_32= 204357.953125 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_64= 105868.625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 2_128= 46488.0234375 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_16= 137583 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_32= 89561.90625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_64= 52665.28515625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 10_128= 27036.296875 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_16= 28990.201171875 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_32= 24570.134765625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_64= 19634.06640625 runs/s *RESULT tile_manager_raster_tile_queue_construct_and_iterate: 50_128= 12886.591796875 runs/s [ OK ] TileManagerPerfTest.RasterTileQueueConstructAndIterate (24187 ms) Now, construct also improves, the construct and iterate is still a substantial improvement. Can you explain what is your objection to this patch?
Here's profile numbers for ConstructAndIterate test (i put all things >1% prunning at layer raster tile iterator level): Before: RunConstructAndIterate 99% RasterTilePriorityQueue::Pop 77.2% adjust heap (from push/pop heap) 22.7% PairedPictureLayerQueue::Pop 11.1% LayerRasterTileIterator::operator++ 7.1% std::vector::insert 1.1% free 11.1% memmove 9.4% new 7.5% std::vector::operator= 7.2% RasterOrderComparator::operator() 3.1% LayerRasterTileIterator::~LayerRasterTileIterator 1.3% BuildRasterQueue 17.7% RasterTilePriorityQueue::Build 16.8% LayerRasterTileIterator::LayerRasterTileIterator 8.9% adjust heap (from make heap) 3.0% std::vector::insert 2.8% RasterTilePriorityQueue::~RasterTilePriorityQueue 1.2% After: RunConstructAndIterate 100% RasterTilePriorityQueue::Pop 56.3% PairedPictureLayerQueue::Pop 33.6% LayerRasterTileIterator::operator++ 19.8% std::vector::insert 8.2% LayerRasterTileIterator::operator bool 1.7% LayerRasterTileIterator::operator* 1.2% RasterOrderComparator::operator() 16.0% LayerRasterTileIterator::operator bool 5.0% LayerRasterTileIterator::operator* 3.7% BuildRasterQueue 33.4% RasterTilePriority::Build 31.8% LayerRasterTileIterator::LayerRasterTileIterator 18.2% std::vector<PairedPictureLayerQueue>::insert 5.8% std::vector<PairedPictureLayerQueue*>::insert 2.3% adjust heap (from make heap) 2.1% GetPairedPictureLayers 1.3% RasterTilePriorityQueue::~RasterTilePriorityQueue 3.0% free 2.5% RasterTilePriorityQueue::Top 1.9% (LapTimer::NextLap 1.8%) RasterTilePriorityQueue::IsEmpty 1.4%
On 2014/08/01 17:52:47, vmpstr wrote: > Here's profile numbers for ConstructAndIterate test (i put all things >1% > prunning at layer raster tile iterator level): > Before: > RunConstructAndIterate 99% > RasterTilePriorityQueue::Pop 77.2% > adjust heap (from push/pop heap) 22.7% > PairedPictureLayerQueue::Pop 11.1% > LayerRasterTileIterator::operator++ 7.1% > std::vector::insert 1.1% > free 11.1% > memmove 9.4% > new 7.5% I assume free/new is from returned_tiles_? > std::vector::operator= 7.2% > RasterOrderComparator::operator() 3.1% > LayerRasterTileIterator::~LayerRasterTileIterator 1.3% > BuildRasterQueue 17.7% > RasterTilePriorityQueue::Build 16.8% > LayerRasterTileIterator::LayerRasterTileIterator 8.9% > adjust heap (from make heap) 3.0% > std::vector::insert 2.8% > RasterTilePriorityQueue::~RasterTilePriorityQueue 1.2% > > After: > RunConstructAndIterate 100% > RasterTilePriorityQueue::Pop 56.3% > PairedPictureLayerQueue::Pop 33.6% > LayerRasterTileIterator::operator++ 19.8% > std::vector::insert 8.2% > LayerRasterTileIterator::operator bool 1.7% > LayerRasterTileIterator::operator* 1.2% > RasterOrderComparator::operator() 16.0% > LayerRasterTileIterator::operator bool 5.0% > LayerRasterTileIterator::operator* 3.7% > BuildRasterQueue 33.4% > RasterTilePriority::Build 31.8% > LayerRasterTileIterator::LayerRasterTileIterator 18.2% > std::vector<PairedPictureLayerQueue>::insert 5.8% > std::vector<PairedPictureLayerQueue*>::insert 2.3% > adjust heap (from make heap) 2.1% > GetPairedPictureLayers 1.3% > RasterTilePriorityQueue::~RasterTilePriorityQueue 3.0% > free 2.5% > RasterTilePriorityQueue::Top 1.9% > (LapTimer::NextLap 1.8%) > RasterTilePriorityQueue::IsEmpty 1.4% I think of this change as a last resort that can be applied when we're done with all optimizations that don't add complexity. This might be a significant win right now but it's no clear that it will be in the future. This patch actually adds some more work too. That happens to work out for the better right now but maybe it doesn't after we've made other optimizations. Maybe we end up with the increased complexity from this patch without any benefit. ScopedPtrVector would be a bit better but I'd prefer to hold off on that too for a bit longer. Here's what I'd like us to do before we land this: 1. Remove returned_tiles_ if possible 2. Get a breakdown of the LayerRasterTileIterator struct to see if any low-hanging fruit is left that would significantly reduce the size
On 2014/08/01 19:21:41, reveman wrote: > On 2014/08/01 17:52:47, vmpstr wrote: > > Here's profile numbers for ConstructAndIterate test (i put all things >1% > > prunning at layer raster tile iterator level): > > Before: > > RunConstructAndIterate 99% > > RasterTilePriorityQueue::Pop 77.2% > > adjust heap (from push/pop heap) 22.7% > > PairedPictureLayerQueue::Pop 11.1% > > LayerRasterTileIterator::operator++ 7.1% > > std::vector::insert 1.1% > > free 11.1% > > memmove 9.4% > > new 7.5% > > I assume free/new is from returned_tiles_? I'm not sure. After this change, new memmove and free don't show up at these high levels, but the patch doesn't touch returned tiles logic. > > > std::vector::operator= 7.2% > > RasterOrderComparator::operator() 3.1% > > LayerRasterTileIterator::~LayerRasterTileIterator 1.3% > > BuildRasterQueue 17.7% > > RasterTilePriorityQueue::Build 16.8% > > LayerRasterTileIterator::LayerRasterTileIterator 8.9% > > adjust heap (from make heap) 3.0% > > std::vector::insert 2.8% > > RasterTilePriorityQueue::~RasterTilePriorityQueue 1.2% > > > > After: > > RunConstructAndIterate 100% > > RasterTilePriorityQueue::Pop 56.3% > > PairedPictureLayerQueue::Pop 33.6% > > LayerRasterTileIterator::operator++ 19.8% > > std::vector::insert 8.2% > > LayerRasterTileIterator::operator bool 1.7% > > LayerRasterTileIterator::operator* 1.2% > > RasterOrderComparator::operator() 16.0% > > LayerRasterTileIterator::operator bool 5.0% > > LayerRasterTileIterator::operator* 3.7% > > BuildRasterQueue 33.4% > > RasterTilePriority::Build 31.8% > > LayerRasterTileIterator::LayerRasterTileIterator 18.2% > > std::vector<PairedPictureLayerQueue>::insert 5.8% > > std::vector<PairedPictureLayerQueue*>::insert 2.3% > > adjust heap (from make heap) 2.1% > > GetPairedPictureLayers 1.3% > > RasterTilePriorityQueue::~RasterTilePriorityQueue 3.0% > > free 2.5% > > RasterTilePriorityQueue::Top 1.9% > > (LapTimer::NextLap 1.8%) > > RasterTilePriorityQueue::IsEmpty 1.4% > > I think of this change as a last resort that can be applied when we're done with > all optimizations that don't add complexity. This might be a significant win > right now but it's no clear that it will be in the future. This patch actually > adds some more work too. That happens to work out for the better right now but > maybe it doesn't after we've made other optimizations. Maybe we end up with the > increased complexity from this patch without any benefit. I'm afraid that this will end up being a necessary change. For instance, I applied the eviction version of this patch on top of the layer eviction patch (ie assuming everything on layer side lands), and I still see about a 100% gain. In the end I think it's very hard to beat shuffling pointers by reducing the size of the object. > > ScopedPtrVector would be a bit better but I'd prefer to hold off on that too for > a bit longer. > > Here's what I'd like us to do before we land this: > 1. Remove returned_tiles_ if possible > 2. Get a breakdown of the LayerRasterTileIterator struct to see if any > low-hanging fruit is left that would significantly reduce the size I think if we bring the size of the layer iterator down it will definitely help, but I think this patch will still be significantly better. See my previous comment with eviction iterators.
We should find out why these new,memmove,free calls are removed with this change as that might lead us to a more worthwhile optimization. You can land this for now if you like but I'd like to make sure that we re-evaluate the usefulness later. LGTM. https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... File cc/resources/raster_tile_priority_queue.h (right): https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... cc/resources/raster_tile_priority_queue.h:51: std::vector<PairedPictureLayerQueue*> paired_queues_heap_; Please use ScopedPtrVector unless that is significantly slower. Also, please add a note about this possible being unnecessary.
On 2014/08/04 14:37:16, reveman wrote: > We should find out why these new,memmove,free calls are removed with this change > as that might lead us to a more worthwhile optimization. You can land this for > now if you like but I'd like to make sure that we re-evaluate the usefulness > later. LGTM. > I've added a comment to ensure we revisit this. The new/free/memmove operations happen because we're not actually using a heap of layer iterators, we're using a heap of paired picture layer iterators. So I think it is because of shared_tiles in those objects that we see these mem operations. Basically, we keep copying a vector around. I think the two TODOs in the header (removing shared_tiles, and removing ScopedPtrVector) are definitely the next to be addressed. I would still like to move forward in landing this, since as a first pass, I would like to address the largest inefficiencies we have here, and revisit smaller gains that are more correct later.
https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... File cc/resources/raster_tile_priority_queue.h (right): https://codereview.chromium.org/431433005/diff/1/cc/resources/raster_tile_pri... cc/resources/raster_tile_priority_queue.h:51: std::vector<PairedPictureLayerQueue*> paired_queues_heap_; On 2014/08/04 14:37:16, reveman wrote: > Please use ScopedPtrVector unless that is significantly slower. Also, please add > a note about this possible being unnecessary. Done.
The CQ bit was checked by vmpstr@chromium.org
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/vmpstr@chromium.org/431433005/20001
Message was sent while issue was closed.
Change committed as 287378 |