|
|
Created:
6 years, 11 months ago by mtklein Modified:
6 years, 11 months ago CC:
skia-review_googlegroups.com Base URL:
https://skia.googlesource.com/skia.git@off-by-one Visibility:
Public. |
DescriptionEliminate useless NULL push by making fIndexedData 0-based.
Depends on http://crrev.com/134223002
Testing: out/Debug/dm && out/Debug/tests && echo ok
BUG=skia:1979
Committed: http://code.google.com/p/skia/source/detail?r=13028
Patch Set 1 #
Messages
Total messages: 10 (0 generated)
LGTM, assuming all tests pass.
LGTM.
On 2014/01/10 18:21:36, Dominik Grewe wrote: > LGTM. Btw, do you have any data on how much of a difference this makes in terms of performance? Or how often you save a malloc for empty dictionaries?
On 2014/01/10 18:24:08, Dominik Grewe wrote: > On 2014/01/10 18:21:36, Dominik Grewe wrote: > > LGTM. > > Btw, do you have any data on how much of a difference this makes in terms of > performance? Or how often you save a malloc for empty dictionaries? I wouldn't get excited about it. Before this CL, I recorded 64 SKPs 900 times each (a total of 57,600 recordings), and each of SkPaint, SkMatrix, and SkRegion's allocations in the SkFlatDictionary constructor were visible, each allocating 1.73MB over all runs. (This is 32B per dictionary instantiation.) In total, sk_realloc_throw (which is what adding to a TDArray boils down to) allocated 105.70MB in 657006 calls, about 11.5 calls per recording. In contrast, with this CL, I did the same thing and saw no calls to sk_realloc_throw from the constructor, which is expected. In total sk_realloc_throw allocated 102.50MB (3% improvement) in 559800 calls, about 9.7 calls per recording (16% improvement). Code under SkFlatDictionary accounted for 12140 of ~29000 ms of runtime. Of that 12140ms, before this CL we spent 123ms in sk_realloc_throw. After this CL, we spent 121ms out of a total of 11987ms spent under SkFlatDictionary in sk_realloc_throw. So I see no signficant speed change: reallocating this dictionary is about 1% (120/12000) of the time we spend under SkFlatDictionary, which itself accounts for about 40% (12/29) of the time spent recording. At best we can move the overall needle 0.4% tweaking this, but I think we have not moved it at all.
On 2014/01/10 19:04:02, mtklein wrote: > On 2014/01/10 18:24:08, Dominik Grewe wrote: > > On 2014/01/10 18:21:36, Dominik Grewe wrote: > > > LGTM. > > > > Btw, do you have any data on how much of a difference this makes in terms of > > performance? Or how often you save a malloc for empty dictionaries? > > I wouldn't get excited about it. > > Before this CL, I recorded 64 SKPs 900 times each (a total of 57,600 > recordings), and each of SkPaint, SkMatrix, and SkRegion's allocations in the > SkFlatDictionary constructor were visible, each allocating 1.73MB over all runs. > (This is 32B per dictionary instantiation.) In total, sk_realloc_throw (which > is what adding to a TDArray boils down to) allocated 105.70MB in 657006 calls, > about 11.5 calls per recording. > > In contrast, with this CL, I did the same thing and saw no calls to > sk_realloc_throw from the constructor, which is expected. In total > sk_realloc_throw allocated 102.50MB (3% improvement) in 559800 calls, about 9.7 > calls per recording (16% improvement). > > Code under SkFlatDictionary accounted for 12140 of ~29000 ms of runtime. Of > that 12140ms, before this CL we spent 123ms in sk_realloc_throw. After this CL, > we spent 121ms out of a total of 11987ms spent under SkFlatDictionary in > sk_realloc_throw. So I see no signficant speed change: reallocating this > dictionary is about 1% (120/12000) of the time we spend under SkFlatDictionary, > which itself accounts for about 40% (12/29) of the time spent recording. At > best we can move the overall needle 0.4% tweaking this, but I think we have not > moved it at all. Thanks for the analysis. I'm surprised that sk_realloc_throw only accounts for such a small percentage of SkFlatDictionary. In our perf profiles malloc came up quite a bit. Maybe that's due to different platforms (I guess you were running on Linux?)? And I guess we also used different data sets. On the other hand it means that there is significant scope in tuning the other operations of the dictionary, right? Was it mainly the paint dictionary that contributed to the 12140ms?
On 2014/01/10 19:17:35, Dominik Grewe wrote: > On 2014/01/10 19:04:02, mtklein wrote: > > On 2014/01/10 18:24:08, Dominik Grewe wrote: > > > On 2014/01/10 18:21:36, Dominik Grewe wrote: > > > > LGTM. > > > > > > Btw, do you have any data on how much of a difference this makes in terms of > > > performance? Or how often you save a malloc for empty dictionaries? > > > > I wouldn't get excited about it. > > > > Before this CL, I recorded 64 SKPs 900 times each (a total of 57,600 > > recordings), and each of SkPaint, SkMatrix, and SkRegion's allocations in the > > SkFlatDictionary constructor were visible, each allocating 1.73MB over all > runs. > > (This is 32B per dictionary instantiation.) In total, sk_realloc_throw > (which > > is what adding to a TDArray boils down to) allocated 105.70MB in 657006 calls, > > about 11.5 calls per recording. > > > > In contrast, with this CL, I did the same thing and saw no calls to > > sk_realloc_throw from the constructor, which is expected. In total > > sk_realloc_throw allocated 102.50MB (3% improvement) in 559800 calls, about > 9.7 > > calls per recording (16% improvement). > > > > Code under SkFlatDictionary accounted for 12140 of ~29000 ms of runtime. Of > > that 12140ms, before this CL we spent 123ms in sk_realloc_throw. After this > CL, > > we spent 121ms out of a total of 11987ms spent under SkFlatDictionary in > > sk_realloc_throw. So I see no signficant speed change: reallocating this > > dictionary is about 1% (120/12000) of the time we spend under > SkFlatDictionary, > > which itself accounts for about 40% (12/29) of the time spent recording. At > > best we can move the overall needle 0.4% tweaking this, but I think we have > not > > moved it at all. > > Thanks for the analysis. > I'm surprised that sk_realloc_throw only accounts for such a small percentage of > SkFlatDictionary. In our perf profiles malloc came up quite a bit. Maybe that's > due to different platforms (I guess you were running on Linux?)? And I guess we > also used different data sets. > > On the other hand it means that there is significant scope in tuning the other > operations of the dictionary, right? Was it mainly the paint dictionary that > contributed to the 12140ms? Yeah, the thing to keep in mind here is that realloc is not malloc. They're counted separately. Here's what I see as top slow things, from a run on my Mac: - ~29000ms total - 11484 under SkFlatDictionary Time under SkFlatDictionary breaks down into: - 1786 SkPaint::flatten - 1659 SkFlatDictionary<SkPaint>::findAndReturnMutableFlat - 1233 SkFlatDictionary<SkPaint>::resetScratch - 707 SkPtrSet::add (this is recording/deduping typefaces) - 537 SkWriter32::reset - 351 SkOrderedBuffer::writeByteArray - 223 SkWriter32::reservePad - 223 SkWriter32::peek32 - 205 SkOrderedWriteBuffer::writeFlattenable - 203 SkWriter32::writePad - 192 SkTDynamicHash::add - 184 collapse_save_clip_restore (a picture recording optimization) - 147 SkOrderedBuffer::writeTypeface - 122 match (determining if picture record optimizations apply) - 121 sk_realloc_throw Obviously the list tails on down from there, but I think you get the idea of the relative importance of sk_realloc_throw (~ SkTDArray::append). These are flat values, that is it measures the time spent in these methods themselves but not in the time spent in methods they call. If I remember right, Tom was looking at a page with a very small picture representation, trying to get its cost down as close to zero as possible. The pictures I'm using are more ordinary web pages. I personally would focus on the top 3 or 4 methods. (If you're interested, I used Instruments's Time Profiler to profile out/Release/bench_record, which lives at crrev.com/132573002 until I can get it submitted.)
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/mtklein@google.com/134283002/1
Message was sent while issue was closed.
Change committed as 13028
Message was sent while issue was closed.
On 2014/01/10 19:43:20, mtklein wrote: > On 2014/01/10 19:17:35, Dominik Grewe wrote: > Obviously the list tails on down from there, but I think you get the idea of the > relative importance of sk_realloc_throw (~ SkTDArray::append). These are flat > values, that is it measures the time spent in these methods themselves but not > in the time spent in methods they call. Yeah, that profile is really inconsistent with what we've seen on our target webpages on Android, but it's a different data set. Dominik is going to try to reproduce your numbers with the new benchmark on Linux & Android (or I will, but I'm flying to NYC several days before he flies to MTV). > If I remember right, Tom was looking at a page with a very small picture > representation, trying to get its cost down as close to zero as possible. The > pictures I'm using are more ordinary web pages. One of the emerging frequent containers for content is infinite lists - think Facebook. This ends up giving us lots of small pictures which are frequently rerecorded. So yes, it's not what you're going to see in static content, but for the next couple of quarters we in London are going to be focused on animations, lists, and other things that are badly janky on mobile today. > I personally would focus on the top 3 or 4 methods. sk_*alloc_throw() gets inlined on Linux/Android; we see a higher-level caller in Skia, and then tcmalloc / dlmalloc. For our use cases, the top methods are definitely dlmalloc and dlfree, and depending on the profiling setup their top callers on Android include: SkFlatDictionary<*>::SkFlatDictionary SkFlatDictionary<*>::findAndReturnMutableFlat SkPictureStateTree::SkPictureStateTree SkPicturePlayback::SkPicturePlayback SkDeque::allocateBlock SkBBoxHierarchyRecord::~SkBBoxHierarchyRecord SkPicturePlayback::~SkPicturePlayback SkCanvas::internalRestore Tom |