Issue 134283002: Eliminate useless NULL push by making fIndexedData 0-based.

Issue 134283002: Eliminate useless NULL push by making fIndexedData 0-based. (Closed)

Created:
6 years, 11 months ago by mtklein

Modified:
6 years, 11 months ago

Reviewers:
Dominik Grewe, Tom Hudson, hal.canary

CC:
skia-review_googlegroups.com

Base URL:
https://skia.googlesource.com/skia.git@off-by-one

Visibility:
Public.

More Reviews

Description

Eliminate useless NULL push by making fIndexedData 0-based. Depends on http://crrev.com/134223002 Testing: out/Debug/dm && out/Debug/tests && echo ok BUG=skia:1979 Committed: http://code.google.com/p/skia/source/detail?r=13028

Patch Set 1 #

Created: 6 years, 11 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+9 lines, -10 lines)			Patch
	M	src/core/SkPictureFlat.h	View		6 chunks	+9 lines, -10 lines	0 comments	Download

Messages

Total messages: 10 (0 generated)

Expand Messages | Collapse Messages

Dominik Grewe

On 2014/01/10 18:21:36, Dominik Grewe wrote: > LGTM. Btw, do you have any data on ...

6 years, 11 months ago (2014-01-10 18:24:08 UTC) #4

mtklein

On 2014/01/10 18:24:08, Dominik Grewe wrote: > On 2014/01/10 18:21:36, Dominik Grewe wrote: > > ...

6 years, 11 months ago (2014-01-10 19:04:02 UTC) #5

Dominik Grewe

On 2014/01/10 19:04:02, mtklein wrote: > On 2014/01/10 18:24:08, Dominik Grewe wrote: > > On ...

6 years, 11 months ago (2014-01-10 19:17:35 UTC) #6

On 2014/01/10 19:04:02, mtklein wrote:
> On 2014/01/10 18:24:08, Dominik Grewe wrote:
> > On 2014/01/10 18:21:36, Dominik Grewe wrote:
> > > LGTM.
> > 
> > Btw, do you have any data on how much of a difference this makes in terms of
> > performance? Or how often you save a malloc for empty dictionaries?
> 
> I wouldn't get excited about it.
> 
> Before this CL, I recorded 64 SKPs 900 times each (a total of 57,600
> recordings), and each of SkPaint, SkMatrix, and SkRegion's allocations in the
> SkFlatDictionary constructor were visible, each allocating 1.73MB over all
runs.
>  (This is 32B per dictionary instantiation.)  In total, sk_realloc_throw
(which
> is what adding to a TDArray boils down to) allocated 105.70MB in 657006 calls,
> about 11.5 calls per recording.
> 
> In contrast, with this CL, I did the same thing and saw no calls to
> sk_realloc_throw from the constructor, which is expected.  In total
> sk_realloc_throw allocated 102.50MB (3% improvement) in 559800 calls, about
9.7
> calls per recording (16% improvement).
> 
> Code under SkFlatDictionary accounted for 12140 of ~29000 ms of runtime.  Of
> that 12140ms, before this CL we spent 123ms in sk_realloc_throw.  After this
CL,
> we spent 121ms out of a total of 11987ms spent under SkFlatDictionary in
> sk_realloc_throw.  So I see no signficant speed change: reallocating this
> dictionary is about 1% (120/12000) of the time we spend under
SkFlatDictionary,
> which itself accounts for about 40% (12/29) of the time spent recording.  At
> best we can move the overall needle 0.4% tweaking this, but I think we have
not
> moved it at all.

Thanks for the analysis.
I'm surprised that sk_realloc_throw only accounts for such a small percentage of
SkFlatDictionary. In our perf profiles malloc came up quite a bit. Maybe that's
due to different platforms (I guess you were running on Linux?)? And I guess we
also used different data sets.

On the other hand it means that there is significant scope in tuning the other
operations of the dictionary, right? Was it mainly the paint dictionary that
contributed to the 12140ms?

mtklein

On 2014/01/10 19:17:35, Dominik Grewe wrote: > On 2014/01/10 19:04:02, mtklein wrote: > > On ...

6 years, 11 months ago (2014-01-10 19:43:20 UTC) #7

On 2014/01/10 19:17:35, Dominik Grewe wrote:
> On 2014/01/10 19:04:02, mtklein wrote:
> > On 2014/01/10 18:24:08, Dominik Grewe wrote:
> > > On 2014/01/10 18:21:36, Dominik Grewe wrote:
> > > > LGTM.
> > > 
> > > Btw, do you have any data on how much of a difference this makes in terms
of
> > > performance? Or how often you save a malloc for empty dictionaries?
> > 
> > I wouldn't get excited about it.
> > 
> > Before this CL, I recorded 64 SKPs 900 times each (a total of 57,600
> > recordings), and each of SkPaint, SkMatrix, and SkRegion's allocations in
the
> > SkFlatDictionary constructor were visible, each allocating 1.73MB over all
> runs.
> >  (This is 32B per dictionary instantiation.)  In total, sk_realloc_throw
> (which
> > is what adding to a TDArray boils down to) allocated 105.70MB in 657006
calls,
> > about 11.5 calls per recording.
> > 
> > In contrast, with this CL, I did the same thing and saw no calls to
> > sk_realloc_throw from the constructor, which is expected.  In total
> > sk_realloc_throw allocated 102.50MB (3% improvement) in 559800 calls, about
> 9.7
> > calls per recording (16% improvement).
> > 
> > Code under SkFlatDictionary accounted for 12140 of ~29000 ms of runtime.  Of
> > that 12140ms, before this CL we spent 123ms in sk_realloc_throw.  After this
> CL,
> > we spent 121ms out of a total of 11987ms spent under SkFlatDictionary in
> > sk_realloc_throw.  So I see no signficant speed change: reallocating this
> > dictionary is about 1% (120/12000) of the time we spend under
> SkFlatDictionary,
> > which itself accounts for about 40% (12/29) of the time spent recording.  At
> > best we can move the overall needle 0.4% tweaking this, but I think we have
> not
> > moved it at all.
> 
> Thanks for the analysis.
> I'm surprised that sk_realloc_throw only accounts for such a small percentage
of
> SkFlatDictionary. In our perf profiles malloc came up quite a bit. Maybe
that's
> due to different platforms (I guess you were running on Linux?)? And I guess
we
> also used different data sets.
> 
> On the other hand it means that there is significant scope in tuning the other
> operations of the dictionary, right? Was it mainly the paint dictionary that
> contributed to the 12140ms?

Yeah, the thing to keep in mind here is that realloc is not malloc.  They're
counted separately.

Here's what I see as top slow things, from a run on my Mac:
  - ~29000ms total
  - 11484 under SkFlatDictionary
Time under SkFlatDictionary breaks down into:
  - 1786 SkPaint::flatten
  - 1659 SkFlatDictionary<SkPaint>::findAndReturnMutableFlat
  - 1233 SkFlatDictionary<SkPaint>::resetScratch
  -  707 SkPtrSet::add  (this is recording/deduping typefaces)
  -  537 SkWriter32::reset
  -  351 SkOrderedBuffer::writeByteArray
  -  223 SkWriter32::reservePad
  -  223 SkWriter32::peek32
  -  205 SkOrderedWriteBuffer::writeFlattenable
  -  203 SkWriter32::writePad
  -  192 SkTDynamicHash::add
  -  184 collapse_save_clip_restore (a picture recording optimization)
  -  147 SkOrderedBuffer::writeTypeface
  -  122 match  (determining if picture record optimizations apply)
  -  121 sk_realloc_throw
Obviously the list tails on down from there, but I think you get the idea of the
relative importance of sk_realloc_throw (~ SkTDArray::append).  These are flat
values, that is it measures the time spent in these methods themselves but not
in the time spent in methods they call.

If I remember right, Tom was looking at a page with a very small picture
representation, trying to get its cost down as close to zero as possible.  The
pictures I'm using are more ordinary web pages.

I personally would focus on the top 3 or 4 methods.

(If you're interested, I used Instruments's Time Profiler to profile
out/Release/bench_record, which lives at crrev.com/132573002 until I can get it
submitted.)

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/mtklein@google.com/134283002/1

6 years, 11 months ago (2014-01-10 19:46:08 UTC) #8

tomhudson

6 years, 11 months ago (2014-01-13 10:03:40 UTC) #10

Message was sent while issue was closed.

On 2014/01/10 19:43:20, mtklein wrote:
> On 2014/01/10 19:17:35, Dominik Grewe wrote:

> Obviously the list tails on down from there, but I think you get the idea of
the
> relative importance of sk_realloc_throw (~ SkTDArray::append).  These are flat
> values, that is it measures the time spent in these methods themselves but not
> in the time spent in methods they call.

Yeah, that profile is really inconsistent with what we've seen on our target
webpages on Android, but it's a different data set. Dominik is going to try to
reproduce your numbers with the new benchmark on Linux & Android (or I will, but
I'm flying to NYC several days before he flies to MTV).

> If I remember right, Tom was looking at a page with a very small picture
> representation, trying to get its cost down as close to zero as possible.  The
> pictures I'm using are more ordinary web pages.

One of the emerging frequent containers for content is infinite lists - think
Facebook. This ends up giving us lots of small pictures which are frequently
rerecorded. So yes, it's not what you're going to see in static content, but for
the next couple of quarters we in London are going to be focused on animations,
lists, and other things that are badly janky on mobile today.

> I personally would focus on the top 3 or 4 methods.

sk_*alloc_throw() gets inlined on Linux/Android; we see a higher-level caller in
Skia, and then tcmalloc / dlmalloc. For our use cases, the top methods are
definitely dlmalloc and dlfree, and depending on the profiling setup their top
callers on Android include:

  SkFlatDictionary<*>::SkFlatDictionary
  SkFlatDictionary<*>::findAndReturnMutableFlat
  SkPictureStateTree::SkPictureStateTree
  SkPicturePlayback::SkPicturePlayback
  SkDeque::allocateBlock

  SkBBoxHierarchyRecord::~SkBBoxHierarchyRecord
  SkPicturePlayback::~SkPicturePlayback
  SkCanvas::internalRestore

Tom

Expand Messages | Collapse Messages