Issue 57823003: Add SK_PREFETCH and use in SkBlurImageFilter.

Issue 57823003: Add SK_PREFETCH and use in SkBlurImageFilter. (Closed)

Created:
7 years, 1 month ago by mtklein

Modified:
7 years, 1 month ago

Reviewers:
Nico, Stephen White, senorblanco, reed1

CC:
skia-review_googlegroups.com, tomhudson

Base URL:
https://skia.googlecode.com/svn/trunk

Visibility:
Public.

More Reviews

Description

Add SK_PREFETCH and use in SkBlurImageFilter. Relative speed is 1.2-1.6x on desktop, 1.0-1.2x on Nexus 4. (Division remains the bottleneck, now more so.) BUG= Committed: http://code.google.com/p/skia/source/detail?r=12129

Patch Set 1 #

Total comments: 1

Patch Set 2 : sync #

Patch Set 3 : slimmer change #

Total comments: 1

Created: 7 years, 1 month ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+11 lines, -0 lines)			Patch
	M	include/core/SkPostConfig.h	View		1 chunk	+9 lines, -0 lines	1 comment	Download
	M	src/effects/SkBlurImageFilter.cpp	View	1 2	1 chunk	+2 lines, -0 lines	0 comments	Download

Messages

Total messages: 12 (0 generated)

Expand Messages | Collapse Messages

mtklein

On 2013/11/04 15:48:35, mtklein wrote: After the fixed point change, this CL is now +10% ...

7 years, 1 month ago (2013-11-04 16:16:14 UTC) #2

Stephen White

https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter.cpp File src/effects/SkBlurImageFilter.cpp (right): https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter.cpp#newcode127 src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by srcStride seems to be ...

7 years, 1 month ago (2013-11-04 16:24:17 UTC) #4

mtklein

On 2013/11/04 16:24:17, Stephen White wrote: > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter.cpp > File src/effects/SkBlurImageFilter.cpp (right): > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter.cpp#newcode127 ...

7 years, 1 month ago (2013-11-04 16:34:14 UTC) #5

Stephen White

On 2013/11/04 16:34:14, mtklein wrote: > On 2013/11/04 16:24:17, Stephen White wrote: > > > ...

7 years, 1 month ago (2013-11-04 16:38:43 UTC) #6

On 2013/11/04 16:34:14, mtklein wrote:
> On 2013/11/04 16:24:17, Stephen White wrote:
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > File src/effects/SkBlurImageFilter.cpp (right):
> > 
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front);  // This step by
> > srcStride seems to be hard to predict.
> > I'm a little leery of prefetch instructions, due to this Intel lore: "On
some
> > older processors, under very bandwidth-sensitive workloads, ineffective
> hardware
> > prefetches can hurt the system by actually using bandwidth that is needed by
> > your application." But I don't have any hard data.
> > 
> > Could you characterize the impact of this change with and without the
> prefetch?
> 
> Uh, that quote's exactly on the money, but I think you may be reading it
> backwards.  Any hardware prefetches being done here are indeed useless, so
we're
> adding in an application-level prefetch hint to tell it the right load to
make.

Sorry, I didn't mean the reordering; that's fine. I meant the call to
SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This function
is used to minimize cache-miss latency by moving data into a cache before it is
accessed. You can insert calls to __builtin_prefetch into code for which you
know addresses of data in memory that is likely to be accessed soon. If the
target supports them, data prefetch instructions will be generated. If the
prefetch is done early enough before the access then the data will be in the
cache by the time it is accessed."

That sounds like it generates a prefetch instruction, or am I reading that
wrong?

> I'm not exactly sure what you were looking to compare here.  I've rewritten
the
> change to make it more obvious that the only functional change here is the
> prefetch.
> 
> Performance is still in the 1.12-1.16x range.

Stephen White

On 2013/11/04 16:38:43, Stephen White wrote: > On 2013/11/04 16:34:14, mtklein wrote: > > On ...

7 years, 1 month ago (2013-11-04 16:52:24 UTC) #7

On 2013/11/04 16:38:43, Stephen White wrote:
> On 2013/11/04 16:34:14, mtklein wrote:
> > On 2013/11/04 16:24:17, Stephen White wrote:
> > >
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > > File src/effects/SkBlurImageFilter.cpp (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front);  // This step
by
> > > srcStride seems to be hard to predict.
> > > I'm a little leery of prefetch instructions, due to this Intel lore: "On
> some
> > > older processors, under very bandwidth-sensitive workloads, ineffective
> > hardware
> > > prefetches can hurt the system by actually using bandwidth that is needed
by
> > > your application." But I don't have any hard data.
> > > 
> > > Could you characterize the impact of this change with and without the
> > prefetch?
> > 
> > Uh, that quote's exactly on the money, but I think you may be reading it
> > backwards.  Any hardware prefetches being done here are indeed useless, so
> we're
> > adding in an application-level prefetch hint to tell it the right load to
> make.
> 
> Sorry, I didn't mean the reordering; that's fine. I meant the call to
> SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This
function
> is used to minimize cache-miss latency by moving data into a cache before it
is
> accessed. You can insert calls to __builtin_prefetch into code for which you
> know addresses of data in memory that is likely to be accessed soon. If the
> target supports them, data prefetch instructions will be generated. If the
> prefetch is done early enough before the access then the data will be in the
> cache by the time it is accessed."
> 
> That sounds like it generates a prefetch instruction, or am I reading that
> wrong?

In case my Monday-brain is still not being clear: I thought perhaps the
reordering alone would be enough to get the hardware prefetchers to do the right
thing, without the need for SK_PREFETCH, so I was wondering what the perf would
be with the hoisting alone, and no SK_PREFETCH.

> > I'm not exactly sure what you were looking to compare here.  I've rewritten
> the
> > change to make it more obvious that the only functional change here is the
> > prefetch.
> > 
> > Performance is still in the 1.12-1.16x range.

mtklein

On 2013/11/04 16:38:43, Stephen White wrote: > On 2013/11/04 16:34:14, mtklein wrote: > > On ...

7 years, 1 month ago (2013-11-04 16:54:36 UTC) #8

On 2013/11/04 16:38:43, Stephen White wrote:
> On 2013/11/04 16:34:14, mtklein wrote:
> > On 2013/11/04 16:24:17, Stephen White wrote:
> > >
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > > File src/effects/SkBlurImageFilter.cpp (right):
> > > 
> > >
> >
>
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter...
> > > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front);  // This step
by
> > > srcStride seems to be hard to predict.
> > > I'm a little leery of prefetch instructions, due to this Intel lore: "On
> some
> > > older processors, under very bandwidth-sensitive workloads, ineffective
> > hardware
> > > prefetches can hurt the system by actually using bandwidth that is needed
by
> > > your application." But I don't have any hard data.
> > > 
> > > Could you characterize the impact of this change with and without the
> > prefetch?
> > 
> > Uh, that quote's exactly on the money, but I think you may be reading it
> > backwards.  Any hardware prefetches being done here are indeed useless, so
> we're
> > adding in an application-level prefetch hint to tell it the right load to
> make.
> 
> Sorry, I didn't mean the reordering; that's fine. I meant the call to
> SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This
function
> is used to minimize cache-miss latency by moving data into a cache before it
is
> accessed. You can insert calls to __builtin_prefetch into code for which you
> know addresses of data in memory that is likely to be accessed soon. If the
> target supports them, data prefetch instructions will be generated. If the
> prefetch is done early enough before the access then the data will be in the
> cache by the time it is accessed."
> 
> That sounds like it generates a prefetch instruction, or am I reading that
> wrong?
> 
> > I'm not exactly sure what you were looking to compare here.  I've rewritten
> the
> > change to make it more obvious that the only functional change here is the
> > prefetch.
> > 
> > Performance is still in the 1.12-1.16x range.

Yeah, that's exactly right.  We're doing the fetch just as soon as we know the
next address, to give it as much time as possible to get that cache line loaded.

That document you were reading before was trying to explain that hardware
prefetches (automatic) aren't always a win, and that sometimes software
prefetches (using prefetch instructions) can be better than the automatic
hardware prefetches.  That's what's happening here: any prefetches being done by
the CPU aren't fetching the right cache line.  We know what address we want, so
we issue a software prefetch instruction to help it out.  The movs later on will
find the cache populated (or more realistically, somewhat closer to being
populated).

Most of the time hardware prefetches are just the most lovely thing in the
world.  Any time you do a looped *ptr++, you're totally cool.  The hardware can
detect that and prefetch cache lines as you're walking along way better than we
can hint.  In this case, I'm actually a little surprised that the hardware isn't
able to cotton onto this +=srcStride.  Seems straightforward.

Stephen White

LGTM. I'm surprised that the hardware prefetchers are so bad, but data wins. Thanks for ...

7 years, 1 month ago (2013-11-04 17:13:50 UTC) #9

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/mtklein@google.com/57823003/20003

7 years, 1 month ago (2013-11-05 13:23:24 UTC) #10

Nico

7 years, 1 month ago (2013-11-12 05:23:09 UTC) #12

Message was sent while issue was closed.

fyi

https://codereview.chromium.org/57823003/diff/20003/include/core/SkPostConfig.h
File include/core/SkPostConfig.h (right):

https://codereview.chromium.org/57823003/diff/20003/include/core/SkPostConfig...
include/core/SkPostConfig.h:384: #if defined(__clang__) || defined(__GNUC__)
clang defines __GNUC__ too, no need to check for it separately

Expand Messages | Collapse Messages