|
|
Created:
7 years, 1 month ago by mtklein Modified:
7 years, 1 month ago CC:
skia-review_googlegroups.com, tomhudson Base URL:
https://skia.googlecode.com/svn/trunk Visibility:
Public. |
DescriptionAdd SK_PREFETCH and use in SkBlurImageFilter.
Relative speed is 1.2-1.6x on desktop, 1.0-1.2x on Nexus 4.
(Division remains the bottleneck, now more so.)
BUG=
Committed: http://code.google.com/p/skia/source/detail?r=12129
Patch Set 1 #
Total comments: 1
Patch Set 2 : sync #Patch Set 3 : slimmer change #
Total comments: 1
Messages
Total messages: 12 (0 generated)
On 2013/11/04 15:48:35, mtklein wrote: After the fixed point change, this CL is now +10% across the board.
lgtm
https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... File src/effects/SkBlurImageFilter.cpp (right): https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by srcStride seems to be hard to predict. I'm a little leery of prefetch instructions, due to this Intel lore: "On some older processors, under very bandwidth-sensitive workloads, ineffective hardware prefetches can hurt the system by actually using bandwidth that is needed by your application." But I don't have any hard data. Could you characterize the impact of this change with and without the prefetch?
On 2013/11/04 16:24:17, Stephen White wrote: > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > File src/effects/SkBlurImageFilter.cpp (right): > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by > srcStride seems to be hard to predict. > I'm a little leery of prefetch instructions, due to this Intel lore: "On some > older processors, under very bandwidth-sensitive workloads, ineffective hardware > prefetches can hurt the system by actually using bandwidth that is needed by > your application." But I don't have any hard data. > > Could you characterize the impact of this change with and without the prefetch? Uh, that quote's exactly on the money, but I think you may be reading it backwards. Any hardware prefetches being done here are indeed useless, so we're adding in an application-level prefetch hint to tell it the right load to make. I'm not exactly sure what you were looking to compare here. I've rewritten the change to make it more obvious that the only functional change here is the prefetch. Performance is still in the 1.12-1.16x range.
On 2013/11/04 16:34:14, mtklein wrote: > On 2013/11/04 16:24:17, Stephen White wrote: > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > File src/effects/SkBlurImageFilter.cpp (right): > > > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by > > srcStride seems to be hard to predict. > > I'm a little leery of prefetch instructions, due to this Intel lore: "On some > > older processors, under very bandwidth-sensitive workloads, ineffective > hardware > > prefetches can hurt the system by actually using bandwidth that is needed by > > your application." But I don't have any hard data. > > > > Could you characterize the impact of this change with and without the > prefetch? > > Uh, that quote's exactly on the money, but I think you may be reading it > backwards. Any hardware prefetches being done here are indeed useless, so we're > adding in an application-level prefetch hint to tell it the right load to make. Sorry, I didn't mean the reordering; that's fine. I meant the call to SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed." That sounds like it generates a prefetch instruction, or am I reading that wrong? > I'm not exactly sure what you were looking to compare here. I've rewritten the > change to make it more obvious that the only functional change here is the > prefetch. > > Performance is still in the 1.12-1.16x range.
On 2013/11/04 16:38:43, Stephen White wrote: > On 2013/11/04 16:34:14, mtklein wrote: > > On 2013/11/04 16:24:17, Stephen White wrote: > > > > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > > File src/effects/SkBlurImageFilter.cpp (right): > > > > > > > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by > > > srcStride seems to be hard to predict. > > > I'm a little leery of prefetch instructions, due to this Intel lore: "On > some > > > older processors, under very bandwidth-sensitive workloads, ineffective > > hardware > > > prefetches can hurt the system by actually using bandwidth that is needed by > > > your application." But I don't have any hard data. > > > > > > Could you characterize the impact of this change with and without the > > prefetch? > > > > Uh, that quote's exactly on the money, but I think you may be reading it > > backwards. Any hardware prefetches being done here are indeed useless, so > we're > > adding in an application-level prefetch hint to tell it the right load to > make. > > Sorry, I didn't mean the reordering; that's fine. I meant the call to > SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This function > is used to minimize cache-miss latency by moving data into a cache before it is > accessed. You can insert calls to __builtin_prefetch into code for which you > know addresses of data in memory that is likely to be accessed soon. If the > target supports them, data prefetch instructions will be generated. If the > prefetch is done early enough before the access then the data will be in the > cache by the time it is accessed." > > That sounds like it generates a prefetch instruction, or am I reading that > wrong? In case my Monday-brain is still not being clear: I thought perhaps the reordering alone would be enough to get the hardware prefetchers to do the right thing, without the need for SK_PREFETCH, so I was wondering what the perf would be with the hoisting alone, and no SK_PREFETCH. > > I'm not exactly sure what you were looking to compare here. I've rewritten > the > > change to make it more obvious that the only functional change here is the > > prefetch. > > > > Performance is still in the 1.12-1.16x range.
On 2013/11/04 16:38:43, Stephen White wrote: > On 2013/11/04 16:34:14, mtklein wrote: > > On 2013/11/04 16:24:17, Stephen White wrote: > > > > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > > File src/effects/SkBlurImageFilter.cpp (right): > > > > > > > > > https://codereview.chromium.org/57823003/diff/1/src/effects/SkBlurImageFilter... > > > src/effects/SkBlurImageFilter.cpp:127: SK_PREFETCH(front); // This step by > > > srcStride seems to be hard to predict. > > > I'm a little leery of prefetch instructions, due to this Intel lore: "On > some > > > older processors, under very bandwidth-sensitive workloads, ineffective > > hardware > > > prefetches can hurt the system by actually using bandwidth that is needed by > > > your application." But I don't have any hard data. > > > > > > Could you characterize the impact of this change with and without the > > prefetch? > > > > Uh, that quote's exactly on the money, but I think you may be reading it > > backwards. Any hardware prefetches being done here are indeed useless, so > we're > > adding in an application-level prefetch hint to tell it the right load to > make. > > Sorry, I didn't mean the reordering; that's fine. I meant the call to > SK_PREFETCH(), which seems to be __builtin_prefetch, which says: "This function > is used to minimize cache-miss latency by moving data into a cache before it is > accessed. You can insert calls to __builtin_prefetch into code for which you > know addresses of data in memory that is likely to be accessed soon. If the > target supports them, data prefetch instructions will be generated. If the > prefetch is done early enough before the access then the data will be in the > cache by the time it is accessed." > > That sounds like it generates a prefetch instruction, or am I reading that > wrong? > > > I'm not exactly sure what you were looking to compare here. I've rewritten > the > > change to make it more obvious that the only functional change here is the > > prefetch. > > > > Performance is still in the 1.12-1.16x range. Yeah, that's exactly right. We're doing the fetch just as soon as we know the next address, to give it as much time as possible to get that cache line loaded. That document you were reading before was trying to explain that hardware prefetches (automatic) aren't always a win, and that sometimes software prefetches (using prefetch instructions) can be better than the automatic hardware prefetches. That's what's happening here: any prefetches being done by the CPU aren't fetching the right cache line. We know what address we want, so we issue a software prefetch instruction to help it out. The movs later on will find the cache populated (or more realistically, somewhat closer to being populated). Most of the time hardware prefetches are just the most lovely thing in the world. Any time you do a looped *ptr++, you're totally cool. The hardware can detect that and prefetch cache lines as you're walking along way better than we can hint. In this case, I'm actually a little surprised that the hardware isn't able to cotton onto this +=srcStride. Seems straightforward.
LGTM. I'm surprised that the hardware prefetchers are so bad, but data wins. Thanks for the extra investigation!
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/mtklein@google.com/57823003/20003
Message was sent while issue was closed.
Change committed as 12129
Message was sent while issue was closed.
fyi https://codereview.chromium.org/57823003/diff/20003/include/core/SkPostConfig.h File include/core/SkPostConfig.h (right): https://codereview.chromium.org/57823003/diff/20003/include/core/SkPostConfig... include/core/SkPostConfig.h:384: #if defined(__clang__) || defined(__GNUC__) clang defines __GNUC__ too, no need to check for it separately |