Issue 105893003: NEON fast path for box blur

Issue 105893003: NEON fast path for box blur (Closed)

Created:
7 years ago by zheng.xu

Modified:
6 years, 11 months ago

Reviewers:
Stephen White, reed1, mtklein, kevin.petit, kevin.petit.not.used.account

CC:
skia-review_googlegroups.com

Base URL:
https://skia.googlecode.com/svn/trunk

Visibility:
Public.

More Reviews

Description

NEON fast path for box blur Calculate 8 channels in parallel by using 16-bits to store each channel. Due to the limitation of VQRDMULH, (int16 * int16 * 2 + 0x8000) >> 16, the fast path can only support kernelSize < 128. 8 significant bits are kept at least in each stage, the final error should less-equal than 1. Pre-fetching memory for X-direction read. In fact pre-fetching memory doesn't help much for Y direction read, since it is a waste to load a cache line for only read 8 bytes.(I left it there to keep the symmetry. pre-fetch is cheap :) ) bench data on Nexus 10 before: running bench [640 480] blur_image_filter_large_10.00_10.00 8888: cmsecs = 25081.48 running bench [640 480] blur_image_filter_small_10.00_10.00 8888: cmsecs = 25038.04 running bench [640 480] blur_image_filter_large_1.00_1.00 8888: cmsecs = 25209.04 running bench [640 480] blur_image_filter_small_1.00_1.00 8888: cmsecs = 24928.01 running bench [640 480] blur_image_filter_large_0.00_1.00 8888: cmsecs = 17160.98 running bench [640 480] blur_image_filter_large_0.00_10.00 8888: cmsecs = 17924.11 running bench [640 480] blur_image_filter_large_1.00_0.00 8888: cmsecs = 14609.19 running bench [640 480] blur_image_filter_large_10.00_0.00 8888: cmsecs = 14625.91 after: running bench [640 480] blur_image_filter_large_10.00_10.00 8888: cmsecs = 14848.42 running bench [640 480] blur_image_filter_small_10.00_10.00 8888: cmsecs = 16037.29 running bench [640 480] blur_image_filter_large_1.00_1.00 8888: cmsecs = 14819.55 running bench [640 480] blur_image_filter_small_1.00_1.00 8888: cmsecs = 14563.69 running bench [640 480] blur_image_filter_large_0.00_1.00 8888: cmsecs = 11905.34 running bench [640 480] blur_image_filter_large_0.00_10.00 8888: cmsecs = 11883.85 running bench [640 480] blur_image_filter_large_1.00_0.00 8888: cmsecs = 9576.51 running bench [640 480] blur_image_filter_large_10.00_0.00 8888: cmsecs = 9793.84 BUG= Committed: http://code.google.com/p/skia/source/detail?r=13036

Patch Set 1 #

Total comments: 20

Patch Set 2 : fix coding style issues #

Patch Set 3 : add gm and bench cases with large sigma to cover the box blur path with kernelSize >= 128 #

Patch Set 4 : pre-fetch data for X direction in the code path kernelSize > 128 #

Patch Set 5 : #

Total comments: 8

Patch Set 6 : fix a typo #

Total comments: 2

Patch Set 7 : remove useless prefetch. #

Total comments: 6

Patch Set 8 : #

Patch Set 9 : #

Total comments: 2

Created: 7 years ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+116 lines, -8 lines)			Patch
M	bench/BlurImageFilterBench.cpp	View	1 2 3 4 5 6 7	2 chunks	+3 lines, -0 lines	0 comments	Download
M	expectations/gm/ignored-tests.txt	View	1 2 3 4 5 6 7	1 chunk	+9 lines, -0 lines	2 comments	Download
M	gm/imageblur.cpp	View	1 2 3 4 5 6 7	3 chunks	+14 lines, -5 lines	0 comments	Download
M	src/opts/SkBlurImage_opts_neon.cpp	View	1 2 3 4 5 6 7 8	3 chunks	+90 lines, -3 lines	0 comments	Download

Messages

Total messages: 26 (0 generated)

Expand Messages | Collapse Messages

Stephen White

Thanks for the patch! Does this provide results which match the existing path pixel-for-pixel? So ...

7 years ago (2013-12-12 05:30:30 UTC) #2

zheng.xu

On 2013/12/12 05:30:30, Stephen White wrote: > Thanks for the patch! > > Does this ...

7 years ago (2013-12-12 05:58:45 UTC) #3

On 2013/12/12 05:30:30, Stephen White wrote:
> Thanks for the patch!
> 
> Does this provide results which match the existing path pixel-for-pixel? So
far,
> all the image filter optimizations have had that property, and I'd like to
keep
> it if possible.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> File src/opts/SkBlurImage_opts_neon.cpp (right):
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const
> SkPMColor* src, int srcStride) {
> should be separated by underscores; see
>
https://sites.google.com/site/skiadocs/developer-documentation/contributing-c...,
> so load_2_pixels()
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:30: SK_PREFETCH(src + 16);
> Are you sure this offset is correct? src is SkPMColor*, so unless I've done my
> pointer math wrong, I think this would be prefetching 64 bytes ahead. Is that
> what you had in mind?
> 
> At any rate, I think we should check that the prefetches are actually doing
> anything in either X or Y, and skip them if they aren't. There are X-only and
> Y-only benches which should show the effect.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:43: inline void store2pixels(uint16x8_t
> result16x8, SkPMColor* dst, int dstStride) {
> store_2_pixels()
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:58: void SkFastBoxBlur_NEON(const
SkPMColor*
> src, int srcStride, SkPMColor* dst, int kernelSize,
> I'd prefer to avoid names like "fast" (since we often come up with faster
> algorithms later). Let's name it by what it does, perhaps
> SkDoubleRowBoxBlur_NEON.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:69: height &= ~1;
> Let's just pass in pointers to src, dst and height. Then the loop can become:
> 
> for (; *height > 0; *height -= 2)
> ...
> 
> and we can omit the &= -1 here and below.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:85: val16x8 =
> vreinterpretq_u16_s16(vqrdmulhq_s16(
> Please declare this variable here where it's used, not up top. And since
you'll
> see the type in context that way, I wouldn't bother naming it with the type.
> E.g.,
> 
> uint16x8_t result = ...;
> store2pixels<dstDirection>(result, ...);
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:90: val8x8 = load2pixels<srcDirection>(sptr
-
> leftOffset * srcStrideX, srcStride);
> Same here.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:94: val8x8 = load2pixels<srcDirection>(sptr
+
> (rightOffset + 1) * srcStrideX, srcStride);
> Same here.
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:131: if (kernelSize < 128)
> You should add a GM and a bench for the old case, so that it still gets
> exercised (kernel size exactly 128, say).
> 
>
https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
> src/opts/SkBlurImage_opts_neon.cpp:134: leftOffset, rightOffset, width,
height);
> See above: if we pass in &src &dst and &height here, we don't need to repeat
the
> computations below.

It doesn't provide results match the existing path pixel-for-pixel. Since in
Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the time
complexity, so I think it should be acceptable.
But I don't know whether it will break the auto test or not.

zheng.xu

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_neon.cpp File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_neon.cpp#newcode27 src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const SkPMColor* src, int srcStride) { On ...

7 years ago (2013-12-12 08:15:37 UTC) #4

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
File src/opts/SkBlurImage_opts_neon.cpp (right):

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const
SkPMColor* src, int srcStride) {
On 2013/12/12 05:30:30, Stephen White wrote:
> should be separated by underscores; see
>
https://sites.google.com/site/skiadocs/developer-documentation/contributing-c...,
> so load_2_pixels()

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:30: SK_PREFETCH(src + 16);
On 2013/12/12 05:30:30, Stephen White wrote:
> Are you sure this offset is correct? src is SkPMColor*, so unless I've done my
> pointer math wrong, I think this would be prefetching 64 bytes ahead. Is that
> what you had in mind?
> 
> At any rate, I think we should check that the prefetches are actually doing
> anything in either X or Y, and skip them if they aren't. There are X-only and
> Y-only benches which should show the effect.

Yes, I actually mean 64 bytes ahead, that's 8 loops. Because it need time to let
the cache system to pre-fetch data in parallel.

I've done some tests locally. 10% faster by adding 2 prefetchs in (srcDirection
== kX). adding prefetch in the else branch doesn't affect the bench data

I left the pre-fetch in (srcDirection == kY), only for keep the symmetry. Do you
mean we'd better remove them?

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:43: inline void store2pixels(uint16x8_t
result16x8, SkPMColor* dst, int dstStride) {
On 2013/12/12 05:30:30, Stephen White wrote:
> store_2_pixels()

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:58: void SkFastBoxBlur_NEON(const SkPMColor*
src, int srcStride, SkPMColor* dst, int kernelSize,
On 2013/12/12 05:30:30, Stephen White wrote:
> I'd prefer to avoid names like "fast" (since we often come up with faster
> algorithms later). Let's name it by what it does, perhaps
> SkDoubleRowBoxBlur_NEON.

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:69: height &= ~1;
On 2013/12/12 05:30:30, Stephen White wrote:
> Let's just pass in pointers to src, dst and height. Then the loop can become:
> 
> for (; *height > 0; *height -= 2)
> ...
> 
> and we can omit the &= -1 here and below.

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:85: val16x8 =
vreinterpretq_u16_s16(vqrdmulhq_s16(
On 2013/12/12 05:30:30, Stephen White wrote:
> Please declare this variable here where it's used, not up top. And since
you'll
> see the type in context that way, I wouldn't bother naming it with the type.
> E.g.,
> 
> uint16x8_t result = ...;
> store2pixels<dstDirection>(result, ...);

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:90: val8x8 = load2pixels<srcDirection>(sptr -
leftOffset * srcStrideX, srcStride);
On 2013/12/12 05:30:30, Stephen White wrote:
> Same here.

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:94: val8x8 = load2pixels<srcDirection>(sptr +
(rightOffset + 1) * srcStrideX, srcStride);
On 2013/12/12 05:30:30, Stephen White wrote:
> Same here.

Done.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:131: if (kernelSize < 128)
On 2013/12/12 05:30:30, Stephen White wrote:
> You should add a GM and a bench for the old case, so that it still gets
> exercised (kernel size exactly 128, say).

I think there is no API can set the box kernel size. box kernel size is nearly 2
times sigma. We can make a blur case with sigma 80.

https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne...
src/opts/SkBlurImage_opts_neon.cpp:134: leftOffset, rightOffset, width, height);
On 2013/12/12 05:30:30, Stephen White wrote:
> See above: if we pass in &src &dst and &height here, we don't need to repeat
the
> computations below.

Done.

zheng.xu

It doesn't provide results match the existing path pixel-for-pixel. Since in Skia, we use 3-pass ...

7 years ago (2013-12-12 08:29:25 UTC) #5

zheng.xu

bench data for patch set 4: before: running bench [640 480] blur_image_filter_large_80.00_80.00 8888: cmsecs = ...

7 years ago (2013-12-12 08:32:07 UTC) #6

kevin.petit.not.used.account

I've tested the patch: 5 mismatching GMs on my Chromebook, errors never exceed 2. I ...

7 years ago (2013-12-12 15:29:31 UTC) #7

zheng.xu

>> pre-fetch is cheap :) > Be careful with that assumption, especially if testing only ...

7 years ago (2013-12-13 04:31:58 UTC) #8

>> pre-fetch is cheap :)

> Be careful with that assumption, especially if testing only on one device.  
> It has bitten me several times in the past ;-). You should really avoid
prefetching unless you can prove an interesting gain.

I've test it on A7,A9,A15. prefetch in X-direction shows about 10% performance
gain.
Prefetch in Y-direction has no impact on performance. Here, it just prefetch
data for next run, so it should have no bad impact.

I'm also struggling on whether we should remove the meanless prefetch in
Y-direction. But it is already there in previous implementation.

https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt...
File src/opts/SkBlurImage_opts_neon.cpp (right):

https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt...
src/opts/SkBlurImage_opts_neon.cpp:83: uint16x8_t resultPixles =
vreinterpretq_u16_s16(vqrdmulhq_s16(
On 2013/12/12 15:29:32, kevin.petit.arm wrote:
> typo: "Pixles" => "Pixels". Otherwise, nice use of vqrdmulh :-).

Done.

https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt...
src/opts/SkBlurImage_opts_neon.cpp:88: sum = vsubw_u8(sum,
On 2013/12/12 15:29:32, kevin.petit.arm wrote:
> stray space character at the end of the line

Done.

https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt...
src/opts/SkBlurImage_opts_neon.cpp:92: sum = vaddw_u8(sum,
On 2013/12/12 15:29:32, kevin.petit.arm wrote:
> stray space character at the end of the line

Done.

https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt...
src/opts/SkBlurImage_opts_neon.cpp:134: for (; height > 0; height--) {
On 2013/12/12 15:29:32, kevin.petit.arm wrote:
> That means the last row in the case of an odd count will be processed in a
> different way with different precision/rounding. Is this a problem?

I think it is not. Because this algorithm is already an approximation for
Gausscian Blur, the small differences should be acceptable.

kevin.petit.not.used.account

On 2013/12/13 04:31:58, zheng.xu wrote: > >> pre-fetch is cheap :) > > > Be ...

7 years ago (2013-12-13 15:58:47 UTC) #9

Stephen White

On 2013/12/19 12:01:23, zheng.xu wrote: > @Stephen White, any other comments? Since this will require ...

7 years ago (2013-12-19 15:17:20 UTC) #11

Stephen White

https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opts_neon.cpp File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opts_neon.cpp#newcode37 src/opts/SkBlurImage_opts_neon.cpp:37: return vld1_u8((uint8_t*)src); Looks like you removed the actual fetch ...

7 years ago (2013-12-19 15:27:03 UTC) #12

zheng.xu

7 years ago (2013-12-20 04:29:54 UTC) #13

Stephen White

On 2013/12/12 08:29:25, zheng.xu wrote: > It doesn't provide results match the existing path pixel-for-pixel. ...

6 years, 11 months ago (2014-01-09 15:05:52 UTC) #15

zheng.xu

On 2014/01/09 15:05:52, Stephen White wrote: > On 2013/12/12 08:29:25, zheng.xu wrote: > > It ...

6 years, 11 months ago (2014-01-10 11:24:42 UTC) #16

On 2014/01/09 15:05:52, Stephen White wrote:
> On 2013/12/12 08:29:25, zheng.xu wrote:
> > It doesn't provide results match the existing path pixel-for-pixel. Since in
> > Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the
time
> > complexity, so I think it should be acceptable.
> > 
> > But I don't know whether it will break the auto test or not.
> > 
> > BTW, I also checked the GPU path. It uses another algorithm to approximate
> > Gausscian Blur. The output also can not match the CPU path pixel-by-pixel. I
> > guess this might not be an issue.
> 
> In Blink, we keep separate CPU and GPU results for layout tests, so that
> difference is manageable. (The GPU is different out of necessity, since the
> sliding window box blur is impossible to implement on GPU). However, we only
> keep one set of CPU results. If the hardware config changes (e.g., non-NEON ->
> NEON), it's nice to know that we don't have to rebaseline the results. So
until
> now I've made sure that all CPU paths are an exact match, whether NEON/SSE2 or
> x86/ARM.  OTOH, we don't run a full slate of layout tests on ARM platforms in
> Blink anyway, so it's not a huge deal at this point. It's just a precedent I'd
> rather not set if possible.
> 
> Out of curiosity, would it be possible to keep the dual-row parallelism and
> still give results which are an exact match (perhaps by using a 128-bit
> register?) I'm guessing the answer is no, but I thought I'd ask.

No, it is impossible.
This patch was made with the assumption that small errors on the image output is
acceptable. By using 16-bit values, we can store 8 channels in a 128-bit
register. Accordingly, scale(1/kernelSize) will also store in 16-bit. Even we
set limitation on the max kernelSize, it still cannot be the exactly the same
with 32-bit values.

It will break the auto test if we need to keep one set of CPU results.

Thanks for your introduction on the test rules that I didn't know before.

reed1

Skia testing *does* allow for small pixel differences, based on CPU optimizations (e.g. SSE or ...

6 years, 11 months ago (2014-01-10 14:54:56 UTC) #17

Stephen White

Actually, I think we should take this patch. I was just wondering if there was ...

6 years, 11 months ago (2014-01-10 15:13:38 UTC) #18

Stephen White

LGTM https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored-tests.txt File expectations/gm/ignored-tests.txt (right): https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored-tests.txt#newcode65 expectations/gm/ignored-tests.txt:65: # blur related GM results need rebaslining Nit: ...

6 years, 11 months ago (2014-01-10 15:14:22 UTC) #19

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001

6 years, 11 months ago (2014-01-13 08:00:56 UTC) #20

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001

6 years, 11 months ago (2014-01-13 08:11:18 UTC) #21

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001

6 years, 11 months ago (2014-01-13 08:16:36 UTC) #22

zheng.xu

https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored-tests.txt File expectations/gm/ignored-tests.txt (right): https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored-tests.txt#newcode65 expectations/gm/ignored-tests.txt:65: # blur related GM results need rebaslining On 2014/01/10 ...

6 years, 11 months ago (2014-01-13 11:27:18 UTC) #24

zheng.xu

2 verylargebitmap gm cases failed recently. But it should have no relationship with this patch. ...

6 years, 11 months ago (2014-01-13 12:18:42 UTC) #25

Stephen White

6 years, 11 months ago (2014-01-13 17:59:15 UTC) #26

Message was sent while issue was closed.

On 2014/01/13 12:18:42, zheng.xu wrote:
> 2 verylargebitmap gm cases failed recently. But it should have no relationship
> with this patch.
> compared to succeeded/failure-ignored verylargebitmap cases on other builders
or
> backends, it looks like the expected image is wrong or the tests should be
> ignored.
> 
> But I have no idea, why these 2 failure occur after this patch was committed.

I believe those tests are flaky, especially on the Win8 bots:
https://code.google.com/p/skia/issues/detail?id=1978

Expand Messages | Collapse Messages