| 
 | 
 | 
 Chromium Code Reviews
 Chromium Code Reviews Issue 
            105893003:
    NEON fast path for box blur  (Closed)
    
  
    Issue 
            105893003:
    NEON fast path for box blur  (Closed) 
  | Created: 7 years ago by zheng.xu Modified: 6 years, 11 months ago CC: skia-review_googlegroups.com Base URL: https://skia.googlecode.com/svn/trunk Visibility: Public. | DescriptionNEON fast path for box blur
Calculate 8 channels in parallel by using 16-bits to store each channel. Due to the limitation of VQRDMULH, (int16 * int16 * 2 + 0x8000) >> 16, the fast path can only support kernelSize < 128.
8 significant bits are kept at least in each stage, the final error should less-equal than 1.
Pre-fetching memory for X-direction read. In fact pre-fetching memory doesn't help much for Y direction read, since it is a waste to load a cache line for only read 8 bytes.(I left it there to keep the symmetry. pre-fetch is cheap :) )
bench data on Nexus 10
before:
running bench [640 480]      blur_image_filter_large_10.00_10.00   8888:  cmsecs =  25081.48
running bench [640 480]      blur_image_filter_small_10.00_10.00   8888:  cmsecs =  25038.04
running bench [640 480]        blur_image_filter_large_1.00_1.00   8888:  cmsecs =  25209.04
running bench [640 480]        blur_image_filter_small_1.00_1.00   8888:  cmsecs =  24928.01
running bench [640 480]        blur_image_filter_large_0.00_1.00   8888:  cmsecs =  17160.98
running bench [640 480]       blur_image_filter_large_0.00_10.00   8888:  cmsecs =  17924.11
running bench [640 480]        blur_image_filter_large_1.00_0.00   8888:  cmsecs =  14609.19
running bench [640 480]       blur_image_filter_large_10.00_0.00   8888:  cmsecs =  14625.91
after:
running bench [640 480]      blur_image_filter_large_10.00_10.00   8888:  cmsecs =  14848.42
running bench [640 480]      blur_image_filter_small_10.00_10.00   8888:  cmsecs =  16037.29
running bench [640 480]        blur_image_filter_large_1.00_1.00   8888:  cmsecs =  14819.55
running bench [640 480]        blur_image_filter_small_1.00_1.00   8888:  cmsecs =  14563.69
running bench [640 480]        blur_image_filter_large_0.00_1.00   8888:  cmsecs =  11905.34
running bench [640 480]       blur_image_filter_large_0.00_10.00   8888:  cmsecs =  11883.85
running bench [640 480]        blur_image_filter_large_1.00_0.00   8888:  cmsecs =   9576.51
running bench [640 480]       blur_image_filter_large_10.00_0.00   8888:  cmsecs =   9793.84
BUG=
Committed: http://code.google.com/p/skia/source/detail?r=13036
   Patch Set 1 #
      Total comments: 20
      
     Patch Set 2 : fix coding style issues #Patch Set 3 : add gm and bench cases with large sigma to cover the box blur path with kernelSize >= 128 #Patch Set 4 : pre-fetch data for X direction in the code path kernelSize > 128 #Patch Set 5 : #
      Total comments: 8
      
     Patch Set 6 : fix a typo #
      Total comments: 2
      
     Patch Set 7 : remove useless prefetch. #
      Total comments: 6
      
     Patch Set 8 : #Patch Set 9 : #
      Total comments: 2
      
     
 Messages
    Total messages: 26 (0 generated)
     
 
 Thanks for the patch! Does this provide results which match the existing path pixel-for-pixel? So far, all the image filter optimizations have had that property, and I'd like to keep it if possible. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const SkPMColor* src, int srcStride) { should be separated by underscores; see https://sites.google.com/site/skiadocs/developer-documentation/contributing-c..., so load_2_pixels() https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:30: SK_PREFETCH(src + 16); Are you sure this offset is correct? src is SkPMColor*, so unless I've done my pointer math wrong, I think this would be prefetching 64 bytes ahead. Is that what you had in mind? At any rate, I think we should check that the prefetches are actually doing anything in either X or Y, and skip them if they aren't. There are X-only and Y-only benches which should show the effect. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:43: inline void store2pixels(uint16x8_t result16x8, SkPMColor* dst, int dstStride) { store_2_pixels() https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:58: void SkFastBoxBlur_NEON(const SkPMColor* src, int srcStride, SkPMColor* dst, int kernelSize, I'd prefer to avoid names like "fast" (since we often come up with faster algorithms later). Let's name it by what it does, perhaps SkDoubleRowBoxBlur_NEON. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:69: height &= ~1; Let's just pass in pointers to src, dst and height. Then the loop can become: for (; *height > 0; *height -= 2) ... and we can omit the &= -1 here and below. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:85: val16x8 = vreinterpretq_u16_s16(vqrdmulhq_s16( Please declare this variable here where it's used, not up top. And since you'll see the type in context that way, I wouldn't bother naming it with the type. E.g., uint16x8_t result = ...; store2pixels<dstDirection>(result, ...); https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:90: val8x8 = load2pixels<srcDirection>(sptr - leftOffset * srcStrideX, srcStride); Same here. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:94: val8x8 = load2pixels<srcDirection>(sptr + (rightOffset + 1) * srcStrideX, srcStride); Same here. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:131: if (kernelSize < 128) You should add a GM and a bench for the old case, so that it still gets exercised (kernel size exactly 128, say). https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:134: leftOffset, rightOffset, width, height); See above: if we pass in &src &dst and &height here, we don't need to repeat the computations below. 
 On 2013/12/12 05:30:30, Stephen White wrote: > Thanks for the patch! > > Does this provide results which match the existing path pixel-for-pixel? So far, > all the image filter optimizations have had that property, and I'd like to keep > it if possible. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > File src/opts/SkBlurImage_opts_neon.cpp (right): > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const > SkPMColor* src, int srcStride) { > should be separated by underscores; see > https://sites.google.com/site/skiadocs/developer-documentation/contributing-c..., > so load_2_pixels() > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:30: SK_PREFETCH(src + 16); > Are you sure this offset is correct? src is SkPMColor*, so unless I've done my > pointer math wrong, I think this would be prefetching 64 bytes ahead. Is that > what you had in mind? > > At any rate, I think we should check that the prefetches are actually doing > anything in either X or Y, and skip them if they aren't. There are X-only and > Y-only benches which should show the effect. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:43: inline void store2pixels(uint16x8_t > result16x8, SkPMColor* dst, int dstStride) { > store_2_pixels() > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:58: void SkFastBoxBlur_NEON(const SkPMColor* > src, int srcStride, SkPMColor* dst, int kernelSize, > I'd prefer to avoid names like "fast" (since we often come up with faster > algorithms later). Let's name it by what it does, perhaps > SkDoubleRowBoxBlur_NEON. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:69: height &= ~1; > Let's just pass in pointers to src, dst and height. Then the loop can become: > > for (; *height > 0; *height -= 2) > ... > > and we can omit the &= -1 here and below. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:85: val16x8 = > vreinterpretq_u16_s16(vqrdmulhq_s16( > Please declare this variable here where it's used, not up top. And since you'll > see the type in context that way, I wouldn't bother naming it with the type. > E.g., > > uint16x8_t result = ...; > store2pixels<dstDirection>(result, ...); > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:90: val8x8 = load2pixels<srcDirection>(sptr - > leftOffset * srcStrideX, srcStride); > Same here. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:94: val8x8 = load2pixels<srcDirection>(sptr + > (rightOffset + 1) * srcStrideX, srcStride); > Same here. > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:131: if (kernelSize < 128) > You should add a GM and a bench for the old case, so that it still gets > exercised (kernel size exactly 128, say). > > https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... > src/opts/SkBlurImage_opts_neon.cpp:134: leftOffset, rightOffset, width, height); > See above: if we pass in &src &dst and &height here, we don't need to repeat the > computations below. It doesn't provide results match the existing path pixel-for-pixel. Since in Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the time complexity, so I think it should be acceptable. But I don't know whether it will break the auto test or not. 
 https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:27: inline uint8x8_t load2pixels(const SkPMColor* src, int srcStride) { On 2013/12/12 05:30:30, Stephen White wrote: > should be separated by underscores; see > https://sites.google.com/site/skiadocs/developer-documentation/contributing-c..., > so load_2_pixels() Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:30: SK_PREFETCH(src + 16); On 2013/12/12 05:30:30, Stephen White wrote: > Are you sure this offset is correct? src is SkPMColor*, so unless I've done my > pointer math wrong, I think this would be prefetching 64 bytes ahead. Is that > what you had in mind? > > At any rate, I think we should check that the prefetches are actually doing > anything in either X or Y, and skip them if they aren't. There are X-only and > Y-only benches which should show the effect. Yes, I actually mean 64 bytes ahead, that's 8 loops. Because it need time to let the cache system to pre-fetch data in parallel. I've done some tests locally. 10% faster by adding 2 prefetchs in (srcDirection == kX). adding prefetch in the else branch doesn't affect the bench data I left the pre-fetch in (srcDirection == kY), only for keep the symmetry. Do you mean we'd better remove them? https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:43: inline void store2pixels(uint16x8_t result16x8, SkPMColor* dst, int dstStride) { On 2013/12/12 05:30:30, Stephen White wrote: > store_2_pixels() Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:58: void SkFastBoxBlur_NEON(const SkPMColor* src, int srcStride, SkPMColor* dst, int kernelSize, On 2013/12/12 05:30:30, Stephen White wrote: > I'd prefer to avoid names like "fast" (since we often come up with faster > algorithms later). Let's name it by what it does, perhaps > SkDoubleRowBoxBlur_NEON. Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:69: height &= ~1; On 2013/12/12 05:30:30, Stephen White wrote: > Let's just pass in pointers to src, dst and height. Then the loop can become: > > for (; *height > 0; *height -= 2) > ... > > and we can omit the &= -1 here and below. Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:85: val16x8 = vreinterpretq_u16_s16(vqrdmulhq_s16( On 2013/12/12 05:30:30, Stephen White wrote: > Please declare this variable here where it's used, not up top. And since you'll > see the type in context that way, I wouldn't bother naming it with the type. > E.g., > > uint16x8_t result = ...; > store2pixels<dstDirection>(result, ...); Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:90: val8x8 = load2pixels<srcDirection>(sptr - leftOffset * srcStrideX, srcStride); On 2013/12/12 05:30:30, Stephen White wrote: > Same here. Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:94: val8x8 = load2pixels<srcDirection>(sptr + (rightOffset + 1) * srcStrideX, srcStride); On 2013/12/12 05:30:30, Stephen White wrote: > Same here. Done. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:131: if (kernelSize < 128) On 2013/12/12 05:30:30, Stephen White wrote: > You should add a GM and a bench for the old case, so that it still gets > exercised (kernel size exactly 128, say). I think there is no API can set the box kernel size. box kernel size is nearly 2 times sigma. We can make a blur case with sigma 80. https://codereview.chromium.org/105893003/diff/1/src/opts/SkBlurImage_opts_ne... src/opts/SkBlurImage_opts_neon.cpp:134: leftOffset, rightOffset, width, height); On 2013/12/12 05:30:30, Stephen White wrote: > See above: if we pass in &src &dst and &height here, we don't need to repeat the > computations below. Done. 
 It doesn't provide results match the existing path pixel-for-pixel. Since in Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the time complexity, so I think it should be acceptable. But I don't know whether it will break the auto test or not. BTW, I also checked the GPU path. It uses another algorithm to approximate Gausscian Blur. The output also can not match the CPU path pixel-by-pixel. I guess this might not be an issue. 
 bench data for patch set 4: before: running bench [640 480] blur_image_filter_large_80.00_80.00 8888: cmsecs = 25463.62 after: running bench [640 480] blur_image_filter_large_80.00_80.00 8888: cmsecs = 24480.74 
 I've tested the patch: 5 mismatching GMs on my Chromebook, errors never exceed 2. I get similar (very nice) speed figures. > pre-fetch is cheap :) Be careful with that assumption, especially if testing only on one device. It has bitten me several times in the past ;-). You should really avoid prefetching unless you can prove an interesting gain. https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:83: uint16x8_t resultPixles = vreinterpretq_u16_s16(vqrdmulhq_s16( typo: "Pixles" => "Pixels". Otherwise, nice use of vqrdmulh :-). https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:88: sum = vsubw_u8(sum, stray space character at the end of the line https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:92: sum = vaddw_u8(sum, stray space character at the end of the line https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:134: for (; height > 0; height--) { That means the last row in the case of an odd count will be processed in a different way with different precision/rounding. Is this a problem? 
 >> pre-fetch is cheap :) > Be careful with that assumption, especially if testing only on one device. > It has bitten me several times in the past ;-). You should really avoid prefetching unless you can prove an interesting gain. I've test it on A7,A9,A15. prefetch in X-direction shows about 10% performance gain. Prefetch in Y-direction has no impact on performance. Here, it just prefetch data for next run, so it should have no bad impact. I'm also struggling on whether we should remove the meanless prefetch in Y-direction. But it is already there in previous implementation. https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:83: uint16x8_t resultPixles = vreinterpretq_u16_s16(vqrdmulhq_s16( On 2013/12/12 15:29:32, kevin.petit.arm wrote: > typo: "Pixles" => "Pixels". Otherwise, nice use of vqrdmulh :-). Done. https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:88: sum = vsubw_u8(sum, On 2013/12/12 15:29:32, kevin.petit.arm wrote: > stray space character at the end of the line Done. https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:92: sum = vaddw_u8(sum, On 2013/12/12 15:29:32, kevin.petit.arm wrote: > stray space character at the end of the line Done. https://codereview.chromium.org/105893003/diff/80001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:134: for (; height > 0; height--) { On 2013/12/12 15:29:32, kevin.petit.arm wrote: > That means the last row in the case of an odd count will be processed in a > different way with different precision/rounding. Is this a problem? I think it is not. Because this algorithm is already an approximation for Gausscian Blur, the small differences should be acceptable. 
 On 2013/12/13 04:31:58, zheng.xu wrote: > >> pre-fetch is cheap :) > > > Be careful with that assumption, especially if testing only on one device. > > It has bitten me several times in the past ;-). You should really avoid > prefetching unless you can prove an interesting gain. > > I've test it on A7,A9,A15. Great! You should be reasonably safe :-). > Prefetch in Y-direction has no impact on performance. Here, it just prefetch > data for next run, so it should have no bad impact. That's not completely true. For something to go in the cache, something else has to be evicted. If prefetching doesn't bring a measurable boost, then it's useless at best and you should remove it. It's always good to remove code, isn't it? > I'm also struggling on whether we should remove the meanless prefetch in > Y-direction. But it is already there in previous implementation. To me it's a no-brainer. No measurable boost: it goes away. > On 2013/12/12 15:29:32, kevin.petit.arm wrote: > > That means the last row in the case of an odd count will be processed in a > > different way with different precision/rounding. Is this a problem? > > I think it is not. Because this algorithm is already an approximation for > Gausscian Blur, the small differences should be acceptable. Ok, I'll leave the conclusion to graphics experts :-). 
 @Stephen White, any other comments? 
 On 2013/12/19 12:01:23, zheng.xu wrote: > @Stephen White, any other comments? Since this will require rebaselining tests, you should add a line to expectations/gm/ignored-tests.txt along with this patch, to prevent the bots going red. 
 https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opt... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:37: return vld1_u8((uint8_t*)src); Looks like you removed the actual fetch along with the prefetch. Does this pass tests? (I'm surprised it doesn't give you a compiler warning, actually.) https://codereview.chromium.org/105893003/diff/110001/bench/BlurImageFilterBe... File bench/BlurImageFilterBench.cpp (right): https://codereview.chromium.org/105893003/diff/110001/bench/BlurImageFilterBe... bench/BlurImageFilterBench.cpp:23: #define BLUR_SIGMA_SUPERLARGE 80.0f Nit: SUPERLARGE -> HUGE https://codereview.chromium.org/105893003/diff/110001/gm/imageblur.cpp File gm/imageblur.cpp (right): https://codereview.chromium.org/105893003/diff/110001/gm/imageblur.cpp#newcode27 gm/imageblur.cpp:27: name.printf("imageblur_%.2f_%.2f", This will cause the tools to consider the imageblur and imageblur_24.00_0.00 to be different tests. Could you pass the name in the GM constructor for now, and call the old one "imageblur" and the new one "imageblur_large"? That way, we'll at least have continuity of coverage over this change. https://codereview.chromium.org/105893003/diff/110001/src/opts/SkBlurImage_op... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/110001/src/opts/SkBlurImage_op... src/opts/SkBlurImage_opts_neon.cpp:30: // 10% faster by add these 2 prefetches Nit: add -> adding 
 https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opt... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/90001/src/opts/SkBlurImage_opt... src/opts/SkBlurImage_opts_neon.cpp:37: return vld1_u8((uint8_t*)src); On 2013/12/19 15:27:03, Stephen White wrote: > Looks like you removed the actual fetch along with the prefetch. Does this pass > tests? (I'm surprised it doesn't give you a compiler warning, actually.) Sorry for this stupid mistake. It should result a compiler error. I didn't catch this, because something goes wrong with my local environment when I sync the source to build machine. https://codereview.chromium.org/105893003/diff/110001/bench/BlurImageFilterBe... File bench/BlurImageFilterBench.cpp (right): https://codereview.chromium.org/105893003/diff/110001/bench/BlurImageFilterBe... bench/BlurImageFilterBench.cpp:23: #define BLUR_SIGMA_SUPERLARGE 80.0f On 2013/12/19 15:27:03, Stephen White wrote: > Nit: SUPERLARGE -> HUGE Done. https://codereview.chromium.org/105893003/diff/110001/gm/imageblur.cpp File gm/imageblur.cpp (right): https://codereview.chromium.org/105893003/diff/110001/gm/imageblur.cpp#newcode27 gm/imageblur.cpp:27: name.printf("imageblur_%.2f_%.2f", On 2013/12/19 15:27:03, Stephen White wrote: > This will cause the tools to consider the imageblur and imageblur_24.00_0.00 to > be different tests. Could you pass the name in the GM constructor for now, and > call the old one "imageblur" and the new one "imageblur_large"? That way, we'll > at least have continuity of coverage over this change. Done. https://codereview.chromium.org/105893003/diff/110001/src/opts/SkBlurImage_op... File src/opts/SkBlurImage_opts_neon.cpp (right): https://codereview.chromium.org/105893003/diff/110001/src/opts/SkBlurImage_op... src/opts/SkBlurImage_opts_neon.cpp:30: // 10% faster by add these 2 prefetches On 2013/12/19 15:27:03, Stephen White wrote: > Nit: add -> adding Done. 
 Stephen White: ping :) 
 On 2013/12/12 08:29:25, zheng.xu wrote: > It doesn't provide results match the existing path pixel-for-pixel. Since in > Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the time > complexity, so I think it should be acceptable. > > But I don't know whether it will break the auto test or not. > > BTW, I also checked the GPU path. It uses another algorithm to approximate > Gausscian Blur. The output also can not match the CPU path pixel-by-pixel. I > guess this might not be an issue. In Blink, we keep separate CPU and GPU results for layout tests, so that difference is manageable. (The GPU is different out of necessity, since the sliding window box blur is impossible to implement on GPU). However, we only keep one set of CPU results. If the hardware config changes (e.g., non-NEON -> NEON), it's nice to know that we don't have to rebaseline the results. So until now I've made sure that all CPU paths are an exact match, whether NEON/SSE2 or x86/ARM. OTOH, we don't run a full slate of layout tests on ARM platforms in Blink anyway, so it's not a huge deal at this point. It's just a precedent I'd rather not set if possible. Out of curiosity, would it be possible to keep the dual-row parallelism and still give results which are an exact match (perhaps by using a 128-bit register?) I'm guessing the answer is no, but I thought I'd ask. 
 On 2014/01/09 15:05:52, Stephen White wrote: > On 2013/12/12 08:29:25, zheng.xu wrote: > > It doesn't provide results match the existing path pixel-for-pixel. Since in > > Skia, we use 3-pass box blur to approximate Gausscian Blur to reduce the time > > complexity, so I think it should be acceptable. > > > > But I don't know whether it will break the auto test or not. > > > > BTW, I also checked the GPU path. It uses another algorithm to approximate > > Gausscian Blur. The output also can not match the CPU path pixel-by-pixel. I > > guess this might not be an issue. > > In Blink, we keep separate CPU and GPU results for layout tests, so that > difference is manageable. (The GPU is different out of necessity, since the > sliding window box blur is impossible to implement on GPU). However, we only > keep one set of CPU results. If the hardware config changes (e.g., non-NEON -> > NEON), it's nice to know that we don't have to rebaseline the results. So until > now I've made sure that all CPU paths are an exact match, whether NEON/SSE2 or > x86/ARM. OTOH, we don't run a full slate of layout tests on ARM platforms in > Blink anyway, so it's not a huge deal at this point. It's just a precedent I'd > rather not set if possible. > > Out of curiosity, would it be possible to keep the dual-row parallelism and > still give results which are an exact match (perhaps by using a 128-bit > register?) I'm guessing the answer is no, but I thought I'd ask. No, it is impossible. This patch was made with the assumption that small errors on the image output is acceptable. By using 16-bit values, we can store 8 channels in a 128-bit register. Accordingly, scale(1/kernelSize) will also store in 16-bit. Even we set limitation on the max kernelSize, it still cannot be the exactly the same with 32-bit values. It will break the auto test if we need to keep one set of CPU results. Thanks for your introduction on the test rules that I didn't know before. 
 Skia testing *does* allow for small pixel differences, based on CPU optimizations (e.g. SSE or NEON). If you can show a measurable gain for a small diff, perhaps we should find a way to make such a thing acceptable to blink's testing. 
 Actually, I think we should take this patch. I was just wondering if there was a way of getting some of the perf benefit while preserving exact matches. There are already non-matching NEON vs ARM paths in other areas of the code outside of image filters. The Blink issue is not a problem today, since we don't have a full slate of layout tests on Android anyway, and would only be a problem if we change hardware on the Blink side. We can test both paths in Skia, and Blink can just run whichever path the hardware supports. If that hardware changes, we'll have to rebaseline, but I don't imagine we'll be using a non-NEON supporting CPU any time soon anyway. 
 LGTM https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored... File expectations/gm/ignored-tests.txt (right): https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored... expectations/gm/ignored-tests.txt:65: # blur related GM results need rebaslining Nit: rebaselining 
 CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001 
 CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001 
 CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/zheng.xu@arm.com/105893003/150001 
 
            
              
                Message was sent while issue was closed.
              
            
             Change committed as 13036 
 
            
              
                Message was sent while issue was closed.
              
            
             https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored... File expectations/gm/ignored-tests.txt (right): https://codereview.chromium.org/105893003/diff/150001/expectations/gm/ignored... expectations/gm/ignored-tests.txt:65: # blur related GM results need rebaslining On 2014/01/10 15:14:22, Stephen White wrote: > Nit: rebaselining cl for rebaselining https://codereview.chromium.org/136333004/ 
 
            
              
                Message was sent while issue was closed.
              
            
             2 verylargebitmap gm cases failed recently. But it should have no relationship with this patch. compared to succeeded/failure-ignored verylargebitmap cases on other builders or backends, it looks like the expected image is wrong or the tests should be ignored. But I have no idea, why these 2 failure occur after this patch was committed. 
 
            
              
                Message was sent while issue was closed.
              
            
             On 2014/01/13 12:18:42, zheng.xu wrote: > 2 verylargebitmap gm cases failed recently. But it should have no relationship > with this patch. > compared to succeeded/failure-ignored verylargebitmap cases on other builders or > backends, it looks like the expected image is wrong or the tests should be > ignored. > > But I have no idea, why these 2 failure occur after this patch was committed. I believe those tests are flaky, especially on the Win8 bots: https://code.google.com/p/skia/issues/detail?id=1978 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
