|
|
Chromium Code Reviews|
Created:
8 years, 3 months ago by DaleCurtis Modified:
8 years, 2 months ago CC:
chromium-reviews, feature-media-reviews_chromium.org, trchen, Johann Base URL:
svn://svn.chromium.org/chrome/trunk/src Visibility:
Public. |
DescriptionAdd ARM NEON intrinsic optimizations for SincResampler.
On an exynos board these yielded an ~2.3x speedup:
Benchmarking 50000000 iterations:
Convolve_C took 5682.71ms.
Convolve_NEON(unaligned) took 2451.18ms; which is 2.32x faster than Convolve_C.
Convolve_NEON (aligned) took 2397.01ms; which is 2.37x faster than Convolve_C and 1.02x faster than Convolve_NEON (unaligned).
BUG=none
TEST=try bot, fischman.
Committed: https://src.chromium.org/viewvc/chrome?view=rev&revision=158870
Patch Set 1 #
Total comments: 14
Patch Set 2 : Comments. #
Total comments: 10
Patch Set 3 : Clean up. #Patch Set 4 : Use multiply-accumulate intrinsics. #
Total comments: 3
Patch Set 5 : Use exclusive-or. #Patch Set 6 : Fix NE issue for ARM. #
Messages
Total messages: 23 (0 generated)
https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:115: double result2 = result; I don't like this b/c it'll make the test claim to pass (but be a no-op) on other archs, and will silently regress if defines get changed, etc. I think it's much safer to #else #error Don't build this on this platform! at l.123 (etc) and indeed only build this test on appropriate targets. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:120: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) Here and elsewhere, I think __ARM_NEON__ is unnecessary given how USE_NEON comes into being. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:154: int convolve_iterations = 50000000; what now? https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:200: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) Like with the original SSE CL, IWBN to include benchmark results in the CL description. https://codereview.chromium.org/10960023/diff/1/media/media.gyp File media/media.gyp (right): https://codereview.chromium.org/10960023/diff/1/media/media.gyp#newcode330 media/media.gyp:330: ['arm_neon == 1', { You need this same block in the media_unittests target, otherwise your new benchmark code isn't very useful :)
Looks like you got some 'splainin' to do. Due to rietveld limits, splitting this response in two. Here is the result of running Sinc* tests on an exynos board, with $GYP_DEFINES of: disable_nacl=1 enable_svg=0 chromeos=1 use_official_google_api_keys=1 target_arch=arm armv7=1 arm_float_abi=hard v8_use_arm_eabi_hardfloat=true linux_use_tcmalloc=0 arm_neon=1 proprietary_codecs=1 ffmpeg_branding=Chrome enable_smooth_scrolling=1 python_ver=2.6 swig_defines=-DOS_CHROMEOS linux_sandbox_path=/opt/google/chrome/chrome-sandbox remove_webcore_debug_symbols=1 component=shared_library First, the result of running the patch as written (so the .cc file is built with USE_NEON and the test isn't): $ ./Debug/media_unittests --gtest_filter=Sinc* Note: Google Test filter = Sinc* [==========] Running 40 tests from 2 test cases. [----------] Global test environment set-up. [----------] 4 tests from SincResamplerTest [ RUN ] SincResamplerTest.ChunkedResample [ OK ] SincResamplerTest.ChunkedResample (18 ms) [ RUN ] SincResamplerTest.Flush ../../media/base/sinc_resampler_unittest.cc:83: Failure Expected: (resampled_destination[0]) != (0), actual: 0 vs 0 [ FAILED ] SincResamplerTest.Flush (1 ms) [ RUN ] SincResamplerTest.Convolve ../../media/base/sinc_resampler_unittest.cc:140: Failure The difference between result2 and result is 0.010386750102043152, which exceeds kEpsilon, where result2 evaluates to 0.17454755306243896, result evaluates to 0.16416080296039581, and kEpsilon evaluates to 4.9999999999999998e-08. [ FAILED ] SincResamplerTest.Convolve (1 ms) [ RUN ] SincResamplerTest.ConvolveBenchmark Benchmarking 50000000 iterations: Convolve_C took 19921.69ms. [ OK ] SincResamplerTest.ConvolveBenchmark (19923 ms) [----------] 4 tests from SincResamplerTest (19944 ms total) [----------] 36 tests from SincResamplerTest/SincResamplerTest [ RUN ] SincResamplerTest/SincResamplerTest.Resample/0 [ OK ] SincResamplerTest/SincResamplerTest.Resample/0 (34 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/1 [ OK ] SincResamplerTest/SincResamplerTest.Resample/1 (34 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/2 [ OK ] SincResamplerTest/SincResamplerTest.Resample/2 (36 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/3 [ OK ] SincResamplerTest/SincResamplerTest.Resample/3 (38 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/4 [ OK ] SincResamplerTest/SincResamplerTest.Resample/4 (39 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/5 [ OK ] SincResamplerTest/SincResamplerTest.Resample/5 (42 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/6 [ OK ] SincResamplerTest/SincResamplerTest.Resample/6 (41 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/7 [ OK ] SincResamplerTest/SincResamplerTest.Resample/7 (47 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/8 [ OK ] SincResamplerTest/SincResamplerTest.Resample/8 (64 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/9 [ OK ] SincResamplerTest/SincResamplerTest.Resample/9 (36 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/10 [ OK ] SincResamplerTest/SincResamplerTest.Resample/10 (37 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/11 [ OK ] SincResamplerTest/SincResamplerTest.Resample/11 (39 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/12 [ OK ] SincResamplerTest/SincResamplerTest.Resample/12 (40 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/13 [ OK ] SincResamplerTest/SincResamplerTest.Resample/13 (42 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/14 [ OK ] SincResamplerTest/SincResamplerTest.Resample/14 (44 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/15 [ OK ] SincResamplerTest/SincResamplerTest.Resample/15 (45 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/16 [ OK ] SincResamplerTest/SincResamplerTest.Resample/16 (51 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/17 [ OK ] SincResamplerTest/SincResamplerTest.Resample/17 (66 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/18 [ OK ] SincResamplerTest/SincResamplerTest.Resample/18 (71 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/19 [ OK ] SincResamplerTest/SincResamplerTest.Resample/19 (72 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/20 [ OK ] SincResamplerTest/SincResamplerTest.Resample/20 (73 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/21 [ OK ] SincResamplerTest/SincResamplerTest.Resample/21 (75 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/22 [ OK ] SincResamplerTest/SincResamplerTest.Resample/22 (76 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/23 [ OK ] SincResamplerTest/SincResamplerTest.Resample/23 (78 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/24 [ OK ] SincResamplerTest/SincResamplerTest.Resample/24 (79 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/25 [ OK ] SincResamplerTest/SincResamplerTest.Resample/25 (88 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/26 [ OK ] SincResamplerTest/SincResamplerTest.Resample/26 (99 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/27 [ OK ] SincResamplerTest/SincResamplerTest.Resample/27 (139 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/28 [ OK ] SincResamplerTest/SincResamplerTest.Resample/28 (140 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/29 [ OK ] SincResamplerTest/SincResamplerTest.Resample/29 (141 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/30 [ OK ] SincResamplerTest/SincResamplerTest.Resample/30 (143 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/31 [ OK ] SincResamplerTest/SincResamplerTest.Resample/31 (145 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/32 [ OK ] SincResamplerTest/SincResamplerTest.Resample/32 (147 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/33 [ OK ] SincResamplerTest/SincResamplerTest.Resample/33 (149 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/34 [ OK ] SincResamplerTest/SincResamplerTest.Resample/34 (158 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/35 [ OK ] SincResamplerTest/SincResamplerTest.Resample/35 (176 ms) [----------] 36 tests from SincResamplerTest/SincResamplerTest (2844 ms total) [----------] Global test environment tear-down [==========] 40 tests from 2 test cases ran. (22789 ms total) [ PASSED ] 38 tests. [ FAILED ] 2 tests, listed below: [ FAILED ] SincResamplerTest.Flush [ FAILED ] SincResamplerTest.Convolve 2 FAILED TESTS YOU HAVE 8 DISABLED TESTS
Then the same cmdline with both the test & .cc file built with USE_NEON: Note: Google Test filter = Sinc* [==========] Running 40 tests from 2 test cases. [----------] Global test environment set-up. [----------] 4 tests from SincResamplerTest [ RUN ] SincResamplerTest.ChunkedResample [ OK ] SincResamplerTest.ChunkedResample (1 ms) [ RUN ] SincResamplerTest.Flush ../../media/base/sinc_resampler_unittest.cc:83: Failure Expected: (resampled_destination[0]) != (0), actual: 0 vs 0 [ FAILED ] SincResamplerTest.Flush (1 ms) [ RUN ] SincResamplerTest.Convolve [ OK ] SincResamplerTest.Convolve (1 ms) [ RUN ] SincResamplerTest.ConvolveBenchmark Benchmarking 50000000 iterations: Convolve_C took 19556.11ms. Convolve_NEON (unaligned) took 20231.97ms; which is 0.97x faster than Convolve_C. Convolve_NEON (aligned) took 20211.57ms; which is 0.97x faster than Convolve_C and 1.00x faster than Convolve_SSE (unaligned). [ OK ] SincResamplerTest.ConvolveBenchmark (60000 ms) [----------] 4 tests from SincResamplerTest (60005 ms total) [----------] 36 tests from SincResamplerTest/SincResamplerTest [ RUN ] SincResamplerTest/SincResamplerTest.Resample/0 [ OK ] SincResamplerTest/SincResamplerTest.Resample/0 (34 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/1 [ OK ] SincResamplerTest/SincResamplerTest.Resample/1 (34 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/2 [ OK ] SincResamplerTest/SincResamplerTest.Resample/2 (35 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/3 [ OK ] SincResamplerTest/SincResamplerTest.Resample/3 (37 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/4 [ OK ] SincResamplerTest/SincResamplerTest.Resample/4 (39 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/5 [ OK ] SincResamplerTest/SincResamplerTest.Resample/5 (41 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/6 [ OK ] SincResamplerTest/SincResamplerTest.Resample/6 (42 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/7 [ OK ] SincResamplerTest/SincResamplerTest.Resample/7 (48 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/8 [ OK ] SincResamplerTest/SincResamplerTest.Resample/8 (64 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/9 [ OK ] SincResamplerTest/SincResamplerTest.Resample/9 (37 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/10 [ OK ] SincResamplerTest/SincResamplerTest.Resample/10 (38 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/11 [ OK ] SincResamplerTest/SincResamplerTest.Resample/11 (39 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/12 [ OK ] SincResamplerTest/SincResamplerTest.Resample/12 (41 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/13 [ OK ] SincResamplerTest/SincResamplerTest.Resample/13 (42 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/14 [ OK ] SincResamplerTest/SincResamplerTest.Resample/14 (44 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/15 [ OK ] SincResamplerTest/SincResamplerTest.Resample/15 (45 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/16 [ OK ] SincResamplerTest/SincResamplerTest.Resample/16 (51 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/17 [ OK ] SincResamplerTest/SincResamplerTest.Resample/17 (67 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/18 [ OK ] SincResamplerTest/SincResamplerTest.Resample/18 (71 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/19 [ OK ] SincResamplerTest/SincResamplerTest.Resample/19 (72 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/20 [ OK ] SincResamplerTest/SincResamplerTest.Resample/20 (74 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/21 [ OK ] SincResamplerTest/SincResamplerTest.Resample/21 (75 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/22 [ OK ] SincResamplerTest/SincResamplerTest.Resample/22 (77 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/23 [ OK ] SincResamplerTest/SincResamplerTest.Resample/23 (79 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/24 [ OK ] SincResamplerTest/SincResamplerTest.Resample/24 (79 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/25 [ OK ] SincResamplerTest/SincResamplerTest.Resample/25 (88 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/26 [ OK ] SincResamplerTest/SincResamplerTest.Resample/26 (99 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/27 [ OK ] SincResamplerTest/SincResamplerTest.Resample/27 (139 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/28 [ OK ] SincResamplerTest/SincResamplerTest.Resample/28 (145 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/29 [ OK ] SincResamplerTest/SincResamplerTest.Resample/29 (143 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/30 [ OK ] SincResamplerTest/SincResamplerTest.Resample/30 (144 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/31 [ OK ] SincResamplerTest/SincResamplerTest.Resample/31 (145 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/32 [ OK ] SincResamplerTest/SincResamplerTest.Resample/32 (147 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/33 [ OK ] SincResamplerTest/SincResamplerTest.Resample/33 (148 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/34 [ OK ] SincResamplerTest/SincResamplerTest.Resample/34 (157 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/35 [ OK ] SincResamplerTest/SincResamplerTest.Resample/35 (175 ms) [----------] 36 tests from SincResamplerTest/SincResamplerTest (2849 ms total) [----------] Global test environment tear-down [==========] 40 tests from 2 test cases ran. (62856 ms total) [ PASSED ] 39 tests. [ FAILED ] 1 test, listed below: [ FAILED ] SincResamplerTest.Flush 1 FAILED TEST YOU HAVE 8 DISABLED TESTS
Fast work Ami! Thanks! Too bad the optimizations aren't :) Have to look at the disassembled objects, but my guesses are: - I used the wrong instructions. - auto vectorization is better than me. Maybe I should email Mans!
On 2012/09/21 05:40:04, DaleCurtis wrote: > Fast work Ami! Thanks! Too bad the optimizations aren't :) > > Have to look at the disassembled objects, but my guesses are: > - I used the wrong instructions. > - auto vectorization is better than me. > > Maybe I should email Mans! Ugh, under the hood it looks like the compiler is turning those 3 intrinsic loads into no less than 36 loads (!!!) and an equal number of stores. I have no idea what it's trying to do, but it pretty horrible looking: http://pastebin.com/RzbM1hHe Maybe I'll see if Frank or someone else on the Chrome team knows what's going on.
On 2012/09/21 17:49:36, DaleCurtis wrote: > On 2012/09/21 05:40:04, DaleCurtis wrote: > > Fast work Ami! Thanks! Too bad the optimizations aren't :) > > > > Have to look at the disassembled objects, but my guesses are: > > - I used the wrong instructions. > > - auto vectorization is better than me. > > > > Maybe I should email Mans! > > Ugh, under the hood it looks like the compiler is turning those 3 intrinsic > loads into no less than 36 loads (!!!) and an equal number of stores. I have no > idea what it's trying to do, but it pretty horrible looking: > > http://pastebin.com/RzbM1hHe > > Maybe I'll see if Frank or someone else on the Chrome team knows what's going > on. What compiler flags do you use? If I only use the minimal flags to support neon (-mfloat-abi=softfp -mfpu=neon) then I got the same crap as you pasted. Below is what I get with an additional -O2: 00000000 <Convolve_NEON>: 0: e92d0030 push {r4, r5} 4: eddf6b1f vldr d22, [pc, #124] ; 88 <Convolve_NEON+0x88> 8: eddf7b20 vldr d23, [pc, #128] ; 90 <Convolve_NEON+0x90> c: edddab02 vldr d26, [sp, #8] 10: f26681f6 vorr q12, q11, q11 14: e3a03000 mov r3, #0 18: e0805003 add r5, r0, r3 1c: e0814003 add r4, r1, r3 20: e082c003 add ip, r2, r3 24: f4650a8f vld1.32 {d16-d17}, [r5] 28: f4642a8f vld1.32 {d18-d19}, [r4] 2c: f46c4a8f vld1.32 {d20-d21}, [ip] 30: e2833010 add r3, r3, #16 34: f3402df2 vmul.f32 q9, q8, q9 38: e3530a01 cmp r3, #4096 ; 0x1000 3c: f3400df4 vmul.f32 q8, q8, q10 40: f2466de2 vadd.f32 q11, q11, q9 44: f2488de0 vadd.f32 q12, q12, q8 48: 1afffff2 bne 18 <Convolve_NEON+0x18> 4c: eef72b00 vmov.f64 d18, #112 ; 0x70 50: eef77bea vcvt.f32.f64 s15, d26 54: ee722bea vsub.f64 d18, d18, d26 58: f3fc0c47 vdup.32 q8, d7[1] 5c: f3480df0 vmul.f32 q8, q12, q8 60: eef77be2 vcvt.f32.f64 s15, d18 64: f3fc2c47 vdup.32 q9, d7[1] 68: f3466df2 vmul.f32 q11, q11, q9 6c: f2460de0 vadd.f32 q8, q11, q8 70: f26121b1 vorr d18, d17, d17 74: f2420da0 vadd.f32 d16, d18, d16 78: f3400da0 vpadd.f32 d16, d16, d16 7c: ee100b90 vmov.32 r0, d16[0] 80: e8bd0030 pop {r4, r5} 84: e12fff1e bx lr
On 2012/09/21 21:24:18, trchen wrote:
> What compiler flags do you use? If I only use the minimal flags to support
neon
> (-mfloat-abi=softfp -mfpu=neon) then I got the same crap as you pasted. Below
is
> what I get with an additional -O2:
Agreed, opt seems to clean this up a lot. Pulling the function into a C file
gave me similar output to trchen, but unrolled.
> 00000000 <Convolve_NEON>:
...
> 18: e0805003 add r5, r0, r3
> 1c: e0814003 add r4, r1, r3
> 20: e082c003 add ip, r2, r3
> 24: f4650a8f vld1.32 {d16-d17}, [r5]
> 28: f4642a8f vld1.32 {d18-d19}, [r4]
> 2c: f46c4a8f vld1.32 {d20-d21}, [ip]
...
One obvious thing I see the compiler missing is post-increment. The loads should
have ! at the end.
On 2012/09/21 23:07:53, Johann wrote:
> On 2012/09/21 21:24:18, trchen wrote:
> > What compiler flags do you use? If I only use the minimal flags to support
> neon
> > (-mfloat-abi=softfp -mfpu=neon) then I got the same crap as you pasted.
Below
> is
> > what I get with an additional -O2:
>
> Agreed, opt seems to clean this up a lot. Pulling the function into a C file
> gave me similar output to trchen, but unrolled.
>
> > 00000000 <Convolve_NEON>:
> ...
> > 18: e0805003 add r5, r0, r3
> > 1c: e0814003 add r4, r1, r3
> > 20: e082c003 add ip, r2, r3
> > 24: f4650a8f vld1.32 {d16-d17}, [r5]
> > 28: f4642a8f vld1.32 {d18-d19}, [r4]
> > 2c: f46c4a8f vld1.32 {d20-d21}, [ip]
> ...
>
> One obvious thing I see the compiler missing is post-increment. The loads
should
> have ! at the end.
Doh, that explains the poor disassembly maybe not the poor results though. Ami,
what optimization settings did you run with? Are those numbers from a
branding=Chrome release build?
The gyp defines I used are in the reviewlog. But this was a Debug build. Oops! Will rerun with Release when I get a chance.
Yay! 2x speedup! See code comment below, and note there was a failure in SRT.Flush. Rebuilding for release (still with my local hack to USE_NEON in building the test.cc): $ ./Release/media_unittests --gtest_filter=Sinc* Note: Google Test filter = Sinc* [==========] Running 40 tests from 2 test cases. [----------] Global test environment set-up. [----------] 4 tests from SincResamplerTest [ RUN ] SincResamplerTest.ChunkedResample [ OK ] SincResamplerTest.ChunkedResample (1 ms) [ RUN ] SincResamplerTest.Flush ../../media/base/sinc_resampler_unittest.cc:83: Failure Expected: (resampled_destination[0]) != (0), actual: 0 vs 0 [ FAILED ] SincResamplerTest.Flush (1 ms) [ RUN ] SincResamplerTest.Convolve [ OK ] SincResamplerTest.Convolve (1 ms) [ RUN ] SincResamplerTest.ConvolveBenchmark Benchmarking 50000000 iterations: Convolve_C took 5810.99ms. Convolve_NEON (unaligned) took 2900.72ms; which is 2.00x faster than Convolve_C. Convolve_NEON (aligned) took 2869.72ms; which is 2.02x faster than Convolve_C and 1.01x faster than Convolve_SSE (unaligned). [ OK ] SincResamplerTest.ConvolveBenchmark (11582 ms) [----------] 4 tests from SincResamplerTest (11586 ms total) [----------] 36 tests from SincResamplerTest/SincResamplerTest [ RUN ] SincResamplerTest/SincResamplerTest.Resample/0 [ OK ] SincResamplerTest/SincResamplerTest.Resample/0 (14 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/1 [ OK ] SincResamplerTest/SincResamplerTest.Resample/1 (14 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/2 [ OK ] SincResamplerTest/SincResamplerTest.Resample/2 (15 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/3 [ OK ] SincResamplerTest/SincResamplerTest.Resample/3 (16 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/4 [ OK ] SincResamplerTest/SincResamplerTest.Resample/4 (17 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/5 [ OK ] SincResamplerTest/SincResamplerTest.Resample/5 (22 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/6 [ OK ] SincResamplerTest/SincResamplerTest.Resample/6 (20 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/7 [ OK ] SincResamplerTest/SincResamplerTest.Resample/7 (25 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/8 [ OK ] SincResamplerTest/SincResamplerTest.Resample/8 (39 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/9 [ OK ] SincResamplerTest/SincResamplerTest.Resample/9 (15 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/10 [ OK ] SincResamplerTest/SincResamplerTest.Resample/10 (15 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/11 [ OK ] SincResamplerTest/SincResamplerTest.Resample/11 (16 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/12 [ OK ] SincResamplerTest/SincResamplerTest.Resample/12 (17 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/13 [ OK ] SincResamplerTest/SincResamplerTest.Resample/13 (18 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/14 [ OK ] SincResamplerTest/SincResamplerTest.Resample/14 (21 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/15 [ OK ] SincResamplerTest/SincResamplerTest.Resample/15 (21 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/16 [ OK ] SincResamplerTest/SincResamplerTest.Resample/16 (26 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/17 [ OK ] SincResamplerTest/SincResamplerTest.Resample/17 (40 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/18 [ OK ] SincResamplerTest/SincResamplerTest.Resample/18 (27 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/19 [ OK ] SincResamplerTest/SincResamplerTest.Resample/19 (27 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/20 [ OK ] SincResamplerTest/SincResamplerTest.Resample/20 (28 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/21 [ OK ] SincResamplerTest/SincResamplerTest.Resample/21 (29 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/22 [ OK ] SincResamplerTest/SincResamplerTest.Resample/22 (30 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/23 [ OK ] SincResamplerTest/SincResamplerTest.Resample/23 (32 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/24 [ OK ] SincResamplerTest/SincResamplerTest.Resample/24 (35 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/25 [ OK ] SincResamplerTest/SincResamplerTest.Resample/25 (42 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/26 [ OK ] SincResamplerTest/SincResamplerTest.Resample/26 (51 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/27 [ OK ] SincResamplerTest/SincResamplerTest.Resample/27 (51 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/28 [ OK ] SincResamplerTest/SincResamplerTest.Resample/28 (52 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/29 [ OK ] SincResamplerTest/SincResamplerTest.Resample/29 (52 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/30 [ OK ] SincResamplerTest/SincResamplerTest.Resample/30 (54 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/31 [ OK ] SincResamplerTest/SincResamplerTest.Resample/31 (57 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/32 [ OK ] SincResamplerTest/SincResamplerTest.Resample/32 (58 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/33 [ OK ] SincResamplerTest/SincResamplerTest.Resample/33 (59 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/34 [ OK ] SincResamplerTest/SincResamplerTest.Resample/34 (69 ms) [ RUN ] SincResamplerTest/SincResamplerTest.Resample/35 [ OK ] SincResamplerTest/SincResamplerTest.Resample/35 (83 ms) [----------] 36 tests from SincResamplerTest/SincResamplerTest (1212 ms total) [----------] Global test environment tear-down [==========] 40 tests from 2 test cases ran. (12799 ms total) [ PASSED ] 39 tests. [ FAILED ] 1 test, listed below: [ FAILED ] SincResamplerTest.Flush 1 FAILED TEST YOU HAVE 8 DISABLED TESTS chronos@localhost ~/chrome $ https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:208: double total_time_sse_unaligned_ms = there are a bunch of variables here and below that include "sse" in the name, probably erroneously. You could fix that with s/sse/neon/g, but probably better to instead dedup the analysis code against the version above (and just keep the actual dispatch in #if's). For that matter, you could #define your way to a single codepath entirely: #if ...SSE... #define CONVOLVE_OPT Convolve_SSE #elif ...NEON... #define CONVOLVE_OPT Convolve_NEON #else #error Boom! #endif ... https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:224: " Convolve_C and %.2fx faster than Convolve_SSE (unaligned).\n", s/SSE/NEON/, I think :)
Thanks for the benchmarks Ami. Comments addressed. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:115: double result2 = result; On 2012/09/21 04:41:41, Ami Fischman wrote: > I don't like this b/c it'll make the test claim to pass (but be a no-op) on > other archs, and will silently regress if defines get changed, etc. > > I think it's much safer to > #else > #error Don't build this on this platform! > at l.123 (etc) and indeed only build this test on appropriate targets. Done. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:120: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) On 2012/09/21 04:41:41, Ami Fischman wrote: > Here and elsewhere, I think __ARM_NEON__ is unnecessary given how USE_NEON comes > into being. Done. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:154: int convolve_iterations = 50000000; On 2012/09/21 04:41:41, Ami Fischman wrote: > what now? Was just for benchmarking on the try bot, but sadly the trybots don't run the unit tests :( https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:200: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) On 2012/09/21 04:41:41, Ami Fischman wrote: > Like with the original SSE CL, IWBN to include benchmark results in the CL > description. Done. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:208: double total_time_sse_unaligned_ms = On 2012/09/22 02:38:27, Ami Fischman wrote: > there are a bunch of variables here and below that include "sse" in the name, > probably erroneously. > > You could fix that with s/sse/neon/g, but probably better to instead dedup the > analysis code against the version above (and just keep the actual dispatch in > #if's). For that matter, you could #define your way to a single codepath > entirely: > #if ...SSE... > #define CONVOLVE_OPT Convolve_SSE > #elif ...NEON... > #define CONVOLVE_OPT Convolve_NEON > #else > #error Boom! > #endif > ... Done. https://codereview.chromium.org/10960023/diff/1/media/base/sinc_resampler_uni... media/base/sinc_resampler_unittest.cc:224: " Convolve_C and %.2fx faster than Convolve_SSE (unaligned).\n", On 2012/09/22 02:38:27, Ami Fischman wrote: > s/SSE/NEON/, I think :) Done. https://codereview.chromium.org/10960023/diff/1/media/media.gyp File media/media.gyp (right): https://codereview.chromium.org/10960023/diff/1/media/media.gyp#newcode330 media/media.gyp:330: ['arm_neon == 1', { On 2012/09/21 04:41:41, Ami Fischman wrote: > You need this same block in the media_unittests target, otherwise your new > benchmark code isn't very useful :) Done.
LGTM % nits https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:127: #error This test should only be compiled when SSE or NEON is available. This'd be a lot clearer if the #if test was reversed, since then the endif could be real close to the #if. https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:127: #error This test should only be compiled when SSE or NEON is available. I wonder if all our bots have one or the other. I guess we'll find out! https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:135: #if defined(ARCH_CPU_X86_FAMILY) && defined(__SSE__) you could avoid repeating these by defining an OPTIMIZATION_TYPE macro (SSE or NEON) as part of the #error-triggering #if, and then use that here. https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:139: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) drop arm_neon https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:183: #define CONVOLVE_FUNC Convolve_NEON Could as well go in the first #if, to save repetition?
https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:127: #error This test should only be compiled when SSE or NEON is available. On 2012/09/24 20:04:16, Ami Fischman wrote: > This'd be a lot clearer if the #if test was reversed, since then the endif could > be real close to the #if. I don't follow. Can you elaborate? You mean #if !neon && !sse #error #elif... #elif #endif? If so, I don't think that's cleaner. https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:127: #error This test should only be compiled when SSE or NEON is available. On 2012/09/24 20:04:16, Ami Fischman wrote: > I wonder if all our bots have one or the other. I guess we'll find out! The #if || check @100 prevents any issues here :) https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:183: #define CONVOLVE_FUNC Convolve_NEON On 2012/09/24 20:04:16, Ami Fischman wrote: > Could as well go in the first #if, to save repetition? I don't see how, the first if is a || check. We don't want to error out if SSE and NEON are not available, just skip the extra benchmark.
https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... File media/base/sinc_resampler_unittest.cc (right): https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:135: #if defined(ARCH_CPU_X86_FAMILY) && defined(__SSE__) On 2012/09/24 20:04:16, Ami Fischman wrote: > you could avoid repeating these by defining an OPTIMIZATION_TYPE macro (SSE or > NEON) as part of the #error-triggering #if, and then use that here. Split out into a common section for clarity. https://codereview.chromium.org/10960023/diff/7003/media/base/sinc_resampler_... media/base/sinc_resampler_unittest.cc:139: #elif defined(ARCH_CPU_ARM_FAMILY) && defined(__ARM_NEON__) && defined(USE_NEON) On 2012/09/24 20:04:16, Ami Fischman wrote: > drop arm_neon Done.
LGTM
On 2012/09/24 20:24:53, Ami Fischman wrote: > LGTM Ami, I optimized the loop according to feedback on the mailing list. When you have some free time, can you rerun the benchmarks? Thanks.
trchen or johann, would either of you mind reviewing the ARM NEON code for correctness? Everything seems to work fine, but a more informed review would be great :) Thanks!
LGTM http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.cc File media/base/sinc_resampler.cc (right): http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.... media/base/sinc_resampler.cc:318: float32x4_t m_sums2 = vmovq_n_f32(0); For some reason it looks like it's assembling with an actual load instruction (per the last disassembly I saw). Should be able to use VEOR as mentioned by JF. It might be messy because I don't see one in arm_neon.h that takes float32x4
http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.cc File media/base/sinc_resampler.cc (right): http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.... media/base/sinc_resampler.cc:318: float32x4_t m_sums2 = vmovq_n_f32(0); On 2012/09/25 19:02:51, Johann wrote: > For some reason it looks like it's assembling with an actual load instruction > (per the last disassembly I saw). Should be able to use VEOR as mentioned by JF. > It might be messy because I don't see one in arm_neon.h that takes float32x4 Done. Old: 7d0: eddf 2b19 vldr d18, [pc, #100] ; 838 <_Z13Convolve_NEONPKfS0_S0_d+0x68> 7d4: eddf 3b1a vldr d19, [pc, #104] ; 840 <_Z13Convolve_NEONPKfS0_S0_d+0x70> 7d8: ef62 61f2 vorr q11, q9, q9 7dc: f100 0380 add.w r3, r0, #128 ; 0x80 7e0: eddd 8b00 vldr d24, [sp] New: 7cc: 2300 movs r3, #0 7ce: eddd 8b00 vldr d24, [sp] 7d2: eea2 3b90 vdup.32 q9, r3 7d6: f100 0380 add.w r3, r0, #128 ; 0x80 7da: ff42 21f2 veor q9, q9, q9 7de: ef62 61f2 vorr q11, q9, q9 The new code seems to be "wasting" time zeroing twice though (vdup.32 and veor), but I suspect that's still better than two extra loads.
http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.cc File media/base/sinc_resampler.cc (right): http://codereview.chromium.org/10960023/diff/13001/media/base/sinc_resampler.... media/base/sinc_resampler.cc:318: float32x4_t m_sums2 = vmovq_n_f32(0); On 2012/09/25 20:30:01, DaleCurtis wrote: > On 2012/09/25 19:02:51, Johann wrote: > > For some reason it looks like it's assembling with an actual load instruction > > (per the last disassembly I saw). Should be able to use VEOR as mentioned by > JF. > > It might be messy because I don't see one in arm_neon.h that takes float32x4 > > Done. Old: > 7d0: eddf 2b19 vldr d18, [pc, #100] ; 838 <_Z13Convolve_NEONPKfS0_S0_d+0x68> > 7d4: eddf 3b1a vldr d19, [pc, #104] ; 840 <_Z13Convolve_NEONPKfS0_S0_d+0x70> > 7d8: ef62 61f2 vorr q11, q9, q9 > 7dc: f100 0380 add.w r3, r0, #128 ; 0x80 > 7e0: eddd 8b00 vldr d24, [sp] > > New: > 7cc: 2300 movs r3, #0 > 7ce: eddd 8b00 vldr d24, [sp] > 7d2: eea2 3b90 vdup.32 q9, r3 > 7d6: f100 0380 add.w r3, r0, #128 ; 0x80 > 7da: ff42 21f2 veor q9, q9, q9 > 7de: ef62 61f2 vorr q11, q9, q9 > > The new code seems to be "wasting" time zeroing twice though (vdup.32 and veor), > but I suspect that's still better than two extra loads. Actually this triggers an uninitialized error and doesn't appear to be any faster per benchmarks: Convolve_C took 5642.36ms. Convolve_NEON(unaligned) took 2421.80ms; which is 2.33x faster than Convolve_C. Convolve_NEON (aligned) took 2415.05ms; which is 2.34x faster than Convolve_C and 1.00x faster than Convolve_NEON (unaligned). So rather than adding a #pragma ignore for the uninitialized variable, I'll just use vmovq and hope future compilers do something smarter.
Sounds reasonable to me On Tue, Sep 25, 2012 at 2:27 PM, <dalecurtis@chromium.org> wrote: > So rather than adding a #pragma ignore for the uninitialized variable, > I'll just use vmovq and hope future compilers do something smarter.
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/dalecurtis@chromium.org/10960023/26001
CQ is trying da patch. Follow status at https://chromium-status.appspot.com/cq/dalecurtis@chromium.org/10960023/26001 |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
