|
|
DescriptionOptimized premultiplying swizzles for NEON
Improves decode performance for RGBA encoded PNGs.
Swizzle Time on Nexus 9 (with clang):
SwapPremul 0.44x
Premul 0.44x
Decode Time On Nexus 9 (with clang):
ZeroInit Decodes 0.85x
Regular Decodes 0.86x
Swizzle Time on Nexus 6P (with clang)
SwapPremul 0.14x
Premul 0.14x
Decode Time On Nexus 6P (with clang):
ZeroInit Decodes 0.93x
Regular Decodes 0.95x
Notes:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).
A profile on Nexus 9 shows that the premultiplication step of PNG
decoding is now ~5% of decode time (down from ~20%).
BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1577703006
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/3a24f459582f2665f0e66bd35a0d8f46a1c4c72f
Patch Set 1 : #
Total comments: 14
Patch Set 2 : Response to comments #
Total comments: 20
Patch Set 3 : Fixed loop bugs #
Total comments: 4
Patch Set 4 : Refactor portable code #
Total comments: 2
Patch Set 5 : Remove unnecessary if statement #
Depends on Patchset: Messages
Total messages: 26 (10 generated)
Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. On Nexus 9 (with clang): ZeroInit Decodes +18% (+31% cumulative) Regular Decodes +17% (+36%) On Nexus 6P (with clang): ZeroInit Decodes +6% (+11%) Regular Decodes +8% (+11%) Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). The cumulative improvement measures how much performance has improved since work started on 4767. We recently improved the baseline by inlining the original code. BUG=skia:4767 ========== to ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. On Nexus 9 (with clang): ZeroInit Decodes +18% (+31% cumulative) Regular Decodes +17% (+36%) On Nexus 6P (with clang): ZeroInit Decodes +6% (+11%) Regular Decodes +8% (+11%) Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). The cumulative improvement measures how much performance has improved since work started on 4767. We recently improved the baseline by inlining the original code. BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Patchset #1 (id:1) has been deleted
msarett@google.com changed reviewers: + mtklein@google.com, scroggo@google.com
https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkOpts_neon.cpp File src/opts/SkOpts_neon.cpp (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkOpts_neon.cp... src/opts/SkOpts_neon.cpp:52: premul_xxxa = sk_neon::premul_xxxa; This actually doesn't work. In order to run the optimized code, I had to modify the default in SkOpts.cpp. I'm trying to figure out why Init_neon() seems to not be run. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:127: rgba.val[0] = result_blues; This is identical to premul_xxxa with the exception of the swap of blue and red here. Can we somehow share code?
I think I know why Init_neon() doesn't run. It's a silly oversight from when we thought everything would be written in Sk4px-style code. Nothing you've done here. I'll send you a CL to fix that (and add Init_sse2() for similar reasons).
On 2016/01/13 14:08:48, mtklein wrote: > I think I know why Init_neon() doesn't run. It's a silly oversight from when we > thought everything would be written in Sk4px-style code. Nothing you've done > here. > > I'll send you a CL to fix that (and add Init_sse2() for similar reasons). Ooops, got myself confused. Let's stick to our guns. This should do the trick: 1) rename SkSwizzler_neon.h to SkSwizzler_opts.h 2) switch from namespace sk_neon to namespace SK_OPTS_NS 3) move the portable swizzler code here to, e.g. #if defined(SK_ARM_HAS_NEON) ... neon code ... #else ... portable code with same function names ... #endif 4) include your header in SkOpts.cpp (after #define SK_OPTS_NS sk_default, like the others) 5) include your header in SkOpts_neon.cpp (after #define SK_OPTS_NS sk_neon, like the others)
Does +18% mean the runtime is 0.847x? I find these things usually easier to think/talk about in runtime multipliers, where <1 is a win and >1 a loss.
https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:127: rgba.val[0] = result_blues; On 2016/01/13 14:05:38, msarett wrote: > This is identical to premul_xxxa with the exception of the swap of blue and red > here. Can we somehow share code? Yes, template <bool kSwapRB> on the function, then if (kSwapRB) { ... }.
https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:17: static void premul_xxxa(uint32_t dst[], const uint32_t src[], int count) { I think some of the clarity of this code is getting swamped by the comments. Going to mark a few of them up with suggestions. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:24: uint8x8x4_t rgba = vld4_u8(src8); Load 8 pixels. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:26: // These should be free operations. I don't think you need any comment here. I'd suggest uint8x8_t r = rgba.val[0], g = rgba.val[1], b = rgba.val[2], a = rgba.val[3]; We all know what r,g,b,a are, and this keeps them short and equal-length. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:35: uint16x8_t product_reds = vmull_u8(reds, alphas); I'd suggest: // Premultiply. r = scale(r, a); g = scale(g, a); b = scale(b, a); where // (x*y+127)/255, i.e. scale a byte by another. static uint8x8_t scale(uint8x8_t x, uint8x8_t y) { ... } possibly using this as a sub-routine: // (x+127)/255. static uint8x8_t div255_round(uint16x8_t x) { ... } We probably don't need to prove that div255() does div255(), as our tests and Gold will do that, but if you want to prove it here, let's squirrel it away inside a div255_round() method so we don't have to keep reading it every time we read this code. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:63: // Store 8 premultiplied pixels. Good from here on.
Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. On Nexus 9 (with clang): ZeroInit Decodes +18% (+31% cumulative) Regular Decodes +17% (+36%) On Nexus 6P (with clang): ZeroInit Decodes +6% (+11%) Regular Decodes +8% (+11%) Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). The cumulative improvement measures how much performance has improved since work started on 4767. We recently improved the baseline by inlining the original code. BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x (0.77x cumulative) Regular Decodes 0.86x (0.75x) Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x (0.90x) Regular Decodes 0.95x (0.91x) Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). The cumulative improvement measures how much performance has improved since work started on 4767. We recently improved the baseline by inlining the original code. A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
"Ooops, got myself confused. Let's stick to our guns. This should do the trick: 1) rename SkSwizzler_neon.h to SkSwizzler_opts.h 2) switch from namespace sk_neon to namespace SK_OPTS_NS 3) move the portable swizzler code here to, e.g. #if defined(SK_ARM_HAS_NEON) ... neon code ... #else ... portable code with same function names ... #endif 4) include your header in SkOpts.cpp (after #define SK_OPTS_NS sk_default, like the others) 5) include your header in SkOpts_neon.cpp (after #define SK_OPTS_NS sk_neon, like the others)" Done. "Does +18% mean the runtime is 0.847x? I find these things usually easier to think/talk about in runtime multipliers, where <1 is a win and >1 a loss." SGTM. I've updated the CL description. Also adding the results of profiling, which I forgot before. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:17: static void premul_xxxa(uint32_t dst[], const uint32_t src[], int count) { On 2016/01/13 15:20:01, mtklein wrote: > I think some of the clarity of this code is getting swamped by the comments. > > Going to mark a few of them up with suggestions. Acknowledged. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:24: uint8x8x4_t rgba = vld4_u8(src8); On 2016/01/13 15:20:01, mtklein wrote: > Load 8 pixels. Done. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:26: // These should be free operations. On 2016/01/13 15:20:01, mtklein wrote: > I don't think you need any comment here. I'd suggest > > uint8x8_t r = rgba.val[0], > g = rgba.val[1], > b = rgba.val[2], > a = rgba.val[3]; > > We all know what r,g,b,a are, and this keeps them short and equal-length. Done. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:35: uint16x8_t product_reds = vmull_u8(reds, alphas); On 2016/01/13 15:20:01, mtklein wrote: > I'd suggest: > > // Premultiply. > r = scale(r, a); > g = scale(g, a); > b = scale(b, a); > > where > > // (x*y+127)/255, i.e. scale a byte by another. > static uint8x8_t scale(uint8x8_t x, uint8x8_t y) { ... } > > possibly using this as a sub-routine: > > // (x+127)/255. > static uint8x8_t div255_round(uint16x8_t x) { ... } > > We probably don't need to prove that div255() does div255(), as our tests and > Gold will do that, but if you want to prove it here, let's squirrel it away > inside a div255_round() method so we don't have to keep reading it every time we > read this code. I think the refactoring makes things clearer. My preference is to keep the comments. I agree that it is cluttered as is, but it's be nice if someone other than you or me can understand what's going on. (And also I may forget that proof in year.) Gold will verify that it works, but I'd also like to explain why it works. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:63: // Store 8 premultiplied pixels. On 2016/01/13 15:20:01, mtklein wrote: > Good from here on. Acknowledged. https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo... src/opts/SkSwizzler_neon.h:127: rgba.val[0] = result_blues; On 2016/01/13 14:33:17, mtklein wrote: > On 2016/01/13 14:05:38, msarett wrote: > > This is identical to premul_xxxa with the exception of the swap of blue and > red > > here. Can we somehow share code? > > Yes, template <bool kSwapRB> on the function, then if (kSwapRB) { ... }. Done.
I'll defer to Mike's approval here. I'm excited about the speed-ups!
https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp File src/core/SkOpts.cpp (right): https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp#new... src/core/SkOpts.cpp:84: decltype( premul_xxxa) premul_xxxa = sk_default::premul_xxxa<false>; I'd prefer to add shims to SkSwizzler_opts.h to keep the names the same. It's easier to debug/profile if we see premul_swaprb_xxxa than premul_xxxa<1>. static void premul_xxxa(...) { premul_xxxa_should_swaprb<false>(...); } static void premul_swaprb_xxxa(...) { premul_xxxa_should_swaprb<true>(...); } _should_swaprb() ought to inline into both, and I can't imagine seeing any effect on speed even if it didn't. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:15: // These variable names in these functions just pretend the input is RGBA. You're going to the dark side! BGRA is the format to prefer, because it matches SkColor. Written as a literal, it's 0xAARRGGBB, which is what we're used to writing and breaking apart with shifts. I admit this is pretty arbitrary, but the arbitrariness is why I'd like to stick to the convention of writing things as if they were BGRA. It's nice to hop back and forth between files without twisting your brain. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:20: // Rounded divide by 255 (x + 127) / 255 add a comma after the first 255? https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:21: // result = (x + 127) / 255 Let's move the rest of these implementation-focused comments to just inside the function body. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:35: // NEON supports. ...supports in one instruction. ? https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:44: // Scale a byte by another (x * y + 127) / 255 comma after another? https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:53: const uint8_t* src8 = (const uint8_t*) src; Generally I prefer to cast the pointers inside each load and store, so that all the constants we use match up. It takes thought to see that src8 += 32 is correct, but while src += 8 is a bit clearer. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:54: for (; i < count; i += 8) { I think this walks i too far for, e.g. count == 15. I've found the safest way to get these things right is to decrement count: while (count >= 8) { count -= 8; } while (count > 0) { count--; } I further usually write that last one as while (count --> 0) { } because it both looks cute and is a pretty strong indication that count will not be referred to inside the loop. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:85: dst[i] = SkPremultiplyARGBInline(src8[3], src8[0], src8[1], src8[2]); dst[i] = ... ... dst++ this steps dst forward twice. I've found the most foolproof way to write these methods never needs an 'i': decrement count, increment dst and src. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:94: static void premul_xxxa(uint32_t dst[], const uint32_t src[], int count) { This change is fine by me if you like, but it's not necessary once you add shims.
https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp File src/core/SkOpts.cpp (right): https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp#new... src/core/SkOpts.cpp:84: decltype( premul_xxxa) premul_xxxa = sk_default::premul_xxxa<false>; On 2016/01/13 17:04:36, mtklein wrote: > I'd prefer to add shims to SkSwizzler_opts.h to keep the names the same. It's > easier to debug/profile if we see premul_swaprb_xxxa than premul_xxxa<1>. > > static void premul_xxxa(...) { premul_xxxa_should_swaprb<false>(...); } > static void premul_swaprb_xxxa(...) { premul_xxxa_should_swaprb<true>(...); } > > _should_swaprb() ought to inline into both, and I can't imagine seeing any > effect on speed even if it didn't. Agreed on all counts! Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:15: // These variable names in these functions just pretend the input is RGBA. On 2016/01/13 17:04:37, mtklein wrote: > You're going to the dark side! > > BGRA is the format to prefer, because it matches SkColor. Written as a literal, > it's 0xAARRGGBB, which is what we're used to writing and breaking apart with > shifts. > > I admit this is pretty arbitrary, but the arbitrariness is why I'd like to stick > to the convention of writing things as if they were BGRA. It's nice to hop back > and forth between files without twisting your brain. SGTM. I just wanted NEON and default to be the same. Didn't know it mattered which one I picked :). https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:20: // Rounded divide by 255 (x + 127) / 255 On 2016/01/13 17:04:36, mtklein wrote: > add a comma after the first 255? Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:21: // result = (x + 127) / 255 On 2016/01/13 17:04:36, mtklein wrote: > Let's move the rest of these implementation-focused comments to just inside the > function body. Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:35: // NEON supports. On 2016/01/13 17:04:37, mtklein wrote: > ...supports in one instruction. > ? Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:44: // Scale a byte by another (x * y + 127) / 255 On 2016/01/13 17:04:37, mtklein wrote: > comma after another? Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:53: const uint8_t* src8 = (const uint8_t*) src; On 2016/01/13 17:04:36, mtklein wrote: > Generally I prefer to cast the pointers inside each load and store, so that all > the constants we use match up. It takes thought to see that src8 += 32 is > correct, but while src += 8 is a bit clearer. Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:54: for (; i < count; i += 8) { On 2016/01/13 17:04:36, mtklein wrote: > I think this walks i too far for, e.g. count == 15. > > I've found the safest way to get these things right is to decrement count: > > while (count >= 8) { > > count -= 8; > } > while (count > 0) { > > count--; > } > > I further usually write that last one as > > while (count --> 0) { > > } > > because it both looks cute and is a pretty strong indication that count will not > be referred to inside the loop. Done. Apologies for the messy job porting this code. I've verified that perf is unaffected and tested again (this time with the opts turned on :)). https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:85: dst[i] = SkPremultiplyARGBInline(src8[3], src8[0], src8[1], src8[2]); On 2016/01/13 17:04:37, mtklein wrote: > dst[i] = ... > ... > dst++ > > this steps dst forward twice. > > I've found the most foolproof way to write these methods never needs an 'i': > decrement count, increment dst and src. Done. https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:94: static void premul_xxxa(uint32_t dst[], const uint32_t src[], int count) { On 2016/01/13 17:04:37, mtklein wrote: > This change is fine by me if you like, but it's not necessary once you add > shims. Acknowledged. Changing back to the original.
https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:55: uint8x8_t b = bgra.val[0], It'd be nice to harmonize the order we list things now that we've got the names straight. Up here we write things in b,g,r,a (0->3) order, but in the portable code we write things in a,r,g,b (3->0) order. Either works for me, but as you might guess, I like a,r,g,b (3->0). https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:81: while (count --> 0) { This loop can't be correct. It's not looking at kSwapRB. (We can't call SkPremultiplyARGBInline... that's always returning SkPMColor order.) Why don't we pull the portable code up top as premul_xxxa_portable, etc, then from in here you can do your NEON loop, then just // Call portable code to finish up the tail of [0,8) pixels. auto proc = kSwapRB ? premul_swaprb_xxxa_portable : premul_xxxa_portable; proc(dst, src, count);
Patchset #4 (id:80001) has been deleted
Patchset #4 (id:100001) has been deleted
Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x (0.77x cumulative) Regular Decodes 0.86x (0.75x) Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x (0.90x) Regular Decodes 0.95x (0.91x) Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). The cumulative improvement measures how much performance has improved since work started on 4767. We recently improved the baseline by inlining the original code. A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. Swizzle Time on Nexus 9 (with clang): SwapPremul 0.44x Premul 0.44x Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x Regular Decodes 0.86x Swizzle Time on Nexus 6P (with clang) SwapPremul 0.14x Premul 0.14x Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x Regular Decodes 0.95x Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Changed the commit message to measure performance improvement of: (1) Just the swizzle step. (2) The full decode. https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:55: uint8x8_t b = bgra.val[0], On 2016/01/13 21:04:17, mtklein wrote: > It'd be nice to harmonize the order we list things now that we've got the names > straight. > > Up here we write things in b,g,r,a (0->3) order, but in the portable code we > write things in a,r,g,b (3->0) order. Either works for me, but as you might > guess, I like a,r,g,b (3->0). Done. https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:81: while (count --> 0) { On 2016/01/13 21:04:17, mtklein wrote: > This loop can't be correct. It's not looking at kSwapRB. (We can't call > SkPremultiplyARGBInline... that's always returning SkPMColor order.) > > Why don't we pull the portable code up top as premul_xxxa_portable, etc, then > from in here you can do your NEON loop, then just > > // Call portable code to finish up the tail of [0,8) pixels. > auto proc = kSwapRB ? premul_swaprb_xxxa_portable : premul_xxxa_portable; > proc(dst, src, count); Done.
lgtm https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_op... File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_op... src/opts/SkSwizzler_opts.h:113: if (count > 0) { This check isn't really needed? I'm okay to keep it here if it makes you feel better / to act as a comment.
https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_op... File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_op... src/opts/SkSwizzler_opts.h:113: if (count > 0) { On 2016/01/13 21:45:34, mtklein wrote: > This check isn't really needed? I'm okay to keep it here if it makes you feel > better / to act as a comment. Done.
The CQ bit was checked by msarett@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the patchset: https://codereview.chromium.org/1577703006/#ps140001 (title: "Remove unnecessary if statement")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1577703006/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1577703006/140001
Message was sent while issue was closed.
Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. Swizzle Time on Nexus 9 (with clang): SwapPremul 0.44x Premul 0.44x Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x Regular Decodes 0.86x Swizzle Time on Nexus 6P (with clang) SwapPremul 0.14x Premul 0.14x Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x Regular Decodes 0.95x Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA encoded PNGs. Swizzle Time on Nexus 9 (with clang): SwapPremul 0.44x Premul 0.44x Decode Time On Nexus 9 (with clang): ZeroInit Decodes 0.85x Regular Decodes 0.86x Swizzle Time on Nexus 6P (with clang) SwapPremul 0.14x Premul 0.14x Decode Time On Nexus 6P (with clang): ZeroInit Decodes 0.93x Regular Decodes 0.95x Notes: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). A profile on Nexus 9 shows that the premultiplication step of PNG decoding is now ~5% of decode time (down from ~20%). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/3a24f459582f2665f0e66bd35a0d8f46a1c4c72f ==========
Message was sent while issue was closed.
Committed patchset #5 (id:140001) as https://skia.googlesource.com/skia/+/3a24f459582f2665f0e66bd35a0d8f46a1c4c72f |