Issue 1577703006: Optimized premultiplying swizzles for NEON

msarett

Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA ...

4 years, 11 months ago (2016-01-13 14:00:20 UTC) #1

msarett

msarett@google.com changed reviewers: + mtklein@google.com, scroggo@google.com

4 years, 11 months ago (2016-01-13 14:05:37 UTC) #3

msarett

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkOpts_neon.cpp File src/opts/SkOpts_neon.cpp (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkOpts_neon.cpp#newcode52 src/opts/SkOpts_neon.cpp:52: premul_xxxa = sk_neon::premul_xxxa; This actually doesn't work. In order ...

4 years, 11 months ago (2016-01-13 14:05:38 UTC) #4

mtklein

I think I know why Init_neon() doesn't run. It's a silly oversight from when we ...

4 years, 11 months ago (2016-01-13 14:08:48 UTC) #5

mtklein

On 2016/01/13 14:08:48, mtklein wrote: > I think I know why Init_neon() doesn't run. It's ...

4 years, 11 months ago (2016-01-13 14:28:01 UTC) #6

mtklein

Does +18% mean the runtime is 0.847x? I find these things usually easier to think/talk ...

4 years, 11 months ago (2016-01-13 14:29:54 UTC) #7

mtklein

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h#newcode127 src/opts/SkSwizzler_neon.h:127: rgba.val[0] = result_blues; On 2016/01/13 14:05:38, msarett wrote: > ...

4 years, 11 months ago (2016-01-13 14:33:18 UTC) #8

mtklein

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h File src/opts/SkSwizzler_neon.h (right): https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h#newcode17 src/opts/SkSwizzler_neon.h:17: static void premul_xxxa(uint32_t dst[], const uint32_t src[], int count) ...

4 years, 11 months ago (2016-01-13 15:20:01 UTC) #9

msarett

Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA ...

4 years, 11 months ago (2016-01-13 16:05:08 UTC) #10

msarett

"Ooops, got myself confused. Let's stick to our guns. This should do the trick: 1) ...

4 years, 11 months ago (2016-01-13 16:17:16 UTC) #11

"Ooops, got myself confused.   Let's stick to our guns.  This should do the
trick:

1)  rename SkSwizzler_neon.h to SkSwizzler_opts.h
2)  switch from namespace sk_neon to namespace SK_OPTS_NS
3)  move the portable swizzler code here to, e.g.

#if defined(SK_ARM_HAS_NEON)
   ... neon code ...
#else
   ... portable code with same function names ...
#endif

4)  include your header in SkOpts.cpp (after #define SK_OPTS_NS sk_default, like
the others)
5)  include your header in SkOpts_neon.cpp (after #define SK_OPTS_NS sk_neon,
like the others)"

Done.

"Does +18% mean the runtime is 0.847x?   I find these things usually easier to
think/talk about in runtime multipliers, where <1 is a win and >1 a loss."

SGTM.  I've updated the CL description.  Also adding the
results of profiling, which I forgot before.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neon.h
File src/opts/SkSwizzler_neon.h (right):

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:17: static void premul_xxxa(uint32_t dst[], const
uint32_t src[], int count) {
On 2016/01/13 15:20:01, mtklein wrote:
> I think some of the clarity of this code is getting swamped by the comments.
> 
> Going to mark a few of them up with suggestions.

Acknowledged.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:24: uint8x8x4_t rgba = vld4_u8(src8);
On 2016/01/13 15:20:01, mtklein wrote:
> Load 8 pixels.

Done.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:26: // These should be free operations.
On 2016/01/13 15:20:01, mtklein wrote:
> I don't think you need any comment here.  I'd suggest
> 
> uint8x8_t r = rgba.val[0],
>           g = rgba.val[1],
>           b = rgba.val[2],
>           a = rgba.val[3];
> 
> We all know what r,g,b,a are, and this keeps them short and equal-length.

Done.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:35: uint16x8_t product_reds = vmull_u8(reds, alphas);
On 2016/01/13 15:20:01, mtklein wrote:
> I'd suggest:
> 
> // Premultiply.
> r = scale(r, a);
> g = scale(g, a);
> b = scale(b, a);
> 
> where
> 
> // (x*y+127)/255, i.e. scale a byte by another.
> static uint8x8_t scale(uint8x8_t x, uint8x8_t y) { ... }
> 
> possibly using this as a sub-routine:
> 
> // (x+127)/255.
> static uint8x8_t div255_round(uint16x8_t x) { ... }
> 
> We probably don't need to prove that div255() does div255(), as our tests and
> Gold will do that, but if you want to prove it here, let's squirrel it away
> inside a div255_round() method so we don't have to keep reading it every time
we
> read this code.

I think the refactoring makes things clearer.

My preference is to keep the comments.  I agree that it is cluttered as is, but
it's be nice if someone other than you or me can understand what's going on. 
(And also I may forget that proof in year.)  Gold will verify that it works, but
I'd also like to explain why it works.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:63: // Store 8 premultiplied pixels.
On 2016/01/13 15:20:01, mtklein wrote:
> Good from here on.

Acknowledged.

https://codereview.chromium.org/1577703006/diff/20001/src/opts/SkSwizzler_neo...
src/opts/SkSwizzler_neon.h:127: rgba.val[0] = result_blues;
On 2016/01/13 14:33:17, mtklein wrote:
> On 2016/01/13 14:05:38, msarett wrote:
> > This is identical to premul_xxxa with the exception of the swap of blue and
> red
> > here.  Can we somehow share code?
> 
> Yes, template <bool kSwapRB> on the function, then if (kSwapRB) { ... }.

Done.

scroggo

I'll defer to Mike's approval here. I'm excited about the speed-ups!

4 years, 11 months ago (2016-01-13 17:02:21 UTC) #12

mtklein

https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp File src/core/SkOpts.cpp (right): https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp#newcode84 src/core/SkOpts.cpp:84: decltype( premul_xxxa) premul_xxxa = sk_default::premul_xxxa<false>; I'd prefer to add ...

4 years, 11 months ago (2016-01-13 17:04:37 UTC) #13

msarett

https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp File src/core/SkOpts.cpp (right): https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp#newcode84 src/core/SkOpts.cpp:84: decltype( premul_xxxa) premul_xxxa = sk_default::premul_xxxa<false>; On 2016/01/13 17:04:36, mtklein ...

4 years, 11 months ago (2016-01-13 18:52:05 UTC) #14

https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp
File src/core/SkOpts.cpp (right):

https://codereview.chromium.org/1577703006/diff/40001/src/core/SkOpts.cpp#new...
src/core/SkOpts.cpp:84: decltype(       premul_xxxa)        premul_xxxa =
sk_default::premul_xxxa<false>;
On 2016/01/13 17:04:36, mtklein wrote:
> I'd prefer to add shims to SkSwizzler_opts.h to keep the names the same.  It's
> easier to debug/profile if we see premul_swaprb_xxxa than premul_xxxa<1>.
> 
> static void       premul_xxxa(...) { premul_xxxa_should_swaprb<false>(...); }
> static void premul_swaprb_xxxa(...) { premul_xxxa_should_swaprb<true>(...); }
> 
>  _should_swaprb() ought to inline into both, and I can't imagine seeing any
> effect on speed even if it didn't.

Agreed on all counts!  Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opts.h
File src/opts/SkSwizzler_opts.h (right):

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:15: // These variable names in these functions just
pretend the input is RGBA.
On 2016/01/13 17:04:37, mtklein wrote:
> You're going to the dark side!
> 
> BGRA is the format to prefer, because it matches SkColor.  Written as a
literal,
> it's 0xAARRGGBB, which is what we're used to writing and breaking apart with
> shifts.
> 
> I admit this is pretty arbitrary, but the arbitrariness is why I'd like to
stick
> to the convention of writing things as if they were BGRA.  It's nice to hop
back
> and forth between files without twisting your brain.

SGTM.  I just wanted NEON and default to be the same.  Didn't know it mattered
which one I picked :).

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:20: // Rounded divide by 255 (x + 127) / 255
On 2016/01/13 17:04:36, mtklein wrote:
> add a comma after the first 255?

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:21: //    result = (x + 127) / 255
On 2016/01/13 17:04:36, mtklein wrote:
> Let's move the rest of these implementation-focused comments to just inside
the
> function body.

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:35: //    NEON supports.
On 2016/01/13 17:04:37, mtklein wrote:
> ...supports in one instruction.
> ?

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:44: // Scale a byte by another (x * y + 127) / 255
On 2016/01/13 17:04:37, mtklein wrote:
> comma after another?

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:53: const uint8_t* src8 = (const uint8_t*) src;
On 2016/01/13 17:04:36, mtklein wrote:
> Generally I prefer to cast the pointers inside each load and store, so that
all
> the constants we use match up.  It takes thought to see that src8 += 32 is
> correct, but while src += 8 is a bit clearer.

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:54: for (; i < count; i += 8) {
On 2016/01/13 17:04:36, mtklein wrote:
> I think this walks i too far for, e.g. count == 15.
> 
> I've found the safest way to get these things right is to decrement count:
> 
> while (count >= 8) {
> 
>     count -= 8;
> }
> while (count > 0) {
> 
>     count--;
> }
> 
> I  further usually write that last one as 
> 
> while (count --> 0) {
> 
> }
> 
> because it both looks cute and is a pretty strong indication that count will
not
> be referred to inside the loop.

Done.  Apologies for the messy job porting this code.

I've verified that perf is unaffected and tested again (this time with the opts
turned on :)).

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:85: dst[i] = SkPremultiplyARGBInline(src8[3],
src8[0], src8[1], src8[2]);
On 2016/01/13 17:04:37, mtklein wrote:
> dst[i] = ...
> ...
> dst++
> 
> this steps dst forward twice.
> 
> I've found the most foolproof way to write these methods never needs an 'i':
> decrement count, increment dst and src.

Done.

https://codereview.chromium.org/1577703006/diff/40001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:94: static void premul_xxxa(uint32_t dst[], const
uint32_t src[], int count) {
On 2016/01/13 17:04:37, mtklein wrote:
> This change is fine by me if you like, but it's not necessary once you add
> shims.

Acknowledged.  Changing back to the original.

mtklein

https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/60001/src/opts/SkSwizzler_opts.h#newcode55 src/opts/SkSwizzler_opts.h:55: uint8x8_t b = bgra.val[0], It'd be nice to harmonize ...

4 years, 11 months ago (2016-01-13 21:04:17 UTC) #15

msarett

Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA ...

4 years, 11 months ago (2016-01-13 21:29:21 UTC) #18

msarett

Changed the commit message to measure performance improvement of: (1) Just the swizzle step. (2) ...

4 years, 11 months ago (2016-01-13 21:30:27 UTC) #19

mtklein

lgtm https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_opts.h#newcode113 src/opts/SkSwizzler_opts.h:113: if (count > 0) { This check isn't ...

4 years, 11 months ago (2016-01-13 21:45:34 UTC) #20

msarett

https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1577703006/diff/120001/src/opts/SkSwizzler_opts.h#newcode113 src/opts/SkSwizzler_opts.h:113: if (count > 0) { On 2016/01/13 21:45:34, mtklein ...

4 years, 11 months ago (2016-01-13 22:11:30 UTC) #21

msarett

The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the ...

4 years, 11 months ago (2016-01-13 22:11:37 UTC) #23

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1577703006/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1577703006/140001

4 years, 11 months ago (2016-01-13 22:11:44 UTC) #24

commit-bot: I haz the power

Description was changed from ========== Optimized premultiplying swizzles for NEON Improves decode performance for RGBA ...

4 years, 11 months ago (2016-01-13 22:32:04 UTC) #25

Message was sent while issue was closed.

Description was changed from

==========
Optimized premultiplying swizzles for NEON

Improves decode performance for RGBA encoded PNGs.

Swizzle Time on Nexus 9 (with clang):
SwapPremul 0.44x
Premul     0.44x
Decode Time On Nexus 9 (with clang):
ZeroInit Decodes 0.85x
Regular  Decodes 0.86x

Swizzle Time on Nexus 6P (with clang)
SwapPremul 0.14x
Premul     0.14x
Decode Time On Nexus 6P (with clang):
ZeroInit Decodes 0.93x
Regular  Decodes 0.95x

Notes:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

A profile on Nexus 9 shows that the premultiplication step of PNG
decoding is now ~5% of decode time (down from ~20%).

BUG=skia:4767
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
Optimized premultiplying swizzles for NEON

Improves decode performance for RGBA encoded PNGs.

Swizzle Time on Nexus 9 (with clang):
SwapPremul 0.44x
Premul     0.44x
Decode Time On Nexus 9 (with clang):
ZeroInit Decodes 0.85x
Regular  Decodes 0.86x

Swizzle Time on Nexus 6P (with clang)
SwapPremul 0.14x
Premul     0.14x
Decode Time On Nexus 6P (with clang):
ZeroInit Decodes 0.93x
Regular  Decodes 0.95x

Notes:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

A profile on Nexus 9 shows that the premultiplication step of PNG
decoding is now ~5% of decode time (down from ~20%).

BUG=skia:4767
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Committed:
https://skia.googlesource.com/skia/+/3a24f459582f2665f0e66bd35a0d8f46a1c4c72f
==========

commit-bot: I haz the power

4 years, 11 months ago (2016-01-13 22:32:05 UTC) #26

Message was sent while issue was closed.

Committed patchset #5 (id:140001) as
https://skia.googlesource.com/skia/+/3a24f459582f2665f0e66bd35a0d8f46a1c4c72f

Issue 1577703006: Optimized premultiplying swizzles for NEON (Closed)

Description

Patch Set 1 : #

Patch Set 2 : Response to comments #

Patch Set 3 : Fixed loop bugs #

Patch Set 4 : Refactor portable code #

Patch Set 5 : Remove unnecessary if statement #

Messages