Issue 1601883002: Add SSSE3 Optimizations for premul and swap

Issue 1601883002: Add SSSE3 Optimizations for premul and swap (Closed)

Created:
4 years, 11 months ago by msarett

Modified:
4 years, 11 months ago

Reviewers:
mtklein

CC:
reviews_skia.org

Base URL:
https://skia.googlesource.com/skia.git@f-and-x

Target Ref:
refs/heads/master

Project:
skia

Visibility:
Public.

More Reviews

Description

Add SSSE3 Optimizations for premul and swap Improves deocde performance for RGBA pngs. Swizzler Time on z620 (clang): SwapPremul 0.24x Premul 0.24x Swap 0.37x Decode Time on z620 (clang): Premul ZeroInit Decodes 0.88x Unpremul ZeroInit Decodes 0.94x Premul Regular Decodes 0.91x Unpremul Regular Decodes 0.98x Swizzler Time in Dell Venue 8 (gcc): SwapPremul 0.14x Premul 0.14x Swap 0.08x Decode Time on Dell Venus 8 (gcc): Premul ZeroInit Decodes 0.79x Premul Regular Decodes 0.77x Note: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1601883002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593

Patch Set 1 #

Patch Set 2 : #

Total comments: 11

Patch Set 3 : Faster repacking, style, comments #

Total comments: 2

Patch Set 4 : Use shared proc #

Total comments: 3

Patch Set 5 : Move constants into premul8 proc #

Created: 4 years, 11 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+101 lines, -0 lines)			Patch
	M	src/opts/SkOpts_ssse3.cpp	View		2 chunks	+5 lines, -0 lines	0 comments	Download
	M	src/opts/SkSwizzler_opts.h	View	1 2 3 4	1 chunk	+96 lines, -0 lines	0 comments	Download

Depends on Patchset:

Issue 1582083005 Patch 20001

Messages

Total messages: 15 (5 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

msarett

Description was changed from ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance ...

4 years, 11 months ago (2016-01-18 20:32:06 UTC) #1

msarett

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h#newcode180 src/opts/SkSwizzler_opts.h:180: static void premul_xxxa_should_swaprb(uint32_t dst[], const uint32_t src[], int count) ...

4 years, 11 months ago (2016-01-18 20:35:05 UTC) #3

mtklein

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h#newcode196 src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb Let's kick some of these ...

4 years, 11 months ago (2016-01-19 15:59:15 UTC) #4

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h
File src/opts/SkSwizzler_opts.h (right):

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb
Let's kick some of these comments a little bit higher-level:

// We'll load 8 pixels into 4 registers, each holding a 16-bit component plane.

// First just load the 8 interlaced pixels.
__m128i lo = _mm_loadu_si128(... +0), // bgrabgra bgrabgra
        hi = _mm_loadu_si128(... +4); // BGRABGRA BGRABGRA

// Swizzle them to 8-bit planar.
lo = _mm_shuffle_epi8(lo, planar);    // bbbbgggg rrrraaaa
hi = _mm_shuffle_epi8(hi, planar);    // BBBBGGGG RRRRAAAA
__m128i bg = _mm_unpacklo(...),       // bbbbBBBB ggggGGGG
        ra = _mm_unpackhi(...);       // rrrrRRRR aaaaAAAA

// Unpack to 16-bit planar in four registers.
__m128i b = _mm_unpacklo(...),        // b_b_b_b_ B_B_B_B_
        ...;

// OK, premultiply!  (x+127)/255 == ((x+128)*257)>>16 for 0 <= x <= 255*255.
...

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:214: r =
_mm_mulhi_epu16(_mm_add_epi16(_mm_mullo_epi16(a, r), _128), _257);
This may be a matter of personal preference, but you might consider:

auto scale = [](__m128i x, __m128i y) { return _mm_mulhi_epu16(...); };
r = scale(r,a);
g = scale(g,a);
b = scale(b,a);

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:218: // aaaa_rrrr_aaaa_rrrr
I think we can do this repacking as something like:

__m128i bg = b | (g << 8)),
        ra = r | (a << 8)),
        lo = unpacklo_epi16(bg, ra),
        hi = unpackhi_epi16(bg, ra);

if (kSwapRB) {
    lo = shuffle_epi8(lo, swapRB)
    hi = shuffle_epi8(hi, swapRB)
}
storeu_si128(... +0, lo)
storeu_si128(... +4, hi)

Does that work?  I think that makes the non-swapRB path a bit shorter, and the
swapRB path no longer.

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:239: if (count >= 4) {
Reminder to self to circle back here when we're happy with n >= 8.

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:295: const __m128i swapRB = _mm_set_epi8(15, 12, 13,
14, 11, 8, 9, 10, 7, 4, 5, 6, 3, 0, 1, 2);
I often find it's easier to read these if you use _mm_setr_foo, so that the
indices go in ascending order:
   _mm_setr_epi8(2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15);

If you do like them as you've written, they're perfectly fine.

msarett

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h#newcode196 src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb On 2016/01/19 15:59:14, mtklein wrote: ...

4 years, 11 months ago (2016-01-19 17:34:38 UTC) #5

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h
File src/opts/SkSwizzler_opts.h (right):

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb
On 2016/01/19 15:59:14, mtklein wrote:
> Let's kick some of these comments a little bit higher-level:
> 
> // We'll load 8 pixels into 4 registers, each holding a 16-bit component
plane.
> 
> // First just load the 8 interlaced pixels.
> __m128i lo = _mm_loadu_si128(... +0), // bgrabgra bgrabgra
>         hi = _mm_loadu_si128(... +4); // BGRABGRA BGRABGRA
> 
> // Swizzle them to 8-bit planar.
> lo = _mm_shuffle_epi8(lo, planar);    // bbbbgggg rrrraaaa
> hi = _mm_shuffle_epi8(hi, planar);    // BBBBGGGG RRRRAAAA
> __m128i bg = _mm_unpacklo(...),       // bbbbBBBB ggggGGGG
>         ra = _mm_unpackhi(...);       // rrrrRRRR aaaaAAAA
> 
> // Unpack to 16-bit planar in four registers.
> __m128i b = _mm_unpacklo(...),        // b_b_b_b_ B_B_B_B_
>         ...;
> 
> // OK, premultiply!  (x+127)/255 == ((x+128)*257)>>16 for 0 <= x <= 255*255.
> ...

Done.

Ugggh, for some reason I thought the rest of this file was written as ARGB (so I
did that on purpose).  But I'm realizing now that it's BGRA.  And I agree that
BGRA is easier to think about.

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:214: r =
_mm_mulhi_epu16(_mm_add_epi16(_mm_mullo_epi16(a, r), _128), _257);
On 2016/01/19 15:59:14, mtklein wrote:
> This may be a matter of personal preference, but you might consider:
> 
> auto scale = [](__m128i x, __m128i y) { return _mm_mulhi_epu16(...); };
> r = scale(r,a);
> g = scale(g,a);
> b = scale(b,a);

Leaving as is, though I'm kind of indifferent.

I needed to pass references to _128 and _257 to "scale" in order to get it to
compile, and I found it a bit confusing.

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:218: // aaaa_rrrr_aaaa_rrrr
On 2016/01/19 15:59:14, mtklein wrote:
> I think we can do this repacking as something like:
> 
> __m128i bg = b | (g << 8)),
>         ra = r | (a << 8)),
>         lo = unpacklo_epi16(bg, ra),
>         hi = unpackhi_epi16(bg, ra);
> 
> if (kSwapRB) {
>     lo = shuffle_epi8(lo, swapRB)
>     hi = shuffle_epi8(hi, swapRB)
> }
> storeu_si128(... +0, lo)
> storeu_si128(... +4, hi)
> 
> Does that work?  I think that makes the non-swapRB path a bit shorter, and the
> swapRB path no longer.

Yes this is better!

Let's even swap BR in the "swizzle to planar step".  Then it is the same cost as
not-swapping.

https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt...
src/opts/SkSwizzler_opts.h:295: const __m128i swapRB = _mm_set_epi8(15, 12, 13,
14, 11, 8, 9, 10, 7, 4, 5, 6, 3, 0, 1, 2);
On 2016/01/19 15:59:15, mtklein wrote:
> I often find it's easier to read these if you use _mm_setr_foo, so that the
> indices go in ascending order:
>    _mm_setr_epi8(2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15);
> 
> If you do like them as you've written, they're perfectly fine.

I think you're right.

mtklein

https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opts.h#newcode230 src/opts/SkSwizzler_opts.h:230: if (count >= 4) { OK, now that we've ...

4 years, 11 months ago (2016-01-19 18:28:30 UTC) #6

msarett

Per our conversation in person, when calling premul8(lo, zeros), the compiler is smart enough to ...

4 years, 11 months ago (2016-01-19 19:17:43 UTC) #8

mtklein

lgtm https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h#newcode191 src/opts/SkSwizzler_opts.h:191: auto premul8 = [&zeros, &_128, &_257, &planar](__m128i* lo, ...

4 years, 11 months ago (2016-01-19 20:15:02 UTC) #9

msarett

https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h#newcode191 src/opts/SkSwizzler_opts.h:191: auto premul8 = [&zeros, &_128, &_257, &planar](__m128i* lo, __m128i* ...

4 years, 11 months ago (2016-01-19 21:02:39 UTC) #10

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1601883002/90001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1601883002/90001

4 years, 11 months ago (2016-01-19 21:03:35 UTC) #13

commit-bot: I haz the power

Description was changed from ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance ...

4 years, 11 months ago (2016-01-19 21:18:00 UTC) #14

Message was sent while issue was closed.

Description was changed from

==========
Add SSSE3 Optimizations for premul and swap

Improves deocde performance for RGBA pngs.

Swizzler Time on z620 (clang):
SwapPremul 0.24x
Premul     0.24x
Swap       0.37x
Decode Time on z620 (clang):
Premul   ZeroInit Decodes 0.88x
Unpremul ZeroInit Decodes 0.94x
Premul   Regular  Decodes 0.91x
Unpremul Regular  Decodes 0.98x

Swizzler Time in Dell Venue 8 (gcc):
SwapPremul 0.14x
Premul     0.14x
Swap       0.08x
Decode Time on Dell Venus 8 (gcc):
Premul   ZeroInit Decodes 0.79x
Premul   Regular  Decodes 0.77x

Note:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

BUG=skia:4767
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
Add SSSE3 Optimizations for premul and swap

Improves deocde performance for RGBA pngs.

Swizzler Time on z620 (clang):
SwapPremul 0.24x
Premul     0.24x
Swap       0.37x
Decode Time on z620 (clang):
Premul   ZeroInit Decodes 0.88x
Unpremul ZeroInit Decodes 0.94x
Premul   Regular  Decodes 0.91x
Unpremul Regular  Decodes 0.98x

Swizzler Time in Dell Venue 8 (gcc):
SwapPremul 0.14x
Premul     0.14x
Swap       0.08x
Decode Time on Dell Venus 8 (gcc):
Premul   ZeroInit Decodes 0.79x
Premul   Regular  Decodes 0.77x

Note:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).

BUG=skia:4767
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Committed:
https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593
==========

commit-bot: I haz the power

4 years, 11 months ago (2016-01-19 21:18:01 UTC) #15

Message was sent while issue was closed.

Committed patchset #5 (id:90001) as
https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages