|
|
DescriptionAdd SSSE3 Optimizations for premul and swap
Improves deocde performance for RGBA pngs.
Swizzler Time on z620 (clang):
SwapPremul 0.24x
Premul 0.24x
Swap 0.37x
Decode Time on z620 (clang):
Premul ZeroInit Decodes 0.88x
Unpremul ZeroInit Decodes 0.94x
Premul Regular Decodes 0.91x
Unpremul Regular Decodes 0.98x
Swizzler Time in Dell Venue 8 (gcc):
SwapPremul 0.14x
Premul 0.14x
Swap 0.08x
Decode Time on Dell Venus 8 (gcc):
Premul ZeroInit Decodes 0.79x
Premul Regular Decodes 0.77x
Note:
ZeroInit means memory is zero initialized, and we do not write to
memory for large sections of zero pixels (memory use opt for Android).
BUG=skia:4767
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1601883002
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593
Patch Set 1 #Patch Set 2 : #
Total comments: 11
Patch Set 3 : Faster repacking, style, comments #
Total comments: 2
Patch Set 4 : Use shared proc #
Total comments: 3
Patch Set 5 : Move constants into premul8 proc #
Depends on Patchset: Messages
Total messages: 15 (5 generated)
Description was changed from ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance for RGBA pngs. Swizzler Time on z620 (clang): SwapPremul 0.24x Premul 0.24x Swap 0.37x Decode Time on z620 (clang): Premul ZeroInit Decodes 0.88x Unpremul ZeroInit Decodes 0.94x Premul Regular Decodes 0.91x Unpremul Regular Decodes 0.98x Swizzler Time in Dell Venue 8 (gcc): SwapPremul 0.14x Premul 0.14x Swap 0.08x Decode Time on Dell Venus 8 (gcc): Premul ZeroInit Decodes 0.79x Premul Regular Decodes 0.77x Note: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). BUG=skia:4767 ========== to ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance for RGBA pngs. Swizzler Time on z620 (clang): SwapPremul 0.24x Premul 0.24x Swap 0.37x Decode Time on z620 (clang): Premul ZeroInit Decodes 0.88x Unpremul ZeroInit Decodes 0.94x Premul Regular Decodes 0.91x Unpremul Regular Decodes 0.98x Swizzler Time in Dell Venue 8 (gcc): SwapPremul 0.14x Premul 0.14x Swap 0.08x Decode Time on Dell Venus 8 (gcc): Premul ZeroInit Decodes 0.79x Premul Regular Decodes 0.77x Note: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
msarett@google.com changed reviewers: + mtklein@google.com
https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:180: static void premul_xxxa_should_swaprb(uint32_t dst[], const uint32_t src[], int count) { There are a lot of different ways to implement this. Specifically with the unpacking and packing of pixels. I'm not certain this is the fastest way, but it's a start. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:212: // (x + 127) / 255 == ((x + 128) * 257) >> 16 for 0 <= x <= 255 * 255 Thanks to Mike for this insight.
https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb Let's kick some of these comments a little bit higher-level: // We'll load 8 pixels into 4 registers, each holding a 16-bit component plane. // First just load the 8 interlaced pixels. __m128i lo = _mm_loadu_si128(... +0), // bgrabgra bgrabgra hi = _mm_loadu_si128(... +4); // BGRABGRA BGRABGRA // Swizzle them to 8-bit planar. lo = _mm_shuffle_epi8(lo, planar); // bbbbgggg rrrraaaa hi = _mm_shuffle_epi8(hi, planar); // BBBBGGGG RRRRAAAA __m128i bg = _mm_unpacklo(...), // bbbbBBBB ggggGGGG ra = _mm_unpackhi(...); // rrrrRRRR aaaaAAAA // Unpack to 16-bit planar in four registers. __m128i b = _mm_unpacklo(...), // b_b_b_b_ B_B_B_B_ ...; // OK, premultiply! (x+127)/255 == ((x+128)*257)>>16 for 0 <= x <= 255*255. ... https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:214: r = _mm_mulhi_epu16(_mm_add_epi16(_mm_mullo_epi16(a, r), _128), _257); This may be a matter of personal preference, but you might consider: auto scale = [](__m128i x, __m128i y) { return _mm_mulhi_epu16(...); }; r = scale(r,a); g = scale(g,a); b = scale(b,a); https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:218: // aaaa_rrrr_aaaa_rrrr I think we can do this repacking as something like: __m128i bg = b | (g << 8)), ra = r | (a << 8)), lo = unpacklo_epi16(bg, ra), hi = unpackhi_epi16(bg, ra); if (kSwapRB) { lo = shuffle_epi8(lo, swapRB) hi = shuffle_epi8(hi, swapRB) } storeu_si128(... +0, lo) storeu_si128(... +4, hi) Does that work? I think that makes the non-swapRB path a bit shorter, and the swapRB path no longer. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:239: if (count >= 4) { Reminder to self to circle back here when we're happy with n >= 8. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:295: const __m128i swapRB = _mm_set_epi8(15, 12, 13, 14, 11, 8, 9, 10, 7, 4, 5, 6, 3, 0, 1, 2); I often find it's easier to read these if you use _mm_setr_foo, so that the indices go in ascending order: _mm_setr_epi8(2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15); If you do like them as you've written, they're perfectly fine.
https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:196: // argb_argb_argb_argb -> aaaa_rrrr_gggg_bbbb On 2016/01/19 15:59:14, mtklein wrote: > Let's kick some of these comments a little bit higher-level: > > // We'll load 8 pixels into 4 registers, each holding a 16-bit component plane. > > // First just load the 8 interlaced pixels. > __m128i lo = _mm_loadu_si128(... +0), // bgrabgra bgrabgra > hi = _mm_loadu_si128(... +4); // BGRABGRA BGRABGRA > > // Swizzle them to 8-bit planar. > lo = _mm_shuffle_epi8(lo, planar); // bbbbgggg rrrraaaa > hi = _mm_shuffle_epi8(hi, planar); // BBBBGGGG RRRRAAAA > __m128i bg = _mm_unpacklo(...), // bbbbBBBB ggggGGGG > ra = _mm_unpackhi(...); // rrrrRRRR aaaaAAAA > > // Unpack to 16-bit planar in four registers. > __m128i b = _mm_unpacklo(...), // b_b_b_b_ B_B_B_B_ > ...; > > // OK, premultiply! (x+127)/255 == ((x+128)*257)>>16 for 0 <= x <= 255*255. > ... Done. Ugggh, for some reason I thought the rest of this file was written as ARGB (so I did that on purpose). But I'm realizing now that it's BGRA. And I agree that BGRA is easier to think about. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:214: r = _mm_mulhi_epu16(_mm_add_epi16(_mm_mullo_epi16(a, r), _128), _257); On 2016/01/19 15:59:14, mtklein wrote: > This may be a matter of personal preference, but you might consider: > > auto scale = [](__m128i x, __m128i y) { return _mm_mulhi_epu16(...); }; > r = scale(r,a); > g = scale(g,a); > b = scale(b,a); Leaving as is, though I'm kind of indifferent. I needed to pass references to _128 and _257 to "scale" in order to get it to compile, and I found it a bit confusing. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:218: // aaaa_rrrr_aaaa_rrrr On 2016/01/19 15:59:14, mtklein wrote: > I think we can do this repacking as something like: > > __m128i bg = b | (g << 8)), > ra = r | (a << 8)), > lo = unpacklo_epi16(bg, ra), > hi = unpackhi_epi16(bg, ra); > > if (kSwapRB) { > lo = shuffle_epi8(lo, swapRB) > hi = shuffle_epi8(hi, swapRB) > } > storeu_si128(... +0, lo) > storeu_si128(... +4, hi) > > Does that work? I think that makes the non-swapRB path a bit shorter, and the > swapRB path no longer. Yes this is better! Let's even swap BR in the "swizzle to planar step". Then it is the same cost as not-swapping. https://codereview.chromium.org/1601883002/diff/20001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:295: const __m128i swapRB = _mm_set_epi8(15, 12, 13, 14, 11, 8, 9, 10, 7, 4, 5, 6, 3, 0, 1, 2); On 2016/01/19 15:59:15, mtklein wrote: > I often find it's easier to read these if you use _mm_setr_foo, so that the > indices go in ascending order: > _mm_setr_epi8(2,1,0,3, 6,5,4,7, 10,9,8,11, 14,13,12,15); > > If you do like them as you've written, they're perfectly fine. I think you're right.
https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:230: if (count >= 4) { OK, now that we've got count >= 8 in shape, let's do something like this? auto premul8 = [](__m128i* lo, __m128i* hi) { // Swizzle them to 8-bit planar. *lo = _mm_shuffle_epi8(*lo, planar); ... *lo = _mm_unpacklo_epi16(bg, ra); *hi = _mm_unpackhi_epi16(bg, ra); }; while (n >= 8) { __m128i lo = _mm_loadu_si128(...), hi = _mm_loadu_si128(....); premul8(&lo, &hi); _mm_storeu_si128(..., lo); _mm_storeu_si128(..., hi); ... } if (n >= 4) { __m128i lo = _mm_loadu_si128(...); hi = _mm_setzero_si128(); premul8(&lo, &hi); _mm_storeu_si128(..., lo); ... } handle n <= 3
Patchset #4 (id:60001) has been deleted
Per our conversation in person, when calling premul8(lo, zeros), the compiler is smart enough to remove the extra _mm_unpackhi_epi16() at the bottom of the proc, but does do an extra _mm_shuffle_epi8() at the start of the proc. Still, let's not worry too much about the tail. https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/40001/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:230: if (count >= 4) { On 2016/01/19 18:28:30, mtklein wrote: > OK, now that we've got count >= 8 in shape, let's do something like this? > > auto premul8 = [](__m128i* lo, __m128i* hi) { > // Swizzle them to 8-bit planar. > *lo = _mm_shuffle_epi8(*lo, planar); > ... > *lo = _mm_unpacklo_epi16(bg, ra); > *hi = _mm_unpackhi_epi16(bg, ra); > }; > > while (n >= 8) { > __m128i lo = _mm_loadu_si128(...), > hi = _mm_loadu_si128(....); > premul8(&lo, &hi); > _mm_storeu_si128(..., lo); > _mm_storeu_si128(..., hi); > ... > } > if (n >= 4) { > __m128i lo = _mm_loadu_si128(...); > hi = _mm_setzero_si128(); > premul8(&lo, &hi); > _mm_storeu_si128(..., lo); > ... > } > handle n <= 3 Done.
lgtm https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:191: auto premul8 = [&zeros, &_128, &_257, &planar](__m128i* lo, __m128i* hi) { Just out of curiosity, what happens to the codegen / perf if we just move these constants inside? I'd hope it's unaffected? I don't personally mind writing this as auto premul8 = [&] { ... }; but I will never object to listing out each closed over variable. https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:191: auto premul8 = [&zeros, &_128, &_257, &planar](__m128i* lo, __m128i* hi) { The comments inside the while loop are now probably not necessary. Our SSE code works with interlaced pixels more or less by default, as that's the memory layout. If you want to leave a note about interlacing vs. planar, let's put it here on premul8: // *lo and *hi should point to interlaced pixels (bgra bgra ...) and will be left that way.
https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opts.h File src/opts/SkSwizzler_opts.h (right): https://codereview.chromium.org/1601883002/diff/70005/src/opts/SkSwizzler_opt... src/opts/SkSwizzler_opts.h:191: auto premul8 = [&zeros, &_128, &_257, &planar](__m128i* lo, __m128i* hi) { On 2016/01/19 20:15:02, mtklein wrote: > The comments inside the while loop are now probably not necessary. Our SSE code > works with interlaced pixels more or less by default, as that's the memory > layout. If you want to leave a note about interlacing vs. planar, let's put it > here on premul8: > > // *lo and *hi should point to interlaced pixels (bgra bgra ...) and will be > left that way. Codegen is unaffected by moving the constants inside the function. Comments also removed.
The CQ bit was checked by mtklein@google.com
lgtm
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1601883002/90001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1601883002/90001
Message was sent while issue was closed.
Description was changed from ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance for RGBA pngs. Swizzler Time on z620 (clang): SwapPremul 0.24x Premul 0.24x Swap 0.37x Decode Time on z620 (clang): Premul ZeroInit Decodes 0.88x Unpremul ZeroInit Decodes 0.94x Premul Regular Decodes 0.91x Unpremul Regular Decodes 0.98x Swizzler Time in Dell Venue 8 (gcc): SwapPremul 0.14x Premul 0.14x Swap 0.08x Decode Time on Dell Venus 8 (gcc): Premul ZeroInit Decodes 0.79x Premul Regular Decodes 0.77x Note: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Add SSSE3 Optimizations for premul and swap Improves deocde performance for RGBA pngs. Swizzler Time on z620 (clang): SwapPremul 0.24x Premul 0.24x Swap 0.37x Decode Time on z620 (clang): Premul ZeroInit Decodes 0.88x Unpremul ZeroInit Decodes 0.94x Premul Regular Decodes 0.91x Unpremul Regular Decodes 0.98x Swizzler Time in Dell Venue 8 (gcc): SwapPremul 0.14x Premul 0.14x Swap 0.08x Decode Time on Dell Venus 8 (gcc): Premul ZeroInit Decodes 0.79x Premul Regular Decodes 0.77x Note: ZeroInit means memory is zero initialized, and we do not write to memory for large sections of zero pixels (memory use opt for Android). BUG=skia:4767 GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593 ==========
Message was sent while issue was closed.
Committed patchset #5 (id:90001) as https://skia.googlesource.com/skia/+/53b9d29b973f2828624f097bf110f1c7acc4b593 |