|
|
Created:
5 years, 10 months ago by henrik.smiding Modified:
5 years, 10 months ago CC:
reviews_skia.org Base URL:
https://skia.googlesource.com/skia.git@master Target Ref:
refs/heads/master Project:
skia Visibility:
Public. |
DescriptionAdd SSE optimization of Color32A_D565
Adds an SSE4.1 version of the Color32A_D565 function.
Performance improvement in the following benchmarks:
Xfermode_SrcOver - ~100%
luma_colorfilter_large - ~150%
luma_colorfilter_small - ~60%
tablebench - ~10%
chart_bw - ~10%
(Measured on a Atom Silvermont core)
Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>
Committed: https://skia.googlesource.com/skia/+/4e65473069b3a9382292577875366bd86c5777c8
Patch Set 1 #
Total comments: 12
Patch Set 2 : Simplified code #Patch Set 3 : Fixed VS warning #
Messages
Total messages: 20 (8 generated)
henrik.smiding@intel.com changed reviewers: + mtklein@google.com, reed@google.com
henrik.smiding@intel.com changed reviewers: + joakim.landberg@intel.com
Please have a look. I tried adding a bunch of trybots, but they all went purple. Not sure what that means. Are they deprecated, or did I loose my trybot powers?
On 2015/01/30 14:47:53, henrik.smiding wrote: > Please have a look. > I tried adding a bunch of trybots, but they all went purple. Not sure what that > means. Are they deprecated, or did I loose my trybot powers? 'internal Buildbot error'
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/1
Note for Reviewers: The CQ is waiting for an approval. If you believe that the CL is not ready yet, or if you would like to L-G-T-M with comments then please uncheck the CQ checkbox. Waiting for LGTM from valid reviewer(s) till 2015-01-30 22:33 UTC
The CQ bit was unchecked by mtklein@google.com
I've been looking at this a bit, and it looks like the current scalar implementation of Color32A_D565 is not very good quality (even ignoring the fact that we're not dithering). You may want to table this CL for a little while, because I think I'm about to move the baseline out from underneath it. I'm hesitant to vectorize a low-quality algorithm that appears to only be low quality out of concern for speed. It seems straightforward to improve the precision of the blend we're doing (to perfect), and then if that's still too slow, vectorize that fixed algorithm. I believe the correct algorithm for Color32A_D565 is: for (int i = 0; i < count; i++) { dst[i] = SkSrcOver32To16(src, dst[i]); } Clearly there's some work there inside SkSrcOver32To16 that can be hoisted out of the loop. https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp File src/core/SkBlitRow_D16.cpp (right): https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp#n... src/core/SkBlitRow_D16.cpp:215: static void Color32_D565(uint16_t dst[], SkPMColor src, int count, int x, int y) { The edits to this file seem unambiguously good and orthogonal to the rest of the change. Why don't you spin it off so we can land it while reviewing the rest? https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4.cpp (right): https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:74: SkASSERT((((size_t)dst) & 0x01) == 0); Seems paranoid? If we need to keep this, let's write it as SkASSERT(SkIsAlign2((uintptr_t)dst)); https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:82: if (count >= 8) { This logic seems complex, and I'm worried we're overfitting to the benchmarks or machine we're using to test. Why don't we start by stripping all the smarts away and just use unaligned memory access? We can watch perf.skia.org to see its performance impact across a variety of machines, and then refine this if needed. uint32_t src_expand = ...; unsigned scale = ... ; const __m128i src_expand_wide = ...; const __m128i scale_wide = ...; const __m128i mask_green = ...; __m128i* dst_wide = (__m128i*)dst; while (count >= 8) { __m128i pixels = _mm_loadu_si128(dst_wide); ... _mm_storeu_si128(dst_wide++, pixels); } while (count --> 0) { ... } https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:91: if ((count >= 64) || ((((size_t)dst) & 0x0F) == 0)) { How'd you pick 64? https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:94: uint32_t dst_expand = SkExpand_rgb_16(*dst) * scale; You've got a lot of repeated code here. If it needs to stick around, which I hope it doesn't, please factor the scalar and the 8x version out into their own functions, e.g. static void Color32A_D565_1x(uint16_t* dst, unsigned scale, uint32_t src_expand); static void Color32A_D565_8x(__m128i* dst, __m128i scale, __m128i src_expand); I'm somewhat surprised we don't have that 1x variant extracted as its own method in SkColorPriv.h already. Might make sense to do so if we can peg down the right name for it, then add the 8x variant to SkColor_opts_SSE2.h. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:112: pixels_high = _mm_mullo_epi32(pixels_high, scale_wide); It doesn't seem like SSE4 is essential to the speed of this code. Am I thinking right it's just _mm_mullo_epi32 forcing us to do this in >=SSE4.1, and _mm_hadd_epi16 forcing us to do it in >=SSSE3? It seems like a shame to not implement this as an _SSE2.cpp optimization instead, and then come back with an SSSE3 or SSE4 variant only if they're really that much faster than the SSE2 code. The bulk of our x86 users do have SSE2, but not SSSE3 or SSE4.
https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp File src/core/SkBlitRow_D16.cpp (right): https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp#n... src/core/SkBlitRow_D16.cpp:215: static void Color32_D565(uint16_t dst[], SkPMColor src, int count, int x, int y) { On 2015/01/30 18:17:36, mtklein wrote: > The edits to this file seem unambiguously good and orthogonal to the rest of the > change. Why don't you spin it off so we can land it while reviewing the rest? Done. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... File src/opts/SkBlitRow_opts_SSE4.cpp (right): https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:74: SkASSERT((((size_t)dst) & 0x01) == 0); On 2015/01/30 18:17:37, mtklein wrote: > Seems paranoid? If we need to keep this, let's write it as > > SkASSERT(SkIsAlign2((uintptr_t)dst)); I thought so too, but I actually found it in the new Color32A_D565_neon function. Will remove it. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:82: if (count >= 8) { On 2015/01/30 18:17:37, mtklein wrote: > This logic seems complex, and I'm worried we're overfitting to the benchmarks or > machine we're using to test. > > Why don't we start by stripping all the smarts away and just use unaligned > memory access? We can watch http://perf.skia.org to see its performance impact across > a variety of machines, and then refine this if needed. > > uint32_t src_expand = ...; > unsigned scale = ... ; > > const __m128i src_expand_wide = ...; > const __m128i scale_wide = ...; > const __m128i mask_green = ...; > > __m128i* dst_wide = (__m128i*)dst; > while (count >= 8) { > __m128i pixels = _mm_loadu_si128(dst_wide); > ... > _mm_storeu_si128(dst_wide++, pixels); > } > while (count --> 0) { > ... > } I did this optimization based on reports that it was used heavily in Android WPS Office application. I actually added the unaligned loop at the last minute, to improve performance of smaller widths (8-63 pixels). I would prefer it if we kept the aligned loop, since aligned memory access is still faster. Especially on Atom/Silvermont cores. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:91: if ((count >= 64) || ((((size_t)dst) & 0x0F) == 0)) { On 2015/01/30 18:17:37, mtklein wrote: > How'd you pick 64? I measured different widths with skia bench, using both aligned and unaligned loops running on both aligned and unaligned start addresses. The sweet-spot for the Silvermont core was somewhere between 64 and 96 pixels. I will remove this 'magic number' in the next patch, along with the unaligned loop. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:94: uint32_t dst_expand = SkExpand_rgb_16(*dst) * scale; On 2015/01/30 18:17:37, mtklein wrote: > You've got a lot of repeated code here. If it needs to stick around, which I > hope it doesn't, please factor the scalar and the 8x version out into their own > functions, e.g. > > static void Color32A_D565_1x(uint16_t* dst, unsigned scale, uint32_t > src_expand); > static void Color32A_D565_8x(__m128i* dst, __m128i scale, __m128i src_expand); > > I'm somewhat surprised we don't have that 1x variant extracted as its own method > in SkColorPriv.h already. Might make sense to do so if we can peg down the > right name for it, then add the 8x variant to SkColor_opts_SSE2.h. I'll re-factor the code a bit. When you're satisfied I'll check if it's possible to move some of it to other files. Moving to header files as inline methods should not affect performance at least. https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4... src/opts/SkBlitRow_opts_SSE4.cpp:112: pixels_high = _mm_mullo_epi32(pixels_high, scale_wide); On 2015/01/30 18:17:37, mtklein wrote: > It doesn't seem like SSE4 is essential to the speed of this code. Am I thinking > right it's just _mm_mullo_epi32 forcing us to do this in >=SSE4.1, and > _mm_hadd_epi16 forcing us to do it in >=SSSE3? It seems like a shame to not > implement this as an _SSE2.cpp optimization instead, and then come back with an > SSSE3 or SSE4 variant only if they're really that much faster than the SSE2 > code. The bulk of our x86 users do have SSE2, but not SSSE3 or SSE4. I added the _mm_hadd instruction at the last minute, replacing something like seven instructions. It made the code shorter and clearer, but since hadd is using micro-code, it didn't give that much on the performance side (on a Silvermont that is). _mm_mullo_epi32 is the only instruction that makes four 32-bit multiplications in one go. The only other instruction that does 32-bit multiplications in SSE2, only does two, requiring two calls to slow mul-instructions, and some shuffling/shifting on top of that. Since the multiplication step is the only really 'heavy' instruction in the loop, doubling it would impact performance. Also, not that far from the C-code's 1 mul/pixel. How do you know that most of your users only have SSE2? Do the majority of people sit on 10+ year old computers? I recall that SSE4 is as old as the entire ARMv7 architecture. I'll push a new version of this patch, and then write and measure a SSE2-only version. But I predict twice the code, and noticeable less performance (given ideal memory pre-fetching). But if it's still faster than the C-code I'll push a patch.
Oh neat, nice to hear there is a real user. That sort of changes everything. Thanks for simplifying the code. This lgtm. > How do you know that most of your users only have SSE2? Do the majority of > people sit on 10+ year old computers? I recall that SSE4 is as old as the entire > ARMv7 architecture. Yeah, they're not using 10+ year old computers (or we'd have to worry about even SSE2), but they certainly are using 7+ year old computers that don't support SSE4. > I'll push a new version of this patch, and then write and measure a SSE2-only > version. But I predict twice the code, and noticeable less performance (given > ideal memory pre-fetching). But if it's still faster than the C-code I'll push a > patch. All SGTM. It sounds like having an SSE2 and SSE4.1 version would be a pretty good maintainability sweet spot.
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/20001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Win-VS2013-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-VS2013-x86...)
New patchsets have been uploaded after l-g-t-m from mtklein@google.com
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/40001
On 2015/02/10 16:17:16, mtklein wrote: > Oh neat, nice to hear there is a real user. That sort of changes everything. > > Thanks for simplifying the code. This lgtm. > > > How do you know that most of your users only have SSE2? Do the majority of > > people sit on 10+ year old computers? I recall that SSE4 is as old as the > entire > > ARMv7 architecture. > Yeah, they're not using 10+ year old computers (or we'd have to worry about even > SSE2), > but they certainly are using 7+ year old computers that don't support SSE4. > > > I'll push a new version of this patch, and then write and measure a SSE2-only > > version. But I predict twice the code, and noticeable less performance (given > > ideal memory pre-fetching). But if it's still faster than the C-code I'll push > a > > patch. > > All SGTM. It sounds like having an SSE2 and SSE4.1 version would be a pretty > good maintainability sweet spot. I've pushed a new version that should fix the warning in windows VS-builds. If you could add the relevant trybots... Reverting the hadd instruction back to the 'old' code for SSE2 proved useless, since the old code used _mm_packus_epi32 which is SSE4.1. Also, doing four multiplications for eight pixels is a waste. I'll instead try to rewrite the SSE2 version from scratch, using 16-bit multiplications on the separate R, G, and B subpixels. Like the Neon version does. Will mean more code to setup src_expand, though.
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/4e65473069b3a9382292577875366bd86c5777c8 |