Issue 892623002: Add SSE optimization of Color32A_D565

Issue 892623002: Add SSE optimization of Color32A_D565 (Closed)

Created:
5 years, 10 months ago by henrik.smiding

Modified:
5 years, 10 months ago

Reviewers:
mtklein, joakim.landberg, reed1

CC:
reviews_skia.org

Base URL:
https://skia.googlesource.com/skia.git@master

Target Ref:
refs/heads/master

Project:
skia

Visibility:
Public.

More Reviews

Description

Add SSE optimization of Color32A_D565 Adds an SSE4.1 version of the Color32A_D565 function. Performance improvement in the following benchmarks: Xfermode_SrcOver - ~100% luma_colorfilter_large - ~150% luma_colorfilter_small - ~60% tablebench - ~10% chart_bw - ~10% (Measured on a Atom Silvermont core) Signed-off-by: Henrik Smiding <henrik.smiding@intel.com>; Committed: https://skia.googlesource.com/skia/+/4e65473069b3a9382292577875366bd86c5777c8

Patch Set 1 #

Total comments: 12

Patch Set 2 : Simplified code #

Patch Set 3 : Fixed VS warning #

Created: 5 years, 10 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+93 lines, -6 lines)			Patch
M	src/opts/SkBlitRow_opts_SSE4.h	View		1 chunk	+3 lines, -0 lines	0 comments	Download
M	src/opts/SkBlitRow_opts_SSE4.cpp	View	1 2	2 chunks	+77 lines, -2 lines	0 comments	Download
M	src/opts/opts_check_x86.cpp	View	1	2 chunks	+13 lines, -4 lines	0 comments	Download

Messages

Total messages: 20 (8 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

henrik.smiding

Please have a look. I tried adding a bunch of trybots, but they all went ...

5 years, 10 months ago (2015-01-30 14:47:53 UTC) #3

henrik.smiding

On 2015/01/30 14:47:53, henrik.smiding wrote: > Please have a look. > I tried adding a ...

5 years, 10 months ago (2015-01-30 15:01:40 UTC) #4

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/1

5 years, 10 months ago (2015-01-30 16:33:36 UTC) #6

commit-bot: I haz the power

Note for Reviewers: The CQ is waiting for an approval. If you believe that the ...

5 years, 10 months ago (2015-01-30 16:33:37 UTC) #7

mtklein

I've been looking at this a bit, and it looks like the current scalar implementation ...

5 years, 10 months ago (2015-01-30 18:17:37 UTC) #9

I've been looking at this a bit, and it looks like the current scalar
implementation of Color32A_D565 is not very good quality (even ignoring the fact
that we're not dithering).  You may want to table this CL for a little while,
because I think I'm about to move the baseline out from underneath it.

I'm hesitant to vectorize a low-quality algorithm that appears to only be low
quality out of concern for speed.  It seems straightforward to improve the
precision of the blend we're doing (to perfect), and then if that's still too
slow, vectorize that fixed algorithm.

I believe the correct algorithm for Color32A_D565 is:
  for (int i = 0; i < count; i++) {
    dst[i] = SkSrcOver32To16(src, dst[i]);
  }
Clearly there's some work there inside SkSrcOver32To16 that can be hoisted out
of the loop.

https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp
File src/core/SkBlitRow_D16.cpp (right):

https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp#n...
src/core/SkBlitRow_D16.cpp:215: static void Color32_D565(uint16_t dst[],
SkPMColor src, int count, int x, int y) {
The edits to this file seem unambiguously good and orthogonal to the rest of the
change.  Why don't you spin it off so we can land it while reviewing the rest?

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
File src/opts/SkBlitRow_opts_SSE4.cpp (right):

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:74: SkASSERT((((size_t)dst) & 0x01) == 0);
Seems paranoid?  If we need to keep this, let's write it as

SkASSERT(SkIsAlign2((uintptr_t)dst));

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:82: if (count >= 8) {
This logic seems complex, and I'm worried we're overfitting to the benchmarks or
machine we're using to test.

Why don't we start by stripping all the smarts away and just use unaligned
memory access?  We can watch perf.skia.org to see its performance impact across
a variety of machines, and then refine this if needed.

uint32_t src_expand = ...;
unsigned scale = ... ;

const __m128i src_expand_wide = ...;
const __m128i scale_wide = ...;
const __m128i mask_green = ...;

__m128i* dst_wide = (__m128i*)dst;
while (count >= 8) {
  __m128i pixels = _mm_loadu_si128(dst_wide);
  ...
  _mm_storeu_si128(dst_wide++, pixels);
}
while (count --> 0) {
  ...
}

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:91: if ((count >= 64) || ((((size_t)dst) &
0x0F) == 0)) {
How'd you pick 64?

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:94: uint32_t dst_expand = SkExpand_rgb_16(*dst)
* scale;
You've got a lot of repeated code here.  If it needs to stick around, which I
hope it doesn't, please factor the scalar and the 8x version out into their own
functions, e.g.

static void Color32A_D565_1x(uint16_t* dst, unsigned scale, uint32_t
src_expand);
static void Color32A_D565_8x(__m128i* dst, __m128i scale, __m128i src_expand);

I'm somewhat surprised we don't have that 1x variant extracted as its own method
in SkColorPriv.h already.  Might make sense to do so if we can peg down the
right name for it, then add the 8x variant to SkColor_opts_SSE2.h.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:112: pixels_high = _mm_mullo_epi32(pixels_high,
scale_wide);
It doesn't seem like SSE4 is essential to the speed of this code.  Am I thinking
right it's just _mm_mullo_epi32 forcing us to do this in >=SSE4.1, and
_mm_hadd_epi16 forcing us to do it in >=SSSE3?  It seems like a shame to not
implement this as an _SSE2.cpp optimization instead, and then come back with an
SSSE3 or SSE4 variant only if they're really that much faster than the SSE2
code.  The bulk of our x86 users do have SSE2, but not SSSE3 or SSE4.

henrik.smiding

https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp File src/core/SkBlitRow_D16.cpp (right): https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp#newcode215 src/core/SkBlitRow_D16.cpp:215: static void Color32_D565(uint16_t dst[], SkPMColor src, int count, int ...

5 years, 10 months ago (2015-02-10 15:11:52 UTC) #10

https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp
File src/core/SkBlitRow_D16.cpp (right):

https://codereview.chromium.org/892623002/diff/1/src/core/SkBlitRow_D16.cpp#n...
src/core/SkBlitRow_D16.cpp:215: static void Color32_D565(uint16_t dst[],
SkPMColor src, int count, int x, int y) {
On 2015/01/30 18:17:36, mtklein wrote:
> The edits to this file seem unambiguously good and orthogonal to the rest of
the
> change.  Why don't you spin it off so we can land it while reviewing the rest?

Done.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
File src/opts/SkBlitRow_opts_SSE4.cpp (right):

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:74: SkASSERT((((size_t)dst) & 0x01) == 0);
On 2015/01/30 18:17:37, mtklein wrote:
> Seems paranoid?  If we need to keep this, let's write it as
> 
> SkASSERT(SkIsAlign2((uintptr_t)dst));

I thought so too, but I actually found it in the new Color32A_D565_neon
function.
Will remove it.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:82: if (count >= 8) {
On 2015/01/30 18:17:37, mtklein wrote:
> This logic seems complex, and I'm worried we're overfitting to the benchmarks
or
> machine we're using to test.
> 
> Why don't we start by stripping all the smarts away and just use unaligned
> memory access?  We can watch http://perf.skia.org to see its performance
impact across
> a variety of machines, and then refine this if needed.
> 
> uint32_t src_expand = ...;
> unsigned scale = ... ;
> 
> const __m128i src_expand_wide = ...;
> const __m128i scale_wide = ...;
> const __m128i mask_green = ...;
> 
> __m128i* dst_wide = (__m128i*)dst;
> while (count >= 8) {
>   __m128i pixels = _mm_loadu_si128(dst_wide);
>   ...
>   _mm_storeu_si128(dst_wide++, pixels);
> }
> while (count --> 0) {
>   ...
> }

I did this optimization based on reports that it was used heavily in Android WPS
Office application.

I actually added the unaligned loop at the last minute, to improve performance
of smaller widths (8-63 pixels).
I would prefer it if we kept the aligned loop, since aligned memory access is
still faster. Especially on Atom/Silvermont cores.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:91: if ((count >= 64) || ((((size_t)dst) &
0x0F) == 0)) {
On 2015/01/30 18:17:37, mtklein wrote:
> How'd you pick 64?

I measured different widths with skia bench, using both aligned and unaligned
loops running on both aligned and unaligned start addresses. The sweet-spot for
the Silvermont core was somewhere between 64 and 96 pixels.
I will remove this 'magic number' in the next patch, along with the unaligned
loop.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:94: uint32_t dst_expand = SkExpand_rgb_16(*dst)
* scale;
On 2015/01/30 18:17:37, mtklein wrote:
> You've got a lot of repeated code here.  If it needs to stick around, which I
> hope it doesn't, please factor the scalar and the 8x version out into their
own
> functions, e.g.
> 
> static void Color32A_D565_1x(uint16_t* dst, unsigned scale, uint32_t
> src_expand);
> static void Color32A_D565_8x(__m128i* dst, __m128i scale, __m128i src_expand);
> 
> I'm somewhat surprised we don't have that 1x variant extracted as its own
method
> in SkColorPriv.h already.  Might make sense to do so if we can peg down the
> right name for it, then add the 8x variant to SkColor_opts_SSE2.h.

I'll re-factor the code a bit. When you're satisfied I'll check if it's possible
to move some of it to other files. Moving to header files as inline methods
should not affect performance at least.

https://codereview.chromium.org/892623002/diff/1/src/opts/SkBlitRow_opts_SSE4...
src/opts/SkBlitRow_opts_SSE4.cpp:112: pixels_high = _mm_mullo_epi32(pixels_high,
scale_wide);
On 2015/01/30 18:17:37, mtklein wrote:
> It doesn't seem like SSE4 is essential to the speed of this code.  Am I
thinking
> right it's just _mm_mullo_epi32 forcing us to do this in >=SSE4.1, and
> _mm_hadd_epi16 forcing us to do it in >=SSSE3?  It seems like a shame to not
> implement this as an _SSE2.cpp optimization instead, and then come back with
an
> SSSE3 or SSE4 variant only if they're really that much faster than the SSE2
> code.  The bulk of our x86 users do have SSE2, but not SSSE3 or SSE4.

I added the _mm_hadd instruction at the last minute, replacing something like
seven instructions. It made the code shorter and clearer, but since hadd is
using micro-code, it didn't give that much on the performance side (on a
Silvermont that is).

_mm_mullo_epi32 is the only instruction that makes four 32-bit multiplications
in one go. 
The only other instruction that does 32-bit multiplications in SSE2, only does
two, requiring two calls to slow mul-instructions, and some shuffling/shifting
on top of that. Since the multiplication step is the only really 'heavy'
instruction in the loop, doubling it would impact performance. Also, not that
far from the C-code's 1 mul/pixel.

How do you know that most of your users only have SSE2? Do the majority of
people sit on 10+ year old computers? I recall that SSE4 is as old as the entire
ARMv7 architecture.

I'll push a new version of this patch, and then write and measure a SSE2-only
version. But I predict twice the code, and noticeable less performance (given
ideal memory pre-fetching). But if it's still faster than the C-code I'll push a
patch.

mtklein

Oh neat, nice to hear there is a real user. That sort of changes everything. ...

5 years, 10 months ago (2015-02-10 16:17:16 UTC) #11

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/20001

5 years, 10 months ago (2015-02-10 16:18:59 UTC) #13

commit-bot: I haz the power

Try jobs failed on following builders: Build-Win-VS2013-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-VS2013-x86-Debug-Trybot/builds/2379)

5 years, 10 months ago (2015-02-10 16:46:51 UTC) #15

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/892623002/40001

5 years, 10 months ago (2015-02-10 17:27:40 UTC) #18

henrik.smiding

On 2015/02/10 16:17:16, mtklein wrote: > Oh neat, nice to hear there is a real ...

5 years, 10 months ago (2015-02-10 17:30:41 UTC) #19

commit-bot: I haz the power

5 years, 10 months ago (2015-02-10 17:42:36 UTC) #20

Message was sent while issue was closed.

Committed patchset #3 (id:40001) as
https://skia.googlesource.com/skia/+/4e65473069b3a9382292577875366bd86c5777c8

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages