Issue 1432903002: float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.

mtklein_C

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-09 19:04:02 UTC) #1

mtklein_C

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-09 19:04:10 UTC) #2

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

5 years, 1 month ago (2015-11-09 19:11:43 UTC) #3

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1432903002/1 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1432903002/1

5 years, 1 month ago (2015-11-09 19:11:52 UTC) #4

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 1 month ago (2015-11-09 19:13:51 UTC) #5

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_64-Debug-Trybot/builds/4179)

5 years, 1 month ago (2015-11-09 19:13:52 UTC) #6

mtklein_C

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-09 19:14:22 UTC) #7

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

5 years, 1 month ago (2015-11-09 19:16:31 UTC) #8

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1432903002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1432903002/20001

5 years, 1 month ago (2015-11-09 19:16:43 UTC) #9

mtklein_C

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-09 19:24:54 UTC) #10

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 1 month ago (2015-11-09 19:27:31 UTC) #11

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

5 years, 1 month ago (2015-11-09 19:27:32 UTC) #12

mtklein_C

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-09 19:42:07 UTC) #13

Description was changed from

==========
float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.

Xfermode_ColorDodge_aa	10.3ms -> 7.85ms	0.76x
 Xfermode_SoftLight_aa	13.8ms -> 10.2ms	0.74x
 Xfermode_ColorBurn_aa	10.7ms -> 7.82ms	0.73x
    Xfermode_SoftLight	33.6ms -> 23.2ms	0.69x
   Xfermode_ColorDodge	  25ms -> 16.5ms	0.66x
    Xfermode_ColorBurn	26.1ms -> 16.6ms	0.63x

I made the SkNx(T) constructors implicit to make writing math expressions
simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than 
  Sk4f v;
  ...
  v = v * Sk4f(4);

As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now.  I plan on
following up with a CL that lets those become `(1 - da)` too.

BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

COMMIT=false
Depends on https://codereview.chromium.org/1431143002
==========

to

==========
float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.

Xfermode_ColorDodge_aa	10.3ms -> 7.85ms	0.76x
 Xfermode_SoftLight_aa	13.8ms -> 10.2ms	0.74x
 Xfermode_ColorBurn_aa	10.7ms -> 7.82ms	0.73x
    Xfermode_SoftLight	33.6ms -> 23.2ms	0.69x
   Xfermode_ColorDodge	  25ms -> 16.5ms	0.66x
    Xfermode_ColorBurn	26.1ms -> 16.6ms	0.63x

Ought to be no pixel diffs:
https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dg...

Incidental stuff:

I made the SkNx(T) constructors implicit to make writing math expressions
simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than 
  Sk4f v;
  ...
  v = v * Sk4f(4);

As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now.  I plan on
following up with a CL that lets those become `(1 - da)` too.


BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

COMMIT=false
Depends on https://codereview.chromium.org/1431143002
==========

mtklein_C

mtklein@chromium.org changed reviewers: + msarett@google.com

5 years, 1 month ago (2015-11-09 19:42:52 UTC) #14

msarett

"I made the SkNx(T) constructors implicit to make writing math expressions simpler. This allows us ...

5 years, 1 month ago (2015-11-09 23:25:06 UTC) #16

"I made the SkNx(T) constructors implicit to make writing math expressions
simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than 
  Sk4f v;
  ...
  v = v * Sk4f(4);"

I'm not sure I'm convinced that this is better than before.

v = v * Sk4f(4);
Seems to indicate that vector math is happening, while the alternative causes me
to pause and try to figure our what's going on.

On the other hand, it's nice to get rid of T(4) and other similar expressions. 
Also, this seems to be keeping with overriding +, *, -, etc.

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h
File src/opts/SkNx_avx.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h#new...
src/opts/SkNx_avx.h:32: __m256i fix32 =
_mm256_insertf128_si256(_mm256_castsi128_si256(lo32), hi32, 1);
Seems annoying to do all the unpacking in 128-bit and then combine to 256-bits. 
But I guess we don't get 256-bit integer unpack/shuffle instructions until AVX2.

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opts.h
File src/opts/SkXfermode_opts.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:126: static inline Sk8f alphas(const Sk8f& f) {
return SkNx_shuffle<3,3,3,3,7,7,7,7>(f); }
Where is shuffle defined for AVX?

Oh, it looks like it uses SkNx's default implementation with kth().  Is a
platform specific implementation a TODO?  Or can we not do better than the
default?

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:254: void xfer32(SkPMColor dst[], const SkPMColor
src[], int n, const SkAlpha aa[]) const override {
It looks like you've deleted some private helper functions and pulled them into
the implementation of this function?  Which allows you to efficiently combine
Sk8f and Sk4f?

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:257: auto d = Sk8f::FromBytes((const uint8_t*)dst) *
(1.0f/255),
nit:  Any reason not to write 1.0f / 255.0f?

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:289: dst[i] = SkPixel32ToPixel16(dst32);             
       // Repack dst to 565 and store.
This seems slow?  Although not different from before.

mtklein

On 2015/11/09 at 23:25:06, msarett wrote: > "I made the SkNx(T) constructors implicit to make ...

5 years, 1 month ago (2015-11-10 00:06:56 UTC) #17

mtklein

mtklein@google.com changed reviewers: + mtklein@google.com

5 years, 1 month ago (2015-11-10 00:22:04 UTC) #18

mtklein

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h File src/opts/SkNx_avx.h (right): https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h#newcode32 src/opts/SkNx_avx.h:32: __m256i fix32 = _mm256_insertf128_si256(_mm256_castsi128_si256(lo32), hi32, 1); On 2015/11/09 at ...

5 years, 1 month ago (2015-11-10 00:22:05 UTC) #19

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h
File src/opts/SkNx_avx.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h#new...
src/opts/SkNx_avx.h:32: __m256i fix32 =
_mm256_insertf128_si256(_mm256_castsi128_si256(lo32), hi32, 1);
On 2015/11/09 at 23:25:06, msarett wrote:
> Seems annoying to do all the unpacking in 128-bit and then combine to
256-bits.  But I guess we don't get 256-bit integer unpack/shuffle instructions
until AVX2.

Right.  Even then, the AVX and AVX2 packing, unpacking, and shuffling
instructions can't cross from the low 128 bits to the high 128 bits or the other
way around.  They basically only work on 2 128-bit arguments in parallel. 
They're depressingly disappointing.  E.g., take a look at the pseudocode for
vpshufb... at first you think, that looks weird, why is it written so
complicated... and then you're like, oh, it just sucks that's why:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3868,474...

The only unpacking instruction that's really useful for us and that works in the
way you'd hope is vpmovzxbd to zero-extend 8 bytes to 8 ints, but that's AVX2. 
That'd get us right from fix8 to fix32 in this code (and I think can even take a
memory argument to read from, compiling the entire function up to __m256i fix32
= ...; into a single instruction).

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opts.h
File src/opts/SkXfermode_opts.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:126: static inline Sk8f alphas(const Sk8f& f) {
return SkNx_shuffle<3,3,3,3,7,7,7,7>(f); }
On 2015/11/09 at 23:25:06, msarett wrote:
> Where is shuffle defined for AVX?
> 
> Oh, it looks like it uses SkNx's default implementation with kth().  Is a
platform specific implementation a TODO?  Or can we not do better than the
default?

Right, default implementation.  This compiles into one instruction, vpermilps
$255, %ymmN, %ymmM.  (Shuffle element 3 into each of lanes 0,1,2,3 and element
3+4=7 into 4,5,6,7).   Pretty hard to beat.

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:254: void xfer32(SkPMColor dst[], const SkPMColor
src[], int n, const SkAlpha aa[]) const override {
On 2015/11/09 at 23:25:06, msarett wrote:
> It looks like you've deleted some private helper functions and pulled them
into the implementation of this function?  Which allows you to efficiently
combine Sk8f and Sk4f?

Yep, just moving things around.

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:257: auto d = Sk8f::FromBytes((const uint8_t*)dst) *
(1.0f/255),
On 2015/11/09 at 23:25:06, msarett wrote:
> nit:  Any reason not to write 1.0f / 255.0f?

Nope, either way works the same and gets to the same compile time constant.  I'd
write 1.0/255, but MSVC complains that 1.0/255 != 1.0f/255.  Gotta get an 'f' in
there somewhere.

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:289: dst[i] = SkPixel32ToPixel16(dst32);             
       // Repack dst to 565 and store.
On 2015/11/09 at 23:25:06, msarett wrote:
> This seems slow?  Although not different from before.

Right.  No different from before.  Just always 1 pixel at a time.  Seems like a
fine compromise for 565.

msarett

lgtm https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h File src/opts/SkNx_avx.h (right): https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h#newcode32 src/opts/SkNx_avx.h:32: __m256i fix32 = _mm256_insertf128_si256(_mm256_castsi128_si256(lo32), hi32, 1); On 2015/11/10 ...

5 years, 1 month ago (2015-11-10 14:54:16 UTC) #20

lgtm

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h
File src/opts/SkNx_avx.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkNx_avx.h#new...
src/opts/SkNx_avx.h:32: __m256i fix32 =
_mm256_insertf128_si256(_mm256_castsi128_si256(lo32), hi32, 1);
On 2015/11/10 00:22:05, mtklein wrote:
> On 2015/11/09 at 23:25:06, msarett wrote:
> > Seems annoying to do all the unpacking in 128-bit and then combine to
> 256-bits.  But I guess we don't get 256-bit integer unpack/shuffle
instructions
> until AVX2.
> 
> Right.  Even then, the AVX and AVX2 packing, unpacking, and shuffling
> instructions can't cross from the low 128 bits to the high 128 bits or the
other
> way around.  They basically only work on 2 128-bit arguments in parallel. 
> They're depressingly disappointing.  E.g., take a look at the pseudocode for
> vpshufb... at first you think, that looks weird, why is it written so
> complicated... and then you're like, oh, it just sucks that's why:
>
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3868,474...
>

Good point, forgot about that.
 
> The only unpacking instruction that's really useful for us and that works in
the
> way you'd hope is vpmovzxbd to zero-extend 8 bytes to 8 ints, but that's AVX2.

> That'd get us right from fix8 to fix32 in this code (and I think can even take
a
> memory argument to read from, compiling the entire function up to __m256i
fix32
> = ...; into a single instruction).

Cool!

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opts.h
File src/opts/SkXfermode_opts.h (right):

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:126: static inline Sk8f alphas(const Sk8f& f) {
return SkNx_shuffle<3,3,3,3,7,7,7,7>(f); }
On 2015/11/10 00:22:05, mtklein wrote:
> On 2015/11/09 at 23:25:06, msarett wrote:
> > Where is shuffle defined for AVX?
> > 
> > Oh, it looks like it uses SkNx's default implementation with kth().  Is a
> platform specific implementation a TODO?  Or can we not do better than the
> default?
> 
> Right, default implementation.  This compiles into one instruction, vpermilps
> $255, %ymmN, %ymmM.  (Shuffle element 3 into each of lanes 0,1,2,3 and element
> 3+4=7 into 4,5,6,7).   Pretty hard to beat.

Great, cool instruction!

I'm impressed by the compiler.  Is having the shared code the benefit of
obscuring this?

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opt...
src/opts/SkXfermode_opts.h:254: void xfer32(SkPMColor dst[], const SkPMColor
src[], int n, const SkAlpha aa[]) const override {
On 2015/11/10 00:22:05, mtklein wrote:
> On 2015/11/09 at 23:25:06, msarett wrote:
> > It looks like you've deleted some private helper functions and pulled them
> into the implementation of this function?  Which allows you to efficiently
> combine Sk8f and Sk4f?
> 
> Yep, just moving things around.

Acknowledged.

mtklein

https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opts.h File src/opts/SkXfermode_opts.h (right): https://codereview.chromium.org/1432903002/diff/20001/src/opts/SkXfermode_opts.h#newcode126 src/opts/SkXfermode_opts.h:126: static inline Sk8f alphas(const Sk8f& f) { return SkNx_shuffle<3,3,3,3,7,7,7,7>(f); ...

5 years, 1 month ago (2015-11-11 19:27:13 UTC) #21

mtklein

Description was changed from ========== float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. ...

5 years, 1 month ago (2015-11-11 19:27:27 UTC) #22

Description was changed from

==========
float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.

Xfermode_ColorDodge_aa	10.3ms -> 7.85ms	0.76x
 Xfermode_SoftLight_aa	13.8ms -> 10.2ms	0.74x
 Xfermode_ColorBurn_aa	10.7ms -> 7.82ms	0.73x
    Xfermode_SoftLight	33.6ms -> 23.2ms	0.69x
   Xfermode_ColorDodge	  25ms -> 16.5ms	0.66x
    Xfermode_ColorBurn	26.1ms -> 16.6ms	0.63x

Ought to be no pixel diffs:
https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dg...

Incidental stuff:

I made the SkNx(T) constructors implicit to make writing math expressions
simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than 
  Sk4f v;
  ...
  v = v * Sk4f(4);

As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now.  I plan on
following up with a CL that lets those become `(1 - da)` too.


BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

COMMIT=false
Depends on https://codereview.chromium.org/1431143002
==========

to

==========
float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX.

Xfermode_ColorDodge_aa	10.3ms -> 7.85ms	0.76x
 Xfermode_SoftLight_aa	13.8ms -> 10.2ms	0.74x
 Xfermode_ColorBurn_aa	10.7ms -> 7.82ms	0.73x
    Xfermode_SoftLight	33.6ms -> 23.2ms	0.69x
   Xfermode_ColorDodge	  25ms -> 16.5ms	0.66x
    Xfermode_ColorBurn	26.1ms -> 16.6ms	0.63x

Ought to be no pixel diffs:
https://gold.skia.org/search2?issue=1432903002&unt=true&query=source_type%3Dg...

Incidental stuff:

I made the SkNx(T) constructors implicit to make writing math expressions
simpler.
This allows us to write expressions like
  Sk4f v;
  ...
  v = v*4;
rather than 
  Sk4f v;
  ...
  v = v * Sk4f(4);

As written it only works when the constant is on the right-hand side,
so expressions like `(Sk4f(1) - da)` have to stay for now.  I plan on
following up with a CL that lets those become `(1 - da)` too.


BUG=skia:4117
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1432903002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1432903002/20001

5 years, 1 month ago (2015-11-11 19:27:39 UTC) #24

commit-bot: I haz the power

5 years, 1 month ago (2015-11-11 19:39:13 UTC) #25

Message was sent while issue was closed.

Committed patchset #2 (id:20001) as
https://skia.googlesource.com/skia/+/084db25d47dbad3ffbd7d15c04b63d344b351f90

Issue 1432903002: float xfermodes (burn, dodge, softlight) in Sk8f, possibly using AVX. (Closed)

Description

Patch Set 1 #

Patch Set 2 : 1.0f/255 #

Messages