Issue 1891513002: skcpu: sse4.1 floor, f16c f16<->f32

mtklein_C

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 BUG=skia: ========== to ========== skcpu: ...

4 years, 8 months ago (2016-04-13 20:32:04 UTC) #1

mtklein

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1891513002 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== ...

4 years, 8 months ago (2016-04-13 20:40:59 UTC) #2

mtklein

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-13 22:23:21 UTC) #3

mtklein

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-13 22:26:22 UTC) #4

Description was changed from

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

COMMIT=false
Depends on https://codereview.chromium.org/1890483002/

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

mtklein_C

mtklein@chromium.org changed reviewers: + fmalita@chromium.org, herb@google.com, reed@google.com

4 years, 8 months ago (2016-04-13 22:31:05 UTC) #5

mtklein_C

Here's how we can use SkCpu to detect support for fast floor() and half<->float operations ...

4 years, 8 months ago (2016-04-13 22:31:06 UTC) #6

mtklein_C

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-13 22:32:55 UTC) #7

Description was changed from

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

COMMIT=false
Depends on https://codereview.chromium.org/1890483002/

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

COMMIT=false
Depends on https://codereview.chromium.org/1890483002/

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 8 months ago (2016-04-14 18:06:41 UTC) #10

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1891513002/160001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1891513002/160001

4 years, 8 months ago (2016-04-14 18:06:49 UTC) #11

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 8 months ago (2016-04-14 18:09:18 UTC) #12

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86-Debug-Trybot/builds/7875) Build-Win-MSVC-x86_64-Debug-Trybot on ...

4 years, 8 months ago (2016-04-14 18:09:19 UTC) #13

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 8 months ago (2016-04-14 18:53:55 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1891513002/180001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1891513002/180001

4 years, 8 months ago (2016-04-14 18:54:01 UTC) #15

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 8 months ago (2016-04-14 19:13:42 UTC) #16

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 8 months ago (2016-04-14 19:13:43 UTC) #17

mtklein

The patchset sent to the CQ was uploaded after l-g-t-m from herb@google.com, fmalita@chromium.org Link to ...

4 years, 8 months ago (2016-04-14 19:22:41 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 8 months ago (2016-04-14 19:22:42 UTC) #20

commit-bot: I haz the power

COMMIT=false detected. CQ is abandoning the patch.

4 years, 8 months ago (2016-04-14 19:22:44 UTC) #21

mtklein

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-14 19:26:28 UTC) #22

Description was changed from

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

COMMIT=false
Depends on https://codereview.chromium.org/1890483002/

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1891513002/180001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1891513002/180001

4 years, 8 months ago (2016-04-14 19:26:43 UTC) #24

commit-bot: I haz the power

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-14 19:27:40 UTC) #25

Message was sent while issue was closed.

Description was changed from

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Committed:
https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
==========

commit-bot: I haz the power

Committed patchset #10 (id:180001) as https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9

4 years, 8 months ago (2016-04-14 19:27:42 UTC) #26

mtklein

A revert of this CL (patchset #10 id:180001) has been created in https://codereview.chromium.org/1891993002/ by mtklein@google.com. ...

4 years, 8 months ago (2016-04-14 23:22:58 UTC) #27

mtklein

The patchset sent to the CQ was uploaded after l-g-t-m from herb@google.com, fmalita@chromium.org Link to ...

4 years, 8 months ago (2016-04-15 13:08:50 UTC) #29

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1891513002/200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1891513002/200001

4 years, 8 months ago (2016-04-15 13:09:02 UTC) #30

commit-bot: I haz the power

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-15 13:18:41 UTC) #31

Message was sent while issue was closed.

Description was changed from

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Committed:
https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9
==========

to

==========
skcpu: sse4.1 floor, f16c f16<->f32

  -  floor with roundps is about 4.5x faster when available
  -  f16 srcover_n is similar to but a little faster than the version in
https://codereview.chromium.org/1884683002.  This new one fuses the dst
load/stores into the f16<->f32 conversions:

+0x180	    movups              (%r15), %xmm1
+0x184	    vcvtph2ps           (%rbx), %xmm2
+0x189	    movaps              %xmm1, %xmm3
+0x18c	    shufps              $255, %xmm3, %xmm3
+0x190	    movaps              %xmm0, %xmm4
+0x193	    subps               %xmm3, %xmm4
+0x196	    mulps               %xmm2, %xmm4
+0x199	    addps               %xmm1, %xmm4
+0x19c	    vcvtps2ph           $0, %xmm4, (%rbx)
+0x1a2	    addq                $16, %r15
+0x1a6	    addq                $8, %rbx
+0x1aa	    decl                %r14d
+0x1ad	    jne                 +0x180

If we decide to land this it'd be a good idea to convert most or all users of
SkFloatToHalf_01 and SkHalfToFloat_01 over to the pointer-based versions.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot

Committed:
https://skia.googlesource.com/skia/+/cbe3c1af987d622ea67ef560d855b41bb14a0ce9

Committed:
https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663
==========

commit-bot: I haz the power

Committed patchset #11 (id:200001) as https://skia.googlesource.com/skia/+/3faf74b8364491ca806f523fbb1d8a97be592663

4 years, 8 months ago (2016-04-15 13:18:42 UTC) #32

mtklein

A revert of this CL (patchset #11 id:200001) has been created in https://codereview.chromium.org/1897433002/ by mtklein@google.com. ...

4 years, 8 months ago (2016-04-15 15:37:06 UTC) #33

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1891513002/200001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1891513002/200001

4 years, 8 months ago (2016-04-19 21:10:18 UTC) #35

commit-bot: I haz the power

Description was changed from ========== skcpu: sse4.1 floor, f16c f16<->f32 - floor with roundps is ...

4 years, 8 months ago (2016-04-19 21:21:34 UTC) #36

commit-bot: I haz the power

4 years, 8 months ago (2016-04-19 21:21:35 UTC) #37

Message was sent while issue was closed.

Committed patchset #11 (id:200001) as
https://skia.googlesource.com/skia/+/244a65350e52c9438931ecdc05a4913f29d343bc

Issue 1891513002: skcpu: sse4.1 floor, f16c f16<->f32 (Closed)

Description

Patch Set 1 #

Patch Set 2 : more hacking, mem addr versions #

Patch Set 3 : rebase #

Patch Set 4 : drop expect #

Patch Set 5 : simpler #

Patch Set 6 : rebase #

Patch Set 7 : typo #

Patch Set 8 : another typo #

Patch Set 9 : another #

Patch Set 10 : typo #

Patch Set 11 : invert #ifs #

Messages