Issue 2150343002: Add a bench to measure the best way to pack from int to uint16_t with SSE.

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 13:34:36 UTC) #1

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 13:36:00 UTC) #2

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 13:37:15 UTC) #3

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-07-15 13:37:17 UTC) #4

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2150343002/20001

4 years, 5 months ago (2016-07-15 13:37:22 UTC) #5

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-07-15 14:08:28 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2150343002/80001

4 years, 5 months ago (2016-07-15 14:08:37 UTC) #7

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 14:11:08 UTC) #8

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 14:12:38 UTC) #9

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 14:13:59 UTC) #10

Description was changed from

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in that I'm tie breaking them on other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot
==========

to

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in speed that I'm tie breaking them on
other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot
==========

mtklein_C

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 14:14:15 UTC) #11

Description was changed from

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in speed that I'm tie breaking them on
other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot
==========

to

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in speed that I'm tie breaking them on
other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any
constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot
==========

mtklein_C

mtklein@chromium.org changed reviewers: + msarett@google.com

4 years, 5 months ago (2016-07-15 14:24:20 UTC) #12

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 5 months ago (2016-07-15 14:25:52 UTC) #14

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86_64-Debug-Trybot on master.client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_64-Debug-Trybot/builds/9845)

4 years, 5 months ago (2016-07-15 14:25:53 UTC) #15

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 5 months ago (2016-07-15 14:27:24 UTC) #16

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2150343002/100001

4 years, 5 months ago (2016-07-15 14:27:28 UTC) #17

msarett

lgtm https://codereview.chromium.org/2150343002/diff/80001/src/opts/SkNx_sse.h File src/opts/SkNx_sse.h (right): https://codereview.chromium.org/2150343002/diff/80001/src/opts/SkNx_sse.h#newcode328 src/opts/SkNx_sse.h:328: // TODO: This seems to be causing code ...

4 years, 5 months ago (2016-07-15 14:29:15 UTC) #18

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2150343002/100001

4 years, 5 months ago (2016-07-15 14:32:24 UTC) #21

commit-bot: I haz the power

Description was changed from ========== Add a bench to measure the best way to pack ...

4 years, 5 months ago (2016-07-15 14:45:56 UTC) #22

Message was sent while issue was closed.

Description was changed from

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in speed that I'm tie breaking them on
other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any
constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot
==========

to

==========
Add a bench to measure the best way to pack from int to uint16_t with SSE.

I measured relative runtimes on my laptop:

   pack_int_uint16_t_ss…
   1036  …e41 1x  …se3 1.01x  …e2_b 3.01x  …e2_a 3.02x

I've run into Clang problems with the actual _mm_packus_epi32 instruction, I
think,
so I'm going to exercise a little cowardice and leave that option disabled for
now.

The ssse3 version probably looks a little faster than it will be in practice.
We'll usually need to load its mask, which here is hoisted out of the bench
loop.

The two sse2 variants are close enough in speed that I'm tie breaking them on
other
concerns: the <<16, >>16 version doesn't need any scratch registers or to load
any
constants, so it wins.

BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2150343002
CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot,Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Fast-Trybot

Committed:
https://skia.googlesource.com/skia/+/036e1831e05ae3a6ec9bcd30cb24f6b1a49a3541
==========

commit-bot: I haz the power

4 years, 5 months ago (2016-07-15 14:45:57 UTC) #23

Message was sent while issue was closed.

Committed patchset #6 (id:100001) as
https://skia.googlesource.com/skia/+/036e1831e05ae3a6ec9bcd30cb24f6b1a49a3541

Issue 2150343002: Add a bench to measure the best way to pack from int to uint16_t with SSE. (Closed)

Description

Patch Set 1 #

Patch Set 2 : naming #

Patch Set 3 : typo #

Patch Set 4 : static -> anonymous #

Patch Set 5 : merge #

Patch Set 6 : so tired of this MSVC... #

Messages