Issue 1813263002: custom ssse3 srcover_n_srgb_bw, about 1.8x faster

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:15:44 UTC) #1

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:22:47 UTC) #2

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:23:02 UTC) #3

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like we're doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to floats
   - src after reading, _MM_TRANSPOSE4_PS
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to floats
   - src after reading, _MM_TRANSPOSE4_PS
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:23:43 UTC) #4

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to floats
   - src after reading, _MM_TRANSPOSE4_PS
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:29:05 UTC) #5

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like 8-bit wraparound in a few GMs, e.g.
hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:30:17 UTC) #6

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like 8-bit wraparound in a few GMs, e.g.
hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflowin a few GMs, most noticeably
in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 16:30:25 UTC) #7

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflowin a few GMs, most noticeably
in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflow in a few GMs, most noticeably
in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 9 months ago (2016-03-18 16:34:02 UTC) #8

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1813263002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1813263002/20001

4 years, 9 months ago (2016-03-18 16:34:08 UTC) #9

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 9 months ago (2016-03-18 16:43:19 UTC) #10

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 9 months ago (2016-03-18 16:43:20 UTC) #11

mtklein_C

mtklein@chromium.org changed reviewers: + herb@google.com, reed@google.com

4 years, 9 months ago (2016-03-18 16:44:11 UTC) #12

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1813263002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1813263002/20001

4 years, 9 months ago (2016-03-18 18:06:48 UTC) #16

commit-bot: I haz the power

Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little ...

4 years, 9 months ago (2016-03-18 18:07:48 UTC) #17

Message was sent while issue was closed.

Description was changed from

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflow in a few GMs, most noticeably
in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
custom ssse3 srcover_n_srgb_bw, about 1.8x speedup

This is a little demo of the sorts of speedups we can get from working in planar
format, or even just a mini-planar of 4 pixels at a time like I'm doing here.

I chose this blit by running
  $ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n.  They can probably
get a similar treatment.

We transpose three times in this function:
   - dst after reading, as part of the zero-extension and conversion to float
   - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap
instructions)
   - result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb
aaaa, we could eliminate the src transpose and get another small speedup, to
right around 2x.

This code leans pretty heavily on SSSE3, so if we want it to speed up
Windows+Linux Chrome, it'll eventually want to go behind a function pointer.

This also appears to fix what looks like overflow in a few GMs, most noticeably
in hairmodes.  This is something we'd better look into...

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...

Committed:
https://skia.googlesource.com/skia/+/dbd94e2bb265b34c2d9bf82624909fef84a7217e
==========

commit-bot: I haz the power

4 years, 9 months ago (2016-03-18 18:07:49 UTC) #18

Message was sent while issue was closed.

Committed patchset #2 (id:20001) as
https://skia.googlesource.com/skia/+/dbd94e2bb265b34c2d9bf82624909fef84a7217e

Issue 1813263002: custom ssse3 srcover_n_srgb_bw, about 1.8x faster (Closed)

Description

Patch Set 1 #

Patch Set 2 : undo #

Messages