|
|
Descriptioncustom ssse3 srcover_n_srgb_bw, about 1.8x speedup
This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here.
I chose this blit by running
$ out/Release/nanobench --config srgb --match skp
and looking for the hottest sRGB-related method.
After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment.
We transpose three times in this function:
- dst after reading, as part of the zero-extension and conversion to float
- src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions)
- result before writing, the last _mm_shuffle_epi8
If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x.
This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer.
This also appears to fix what looks like overflow in a few GMs, most noticeably in hairmodes. This is something we'd better look into...
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1813263002
Committed: https://skia.googlesource.com/skia/+/dbd94e2bb265b34c2d9bf82624909fef84a7217e
Patch Set 1 #Patch Set 2 : undo #Messages
Total messages: 18 (12 generated)
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like we're doing here. There would be another small speedup if we stored the src buffer in, e.g. rrrr gggg bbbb aaaa format, pushing the speedup right up to 2x. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n are now hotter than srcover_n. BUG=skia: ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like we're doing here. There would be another small speedup if we stored the src buffer in, e.g. rrrr gggg bbbb aaaa format, pushing the speedup right up to 2x. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n are now hotter than srcover_n. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like we're doing here. There would be another small speedup if we stored the src buffer in, e.g. rrrr gggg bbbb aaaa format, pushing the speedup right up to 2x. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n are now hotter than srcover_n. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like we're doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to floats - src after reading, _MM_TRANSPOSE4_PS - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like we're doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to floats - src after reading, _MM_TRANSPOSE4_PS - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to floats - src after reading, _MM_TRANSPOSE4_PS - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to floats - src after reading, _MM_TRANSPOSE4_PS - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like 8-bit wraparound in a few GMs, e.g. hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like 8-bit wraparound in a few GMs, e.g. hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like overflowin a few GMs, most noticeably in hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like overflowin a few GMs, most noticeably in hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like overflow in a few GMs, most noticeably in hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1813263002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1813263002/20001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
mtklein@chromium.org changed reviewers: + herb@google.com, reed@google.com
lgtm
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1813263002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1813263002/20001
Message was sent while issue was closed.
Description was changed from ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like overflow in a few GMs, most noticeably in hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== custom ssse3 srcover_n_srgb_bw, about 1.8x speedup This is a little demo of the sorts of speedups we can get from working in planar format, or even just a mini-planar of 4 pixels at a time like I'm doing here. I chose this blit by running $ out/Release/nanobench --config srgb --match skp and looking for the hottest sRGB-related method. After this CL, src_1 and src_n become hotter than srcover_n. They can probably get a similar treatment. We transpose three times in this function: - dst after reading, as part of the zero-extension and conversion to float - src after reading, _MM_TRANSPOSE4_PS (which expands to 8 cheap instructions) - result before writing, the last _mm_shuffle_epi8 If we changed our buffer format to a mini-planar format like rrrr gggg bbbb aaaa, we could eliminate the src transpose and get another small speedup, to right around 2x. This code leans pretty heavily on SSSE3, so if we want it to speed up Windows+Linux Chrome, it'll eventually want to go behind a function pointer. This also appears to fix what looks like overflow in a few GMs, most noticeably in hairmodes. This is something we'd better look into... BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... Committed: https://skia.googlesource.com/skia/+/dbd94e2bb265b34c2d9bf82624909fef84a7217e ==========
Message was sent while issue was closed.
Committed patchset #2 (id:20001) as https://skia.googlesource.com/skia/+/dbd94e2bb265b34c2d9bf82624909fef84a7217e |