|
|
Created:
4 years ago by xiangze.zhang Modified:
4 years ago CC:
reviews_skia.org Target Ref:
refs/heads/master Project:
skia Visibility:
Public. |
DescriptionAdd AVX2 version of ConvolveVertically
ConvolveVertically time is reduced about 60% using haswell cpu.
Nanobench results:
before after
bitmap_scale_filter_64_256 611us 302us
bitmap_scale_filter_80_90 101us 64.9us
bitmap_scale_filter_30_90 82.3us 51.4us
bitmap_scale_filter_10_90 73.6us 42.4us
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002
CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD
Committed: https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31
Patch Set 1 #
Total comments: 16
Patch Set 2 : Address comments #Patch Set 3 : Rebase #
Messages
Total messages: 30 (12 generated)
Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: ========== to ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
xiangze.zhang@intel.com changed reviewers: + mtklein@chromium.org
PTAL.
Hi! I'm sorry I haven't responded on these two CLs yet. They're important to me and I promise to review soon.
https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h File src/opts/SkBitmapFilter_opts.h (right): https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:14: #include <immintrin.h> I think we can just use immintrin.h for everything >= SSE2. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:26: void ComputeCoefficientRow(SkConvolutionFilter1D::ConvolutionFixed filterValue, const unsigned char* sourceDataRows, Typically we name static functions with lower_case_and_underscores(), and pass mutated values as pointers instead of references so that it's clear they're being mutated at the call site. In my experience this doesn't have any effect on code generation, especially with SK_ALWAYS_INLINE. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:29: __m256i pixels = _mm256_lddqu_si256(reinterpret_cast<const __m256i *>(sourceDataRows)); I have always shied away from lddqu, instead just using loadu. Is it important here? Not sure I see any difference when benching. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:33: __m256i Yt1 = _mm256_unpacklo_epi8(pixels, zero); Does Yt1 mean ymm, temporary #1? We usually find this sort of code a lot easier to read and maintain if you give each value some sort of meaningful name (like pixels_03_16bit) then never change it, letting the compiler handle register allocation. It can also help readability if you can find ways to vertically align analogous parts of register or operation names. E.g. __m256i pixels_03_16bit = _mm256_unpacklo_epi8(pixels, zero); __m256i scaled_03_lo = _mm256_mullo_epi16(pixels_03_16bit, coefs), scaled_03_hi = _mm256_mulhi_epi16(pixels_03_16bit, coefs); Y01 = _mm256_add_epi32(Y01, _mm256_unpacklo_epi16(scaled_03_lo, scaled_03_hi)); Y23 = _mm256_add_epi32(Y23, _mm256_unpackhi_epi16(scaled_03_lo, scaled_03_hi)); ... https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:84: ComputeCoefficientRow(filterValues[filterY], sourceDataRows[filterY] + outX * 4, Y01, Y23, Y45, Y67); This might line up neater with a couple "+ 0". https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:123: // Make sure the value of alpha channel is always larger than maximum This comment might be better moved up just under if (hasAlpha) { https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:157: Yt2 = _mm256_max_epu8(Yt1, Yt0); // Max of r and g. This is the sort of code that using more variables with better names can help make readable. E.g. if (hasAlpha) { // If alpha is less than r,g, or b, set it to their max. auto max_rg = _mm256_max_epu8(pixels, _mm256_srli_epi32(pixels, 8)), max_rgb = _mm256_max_epu8(max_rg, _mm256_srli_epi32(pixels, 16)); pixels = _mm256_max_epu8(pixels, _mm256_slli_epi32(max_rgb, 24)); } else { // Force opaque. pixels = _mm256_or_si256(pixels, _mm256_set1_epi32(0xff000000)); } https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:167: for (int i = width; i < pixelWidth; i++) { I'd have expected a call to _mm256_maskstore_epi32()... is this way faster?
Thanks for your review and I have addressed most of the comments. I also merged the code that deal with the left pixels into the loop since most code are repeated. For the last comment, I don't know whether I should make changes? https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h File src/opts/SkBitmapFilter_opts.h (right): https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:14: #include <immintrin.h> On 2016/12/05 18:42:46, mtklein_C wrote: > I think we can just use immintrin.h for everything >= SSE2. Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:26: void ComputeCoefficientRow(SkConvolutionFilter1D::ConvolutionFixed filterValue, const unsigned char* sourceDataRows, On 2016/12/05 18:42:45, mtklein_C wrote: > Typically we name static functions with lower_case_and_underscores(), and pass > mutated values as pointers instead of references so that it's clear they're > being mutated at the call site. In my experience this doesn't have any effect > on code generation, especially with SK_ALWAYS_INLINE. Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:29: __m256i pixels = _mm256_lddqu_si256(reinterpret_cast<const __m256i *>(sourceDataRows)); On 2016/12/05 18:42:45, mtklein_C wrote: > I have always shied away from lddqu, instead just using loadu. Is it important > here? Not sure I see any difference when benching. Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:33: __m256i Yt1 = _mm256_unpacklo_epi8(pixels, zero); On 2016/12/05 18:42:45, mtklein_C wrote: > Does Yt1 mean ymm, temporary #1? > > We usually find this sort of code a lot easier to read and maintain if you give > each value some sort of meaningful name (like pixels_03_16bit) then never change > it, letting the compiler handle register allocation. It can also help > readability if you can find ways to vertically align analogous parts of register > or operation names. > > E.g. > > __m256i pixels_03_16bit = _mm256_unpacklo_epi8(pixels, zero); > > __m256i scaled_03_lo = _mm256_mullo_epi16(pixels_03_16bit, coefs), > scaled_03_hi = _mm256_mulhi_epi16(pixels_03_16bit, coefs); > > Y01 = _mm256_add_epi32(Y01, _mm256_unpacklo_epi16(scaled_03_lo, scaled_03_hi)); > Y23 = _mm256_add_epi32(Y23, _mm256_unpackhi_epi16(scaled_03_lo, scaled_03_hi)); > ... Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:84: ComputeCoefficientRow(filterValues[filterY], sourceDataRows[filterY] + outX * 4, Y01, Y23, Y45, Y67); On 2016/12/05 18:42:46, mtklein_C wrote: > This might line up neater with a couple "+ 0". Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:123: // Make sure the value of alpha channel is always larger than maximum On 2016/12/05 18:42:46, mtklein_C wrote: > This comment might be better moved up just under if (hasAlpha) { Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:157: Yt2 = _mm256_max_epu8(Yt1, Yt0); // Max of r and g. On 2016/12/05 18:42:45, mtklein_C wrote: > This is the sort of code that using more variables with better names can help > make readable. E.g. > > if (hasAlpha) { > // If alpha is less than r,g, or b, set it to their max. > auto max_rg = _mm256_max_epu8(pixels, _mm256_srli_epi32(pixels, 8)), > max_rgb = _mm256_max_epu8(max_rg, _mm256_srli_epi32(pixels, 16)); > pixels = _mm256_max_epu8(pixels, _mm256_slli_epi32(max_rgb, 24)); > } else { > // Force opaque. > pixels = _mm256_or_si256(pixels, _mm256_set1_epi32(0xff000000)); > } Done. https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... src/opts/SkBitmapFilter_opts.h:167: for (int i = width; i < pixelWidth; i++) { On 2016/12/05 18:42:46, mtklein_C wrote: > I'd have expected a call to _mm256_maskstore_epi32()... is this way faster? I tried something like this: __m256i mask = _mm256_setr_epi32(0xffffffff, 0, 0, 0, 0, 0, 0, 0); for (int i = outX + 1; i < pixelWidth; i++) { __m256i shift = _mm256_setr_epi32(0, 0, 1, 2, 3, 4, 5, 6); mask = _mm256_permutevar8x32_epi32(mask, shift); } _mm256_maskstore_epi32(reinterpret_cast<int*>(outRow), mask, accum); And nanobench time becomes a little longer. It is weird that bitmap_scale_filter_64_256 time also change from 306us to 314us, which should have no pixels left to go through this code path.
Thanks for the updates. lgtm! Good call on merging the tail code into the main loop... it's all very easy to read now. > https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt... > src/opts/SkBitmapFilter_opts.h:167: for (int i = width; i < pixelWidth; i++) { > On 2016/12/05 18:42:46, mtklein_C wrote: > > I'd have expected a call to _mm256_maskstore_epi32()... is this way faster? > > I tried something like this: > __m256i mask = _mm256_setr_epi32(0xffffffff, 0, 0, 0, 0, 0, 0, 0); > for (int i = outX + 1; i < pixelWidth; i++) { > __m256i shift = _mm256_setr_epi32(0, 0, 1, 2, 3, 4, 5, 6); > mask = _mm256_permutevar8x32_epi32(mask, shift); > } > _mm256_maskstore_epi32(reinterpret_cast<int*>(outRow), mask, accum); > > And nanobench time becomes a little longer. > It is weird that bitmap_scale_filter_64_256 time also change from 306us to 314us, which should have no pixels left to go through this code path. Weird. Thanks for investigating. Let's go with this as it is.
The CQ bit was checked by mtklein@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot on master.client.skia (JOB_FAILED, no build URL)
Ah, the bot name has changed a little since you uploaded this. I'll fix it...
Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002 CQ_INCLUDE_TRYBOTS=master.client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002 CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD ==========
The CQ bit was checked by mtklein@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Failed to apply patch for src/opts/SkOpts_hsw.cpp: While running git apply --index -p1; error: patch failed: src/opts/SkOpts_hsw.cpp:8 error: src/opts/SkOpts_hsw.cpp: patch does not apply Patch: src/opts/SkOpts_hsw.cpp Index: src/opts/SkOpts_hsw.cpp diff --git a/src/opts/SkOpts_hsw.cpp b/src/opts/SkOpts_hsw.cpp index 8c31af5b8373d60d6dd5a772d24e44c26acdac34..f7af52ef40f68c3217ebca34d634018fce855026 100644 --- a/src/opts/SkOpts_hsw.cpp +++ b/src/opts/SkOpts_hsw.cpp @@ -8,11 +8,13 @@ #include "SkOpts.h" #define SK_OPTS_NS hsw +#include "SkBitmapFilter_opts.h" #include "SkRasterPipeline_opts.h" namespace SkOpts { void Init_hsw() { compile_pipeline = hsw::compile_pipeline; + convolve_vertically = hsw::convolve_vertically; } }
The CQ bit was checked by xiangze.zhang@intel.com
The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@chromium.org Link to the patchset: https://codereview.chromium.org/2526733002/#ps40001 (title: "Rebase")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1481160398575720, "parent_rev": "45aac57ac6195880930441656a0988453f16c3db", "commit_rev": "4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31"}
Message was sent while issue was closed.
Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002 CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD ========== to ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about 60% using haswell cpu. Nanobench results: before after bitmap_scale_filter_64_256 611us 302us bitmap_scale_filter_80_90 101us 64.9us bitmap_scale_filter_30_90 82.3us 51.4us bitmap_scale_filter_10_90 73.6us 42.4us BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2526733002 CQ_INCLUDE_TRYBOTS=skia.primary:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD Committed: https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31
Message was sent while issue was closed.
On 2016/12/08 at 01:54:08, commit-bot wrote: > Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 Very weird that the first round of bots passed but the patch couldn't land without a rebase. I think our new Gerrit code review system is a lot better about figuring out how to rebase / merge as needed when there are no real conflicts... future CLs should be easier on skia-review.googlesource.com.
Message was sent while issue was closed.
On 2016/12/08 01:58:51, mtklein_C wrote: > On 2016/12/08 at 01:54:08, commit-bot wrote: > > Committed patchset #3 (id:40001) as > https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 > > Very weird that the first round of bots passed but the patch couldn't land > without a rebase. I think our new Gerrit code review system is a lot better > about figuring out how to rebase / merge as needed when there are no real > conflicts... future CLs should be easier on http://skia-review.googlesource.com. I tried Gerrit before and failed to upload cl completely because of network issues. My environment needs proxy to connect to Internet. Git cl upload timed out at somewhere about adding reviewers. Seems the script did not respect http_proxy at some point.
Message was sent while issue was closed.
mtklein@chromium.org changed reviewers: + rmistry@google.com
Message was sent while issue was closed.
Hmm. When I have Gerrit problems I always run to Ravi (rmistry@google.com) to ask for help. Maybe he can help with your proxy problems. As a team we're trying to get everything Skia-related moved over to Gerrit, but if you keep writing changes like this, I'd be happy to review them on Rietveld, over email, by telegraph... any way you can get the patch to me. :)
Message was sent while issue was closed.
On 2016/12/08 02:23:12, xiangze.zhang wrote: > On 2016/12/08 01:58:51, mtklein_C wrote: > > On 2016/12/08 at 01:54:08, commit-bot wrote: > > > Committed patchset #3 (id:40001) as > > https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 > > > > Very weird that the first round of bots passed but the patch couldn't land > > without a rebase. I think our new Gerrit code review system is a lot better > > about figuring out how to rebase / merge as needed when there are no real > > conflicts... future CLs should be easier on > http://skia-review.googlesource.com. > > I tried Gerrit before and failed to upload cl completely because of network > issues. > > My environment needs proxy to connect to Internet. Git cl upload timed out at > somewhere about adding reviewers. Seems the script did not respect http_proxy at > some point. Can you please file a bug at https://bugs.chromium.org/p/chromium/issues/entry and link it here? If uploading to Rietveld works for you then we need to make sure it works for Gerrit as well.
Message was sent while issue was closed.
On 2016/12/08 12:40:24, rmistry wrote: > On 2016/12/08 02:23:12, xiangze.zhang wrote: > > On 2016/12/08 01:58:51, mtklein_C wrote: > > > On 2016/12/08 at 01:54:08, commit-bot wrote: > > > > Committed patchset #3 (id:40001) as > > > > https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 > > > > > > Very weird that the first round of bots passed but the patch couldn't land > > > without a rebase. I think our new Gerrit code review system is a lot better > > > about figuring out how to rebase / merge as needed when there are no real > > > conflicts... future CLs should be easier on > > http://skia-review.googlesource.com. > > > > I tried Gerrit before and failed to upload cl completely because of network > > issues. > > > > My environment needs proxy to connect to Internet. Git cl upload timed out at > > somewhere about adding reviewers. Seems the script did not respect http_proxy > at > > some point. > > Can you please file a bug at https://bugs.chromium.org/p/chromium/issues/entry > and link it here? > If uploading to Rietveld works for you then we need to make sure it works for > Gerrit as well. I have filed a bug: https://bugs.chromium.org/p/chromium/issues/detail?id=672729 Thanks for your help! |