Issue 2526733002: Add AVX2 version of ConvolveVertically

xiangze.zhang

Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about ...

4 years ago (2016-11-23 06:24:56 UTC) #1

xiangze.zhang

xiangze.zhang@intel.com changed reviewers: + mtklein@chromium.org

4 years ago (2016-11-23 06:28:49 UTC) #2

mtklein_C

Hi! I'm sorry I haven't responded on these two CLs yet. They're important to me ...

4 years ago (2016-11-29 20:53:51 UTC) #5

mtklein_C

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h File src/opts/SkBitmapFilter_opts.h (right): https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h#newcode14 src/opts/SkBitmapFilter_opts.h:14: #include <immintrin.h> I think we can just use immintrin.h ...

4 years ago (2016-12-05 18:42:46 UTC) #6

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h
File src/opts/SkBitmapFilter_opts.h (right):

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:14: #include <immintrin.h>
I think we can just use immintrin.h for everything >= SSE2.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:26: void
ComputeCoefficientRow(SkConvolutionFilter1D::ConvolutionFixed filterValue, const
unsigned char* sourceDataRows,
Typically we name static functions with lower_case_and_underscores(), and pass
mutated values as pointers instead of references so that it's clear they're
being mutated at the call site.   In my experience this doesn't have any effect
on code generation, especially with SK_ALWAYS_INLINE.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:29: __m256i pixels =
_mm256_lddqu_si256(reinterpret_cast<const __m256i *>(sourceDataRows));
I have always shied away from lddqu, instead just using loadu.  Is it important
here?  Not sure I see any difference when benching.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:33: __m256i Yt1 = _mm256_unpacklo_epi8(pixels,
zero);
Does Yt1 mean ymm, temporary #1?

We usually find this sort of code a lot easier to read and maintain if you give
each value some sort of meaningful name (like pixels_03_16bit) then never change
it, letting the compiler handle register allocation.  It can also help
readability if you can find ways to vertically align analogous parts of register
or operation names.

E.g.

__m256i pixels_03_16bit = _mm256_unpacklo_epi8(pixels, zero);

__m256i scaled_03_lo = _mm256_mullo_epi16(pixels_03_16bit, coefs),
        scaled_03_hi = _mm256_mulhi_epi16(pixels_03_16bit, coefs);

Y01 = _mm256_add_epi32(Y01, _mm256_unpacklo_epi16(scaled_03_lo, scaled_03_hi));
Y23 = _mm256_add_epi32(Y23, _mm256_unpackhi_epi16(scaled_03_lo, scaled_03_hi));
...

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:84: ComputeCoefficientRow(filterValues[filterY],
sourceDataRows[filterY] + outX * 4, Y01, Y23, Y45, Y67);
This might line up neater with a couple "+ 0".

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:123: // Make sure the value of alpha channel is
always larger than maximum
This comment might be better moved up just under if (hasAlpha) {

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:157: Yt2 = _mm256_max_epu8(Yt1, Yt0);  // Max of
r and g.
This is the sort of code that using more variables with better names can help
make readable.  E.g.

if (hasAlpha) {
  // If alpha is less than r,g, or b, set it to their max.
  auto max_rg  = _mm256_max_epu8(pixels, _mm256_srli_epi32(pixels,  8)),
       max_rgb = _mm256_max_epu8(max_rg, _mm256_srli_epi32(pixels, 16));
  pixels = _mm256_max_epu8(pixels, _mm256_slli_epi32(max_rgb, 24));
} else {
  // Force opaque.
  pixels = _mm256_or_si256(pixels, _mm256_set1_epi32(0xff000000));
}

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:167: for (int i = width; i < pixelWidth; i++) {
I'd have expected a call to _mm256_maskstore_epi32()... is this way faster?

xiangze.zhang

Thanks for your review and I have addressed most of the comments. I also merged ...

4 years ago (2016-12-07 11:17:01 UTC) #7

Thanks for your review and I have addressed most of the comments. I also merged
the code that deal with the left pixels into the loop since most code are
repeated. For the last comment, I don't know whether I should make changes?

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opts.h
File src/opts/SkBitmapFilter_opts.h (right):

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:14: #include <immintrin.h>
On 2016/12/05 18:42:46, mtklein_C wrote:
> I think we can just use immintrin.h for everything >= SSE2.

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:26: void
ComputeCoefficientRow(SkConvolutionFilter1D::ConvolutionFixed filterValue, const
unsigned char* sourceDataRows,
On 2016/12/05 18:42:45, mtklein_C wrote:
> Typically we name static functions with lower_case_and_underscores(), and pass
> mutated values as pointers instead of references so that it's clear they're
> being mutated at the call site.   In my experience this doesn't have any
effect
> on code generation, especially with SK_ALWAYS_INLINE.

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:29: __m256i pixels =
_mm256_lddqu_si256(reinterpret_cast<const __m256i *>(sourceDataRows));
On 2016/12/05 18:42:45, mtklein_C wrote:
> I have always shied away from lddqu, instead just using loadu.  Is it
important
> here?  Not sure I see any difference when benching.

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:33: __m256i Yt1 = _mm256_unpacklo_epi8(pixels,
zero);
On 2016/12/05 18:42:45, mtklein_C wrote:
> Does Yt1 mean ymm, temporary #1?
> 
> We usually find this sort of code a lot easier to read and maintain if you
give
> each value some sort of meaningful name (like pixels_03_16bit) then never
change
> it, letting the compiler handle register allocation.  It can also help
> readability if you can find ways to vertically align analogous parts of
register
> or operation names.
> 
> E.g.
> 
> __m256i pixels_03_16bit = _mm256_unpacklo_epi8(pixels, zero);
> 
> __m256i scaled_03_lo = _mm256_mullo_epi16(pixels_03_16bit, coefs),
>         scaled_03_hi = _mm256_mulhi_epi16(pixels_03_16bit, coefs);
> 
> Y01 = _mm256_add_epi32(Y01, _mm256_unpacklo_epi16(scaled_03_lo,
scaled_03_hi));
> Y23 = _mm256_add_epi32(Y23, _mm256_unpackhi_epi16(scaled_03_lo,
scaled_03_hi));
> ...

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:84: ComputeCoefficientRow(filterValues[filterY],
sourceDataRows[filterY] + outX * 4, Y01, Y23, Y45, Y67);
On 2016/12/05 18:42:46, mtklein_C wrote:
> This might line up neater with a couple "+ 0".

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:123: // Make sure the value of alpha channel is
always larger than maximum
On 2016/12/05 18:42:46, mtklein_C wrote:
> This comment might be better moved up just under if (hasAlpha) {

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:157: Yt2 = _mm256_max_epu8(Yt1, Yt0);  // Max of
r and g.
On 2016/12/05 18:42:45, mtklein_C wrote:
> This is the sort of code that using more variables with better names can help
> make readable.  E.g.
> 
> if (hasAlpha) {
>   // If alpha is less than r,g, or b, set it to their max.
>   auto max_rg  = _mm256_max_epu8(pixels, _mm256_srli_epi32(pixels,  8)),
>        max_rgb = _mm256_max_epu8(max_rg, _mm256_srli_epi32(pixels, 16));
>   pixels = _mm256_max_epu8(pixels, _mm256_slli_epi32(max_rgb, 24));
> } else {
>   // Force opaque.
>   pixels = _mm256_or_si256(pixels, _mm256_set1_epi32(0xff000000));
> }

Done.

https://codereview.chromium.org/2526733002/diff/1/src/opts/SkBitmapFilter_opt...
src/opts/SkBitmapFilter_opts.h:167: for (int i = width; i < pixelWidth; i++) {
On 2016/12/05 18:42:46, mtklein_C wrote:
> I'd have expected a call to _mm256_maskstore_epi32()... is this way faster?

I tried something like this:
    __m256i mask = _mm256_setr_epi32(0xffffffff, 0, 0, 0, 0, 0, 0, 0);
    for (int i = outX + 1; i < pixelWidth; i++) {
        __m256i shift = _mm256_setr_epi32(0, 0, 1, 2, 3, 4, 5, 6);
        mask = _mm256_permutevar8x32_epi32(mask, shift);
    }
    _mm256_maskstore_epi32(reinterpret_cast<int*>(outRow), mask, accum);

And nanobench time becomes a little longer.
It is weird that bitmap_scale_filter_64_256 time also change from 306us to
314us, which should have no pixels left to go through this code path.

mtklein_C

Thanks for the updates. lgtm! Good call on merging the tail code into the main ...

4 years ago (2016-12-07 17:36:06 UTC) #8

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2526733002/20001

4 years ago (2016-12-07 17:36:16 UTC) #10

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-07 17:36:46 UTC) #11

commit-bot: I haz the power

Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot on master.client.skia (JOB_FAILED, no build URL)

4 years ago (2016-12-07 17:36:47 UTC) #12

mtklein_C

Ah, the bot name has changed a little since you uploaded this. I'll fix it...

4 years ago (2016-12-07 17:43:31 UTC) #13

mtklein_C

Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about ...

4 years ago (2016-12-07 17:43:42 UTC) #14

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2526733002/20001

4 years ago (2016-12-07 17:43:57 UTC) #16

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years ago (2016-12-07 18:05:44 UTC) #17

commit-bot: I haz the power

Failed to apply patch for src/opts/SkOpts_hsw.cpp: While running git apply --index -p1; error: patch failed: ...

4 years ago (2016-12-07 18:05:45 UTC) #18

xiangze.zhang

The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@chromium.org Link to the ...

4 years ago (2016-12-08 01:26:38 UTC) #20

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2526733002/40001

4 years ago (2016-12-08 01:26:48 UTC) #21

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1481160398575720, "parent_rev": "45aac57ac6195880930441656a0988453f16c3db", "commit_rev": "4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31"}

4 years ago (2016-12-08 01:54:05 UTC) #22

commit-bot: I haz the power

Description was changed from ========== Add AVX2 version of ConvolveVertically ConvolveVertically time is reduced about ...

4 years ago (2016-12-08 01:54:07 UTC) #23

commit-bot: I haz the power

Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31

4 years ago (2016-12-08 01:54:08 UTC) #24

mtklein_C

On 2016/12/08 at 01:54:08, commit-bot wrote: > Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31 Very weird ...

4 years ago (2016-12-08 01:58:51 UTC) #25

xiangze.zhang

On 2016/12/08 01:58:51, mtklein_C wrote: > On 2016/12/08 at 01:54:08, commit-bot wrote: > > Committed ...

4 years ago (2016-12-08 02:23:12 UTC) #26

mtklein_C

Hmm. When I have Gerrit problems I always run to Ravi (rmistry@google.com) to ask for ...

4 years ago (2016-12-08 02:27:32 UTC) #28

rmistry

On 2016/12/08 02:23:12, xiangze.zhang wrote: > On 2016/12/08 01:58:51, mtklein_C wrote: > > On 2016/12/08 ...

4 years ago (2016-12-08 12:40:24 UTC) #29

xiangze.zhang

4 years ago (2016-12-09 07:31:37 UTC) #30

Message was sent while issue was closed.

On 2016/12/08 12:40:24, rmistry wrote:
> On 2016/12/08 02:23:12, xiangze.zhang wrote:
> > On 2016/12/08 01:58:51, mtklein_C wrote:
> > > On 2016/12/08 at 01:54:08, commit-bot wrote:
> > > > Committed patchset #3 (id:40001) as
> > >
> https://skia.googlesource.com/skia/+/4adac2e32a99b9e266b41b356c7bbc2ea2c8dd31
> > > 
> > > Very weird that the first round of bots passed but the patch couldn't land
> > > without a rebase.  I think our new Gerrit code review system is a lot
better
> > > about figuring out how to rebase / merge as needed when there are no real
> > > conflicts... future CLs should be easier on
> > http://skia-review.googlesource.com.
> > 
> > I tried Gerrit before and failed to upload cl completely because of network
> > issues. 
> > 
> > My environment needs proxy to connect to Internet. Git cl upload timed out
at 
> > somewhere about adding reviewers. Seems the script did not respect
http_proxy
> at
> > some point.
> 
> Can you please file a bug at https://bugs.chromium.org/p/chromium/issues/entry
> and link it here?
> If uploading to Rietveld works for you then we need to make sure it works for
> Gerrit as well.

I have filed a bug: https://bugs.chromium.org/p/chromium/issues/detail?id=672729
Thanks for your help!

Issue 2526733002: Add AVX2 version of ConvolveVertically (Closed)

Description

Patch Set 1 #

Patch Set 2 : Address comments #

Patch Set 3 : Rebase #

Messages