Issue 1213723002: Optimize RGB16 blitV functions with NEON for ARM platform.

yang.zhang

yang.zhang@linaro.org changed reviewers: + bero@linaro.org, caryclark@google.com, djsollen@google.com, mtklein@google.com, reed@google.com

5 years, 6 months ago (2015-06-26 05:34:29 UTC) #1

yang.zhang

Hi all I have optimized RGB16 blitV functions with NEON for ARM platform. Could you ...

5 years, 6 months ago (2015-06-26 05:34:30 UTC) #2

reed1

Can we achieve this sort of speed-up using SkNx instead of custom assembly?

5 years, 6 months ago (2015-06-26 13:33:25 UTC) #3

mtklein

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_arm_neon.cpp File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_arm_neon.cpp#newcode2 src/opts/SkBlitMask_opts_arm_neon.cpp:2: * Copyright 2016 The Android Open Source Project Let's ...

5 years, 6 months ago (2015-06-26 14:05:50 UTC) #4

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
File src/opts/SkBlitMask_opts_arm_neon.cpp (right):

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:2: * Copyright 2016 The Android Open
Source Project
Let's put 2013 (file created) or 2015 (now) here.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:268: uint32x4_t vsrc32, vscale5;
Does writing it like this recover any of the slowdown when height is 1-7?

if (height >= 8) { 
    <setup>
    while (height >= 8) {
       <blit 8 rows>
    }
}
while (height --> 0) {
   <blit 1 row>
}

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:272: uint16x8x2_t vdst32;
I'd prefer if if you could move the declarations of these variables closer to
where they're first used.

This one in particular is easy to get confused about without a type... it seems
by name like it'd be uint32x4_t.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:280: vmaskq_g16 =
vdupq_n_u16(SK_G16_MASK_IN_PLACE);
Why do we make four masks here when we can use vand / vbic with two?

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:288: vdev = vld1q_lane_u16(device, vdev,
0);
This code (and the stores) might read more clearly as a loop?

for (int j = 0; j < 8; j++) {
   vdev = vldq_lane_u16(device, vdev, j);
   device = (uint16_t*)((char*)device + deviceRB);
}

Or does vldq_lane_u16 require the lane be a compile-time constant?
If so I might write it out like this:

// vldq1_lane_u16 requires lane to be a compile-time constant, so no for-loop.
#define LOAD(row) \
    vdev = vld1q_lane_u16(device, vdev, row);  \
    device = (uint16_t*)((char*)device + deviceRB)
LOAD(0); LOAD(1); LOAD(2); LOAD(3);
LOAD(4); LOAD(5); LOAD(6); LOAD(7);
#undef LOAD

Using macros to make it clear that the repetition is intentional and all
identical and being a bit more compact.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:349: void
SkRGB16BlitterBlitH_neon(uint16_t* device,
Let's leave this out until it's used?

mtklein

On 2015/06/26 13:33:25, reed1 wrote: > Can we achieve this sort of speed-up using SkNx ...

5 years, 6 months ago (2015-06-26 14:18:38 UTC) #5

yang.zhang

I have updated this patch according to your comments. Please check it. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_arm_neon.cpp File src/opts/SkBlitMask_opts_arm_neon.cpp ...

5 years, 5 months ago (2015-06-29 07:25:57 UTC) #6

I have updated this patch according to your comments. Please check it.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
File src/opts/SkBlitMask_opts_arm_neon.cpp (right):

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:2: * Copyright 2016 The Android Open
Source Project
On 2015/06/26 14:05:50, mtklein wrote:
> Let's put 2013 (file created) or 2015 (now) here.

Done.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:268: uint32x4_t vsrc32, vscale5;
On 2015/06/26 14:05:50, mtklein wrote:
> Does writing it like this recover any of the slowdown when height is 1-7?
> 
> if (height >= 8) { 
>     <setup>
>     while (height >= 8) {
>        <blit 8 rows>
>     }
> }
> while (height --> 0) {
>    <blit 1 row>
> }

Yeah. The setup code may have an effect on the cases with height 1~7.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:272: uint16x8x2_t vdst32;
On 2015/06/26 14:05:50, mtklein wrote:
> I'd prefer if if you could move the declarations of these variables closer to
> where they're first used.
> 
> This one in particular is easy to get confused about without a type... it
seems
> by name like it'd be uint32x4_t.

Done.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:280: vmaskq_g16 =
vdupq_n_u16(SK_G16_MASK_IN_PLACE);
On 2015/06/26 14:05:50, mtklein wrote:
> Why do we make four masks here when we can use vand / vbic with two?

Done.

https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar...
src/opts/SkBlitMask_opts_arm_neon.cpp:288: vdev = vld1q_lane_u16(device, vdev,
0);
On 2015/06/26 14:05:50, mtklein wrote:
> This code (and the stores) might read more clearly as a loop?
> 
> for (int j = 0; j < 8; j++) {
>    vdev = vldq_lane_u16(device, vdev, j);
>    device = (uint16_t*)((char*)device + deviceRB);
> }
> 
> Or does vldq_lane_u16 require the lane be a compile-time constant?
> If so I might write it out like this:
> 
> // vldq1_lane_u16 requires lane to be a compile-time constant, so no for-loop.
> #define LOAD(row) \
>     vdev = vld1q_lane_u16(device, vdev, row);  \
>     device = (uint16_t*)((char*)device + deviceRB)
> LOAD(0); LOAD(1); LOAD(2); LOAD(3);
> LOAD(4); LOAD(5); LOAD(6); LOAD(7);
> #undef LOAD
> 
> Using macros to make it clear that the repetition is intentional and all
> identical and being a bit more compact.

Done.

mtklein

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp#newcode282 src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); Oh, I was actually asking ...

5 years, 5 months ago (2015-06-29 17:16:17 UTC) #7

yang.zhang

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp#newcode282 src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); On 2015/06/29 17:16:17, mtklein wrote: ...

5 years, 5 months ago (2015-06-30 04:51:53 UTC) #8

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
File src/opts/SkBlitMask_opts_arm_neon.cpp (right):

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 =
vdupq_n_u16(SK_G16_MASK_IN_PLACE);
On 2015/06/29 17:16:17, mtklein wrote:
> Oh, I was actually asking about reducing the four masks to two the other way,
> but given what you've done here I think it can just be one!
> 
> What I meant was, use a single mask with vandq, or vbicq when you'd use ~mask:
> 
>     uint16x8_t greenMask = vdupq_n_u16(SK_G16_MASK_IN_PLACE);
>     ...
> 
>     uint16x8x2_t vdst = vzipq_u16(vbicq_u16(vdev, greenMask), 
>                                   vandq_u16(vdev, greenMask));
>     ...

Yeah. The results are the same. But I think there isn't difference on
performance. Besides using a single mask, is there any other benefit?

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
src/opts/SkBlitMask_opts_arm_neon.cpp:298: uint16x8x2_t vdst = vzipq_u16((vdev &
vmaskq_ng16), (vdev & vmaskq_g16));
On 2015/06/29 17:16:17, mtklein wrote:
> Remind me, why do we need to zip these together?  Aren't the operations done
to
> _hi and _lo always the same?
> 
> Can't we just operate on two vectors without zipping them, one with red and
> blue, the other with just green?
> 
> uint16x8_t rb = vbicq_u16(vdev, greenMask),
>             g = vandq_u16(vdev, greenMask);
> ...
Here, I used vzip instruction to implement the following operations.

C implementation:
((c & SK_G16_MASK_IN_PLACE) << 16) | (c & ~SK_G16_MASK_IN_PLACE)

another NEON implementation:
uint32x4_t dev_lo = vmovl_u16(vget_low_u16(vdev));
uint32x4_t dev_hi = vmovl_u16(vget_high_u16(vdev));
// unpack them in 32 bits
dev_lo = (dev_lo & vmask_ng16) | vshlq_n_u32(dev_lo & vmask_g16, 16);
dev_hi = (dev_hi & vmask_ng16) | vshlq_n_u32(dev_hi & vmask_g16, 16);

I think that using vzip instruction is better because less instructions are
needed.

mtklein

lgtm https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opts_arm_neon.cpp#newcode282 src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); On 2015/06/30 04:51:53, yang.zhang ...

5 years, 5 months ago (2015-06-30 12:42:56 UTC) #10

lgtm

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
File src/opts/SkBlitMask_opts_arm_neon.cpp (right):

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 =
vdupq_n_u16(SK_G16_MASK_IN_PLACE);
On 2015/06/30 04:51:53, yang.zhang wrote:
> On 2015/06/29 17:16:17, mtklein wrote:
> > Oh, I was actually asking about reducing the four masks to two the other
way,
> > but given what you've done here I think it can just be one!
> > 
> > What I meant was, use a single mask with vandq, or vbicq when you'd use
~mask:
> > 
> >     uint16x8_t greenMask = vdupq_n_u16(SK_G16_MASK_IN_PLACE);
> >     ...
> > 
> >     uint16x8x2_t vdst = vzipq_u16(vbicq_u16(vdev, greenMask), 
> >                                   vandq_u16(vdev, greenMask));
> >     ...
> 
> Yeah. The results are the same. But I think there isn't difference on
> performance. Besides using a single mask, is there any other benefit? 

Oh, just seemed tidier.  I agree it's not a big deal either way.

https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt...
src/opts/SkBlitMask_opts_arm_neon.cpp:298: uint16x8x2_t vdst = vzipq_u16((vdev &
vmaskq_ng16), (vdev & vmaskq_g16));
On 2015/06/30 04:51:53, yang.zhang wrote:
> On 2015/06/29 17:16:17, mtklein wrote:
> > Remind me, why do we need to zip these together?  Aren't the operations done
> to
> > _hi and _lo always the same?
> > 
> > Can't we just operate on two vectors without zipping them, one with red and
> > blue, the other with just green?
> > 
> > uint16x8_t rb = vbicq_u16(vdev, greenMask),
> >             g = vandq_u16(vdev, greenMask);
> > ...
> Here, I used vzip instruction to implement the following operations.
> 
> C implementation:
> ((c & SK_G16_MASK_IN_PLACE) << 16) | (c & ~SK_G16_MASK_IN_PLACE)
> 
> another NEON implementation:
> uint32x4_t dev_lo = vmovl_u16(vget_low_u16(vdev));
> uint32x4_t dev_hi = vmovl_u16(vget_high_u16(vdev));
> // unpack them in 32 bits
> dev_lo = (dev_lo & vmask_ng16) | vshlq_n_u32(dev_lo & vmask_g16, 16);
> dev_hi = (dev_hi & vmask_ng16) | vshlq_n_u32(dev_hi & vmask_g16, 16);
> 
> I think that using vzip instruction is better because less instructions are
> needed.

sgtm

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/40001

5 years, 5 months ago (2015-06-30 12:43:09 UTC) #11

commit-bot: I haz the power

The author yang.zhang@linaro.org has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign ...

5 years, 5 months ago (2015-06-30 12:43:11 UTC) #12

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 5 months ago (2015-06-30 12:43:51 UTC) #13

mtklein

On 2015/06/30 12:43:52, commit-bot: I haz the power wrote: > Exceeded global retry quota Hmm, ...

5 years, 5 months ago (2015-06-30 12:46:10 UTC) #15

yang.zhang

On 2015/06/30 12:46:10, mtklein wrote: > On 2015/06/30 12:43:52, commit-bot: I haz the power wrote: ...

5 years, 5 months ago (2015-07-01 07:17:10 UTC) #16

yang.zhang

On 2015/06/30 12:43:11, commit-bot: I haz the power wrote: > The author mailto:yang.zhang@linaro.org has not ...

5 years, 5 months ago (2015-07-01 10:26:17 UTC) #17

reed1

I think you need to rebase locally, and then re-upload. Sk BlitMask_opts_arm_neon.cpp already has a ...

5 years, 5 months ago (2015-07-01 13:12:27 UTC) #18

yang.zhang

On 2015/07/01 13:12:27, reed1 wrote: > I think you need to rebase locally, and then ...

5 years, 5 months ago (2015-07-02 07:14:17 UTC) #19

yang.zhang

Hi all Currently, I'm already in the list of AOSP CLA. Is it OK?

5 years, 5 months ago (2015-07-06 08:57:59 UTC) #20

mtklein

On 2015/07/06 08:57:59, yang.zhang wrote: > Hi all > > Currently, I'm already in the ...

5 years, 5 months ago (2015-07-06 14:22:55 UTC) #21

yang.zhang

On 2015/07/06 14:22:55, mtklein wrote: > On 2015/07/06 08:57:59, yang.zhang wrote: > > Hi all ...

5 years, 5 months ago (2015-07-14 07:10:58 UTC) #22

mtklein

The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the ...

5 years, 5 months ago (2015-07-14 11:53:01 UTC) #24

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/60001

5 years, 5 months ago (2015-07-14 11:53:07 UTC) #25

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years, 5 months ago (2015-07-14 11:54:28 UTC) #26

commit-bot: I haz the power

Try jobs failed on following builders: skia_presubmit-Trybot on client.skia.fyi (JOB_FAILED, http://build.chromium.org/p/client.skia.fyi/builders/skia_presubmit-Trybot/builds/1083)

5 years, 5 months ago (2015-07-14 11:54:29 UTC) #27

mtklein

On 2015/07/14 11:54:29, commit-bot: I haz the power wrote: > Try jobs failed on following ...

5 years, 5 months ago (2015-07-14 11:55:44 UTC) #28

yang.zhang

On 2015/07/14 11:55:44, mtklein wrote: > On 2015/07/14 11:54:29, commit-bot: I haz the power wrote: ...

5 years, 5 months ago (2015-07-15 07:13:03 UTC) #29

mtklein

The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the ...

5 years, 5 months ago (2015-07-15 13:46:17 UTC) #31

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/80001

5 years, 5 months ago (2015-07-15 13:46:25 UTC) #32

commit-bot: I haz the power

5 years, 5 months ago (2015-07-15 14:07:34 UTC) #33

Message was sent while issue was closed.

Committed patchset #5 (id:80001) as
https://skia.googlesource.com/skia/+/dc77b3591841bf1e70ed45455490d688e5d4e6f9

Issue 1213723002: Optimize RGB16 blitV functions with NEON for ARM platform. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Modify varibles definition #

Patch Set 3 : Add macro define for data load/store #

Patch Set 4 : Remove the copyright #

Patch Set 5 : Adding AUTHORS #

Messages