|
|
Created:
5 years, 6 months ago by yang.zhang Modified:
5 years, 5 months ago CC:
reviews_skia.org Base URL:
https://skia.googlesource.com/skia.git@master Target Ref:
refs/heads/master Project:
skia Visibility:
Public. |
DescriptionOptimize RGB16 blitV functions with NEON for ARM platform.
Here are some performance resultsi on Nexus 9:
SkRGB16BlitterBlitV_neon:
+--------+-----------+
|height | C/NEON |
+--------+-----------+
|1 | 0.765230 |
+--------+-----------+
|8 | 1.273330 |
+--------+-----------+
|18 | 1.441462 |
+--------+-----------+
|32 | 1.627798 |
+--------+-----------+
|76 | 1.683131 |
+--------+-----------+
|85 | 1.679456 |
+--------+-----------+
|120 | 1.721311 |
+--------+-----------+
|128 | 1.725482 |
+--------+-----------+
|512 | 1.784117 |
+--------+-----------+
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/dc77b3591841bf1e70ed45455490d688e5d4e6f9
Patch Set 1 #
Total comments: 11
Patch Set 2 : Modify varibles definition #Patch Set 3 : Add macro define for data load/store #
Total comments: 6
Patch Set 4 : Remove the copyright #Patch Set 5 : Adding AUTHORS #
Messages
Total messages: 33 (8 generated)
yang.zhang@linaro.org changed reviewers: + bero@linaro.org, caryclark@google.com, djsollen@google.com, mtklein@google.com, reed@google.com
Hi all I have optimized RGB16 blitV functions with NEON for ARM platform. Could you help to review it? Regards Yang
Can we achieve this sort of speed-up using SkNx instead of custom assembly?
https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:2: * Copyright 2016 The Android Open Source Project Let's put 2013 (file created) or 2015 (now) here. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:268: uint32x4_t vsrc32, vscale5; Does writing it like this recover any of the slowdown when height is 1-7? if (height >= 8) { <setup> while (height >= 8) { <blit 8 rows> } } while (height --> 0) { <blit 1 row> } https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:272: uint16x8x2_t vdst32; I'd prefer if if you could move the declarations of these variables closer to where they're first used. This one in particular is easy to get confused about without a type... it seems by name like it'd be uint32x4_t. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:280: vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); Why do we make four masks here when we can use vand / vbic with two? https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:288: vdev = vld1q_lane_u16(device, vdev, 0); This code (and the stores) might read more clearly as a loop? for (int j = 0; j < 8; j++) { vdev = vldq_lane_u16(device, vdev, j); device = (uint16_t*)((char*)device + deviceRB); } Or does vldq_lane_u16 require the lane be a compile-time constant? If so I might write it out like this: // vldq1_lane_u16 requires lane to be a compile-time constant, so no for-loop. #define LOAD(row) \ vdev = vld1q_lane_u16(device, vdev, row); \ device = (uint16_t*)((char*)device + deviceRB) LOAD(0); LOAD(1); LOAD(2); LOAD(3); LOAD(4); LOAD(5); LOAD(6); LOAD(7); #undef LOAD Using macros to make it clear that the repetition is intentional and all identical and being a bit more compact. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:349: void SkRGB16BlitterBlitH_neon(uint16_t* device, Let's leave this out until it's used?
On 2015/06/26 13:33:25, reed1 wrote: > Can we achieve this sort of speed-up using SkNx instead of custom assembly? Not right out of the box yet. I need to first figure out how to handle 565. Generally I'm tempted to just deemphasize the importance of 565, and so convert up to 8888, do everything there, then convert back. But I feel like I need to compare that against a good-faith 565 tuned set of methods first. This code in particular is also somewhat tricky in that it's working between 16-bit and 32-bit, which is again something I'd need to flesh out a bit first. Though, I may not need to: I have to do some scribbling to check my understanding, but I don't think this 32-bit math is strictly necessary, but rather it allows us to manipulate red+blue together without clobbering each other in a sort of SIMD^2 manner. If that's the case, it might be less work to handle 565.
I have updated this patch according to your comments. Please check it. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:2: * Copyright 2016 The Android Open Source Project On 2015/06/26 14:05:50, mtklein wrote: > Let's put 2013 (file created) or 2015 (now) here. Done. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:268: uint32x4_t vsrc32, vscale5; On 2015/06/26 14:05:50, mtklein wrote: > Does writing it like this recover any of the slowdown when height is 1-7? > > if (height >= 8) { > <setup> > while (height >= 8) { > <blit 8 rows> > } > } > while (height --> 0) { > <blit 1 row> > } Yeah. The setup code may have an effect on the cases with height 1~7. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:272: uint16x8x2_t vdst32; On 2015/06/26 14:05:50, mtklein wrote: > I'd prefer if if you could move the declarations of these variables closer to > where they're first used. > > This one in particular is easy to get confused about without a type... it seems > by name like it'd be uint32x4_t. Done. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:280: vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); On 2015/06/26 14:05:50, mtklein wrote: > Why do we make four masks here when we can use vand / vbic with two? Done. https://codereview.chromium.org/1213723002/diff/1/src/opts/SkBlitMask_opts_ar... src/opts/SkBlitMask_opts_arm_neon.cpp:288: vdev = vld1q_lane_u16(device, vdev, 0); On 2015/06/26 14:05:50, mtklein wrote: > This code (and the stores) might read more clearly as a loop? > > for (int j = 0; j < 8; j++) { > vdev = vldq_lane_u16(device, vdev, j); > device = (uint16_t*)((char*)device + deviceRB); > } > > Or does vldq_lane_u16 require the lane be a compile-time constant? > If so I might write it out like this: > > // vldq1_lane_u16 requires lane to be a compile-time constant, so no for-loop. > #define LOAD(row) \ > vdev = vld1q_lane_u16(device, vdev, row); \ > device = (uint16_t*)((char*)device + deviceRB) > LOAD(0); LOAD(1); LOAD(2); LOAD(3); > LOAD(4); LOAD(5); LOAD(6); LOAD(7); > #undef LOAD > > Using macros to make it clear that the repetition is intentional and all > identical and being a bit more compact. Done.
https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); Oh, I was actually asking about reducing the four masks to two the other way, but given what you've done here I think it can just be one! What I meant was, use a single mask with vandq, or vbicq when you'd use ~mask: uint16x8_t greenMask = vdupq_n_u16(SK_G16_MASK_IN_PLACE); ... uint16x8x2_t vdst = vzipq_u16(vbicq_u16(vdev, greenMask), vandq_u16(vdev, greenMask)); ... https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:298: uint16x8x2_t vdst = vzipq_u16((vdev & vmaskq_ng16), (vdev & vmaskq_g16)); Remind me, why do we need to zip these together? Aren't the operations done to _hi and _lo always the same? Can't we just operate on two vectors without zipping them, one with red and blue, the other with just green? uint16x8_t rb = vbicq_u16(vdev, greenMask), g = vandq_u16(vdev, greenMask); ...
https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); On 2015/06/29 17:16:17, mtklein wrote: > Oh, I was actually asking about reducing the four masks to two the other way, > but given what you've done here I think it can just be one! > > What I meant was, use a single mask with vandq, or vbicq when you'd use ~mask: > > uint16x8_t greenMask = vdupq_n_u16(SK_G16_MASK_IN_PLACE); > ... > > uint16x8x2_t vdst = vzipq_u16(vbicq_u16(vdev, greenMask), > vandq_u16(vdev, greenMask)); > ... Yeah. The results are the same. But I think there isn't difference on performance. Besides using a single mask, is there any other benefit? https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:298: uint16x8x2_t vdst = vzipq_u16((vdev & vmaskq_ng16), (vdev & vmaskq_g16)); On 2015/06/29 17:16:17, mtklein wrote: > Remind me, why do we need to zip these together? Aren't the operations done to > _hi and _lo always the same? > > Can't we just operate on two vectors without zipping them, one with red and > blue, the other with just green? > > uint16x8_t rb = vbicq_u16(vdev, greenMask), > g = vandq_u16(vdev, greenMask); > ... Here, I used vzip instruction to implement the following operations. C implementation: ((c & SK_G16_MASK_IN_PLACE) << 16) | (c & ~SK_G16_MASK_IN_PLACE) another NEON implementation: uint32x4_t dev_lo = vmovl_u16(vget_low_u16(vdev)); uint32x4_t dev_hi = vmovl_u16(vget_high_u16(vdev)); // unpack them in 32 bits dev_lo = (dev_lo & vmask_ng16) | vshlq_n_u32(dev_lo & vmask_g16, 16); dev_hi = (dev_hi & vmask_ng16) | vshlq_n_u32(dev_hi & vmask_g16, 16); I think that using vzip instruction is better because less instructions are needed.
The CQ bit was checked by mtklein@google.com
lgtm https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... File src/opts/SkBlitMask_opts_arm_neon.cpp (right): https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:282: uint16x8_t vmaskq_g16 = vdupq_n_u16(SK_G16_MASK_IN_PLACE); On 2015/06/30 04:51:53, yang.zhang wrote: > On 2015/06/29 17:16:17, mtklein wrote: > > Oh, I was actually asking about reducing the four masks to two the other way, > > but given what you've done here I think it can just be one! > > > > What I meant was, use a single mask with vandq, or vbicq when you'd use ~mask: > > > > uint16x8_t greenMask = vdupq_n_u16(SK_G16_MASK_IN_PLACE); > > ... > > > > uint16x8x2_t vdst = vzipq_u16(vbicq_u16(vdev, greenMask), > > vandq_u16(vdev, greenMask)); > > ... > > Yeah. The results are the same. But I think there isn't difference on > performance. Besides using a single mask, is there any other benefit? Oh, just seemed tidier. I agree it's not a big deal either way. https://codereview.chromium.org/1213723002/diff/40001/src/opts/SkBlitMask_opt... src/opts/SkBlitMask_opts_arm_neon.cpp:298: uint16x8x2_t vdst = vzipq_u16((vdev & vmaskq_ng16), (vdev & vmaskq_g16)); On 2015/06/30 04:51:53, yang.zhang wrote: > On 2015/06/29 17:16:17, mtklein wrote: > > Remind me, why do we need to zip these together? Aren't the operations done > to > > _hi and _lo always the same? > > > > Can't we just operate on two vectors without zipping them, one with red and > > blue, the other with just green? > > > > uint16x8_t rb = vbicq_u16(vdev, greenMask), > > g = vandq_u16(vdev, greenMask); > > ... > Here, I used vzip instruction to implement the following operations. > > C implementation: > ((c & SK_G16_MASK_IN_PLACE) << 16) | (c & ~SK_G16_MASK_IN_PLACE) > > another NEON implementation: > uint32x4_t dev_lo = vmovl_u16(vget_low_u16(vdev)); > uint32x4_t dev_hi = vmovl_u16(vget_high_u16(vdev)); > // unpack them in 32 bits > dev_lo = (dev_lo & vmask_ng16) | vshlq_n_u32(dev_lo & vmask_g16, 16); > dev_hi = (dev_hi & vmask_ng16) | vshlq_n_u32(dev_hi & vmask_g16, 16); > > I think that using vzip instruction is better because less instructions are > needed. sgtm
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/40001
The author yang.zhang@linaro.org has not signed Google Contributor License Agreement. Please visit https://cla.developers.google.com to sign and manage CLA.
The CQ bit was unchecked by commit-bot@chromium.org
Exceeded global retry quota
On 2015/06/30 12:43:52, commit-bot: I haz the power wrote: > Exceeded global retry quota Hmm, looks like someone else snuck in and added a copyright header to BlitMask_opts_arm_neon.cpp. Better just rebase on top of that.
On 2015/06/30 12:46:10, mtklein wrote: > On 2015/06/30 12:43:52, commit-bot: I haz the power wrote: > > Exceeded global retry quota > > Hmm, looks like someone else snuck in and added a copyright header to > BlitMask_opts_arm_neon.cpp. Better just rebase on top of that. The reason why I added a copyright header to BlitMask_opts_arm_neon.cpp is that I can't summit this patch for review without the copyright header. When I used git cl upload, the error log is "miss the copyright header in BlitMask_opts_arm_neon.cpp".
On 2015/06/30 12:43:11, commit-bot: I haz the power wrote: > The author mailto:yang.zhang@linaro.org has not signed Google Contributor License > Agreement. Please visit https://cla.developers.google.com to sign and manage > CLA. Currently, I'm already in the list of AOSP CLA. Is it OK?
I think you need to rebase locally, and then re-upload. Sk BlitMask_opts_arm_neon.cpp already has a copyright, so your diff won't apply.
On 2015/07/01 13:12:27, reed1 wrote: > I think you need to rebase locally, and then re-upload. Sk > BlitMask_opts_arm_neon.cpp already has a copyright, so your diff won't apply. done.
Hi all Currently, I'm already in the list of AOSP CLA. Is it OK?
On 2015/07/06 08:57:59, yang.zhang wrote: > Hi all > > Currently, I'm already in the list of AOSP CLA. Is it OK? I don't think so. Seems like everything is switching to this one centralized CLA.
On 2015/07/06 14:22:55, mtklein wrote: > On 2015/07/06 08:57:59, yang.zhang wrote: > > Hi all > > > > Currently, I'm already in the list of AOSP CLA. Is it OK? > > I don't think so. Seems like everything is switching to this one centralized > CLA. Hi mtklein The CLA is ok. Could you help to check it?
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the patchset: https://codereview.chromium.org/1213723002/#ps60001 (title: "Remove the copyright")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/60001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: skia_presubmit-Trybot on client.skia.fyi (JOB_FAILED, http://build.chromium.org/p/client.skia.fyi/builders/skia_presubmit-Trybot/bu...)
On 2015/07/14 11:54:29, commit-bot: I haz the power wrote: > Try jobs failed on following builders: > skia_presubmit-Trybot on client.skia.fyi (JOB_FAILED, > http://build.chromium.org/p/client.skia.fyi/builders/skia_presubmit-Trybot/bu...) Ooof, one more thing to do: The email yang.zhang@linaro.org is not in Skia's AUTHORS file. Issue owner, this CL must include an addition to the Skia AUTHORS file.
On 2015/07/14 11:55:44, mtklein wrote: > On 2015/07/14 11:54:29, commit-bot: I haz the power wrote: > > Try jobs failed on following builders: > > skia_presubmit-Trybot on client.skia.fyi (JOB_FAILED, > > > http://build.chromium.org/p/client.skia.fyi/builders/skia_presubmit-Trybot/bu...) > > Ooof, one more thing to do: > The email mailto:yang.zhang@linaro.org is not in Skia's AUTHORS file. > Issue owner, this CL must include an addition to the Skia AUTHORS file. I have added Linaro <*@linaro.org> to AUTHORS file.
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the patchset: https://codereview.chromium.org/1213723002/#ps80001 (title: "Adding AUTHORS")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1213723002/80001
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://skia.googlesource.com/skia/+/dc77b3591841bf1e70ed45455490d688e5d4e6f9 |