|
|
Descriptionskia: blend32_16_row for neon version
This includes blend32_16_row neon implementation
for aarch32 and aarch64.
For performance,
blend32_16_row is called in following tests in nanobench.
- Xfermode_SrcOver
- tablebench
- rotated_rects_bw_alternating_transparent_and_opaque_srcover
- rotated_rects_bw_changing_transparent_srcover
- rotated_rects_bw_same_transparent_srcover
- luma_colorfilter_large
- luma_colorfilter_small
- chart_bw
I can see perf increase in following two tests, especially. For others, looks
similar.
For each, I tried to run two times.
1) Xfermode_SrcOver
<org>
- D/skia ( 2000): 3M 57 17.3µs 17.4µs 17.4µs 17.7µs 1%
█▃▂▃▂▂▂▁▃▂ 565 Xfermode_SrcOver
- D/skia ( 1915): 3M 70 13.5µs 16.9µs 16.7µs 18.8µs 9%
▆█▄▅█▁▅▅▆▄ 565 Xfermode_SrcOver
<new>
- D/skia ( 2000): 3M 8 11.6µs 11.8µs 12.1µs 14.4µs 7%
▃█▁▁▂▁▁▁▂▂ 565 Xfermode_SrcOver
- D/skia ( 2004): 3M 62 10.3µs 12.9µs 13µs 15.2µs 11%
█▅▅▆▁▅▅▅▇▃ 565 Xfermode_SrcOver
2)
luma_colorfilter_large
<org>
- D/skia ( 2000): 159M 8 136µs 136µs 136µs 139µs 1%
█▃▁▂▁▁▁▁▁▁ 565 luma_colorfilter_large
- D/skia ( 1915): 158M 2 135µs 177µs 182µs 269µs 22%
▆▃█▁▁▃▃▃▃▃ 565 luma_colorfilter_large
<new>
- D/skia ( 2000): 157M 5 84.2µs 85.3µs 87.5µs 110µs 9%
█▁▂▁▁▁▁▁▁▁ 565 luma_colorfilter_large
- D/skia ( 2004): 159M 6 84.7µs 110µs 112µs 144µs 18%
█▄▇▁▁▄▃▄▄▆ 565 luma_colorfilter_large
Committed: https://skia.googlesource.com/skia/+/402448d6818cab9d7b7633a0c18fcf574c915357
Patch Set 1 #
Total comments: 3
Patch Set 2 : skia: blend32_16_row for neon version #Patch Set 3 : skia: blend32_16_row for neon version #Patch Set 4 : skia: blend32_16_row for neon version #
Total comments: 5
Patch Set 5 : skia: blend32_16_row for neon version #
Total comments: 4
Patch Set 6 : skia: blend32_16_row for neon version #
Messages
Total messages: 40 (3 generated)
mlee@nvidia.com changed reviewers: + bsalomon@chromium.org, bsalomon@google.com, djsollen@chromium.org, djsollen@google.com, reed@chromium.org, reed@google.com
Add reviewers. I verified skia_gm for aarch32 neon and aarch64 neon.
On 2015/01/14 10:33:37, mlee wrote: > Add reviewers. > > I verified skia_gm for aarch32 neon and aarch64 neon. hmm...I try to add "what reed ran in his change" and "what mtklein helped to add in previous my change" by myself but they are still purple.
https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp... src/core/SkBlitter_RGB16.cpp:534: for (int count = 0; count < 8; count++) { We shouldn't need to do this here. You can just do it in the assembly using the vdup instructions http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491g/BABDBBJB... Then you shouldn't need to update the proc to pass an array instead of a single value.
On 2015/01/14 14:57:54, djsollen wrote: Hi, Below is the reason why I prepare eight RGBA pixels in memory. :: src pixel is RGBA and this is to prepare eight src pixles, which are arrached in memory as R, G, B, A, R, G, B, A ... ld4 command pull data from memory and simultaneously separate values into different registers: : V24 has 8 red (each one has 8bit), V25 has 8 green, v26 has 8 blue, v27 has 8 alpha. I have no idea if we can prepare this using DUP command. With DUP, I can replicate src to 128bit register, but it doesn't sepate values into different registers. Let me know if I am missing something. If maintaining to load single src rather than src[] is necessary as original code does, I can do prepare 8 src pixels array in neon function inside, but this will be done in every function call, which can cause overhead. Because of that, I prepared it in constructor, directly. Let me know your thought once again. As I mentioned, if you want to maintain current code's single src load, I will move this array preparation into neon function inside. Thanks, Min-wuk Lee.
load single src, so there's no change in arguments for ColorProc16. I see there's no actual perf change compared to previous my patch set. i.e. having srcs array for 8 pixels in neon function doesn't have actual overhead. I verify skia_gm for aarch32 and aarch64 as well with patch set 2. Please review. https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp... src/core/SkBlitter_RGB16.cpp:534: for (int count = 0; count < 8; count++) { On 2015/01/14 14:57:54, djsollen wrote: Hi, djsollen. Thanks for the review. This is the reason why I prepare eight RGBA pixels in memory. src pixel is RGBA and this is in order to prepare eight src pixles, which are arrached in memory as R, G, B, A, R, G, B, A ... R, G, B, A. "ld4" command pull data from memory and simultaneously separate values into different four 128 bit registers: : V24 has 8 red componets(each one has 8bit), V25 has 8 green, v26 has 8 blue, v27 has 8 alpha. I have no idea if we can prepare this using DUP command, though. As I understand, with DUP, I can replicate src to 128bit register, but it doesn't separate values into different registers. Let me know if I am missing something. If maintaining to load single src rather than src[] is necessary as original code does, I can do prepare 8 src pixels array in neon function inside, but this will be done in every function call, which can be overhead. Because of that, I've prepared it in constructor. Let me know your thought once again. As I mentioned, if you want to maintain original single src load, I will move this array preparation into neon function inside. Thanks, Min-wuk Lee.
Please review patch set 2 and if this is fine, can someone trigger trybot? Thanks.
Hi, Djsollen. I had more time to consider your comment today. I think you might mean scaling scalar single src to 32x4x4 vector using dup command. I tried, like below attached. With that, I can pass skia_gm on aarch32 but am not able to pass on aarch64. I am not sure if getting vaddr with <SkPMColor *> cast on "vector variable" is right or not. Also, vdup's result is vector, whereas I want to expand single scalar src to sequence scalar 8 srcs. i.e. This is for preparing new src. That's why I have current patch set 2 style using for loop. Please give your opinion. =====What I tried on the top of patch set 2 === --- a/src/opts/SkBlitRow_opts_arm_neon.cpp +++ b/src/opts/SkBlitRow_opts_arm_neon.cpp @@ -377,7 +377,8 @@ static uint32_t pmcolor_to_expand16(SkPMColor c) { void Color32A_D565_neon(uint16_t dst[], SkPMColor src, int count, int x, int y) { uint32_t src_expand; unsigned scale; - SkPMColor srcs[8]; + uint32x4_t srcs_0; + uint32x4x2_t srcs; SkPMColor* src0; if (count <= 0) return; @@ -404,10 +405,10 @@ void Color32A_D565_neon(uint16_t dst[], SkPMColor src, int count, int x, int y) break; } - for (int k = 0; k < 8; k++) { - srcs[k] = src; - } - src0 = srcs; + srcs_0 = vdupq_n_u32(src); + srcs.val[0] = srcs_0; + srcs.val[1] = srcs_0; + src0 = (SkPMColor*)(&srcs); #ifdef SK_CPU_ARM64 asm (
Patch set 3: Limit this to little endian.
Now, I understand Djsollen's previous comment: from src, get r, g, b, a component, and duplicate each component's 8bit to 8x8 using vdup_n_u8, then, there will be four uint8x8_t. Am I right? I try to use DUP Vd.<T>, Wn in assembly code, directly. In four 32 bit variables, each component's 8bit will be there. If I use 8B option, it will replicate low order bits from 32 bits register, so will be 8bitx8 for each component. However, the problem that I face is I can't define variable which will be interpreted to Wn in aarch64. Whatever I prepare uint32_t, uint16_t, uint8_t, it will be interpreted to Xn register in assembly part. - uint32_t src_a; - src_a = SkGetPackedA32(src); In assembly, write - dup v24.8b %[src_a] \n\t Then, I meet “operant mismatch error, like dup v24.8b, x15”. Since my code is written mostly as assembly code, if I need to use dup, I need to use neon helper functions, like vdup_n_u8, which may a lot of update. Can I have current assembly based code and limit this to little endian? If this is acceptable, I still believe there's minor overhead to prepare 8 src pixels and load these with interleaving by single ld4 command. Thanks.
Hi, I removed assembly based code instead this patchset uses ARM NEON intrinsics. Removed little endian limitation as well. Passed skia_gm for aarch32, aarch64. Please review again and trigger trybot if this is fine for you. Thank you!
mlee@nvidia.com changed reviewers: + junov@google.com
Ptla. Any feedback on this?
https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/1/src/core/SkBlitter_RGB16.cpp... src/core/SkBlitter_RGB16.cpp:534: for (int count = 0; count < 8; count++) { On 2015/01/14 14:57:54, djsollen wrote: Done in patch set 4.
I'm going to defer to mike on the changes to the blitter's constructor, but otherwise things are looking pretty good. I saw the numbers you posted on 0xbenchmark 2D for drawrect and drawarc, but do you see comparable results when running Skia's nanobench tool with the --config 565 option? https://codereview.chromium.org/847363002/diff/60001/src/opts/SkBlitRow_opts_... File src/opts/SkBlitRow_opts_arm_neon.cpp (right): https://codereview.chromium.org/847363002/diff/60001/src/opts/SkBlitRow_opts_... src/opts/SkBlitRow_opts_arm_neon.cpp:1793: #if 0 remove the #if 0 block and replace with... Color32A_D565_neon, // Color32_D565, Color32A_D565_neon, // Color32A_D565, Color32A_D565_neon, // Color32_D565_Dither, Color32A_D565_neon // Color32A_D565_Dither
https://codereview.chromium.org/847363002/diff/60001/src/core/SkBlitter_RGB16... File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/60001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:553: // TODO: respect fDoDither Why is this a todo? If we need a todo comment, perhaps we need it inside the factory, but not here. I think you can set the correct flags here. https://codereview.chromium.org/847363002/diff/60001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:554: flags |= SkBlitRow::kSrcPixelAlpha_Flag; If we're going to always set the alpha flag, lets at least add an assert that color's alpha is not 0xFF.
https://codereview.chromium.org/847363002/diff/60001/src/core/SkBlitter_RGB16... File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/60001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:556: fColorProc16 = SkBlitRow::PlatformColorFactory565(flags); Oops. This is mistake and that's why there's test failure in x86 and arm_without_neon platform. This should be chagned to SkBlitRow::ColorFactory16(flags). I will do that in next patch set update. https://codereview.chromium.org/847363002/diff/60001/src/opts/SkBlitRow_opts_... File src/opts/SkBlitRow_opts_arm_neon.cpp (right): https://codereview.chromium.org/847363002/diff/60001/src/opts/SkBlitRow_opts_... src/opts/SkBlitRow_opts_arm_neon.cpp:1793: #if 0 On 2015/01/21 14:36:53, djsollen wrote: > remove the #if 0 block and replace with... > > Color32A_D565_neon, // Color32_D565, > Color32A_D565_neon, // Color32A_D565, > Color32A_D565_neon, // Color32_D565_Dither, > Color32A_D565_neon // Color32A_D565_Dither Done. https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:555: Removed todo comment, but I try to set flag here. Please see if this is right condition to set flags. As I see from reed previous change, global alpha flag is ignored here. https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:559: fColorProc16 = SkBlitRow::ColorFactory16(flags); Compared to previous patchset, we should have ColorFactory16, here. Fixed. This is why previous patchset trybot showed segmentation fault in ARM_no_neon and non-ARM like X86.
The S4 bot is known to have problems and the error there is not the result of your patch.
On 2015/01/22 14:30:49, djsollen wrote: > The S4 bot is known to have problems and the error there is not the result of > your patch. Thanks for the info. And this run for S4 is just passed.
Hi, Djsollen. I ran nanobech with following commend, but I foudn that blend32_16_row code itself is not called. ./platform_tools/android/bin/android_run_skia --release nanobench --config 565 Let me know if there's remained. Thanks.
Sounds like we should update the benches. Does your new code get called when we run dm?
Hi Reed. I dont run dm by myself.(but saw try testbot is doing) I saw this is called in skim_gm. Thanks.
I have been writing a guide to running DM. You can read the work-in-progress version here: https://skia.org/dev/contrib/testing?cl=868983002 If we're going to land this code, we need to know it's tested for correctness and performance. Generally that means we have it covered by some through unit test or a GM that draws using it on at least one of our bots, and we have some benchmark that measures its speed on at least one of our bots. If we already have these, that's great; please identify the tests and benchmarks that exercise the code and post up some indication of the performance change. If we don't have tests or benchmarks, you need to add them. How have you been testing this code?
On 2015/01/23 14:21:27, mtklein wrote: > I have been writing a guide to running DM. You can read the work-in-progress > version here: > https://skia.org/dev/contrib/testing?cl=868983002 The doc landed, so now https://skia.org/dev/contrib/testing or less site/dev/contrib/testing.md from a skia checkout.
Hi, Mtklein. With below step, I've checked in performance and correctness. - I shared some part of data though e-mail when I asked trybot trigger, but looks like you were not there in that mail thread. For Performance, I see perf increase in android 0xbenchmark. - There's 2D test. Its subtest drawrect, drawarc, drawcircle2 use blend32_16_row. I used the latest 0xbenchmark from google playstore. 1.tegra aarch64 1) landscape based 1440x810 DrawRect : 43.4 => 53.4 (23%) in avg DrawArc : almost no change 2) portrait based 810x1440 DrawRect: 39.0 => 50.3 (28.9%) in avg DrawArc : 53.6 => 58.8 (9.7%) in avg 2. tegra-k1 aarch32 portrait display, 1200x1920 panel DrawCircle2 : 34.2 => 36.4 (6.4%) DrawRect : 21.9 => 39.0 (78%) in avg DrawArc : 36.7 => 49.9 (36%) in avg For Correctness, I used skim_gm. Below is the step that I followed. I tested on both aarch32 and aarch64. In original code, skia_gm -w /data/skia_temp adb pull /data/skia_temp to my desktop then, with new my code, adb push /data/skia_temp from my desktop to device skia_gm -r /data/skia_temp test is passed with below result 01-09 06:31:25.541 D/skia ( 2130): Ran 1335 tests: NoGpuContext=0 IntentionallySkipped=32 RenderModeMismatch=0 GeneratePdfFailed=0 ExpectationsMismatch=0 MissingExpectations=0 WritingReferenceImage=0
Hi. If we already have these, that's great ==> In correct standpoints, I see several images were drawn by this code in skia_gm, so skia_gm (golden master) is good. For perf standpoints, I will try to run full nanobench and find if there's any test which calls blend32_16_row, next Monday. (I am in Korea and I left the office. :-) ) - Previously, I ran with --config 565 option. Two questions, I have. Q1. how to compare nanobench result? Can I compare mean value between org and new? Q2. Is there any other skia bench in the tree that I can try? Thank you.
Just see https://skia.org/dev/contrib/testing. Will try dm for correctness as well.
> Q1. how to compare nanobench result? Can I compare mean value between org and > new? I tend to use bin/c and bin/compare to do performance comparisons. Read through them both before running them though to make sure you understand what they're doing. > Q2. Is there any other skia bench in the tree that I can try? nanobench links in all our benchmarks.
I ran full nanobench, but I see neither SkRGB16_Blitter::blitH nor SkRGB16_Blitter::blitRect is called. Also, ran dm, but it doesn't call, either. To be frank with you, I am not familiar to write 1) bench/2) correctness check app (additionally, to write exact bench app), yet. If these are needed, can someone help to add proper apps? As I mentioned, from my side, I've used 0xbenchmark and golden master (skia/gm) and check perf/correctness respectively and have confirmed benefit of this change. Thanks.
sorry for confusion. I did mistake and added log in my android branch. :-). After I add debug log in skia branch's blend32_16_row, they are called many places in nanobench and dm. I will update result once again.
For correctness. dm is passed from my side. Below is the step that I've checked. 1) adb push resources from PC to device (/data/respath) 2) ./platform_tools/android/bin/android_run_skia dm -w /data/dm_org1 --resourcePath /data/respath 3) ./platform_tools/android/bin/android_run_skia dm -w /data/dm_org2 --resourcePath /data/respath 4) adb pull /data/dm_org1 and /data/dm_org2 to PC and compare directories. - see that all png files are matched but dm.json is not matched between outputs through consecutive dm runnings even I use same source. (why ... Is that a bug?) 5) Now, apply this patch & build 6) ./platform_tools/android/bin/android_run_skia dm -w /data/dm_new --resourcePath /data/respath 7) adb pull /data/dm_new and compare org directory. dm.json is not matched but all png files are matched.
For performance, blend32_16_row is called in following tests in nanobench. - Xfermode_SrcOver - tablebench - rotated_rects_bw_alternating_transparent_and_opaque_srcover - rotated_rects_bw_changing_transparent_srcover - rotated_rects_bw_same_transparent_srcover - luma_colorfilter_large - luma_colorfilter_small - chart_bw I can see perf increase in following two tests, especially. For others, looks similar. For each, I tried to run two times. 1) Xfermode_SrcOver <org> - D/skia ( 2000): 3M 57 17.3µs 17.4µs 17.4µs 17.7µs 1% █▃▂▃▂▂▂▁▃▂ 565 Xfermode_SrcOver - D/skia ( 1915): 3M 70 13.5µs 16.9µs 16.7µs 18.8µs 9% ▆█▄▅█▁▅▅▆▄ 565 Xfermode_SrcOver <new> - D/skia ( 2000): 3M 8 11.6µs 11.8µs 12.1µs 14.4µs 7% ▃█▁▁▂▁▁▁▂▂ 565 Xfermode_SrcOver - D/skia ( 2004): 3M 62 10.3µs 12.9µs 13µs 15.2µs 11% █▅▅▆▁▅▅▅▇▃ 565 Xfermode_SrcOver 2) luma_colorfilter_large <org> - D/skia ( 2000): 159M 8 136µs 136µs 136µs 139µs 1% █▃▁▂▁▁▁▁▁▁ 565 luma_colorfilter_large - D/skia ( 1915): 158M 2 135µs 177µs 182µs 269µs 22% ▆▃█▁▁▃▃▃▃▃ 565 luma_colorfilter_large <new> - D/skia ( 2000): 157M 5 84.2µs 85.3µs 87.5µs 110µs 9% █▁▂▁▁▁▁▁▁▁ 565 luma_colorfilter_large - D/skia ( 2004): 159M 6 84.7µs 110µs 112µs 144µs 18% █▄▇▁▁▄▃▄▄▆ 565 luma_colorfilter_large
Please take look at. Let me know if anything is remained. Thank you.
lgtm
please address nits, but lgtm after that https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... File src/core/SkBlitter_RGB16.cpp (right): https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:553: if (SkGetPackedA32(fSrcColor32) < 0xFF) nit: Skia always uses { } for if-statements https://codereview.chromium.org/847363002/diff/80001/src/core/SkBlitter_RGB16... src/core/SkBlitter_RGB16.cpp:556: if (fDoDither) same nit about { }
Hi, Reed. Update new patch set, fixing two minor concerns. Thanks.
lgtm
The CQ bit was checked by djsollen@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/847363002/100001
Message was sent while issue was closed.
Committed patchset #6 (id:100001) as https://skia.googlesource.com/skia/+/402448d6818cab9d7b7633a0c18fcf574c915357 |