|
|
Created:
6 years, 11 months ago by qiankun Modified:
6 years, 10 months ago CC:
skia-review_googlegroups.com Base URL:
https://skia.googlesource.com/skia.git@SkSrcOver32To16SSE Visibility:
Public. |
DescriptionSSE2 implementation of S32A_D565_Opaque
microbenchmark of S32A_D565_Opaque() shows a 3x speedup after SSE optimization with various count on i7-3770.
BUG=
Committed: http://code.google.com/p/skia/source/detail?r=13495
Patch Set 1 #Patch Set 2 : handle 8 pixels at a time #
Total comments: 1
Patch Set 3 : add include guards #Patch Set 4 : refine code #
Total comments: 2
Patch Set 5 : fix nits #
Messages
Total messages: 15 (0 generated)
Could you help to review this?
On 2014/01/27 06:50:56, qiankun wrote: > Could you help to review this? ping.
I can't click on the try-bot-failures :( What cased them to turn red?
On 2014/02/07 13:27:24, reed1 wrote: > I can't click on the try-bot-failures :( What cased them to turn red? Ah, I see they've upgraded the hover error message from "-1" to "failure". How helpful. This is what happens when someone without permission to run the trybots tries to run the trybots.
1. This is code that is probably never taken by any of our known clients, so it will get precious little testing by Google. 2. If the skia bots (gm and bench) are not offended, then neither am I 3. I defer to mtk if the structure of the "opts" looks maintainable.
I found this code very readable. Thank you! Can you update the CL with some macro motivation for doing this? Is there a profile we're seeing with the existing S32A_D565_Opaque showing up hot? Is there an existing GM that exercises this code path? Is that 3x improvement from one of our existing benches, or something you wrote up? I'd like us to be able to reproduce this benefit in the future in case we try to change it. Adding platform-specific code can sometimes be a tricky balance of increased performance versus maintenance cost. I want to make sure we're confident that this is not only faster, but that it remains faster, and that it matters that it's faster. https://codereview.chromium.org/138163013/diff/30001/src/opts/SkColor_opts_SS... File src/opts/SkColor_opts_SSE2.h (right): https://codereview.chromium.org/138163013/diff/30001/src/opts/SkColor_opts_SS... src/opts/SkColor_opts_SSE2.h:9: Can you add include guards? #ifndef SkColor_opts_SSE2_DEFINED #define SkColor_opts_SSE2_DEFINED ... #endif//SkColor_opts_SSE2_DEFINED
On 2014/02/07 15:15:45, mtklein wrote: > I found this code very readable. Thank you! > > Can you update the CL with some macro motivation for doing this? Is there a > profile we're seeing with the existing S32A_D565_Opaque showing up hot? Is > there an existing GM that exercises this code path? > Thanks for review:) I did this work because I saw an ARM NEON version for various 565 format operations but no SSE version. Then I started trying to implement the SSE version. Honestly, I didn't profile the hot spot. Some existing GM and bench cover this code path. For GM, there are 100 cases which call S32A_D565_Opaque(). I will list covered benchmarks with performance data later. > Is that 3x improvement from one of our existing benches, or something you wrote > up? I'd like us to be able to reproduce this benefit in the future in case we > try to change it. > The 3x performance speedup is from my own code which purely runs S32A_D565_Opaque and measure the improvement. > Adding platform-specific code can sometimes be a tricky balance of increased > performance versus maintenance cost. I want to make sure we're confident that > this is not only faster, but that it remains faster, and that it matters that > it's faster. > > https://codereview.chromium.org/138163013/diff/30001/src/opts/SkColor_opts_SS... > File src/opts/SkColor_opts_SSE2.h (right): > > https://codereview.chromium.org/138163013/diff/30001/src/opts/SkColor_opts_SS... > src/opts/SkColor_opts_SSE2.h:9: > Can you add include guards? > > #ifndef SkColor_opts_SSE2_DEFINED > #define SkColor_opts_SSE2_DEFINED > ... > #endif//SkColor_opts_SSE2_DEFINED Done.
Here are the data for S32A_D565_Opaque before and after SSE optimization. Most benchmarks benefit from SSE but also some benchmarks slow down. Not very clear about the slow down. before after verts 4145.15 3478.78 16.08% repeatTile_4444_A 1472.30 558.78 62.05% repeatTile_4444_A 920.01 379.32 58.77% maskshader 5459.76 5436.38 0.43% perlinnoise 1234.56 1212.10 1.82% morph_1.50_dilate 2798.59 2932.71 -4.79% morph_1.50_erode 2852.74 2989.41 -4.79% morph_10_dilate 7864.32 7421.04 5.64% morph_10_erode 7815.56 7523.87 3.73% morph_2_dilate 3059.58 3190.25 -4.27% morph_2_erode 3115.25 3250.82 -4.35% merge_large 644.57 815.76 -26.56% merge_small 634.80 805.72 -26.92% matrixconvolution 25649.96 24866.87 3.05% magnifier_large 4551.15 4012.97 11.83% magnifier_small 4354.89 3995.91 8.24% lightingspotlitspecular_large 37140.07 37069.45 0.19% lightingspotlitspecular_small 37074.21 37056.69 0.05% lightingdistantlitspecular_large 32920.18 32039.25 2.68% lightingdistantlitspecular_small 32967.53 32038.66 2.82% lightingpointlitspecular_large 35728.09 34827.63 2.52% lightingpointlitspecular_small 35688.85 34815.83 2.45% lightingspotlitdiffuse_large 14382.98 13519.62 6.00% lightingspotlitdiffuse_small 14410.10 13500.87 6.31% lightingdistantlitdiffuse_large 8511.08 7621.06 10.46% lightingdistantlitdiffuse_small 8522.36 7615.85 10.64% lightingpointlitdiffuse_large 13127.32 12326.10 6.10% lightingpointlitdiffuse_small 13107.61 12387.82 5.49% gradient_create_alpha 4.37 4.37 0.00% gradient_conical_clamp_3color 34391.64 27128.89 21.12% gradient_conical_clamp_hicolor 34434.24 27060.91 21.41% gradient_conical_clamp 34407.68 27060.27 21.35% game_trans_full_atlas_drawVerts_aa 14.84 11.73 20.98% game_trans_full_atlas_aa 8.83 4.38 50.38% game_rot_full_aa 51.25 39.65 22.62% game_trans_aligned_full_aa 20.96 10.64 49.21% game_trans_full_aa 20.88 10.62 49.11% game_scale_full_aa 45.11 33.53 25.68% game_rot_partial_aa 63.83 51.41 19.45% game_trans_aligned_partial_aa 22.25 11.93 46.38% game_trans_partial_aa 22.80 12.49 45.23% game_scale_partial_aa 46.95 34.88 25.71% displacement_full_large 506.55 528.44 -4.32% displacement_alpha_large 451.18 480.47 -6.49% displacement_zero_large 476.39 504.76 -5.96% displacement_full_small 68.08 68.16 -0.12% displacement_alpha_small 65.31 64.08 1.88% displacement_zero_small 67.35 66.63 1.07% colorfilter_gray_large 3830.02 3778.47 1.35% colorfilter_blue_large 1506.63 1460.42 3.07% colorfilter_bright_large 3165.70 3125.79 1.26% colorfilter_bright_blue_large 3882.88 3833.86 1.26% colorfilter_blue_bright_large 3873.56 3832.56 1.06% colorfilter_gray_bright_large 3768.96 3767.09 0.05% colorfilter_bright_gray_large 6243.51 6139.57 1.66% colorfilter_dim_bright_large 30355.94 30260.34 0.31% colorfilter_gray_small 3816.05 3999.72 -4.81% colorfilter_blue_small 1261.13 1450.05 -14.98% colorfilter_bright_small 3144.01 3329.31 -5.89% colorfilter_bright_blue_small 3822.40 4031.87 -5.48% colorfilter_blue_bright_small 3818.07 4056.30 -6.24% colorfilter_gray_bright_small 3737.92 3922.84 -4.95% colorfilter_bright_gray_small 6364.39 6563.86 -3.13% colorfilter_dim_bright_small 30656.30 31817.58 -3.79% blur_image_filter_large_10.00_10.00 4671.72 4581.88 1.92% blur_image_filter_small_10.00_10.00 4433.36 4558.15 -2.81% blur_image_filter_large_1.00_1.00 4677.10 4627.84 1.05% blur_image_filter_small_1.00_1.00 4446.53 4591.37 -3.26% blur_image_filter_large_0.00_1.00 3149.41 3097.60 1.64% blur_image_filter_large_0.00_10.00 3129.05 3065.47 2.03% blur_image_filter_large_1.00_0.00 2847.48 2788.85 2.06% blur_image_filter_large_10.00_0.00 2782.12 2719.84 2.24% bitmap_4444_A_source_stripes_three 41.54 17.51 57.85% bitmap_4444_A_source_stripes_two 32.26 17.15 46.84% bitmap_4444_A_source_transparent 9.32 15.62 -67.57% bitmap_4444_A_source_opaque 60.04 18.60 69.02% bitmap_4444_A_scale_rotate_bicubic 2302.31 2293.13 0.40% bitmap_4444_A_scale_bicubic 47.18 25.72 45.47% bitmap_4444_A_scale_rotate_bilerp 97.36 80.56 17.26% bitmap_4444_A_scale_bilerp 76.60 54.03 29.46% bitmap_4444_A 37.68 17.18 54.41% bicubic_2.5x2.5 1540.65 1514.56 1.69% bicubic_10x2.5 6059.16 5965.20 1.55% bicubic_2.5x10 6103.93 6015.64 1.45% bicubic_10x10 24159.99 23799.29 1.49%
On 2014/02/08 09:02:01, qiankun wrote: > Here are the data for S32A_D565_Opaque before and after SSE optimization. Most > benchmarks benefit from SSE but also some benchmarks slow down. Not very clear > about the slow down. > before after > verts 4145.15 3478.78 16.08% > > > repeatTile_4444_A 1472.30 558.78 62.05% > repeatTile_4444_A 920.01 379.32 58.77% > maskshader 5459.76 5436.38 0.43% > perlinnoise 1234.56 1212.10 1.82% > morph_1.50_dilate 2798.59 2932.71 -4.79% > morph_1.50_erode 2852.74 2989.41 -4.79% > morph_10_dilate 7864.32 7421.04 5.64% > morph_10_erode 7815.56 7523.87 3.73% > morph_2_dilate 3059.58 3190.25 -4.27% > morph_2_erode 3115.25 3250.82 -4.35% > merge_large 644.57 815.76 -26.56% > merge_small 634.80 805.72 -26.92% > matrixconvolution 25649.96 24866.87 3.05% > magnifier_large 4551.15 4012.97 11.83% > magnifier_small 4354.89 3995.91 8.24% > lightingspotlitspecular_large 37140.07 37069.45 0.19% > lightingspotlitspecular_small 37074.21 37056.69 0.05% > lightingdistantlitspecular_large 32920.18 32039.25 2.68% > lightingdistantlitspecular_small 32967.53 32038.66 2.82% > lightingpointlitspecular_large 35728.09 34827.63 2.52% > lightingpointlitspecular_small 35688.85 34815.83 2.45% > lightingspotlitdiffuse_large 14382.98 13519.62 6.00% > lightingspotlitdiffuse_small 14410.10 13500.87 6.31% > lightingdistantlitdiffuse_large 8511.08 7621.06 10.46% > lightingdistantlitdiffuse_small 8522.36 7615.85 10.64% > lightingpointlitdiffuse_large 13127.32 12326.10 6.10% > lightingpointlitdiffuse_small 13107.61 12387.82 5.49% > gradient_create_alpha 4.37 4.37 0.00% > gradient_conical_clamp_3color 34391.64 27128.89 21.12% > gradient_conical_clamp_hicolor 34434.24 27060.91 21.41% > gradient_conical_clamp 34407.68 27060.27 21.35% > game_trans_full_atlas_drawVerts_aa 14.84 11.73 20.98% > game_trans_full_atlas_aa 8.83 4.38 50.38% > game_rot_full_aa 51.25 39.65 22.62% > game_trans_aligned_full_aa 20.96 10.64 49.21% > game_trans_full_aa 20.88 10.62 49.11% > game_scale_full_aa 45.11 33.53 25.68% > game_rot_partial_aa 63.83 51.41 19.45% > game_trans_aligned_partial_aa 22.25 11.93 46.38% > game_trans_partial_aa 22.80 12.49 45.23% > game_scale_partial_aa 46.95 34.88 25.71% > displacement_full_large 506.55 528.44 -4.32% > displacement_alpha_large 451.18 480.47 -6.49% > displacement_zero_large 476.39 504.76 -5.96% > displacement_full_small 68.08 68.16 -0.12% > displacement_alpha_small 65.31 64.08 1.88% > displacement_zero_small 67.35 66.63 1.07% > colorfilter_gray_large 3830.02 3778.47 1.35% > colorfilter_blue_large 1506.63 1460.42 3.07% > colorfilter_bright_large 3165.70 3125.79 1.26% > colorfilter_bright_blue_large 3882.88 3833.86 1.26% > colorfilter_blue_bright_large 3873.56 3832.56 1.06% > colorfilter_gray_bright_large 3768.96 3767.09 0.05% > > > colorfilter_bright_gray_large 6243.51 6139.57 1.66% > colorfilter_dim_bright_large 30355.94 30260.34 0.31% > colorfilter_gray_small 3816.05 3999.72 -4.81% > colorfilter_blue_small 1261.13 1450.05 -14.98% > colorfilter_bright_small 3144.01 3329.31 -5.89% > colorfilter_bright_blue_small 3822.40 4031.87 -5.48% > colorfilter_blue_bright_small 3818.07 4056.30 -6.24% > colorfilter_gray_bright_small 3737.92 3922.84 -4.95% > colorfilter_bright_gray_small 6364.39 6563.86 -3.13% > colorfilter_dim_bright_small 30656.30 31817.58 -3.79% > blur_image_filter_large_10.00_10.00 4671.72 4581.88 1.92% > blur_image_filter_small_10.00_10.00 4433.36 4558.15 -2.81% > blur_image_filter_large_1.00_1.00 4677.10 4627.84 1.05% > blur_image_filter_small_1.00_1.00 4446.53 4591.37 -3.26% > blur_image_filter_large_0.00_1.00 3149.41 3097.60 1.64% > blur_image_filter_large_0.00_10.00 3129.05 3065.47 2.03% > blur_image_filter_large_1.00_0.00 2847.48 2788.85 2.06% > blur_image_filter_large_10.00_0.00 2782.12 2719.84 2.24% > bitmap_4444_A_source_stripes_three 41.54 17.51 57.85% > bitmap_4444_A_source_stripes_two 32.26 17.15 46.84% > bitmap_4444_A_source_transparent 9.32 15.62 -67.57% > bitmap_4444_A_source_opaque 60.04 18.60 69.02% > bitmap_4444_A_scale_rotate_bicubic 2302.31 2293.13 0.40% > bitmap_4444_A_scale_bicubic 47.18 25.72 45.47% > bitmap_4444_A_scale_rotate_bilerp 97.36 80.56 17.26% > bitmap_4444_A_scale_bilerp 76.60 54.03 29.46% > bitmap_4444_A 37.68 17.18 54.41% > bicubic_2.5x2.5 1540.65 1514.56 1.69% > bicubic_10x2.5 6059.16 5965.20 1.55% > bicubic_2.5x10 6103.93 6015.64 1.45% > bicubic_10x10 24159.99 23799.29 1.49% So, wait, do any of these actually run S32A_D565? These numbers vary wildly, and I'm not sure I'm seeing at anything but noise. Does changing S32A_D565 really affect, say, gradient_conical_clamp_3color or bitmap_4444_A_source_transparent? It's hard to believe.
On 2014/02/13 20:04:53, mtklein wrote: > So, wait, do any of these actually run S32A_D565? These numbers vary wildly, > and I'm not sure I'm seeing at anything but noise. Does changing S32A_D565 > really affect, say, gradient_conical_clamp_3color or > bitmap_4444_A_source_transparent? It's hard to believe. Yes, all the benchmarks I listed actually run S32A_D565_Opaque. I optimized the code by skipping some zero value colors. Now almost all the benchmark wills benefit from this patch. Please see the data. before after verts 4155.55 3490.33 16.01% repeatTile_4444_A 1470.27 573.27 61.01% repeatTile_4444_A 913.72 329.12 63.98% maskshader 5467.79 5598.29 -2.39% perlinnoise 1329.07 1212.15 8.80% morph_1.50_dilate 2813.06 2655.17 5.61% morph_1.50_erode 2867.85 2714.29 5.35% morph_10_dilate 7823.79 8282.93 -5.87% morph_10_erode 7678.59 8854.89 -15.32% morph_2_dilate 3074.54 2917.20 5.12% morph_2_erode 3132.14 2976.99 4.95% merge_large 661.84 535.20 19.13% merge_small 651.77 524.69 19.50% matrixconvolution 25748.44 24858.70 3.46% magnifier_large 4537.08 3911.96 13.78% magnifier_small 4353.51 3823.04 12.18% lightingspotlitspecular_large 37147.44 36843.84 0.82% lightingspotlitspecular_small 37141.43 36824.22 0.85% lightingdistantlitspecular_large 32925.50 32007.45 2.79% lightingdistantlitspecular_small 32919.84 31995.53 2.81% lightingpointlitspecular_large 35697.66 34810.21 2.49% lightingpointlitspecular_small 35710.30 34804.24 2.54% lightingspotlitdiffuse_large 14502.46 13507.13 6.86% lightingspotlitdiffuse_small 14417.11 13500.88 6.36% lightingdistantlitdiffuse_large 8545.33 7635.77 10.64% lightingdistantlitdiffuse_small 8528.96 7632.00 10.52% lightingpointlitdiffuse_large 13123.68 12294.99 6.31% lightingpointlitdiffuse_small 13123.86 12289.67 6.36% gradient_create_alpha 4.38 4.36 0.46% gradient_conical_clamp_3color 34357.89 27297.29 20.55% gradient_conical_clamp_hicolor 34327.11 27291.99 20.49% gradient_conical_clamp 34308.99 27266.42 20.53% game_trans_full_atlas_drawVerts_aa 14.80 11.86 19.86% game_trans_full_atlas_aa 8.87 4.47 49.62% game_rot_full_aa 51.44 39.41 23.39% game_trans_aligned_full_aa 20.88 8.65 58.56% game_trans_full_aa 20.91 8.69 58.44% game_scale_full_aa 45.57 30.97 32.03% game_rot_partial_aa 63.76 51.25 19.63% game_trans_aligned_partial_aa 22.23 9.84 55.74% game_trans_partial_aa 22.82 10.39 54.46% game_scale_partial_aa 48.03 32.83 31.64% displacement_full_large 503.10 471.06 6.37% displacement_alpha_large 452.04 424.36 6.12% displacement_zero_large 476.88 443.35 7.03% displacement_full_small 68.23 62.04 9.08% displacement_alpha_small 65.39 60.96 6.77% displacement_zero_small 67.24 60.65 9.81% colorfilter_gray_large 3817.72 3550.62 7.00% colorfilter_blue_large 1516.62 1240.65 18.20% colorfilter_bright_large 3181.34 2899.33 8.86% colorfilter_bright_blue_large 3880.84 3600.19 7.23% colorfilter_blue_bright_large 3879.39 3601.28 7.17% colorfilter_gray_bright_large 3785.02 3509.16 7.29% colorfilter_bright_gray_large 6190.72 5901.19 4.68% colorfilter_dim_bright_large 30621.98 28989.46 5.33% colorfilter_gray_small 3823.52 3720.76 2.69% colorfilter_blue_small 1285.20 1164.51 9.39% colorfilter_bright_small 3123.89 3012.74 3.56% colorfilter_bright_blue_small 3827.49 3714.09 2.96% colorfilter_blue_bright_small 3821.74 3712.08 2.87% colorfilter_gray_bright_small 3748.20 3641.08 2.86% colorfilter_bright_gray_small 6375.14 6281.16 1.47% colorfilter_dim_bright_small 30873.09 30310.03 1.82% blur_image_filter_large_10.00_10.00 4687.52 4408.70 5.95% blur_image_filter_small_10.00_10.00 4431.49 4387.22 1.00% blur_image_filter_large_1.00_1.00 4676.81 4392.01 6.09% blur_image_filter_small_1.00_1.00 4443.95 4385.26 1.32% blur_image_filter_large_0.00_1.00 3165.63 2867.16 9.43% blur_image_filter_large_0.00_10.00 3145.95 2835.62 9.86% blur_image_filter_large_1.00_0.00 2841.56 2560.65 9.89% blur_image_filter_large_10.00_0.00 2785.27 2487.47 10.69% bitmap_4444_A_source_stripes_three 41.48 17.99 56.63% bitmap_4444_A_source_stripes_two 32.27 17.62 45.41% bitmap_4444_A_source_transparent 9.32 3.67 60.66% bitmap_4444_A_source_opaque 59.99 19.06 68.22% bitmap_4444_A_scale_rotate_bicubic 2302.50 2294.24 0.36% bitmap_4444_A_scale_bicubic 47.27 21.62 54.26% bitmap_4444_A_scale_rotate_bilerp 97.27 78.64 19.16% bitmap_4444_A_scale_bilerp 76.94 50.86 33.90% bitmap_4444_A 37.66 12.60 66.54% bicubic_2.5x2.5 1530.19 1507.68 1.47% bicubic_10x2.5 6002.23 5933.84 1.14% bicubic_2.5x10 6065.07 5992.08 1.20% bicubic_10x10 23924.40 23822.76 0.42%
Sorry this has taken so long to review. LGTM. I've marked two small nits; please just fix them up and reupload before submitting. https://codereview.chromium.org/138163013/diff/290001/src/opts/SkBlitRow_opts... File src/opts/SkBlitRow_opts_SSE2.cpp (right): https://codereview.chromium.org/138163013/diff/290001/src/opts/SkBlitRow_opts... src/opts/SkBlitRow_opts_SSE2.cpp:879: const __m128i *s = reinterpret_cast<const __m128i*>(src); Please put the * with the __m128i https://codereview.chromium.org/138163013/diff/290001/src/opts/SkBlitRow_opts... src/opts/SkBlitRow_opts_SSE2.cpp:880: __m128i *d = reinterpret_cast<__m128i*>(dst); Same.
The CQ bit was checked by qiankun.miao@intel.com
CQ is trying da patch. Follow status at https://skia-tree-status.appspot.com/cq/qiankun.miao@intel.com/138163013/370001
Message was sent while issue was closed.
Change committed as 13495
Message was sent while issue was closed.
Filed https://code.google.com/p/skia/issues/detail?id=2192 ('performance regression after "SSE2 implementation of S32A_D565_Opaque"') |