|
|
DescriptionDo loads and math in parallel in SkColorXform_opts
Note that baselines have changed a little since I
recently started using clang.
201295.jpg on HP z620 (300x280)
Skia Xform sRGB Dst Before 0.378 ms
Skia Xform sRGB Dst After 0.322 ms
1.17x
Skia Xform 2.2 Dst Before 0.428 ms
Skia Xform 2.2 Dst After 0.395 ms
1.08x
QCMS Xform 0.418 ms
sRGB Dst vs QCMS 1.30x
2.2 Dst vs QCMS 1.06x
--------------------------------------------
Nexus 6P:
Skia Xform sRGB Dst Before 1.58 ms
Skia Xform sRGB Dst After 1.43 ms
Skia Xform 2.2 Dst Before 2.69 ms
Skia Xform 2.2 Dst After 2.62 ms
Dell Venue 8:
Skia Xform sRGB Dst Before 2.78 ms
Skia Xform sRGB Dst After 2.74 ms
Skia Xform 2.2 Dst Before 3.73 ms
Skia Xform 2.2 Dst After 3.64 ms
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/9bba21530d7566494533788f4934848ea9318080
Patch Set 1 #
Total comments: 2
Patch Set 2 : Use lambda fns to share code #
Total comments: 2
Patch Set 3 : Fuse gamma_transform and store #Patch Set 4 : Fix bug when length < 4 #
Total comments: 2
Patch Set 5 : Don't close over rgba #Messages
Total messages: 34 (16 generated)
Description was changed from ========== Load and math in parallel in SkColorXform_opts BUG=skia: ========== to ========== Load and math in parallel in SkColorXform_opts BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== Load and math in parallel in SkColorXform_opts BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Load and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== Load and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
msarett@google.com changed reviewers: + mtklein@google.com
Three thoughts: - Generally suspicious of that these sorts of optimizations are very architecture and micro-architecture dependent. If we do this, let's make sure it's faster or no slower across a good handful of different x86 and ARM chips. - Generally suspicious that this is something the compiler ought to be able to do for us when it thinks it's a good idea. Can you see what happens if we tag the arguments SK_RESTRICT? Seems like this is not a legal transformation if src and dst alias each other, which I figure is why the compiler won't do it itself. - Generally suspicious that this is something the CPU ought to be able to do for us across loops, at least on x86. Can you see what IACA (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) has to say about the throughput bottleneck before and after this change? https://codereview.chromium.org/2081933005/diff/1/src/opts/SkColorXform_opts.h File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/1/src/opts/SkColorXform_opts.... src/opts/SkColorXform_opts.h:197: reds = Sk4f{linear_from_curve[(src[0] >> 0) & 0xFF], Let's make this a lambda to share here and in the loop? auto load_next_4 = [&reds, &greens, &blues, &src, &len] { reds = Sk4f{ ... }; greens = Sk4f{ ... }; blues = Sk4f{ ... }; src += 4; len -= 4; }; if (len >= 4) { load_next_4(); } while (len >= 4) { ... load_next_4(); } Same deal with the other two steps even, so that we ultimately see the loop as if (len >= 4) { load_next_4(); while (len >= 4) { transform_4(); load_next_4(); store_4(); } transform_4(); store_4(); } (old loop equivalent would be this) while (len >= 4) { load_next_4(); transform_4(); store_4(); }
> - Generally suspicious that this is something the compiler ought to be able to > do for us when it thinks it's a good idea. Can you see what happens if we tag > the arguments SK_RESTRICT? Seems like this is not a legal transformation if src > and dst alias each other, which I figure is why the compiler won't do it itself. SK_RESTRICT doesn't have an effect on my desktop. So I guess the compiler isn't figuring out that it should do this (or maybe it is correct that it shouldn't). FWIW, I don't want to use SK_RESTRICT because I want to allow this transform to be in place. I guess the behavior becomes harder to predict if the client passes us aliased, but offset, src and dst. I claim that is the client's fault. Still following up on your other comments.
Oh, I was confused. Of course this is correct when src and dst are the same array to do it in-place.
Description was changed from ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.320 ms 1.18x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.31x 2.2 Dst vs QCMS 1.06x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.322 ms 1.17x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.386 ms 1.11x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.30x 2.2 Dst vs QCMS 1.08x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
> - Generally suspicious of that these sorts of optimizations are very > architecture and micro-architecture dependent. If we do this, let's make sure > it's faster or no slower across a good handful of different x86 and ARM chips. I’ve added performance results on Nexus 6P and Dell Venue 8. I saw small improvements, though not as much as I saw on my Desktop. My main takeaway is a little bit of concern is that we are slow on mobile/tablet in general... > - Generally suspicious that this is something the CPU ought to be able to do > for us across loops, at least on x86. Can you see what IACA > (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer) has > to say about the throughput bottleneck before and after this change? Very cool tool, thanks for sharing. The “throughput test” seems to be unaffected. The critical path is straight through all the mul/rsqrt instructions (port 0), which doesn’t change based on the implementation. I saw an improvement (90 -> 83 cycles) on the latency test, which looks to be related to the chain of dependencies. In the old impl, we have to wait on reds, greens, and blues to be ready before beginning to hammer port 0. In the new impl, we hit the first mul of the critical path right away. Possibly, the reason the compiler can’t do this for us is that it is worried about strange aliasing of src and dst... But we can take advantage because we know that we don't care. https://codereview.chromium.org/2081933005/diff/1/src/opts/SkColorXform_opts.h File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/1/src/opts/SkColorXform_opts.... src/opts/SkColorXform_opts.h:197: reds = Sk4f{linear_from_curve[(src[0] >> 0) & 0xFF], On 2016/06/22 14:46:54, mtklein wrote: > Let's make this a lambda to share here and in the loop? > > auto load_next_4 = [&reds, &greens, &blues, &src, &len] { > reds = Sk4f{ ... }; > greens = Sk4f{ ... }; > blues = Sk4f{ ... }; > src += 4; > len -= 4; > }; > > if (len >= 4) { > load_next_4(); > } > > while (len >= 4) { > ... > load_next_4(); > } > > Same deal with the other two steps even, so that we ultimately see the loop as > > if (len >= 4) { > load_next_4(); > while (len >= 4) { > transform_4(); > load_next_4(); > store_4(); > } > transform_4(); > store_4(); > } > > (old loop equivalent would be this) > while (len >= 4) { > load_next_4(); > transform_4(); > store_4(); > } Nice! I like how this looks.
https://codereview.chromium.org/2081933005/diff/10001/src/opts/SkColorXform_o... File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/10001/src/opts/SkColorXform_o... src/opts/SkColorXform_opts.h:246: gamma_transform_4(); Why not fuse gamma_transform_4() into store_4()? They're always called one after the other, and the fused linear->stored-gamma would mirror load_next_4()'s stored-gamma->linear. I'll also admit part of my request here is that I find it hard to read gamut_transform_4() followed by gamma_transform_4(). I know they're different, but every time I read it they look like duplicated code.
https://codereview.chromium.org/2081933005/diff/10001/src/opts/SkColorXform_o... File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/10001/src/opts/SkColorXform_o... src/opts/SkColorXform_opts.h:246: gamma_transform_4(); On 2016/06/22 19:33:34, mtklein wrote: > Why not fuse gamma_transform_4() into store_4()? They're always called one > after the other, and the fused linear->stored-gamma would mirror load_next_4()'s > stored-gamma->linear. > > I'll also admit part of my request here is that I find it hard to read > gamut_transform_4() followed by gamma_transform_4(). I know they're different, > but every time I read it they look like duplicated code. Yup, looks good this way.
lgtm
The CQ bit was checked by msarett@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/2081933005/30001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...)
Description was changed from ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.322 ms 1.17x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.386 ms 1.11x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.30x 2.2 Dst vs QCMS 1.08x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.322 ms 1.17x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.30x 2.2 Dst vs QCMS 1.06x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Please take another look! I've refactored to fix a bug that occurs when length < 4. Annoyingly, performance changes every time I touch this code... maybe I want to find a bigger image to bench with. https://codereview.chromium.org/2081933005/diff/50001/src/opts/SkColorXform_o... File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/50001/src/opts/SkColorXform_o... src/opts/SkColorXform_opts.h:224: return color_xform_RGB1_tail<linear_from_curve, linear_to_curve>(dst, src, len, matrix); I could also use a goto, but I'm guessing that's frowned upon...
Patchset #4 (id:50001) has been deleted
The CQ bit was checked by msarett@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/2081933005/70001
lgtm https://codereview.chromium.org/2081933005/diff/70001/src/opts/SkColorXform_o... File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/70001/src/opts/SkColorXform_o... src/opts/SkColorXform_opts.h:220: Sk4i rgba; I think we can move rgba into the lambda. No need to close over it... it's just a temporary.
https://codereview.chromium.org/2081933005/diff/70001/src/opts/SkColorXform_o... File src/opts/SkColorXform_opts.h (right): https://codereview.chromium.org/2081933005/diff/70001/src/opts/SkColorXform_o... src/opts/SkColorXform_opts.h:220: Sk4i rgba; On 2016/06/22 21:29:29, mtklein wrote: > I think we can move rgba into the lambda. No need to close over it... it's just > a temporary. Done.
The CQ bit was checked by msarett@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/2081933005/90001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by msarett@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from mtklein@google.com Link to the patchset: https://codereview.chromium.org/2081933005/#ps90001 (title: "Don't close over rgba")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/2081933005/90001
Message was sent while issue was closed.
Description was changed from ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.322 ms 1.17x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.30x 2.2 Dst vs QCMS 1.06x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== Do loads and math in parallel in SkColorXform_opts Note that baselines have changed a little since I recently started using clang. 201295.jpg on HP z620 (300x280) Skia Xform sRGB Dst Before 0.378 ms Skia Xform sRGB Dst After 0.322 ms 1.17x Skia Xform 2.2 Dst Before 0.428 ms Skia Xform 2.2 Dst After 0.395 ms 1.08x QCMS Xform 0.418 ms sRGB Dst vs QCMS 1.30x 2.2 Dst vs QCMS 1.06x -------------------------------------------- Nexus 6P: Skia Xform sRGB Dst Before 1.58 ms Skia Xform sRGB Dst After 1.43 ms Skia Xform 2.2 Dst Before 2.69 ms Skia Xform 2.2 Dst After 2.62 ms Dell Venue 8: Skia Xform sRGB Dst Before 2.78 ms Skia Xform sRGB Dst After 2.74 ms Skia Xform 2.2 Dst Before 3.73 ms Skia Xform 2.2 Dst After 3.64 ms BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search?issue=2081933005 CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/9bba21530d7566494533788f4934848ea9318080 ==========
Message was sent while issue was closed.
Committed patchset #5 (id:90001) as https://skia.googlesource.com/skia/+/9bba21530d7566494533788f4934848ea9318080 |