|
|
Descriptionbetter NEON div255
We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
1) x += 128
2) shift x right 8 bits
3) add x and x>>8 together, then shift right more 8 bits
Now do it as two instructions:
1) shift (x+128) right 8 bits
2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits
On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost.
This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines.
This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dgm&master=false
BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
Committed: https://skia.googlesource.com/skia/+/9d344069c5d45a936823d3f84999898383a026cd
Patch Set 1 #Patch Set 2 : ) #Patch Set 3 : comment #Messages
Total messages: 27 (18 generated)
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then cshift right 8 more bits BUG=skia: ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then cshift right 8 more bits BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then cshift right 8 more bits BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add that and x together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add that and x and 128 all together, then shift right 8 more bits This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add that and x together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add that and x and 128 all together, then shift right 8 more bits This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ==========
The CQ bit was unchecked by mtklein@chromium.org
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg... BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ==========
mtklein@chromium.org changed reviewers: + reed@google.com
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and x>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg... BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg... BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)
reed@google.com changed reviewers: + msarett@google.com
lgtm
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001
Message was sent while issue was closed.
Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg... BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot ========== to ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions: 1) x += 128 2) shift x right 8 bits 3) add x and x>>8 together, then shift right more 8 bits Now do it as two instructions: 1) shift (x+128) right 8 bits 2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply, Difference, HardLight, Darken, and Lighten xfermodes. When we have a mask (e.g. text), *all* xfermodes except Plus will get a similar boost. This should mean now that (a*b).div255() is the same speed as a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of approximate. So we should eliminate approxMulDiv255(), but I'll leave it to another CL, as it'll need Blink rebaselines. This CL should not change GMs or Blink. https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg... BUG=skia: CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot Committed: https://skia.googlesource.com/skia/+/9d344069c5d45a936823d3f84999898383a026cd ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/9d344069c5d45a936823d3f84999898383a026cd |