Issue 1502843002: better NEON div255

mtklein_C

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:35:05 UTC) #1

mtklein

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:37:28 UTC) #2

mtklein

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:39:45 UTC) #3

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

5 years ago (2015-12-06 01:40:09 UTC) #4

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001

5 years ago (2015-12-06 01:40:14 UTC) #5

mtklein_C

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:40:46 UTC) #6

mtklein

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:48:06 UTC) #7

mtklein_C

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:50:59 UTC) #8

Description was changed from

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and x>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
==========

to

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and x>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

5 years ago (2015-12-06 01:51:09 UTC) #10

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001

5 years ago (2015-12-06 01:51:13 UTC) #11

mtklein_C

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 01:51:21 UTC) #12

Description was changed from

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and x>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

to

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and x>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg...

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years ago (2015-12-06 03:51:20 UTC) #15

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)

5 years ago (2015-12-06 03:51:21 UTC) #16

mtklein

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-06 15:41:00 UTC) #17

Description was changed from

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and x>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg...

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

to

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg...

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001

5 years ago (2015-12-06 15:41:36 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

5 years ago (2015-12-06 17:41:41 UTC) #20

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)

5 years ago (2015-12-06 17:41:41 UTC) #21

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1502843002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1502843002/40001

5 years ago (2015-12-07 14:58:25 UTC) #25

commit-bot: I haz the power

Description was changed from ========== better NEON div255 We were doing (x+127)/255 = ((x+128) + ...

5 years ago (2015-12-07 16:21:14 UTC) #26

Message was sent while issue was closed.

Description was changed from

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg...

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot
==========

to

==========
better NEON div255

We were doing (x+127)/255 = ((x+128) + (x+128)>>8)>>8 in three instructions:
    1) x += 128
    2) shift x right 8 bits
    3) add x and x>>8 together, then shift right more 8 bits

Now do it as two instructions:
    1) shift (x+128) right 8 bits
    2) add x and (x+128)>>8 and 128 all together, then shift right 8 more bits

On ARM this will be a 5-10% speedup for SrcATop, DstATop, Xor, Multiply,
Difference, HardLight, Darken, and Lighten xfermodes.  When we have a mask (e.g.
text), *all* xfermodes except Plus will get a similar boost.

This should mean now that (a*b).div255() is the same speed as
a.approxMulDiv255(b) on both x86 and ARM, and of course it's perfect instead of
approximate.  So we should eliminate approxMulDiv255(), but I'll leave it to
another CL, as it'll need Blink rebaselines.

This CL should not change GMs or Blink.
https://gold.skia.org/search2?issue=1502843002&unt=true&query=source_type%3Dg...

BUG=skia:
CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Debug-Trybot,Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot

Committed:
https://skia.googlesource.com/skia/+/9d344069c5d45a936823d3f84999898383a026cd
==========

commit-bot: I haz the power

5 years ago (2015-12-07 16:21:15 UTC) #27

Message was sent while issue was closed.

Committed patchset #3 (id:40001) as
https://skia.googlesource.com/skia/+/9d344069c5d45a936823d3f84999898383a026cd

Issue 1502843002: better NEON div255 (Closed)

Description

Patch Set 1 #

Patch Set 2 : ) #

Patch Set 3 : comment #

Messages