|
|
DescriptionNEON f32 <-> f16 and f32 <-> u16
Adds f32 <-> f16 ARMv7 and ARMv8 NEON code.
Also adds NEON f32 <-> u16 code to make the comparison fair.
The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly.
The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx.
Still TODO:
ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1700473003
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a
Patch Set 1 #Patch Set 2 : ARMv7 support too #Patch Set 3 : fixes #Patch Set 4 : q #Patch Set 5 : tweak #Patch Set 6 : f32 <-> u16 #Patch Set 7 : back off from ARMv7 #Patch Set 8 : armv8 asm #
Total comments: 6
Messages
Total messages: 76 (52 generated)
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster format to work with than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster format to work with than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
mtklein@chromium.org changed reviewers: + reed@google.com
The CQ bit was checked by mtklein@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/1 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/1
Note for Reviewers: The CQ is waiting for an approval. If you believe that the CL is not ready yet, or if you would like to L-G-T-M with comments then please uncheck the CQ checkbox. Waiting for LGTM from valid reviewer(s) till 2016-02-13 04:52 UTC
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC, for comparison: 1108.07 xferu64_bw_1_opaque_u16 53033.72 xferu64_bw_1_alpha_u16 56324.06 xferu64_aa_1_opaque_u16 63194.09 xferu64_aa_1_alpha_u16 629.98 xferu64_bw_1_opaque_f16 95098.56 xferu64_bw_1_alpha_f16 109346.14 xferu64_aa_1_opaque_f16 106094.29 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was unchecked by commit-bot@chromium.org
No LGTM from a valid reviewer yet. Please ask for an LGTM from a full Skia committer
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC, for comparison: 1108.07 xferu64_bw_1_opaque_u16 53033.72 xferu64_bw_1_alpha_u16 56324.06 xferu64_aa_1_opaque_u16 63194.09 xferu64_aa_1_alpha_u16 629.98 xferu64_bw_1_opaque_f16 95098.56 xferu64_bw_1_alpha_f16 109346.14 xferu64_aa_1_opaque_f16 106094.29 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 ? xferu64_bw_1_opaque_u16 nonrendering 53036.52 xferu64_bw_1_alpha_u16 nonrendering 56328.17 xferu64_aa_1_opaque_u16 nonrendering 63196.74 xferu64_aa_1_alpha_u16 nonrendering 575.16 xferu64_bw_1_opaque_f16 nonrendering 8866.49 xferu64_bw_1_alpha_f16 nonrendering 11050.74 xferu64_aa_1_opaque_f16 nonrendering 14128.42 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 ? xferu64_bw_1_opaque_u16 nonrendering 53036.52 xferu64_bw_1_alpha_u16 nonrendering 56328.17 xferu64_aa_1_opaque_u16 nonrendering 63196.74 xferu64_aa_1_alpha_u16 nonrendering 575.16 xferu64_bw_1_opaque_f16 nonrendering 8866.49 xferu64_bw_1_alpha_f16 nonrendering 11050.74 xferu64_aa_1_opaque_f16 nonrendering 14128.42 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back on ARMv7-compatible NEON for GCC. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x). Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for uint16_t code on GCC. Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for uint16_t code on GCC. Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. Clang: 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x, Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x, GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x, Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x, GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x, Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x, GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5, GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x, Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x, GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5, GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x (ARMv8), Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x (ARMv8), GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5 (ARMv7), Clang N5 (ARMv7), GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x (ARMv8), Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x (ARMv8), GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5 (ARMv7), Clang N5 (ARMv7), GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x (ARMv8), Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x (ARMv8), GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5 (ARMv7), Clang 470.13 xferu64_bw_1_opaque_u16 17775.88 xferu64_bw_1_alpha_u16 20440.19 xferu64_aa_1_opaque_u16 25235.11 xferu64_aa_1_alpha_u16 464.99 xferu64_bw_1_opaque_f16 10631.84 xferu64_bw_1_alpha_f16 13293.95 xferu64_aa_1_opaque_f16 18150.39 xferu64_aa_1_alpha_f16 N5 (ARMv7), GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang, falling back to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7. This means that on ARMv8, half-float is a faster storage format than uint16_t. Though some of that is due to garbage code generation for the uint16_t case on GCC. N5x (ARMv8), Clang: 1113.04 xferu64_bw_1_opaque_u16 7707.76 xferu64_bw_1_alpha_u16 10333.98 xferu64_aa_1_opaque_u16 13723.14 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6059.57 xferu64_bw_1_alpha_f16 8778.08 xferu64_aa_1_opaque_f16 11973.88 xferu64_aa_1_alpha_f16 N5x (ARMv8), GCC: 597.99 xferu64_bw_1_opaque_u16 53036.52 xferu64_bw_1_alpha_u16 56328.17 xferu64_aa_1_opaque_u16 63196.74 xferu64_aa_1_alpha_u16 575.16 xferu64_bw_1_opaque_f16 8866.49 xferu64_bw_1_alpha_f16 11050.74 xferu64_aa_1_opaque_f16 14128.42 xferu64_aa_1_alpha_f16 N5 (ARMv7), Clang 470.13 xferu64_bw_1_opaque_u16 17775.88 xferu64_bw_1_alpha_u16 20440.19 xferu64_aa_1_opaque_u16 25235.11 xferu64_aa_1_alpha_u16 464.99 xferu64_bw_1_opaque_f16 10631.84 xferu64_bw_1_alpha_f16 13293.95 xferu64_aa_1_opaque_f16 18150.39 xferu64_aa_1_alpha_f16 N5 (ARMv7), GCC 1028.12 xferu64_bw_1_opaque_u16 38204.10 xferu64_bw_1_alpha_u16 44265.87 xferu64_aa_1_opaque_u16 46950.93 xferu64_aa_1_alpha_u16 911.87 xferu64_bw_1_opaque_f16 11553.22 xferu64_bw_1_alpha_f16 15076.66 xferu64_aa_1_opaque_f16 20457.03 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was just TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was just TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/100001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/100001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_FAILED, http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...)
lgtm
reed@google.com changed reviewers: + msarett@google.com
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1700473003/#ps120001 (title: "back off from ARMv7")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/120001
The CQ bit was unchecked by mtklein@google.com
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. Speed summary: ARMv8, GCC: f16 is about 20% faster than u16 ARMv8, clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. Speed summary: ARMv8, GCC: f16 is about 20% faster than u16 ARMv8, clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
This is probably a good time to take a(nother) look. I've changed two things: 1) inline assembly lets us be fast on ARMv8 no matter the compiler; 2) the ARMv7 float -> half code was wrong for denormal outputs, so I've removed it for this CL. Will follow up.
I realize I've partially reviewed code that was already there :). https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h File src/core/SkHalf.h (right): https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:55: norm = vreinterpretq_f32_u32(vaddq_u32(vshlq_n_u32(h, 13), Is this faster than vcvtq_n_f32_u32? Or does vcvtq_n_f32_u32() not work for some reason? https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:106: // This doesn't round, so it can be 1 bit too small. Would an "add" before the "right shift" allow you to round? How much more costly is this? https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:107: const __m128 rebias = _mm_castsi128_ps(_mm_set1_epi32((127 - (127-15)) << 23)); I find this more clear as (15 << 23)
https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h File src/core/SkHalf.h (right): https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:55: norm = vreinterpretq_f32_u32(vaddq_u32(vshlq_n_u32(h, 13), On 2016/02/17 19:46:38, msarett wrote: > Is this faster than vcvtq_n_f32_u32? Or does vcvtq_n_f32_u32() not work for > some reason? Remember, we're always doing both. vcvtq_n_f32_u32(...) is correct when the input is denormalized. vaddq_u32(vshlq_n_u32(... ), ...) is correct when it's not. https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:106: // This doesn't round, so it can be 1 bit too small. On 2016/02/17 19:46:38, msarett wrote: > Would an "add" before the "right shift" allow you to round? How much more > costly is this? I think so, but I'm not quite what to add when yet. Will be following up here. It's not super important we get this perfectly precise. https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:107: const __m128 rebias = _mm_castsi128_ps(_mm_set1_epi32((127 - (127-15)) << 23)); On 2016/02/17 19:46:38, msarett wrote: > I find this more clear as (15 << 23) This is meant to parallel the (127 + (127-15)). Seeing 127 and 15 here makes sense in the context of floating point exponent biases.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_FAILED, http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...)
On 2016/02/17 20:09:00, commit-bot: I haz the power wrote: > Dry run: Try jobs failed on following builders: > Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android > (JOB_FAILED, > http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...) I'm gonna count this run as a success. The _01 tests passed.
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Gonna get this baking... happy to follow up / evolve it.
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1700473003/#ps140001 (title: "armv8 asm")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
Message was sent while issue was closed.
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a ==========
Message was sent while issue was closed.
Committed patchset #8 (id:140001) as https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a |