|
|
Chromium Code Reviews
DescriptionNEON f32 <-> f16 and f32 <-> u16
Adds f32 <-> f16 ARMv7 and ARMv8 NEON code.
Also adds NEON f32 <-> u16 code to make the comparison fair.
The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly.
The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx.
Still TODO:
ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1700473003
CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot
Committed: https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a
Patch Set 1 #Patch Set 2 : ARMv7 support too #Patch Set 3 : fixes #Patch Set 4 : q #Patch Set 5 : tweak #Patch Set 6 : f32 <-> u16 #Patch Set 7 : back off from ARMv7 #Patch Set 8 : armv8 asm #
Total comments: 6
Messages
Total messages: 76 (52 generated)
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster format to work with than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster format to work with than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
mtklein@chromium.org changed reviewers: + reed@google.com
The CQ bit was checked by mtklein@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/1 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/1
Note for Reviewers: The CQ is waiting for an approval. If you believe that the CL is not ready yet, or if you would like to L-G-T-M with comments then please uncheck the CQ checkbox. Waiting for LGTM from valid reviewer(s) till 2016-02-13 04:52 UTC
Description was changed from ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== On ARMv8, we definitely have NEON f16 <-> f32 instructions. ... unfortunately GCC 4.9 doesn't seem to know that. We could work around that with inline assembly, but I don't feel like pandering to old compilers. Instead, check for Clang. This means that on ARMv8, half-float is a faster storage format than uint16_t (timed on N5x): 1425.29 xferu64_bw_1_opaque_u16 7712.89 xferu64_bw_1_alpha_u16 10338.13 xferu64_aa_1_opaque_u16 13750.49 xferu64_aa_1_alpha_u16 1112.06 xferu64_bw_1_opaque_f16 6070.07 xferu64_bw_1_alpha_f16 8789.06 xferu64_aa_1_opaque_f16 11975.83 xferu64_aa_1_alpha_f16 GCC, for comparison: 1108.07 xferu64_bw_1_opaque_u16 53033.72 xferu64_bw_1_alpha_u16 56324.06 xferu64_aa_1_opaque_u16 63194.09 xferu64_aa_1_alpha_u16 629.98 xferu64_bw_1_opaque_f16 95098.56 xferu64_bw_1_alpha_f16 109346.14 xferu64_aa_1_opaque_f16 106094.29 xferu64_aa_1_alpha_f16 BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was unchecked by commit-bot@chromium.org
No LGTM from a valid reviewer yet. Please ask for an LGTM from a full Skia committer
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x):
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC, for comparison:
1108.07 xferu64_bw_1_opaque_u16
53033.72 xferu64_bw_1_alpha_u16
56324.06 xferu64_aa_1_opaque_u16
63194.09 xferu64_aa_1_alpha_u16
629.98 xferu64_bw_1_opaque_f16
95098.56 xferu64_bw_1_alpha_f16
109346.14 xferu64_aa_1_opaque_f16
106094.29 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 ? xferu64_bw_1_opaque_u16 nonrendering
53036.52 xferu64_bw_1_alpha_u16 nonrendering
56328.17 xferu64_aa_1_opaque_u16 nonrendering
63196.74 xferu64_aa_1_alpha_u16 nonrendering
575.16 xferu64_bw_1_opaque_f16 nonrendering
8866.49 xferu64_bw_1_alpha_f16 nonrendering
11050.74 xferu64_aa_1_opaque_f16 nonrendering
14128.42 xferu64_aa_1_alpha_f16 nonrendering
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 ? xferu64_bw_1_opaque_u16 nonrendering
53036.52 xferu64_bw_1_alpha_u16 nonrendering
56328.17 xferu64_aa_1_opaque_u16 nonrendering
63196.74 xferu64_aa_1_alpha_u16 nonrendering
575.16 xferu64_bw_1_opaque_f16 nonrendering
8866.49 xferu64_bw_1_alpha_f16 nonrendering
11050.74 xferu64_aa_1_opaque_f16 nonrendering
14128.42 xferu64_aa_1_alpha_f16 nonrendering
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
on ARMv7-compatible NEON for GCC.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t (timed on N5x).
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for uint16_t code on GCC.
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for uint16_t code on GCC.
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
Clang:
1425.29 xferu64_bw_1_opaque_u16
7712.89 xferu64_bw_1_alpha_u16
10338.13 xferu64_aa_1_opaque_u16
13750.49 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6070.07 xferu64_bw_1_alpha_f16
8789.06 xferu64_aa_1_opaque_f16
11975.83 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x, Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x, GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x, Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x, GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x, Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x, GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5, GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x, Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x, GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5, GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x (ARMv8), Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x (ARMv8), GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5 (ARMv7), Clang
N5 (ARMv7), GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x (ARMv8), Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x (ARMv8), GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5 (ARMv7), Clang
N5 (ARMv7), GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x (ARMv8), Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x (ARMv8), GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5 (ARMv7), Clang
470.13 xferu64_bw_1_opaque_u16
17775.88 xferu64_bw_1_alpha_u16
20440.19 xferu64_aa_1_opaque_u16
25235.11 xferu64_aa_1_alpha_u16
464.99 xferu64_bw_1_opaque_f16
10631.84 xferu64_bw_1_alpha_f16
13293.95 xferu64_aa_1_opaque_f16
18150.39 xferu64_aa_1_alpha_f16
N5 (ARMv7), GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from
==========
On ARMv8, we definitely have NEON f16 <-> f32 instructions.
... unfortunately GCC 4.9 doesn't seem to know that. We could
work around that with inline assembly, but I don't feel like
pandering to old compilers. Instead, check for Clang, falling back
to ARMv7-compatible NEON for ARMv8 GCC and, of course, ARMv7.
This means that on ARMv8, half-float is a faster storage format
than uint16_t. Though some of that is due to garbage code generation
for the uint16_t case on GCC.
N5x (ARMv8), Clang:
1113.04 xferu64_bw_1_opaque_u16
7707.76 xferu64_bw_1_alpha_u16
10333.98 xferu64_aa_1_opaque_u16
13723.14 xferu64_aa_1_alpha_u16
1112.06 xferu64_bw_1_opaque_f16
6059.57 xferu64_bw_1_alpha_f16
8778.08 xferu64_aa_1_opaque_f16
11973.88 xferu64_aa_1_alpha_f16
N5x (ARMv8), GCC:
597.99 xferu64_bw_1_opaque_u16
53036.52 xferu64_bw_1_alpha_u16
56328.17 xferu64_aa_1_opaque_u16
63196.74 xferu64_aa_1_alpha_u16
575.16 xferu64_bw_1_opaque_f16
8866.49 xferu64_bw_1_alpha_f16
11050.74 xferu64_aa_1_opaque_f16
14128.42 xferu64_aa_1_alpha_f16
N5 (ARMv7), Clang
470.13 xferu64_bw_1_opaque_u16
17775.88 xferu64_bw_1_alpha_u16
20440.19 xferu64_aa_1_opaque_u16
25235.11 xferu64_aa_1_alpha_u16
464.99 xferu64_bw_1_opaque_f16
10631.84 xferu64_bw_1_alpha_f16
13293.95 xferu64_aa_1_opaque_f16
18150.39 xferu64_aa_1_alpha_f16
N5 (ARMv7), GCC
1028.12 xferu64_bw_1_opaque_u16
38204.10 xferu64_bw_1_alpha_u16
44265.87 xferu64_aa_1_opaque_u16
46950.93 xferu64_aa_1_alpha_u16
911.87 xferu64_bw_1_opaque_f16
11553.22 xferu64_bw_1_alpha_f16
15076.66 xferu64_aa_1_opaque_f16
20457.03 xferu64_aa_1_alpha_f16
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
to
==========
NEON f32 <-> f16 and f32 <-> u16
Adds f32 <-> f16 ARMv7 and ARMv8 NEON code.
To make it a fair comparison, also adds NEON f32 <-> u16 code, which was just
TODO.
The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest,
so we fall back on my ARMv7 version there. The ARMv7 version is different
enough
from the SSE version that it does not make sense to use SkNx.
In all cases, f16 is at least competitive with u16.
Nexus 5 (ARMv7), GCC:
10218.75 xferu64_bw_1_alpha_u16 nonrendering
12868.90 xferu64_aa_1_opaque_u16 nonrendering
19093.02 xferu64_aa_1_alpha_u16 nonrendering
11520.75 xferu64_bw_1_alpha_f16 nonrendering
15064.45 xferu64_aa_1_opaque_f16 nonrendering
20384.28 xferu64_aa_1_alpha_f16 nonrendering
Nexus 5 (ARMv7), Clang:
Nexus 5x (ARMv8), GCC:
Nexus 5x (ARMv8), Clang:
BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was just TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. In all cases, f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. To make it a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16, and faster with proper ARMv8. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/100001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. For a fair comparison, also adds NEON f32 <-> u16 code, which was a TODO. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/100001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_FAILED, http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...)
lgtm
reed@google.com changed reviewers: + msarett@google.com
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot;client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1700473003/#ps120001 (title: "back off from ARMv7")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/120001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/120001
The CQ bit was unchecked by mtklein@google.com
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we fall back on my ARMv7 version there. The ARMv7 version is different enough from the SSE version that it does not make sense to use SkNx. f16 is at least competitive with u16. Nexus 5 (ARMv7), GCC: 10218.75 xferu64_bw_1_alpha_u16 nonrendering 12868.90 xferu64_aa_1_opaque_u16 nonrendering 19093.02 xferu64_aa_1_alpha_u16 nonrendering 11520.75 xferu64_bw_1_alpha_f16 nonrendering 15064.45 xferu64_aa_1_opaque_f16 nonrendering 20384.28 xferu64_aa_1_alpha_f16 nonrendering Nexus 5 (ARMv7), Clang: 17812.26 xferu64_bw_1_alpha_u16 nonrendering 20440.92 xferu64_aa_1_opaque_u16 nonrendering 25239.75 ! xferu64_aa_1_alpha_u16 nonrendering 10631.35 xferu64_bw_1_alpha_f16 nonrendering 13285.64 xferu64_aa_1_opaque_f16 nonrendering 18147.22 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), GCC: 8604.82 ! xferu64_bw_1_alpha_u16 nonrendering 12658.99 xferu64_aa_1_opaque_u16 nonrendering 14555.23 xferu64_aa_1_alpha_u16 nonrendering 8876.97 xferu64_bw_1_alpha_f16 nonrendering 11141.55 ? xferu64_aa_1_opaque_f16 nonrendering 14257.30 xferu64_aa_1_alpha_f16 nonrendering Nexus 5x (ARMv8), Clang: 7795.90 ? xferu64_bw_1_alpha_u16 nonrendering 10327.39 xferu64_aa_1_opaque_u16 nonrendering 13880.62 xferu64_aa_1_alpha_u16 nonrendering 6064.70 xferu64_bw_1_alpha_f16 nonrendering 8782.47 xferu64_aa_1_opaque_f16 nonrendering 11970.70 xferu64_aa_1_alpha_f16 nonrendering BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. Speed summary: ARMv8, GCC: f16 is about 20% faster than u16 ARMv8, clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. Speed summary: ARMv8, GCC: f16 is about 20% faster than u16 ARMv8, clang: BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ==========
This is probably a good time to take a(nother) look. I've changed two things: 1) inline assembly lets us be fast on ARMv8 no matter the compiler; 2) the ARMv7 float -> half code was wrong for denormal outputs, so I've removed it for this CL. Will follow up.
I realize I've partially reviewed code that was already there :). https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h File src/core/SkHalf.h (right): https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:55: norm = vreinterpretq_f32_u32(vaddq_u32(vshlq_n_u32(h, 13), Is this faster than vcvtq_n_f32_u32? Or does vcvtq_n_f32_u32() not work for some reason? https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:106: // This doesn't round, so it can be 1 bit too small. Would an "add" before the "right shift" allow you to round? How much more costly is this? https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:107: const __m128 rebias = _mm_castsi128_ps(_mm_set1_epi32((127 - (127-15)) << 23)); I find this more clear as (15 << 23)
https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h File src/core/SkHalf.h (right): https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:55: norm = vreinterpretq_f32_u32(vaddq_u32(vshlq_n_u32(h, 13), On 2016/02/17 19:46:38, msarett wrote: > Is this faster than vcvtq_n_f32_u32? Or does vcvtq_n_f32_u32() not work for > some reason? Remember, we're always doing both. vcvtq_n_f32_u32(...) is correct when the input is denormalized. vaddq_u32(vshlq_n_u32(... ), ...) is correct when it's not. https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:106: // This doesn't round, so it can be 1 bit too small. On 2016/02/17 19:46:38, msarett wrote: > Would an "add" before the "right shift" allow you to round? How much more > costly is this? I think so, but I'm not quite what to add when yet. Will be following up here. It's not super important we get this perfectly precise. https://codereview.chromium.org/1700473003/diff/140001/src/core/SkHalf.h#newc... src/core/SkHalf.h:107: const __m128 rebias = _mm_castsi128_ps(_mm_set1_epi32((127 - (127-15)) << 23)); On 2016/02/17 19:46:38, msarett wrote: > I find this more clear as (15 << 23) This is meant to parallel the (127 + (127-15)). Seeing 127 and 15 here makes sense in the context of floating point exponent biases.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_FAILED, http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...)
On 2016/02/17 20:09:00, commit-bot: I haz the power wrote: > Dry run: Try jobs failed on following builders: > Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android > (JOB_FAILED, > http://build.chromium.org/p/client.skia.android/builders/Test-Android-GCC-Nex...) I'm gonna count this run as a success. The _01 tests passed.
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Gonna get this baking... happy to follow up / evolve it.
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1700473003/#ps140001 (title: "armv8 asm")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot on client.skia.android (JOB_TIMED_OUT, no build URL)
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1700473003/140001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1700473003/140001
Message was sent while issue was closed.
Description was changed from ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot ========== to ========== NEON f32 <-> f16 and f32 <-> u16 Adds f32 <-> f16 ARMv7 and ARMv8 NEON code. Also adds NEON f32 <-> u16 code to make the comparison fair. The NDK GCC does not support the ARMv8 NEON intrinsics needed to go fastest, so we use a tiny amount of inline assembly. The ARMv7 half -> float is different enough from the SSE version that it does not make sense to use SkNx. Still TODO: ARMv7 float -> half. Naively translating the SSE version results in 0x0000 where we'd expect a denormal output. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... CQ_EXTRA_TRYBOTS=client.skia.android:Test-Android-GCC-Nexus5-CPU-NEON-Arm7-Release-Trybot,Test-Android-GCC-Nexus9-CPU-Denver-Arm64-Release-Trybot;client.skia:Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-SKNX_NO_SIMD-Trybot Committed: https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a ==========
Message was sent while issue was closed.
Committed patchset #8 (id:140001) as https://skia.googlesource.com/skia/+/be8c19e8d3deac9b9585c44b9a423912dd00a75a |
