src/core/Sk4x_neon.h - Issue 975303003: 4x library for NEON

Side by Side Diff: src/core/Sk4x_neon.h

Issue 975303003: 4x library for NEON (Closed) Base URL: https://skia.googlesource.com/skia.git@master

Patch Set: Created 5 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
(Empty)
	1 // It is important _not_ to put header guards here.

	2 // This file will be intentionally included three times.

	3

	4 #if defined(SK4X_PREAMBLE)

	5 #include <arm_neon.h>

	6

	7 // Template metaprogramming to map scalar types to vector types.

	8 template <typename T> struct SkScalarToSIMD;

	9 template <> struct SkScalarToSIMD<float> { typedef float32x4_t Type; };

	10 template <> struct SkScalarToSIMD<int32_t> { typedef int32x4_t Type; };

	11

	12 #elif defined(SK4X_PRIVATE)

	13 Sk4x(float32x4_t);

	14 Sk4x(int32x4_t);

	15

	16 typename SkScalarToSIMD<T>::Type fVec;

	17

	18 #else

	19

	20 // Vector Constructors

	21 template <> inline Sk4f::Sk4x(int32x4_t v) : fVec(vcvtq_f32_s32(v)) {}
	mtklein 2015/03/04 16:48:53 In the SSE code, we use these constructors (which In the SSE code, we use these constructors (which are private, thankfully) for easy bit casts (like reinterpret). Here it looks like you're using them for cast conversions. Though it looks to me that you're not using them at all, so perhaps we can just drop them? msarett 2015/03/04 18:41:12 Yeah I agree, I think originally, I did drop them. Show quoted text On 2015/03/04 16:48:53, mtklein wrote: > In the SSE code, we use these constructors (which are private, thankfully) for > easy bit casts (like reinterpret). Here it looks like you're using them for > cast conversions. Though it looks to me that you're not using them at all, so > perhaps we can just drop them? Yeah I agree, I think originally, I did drop them. If I remember correctly, the lack of these constructors was the reason for the linking error I ran into?
	22 template <> inline Sk4f::Sk4x(float32x4_t v) : fVec(v) {}

	23 template <> inline Sk4i::Sk4x(int32x4_t v) : fVec(v) {}

	24 template <> inline Sk4i::Sk4x(float32x4_t v) : fVec(vcvtq_s32_f32(v)) {}

	25

	26 // Generic Methods

	27 template <typename T> Sk4x<T>::Sk4x() {}

	28 template <typename T> Sk4x<T>::Sk4x(const Sk4x& other) { *this = other; }

	29 template <typename T> Sk4x<T>& Sk4x<T>::operator=(const Sk4x<T>& other) {

	30 fVec = other.fVec;

	31 return *this;

	32 }

	33

	34 // Sk4f Methods

	35 #define M(...) template <> inline __VA_ARGS__ Sk4f::

	36

	37 M() Sk4x(float v) : fVec(vdupq_n_f32(v)) {}

	38 M() Sk4x(float a, float b, float c, float d) {

	39 // NEON lacks an intrinsic to make this easy. It is recommended to avoid

	40 // this constructor unless it is absolutely necessary.

	41

	42 // I am choosing to use the set lane intrinsics. Particularly, in the case

	43 // of floating point, it is likely that the values are already in the right

	44 // register file, so this may be the best approach. However, I am not

	45 // certain that this is the fastest approach and experimentation might be

	46 // useful.
	mtklein 2015/03/04 16:48:53 This SGTM. I think this constructor should be rar This SGTM. I think this constructor should be rare in practice. We'll mostly be using LoadAligned(). We could even try removing this constructor in a follow up and see what stinks.
	47 fVec = vsetq_lane_f32(a, fVec, 0);

	48 fVec = vsetq_lane_f32(b, fVec, 1);

	49 fVec = vsetq_lane_f32(c, fVec, 2);

	50 fVec = vsetq_lane_f32(d, fVec, 3);

	51 }

	52

	53 // As far as I can tell, it's not possible to provide an alignment hint to

	54 // NEON using intrinsics. However, I think it is possible at the assembly

	55 // level if we want to get into that.
	mtklein 2015/03/04 16:48:52 Right. I think people typically end up writing th Right. I think people typically end up writing their own vld1qa_f32(). Sounds like a good follow up. msarett 2015/03/04 18:41:12 Cool I'll add a TODO. Show quoted text On 2015/03/04 16:48:52, mtklein wrote: > Right. I think people typically end up writing their own vld1qa_f32(). Sounds > like a good follow up. Cool I'll add a TODO.
	56 M(Sk4f) Load (const float fs[4]) { return vld1q_f32(fs); }

	57 M(Sk4f) LoadAligned(const float fs[4]) { return vld1q_f32(fs); }

	58 M(void) store (float fs[4]) const { vst1q_f32(fs, fVec); }

	59 M(void) storeAligned(float fs[4]) const { vst1q_f32 (fs, fVec); }

	60

	61 template <>

	62 M(Sk4i) reinterpret<Sk4i>() const { return vreinterpretq_s32_f32(fVec); }

	63

	64 template <>

	65 M(Sk4i) cast<Sk4i>() const { return vcvtq_s32_f32(fVec); }

	66

	67 // We're going to skip allTrue(), anyTrue(), and bit-manipulators

	68 // for Sk4f. Code that calls them probably does so accidentally.

	69 // Ask msarett or mtklein to fill these in if you really need them.

	70 M(Sk4f) add (const Sk4f& o) const { return vaddq_f32(fVec, o.fVec); }

	71 M(Sk4f) subtract(const Sk4f& o) const { return vsubq_f32(fVec, o.fVec); }

	72 M(Sk4f) multiply(const Sk4f& o) const { return vmulq_f32(fVec, o.fVec); }

	73 M(Sk4f) divide (const Sk4f& o) const { return vmulq_f32(fVec, vrecpeq_f32(o.fVe c)); }
	mtklein 2015/03/04 16:48:53 TODO: how many vrecpsq_f32 should we use here? TODO: how many vrecpsq_f32 should we use here? msarett 2015/03/04 18:41:12 I can't figure out how to factor out the calls to Show quoted text On 2015/03/04 16:48:53, mtklein wrote: > TODO: how many vrecpsq_f32 should we use here? I can't figure out how to factor out the calls to vrecpsq_f32 without adding reciprocal as its own member function. Do we want to do this?
	74 // TODO: Maybe it would be useful to provide a simple reciprocal as well?

	75 M(Sk4f) rsqrt() const { return vrsqrteq_f32(fVec); }
	mtklein 2015/03/04 16:48:53 TODO: how many vrsqrtsq_f32 should we use here? TODO: how many vrsqrtsq_f32 should we use here?
	76 M(Sk4f) sqrt() const { return vrecpeq_f32(vrsqrteq_f32(fVec)); }
	mtklein 2015/03/04 16:48:53 Won't we always end up with a better answer using Won't we always end up with a better answer using this->multiply(this->rsqrt()) ? msarett 2015/03/04 18:41:12 I'm not sure exactly what you mean. The issue tha Show quoted text On 2015/03/04 16:48:53, mtklein wrote: > Won't we always end up with a better answer using this->multiply(this->rsqrt()) > ? I'm not sure exactly what you mean. The issue that I have run into is that NEON does not provide a sqrt instruction. They only provide reciprocal sqrt and reciprocal. Also, it might be worth mentioning that I read a little more about the accuracy of these instructions. rsqrte and recpe provides "estimates" of these values. rsqrts and recps (unused) provide Newton-Raphson steps for these estimates to converge on the true result. If we have accuracy issues, we may need to add these iteration steps.
	77

	78 M(Sk4i) equal (const Sk4f& o) const { return vreinterpretq_s32_u32(vce qq_f32(fVec, o.fVec)); }

	79 M(Sk4i) notEqual (const Sk4f& o) const { return vreinterpretq_s32_u32(vmv nq_u32(vceqq_f32(fVec, o.fVec))); }

	80 M(Sk4i) lessThan (const Sk4f& o) const { return vreinterpretq_s32_u32(vcl tq_f32(fVec, o.fVec)); }

	81 M(Sk4i) greaterThan (const Sk4f& o) const { return vreinterpretq_s32_u32(vcg tq_f32(fVec, o.fVec)); }

	82 M(Sk4i) lessThanEqual (const Sk4f& o) const { return vreinterpretq_s32_u32(vcl eq_f32(fVec, o.fVec)); }

	83 M(Sk4i) greaterThanEqual(const Sk4f& o) const { return vreinterpretq_s32_u32(vcg eq_f32(fVec, o.fVec)); }

	84

	85 M(Sk4f) Min(const Sk4f& a, const Sk4f& b) { return vminq_f32(a.fVec, b.fVec); }

	86 M(Sk4f) Max(const Sk4f& a, const Sk4f& b) { return vmaxq_f32(a.fVec, b.fVec); }

	87

	88 // These shuffle operations are implemented more efficiently with SSE.

	89 // NEON has efficient zip, unzip, and transpose, but it is more costly to

	90 // exploit zip and unzip in order to shuffle.

	91 M(Sk4f) zwxy() const {

	92 float32x4x2_t zip = vzipq_f32(fVec, vdupq_n_f32(0.0));

	93 return vuzpq_f32(zip.val[1], zip.val[0]).val[0];

	94 }

	95 // Note that XYAB and ZWCD share code. If both are needed, they could be
	mtklein 2015/03/04 16:48:52 A lot of these shuffles were sort of exploratory. A lot of these shuffles were sort of exploratory. It'll be hard for us to find the intersection of useful, fast-in-SSE, and fast-in-NEON. I'm hoping we can evolve this over time as we find better ways to zip and shuffle. Should be, though, that because these are all inlined, it's probably not much worse to call XYAB() then ZWCD() than if we implemented them together. Constant subexpresssion elimination ought to skip the redundant work. (I have not tested this.)
	96 // implemented more efficiently together. Also, ABXY and CDZW are available

	97 // as well.

	98 M(Sk4f) XYAB(const Sk4f& xyzw, const Sk4f& abcd) {

	99 float32x4x2_t xayb_zcwd = vzipq_f32(xyzw.fVec, abcd.fVec);

	100 float32x4x2_t axby_czdw = vzipq_f32(abcd.fVec, xyzw.fVec);

	101 return vuzpq_f32(xayb_zcwd.val[0], axby_czdw.val[0]).val[0];

	102 }

	103 M(Sk4f) ZWCD(const Sk4f& xyzw, const Sk4f& abcd) {

	104 float32x4x2_t xayb_zcwd = vzipq_f32(xyzw.fVec, abcd.fVec);

	105 float32x4x2_t axby_czdw = vzipq_f32(abcd.fVec, xyzw.fVec);

	106 return vuzpq_f32(xayb_zcwd.val[1], axby_czdw.val[1]).val[0];

	107 }

	108

	109 // Sk4i Methods

	110 #undef M

	111 #define M(...) template <> inline __VA_ARGS__ Sk4i::

	112

	113 M() Sk4x(int32_t v) : fVec(vdupq_n_s32(v)) {}

	114 M() Sk4x(int32_t a, int32_t b, int32_t c, int32_t d) {

	115 // NEON lacks an intrinsic to make this easy. It is recommended to avoid

	116 // this constructor unless it is absolutely necessary.

	117

	118 // There are a few different implementation strategies.

	119

	120 // uint64_t ab_i = ((uint32_t) a) \| (((uint64_t) b) << 32);

	121 // uint64_t cd_i = ((uint32_t) c) \| (((uint64_t) d) << 32);

	122 // int32x2_t ab = vcreate_s32(ab_i);

	123 // int32x2_t cd = vcreate_s32(cd_i);

	124 // fVec = vcombine_s32(ab, cd);

	125 // This might not be a bad idea for the integer case. Either way I think,

	126 // we will need to move values from general registers to NEON registers.

	127

	128 // I am choosing to use the set lane intrinsics. I am not certain that

	129 // this is the fastest approach. It may be useful to try the above code

	130 // for integers.

	131 fVec = vsetq_lane_s32(a, fVec, 0);

	132 fVec = vsetq_lane_s32(b, fVec, 1);

	133 fVec = vsetq_lane_s32(c, fVec, 2);

	134 fVec = vsetq_lane_s32(d, fVec, 3);

	135 }

	136

	137 // As far as I can tell, it's not possible to provide an alignment hint to

	138 // NEON using intrinsics. However, I think it is possible at the assembly

	139 // level if we want to get into that.

	140 M(Sk4i) Load (const int32_t is[4]) { return vld1q_s32(is); }

	141 M(Sk4i) LoadAligned(const int32_t is[4]) { return vld1q_s32(is); }

	142 M(void) store (int32_t is[4]) const { vst1q_s32(is, fVec); }

	143 M(void) storeAligned(int32_t is[4]) const { vst1q_s32 (is, fVec); }

	144

	145 template <>

	146 M(Sk4f) reinterpret<Sk4f>() const { return vreinterpretq_f32_s32(fVec); }

	147

	148 template <>

	149 M(Sk4f) cast<Sk4f>() const { return vcvtq_f32_s32(fVec); }

	150

	151 M(bool) allTrue() const {

	152 // TODO: There has to be a better way to implement this.
	mtklein 2015/03/04 16:48:53 I actually don't think there is. movemask is a pr I actually don't think there is. movemask is a pretty unique feature of SSE.
	153 int32_t a = vgetq_lane_s32(fVec, 0);

	154 int32_t b = vgetq_lane_s32(fVec, 1);

	155 int32_t c = vgetq_lane_s32(fVec, 2);

	156 int32_t d = vgetq_lane_s32(fVec, 3);

	157 return a & b & c & d;

	158 }

	159 M(bool) anyTrue() const {

	160 // TODO: There has to be a better way to implement this.

	161 int32_t a = vgetq_lane_s32(fVec, 0);

	162 int32_t b = vgetq_lane_s32(fVec, 1);

	163 int32_t c = vgetq_lane_s32(fVec, 2);

	164 int32_t d = vgetq_lane_s32(fVec, 3);

	165 return a \| b \| c \| d;

	166 }

	167

	168 M(Sk4i) bitNot() const { return vmvnq_s32(fVec); }

	169 M(Sk4i) bitAnd(const Sk4i& o) const { return vandq_s32(fVec, o.fVec); }

	170 M(Sk4i) bitOr (const Sk4i& o) const { return vorrq_s32(fVec, o.fVec); }

	171

	172 M(Sk4i) equal (const Sk4i& o) const { return vreinterpretq_s32_u32(vce qq_s32(fVec, o.fVec)); }

	173 M(Sk4i) notEqual (const Sk4i& o) const { return vreinterpretq_s32_u32(vmv nq_u32(vceqq_s32(fVec, o.fVec))); }

	174 M(Sk4i) lessThan (const Sk4i& o) const { return vreinterpretq_s32_u32(vcl tq_s32(fVec, o.fVec)); }

	175 M(Sk4i) greaterThan (const Sk4i& o) const { return vreinterpretq_s32_u32(vcg tq_s32(fVec, o.fVec)); }

	176 M(Sk4i) lessThanEqual (const Sk4i& o) const { return vreinterpretq_s32_u32(vcl eq_s32(fVec, o.fVec)); }

	177 M(Sk4i) greaterThanEqual(const Sk4i& o) const { return vreinterpretq_s32_u32(vcg eq_s32(fVec, o.fVec)); }

	178

	179 M(Sk4i) add (const Sk4i& o) const { return vaddq_s32(fVec, o.fVec); }

	180 M(Sk4i) subtract(const Sk4i& o) const { return vsubq_s32(fVec, o.fVec); }

	181 M(Sk4i) multiply(const Sk4i& o) const { return vmulq_s32(fVec, o.fVec); }

	182 // NEON does not have integer reciprocal, sqrt, or division.

	183 M(Sk4i) Min(const Sk4i& a, const Sk4i& b) { return vminq_s32(a.fVec, b.fVec); }

	184 M(Sk4i) Max(const Sk4i& a, const Sk4i& b) { return vmaxq_s32(a.fVec, b.fVec); }

	185

	186 // These shuffle operations are implemented more efficiently with SSE.

	187 // NEON has efficient zip, unzip, and transpose, but it is more costly to

	188 // exploit zip and unzip in order to shuffle.

	189 M(Sk4i) zwxy() const {

	190 int32x4x2_t zip = vzipq_s32(fVec, vdupq_n_s32(0.0));

	191 return vuzpq_s32(zip.val[1], zip.val[0]).val[0];

	192 }

	193 // Note that XYAB and ZWCD share code. If both are needed, they could be

	194 // implemented more efficiently together. Also, ABXY and CDZW are available

	195 // as well.

	196 M(Sk4i) XYAB(const Sk4i& xyzw, const Sk4i& abcd) {

	197 int32x4x2_t xayb_zcwd = vzipq_s32(xyzw.fVec, abcd.fVec);

	198 int32x4x2_t axby_czdw = vzipq_s32(abcd.fVec, xyzw.fVec);

	199 return vuzpq_s32(xayb_zcwd.val[0], axby_czdw.val[0]).val[0];

	200 }

	201 M(Sk4i) ZWCD(const Sk4i& xyzw, const Sk4i& abcd) {

	202 int32x4x2_t xayb_zcwd = vzipq_s32(xyzw.fVec, abcd.fVec);

	203 int32x4x2_t axby_czdw = vzipq_s32(abcd.fVec, xyzw.fVec);

	204 return vuzpq_s32(xayb_zcwd.val[1], axby_czdw.val[1]).val[0];

	205 }

	206

	207 #undef M

	208

	209 #endif

OLD	NEW

« no previous file with comments | « src/core/Sk4x.h ('k') | tests/Sk4xTest.cpp » ('j') | no next file with comments »