src/opts/SkPMFloat_SSSE3.h - Issue 976493002: Add SSSE3 implementation for SkPMFloat, with faster get() and set().

Side by Side Diff: src/opts/SkPMFloat_SSSE3.h

Issue 976493002: Add SSSE3 implementation for SkPMFloat, with faster get() and set(). (Closed) Base URL: https://skia.googlesource.com/skia.git@master

Patch Set: rebase Created 5 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 #include "SkColorPriv.h"

	2 #include <tmmintrin.h>

	3

	4 // For set(), we widen our 8 bit components (fix8) to 8-bit components in 32 bit s (fix8_32),

	5 // then convert those to floats.

	6

	7 // get() does the opposite, working from floats to 8-bit-in-32-bits, then back t o packed 8 bit.

	8

	9 // clamped() is the same as _SSE2: floats to 8-in-32, to 8-in-16, to packed 8 bi t, with

	10 // _mm_packus_epi16() both clamping and narrowing.

	11

	12 inline void SkPMFloat::set(SkPMColor c) {

	13 SkPMColorAssert(c);

	14 const int _ = 255; // _ means to zero that byte.

	15 __m128i fix8 = _mm_set_epi32(0,0,0,c),

	16 fix8_32 = _mm_shuffle_epi8(fix8, _mm_set_epi8(_,_,_,3, _,_,_,2, _,_, _,1, _,_,_,0));

	17 _mm_store_ps(fColor, _mm_cvtepi32_ps(fix8_32));
	msarett 2015/03/04 18:06:12 As far as instructions, it looks to me like you ha As far as instructions, it looks to me like you have taken advantage of the best ones. I'm pretty sure that my main thoughts for improvements are the same things you are already thinking. Ideally, the code would be structured so that the convert hardware is always busy (and the speed of the convert is the limiting factor). If we pass in a whole array of SkPMColor, setting the constant for the shuffle would be a one time overhead. We could perform four shuffles and four converts for every one vector load. If we unroll the loop appropriately to limit dependencies it is possible that the load, shuffle, and convert could be performed in parallel (though I haven't been able to confirm that the shuffle and convert use different hardware). This might get us closer to the optimum of only being constrained by the convert step. mtklein 2015/03/04 18:10:24 +1, though keep in mind this code is inlined. Se Show quoted text On 2015/03/04 18:06:12, msarett wrote: > As far as instructions, it looks to me like you have taken advantage of the best > ones. > > I'm pretty sure that my main thoughts for improvements are the same things you > are already thinking. > > Ideally, the code would be structured so that the convert hardware is always > busy (and the speed of the convert is the limiting factor). If we pass in a > whole array of SkPMColor, setting the constant for the shuffle would be a one > time overhead. We could perform four shuffles and four converts for every one > vector load. > > If we unroll the loop appropriately to limit dependencies it is possible that > the load, shuffle, and convert could be performed in parallel (though I haven't > been able to confirm that the shuffle and convert use different hardware). This > might get us closer to the optimum of only being constrained by the convert > step. +1, though keep in mind this code is inlined. Setting the constant for the shuffles _is_ a one time overhead if we call SkPMFloat::set() in a loop. This code is inlined and those non-loop-dependent ops are hoisted up.
	18 SkASSERT(this->isValid());

	19 }

	20

	21 inline SkPMColor SkPMFloat::get() const {

	22 SkASSERT(this->isValid());

	23 const int _ = 255; // _ means to zero that byte.

	24 __m128i fix8_32 = _mm_cvtps_epi32(_mm_load_ps(fColor)), // _mm_cvtps_epi32 rounds for us!

	25 fix8 = _mm_shuffle_epi8(fix8_32, _mm_set_epi8(_,_,_,_, _,_,_,_, _ ,_,_,_, 12,8,4,0));

	26 SkPMColor c = _mm_cvtsi128_si32(fix8);

	27 SkPMColorAssert(c);

	28 return c;

	29 }

	30

	31 inline SkPMColor SkPMFloat::clamped() const {

	32 __m128i fix8_32 = _mm_cvtps_epi32(_mm_load_ps(fColor)), // _mm_cvtps_epi32 rounds for us!

	33 fix8_16 = _mm_packus_epi16(fix8_32, fix8_32),
	msarett 2015/03/04 18:06:12 For lack of anything to comment on, it might be mo For lack of anything to comment on, it might be more readable to use _mm_packus_epi32 here. But as far as I can tell, the behavior and performance will be identical. mtklein 2015/03/04 18:10:24 Yeah. I'd have done that, just for the readabilit Show quoted text On 2015/03/04 18:06:12, msarett wrote: > For lack of anything to comment on, it might be more readable to use > _mm_packus_epi32 here. But as far as I can tell, the behavior and performance > will be identical. Yeah. I'd have done that, just for the readability, but sadly _mm_packus_epi32 is SSE 4.1. We've only got _mm_packus_16 to play with until then.
	34 fix8 = _mm_packus_epi16(fix8_16, fix8_16);

	35 SkPMColor c = _mm_cvtsi128_si32(fix8);

	36 SkPMColorAssert(c);

	37 return c;

	38 }

OLD	NEW

« no previous file with comments | « src/core/SkPMFloat.h ('k') | no next file » | no next file with comments »