src/opts/SkPMFloat_SSE2.h - Issue 973603002: Make SkPMFloats store floats in [0,255] instead of [0,1].

Side by Side Diff: src/opts/SkPMFloat_SSE2.h

Issue 973603002: Make SkPMFloats store floats in [0,255] instead of [0,1]. (Closed) Base URL: https://skia.googlesource.com/skia.git@master

Patch Set: restore comment Created 5 years, 9 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 #include "SkColorPriv.h"	1 #include "SkColorPriv.h"

2 #include <emmintrin.h>	2 #include <emmintrin.h>

3	3

4 // For set(), we widen our 8 bit components (fix8) to 8-bit components in 16 bit s (fix8_16),	4 // For set(), we widen our 8 bit components (fix8) to 8-bit components in 16 bit s (fix8_16),

5 // then widen those to 8-bit-in-32-bits (fix8_32), convert those to floats (scal ed),	5 // then widen those to 8-bit-in-32-bits (fix8_32), and finally convert those to floats.

6 // then finally scale those down from [0.0f, 255.0f] to [0.0f, 1.0f] into fColor .

7	6

8 // get() and clamped() do the opposite, working from [0.0f, 1.0f] floats to [0.0 f, 255.0f],	7 // get() and clamped() do the opposite, working from floats to 8-bit-in-32-bit,

9 // to 8-bit-in-32-bit, to 8-bit-in-16-bit, back down to 8-bit components.	8 // to 8-bit-in-16-bit, back down to 8-bit components.

10 // _mm_packus_epi16() gives us clamping for free while narrowing.	9 // _mm_packus_epi16() gives us clamping for free while narrowing.

11	10

12 inline void SkPMFloat::set(SkPMColor c) {	11 inline void SkPMFloat::set(SkPMColor c) {

13 SkPMColorAssert(c);	12 SkPMColorAssert(c);

14 __m128i fix8 = _mm_set_epi32(0,0,0,c),	13 __m128i fix8 = _mm_set_epi32(0,0,0,c),

15 fix8_16 = _mm_unpacklo_epi8 (fix8, _mm_setzero_si128()),	14 fix8_16 = _mm_unpacklo_epi8 (fix8, _mm_setzero_si128()),

16 fix8_32 = _mm_unpacklo_epi16(fix8_16, _mm_setzero_si128());	15 fix8_32 = _mm_unpacklo_epi16(fix8_16, _mm_setzero_si128());

17 __m128 scaled = _mm_cvtepi32_ps(fix8_32);	16 _mm_store_ps(fColor, _mm_cvtepi32_ps(fix8_32));
	msarett 2015/03/03 14:34:28 I think we might be able to improve performance a I think we might be able to improve performance a little bit if we convert directly from 8-bit or 16-bit ints to floats (not 100% sure because I'm having trouble finding documentation on the latency and throughput of the applicable instructions). Two instructions that might be useful: __m128 _mm_cvtpi8_ps (__m64 a) // Converts 8-bit signed integers in lower half of vector to floats __m128 _mm_cvtpi16_ps (__m64 a) // Converts 16-bit signed integers (there is also an unsigned version) to floats The 16-bit version (either signed or unsigned) would certainly be workable and would save the 16-bit to 32-bit step. The 8-bit version is more interesting because it would be faster, but the issue is that there is only a signed version, and we need unsigned. We might write something like this: __m64 fix8 = _mm_set_pi32(0, c); __m128i fix8_float = _mm_cvtpi8_ps(fix8); _mm_store_ps(fColor, fix8_float); The issue would be that input color components in the range [128, 255] would be converted to the range [-128, -1]. I think we could trick our way around this by adding 128 before and after the conversion (I can draw out how this might work). It's not clear if this would be faster than a 16-bit version or not but it might be worth trying. mtklein 2015/03/03 15:02:17 The reason I've shied away from intrinsics like _m Show quoted text On 2015/03/03 14:34:28, msarett wrote: > I think we might be able to improve performance a little bit if we convert > directly from 8-bit or 16-bit ints to floats (not 100% sure because I'm having > trouble finding documentation on the latency and throughput of the applicable > instructions). > > Two instructions that might be useful: > __m128 _mm_cvtpi8_ps (__m64 a) // Converts 8-bit signed integers in lower half > of vector to floats > __m128 _mm_cvtpi16_ps (__m64 a) // Converts 16-bit signed integers (there is > also an unsigned version) to floats > > The 16-bit version (either signed or unsigned) would certainly be workable and > would save the 16-bit to 32-bit step. The 8-bit version is more interesting > because it would be faster, but the issue is that there is only a signed > version, and we need unsigned. We might write something like this: > __m64 fix8 = _mm_set_pi32(0, c); > __m128i fix8_float = _mm_cvtpi8_ps(fix8); > _mm_store_ps(fColor, fix8_float); > > The issue would be that input color components in the range [128, 255] would be > converted to the range [-128, -1]. I think we could trick our way around this > by adding 128 before and after the conversion (I can draw out how this might > work). It's not clear if this would be faster than a 16-bit version or not but > it might be worth trying. The reason I've shied away from intrinsics like _mm_cvtpi8_ps is that they don't compile to single instructions. That'll be why you can't find latency and throughput numbers on them: they're implemented as compound operations that may vary from compiler to compiler. Looking at my xmmintrin.h from GCC and Clang, they do seem to vary, and they're quite a lot more work than what we've got here. It's certainly worth a try locally to see what code's generated and how it performs. I just tried myself with this code _mm_store_ps(fColor, _mm_cvtpu8_ps(_mm_set_pi32(0, c))); and it ran about 4x slower (at about 4x the instructions) than my current best. (As far as I can tell, _mm_cvtpu8_ps is what you'd want for unsigned 8 -> float conversions. But again, it's slow.) As far as int->float conversions go, I really only see two options: cvtpi2ps (_mm_cvtpi32_ps) to convert 2 at a time from SSE, or cvtepi2ps (_mm_cvtepi32_ps) to convert 4 at a time from SSE2. (AVX lets us go 8 at a time of course, but that's not really realistic to target right now.)
18 _mm_store_ps(fColor, _mm_mul_ps(scaled, _mm_set1_ps(1.0f/255.0f)));

19 SkASSERT(this->isValid());	17 SkASSERT(this->isValid());
	msarett 2015/03/03 14:34:28 I'm starting another comment for another train of I'm starting another comment for another train of thought. No matter what strategy we choose, we are going to have to pay a price for the conversion from ints to floats (latency = 3 cycles for _mm_cvtepi32_ps). But some of this can be offset by the high throughput of this instruction (throughput = 1 cycle for _mm_cvtepi32_ps). However, because we are only converting one pixel at a time with each call to this function, it is likely that the latency is being paid every time and that we are not taking advantage of the throughput. There is a chance that if we were to convert an array of pixels in a single function call and use a partly unrolled loop, we could see a performance improvement. mtklein 2015/03/03 15:02:17 Yep, totally agree. We're thinking the next logic Show quoted text On 2015/03/03 14:34:28, msarett wrote: > I'm starting another comment for another train of thought. No matter what > strategy we choose, we are going to have to pay a price for the conversion from > ints to floats (latency = 3 cycles for _mm_cvtepi32_ps). But some of this can > be offset by the high throughput of this instruction (throughput = 1 cycle for > _mm_cvtepi32_ps). However, because we are only converting one pixel at a time > with each call to this function, it is likely that the latency is being paid > every time and that we are not taking advantage of the throughput. There is a > chance that if we were to convert an array of pixels in a single function call > and use a partly unrolled loop, we could see a performance improvement. Yep, totally agree. We're thinking the next logical place to go is to do conversions of 4 pixels at a time, which also lets us do 128-bit reads and writes of SkPMColors, and lets us get more work out of the _mm_pack instructions.
20 }	18 }

21	19

22 inline SkPMColor SkPMFloat::get() const {	20 inline SkPMColor SkPMFloat::get() const {

23 SkASSERT(this->isValid());	21 SkASSERT(this->isValid());

24 return this->clamped(); // At the moment, we don't know anything faster.	22 return this->clamped(); // At the moment, we don't know anything faster.

25 }	23 }

26	24

27 inline SkPMColor SkPMFloat::clamped() const {	25 inline SkPMColor SkPMFloat::clamped() const {

28 __m128 scaled = _mm_mul_ps(_mm_load_ps(fColor), _mm_set1_ps(255.0f));	26 __m128i fix8_32 = _mm_cvtps_epi32(_mm_load_ps(fColor)),

29 __m128i fix8_32 = _mm_cvtps_epi32(scaled),

30 fix8_16 = _mm_packus_epi16(fix8_32, fix8_32),	27 fix8_16 = _mm_packus_epi16(fix8_32, fix8_32),

31 fix8 = _mm_packus_epi16(fix8_16, fix8_16);	28 fix8 = _mm_packus_epi16(fix8_16, fix8_16);

32 SkPMColor c = _mm_cvtsi128_si32(fix8);	29 SkPMColor c = _mm_cvtsi128_si32(fix8);

33 SkPMColorAssert(c);	30 SkPMColorAssert(c);

34 return c;	31 return c;

35 }	32 }

OLD	NEW

« no previous file with comments | « src/core/SkPMFloat.h ('k') | src/opts/SkPMFloat_neon.h » ('j') | no next file with comments »