src/opts/SkPMFloat_SSE2.h - Issue 973603002: Make SkPMFloats store floats in [0,255] instead of [0,1].

Unified Diff: src/opts/SkPMFloat_SSE2.h

Issue 973603002: Make SkPMFloats store floats in [0,255] instead of [0,1]. (Closed) Base URL: https://skia.googlesource.com/skia.git@master

Patch Set: restore comment Created 5 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: src/opts/SkPMFloat_SSE2.h

diff --git a/src/opts/SkPMFloat_SSE2.h b/src/opts/SkPMFloat_SSE2.h

index c3dbbf2cb50d655fb77d840a7ace132f5cec7cb7..f28608079f3a4ac9a41a6c4bcc5c941261c4dbf9 100644

--- a/src/opts/SkPMFloat_SSE2.h

+++ b/src/opts/SkPMFloat_SSE2.h

@@ -2,11 +2,10 @@

#include <emmintrin.h>

// For set(), we widen our 8 bit components (fix8) to 8-bit components in 16 bits (fix8_16),

-// then widen those to 8-bit-in-32-bits (fix8_32), convert those to floats (scaled),

-// then finally scale those down from [0.0f, 255.0f] to [0.0f, 1.0f] into fColor.

+// then widen those to 8-bit-in-32-bits (fix8_32), and finally convert those to floats.

-// get() and clamped() do the opposite, working from [0.0f, 1.0f] floats to [0.0f, 255.0f],

-// to 8-bit-in-32-bit, to 8-bit-in-16-bit, back down to 8-bit components.

+// get() and clamped() do the opposite, working from floats to 8-bit-in-32-bit,

+// to 8-bit-in-16-bit, back down to 8-bit components.

// _mm_packus_epi16() gives us clamping for free while narrowing.

inline void SkPMFloat::set(SkPMColor c) {

@@ -14,8 +13,7 @@ inline void SkPMFloat::set(SkPMColor c) {

__m128i fix8 = _mm_set_epi32(0,0,0,c),

fix8_16 = _mm_unpacklo_epi8 (fix8, _mm_setzero_si128()),

fix8_32 = _mm_unpacklo_epi16(fix8_16, _mm_setzero_si128());

- __m128 scaled = _mm_cvtepi32_ps(fix8_32);

- _mm_store_ps(fColor, _mm_mul_ps(scaled, _mm_set1_ps(1.0f/255.0f)));

+ _mm_store_ps(fColor, _mm_cvtepi32_ps(fix8_32));

msarett 2015/03/03 14:34:28 I think we might be able to improve performance a

mtklein 2015/03/03 15:02:17 The reason I've shied away from intrinsics like _m

The reason I've shied away from intrinsics like _mm_cvtpi8_ps is that they don't compile to single instructions. That'll be why you can't find latency and throughput numbers on them: they're implemented as compound operations that may vary from compiler to compiler. Looking at my xmmintrin.h from GCC and Clang, they do seem to vary, and they're quite a lot more work than what we've got here. It's certainly worth a try locally to see what code's generated and how it performs. I just tried myself with this code _mm_store_ps(fColor, _mm_cvtpu8_ps(_mm_set_pi32(0, c))); and it ran about 4x slower (at about 4x the instructions) than my current best. (As far as I can tell, _mm_cvtpu8_ps is what you'd want for unsigned 8 -> float conversions. But again, it's slow.) As far as int->float conversions go, I really only see two options: cvtpi2ps (_mm_cvtpi32_ps) to convert 2 at a time from SSE, or cvtepi2ps (_mm_cvtepi32_ps) to convert 4 at a time from SSE2. (AVX lets us go 8 at a time of course, but that's not really realistic to target right now.)

SkASSERT(this->isValid());

msarett 2015/03/03 14:34:28 I'm starting another comment for another train of

mtklein 2015/03/03 15:02:17 Yep, totally agree. We're thinking the next logic

}

@@ -25,8 +23,7 @@ inline SkPMColor SkPMFloat::get() const {

}

inline SkPMColor SkPMFloat::clamped() const {

- __m128 scaled = _mm_mul_ps(_mm_load_ps(fColor), _mm_set1_ps(255.0f));

- __m128i fix8_32 = _mm_cvtps_epi32(scaled),

+ __m128i fix8_32 = _mm_cvtps_epi32(_mm_load_ps(fColor)),

fix8_16 = _mm_packus_epi16(fix8_32, fix8_32),

fix8 = _mm_packus_epi16(fix8_16, fix8_16);

SkPMColor c = _mm_cvtsi128_si32(fix8);

« no previous file with comments | « src/core/SkPMFloat.h ('k') | src/opts/SkPMFloat_neon.h » ('j') | no next file with comments »