|
|
DescriptionSkColorCubeFilter: require alpha == 0xFF.
This is about a 12% improvement on my desktop, from 134 to 118ms on our bench.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/d1c6b7c5007b5c609b44a9cdfe95ef64a5a8f29f
Patch Set 1 #Patch Set 2 : fixes #Patch Set 3 : just perf #
Total comments: 5
Messages
Total messages: 30 (7 generated)
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1295873004/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1295873004/20001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1295873004/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1295873004/40001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
mtklein@chromium.org changed reviewers: + noel@chromium.org, reed@google.com
lgtm
LGTM++ Awesome patch mtklein. I expect this to match qcms tetrahedral SSE perf-wise on win, an outstanding result. https://codereview.chromium.org/1295873004/diff/40001/include/effects/SkColor... File include/effects/SkColorCubeFilter.h (left): https://codereview.chromium.org/1295873004/diff/40001/include/effects/SkColor... include/effects/SkColorCubeFilter.h:21: */ Yeap, https://code.google.com/p/chromium/codesearch#chromium/src/third_party/qcms/s... 1222 https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_none.h File src/opts/SkPMFloat_none.h (right): https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_none... src/opts/SkPMFloat_none.h:43: SkGetPackedB32(c) * inv255); Yes, that's a better function name, and the SkGetPackedR32 -> SkColorGetR fix for lines 41-43 is https://codereview.chromium.org/1296383006 and done / submitted, so all good here. https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.h File src/opts/SkPMFloat_sse.h (left): https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.... src/opts/SkPMFloat_sse.h:59: fix8_32 = _mm_or_si128(fix8_32, _mm_set_epi32(0xFF,0,0,0)); // Force alpha to 1. Very cool; less instructions generally means more perf. Aside: there is one other way I've thought of implementing this entire function: const __m128 __color = _mm_set_ps((float)SkColorGetR(c), (float)SkColorGetR(g), (float)SkColorGetB(c), 1.0f); // FIXME SkPMFloat pmf = Sk4f(_mm_mul_ps(__color, _mm_set1_ps(1.0f/255))); SkASSERT(pmf.isValid()); return pmf; Would need a fix to get the __color R G B byte order correct for platform PMColor order. But with many fewer SSE instructions, it might be faster, or it might not -- it would depend on the latency of _mm_set_ps() and maybe the C++ compiler.
mtklein@google.com changed reviewers: + mtklein@google.com
https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.h File src/opts/SkPMFloat_sse.h (left): https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.... src/opts/SkPMFloat_sse.h:59: fix8_32 = _mm_or_si128(fix8_32, _mm_set_epi32(0xFF,0,0,0)); // Force alpha to 1. On 2015/08/20 01:43:29, noel gordon wrote: > Very cool; less instructions generally means more perf. > > Aside: there is one other way I've thought of implementing this entire function: > > const __m128 __color = _mm_set_ps((float)SkColorGetR(c), (float)SkColorGetR(g), > (float)SkColorGetB(c), 1.0f); // FIXME > SkPMFloat pmf = Sk4f(_mm_mul_ps(__color, _mm_set1_ps(1.0f/255))); > SkASSERT(pmf.isValid()); > return pmf; > > Would need a fix to get the __color R G B byte order correct for platform > PMColor order. But with many fewer SSE instructions, it might be faster, or it > might not -- it would depend on the latency of _mm_set_ps() and maybe the C++ > compiler. Yeah, _mm_set_ps doesn't really correspond to any particular instruction, so I've always preferred writing things out myself rather than having to hope GCC, Clang, and MSVC all generate something good. But it's worth trying out. If we find we can trust _mm_set_ps, it'd make lots of code quite a bit clearer. I'm not particularly optimistic about this one though, as it'd involve three cvtsi2ss calls rather than a single cvtdq2ps, and a bunch of serial shifts to move components around.
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1295873004/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1295873004/40001
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://skia.googlesource.com/skia/+/d1c6b7c5007b5c609b44a9cdfe95ef64a5a8f29f
Message was sent while issue was closed.
On 2015/08/20 01:56:54, commit-bot: I haz the power wrote: > Committed patchset #3 (id:40001) as > https://skia.googlesource.com/skia/+/d1c6b7c5007b5c609b44a9cdfe95ef64a5a8f29f I'm going to be out for a few days, but if you're feeling eager, the implementation I was leaning toward trying next requires SSE 4.1: #if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41 && defined(SK_PMCOLOR_IS_BGRA) _mm_mul_ps(_mm_set1_ps(1.0f/255), _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_cvtsi32_si128((int)c)))) #else ... ibid ... #endif I've seen at least Clang fuse _mm_cvtepu8_epi32(_mm_cvtsi32_si128((int)c)) into a single pmovzxbd instruction with a memory argument, so I think this could be as few as 3 instructions (again, only when SkPMColor is BGRA): pmovzxbd xmm0, &c cvtdq2ps xmm0, xmm0 mulps xmm0, &<a static constant with 4x 1.0f/255>
Message was sent while issue was closed.
On 2015/08/20 01:55:17, mtklein wrote: > https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.h > File src/opts/SkPMFloat_sse.h (left): > > https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkPMFloat_sse.... > src/opts/SkPMFloat_sse.h:59: fix8_32 = _mm_or_si128(fix8_32, > _mm_set_epi32(0xFF,0,0,0)); // Force alpha to 1. > On 2015/08/20 01:43:29, noel gordon wrote: > > Very cool; less instructions generally means more perf. > > > > Aside: there is one other way I've thought of implementing this entire > function: > > > > const __m128 __color = _mm_set_ps((float)SkColorGetR(c), > (float)SkColorGetR(g), > > (float)SkColorGetB(c), 1.0f); // FIXME > > SkPMFloat pmf = Sk4f(_mm_mul_ps(__color, _mm_set1_ps(1.0f/255))); > > SkASSERT(pmf.isValid()); > > return pmf; > > > > Would need a fix to get the __color R G B byte order correct for platform > > PMColor order. But with many fewer SSE instructions, it might be faster, or > it > > might not -- it would depend on the latency of _mm_set_ps() and maybe the C++ > > compiler. > > Yeah, _mm_set_ps doesn't really correspond to any particular instruction, Ah yes, it's a compiler intrinsic so .. > so I've always preferred writing things out myself rather than having to hope GCC, > Clang, and MSVC all generate something good. this indeed makes sense. > But it's worth trying out. If we > find we can trust _mm_set_ps, it'd make lots of code quite a bit clearer. One thing we found when doing the qcms tetra SSE, was that MSVC _mm_set_ps() was awesome compared to writing it out long hand. However, the long hand was better on Linux and Mac. Since most chrome users are in win, we preferred _mm_set_ps() to avoid #ifdef-ing the code up the wahzoo. > I'm not particularly optimistic about this one though, as it'd involve three > cvtsi2ss calls rather than a single cvtdq2ps, and a bunch of serial shifts to > move components around. When measuring win, I recall I quickly tried the _mm_set_ps() idea. Shorter code, and I expected to see a perf improvement compared to long hand way of SkPMFloat. I _did not_ see any improvement in MSVC 2013 at all (from memory, was doing other measurements for you guys at that time so I dropped _mm_set_ps()) The result suggested to me that the long hand way in Skia is very fast / good. @Radu is this something we could re-investigate for QCMS? eg., compare _mm_set_ps(), vs a backport of the Skia readers from SkPMFloat into QCMS to confirm whether _mm_set_ps() is the right option?
Message was sent while issue was closed.
On 2015/08/20 02:07:26, mtklein wrote: > On 2015/08/20 01:56:54, commit-bot: I haz the power wrote: > > Committed patchset #3 (id:40001) as > > https://skia.googlesource.com/skia/+/d1c6b7c5007b5c609b44a9cdfe95ef64a5a8f29f > > I'm going to be out for a few days, Thanks for letting me know. Radu will out for the next 10 days also. We'll circle back on these issues when you both are back. Appreciate the work you guys have done on this, win32 was 510ms when we started, and now it's something like ~120ms I expect. > but if you're feeling eager, > the implementation I was leaning toward trying next requires SSE 4.1: > > #if SK_CPU_SSE_LEVEL >= SK_CPU_SSE_LEVEL_SSE41 && defined(SK_PMCOLOR_IS_BGRA) > _mm_mul_ps(_mm_set1_ps(1.0f/255), > _mm_cvtepi32_ps(_mm_cvtepu8_epi32(_mm_cvtsi32_si128((int)c)))) > #else > ... ibid ... > #endif > > I've seen at least Clang fuse _mm_cvtepu8_epi32(_mm_cvtsi32_si128((int)c)) into > a single pmovzxbd instruction with a memory argument, so I think this could be > as few as 3 instructions (again, only when SkPMColor is BGRA): > pmovzxbd xmm0, &c > cvtdq2ps xmm0, xmm0 > mulps xmm0, &<a static constant with 4x 1.0f/255> Great info, and that does look rocking fast. @Radu looks like another thing we could try for QCMS.
Message was sent while issue was closed.
> This is about a 12% improvement on my desktop, from 134 to 118ms on our bench. Confirm: for a 33 cube, on my mac air, I'm seeing 146ms (before), 129ms (after).
Message was sent while issue was closed.
https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkColorCubeFil... File src/opts/SkColorCubeFilter_opts.h (right): https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkColorCubeFil... src/opts/SkColorCubeFilter_opts.h:56: for (int x = 0; x < 2; ++x) { Possible followup: unroll this loop. Mac air goes from 129ms -> 119ms for me in that case.
Message was sent while issue was closed.
On 2015/08/20 07:10:07, noel gordon wrote: > https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkColorCubeFil... > src/opts/SkColorCubeFilter_opts.h:56: for (int x = 0; x < 2; ++x) { > Possible followup: unroll this loop. Mac air goes from 129ms -> 119ms for me in > that case. mtklein@ you still there?
Message was sent while issue was closed.
On 2015/08/27 06:20:24, noel gordon wrote: > On 2015/08/20 07:10:07, noel gordon wrote: > > > > https://codereview.chromium.org/1295873004/diff/40001/src/opts/SkColorCubeFil... > > src/opts/SkColorCubeFilter_opts.h:56: for (int x = 0; x < 2; ++x) { > > Possible followup: unroll this loop. Mac air goes from 129ms -> 119ms for me > in > > that case. > > mtklein@ you still there? Yes?
Message was sent while issue was closed.
> Yes? #22 do/did you confirm?
Message was sent while issue was closed.
On 2015/08/27 12:41:37, noel gordon wrote: > > Yes? > > #22 do/did you confirm? Oddly, no. Also on a Mac, I just tried both manually unrolling (small regression) and #pragma unroll (no effect). If you'd like to try it I'm still game. We can watch the perf bots (https://perf.skia.org/#3984) to see if it's a win, and revert it if not. I think we've still got opportunities here to simply do less work. Here are a few I've thought of that should all be pursuable independently: 1) swizzle the color cube once into SkPMColors (in the filters ctor, or earlier) so that we don't have to worry about that here. That'd let us focus on optimizing SkPMFloat::FromPMColor() rather than having to split our attention between that an FromOpaqueColor(); 2) FromPMColor() can be made faster with SSE 4.1 using pmovzxbd as I outlined above. A quick hack up makes this look like a 5-10% speedup, and it'll benefit a few other parts of Skia too. 3) Parameterize SkPMFloat by a divisor bias. E.g. SkPMFloat<1> would be today's SkPMFloat, and SkPMFloat<255> would be the SkPMFloat of yore that worked with float components in range [0.0f, 255.0f]. This lets us skip a multiply on both sides of the conversions. Usually the math ends up more complicated in 255-space (and it's weird to think about "8-bit floats") but it looks like this use case is pretty neutral to a bias, so <255> might be a nice win.
Message was sent while issue was closed.
On 2015/08/27 13:07:06, mtklein wrote: > On 2015/08/27 12:41:37, noel gordon wrote: > > > Yes? > > > > #22 do/did you confirm? > > Oddly, no. Also on a Mac, I just tried both manually unrolling (small > regression) and #pragma unroll (no effect). > If you'd like to try it I'm still game. We can watch the perf bots > (https://perf.skia.org/#3984) to see if it's a win, and revert it if not. I measured on a mac air, it has an integrated GPU (shares the memory bus with the CPU). Perhaps your mac has a discrete GPU (has it's own memory bus), like a mac pro or similar?
Message was sent while issue was closed.
> I measured on a mac air, it has an integrated GPU (shares the memory bus with > the CPU). Perhaps your mac has a discrete GPU (has it's own memory bus), like a > mac pro or similar? Does that matter? We're entirely in software here. No GPU involved in these tests. But yes, I've got a MBP locked into Intel HD Graphics 4000 mode.
Message was sent while issue was closed.
> 3) Parameterize SkPMFloat by a divisor bias. E.g. SkPMFloat<1> would be > today's SkPMFloat, and SkPMFloat<255> would be the SkPMFloat of yore that worked > with float components in range [0.0f, 255.0f]. This lets us skip a multiply on > both sides of the conversions. Usually the math ends up more complicated in > 255-space (and it's weird to think about "8-bit floats") but it looks like this > use case is pretty neutral to a bias, so <255> might be a nice win. I hacked up this idea and it looks like we'll see another ~15% speedup. Going to try to turn it into a real CL.
Message was sent while issue was closed.
On 2015/08/27 14:00:51, mtklein wrote: > > I measured on a mac air, it has an integrated GPU (shares the memory bus with > > the CPU). Perhaps your mac has a discrete GPU (has it's own memory bus), like > a > > mac pro or similar? > > Does that matter? We're entirely in software here. No GPU involved in these > tests. > > But yes, I've got a MBP locked into Intel HD Graphics 4000 mode. No it shouldn't matter, but I do see the improvement on a mac air. Anyway, with the change in bias CL that was discussed in another CL, we get a good speed improvement. My result is: if you combine that CL with a loop unroll, there is a speed _regression_. |