|
|
DescriptionReally use SSE4 (and SSSE3) in SkBlurImage_SSE4
We don't seem to be making good use of the available instruction set.
SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and
SSSE3 gave us an easy way to do the reverse.
This should be bit-perfect and about a 10% speedup.
BUG=skia:
Committed: https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
Patch Set 1 #Patch Set 2 : epu #Patch Set 3 : todos #Patch Set 4 : another note #Patch Set 5 : note #Patch Set 6 : win? #Patch Set 7 : 127 #Patch Set 8 : 0 #Messages
Total messages: 21 (8 generated)
mtklein@chromium.org changed reviewers: + reed@google.com
Hate to move the bar on you, but I couldn't help but notice this code wasn't really fully upgraded to SSE4.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/80001
lgtm
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_6...)
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1123263003/#ps100001 (title: "win?")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/100001
The CQ bit was unchecked by commit-bot@chromium.org
Try jobs failed on following builders: Build-Win-MSVC-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86-D...) Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_6...)
The CQ bit was checked by mtklein@google.com
The patchset sent to the CQ was uploaded after l-g-t-m from reed@google.com Link to the patchset: https://codereview.chromium.org/1123263003/#ps140001 (title: "0")
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/140001
Message was sent while issue was closed.
Committed patchset #8 (id:140001) as https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
Message was sent while issue was closed.
On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > Committed patchset #8 (id:140001) as > https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda Stumbled upon this patch. See that you replaced a shift and two pack instructions with a set_epi8 and shuffle_epi8. Are you sure this part gives better performance on average? At least on a Atom Silvermont, shuffle_epi8 has micro code and takes 5 cycles, compared to Haswell where it only takes 1 cycle.
Message was sent while issue was closed.
On 2015/07/01 09:16:08, henrik.smiding wrote: > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > > Committed patchset #8 (id:140001) as > > https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda > > Stumbled upon this patch. See that you replaced a shift and two pack > instructions with a set_epi8 and shuffle_epi8. Yes, though generally that set_epi8 is not really interesting to consider... it's a load from static memory that's hoisted out of the loop. It's usually just sitting around in an xmm register. > Are you sure this part gives > better performance on average? At least on a Atom Silvermont, shuffle_epi8 has > micro code and takes 5 cycles, compared to Haswell where it only takes 1 cycle. I don't remember any performance alerts triggering, but we can time it on an Atom again if you like. I'll freely admit that most of our x86 test machines fall in the Nehalem-Haswell range, and even in that range mostly Sandy Bridge or Ivy Bridge. It doesn't look like we even currently have an Atom CPU performance test bot, though that's something we could change. Are there docs like https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include Atom instruction costs?
Message was sent while issue was closed.
On 2015/07/01 13:09:46, mtklein wrote: > On 2015/07/01 09:16:08, henrik.smiding wrote: > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > > > Committed patchset #8 (id:140001) as > > > > https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda > > > > Stumbled upon this patch. See that you replaced a shift and two pack > > instructions with a set_epi8 and shuffle_epi8. > > Yes, though generally that set_epi8 is not really interesting to consider... > it's a load from static memory that's hoisted out of the loop. > It's usually just sitting around in an xmm register. > > > Are you sure this part gives > > better performance on average? At least on a Atom Silvermont, shuffle_epi8 has > > micro code and takes 5 cycles, compared to Haswell where it only takes 1 > cycle. > > I don't remember any performance alerts triggering, but we can time it on an > Atom again if you like. > > I'll freely admit that most of our x86 test machines fall in the Nehalem-Haswell > range, and even in that range mostly Sandy Bridge or Ivy Bridge. > It doesn't look like we even currently have an Atom CPU performance test bot, > though that's something we could change. > > Are there docs like > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include Atom > instruction costs? Sorry, I totally missed this. Yes, it would be great if you could get some x86 tablet running Silvermont and use as a test bot. The Atom CPUs has a few instructions that uses micro code. Especially with SSE4 instructions. I found this out the hard way myself. I don't think they'll ever map intrinsics to cpu cycles directly. I usually take the corresponding assembly instruction and do a lookup in Agner Fog's table for Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf If the column μops is >1 then you might want to benchmark on Silvermont HW, or find an alternative.
Message was sent while issue was closed.
On 2015/08/05 15:09:12, henrik.smiding wrote: > On 2015/07/01 13:09:46, mtklein wrote: > > On 2015/07/01 09:16:08, henrik.smiding wrote: > > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > > > > Committed patchset #8 (id:140001) as > > > > > > https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda > > > > > > Stumbled upon this patch. See that you replaced a shift and two pack > > > instructions with a set_epi8 and shuffle_epi8. > > > > Yes, though generally that set_epi8 is not really interesting to consider... > > it's a load from static memory that's hoisted out of the loop. > > It's usually just sitting around in an xmm register. > > > > > Are you sure this part gives > > > better performance on average? At least on a Atom Silvermont, shuffle_epi8 > has > > > micro code and takes 5 cycles, compared to Haswell where it only takes 1 > > cycle. > > > > I don't remember any performance alerts triggering, but we can time it on an > > Atom again if you like. > > > > I'll freely admit that most of our x86 test machines fall in the > Nehalem-Haswell > > range, and even in that range mostly Sandy Bridge or Ivy Bridge. > > It doesn't look like we even currently have an Atom CPU performance test bot, > > though that's something we could change. > > > > Are there docs like > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include > Atom > > instruction costs? > > Sorry, I totally missed this. > > Yes, it would be great if you could get some x86 tablet running Silvermont and > use as a test bot. The Atom CPUs has a few instructions that uses micro code. > Especially with SSE4 instructions. I found this out the hard way myself. > > I don't think they'll ever map intrinsics to cpu cycles directly. I usually take > the corresponding assembly instruction and do a lookup in Agner Fog's table for > Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf > If the column μops is >1 then you might want to benchmark on Silvermont HW, or > find an alternative. Going to try to add a Silvermont perf bot.
Message was sent while issue was closed.
On 2015/08/05 16:02:38, mtklein wrote: > On 2015/08/05 15:09:12, henrik.smiding wrote: > > On 2015/07/01 13:09:46, mtklein wrote: > > > On 2015/07/01 09:16:08, henrik.smiding wrote: > > > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > > > > > Committed patchset #8 (id:140001) as > > > > > > > > > https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda > > > > > > > > Stumbled upon this patch. See that you replaced a shift and two pack > > > > instructions with a set_epi8 and shuffle_epi8. > > > > > > Yes, though generally that set_epi8 is not really interesting to consider... > > > it's a load from static memory that's hoisted out of the loop. > > > It's usually just sitting around in an xmm register. > > > > > > > Are you sure this part gives > > > > better performance on average? At least on a Atom Silvermont, shuffle_epi8 > > has > > > > micro code and takes 5 cycles, compared to Haswell where it only takes 1 > > > cycle. > > > > > > I don't remember any performance alerts triggering, but we can time it on an > > > Atom again if you like. > > > > > > I'll freely admit that most of our x86 test machines fall in the > > Nehalem-Haswell > > > range, and even in that range mostly Sandy Bridge or Ivy Bridge. > > > It doesn't look like we even currently have an Atom CPU performance test > bot, > > > though that's something we could change. > > > > > > Are there docs like > > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include > > Atom > > > instruction costs? > > > > Sorry, I totally missed this. > > > > Yes, it would be great if you could get some x86 tablet running Silvermont and > > use as a test bot. The Atom CPUs has a few instructions that uses micro code. > > Especially with SSE4 instructions. I found this out the hard way myself. > > > > I don't think they'll ever map intrinsics to cpu cycles directly. I usually > take > > the corresponding assembly instruction and do a lookup in Agner Fog's table > for > > Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf > > If the column μops is >1 then you might want to benchmark on Silvermont HW, or > > find an alternative. > > Going to try to add a Silvermont perf bot. Right, so we now have a Silvermont perf bot, and I've been testing with one on my desk. If we switch from shuffle_epi8 back to srli_epi32/packs_epi32/packus_epi16 to repack the pixel, we speed up between 2% and 18% on Silvermont, depending on the size of the blur. This same change regresses my desktop by 2%-11%, again depending on the size of the blur. That's very awkward. I'm not keen to have to complicate the code by considering microarchitecture, and I'm similarly not eager to set a precedent that we should allow ourselves to care. We've been down this route with NEON (where _every_ processor performs completely differently) and it's not proved manageable. I don't want to sound like I'm shutting the door completely here, but the momentum under src/opts is going the other way, favoring less, clearer code over eeking out every bit of performance. This particularly holds for operations like blurs, where switching to Ganesh gives a better speedup than we could really ever offer with a CPU. |