Issue 1123263003: Really use SSE4 (and SSSE3) in SkBlurImage_SSE4 - Code Review

Chromium Code Reviews

chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out

(863)

My Issues | Starred Open | Closed | All

Issue 1123263003: Really use SSE4 (and SSSE3) in SkBlurImage_SSE4 (Closed)

Created:
5 years, 7 months ago by mtklein_C

Modified:
5 years, 4 months ago

Reviewers:
mtklein, reed1

CC:
reviews_skia.org

Base URL:
https://skia.googlesource.com/skia.git@master

Target Ref:
refs/heads/master

Project:
skia

Visibility:
Public.

More Reviews

Description

Really use SSE4 (and SSSE3) in SkBlurImage_SSE4 We don't seem to be making good use of the available instruction set. SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and SSSE3 gave us an easy way to do the reverse. This should be bit-perfect and about a 10% speedup. BUG=skia: Committed: https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda

Patch Set 1 #

Patch Set 2 : epu #

Patch Set 3 : todos #

Patch Set 4 : another note #

Patch Set 5 : note #

Patch Set 6 : win? #

Patch Set 7 : 127 #

Patch Set 8 : 0 #

Created: 5 years, 7 months ago

Download [raw] [tar.bz2]

		Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+16 lines, -21 lines)			Patch
	M	src/opts/SkBlurImage_opts_SSE4.cpp	View	1 2 3 4 5 6 7	4 chunks	+16 lines, -21 lines	0 comments	Download

Messages

Total messages: 21 (8 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

mtklein_C

Hate to move the bar on you, but I couldn't help but notice this code ...

5 years, 7 months ago (2015-05-06 19:48:23 UTC) #2

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/80001

5 years, 7 months ago (2015-05-06 19:48:44 UTC) #4

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_64-Debug-Trybot/builds/914)

5 years, 7 months ago (2015-05-06 19:52:00 UTC) #7

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/100001

5 years, 7 months ago (2015-05-06 19:53:38 UTC) #10

commit-bot: I haz the power

Try jobs failed on following builders: Build-Win-MSVC-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86-Debug-Trybot/builds/904) Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, ...

5 years, 7 months ago (2015-05-06 19:57:04 UTC) #12

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/140001

5 years, 7 months ago (2015-05-06 20:01:56 UTC) #15

commit-bot: I haz the power

Committed patchset #8 (id:140001) as https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda

5 years, 7 months ago (2015-05-06 20:22:08 UTC) #16

henrik.smiding

On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > Committed patchset #8 (id:140001) as ...

5 years, 5 months ago (2015-07-01 09:16:08 UTC) #17

mtklein

On 2015/07/01 09:16:08, henrik.smiding wrote: > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: ...

5 years, 5 months ago (2015-07-01 13:09:46 UTC) #18

henrik.smiding

On 2015/07/01 13:09:46, mtklein wrote: > On 2015/07/01 09:16:08, henrik.smiding wrote: > > On 2015/05/06 ...

5 years, 4 months ago (2015-08-05 15:09:12 UTC) #19

Message was sent while issue was closed.

On 2015/07/01 13:09:46, mtklein wrote:
> On 2015/07/01 09:16:08, henrik.smiding wrote:
> > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote:
> > > Committed patchset #8 (id:140001) as
> > >
> https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
> > 
> > Stumbled upon this patch. See that you replaced a shift and two pack
> > instructions with a set_epi8 and shuffle_epi8. 
> 
> Yes, though generally that set_epi8 is not really interesting to consider...
> it's a load from static memory that's hoisted out of the loop. 
> It's usually just sitting around in an xmm register.
> 
> > Are you sure this part gives
> > better performance on average? At least on a Atom Silvermont, shuffle_epi8
has
> > micro code and takes 5 cycles, compared to Haswell where it only takes 1
> cycle.
> 
> I don't remember any performance alerts triggering, but we can time it on an
> Atom again if you like.
> 
> I'll freely admit that most of our x86 test machines fall in the
Nehalem-Haswell
> range, and even in that range mostly Sandy Bridge or Ivy Bridge.
> It doesn't look like we even currently have an Atom CPU performance test bot,
> though that's something we could change.
> 
> Are there docs like
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include
Atom
> instruction costs?

Sorry, I totally missed this.

Yes, it would be great if you could get some x86 tablet running Silvermont and
use as a test bot. The Atom CPUs has a few instructions that uses micro code.
Especially with SSE4 instructions. I found this out the hard way myself.

I don't think they'll ever map intrinsics to cpu cycles directly. I usually take
the corresponding assembly instruction and do a lookup in Agner Fog's table for
Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf
If the column μops is >1 then you might want to benchmark on Silvermont HW, or
find an alternative.

mtklein

On 2015/08/05 15:09:12, henrik.smiding wrote: > On 2015/07/01 13:09:46, mtklein wrote: > > On 2015/07/01 ...

5 years, 4 months ago (2015-08-05 16:02:38 UTC) #20

Message was sent while issue was closed.

On 2015/08/05 15:09:12, henrik.smiding wrote:
> On 2015/07/01 13:09:46, mtklein wrote:
> > On 2015/07/01 09:16:08, henrik.smiding wrote:
> > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote:
> > > > Committed patchset #8 (id:140001) as
> > > >
> >
https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
> > > 
> > > Stumbled upon this patch. See that you replaced a shift and two pack
> > > instructions with a set_epi8 and shuffle_epi8. 
> > 
> > Yes, though generally that set_epi8 is not really interesting to consider...
> > it's a load from static memory that's hoisted out of the loop. 
> > It's usually just sitting around in an xmm register.
> > 
> > > Are you sure this part gives
> > > better performance on average? At least on a Atom Silvermont, shuffle_epi8
> has
> > > micro code and takes 5 cycles, compared to Haswell where it only takes 1
> > cycle.
> > 
> > I don't remember any performance alerts triggering, but we can time it on an
> > Atom again if you like.
> > 
> > I'll freely admit that most of our x86 test machines fall in the
> Nehalem-Haswell
> > range, and even in that range mostly Sandy Bridge or Ivy Bridge.
> > It doesn't look like we even currently have an Atom CPU performance test
bot,
> > though that's something we could change.
> > 
> > Are there docs like
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include
> Atom
> > instruction costs?
> 
> Sorry, I totally missed this.
> 
> Yes, it would be great if you could get some x86 tablet running Silvermont and
> use as a test bot. The Atom CPUs has a few instructions that uses micro code.
> Especially with SSE4 instructions. I found this out the hard way myself.
> 
> I don't think they'll ever map intrinsics to cpu cycles directly. I usually
take
> the corresponding assembly instruction and do a lookup in Agner Fog's table
for
> Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf
> If the column μops is >1 then you might want to benchmark on Silvermont HW, or
> find an alternative.

Going to try to add a Silvermont perf bot.

mtklein

5 years, 4 months ago (2015-08-06 17:27:36 UTC) #21

Message was sent while issue was closed.

On 2015/08/05 16:02:38, mtklein wrote:
> On 2015/08/05 15:09:12, henrik.smiding wrote:
> > On 2015/07/01 13:09:46, mtklein wrote:
> > > On 2015/07/01 09:16:08, henrik.smiding wrote:
> > > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote:
> > > > > Committed patchset #8 (id:140001) as
> > > > >
> > >
> https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
> > > > 
> > > > Stumbled upon this patch. See that you replaced a shift and two pack
> > > > instructions with a set_epi8 and shuffle_epi8. 
> > > 
> > > Yes, though generally that set_epi8 is not really interesting to
consider...
> > > it's a load from static memory that's hoisted out of the loop. 
> > > It's usually just sitting around in an xmm register.
> > > 
> > > > Are you sure this part gives
> > > > better performance on average? At least on a Atom Silvermont,
shuffle_epi8
> > has
> > > > micro code and takes 5 cycles, compared to Haswell where it only takes 1
> > > cycle.
> > > 
> > > I don't remember any performance alerts triggering, but we can time it on
an
> > > Atom again if you like.
> > > 
> > > I'll freely admit that most of our x86 test machines fall in the
> > Nehalem-Haswell
> > > range, and even in that range mostly Sandy Bridge or Ivy Bridge.
> > > It doesn't look like we even currently have an Atom CPU performance test
> bot,
> > > though that's something we could change.
> > > 
> > > Are there docs like
> > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include
> > Atom
> > > instruction costs?
> > 
> > Sorry, I totally missed this.
> > 
> > Yes, it would be great if you could get some x86 tablet running Silvermont
and
> > use as a test bot. The Atom CPUs has a few instructions that uses micro
code.
> > Especially with SSE4 instructions. I found this out the hard way myself.
> > 
> > I don't think they'll ever map intrinsics to cpu cycles directly. I usually
> take
> > the corresponding assembly instruction and do a lookup in Agner Fog's table
> for
> > Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf
> > If the column μops is >1 then you might want to benchmark on Silvermont HW,
or
> > find an alternative.
> 
> Going to try to add a Silvermont perf bot.

Right, so we now have a Silvermont perf bot, and I've been testing with one on
my desk.

If we switch from shuffle_epi8 back to srli_epi32/packs_epi32/packus_epi16 to
repack the pixel, 
we speed up between 2% and 18% on Silvermont, depending on the size of the blur.
 This same
change regresses my desktop by 2%-11%, again depending on the size of the blur.

That's very awkward.  I'm not keen to have to complicate the code by considering
microarchitecture,
and I'm similarly not eager to set a precedent that we should allow ourselves to
care.  We've been
down this route with NEON (where _every_ processor performs completely
differently) and it's not
proved manageable.

I don't want to sound like I'm shutting the door completely here, but the
momentum under src/opts
is going the other way, favoring less, clearer code over eeking out every bit of
performance.
This particularly holds for operations like blurs, where switching to Ganesh
gives a better
speedup than we could really ever offer with a CPU.

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

Powered by Google App Engine

This is Rietveld 408576698