Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(863)

Issue 1123263003: Really use SSE4 (and SSSE3) in SkBlurImage_SSE4 (Closed)

Created:
5 years, 7 months ago by mtklein_C
Modified:
5 years, 4 months ago
Reviewers:
mtklein, reed1
CC:
reviews_skia.org
Base URL:
https://skia.googlesource.com/skia.git@master
Target Ref:
refs/heads/master
Project:
skia
Visibility:
Public.

Description

Really use SSE4 (and SSSE3) in SkBlurImage_SSE4 We don't seem to be making good use of the available instruction set. SSE4.1 gives us an easy way to unpack a pixel into an __m128i, and SSSE3 gave us an easy way to do the reverse. This should be bit-perfect and about a 10% speedup. BUG=skia: Committed: https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda

Patch Set 1 #

Patch Set 2 : epu #

Patch Set 3 : todos #

Patch Set 4 : another note #

Patch Set 5 : note #

Patch Set 6 : win? #

Patch Set 7 : 127 #

Patch Set 8 : 0 #

Unified diffs Side-by-side diffs Delta from patch set Stats (+16 lines, -21 lines) Patch
M src/opts/SkBlurImage_opts_SSE4.cpp View 1 2 3 4 5 6 7 4 chunks +16 lines, -21 lines 0 comments Download

Messages

Total messages: 21 (8 generated)
mtklein_C
Hate to move the bar on you, but I couldn't help but notice this code ...
5 years, 7 months ago (2015-05-06 19:48:23 UTC) #2
commit-bot: I haz the power
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/80001
5 years, 7 months ago (2015-05-06 19:48:44 UTC) #4
reed1
lgtm
5 years, 7 months ago (2015-05-06 19:50:04 UTC) #5
commit-bot: I haz the power
Dry run: Try jobs failed on following builders: Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86_64-Debug-Trybot/builds/914)
5 years, 7 months ago (2015-05-06 19:52:00 UTC) #7
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/100001
5 years, 7 months ago (2015-05-06 19:53:38 UTC) #10
commit-bot: I haz the power
Try jobs failed on following builders: Build-Win-MSVC-x86-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Win-MSVC-x86-Debug-Trybot/builds/904) Build-Win-MSVC-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, ...
5 years, 7 months ago (2015-05-06 19:57:04 UTC) #12
commit-bot: I haz the power
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1123263003/140001
5 years, 7 months ago (2015-05-06 20:01:56 UTC) #15
commit-bot: I haz the power
Committed patchset #8 (id:140001) as https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
5 years, 7 months ago (2015-05-06 20:22:08 UTC) #16
henrik.smiding
On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: > Committed patchset #8 (id:140001) as ...
5 years, 5 months ago (2015-07-01 09:16:08 UTC) #17
mtklein
On 2015/07/01 09:16:08, henrik.smiding wrote: > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote: ...
5 years, 5 months ago (2015-07-01 13:09:46 UTC) #18
henrik.smiding
On 2015/07/01 13:09:46, mtklein wrote: > On 2015/07/01 09:16:08, henrik.smiding wrote: > > On 2015/05/06 ...
5 years, 4 months ago (2015-08-05 15:09:12 UTC) #19
mtklein
On 2015/08/05 15:09:12, henrik.smiding wrote: > On 2015/07/01 13:09:46, mtklein wrote: > > On 2015/07/01 ...
5 years, 4 months ago (2015-08-05 16:02:38 UTC) #20
mtklein
5 years, 4 months ago (2015-08-06 17:27:36 UTC) #21
Message was sent while issue was closed.
On 2015/08/05 16:02:38, mtklein wrote:
> On 2015/08/05 15:09:12, henrik.smiding wrote:
> > On 2015/07/01 13:09:46, mtklein wrote:
> > > On 2015/07/01 09:16:08, henrik.smiding wrote:
> > > > On 2015/05/06 20:22:08, commit-bot: I haz the power wrote:
> > > > > Committed patchset #8 (id:140001) as
> > > > >
> > >
> https://skia.googlesource.com/skia/+/e0cab96599764f7611de015f558a6b22162c3eda
> > > > 
> > > > Stumbled upon this patch. See that you replaced a shift and two pack
> > > > instructions with a set_epi8 and shuffle_epi8. 
> > > 
> > > Yes, though generally that set_epi8 is not really interesting to
consider...
> > > it's a load from static memory that's hoisted out of the loop. 
> > > It's usually just sitting around in an xmm register.
> > > 
> > > > Are you sure this part gives
> > > > better performance on average? At least on a Atom Silvermont,
shuffle_epi8
> > has
> > > > micro code and takes 5 cycles, compared to Haswell where it only takes 1
> > > cycle.
> > > 
> > > I don't remember any performance alerts triggering, but we can time it on
an
> > > Atom again if you like.
> > > 
> > > I'll freely admit that most of our x86 test machines fall in the
> > Nehalem-Haswell
> > > range, and even in that range mostly Sandy Bridge or Ivy Bridge.
> > > It doesn't look like we even currently have an Atom CPU performance test
> bot,
> > > though that's something we could change.
> > > 
> > > Are there docs like
> > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/ that include
> > Atom
> > > instruction costs?
> > 
> > Sorry, I totally missed this.
> > 
> > Yes, it would be great if you could get some x86 tablet running Silvermont
and
> > use as a test bot. The Atom CPUs has a few instructions that uses micro
code.
> > Especially with SSE4 instructions. I found this out the hard way myself.
> > 
> > I don't think they'll ever map intrinsics to cpu cycles directly. I usually
> take
> > the corresponding assembly instruction and do a lookup in Agner Fog's table
> for
> > Silvermont. Found here http://www.agner.org/optimize/instruction_tables.pdf
> > If the column μops is >1 then you might want to benchmark on Silvermont HW,
or
> > find an alternative.
> 
> Going to try to add a Silvermont perf bot.

Right, so we now have a Silvermont perf bot, and I've been testing with one on
my desk.

If we switch from shuffle_epi8 back to srli_epi32/packs_epi32/packus_epi16 to
repack the pixel, 
we speed up between 2% and 18% on Silvermont, depending on the size of the blur.
 This same
change regresses my desktop by 2%-11%, again depending on the size of the blur.

That's very awkward.  I'm not keen to have to complicate the code by considering
microarchitecture,
and I'm similarly not eager to set a precedent that we should allow ourselves to
care.  We've been
down this route with NEON (where _every_ processor performs completely
differently) and it's not
proved manageable.

I don't want to sound like I'm shutting the door completely here, but the
momentum under src/opts
is going the other way, favoring less, clearer code over eeking out every bit of
performance.
This particularly holds for operations like blurs, where switching to Ganesh
gives a better
speedup than we could really ever offer with a CPU.

Powered by Google App Engine
This is Rietveld 408576698