Issue 1657503002: Look beyond SSE2 for Paeth

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:28:20 UTC) #1

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 10 months ago (2016-01-31 15:48:43 UTC) #2

mtklein

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:48:54 UTC) #3

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/40001

4 years, 10 months ago (2016-01-31 15:48:55 UTC) #4

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 10 months ago (2016-01-31 15:52:55 UTC) #5

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot/builds/5690)

4 years, 10 months ago (2016-01-31 15:52:55 UTC) #6

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:54:36 UTC) #7

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:54:58 UTC) #8

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:56:44 UTC) #9

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:57:17 UTC) #10

Description was changed from

==========
Look beyond SSE2 for Paeth

You can break this CL down into three steps.  Steps 2 and 3 depend on 1.

    Step 1: go to a 16-bit impl.  Speed ~unaffected.
    Step 2: use SSSE3 16-bit abs.  ~20% speedup to Paeth.
    Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth.

Overall this can improve PNG decoding by around 10% end-to-end.

I would feel most comfortable landing this only after we have a bot exercising
the SSE4.1 code, either by moving this stuff behind a function pointer
(simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile
time (simulating an Android system build).  We've got plenty of bots building
with SSSE3 at compile time to test that path.

On the other hand, the non-SSE4.1 code is actually more complicated than the
SSE4.1 code... I might be convinced to just land this.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false...
==========

to

==========
Look beyond SSE2 for Paeth

You can break this CL down into three steps.  Steps 2 and 3 depend on 1.

    Step 1: go to a 16-bit impl.  Speed ~unaffected.
    Step 2: use SSSE3 16-bit abs.  ~20% speedup to Paeth.
    Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth.

Overall this can improve PNG decoding by around 8% end-to-end.

I would feel most comfortable landing this only after we have a bot exercising
the SSE4.1 code, either by moving this stuff behind a function pointer
(simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile
time (simulating an Android system build).  We've got plenty of bots building
with SSSE3 at compile time to test that path.

On the other hand, the non-SSE4.1 code is actually more complicated than the
SSE4.1 code... I might be convinced to just land this.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false...
==========

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 15:58:10 UTC) #11

mtklein_C

mtklein@chromium.org changed reviewers: + msarett@google.com - mtklein@google.com

4 years, 10 months ago (2016-01-31 16:11:32 UTC) #12

mtklein_C

There are failing bots, so clearly this is not correct, but I thought you might ...

4 years, 10 months ago (2016-01-31 16:11:34 UTC) #13

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 10 months ago (2016-01-31 16:18:18 UTC) #14

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/60001

4 years, 10 months ago (2016-01-31 16:18:26 UTC) #15

mtklein_C

On 2016/01/31 at 16:11:34, mtklein_C wrote: > There are failing bots, so clearly this is ...

4 years, 10 months ago (2016-01-31 16:20:43 UTC) #16

mtklein_C

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-01-31 16:21:55 UTC) #17

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 10 months ago (2016-01-31 16:30:10 UTC) #18

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 10 months ago (2016-01-31 16:30:11 UTC) #19

msarett

I like that this is faster and easier to understand. I agree that we need ...

4 years, 10 months ago (2016-02-01 15:17:34 UTC) #20

mtklein

mtklein@google.com changed reviewers: + mtklein@google.com

4 years, 10 months ago (2016-02-01 15:28:15 UTC) #21

mtklein

https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp#newcode100 src/codec/SkPngFilters.cpp:100: return _mm_blendv_epi8(e,t,c); On 2016/02/01 15:17:34, msarett wrote: > Is ...

4 years, 10 months ago (2016-02-01 15:28:15 UTC) #22

msarett

https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp#newcode100 src/codec/SkPngFilters.cpp:100: return _mm_blendv_epi8(e,t,c); On 2016/02/01 15:28:15, mtklein wrote: > On ...

4 years, 10 months ago (2016-02-01 15:49:41 UTC) #23

mtklein

https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp#newcode142 src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest); // Note `_epi8`: we need ...

4 years, 10 months ago (2016-02-01 16:04:02 UTC) #24

https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp
File src/codec/SkPngFilters.cpp (right):

https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters....
src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest);  // Note `_epi8`:
we need addition to wrap modulo 255.
On 2016/02/01 15:49:41, msarett wrote:
> On 2016/02/01 15:28:15, mtklein wrote:
> > On 2016/02/01 15:17:34, msarett wrote:
> > > Can we call packus(nearest) before this add?
> > > 
> > > That might make the use of epi8 clearer.  Also that would involve not
> needing
> > to
> > > unpack d to 16 bit at the start?
> > 
> > We need to unpack to be able to do the middle bits (sub_epi16...abs_i16).
> > That in turn means 'a' wants to start unpacked (or we'd need to re-unpack
it).
> > That in turn means we want to leave d unpacked.
> > 
> > Luckily, we know all of {a,b,c,d} are really 8 bit, with zeros in the upper
> > bytes of each 16-bit lane, so the 8-bit add here doesn't really have any
> chance
> > of going wrong.
> > 
> > Let me hack around a bit on the alternative you suggest here, which is to
load
> d
> > packed, but unpack it when moving it to a.  I think that'd work fine.
> 
> Oh I see, "a" is set to "d", and "a" must be unpacked.
> 
> Either approach is fine by me, since it looks like my suggestion won't save
any
> instructions anyway.

Just tried this out.  Correctness-wise what you suggested works fine, and it's
the same number of instructions.  It does change how the instructions can
schedule, and doesn't work out in our favor: IACA says the old code had 14
instructions on a 13.9 cycle critical path, and the new code 15 instructions on
a 14.3 cycle critical path.  Nanobench agrees, measuring the new way as a just a
little slower, 2-3%, right in line with IACA's predicted 2.8%.

(I also tried moving d's load<4>() around... couldn't beat where it is for
speed, and I like seeing all the data roll over together.)

msarett

Looks good to me But what is the situation with the bots? I don't want ...

4 years, 10 months ago (2016-02-01 16:12:55 UTC) #25

mtklein

If we land as is, we'll test SSE2-only (Linux+Windows bots) and SSSE3 (Mac+Android bots), but ...

4 years, 10 months ago (2016-02-01 16:22:42 UTC) #26

msarett

On 2016/02/01 16:22:42, mtklein wrote: > If we land as is, we'll test SSE2-only (Linux+Windows ...

4 years, 10 months ago (2016-02-01 16:27:12 UTC) #27

mtklein

On 2016/02/01 16:27:12, msarett wrote: > On 2016/02/01 16:22:42, mtklein wrote: > > If we ...

4 years, 10 months ago (2016-02-01 16:31:56 UTC) #28

mtklein

On 2016/02/01 16:48:56, msarett wrote: > That works for me as well :). Done.

4 years, 10 months ago (2016-02-01 19:30:38 UTC) #30

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/80001

4 years, 10 months ago (2016-02-01 20:08:40 UTC) #33

commit-bot: I haz the power

Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL ...

4 years, 10 months ago (2016-02-01 20:20:35 UTC) #34

commit-bot: I haz the power

4 years, 10 months ago (2016-02-01 20:20:36 UTC) #35

Message was sent while issue was closed.

Committed patchset #5 (id:80001) as
https://skia.googlesource.com/skia/+/b21c752eb3d55970ac45daaf3fd2cbda39c7658a

Issue 1657503002: Look beyond SSE2 for Paeth (Closed)

Description

Patch Set 1 #

Patch Set 2 : comments #

Patch Set 3 : more comments #

Patch Set 4 : typo #

Patch Set 5 : kill sse4.1 #

Messages