Issue 1573943002: sketch hooking into PNG_FILTER_OPTIMIZATIONS

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS BUG=skia: ========== to ========== sketch hooking ...

4 years, 11 months ago (2016-01-11 16:29:46 UTC) #1

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-11 22:55:34 UTC) #2

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/40001

4 years, 11 months ago (2016-01-11 22:55:43 UTC) #3

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-11 23:01:28 UTC) #4

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot/builds/5191)

4 years, 11 months ago (2016-01-11 23:01:28 UTC) #5

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-26 18:56:24 UTC) #6

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/170001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/170001

4 years, 11 months ago (2016-01-26 18:56:36 UTC) #7

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1573943002 ========== to ========== ...

4 years, 11 months ago (2016-01-26 18:56:56 UTC) #8

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-26 18:57:28 UTC) #9

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot/builds/5566) Build-Ubuntu-Clang-x86_64-Debug-Trybot on ...

4 years, 11 months ago (2016-01-26 18:57:28 UTC) #10

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 18:58:45 UTC) #11

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-26 18:59:41 UTC) #12

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/190001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/190001

4 years, 11 months ago (2016-01-26 18:59:45 UTC) #13

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:00:05 UTC) #14

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-26 19:02:31 UTC) #15

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/210001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/210001

4 years, 11 months ago (2016-01-26 19:02:39 UTC) #16

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:08:40 UTC) #17

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:10:13 UTC) #18

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:11:00 UTC) #19

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-26 19:13:14 UTC) #20

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-26 19:13:15 UTC) #21

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:13:40 UTC) #22

mtklein_C

mtklein@chromium.org changed reviewers: + msarett@google.com

4 years, 11 months ago (2016-01-26 19:15:09 UTC) #23

mtklein_C

Matt, can you run whatever perf tests you do on the other PNG-decoding codec changes? ...

4 years, 11 months ago (2016-01-26 19:15:09 UTC) #24

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:39:01 UTC) #25

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 19:45:47 UTC) #26

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 20:24:18 UTC) #27

Description was changed from

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

Doing things naively in 16-bit looks almost as fast, but this looks a touch
faster.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-26 20:24:48 UTC) #28

Description was changed from

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

msarett

Here's a few thoughts. "Matt, can you run whatever perf tests you do on the ...

4 years, 11 months ago (2016-01-26 22:35:02 UTC) #29

Here's a few thoughts.

"Matt, can you run whatever perf tests you do on the other PNG-decoding codec
changes?"

I'm happy to.  Generally I just run CodecBench (commenting out a few lines in
nanobench.cpp to only test kN32 and to not skip the images).  I do need to pull
together a set of PNGs that use the paeth filter.

"How do you test correctness?  I've just been waiting on trybot results to show
up in Gold,
though even there I'm not sure I know how to get at image results.  They end up
there right?"

I don't really know how to make the trybots upload to Gold (or that it is/isn't
possible).  But any images that you change/break will show up here when you
commit:
https://gold.skia.org/list?query=source_type%3Dimage

Generally I just run dm and look at them.  And then if there are issues I can't
see, Gold catches them later.

"Everything here can be trivially downgraded to MMX."

Good to know.  Do you see a reason to optimize MMX?


I'll work on pulling together some PNGs.  Also, I really need to put my thinking
cap on and understand the code :).

https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp
File src/codec/SkPngCodec.cpp (right):

https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.c...
src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__)
Is there another file we can put this in?

Seems to add complicated to code a file that's already pretty long.  And maybe
I'm looking too far ahead, but I'm guessing that there'll be more filter opts to
come - Ooh maybe src/opts/SkPngFilters_opt?

https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png...
File third_party/libpng/pnglibconf.h (right):

https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png...
third_party/libpng/pnglibconf.h:214: #if defined(__SSE2__)
Wow!  We can hook into libpng without upstreaming anything?!  Maybe this is
still something to consider, but it's nice to not depend on it!

mtklein

mtklein@google.com changed reviewers: + mtklein@google.com

4 years, 11 months ago (2016-01-26 22:52:16 UTC) #30

mtklein

https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp File src/codec/SkPngCodec.cpp (right): https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp#newcode19 src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__) On 2016/01/26 22:35:02, msarett wrote: > Is ...

4 years, 11 months ago (2016-01-26 22:52:17 UTC) #31

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 00:18:03 UTC) #32

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/250001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/250001

4 years, 11 months ago (2016-01-27 00:18:12 UTC) #33

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 00:19:02 UTC) #34

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug-Trybot/builds/5577) Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on ...

4 years, 11 months ago (2016-01-27 00:19:03 UTC) #35

mtklein_C

https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp File src/codec/SkPngCodec.cpp (right): https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp#newcode19 src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__) I've moved things to a new file, ...

4 years, 11 months ago (2016-01-27 00:21:26 UTC) #36

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 00:22:29 UTC) #37

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/270001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/270001

4 years, 11 months ago (2016-01-27 00:22:32 UTC) #38

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 00:33:33 UTC) #39

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 00:33:34 UTC) #40

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 00:35:20 UTC) #41

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/290001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/290001

4 years, 11 months ago (2016-01-27 00:35:31 UTC) #42

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 00:39:14 UTC) #43

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot/builds/5574)

4 years, 11 months ago (2016-01-27 00:39:15 UTC) #44

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 11 months ago (2016-01-27 00:39:59 UTC) #45

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/290001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/290001

4 years, 11 months ago (2016-01-27 00:40:04 UTC) #46

commit-bot: I haz the power

Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot/builds/5575)

4 years, 11 months ago (2016-01-27 00:43:25 UTC) #48

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 00:43:44 UTC) #50

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/270001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/270001

4 years, 11 months ago (2016-01-27 00:43:54 UTC) #51

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 00:43:57 UTC) #52

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 00:43:57 UTC) #53

mtklein_C

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-27 01:07:43 UTC) #54

Description was changed from

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice
side effect of testing the correctness of these filters!

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 11 months ago (2016-01-27 13:27:24 UTC) #55

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/310001

4 years, 11 months ago (2016-01-27 13:27:30 UTC) #56

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 13:38:36 UTC) #57

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 13:38:37 UTC) #58

mtklein

The CQ bit was checked by mtklein@google.com to run a CQ dry run

4 years, 11 months ago (2016-01-27 13:52:40 UTC) #59

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/330001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/330001

4 years, 11 months ago (2016-01-27 13:52:51 UTC) #60

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 14:03:11 UTC) #61

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 14:03:12 UTC) #62

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 14:12:53 UTC) #63

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/350001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/350001

4 years, 11 months ago (2016-01-27 14:13:06 UTC) #64

mtklein_C

I think I've got Paeth, Avg, and Sub all correct and usefully faster than the ...

4 years, 11 months ago (2016-01-27 14:19:27 UTC) #65

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 14:24:37 UTC) #66

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 14:24:37 UTC) #67

mtklein_C

The CQ bit was checked by mtklein@chromium.org to run a CQ dry run

4 years, 11 months ago (2016-01-27 20:18:35 UTC) #68

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/390001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/390001

4 years, 11 months ago (2016-01-27 20:18:42 UTC) #69

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

4 years, 11 months ago (2016-01-27 20:31:04 UTC) #70

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

4 years, 11 months ago (2016-01-27 20:31:05 UTC) #71

msarett

LGTM Woohoo! https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters.cpp#newcode79 src/codec/SkPngFilters.cpp:79: // SSE 4.1+ would be: return _mm_blendv_epi8(e,t,c); ...

4 years, 11 months ago (2016-01-27 20:56:11 UTC) #72

mtklein

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-27 20:57:18 UTC) #73

Description was changed from

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice
side effect of testing the correctness of these filters!

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice
side effect of testing the correctness of these filters!

(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x
{ 3 bpp, 4 bpp }.)

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

mtklein

https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters.cpp#newcode79 src/codec/SkPngFilters.cpp:79: // SSE 4.1+ would be: return _mm_blendv_epi8(e,t,c); On 2016/01/27 ...

4 years, 11 months ago (2016-01-27 21:00:16 UTC) #74

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/390001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/390001

4 years, 11 months ago (2016-01-27 21:00:52 UTC) #76

commit-bot: I haz the power

Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth ...

4 years, 11 months ago (2016-01-27 21:01:43 UTC) #77

Message was sent while issue was closed.

Description was changed from

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice
side effect of testing the correctness of these filters!

(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x
{ 3 bpp, 4 bpp }.)

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...
==========

to

==========
sketch hooking into PNG_FILTER_OPTIMIZATIONS

Local timing says this 4-byte Paeth function takes about 0.3x the time the
serial libpng code does, dropping from ~10 cycles per byte to ~2.9.

bpp=4 is mainly an easy demo.  This approach can work for any bpp up to 16, 1
pixel at a time, at roughly the same cost per pixel.  Doing more than 1 pixel at
a time is a tricky math problem I have yet to attempt to solve.

Everything here can be trivially downgraded to MMX, supporting bpp up to 8.  It
seems to be a little slower (~3.5 cycles per byte), but it would make the code
compatible with every x86 that can still power on.

I've tried four approaches:
  - this way;
  - doing things naively in 16-bit;
  - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580)
);
  - a mostly 8-bit version of the same.

They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very
clear implementation that's only minorly slower than this version.  The other
two are way more complicated, and would require us to draw some serious ASCII
diagrams to explain.

I have learned that the .skp serialization tests (serialize-8888) have a nice
side effect of testing the correctness of these filters!

(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x
{ 3 bpp, 4 bpp }.)

BUG=skia:
GOLD_TRYBOT_URL=
https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is...

Committed:
https://skia.googlesource.com/skia/+/372d65cc6ee743944a9fb58a734b3f1eb253b015
==========

commit-bot: I haz the power

4 years, 11 months ago (2016-01-27 21:01:44 UTC) #78

Message was sent while issue was closed.

Committed patchset #20 (id:390001) as
https://skia.googlesource.com/skia/+/372d65cc6ee743944a9fb58a734b3f1eb253b015

Issue 1573943002: sketch hooking into PNG_FILTER_OPTIMIZATIONS (Closed)

Description

Patch Set 1 #

Patch Set 2 : working? #

Patch Set 3 : simpler #

Patch Set 4 : 8-bit #

Patch Set 5 : whoops, ++ #

Patch Set 6 : avoid cmplt_epi8, refine comments #

Patch Set 7 : faster to use correct p for a+b #

Patch Set 8 : missing const #

Patch Set 9 : tweak comments #

Patch Set 10 : start fresh #

Patch Set 11 : typo #

Patch Set 12 : awkwording #

Patch Set 13 : move #

Patch Set 14 : quick hackup #

Patch Set 15 : types #

Patch Set 16 : update comments #

Patch Set 17 : another try at 8-bit Avg #

Patch Set 18 : dont bother with Up #

Patch Set 19 : bpp=3 #

Patch Set 20 : asserts #

Messages