|
|
Descriptionsketch hooking into PNG_FILTER_OPTIMIZATIONS
Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9.
bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve.
Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on.
I've tried four approaches:
- this way;
- doing things naively in 16-bit;
- a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) );
- a mostly 8-bit version of the same.
They're all fine, but this one is consistently the fastest I've measured.
I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain.
I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters!
(Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.)
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&issue=1573943002
Committed: https://skia.googlesource.com/skia/+/372d65cc6ee743944a9fb58a734b3f1eb253b015
Patch Set 1 #Patch Set 2 : working? #Patch Set 3 : simpler #Patch Set 4 : 8-bit #Patch Set 5 : whoops, ++ #Patch Set 6 : avoid cmplt_epi8, refine comments #Patch Set 7 : faster to use correct p for a+b #Patch Set 8 : missing const #Patch Set 9 : tweak comments #Patch Set 10 : start fresh #Patch Set 11 : typo #Patch Set 12 : awkwording #
Total comments: 5
Patch Set 13 : move #Patch Set 14 : quick hackup #Patch Set 15 : types #Patch Set 16 : update comments #Patch Set 17 : another try at 8-bit Avg #Patch Set 18 : dont bother with Up #
Total comments: 2
Patch Set 19 : bpp=3 #Patch Set 20 : asserts #
Total comments: 1
Messages
Total messages: 78 (44 generated)
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS BUG=skia: ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/40001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...)
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/170001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/170001
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS local timing says this 4-byte Paeth function runs in about 0.3x the time the existing serial code does. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...) Build-Ubuntu-Clang-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Ubuntu-Clang-x...) Build-Ubuntu-GCC-x86_64-Release-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Ubuntu-GCC-x86...)
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS local timing says this 4-byte Paeth function runs in about 0.3x the time the existing serial code does. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function runs in about 0.3x the time the existing serial code does. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/190001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/190001
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function runs in about 0.3x the time the existing serial code does. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/210001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/210001
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~3. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~3. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Everything here can actually be expressed with MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with essentially every x86 ever. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Everything here can actually be expressed with MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with essentially every x86 ever. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the existing serial code does, from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This solution can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
mtklein@chromium.org changed reviewers: + msarett@google.com
Matt, can you run whatever perf tests you do on the other PNG-decoding codec changes? How do you test correctness? I've just been waiting on trybot results to show up in Gold, though even there I'm not sure I know how to get at image results. They end up there right?
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. Unpacking and doing things in 16-bit looks just as fast, and a lot simpler code. COMMIT=false BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. Unpacking and doing things in 16-bit looks just as fast, and a lot simpler code. COMMIT=false BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. Doing things naively in 16-bit looks almost as fast, but this looks a touch faster. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. Doing things naively in 16-bit looks almost as fast, but this looks a touch faster. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
Here's a few thoughts. "Matt, can you run whatever perf tests you do on the other PNG-decoding codec changes?" I'm happy to. Generally I just run CodecBench (commenting out a few lines in nanobench.cpp to only test kN32 and to not skip the images). I do need to pull together a set of PNGs that use the paeth filter. "How do you test correctness? I've just been waiting on trybot results to show up in Gold, though even there I'm not sure I know how to get at image results. They end up there right?" I don't really know how to make the trybots upload to Gold (or that it is/isn't possible). But any images that you change/break will show up here when you commit: https://gold.skia.org/list?query=source_type%3Dimage Generally I just run dm and look at them. And then if there are issues I can't see, Gold catches them later. "Everything here can be trivially downgraded to MMX." Good to know. Do you see a reason to optimize MMX? I'll work on pulling together some PNGs. Also, I really need to put my thinking cap on and understand the code :). https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp File src/codec/SkPngCodec.cpp (right): https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.c... src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__) Is there another file we can put this in? Seems to add complicated to code a file that's already pretty long. And maybe I'm looking too far ahead, but I'm guessing that there'll be more filter opts to come - Ooh maybe src/opts/SkPngFilters_opt? https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png... File third_party/libpng/pnglibconf.h (right): https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png... third_party/libpng/pnglibconf.h:214: #if defined(__SSE2__) Wow! We can hook into libpng without upstreaming anything?! Maybe this is still something to consider, but it's nice to not depend on it!
mtklein@google.com changed reviewers: + mtklein@google.com
https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp File src/codec/SkPngCodec.cpp (right): https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.c... src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__) On 2016/01/26 22:35:02, msarett wrote: > Is there another file we can put this in? > > Seems to add complicated to code a file that's already pretty long. And maybe > I'm looking too far ahead, but I'm guessing that there'll be more filter opts to > come - Ooh maybe src/opts/SkPngFilters_opt? I do think there will be more of these. We'll definitely want Paeth for bpp=3 in addition to this one for bpp=4. Left and Avg should be similar to Paeth, but a little faster and rather easier to read and write. Up is _so_ easy to write that we'll want to do that just because it'll be so fun. A single implementation for Up can handle all bpp, and can handle any number of pixels at a time. I'll leave that one for you. :) As long as we're working on one pixel at a time, bpp only matters as far as loading and storing. So in terms of speed / implementation difficultly, things break down into a couple classes by bpp: {4,8}: fastest {2,3}: not bad 6: annoying It's probably difficult to improve anything but Up for bpp=1 (notice that this code is not below 0.25x). I figure we don't care much about 16-bit component formats (bpp=2,6,8)? https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png... File third_party/libpng/pnglibconf.h (right): https://codereview.chromium.org/1573943002/diff/210001/third_party/libpng/png... third_party/libpng/pnglibconf.h:214: #if defined(__SSE2__) On 2016/01/26 22:35:02, msarett wrote: > Wow! We can hook into libpng without upstreaming anything?! Maybe this is > still something to consider, but it's nice to not depend on it! Yep.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/250001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/250001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Debug-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...) Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...) Build-Ubuntu-Clang-x86_64-Debug-Trybot on client.skia.compile (JOB_FAILED, http://build.chromium.org/p/client.skia.compile/builders/Build-Ubuntu-Clang-x...)
https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.cpp File src/codec/SkPngCodec.cpp (right): https://codereview.chromium.org/1573943002/diff/210001/src/codec/SkPngCodec.c... src/codec/SkPngCodec.cpp:19: #if defined(__SSE2__) I've moved things to a new file, src/codec/SkPngFilters.cpp. I've taken the opportunity to implement a few more filters. I'm not confident that these new ones are any of: 1. faster, 2. usefully faster, or 3. correct. But I figured you might like to see how I'd evolve the file if we do write those. Still TBD are load3() / store3() and consequently the bpp=3 variants of Sub, Avg, and Paeth.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/270001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/270001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/290001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/290001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...)
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/290001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/290001
Patchset #16 (id:290001) has been deleted
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...)
The CQ bit was unchecked by mtklein@chromium.org
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/270001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/270001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters! BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/310001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/310001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/330001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/330001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/350001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/350001
I think I've got Paeth, Avg, and Sub all correct and usefully faster than the libpng versions. libpng's Up autovectorizes pretty perfectly, so it's probably not worth re-implementing. I just ran a small experiment comparing runs with the same input and equal time spent in png_read_IDAT_data as a control. Here's how the costs changed: Paeth: 0.37x runtime Avg: 0.46x Sub: 0.23x --------------------- Overall decode: 0.73x, with almost all of that win coming from a faster Paeth.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/390001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/390001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
LGTM Woohoo! https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters... File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters... src/codec/SkPngFilters.cpp:79: // SSE 4.1+ would be: return _mm_blendv_epi8(e,t,c); Why not add a #if here? https://codereview.chromium.org/1573943002/diff/390001/src/codec/SkPngFilters... File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1573943002/diff/390001/src/codec/SkPngFilters... src/codec/SkPngFilters.cpp:123: // and the predictor remains the same. (Again, brute force.) I've also confirmed this with brute force. Though not completely understanding why bugs me a little bit.
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters! BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters! (Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.) BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters... File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1573943002/diff/350001/src/codec/SkPngFilters... src/codec/SkPngFilters.cpp:79: // SSE 4.1+ would be: return _mm_blendv_epi8(e,t,c); On 2016/01/27 20:56:10, msarett wrote: > Why not add a #if here? Mostly because we won't have a bot to test it, though I do acknowledge that x86 Android might benefit. Let's circle back once we see how this works / performs?
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1573943002/390001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1573943002/390001
Message was sent while issue was closed.
Description was changed from ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters! (Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.) BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== sketch hooking into PNG_FILTER_OPTIMIZATIONS Local timing says this 4-byte Paeth function takes about 0.3x the time the serial libpng code does, dropping from ~10 cycles per byte to ~2.9. bpp=4 is mainly an easy demo. This approach can work for any bpp up to 16, 1 pixel at a time, at roughly the same cost per pixel. Doing more than 1 pixel at a time is a tricky math problem I have yet to attempt to solve. Everything here can be trivially downgraded to MMX, supporting bpp up to 8. It seems to be a little slower (~3.5 cycles per byte), but it would make the code compatible with every x86 that can still power on. I've tried four approaches: - this way; - doing things naively in 16-bit; - a 16-bit version that requires division by 3 (i.e. mulhi_epu16(..., 0x5580) ); - a mostly 8-bit version of the same. They're all fine, but this one is consistently the fastest I've measured. I'd be happy to settle on the naive 16-bit version too, which would have a very clear implementation that's only minorly slower than this version. The other two are way more complicated, and would require us to draw some serious ASCII diagrams to explain. I have learned that the .skp serialization tests (serialize-8888) have a nice side effect of testing the correctness of these filters! (Since writing the description above, I've bumped things up to {Paeth,Sub,Avg} x { 3 bpp, 4 bpp }.) BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... Committed: https://skia.googlesource.com/skia/+/372d65cc6ee743944a9fb58a734b3f1eb253b015 ==========
Message was sent while issue was closed.
Committed patchset #20 (id:390001) as https://skia.googlesource.com/skia/+/372d65cc6ee743944a9fb58a734b3f1eb253b015 |