|
|
DescriptionLook beyond SSE2 for Paeth
You can break this CL down into three steps. Steps 2 and 3 depend on 1.
Step 1: go to a 16-bit impl. Speed ~unaffected.
Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth.
Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth.
Overall this can improve PNG decoding by around 8% end-to-end.
I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path.
BUG=skia:
GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false&issue=1657503002
Committed: https://skia.googlesource.com/skia/+/b21c752eb3d55970ac45daaf3fd2cbda39c7658a
Patch Set 1 #Patch Set 2 : comments #Patch Set 3 : more comments #Patch Set 4 : typo #
Total comments: 7
Patch Set 5 : kill sse4.1 #Messages
Total messages: 35 (16 generated)
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. BUG=skia: ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ==========
The CQ bit was checked by mtklein@google.com to run a CQ dry run
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dgm&master=false&is... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/40001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/40001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: Test-Ubuntu-GCC-GCE-CPU-AVX2-x86_64-Release-Shared-Trybot on client.skia (JOB_FAILED, http://build.chromium.org/p/client.skia/builders/Test-Ubuntu-GCC-GCE-CPU-AVX2...)
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup. Step 3: use SSE4.1 blendv, total ~25% speedup. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 10% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 10% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. On the other hand, the non-SSE4.1 code is actually more complicated than the SSE4.1 code... I might be convinced to just land this. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. COMMIT=false BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
mtklein@chromium.org changed reviewers: + msarett@google.com - mtklein@google.com
There are failing bots, so clearly this is not correct, but I thought you might like an early look at what I'm trying to do to speed up Paeth a bit more. On my machine with an arbitrary test image, this brings sk_paeth4_sse2 from ~17.5% of decode cost to ~12.5%. (Which makes inflate goes from ~75% of the decode cost to ~80%.)
The CQ bit was checked by mtklein@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/60001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/60001
On 2016/01/31 at 16:11:34, mtklein_C wrote: > There are failing bots, so clearly this is not correct, but I thought you might like an early look at what I'm trying to do to speed up Paeth a bit more. > > On my machine with an arbitrary test image, this brings sk_paeth4_sse2 from ~17.5% of decode cost to ~12.5%. (Which makes inflate goes from ~75% of the decode cost to ~80%.) I think I've found the correctness problem. It was a typo, which sadly also had the effect of making some code artificially dead. Now that that code is live again, the performance in the correct version will be a little worse than I've quoted above, more like 17.5% -> 13.5%, maybe a max of 7% end-to-end speedup.
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. COMMIT=false BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
I like that this is faster and easier to understand. I agree that we need a bot testing SSE 4.1 before we land. Is this something you expect to happen soon? It'd be nice to have a bot each for SSE2, SSSE3, SSE4.1. https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:100: return _mm_blendv_epi8(e,t,c); Is this more clear as blendv_epi16? https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest); // Note `_epi8`: we need addition to wrap modulo 255. Can we call packus(nearest) before this add? That might make the use of epi8 clearer. Also that would involve not needing to unpack d to 16 bit at the start?
mtklein@google.com changed reviewers: + mtklein@google.com
https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:100: return _mm_blendv_epi8(e,t,c); On 2016/02/01 15:17:34, msarett wrote: > Is this more clear as blendv_epi16? Yes, but there's no such instruction. In practice, by the time we call this we're fine to operate byte-by-byte. https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest); // Note `_epi8`: we need addition to wrap modulo 255. On 2016/02/01 15:17:34, msarett wrote: > Can we call packus(nearest) before this add? > > That might make the use of epi8 clearer. Also that would involve not needing to > unpack d to 16 bit at the start? We need to unpack to be able to do the middle bits (sub_epi16...abs_i16). That in turn means 'a' wants to start unpacked (or we'd need to re-unpack it). That in turn means we want to leave d unpacked. Luckily, we know all of {a,b,c,d} are really 8 bit, with zeros in the upper bytes of each 16-bit lane, so the 8-bit add here doesn't really have any chance of going wrong. Let me hack around a bit on the alternative you suggest here, which is to load d packed, but unpack it when moving it to a. I think that'd work fine.
https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:100: return _mm_blendv_epi8(e,t,c); On 2016/02/01 15:28:15, mtklein wrote: > On 2016/02/01 15:17:34, msarett wrote: > > Is this more clear as blendv_epi16? > > Yes, but there's no such instruction. In practice, by the time we call this > we're fine to operate byte-by-byte. Ah of course :) https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest); // Note `_epi8`: we need addition to wrap modulo 255. On 2016/02/01 15:28:15, mtklein wrote: > On 2016/02/01 15:17:34, msarett wrote: > > Can we call packus(nearest) before this add? > > > > That might make the use of epi8 clearer. Also that would involve not needing > to > > unpack d to 16 bit at the start? > > We need to unpack to be able to do the middle bits (sub_epi16...abs_i16). > That in turn means 'a' wants to start unpacked (or we'd need to re-unpack it). > That in turn means we want to leave d unpacked. > > Luckily, we know all of {a,b,c,d} are really 8 bit, with zeros in the upper > bytes of each 16-bit lane, so the 8-bit add here doesn't really have any chance > of going wrong. > > Let me hack around a bit on the alternative you suggest here, which is to load d > packed, but unpack it when moving it to a. I think that'd work fine. Oh I see, "a" is set to "d", and "a" must be unpacked. Either approach is fine by me, since it looks like my suggestion won't save any instructions anyway.
https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.cpp File src/codec/SkPngFilters.cpp (right): https://codereview.chromium.org/1657503002/diff/60001/src/codec/SkPngFilters.... src/codec/SkPngFilters.cpp:142: d = _mm_add_epi8(d, nearest); // Note `_epi8`: we need addition to wrap modulo 255. On 2016/02/01 15:49:41, msarett wrote: > On 2016/02/01 15:28:15, mtklein wrote: > > On 2016/02/01 15:17:34, msarett wrote: > > > Can we call packus(nearest) before this add? > > > > > > That might make the use of epi8 clearer. Also that would involve not > needing > > to > > > unpack d to 16 bit at the start? > > > > We need to unpack to be able to do the middle bits (sub_epi16...abs_i16). > > That in turn means 'a' wants to start unpacked (or we'd need to re-unpack it). > > That in turn means we want to leave d unpacked. > > > > Luckily, we know all of {a,b,c,d} are really 8 bit, with zeros in the upper > > bytes of each 16-bit lane, so the 8-bit add here doesn't really have any > chance > > of going wrong. > > > > Let me hack around a bit on the alternative you suggest here, which is to load > d > > packed, but unpack it when moving it to a. I think that'd work fine. > > Oh I see, "a" is set to "d", and "a" must be unpacked. > > Either approach is fine by me, since it looks like my suggestion won't save any > instructions anyway. Just tried this out. Correctness-wise what you suggested works fine, and it's the same number of instructions. It does change how the instructions can schedule, and doesn't work out in our favor: IACA says the old code had 14 instructions on a 13.9 cycle critical path, and the new code 15 instructions on a 14.3 cycle critical path. Nanobench agrees, measuring the new way as a just a little slower, 2-3%, right in line with IACA's predicted 2.8%. (I also tried moving d's load<4>() around... couldn't beat where it is for speed, and I like seeing all the data roll over together.)
Looks good to me But what is the situation with the bots? I don't want to ship code that we aren't testing.
If we land as is, we'll test SSE2-only (Linux+Windows bots) and SSSE3 (Mac+Android bots), but have no testing for the SSE4.1 path. If I put this behind runtime CPU checks, we'll gain testing for the SSE4.1 path, but in practice will lose coverage for the SSE2-only path. We don't have any bots that support SSE2 but not SSSE3. Either way, Android will use the SSE4.1 path on Silvermont and newer chips, and the SSSE3 path on older x86 chips. This makes me lean towards runtime CPU checks---i.e. move these to SkOpts---so that testing matches our (first? only?) use case.
On 2016/02/01 16:22:42, mtklein wrote: > If we land as is, we'll test SSE2-only (Linux+Windows bots) and SSSE3 > (Mac+Android bots), but have no testing for the SSE4.1 path. > > If I put this behind runtime CPU checks, we'll gain testing for the SSE4.1 path, > but in practice will lose coverage for the SSE2-only path. We don't have any > bots that support SSE2 but not SSSE3. > > Either way, Android will use the SSE4.1 path on Silvermont and newer chips, and > the SSSE3 path on older x86 chips. This makes me lean towards runtime CPU > checks---i.e. move these to SkOpts---so that testing matches our (first? only?) > use case. Agreed. Runtime checks sgtm. Though, I would have preferred to keep this this code in src/codec, I think this is best way to handle it.
On 2016/02/01 16:27:12, msarett wrote: > On 2016/02/01 16:22:42, mtklein wrote: > > If we land as is, we'll test SSE2-only (Linux+Windows bots) and SSSE3 > > (Mac+Android bots), but have no testing for the SSE4.1 path. > > > > If I put this behind runtime CPU checks, we'll gain testing for the SSE4.1 > path, > > but in practice will lose coverage for the SSE2-only path. We don't have any > > bots that support SSE2 but not SSSE3. > > > > Either way, Android will use the SSE4.1 path on Silvermont and newer chips, > and > > the SSSE3 path on older x86 chips. This makes me lean towards runtime CPU > > checks---i.e. move these to SkOpts---so that testing matches our (first? > only?) > > use case. > > Agreed. Runtime checks sgtm. Though, I would have preferred to keep this this > code in src/codec, I think this is best way to handle it. An alternative is to break this into two steps: land the 16-bit conversion and SSSE3 optimization now (well tested, can stay in src/codec), then consider the SSE4.1 change separately. IIRC the bigger win was SSE2->SSSE3.
That works for me as well :).
On 2016/02/01 16:48:56, msarett wrote: > That works for me as well :). Done.
lgtm
The CQ bit was checked by mtklein@google.com
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1657503002/80001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1657503002/80001
Message was sent while issue was closed.
Description was changed from ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... ========== to ========== Look beyond SSE2 for Paeth You can break this CL down into three steps. Steps 2 and 3 depend on 1. Step 1: go to a 16-bit impl. Speed ~unaffected. Step 2: use SSSE3 16-bit abs. ~20% speedup to Paeth. Step 3: use SSE4.1 blendv, total ~25% speedup to Paeth. Overall this can improve PNG decoding by around 8% end-to-end. I would feel most comfortable landing this only after we have a bot exercising the SSE4.1 code, either by moving this stuff behind a function pointer (simulating Chrome/Clank) or by adding a builder with at least SSE4.1 at compile time (simulating an Android system build). We've got plenty of bots building with SSSE3 at compile time to test that path. BUG=skia: GOLD_TRYBOT_URL= https://gold.skia.org/search2?unt=true&query=source_type%3Dimage&master=false... Committed: https://skia.googlesource.com/skia/+/b21c752eb3d55970ac45daaf3fd2cbda39c7658a ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://skia.googlesource.com/skia/+/b21c752eb3d55970ac45daaf3fd2cbda39c7658a |