|
|
Created:
3 years, 10 months ago by adenilson.cavalcanti Modified:
3 years, 4 months ago CC:
chromium-reviews, esprehn Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionNEON implementation for Adler32
The checksum is calculated in the uncompressed PNG data and can be made
much faster by using SIMD.
Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime
was 350ms x 125ms for a 4096x4096 bytes executed 30 times).
This alone yields a performance boost for PNG decoding ranging
from 5% to 18% depending on a few factors (SoC, battery status, etc).
BUG=688601
Patch Set 1 #
Total comments: 2
Patch Set 2 : BUILD.gn fix, style. #Patch Set 3 : Rebased with master, added NEON code in a patch, updated README #Patch Set 4 : Typo. #Patch Set 5 : Rebased with master #
Total comments: 2
Patch Set 6 : Coding style. #
Messages
Total messages: 45 (32 generated)
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
cavalcantii@chromium.org changed reviewers: + fmalita@chromium.org, msarett@chromium.org
https://codereview.chromium.org/2676493007/diff/1/third_party/zlib/BUILD.gn File third_party/zlib/BUILD.gn (right): https://codereview.chromium.org/2676493007/diff/1/third_party/zlib/BUILD.gn#n... third_party/zlib/BUILD.gn:64: "neon_adler32.h", Forgot this one, got remove it. https://codereview.chromium.org/2676493007/diff/1/third_party/zlib/neon_adler... File third_party/zlib/neon_adler32.c (right): https://codereview.chromium.org/2676493007/diff/1/third_party/zlib/neon_adler... third_party/zlib/neon_adler32.c:99: if (i + n > len) Add parenthesis i.e. if ((i + n) > len)
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: android_compile_dbg on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/android_comp...) linux_android_rel_ng on master.tryserver.chromium.android (JOB_FAILED, https://build.chromium.org/p/tryserver.chromium.android/builders/linux_androi...) mac_chromium_compile_dbg_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_comp...) mac_chromium_rel_ng on master.tryserver.chromium.mac (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_...) win_chromium_compile_dbg_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_comp...)
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
cavalcantii@chromium.org changed reviewers: + esprehn@chromium.org, schenney@google.com
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: chromeos_daisy_chromium_compile_only_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_asan_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_chromeos_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL) linux_chromium_tsan_rel_ng on master.tryserver.chromium.linux (JOB_TIMED_OUT, no build URL)
Nice speedup! Deferring actual Neon review to msarett. (the .patch + checked-in code duplication seems odd, but is in line with other local zlib modifications) https://codereview.chromium.org/2676493007/diff/80001/third_party/zlib/adler32.c File third_party/zlib/adler32.c (right): https://codereview.chromium.org/2676493007/diff/80001/third_party/zlib/adler3... third_party/zlib/adler32.c:72: if (len > 31) { nit: local code style doesn't seem to require braces for single-line conditional blocks. https://codereview.chromium.org/2676493007/diff/80001/third_party/zlib/adler3... third_party/zlib/adler32.c:73: return NEON_adler32(adler, buf, len); general nit: local code style is 4x-space indentation.
Have you suggested this patch to upstream (madler/zlib)? Do you have a macrobenchmark for anything that uses Adler32 (ex: png decoding)?
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_rel_...)
@florin Thanks for the review, I just uploaded an updated patch with the coding style fixes. @matt I opened an issue suggesting ARM optimizations 15 days ago on madler's project in github (https://github.com/madler/zlib/issues/216). As I had no reply, next I wrote an email to Mark Adler on Feb 7th (14 days ago). Then I added a comment on the issue, pointing to this patch 2 weeks ago. So far no reply. Finally, about 8 days ago, I sent him the link to the zlib report I wrote ('A compressed story of zlibs'). No reply. Concerning benchmarking, I had used a test to measure the execution time (https://gist.github.com/Adenilson/f1e175d364bdad2753fde34361e5e650) comparing the zlib implementation X NEON implementation. It was part of the test suite I showed running on the day we had the discussion at BlinkOn. Just out of curiosity, today I executed the test biding the process to the specific cores and got this (vanilla x NEON): CPU Vanilla NEON little: 530ms 279ms BIG: 348ms 113ms So it seems to be a 3x improvement while running in a big core (A72) and almost 2x while running in a little core (A53). While running on Chrome, the test case I created (http://codepen.io/Savago/pen/VPeQaX) showed an improvement around 15-20% running in a Nexus 4. The traces are available on: https://bugs.chromium.org/p/chromium/issues/detail?id=688601 More precisely it was 18% improvement on CPU selftime and 30% on wall durations. YMMV depending on many factors.
On 2017/02/21 21:20:51, cavalcantii1 wrote: > @florin > > Thanks for the review, I just uploaded an updated patch with the coding style > fixes. > > @matt > > I opened an issue suggesting ARM optimizations 15 days ago on madler's project > in github (https://github.com/madler/zlib/issues/216). > > As I had no reply, next I wrote an email to Mark Adler on Feb 7th (14 days ago). > Then I added a comment on the issue, pointing to this patch 2 weeks ago. > > So far no reply. > > Finally, about 8 days ago, I sent him the link to the zlib report I wrote ('A > compressed story of zlibs'). No reply. > That makes me think that madler/zlib is not such a healthy project for us to depend on. And I don't think it makes sense to increase our maintainability burden by forking this library further than we already have. Have any of the other projects been more receptive to your patches? I'm particularly interested in jtkukunas/zlib - since we have already borrowed some SIMD opts from there, it seems like switching to jtkukunas/zlib might be a natural way to begin this project. > Concerning benchmarking, I had used a test to measure the execution time > (https://gist.github.com/Adenilson/f1e175d364bdad2753fde34361e5e650) comparing > the zlib implementation X NEON implementation. It was part of the test suite I > showed running on the day we had the discussion at BlinkOn. > > Just out of curiosity, today I executed the test biding the process to the > specific cores and got this (vanilla x NEON): > > CPU Vanilla NEON > little: 530ms 279ms > BIG: 348ms 113ms > > So it seems to be a 3x improvement while running in a big core (A72) and almost > 2x while running in a little core (A53). > > While running on Chrome, the test case I created > (http://codepen.io/Savago/pen/VPeQaX) showed an improvement around 15-20% > running in a Nexus 4. > > The traces are available on: > https://bugs.chromium.org/p/chromium/issues/detail?id=688601 > > More precisely it was 18% improvement on CPU selftime and 30% on wall durations. > YMMV depending on many factors.
> That makes me think that madler/zlib is not such a healthy project for us to > depend on. And I don't think it makes sense to increase our maintainability > burden by forking this library further than we already have. > madler/zlib has featured new releases on: a) 31 Dec 2016: version 1.2.9 b) 2 Jan 2017: version 1.2.10 c) 15 Jan 2017: version 1.2.11 Further details: https://github.com/madler/zlib/blob/master/ChangeLog It may be the simple case that they are not interested in platform specific optimizations. > Have any of the other projects been more receptive to your patches? I'm > particularly interested in jtkukunas/zlib - since we have already borrowed some > SIMD opts from there, it seems like switching to jtkukunas/zlib might be a > natural way to begin this project. > As I've reported on the report, I contacted 4 different zlib projects and jtkukunas never replied (opened the issue 15 days ago): https://github.com/jtkukunas/zlib/issues/17 That being said, they are lagging behind in 3 releases from madler and the last commit on github dates way back to Jun 20th 2016 (8 months ago). That is potentially troubling because madler's recent releases featured important bug fixes. The project that seemed more receptive to platform specific patches was zlib-ng. My primary concern about zlib-ng is that they are still to release a stable version.
On 2017/02/22 01:16:57, cavalcantii1 wrote: > > That makes me think that madler/zlib is not such a healthy project for us to > > depend on. And I don't think it makes sense to increase our maintainability > > burden by forking this library further than we already have. > > > > madler/zlib has featured new releases on: > a) 31 Dec 2016: version 1.2.9 > b) 2 Jan 2017: version 1.2.10 > c) 15 Jan 2017: version 1.2.11 > > Further details: > https://github.com/madler/zlib/blob/master/ChangeLog > > It may be the simple case that they are not interested in platform specific > optimizations. > > > Have any of the other projects been more receptive to your patches? I'm > > particularly interested in jtkukunas/zlib - since we have already borrowed > some > > SIMD opts from there, it seems like switching to jtkukunas/zlib might be a > > natural way to begin this project. > > > > As I've reported on the report, I contacted 4 different zlib projects and > jtkukunas never replied (opened the issue 15 days ago): > https://github.com/jtkukunas/zlib/issues/17 > > That being said, they are lagging behind in 3 releases from madler and the last > commit on github dates way back to Jun 20th 2016 (8 months ago). > > That is potentially troubling because madler's recent releases featured > important bug fixes. > > The project that seemed more receptive to platform specific patches was zlib-ng. > My primary concern about zlib-ng is that they are still to release a stable > version. Just read through the issue trackers on various zlib github projects. zlib-ng seems like a clear winner to me. I think it might make sense to consider: "what would it take to switch chromium to zlib-ng?" as a first step - then optimize from there.
On 2017/02/22 18:19:51, msarett1 wrote: > On 2017/02/22 01:16:57, cavalcantii1 wrote: > > The project that seemed more receptive to platform specific patches was > zlib-ng. > > My primary concern about zlib-ng is that they are still to release a stable > > version. > > Just read through the issue trackers on various zlib github projects. > > zlib-ng seems like a clear winner to me. I think it might make sense to > consider: "what would it take to switch chromium to zlib-ng?" as a first step - > then optimize from there. When looking at the git logs of https://github.com/madler/zlib the only patches accepted for the past ~10 years are bug fixes, performance is almost not mentioned in any of these patches. I know for sure that Intel's engineers have tried from day one to push jtkukunas/zlib to madler/zlib and that has not worked out, as madler/zlib is mostly focused on the stability. I understand the position madler is taking when considering the security implications that a patch introducing new bugs to zlib may have. On zlib-ng: I agree that it seems to be the best run open-source zlib: it regularly merges against madler/zlib, integrates the performance related patches from all the other zlibs, and seems to have the best performance. Chromium can have its own security and validation process when updating the zlib, a process similar to updating the clang compilers. I would recommend moving this patch to zlib-ng who expressed interest in integrating performance related changes.
The CQ bit was checked by cavalcantii@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
> I would recommend moving this patch to zlib-ng who expressed interest > in integrating performance related changes. I discussed the subject with Kristian Rosbach (zlib-ng main maintainer) and he agreed to land this optimization separated from the ARM branch that is in development.
On 2017/03/28 22:20:45, cavalcantii1 wrote: > > I would recommend moving this patch to zlib-ng who expressed interest > > in integrating performance related changes. > > I discussed the subject with Kristian Rosbach (zlib-ng main maintainer) and he > agreed to land this optimization separated from the ARM branch that is in > development. That's great :).
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: win_chromium_x64_rel_ng on master.tryserver.chromium.win (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.win/builders/win_chromium_x64_...)
esprehn@chromium.org changed reviewers: + dglazkov@chromium.org - esprehn@chromium.org
This was upstreamed to zlib-ng: https://github.com/Dead2/zlib-ng/commit/ec02ecf104e1d3f1836a908a359f20aa93494df5
Description was changed from ========== NEON implementation for Adler32 The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). BUG=688601 ========== to ========== NEON implementation for Adler32 The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). BUG=688601 ==========
cavalcantii@chromium.org changed reviewers: - msarett@chromium.org
Description was changed from ========== NEON implementation for Adler32 The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). BUG=688601 ========== to ========== NEON implementation for Adler32 The checksum is calculated in the uncompressed PNG data and can be made much faster by using SIMD. Tests in ARMv8 yielded an improvement of about 3x (e.g. walltime was 350ms x 125ms for a 4096x4096 bytes executed 30 times). This alone yields a performance boost for PNG decoding ranging from 5% to 18% depending on a few factors (SoC, battery status, etc). BUG=688601 ========== |