|
|
Created:
4 years, 9 months ago by Primiano Tucci (use gerrit) Modified:
4 years, 9 months ago Reviewers:
Nico CC:
chromium-reviews, wfh+watch_chromium.org, Dai Mikurube (NOT FULLTIME) Base URL:
https://chromium.googlesource.com/chromium/src.git@master Target Ref:
refs/pending/heads/master Project:
chromium Visibility:
Public. |
DescriptionImprove performance of the malloc shim layer
This fixes the perf regression introduced by crrev.com/1781573002
As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced)
causes a visible perf regression, as it causes the addition of two
fences in the malloc fast path.
This CL adds a workaraound that falls-back on a raw volatile ptr read
on Linux+Clang, relying on the fact that on the architectures we care
about a load of an aligned pointer is intrinsically atomic [1,2].
Perf regression:
https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd1794af07f0a4a335f1fc2a45f120a&start_rev=379480&end_rev=380417
I verified manually that the regression goes away on cc_perftests with
this patch.
[1] Chapter 7 of Part 3A - System Programming Guide
http://download.intel.com/design/processor/manuals/253668.pdf
[2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition
http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf
BUG=550886, 593872
TEST=cc_perftestsfast --gtest_filter=*PrepareTiles*
Committed: https://crrev.com/d76421171daa1327f8e1c10ee710ebf911a90d2f
Cr-Commit-Position: refs/heads/master@{#380600}
Patch Set 1 #Patch Set 2 : . #
Total comments: 5
Messages
Total messages: 22 (9 generated)
primiano@chromium.org changed reviewers: + thakis@chromium.org
As partially expected, bummer! a barriered NoBarrier in malloc is pretty bad and our perf waterfall realized that See the linked perf dashboard URL.
Description was changed from ========== Improve performances of the malloc shim layer As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ========== to ========== Improve performances of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ==========
Description was changed from ========== Improve performances of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ========== to ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ==========
The CQ bit was checked by primiano@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1777363002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1777363002/20001
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: Try jobs failed on following builders: linux_chromium_chromeos_rel_ng on tryserver.chromium.linux (JOB_FAILED, http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_...)
And these are the results from the perf trybot for the pagecycler, showing the 3-9% improvement here, which seems to match the perf trybot of the revert perf trybot of this CL: http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-03... perf trybot for the revert: http://storage.googleapis.com/chromium-telemetry/html-results/results-2016-03...
lgtm, sounds like you checked that the assembly this produces makes sense. https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... File base/allocator/allocator_shim.cc (right): https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... base/allocator/allocator_shim.cc:75: // TODO(primiano): Just use NoBarrier_Load once crbug.com/593344 is fixed. (filed crbug.com/593874 today and made 3344 blocked on it) https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... base/allocator/allocator_shim.cc:79: #if defined(OS_LINUX) && defined(__clang__) Do we want to check for x86 too?
Description was changed from ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ========== to ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886,593872 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ==========
https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... File base/allocator/allocator_shim.cc (right): https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... base/allocator/allocator_shim.cc:75: // TODO(primiano): Just use NoBarrier_Load once crbug.com/593344 is fixed. On 2016/03/10 22:50:00, Nico wrote: > (filed crbug.com/593874 today and made 3344 blocked on it) Seen that. Thanks. https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... base/allocator/allocator_shim.cc:79: #if defined(OS_LINUX) && defined(__clang__) On 2016/03/10 22:50:00, Nico wrote: > Do we want to check for x86 too? Especially for x86, which is where I saw the bug :)
The CQ bit was checked by primiano@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1777363002/20001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1777363002/20001
On 2016/03/10 22:50:00, Nico wrote: > lgtm, sounds like you checked that the assembly this produces makes sense. Yup, the difference I see with this patch on ShimMalloc is literally just the two mfences around the mov which loads g_chain_head http://pastebin.com/Xm3Fqrh3 (well and a padding nopw ... variable length opcodes, yay)
Message was sent while issue was closed.
Description was changed from ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886,593872 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ========== to ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886,593872 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ==========
Message was sent while issue was closed.
Committed patchset #2 (id:20001)
Message was sent while issue was closed.
Description was changed from ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886,593872 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* ========== to ========== Improve performance of the malloc shim layer This fixes the perf regression introduced by crrev.com/1781573002 As speculated, crbug.com/593344 (NoBarrier_Load being double-fenced) causes a visible perf regression, as it causes the addition of two fences in the malloc fast path. This CL adds a workaraound that falls-back on a raw volatile ptr read on Linux+Clang, relying on the fact that on the architectures we care about a load of an aligned pointer is intrinsically atomic [1,2]. Perf regression: https://chromeperf.appspot.com/report?sid=1237900c90f9c5e5320a87af8d9e8828fbd... I verified manually that the regression goes away on cc_perftests with this patch. [1] Chapter 7 of Part 3A - System Programming Guide http://download.intel.com/design/processor/manuals/253668.pdf [2] A3.5.3 Atomicity in the ARM architecture, ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition http://liris.cnrs.fr/~mmrissa/lib/exe/fetch.php?media=armv7-a-r-manual.pdf BUG=550886,593872 TEST=cc_perftestsfast --gtest_filter=*PrepareTiles* Committed: https://crrev.com/d76421171daa1327f8e1c10ee710ebf911a90d2f Cr-Commit-Position: refs/heads/master@{#380600} ==========
Message was sent while issue was closed.
Patchset 2 (id:??) landed as https://crrev.com/d76421171daa1327f8e1c10ee710ebf911a90d2f Cr-Commit-Position: refs/heads/master@{#380600}
Message was sent while issue was closed.
https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... File base/allocator/allocator_shim.cc (right): https://codereview.chromium.org/1777363002/diff/20001/base/allocator/allocato... base/allocator/allocator_shim.cc:79: #if defined(OS_LINUX) && defined(__clang__) On 2016/03/11 00:11:08, Primiano wrote: > On 2016/03/10 22:50:00, Nico wrote: > > Do we want to check for x86 too? > > Especially for x86, which is where I saw the bug :) I mean on ARM fences are pretty different, so I'm not sure we want the volatile cast there. (And no idea about mips.)
Message was sent while issue was closed.
On 2016/03/11 15:42:04, Nico wrote: > I mean on ARM fences are pretty different, so I'm not sure we want the volatile > cast there. Uhm I should re-check ARM, but I thought that volatile was not supposed to give you extra fences, just avoid extra compiler optimizations? IIRC there is no specific memory-order semantic associated with that, but it might be one of those gray areas of compiler lands? From a more practical viewpoint, the reason why I kept the volatile, honestly, is because base::subtle::NoBarrier_Load takes a volatile Atomic* , so I thought was safer to match that behavior. Anyways, the story I learned today is that if there is some residual perf regression the dashboard will tell. I already have ~20 graphs screaming at me (https://chromeperf.appspot.com/group_report?bug_id=593872) Fortunately they seem to be all recovering after this fix. > (And no idea about mips.) Ha I thought that we support mips only on OS_ANDROID, but looking at common.gypi it seems there is the case. I wonder if CrOS uses clang though. (I tried to search and didn't find any non-android mips bot around, unless they are hidden behind some mysterious codename ¯\_(ツ)_/¯)
Message was sent while issue was closed.
CrOS doesn't use clang yet, but they're looking at it. (Sounds like they won't use Chromium's clang at the start though.) |