|
|
Created:
4 years, 11 months ago by Mircea Trofin Modified:
4 years, 11 months ago CC:
v8-reviews_googlegroups.com Base URL:
https://chromium.googlesource.com/v8/v8.git@master Target Ref:
refs/pending/heads/master Project:
v8 Visibility:
Public. |
Description[turbofan] avoid xchg instruction on Intel
On Intel, xchg stalls the pipeline. We use xchg to implement swap
moves. In a separate exploration, the presence of xchg in a very hot
loop, due to a change in register allocation, lead to over 20% regression.
Simply changing that instruction with push/mov/pop (almost) eliminated
the regression.
In light of that, I removed uses of xchg. This leads to more instructions,
though. That is particularly problematic for long cycles, which, today,
we translate to successions of swaps.
I plan to address this cycle issue in a separate change. For now, the
goal is to unblock the initial work that lead here.
Committed: https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa
Cr-Commit-Position: refs/heads/master@{#33282}
Patch Set 1 #Patch Set 2 : #Patch Set 3 : #Patch Set 4 : #Patch Set 5 : #Patch Set 6 : #
Total comments: 1
Messages
Total messages: 19 (7 generated)
Description was changed from ========== no xchg BUG= ========== to ========== On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. BUG= ==========
mtrofin@chromium.org changed reviewers: + bmeurer@chromium.org, jarin@chromium.org
Description was changed from ========== On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. BUG= ========== to ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. ==========
lgtm
The CQ bit was checked by mtrofin@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1580233003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1580233003/100001
Message was sent while issue was closed.
Description was changed from ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. ========== to ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. ==========
Message was sent while issue was closed.
Committed patchset #6 (id:100001)
Message was sent while issue was closed.
Description was changed from ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. ========== to ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. Committed: https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa Cr-Commit-Position: refs/heads/master@{#33282} ==========
Message was sent while issue was closed.
Patchset 6 (id:??) landed as https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa Cr-Commit-Position: refs/heads/master@{#33282}
Message was sent while issue was closed.
On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: > Patchset 6 (id:??) landed as > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa > Cr-Commit-Position: refs/heads/master@{#33282} Could you please share the benchmarks or your exploration? xchg only has performance penalty if one of the operand is memory. Thanks -Weiliang
Message was sent while issue was closed.
On 2016/01/15 03:18:40, Weiliang wrote: > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: > > Patchset 6 (id:??) landed as > > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa > > Cr-Commit-Position: refs/heads/master@{#33282} > > Could you please share the benchmarks or your exploration? xchg only has > performance penalty if one of the operand is memory. > > Thanks > -Weiliang The benchmark is Emscripten's copy.js I prepared this to simplify the study: https://codereview.chromium.org/1596593002 Build d8 for ia32. You can enable the disassembler and compare the output, you should only notice the difference due to xchg in the codegen. assuming linux, run: perf stat d8 [one of the flags in the hacky CL quoted above] copy.js Here's what I see: push/mov/pop 1132.803889 task-clock (msec) # 1.000 CPUs utilized 169 context-switches # 0.149 K/sec 19 cpu-migrations # 0.017 K/sec 3,298 page-faults # 0.003 M/sec 3,144,992,895 cycles # 2.776 GHz 1,014,747,038 stalled-cycles-frontend # 32.27% frontend cycles idle <not supported> stalled-cycles-backend 8,639,439,901 instructions # 2.75 insns per cycle # 0.12 stalled cycles per insn 901,088,586 branches # 795.450 M/sec 1,072,717 branch-misses # 0.12% of all branches 1.132646215 seconds time elapsed xchg 1391.608997 task-clock (msec) # 1.000 CPUs utilized 179 context-switches # 0.129 K/sec 23 cpu-migrations # 0.017 K/sec 3,295 page-faults # 0.002 M/sec 3,895,382,601 cycles # 2.799 GHz 1,763,437,652 stalled-cycles-frontend # 45.27% frontend cycles idle <not supported> stalled-cycles-backend 8,516,505,308 instructions # 2.19 insns per cycle # 0.21 stalled cycles per insn 901,420,178 branches # 647.754 M/sec 1,081,349 branch-misses # 0.12% of all branches 1.391312205 seconds time elapsed mov/mov/mov 1056.847689 task-clock (msec) # 1.000 CPUs utilized 163 context-switches # 0.154 K/sec 18 cpu-migrations # 0.017 K/sec 3,844 page-faults # 0.004 M/sec 2,894,961,063 cycles # 2.739 GHz 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle <not supported> stalled-cycles-backend 8,640,787,395 instructions # 2.98 insns per cycle # 0.09 stalled cycles per insn 901,303,752 branches # 852.823 M/sec 1,087,127 branch-misses # 0.12% of all branches 1.057128515 seconds time elapsed xor/xor/xor 1056.847689 task-clock (msec) # 1.000 CPUs utilized 163 context-switches # 0.154 K/sec 18 cpu-migrations # 0.017 K/sec 3,844 page-faults # 0.004 M/sec 2,894,961,063 cycles # 2.739 GHz 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle <not supported> stalled-cycles-backend 8,640,787,395 instructions # 2.98 insns per cycle # 0.09 stalled cycles per insn 901,303,752 branches # 852.823 M/sec 1,087,127 branch-misses # 0.12% of all branches 1.057128515 seconds time elapsed
Message was sent while issue was closed.
On 2016/01/15 18:09:38, Mircea Trofin wrote: > On 2016/01/15 03:18:40, Weiliang wrote: > > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: > > > Patchset 6 (id:??) landed as > > > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa > > > Cr-Commit-Position: refs/heads/master@{#33282} > > > > Could you please share the benchmarks or your exploration? xchg only has > > performance penalty if one of the operand is memory. > > > > Thanks > > -Weiliang > > The benchmark is Emscripten's copy.js > > I prepared this to simplify the study: > https://codereview.chromium.org/1596593002 > > Build d8 for ia32. You can enable the disassembler and compare the output, > you should only notice the difference due to xchg in the codegen. > > assuming linux, run: > > perf stat d8 [one of the flags in the hacky CL quoted above] copy.js > > > Here's what I see: > > push/mov/pop > 1132.803889 task-clock (msec) # 1.000 CPUs utilized > 169 context-switches # 0.149 K/sec > 19 cpu-migrations # 0.017 K/sec > 3,298 page-faults # 0.003 M/sec > 3,144,992,895 cycles # 2.776 GHz > 1,014,747,038 stalled-cycles-frontend # 32.27% frontend cycles idle > <not supported> stalled-cycles-backend > 8,639,439,901 instructions # 2.75 insns per cycle > # 0.12 stalled cycles per insn > 901,088,586 branches # 795.450 M/sec > 1,072,717 branch-misses # 0.12% of all branches > > 1.132646215 seconds time elapsed > > xchg > 1391.608997 task-clock (msec) # 1.000 CPUs utilized > 179 context-switches # 0.129 K/sec > 23 cpu-migrations # 0.017 K/sec > 3,295 page-faults # 0.002 M/sec > 3,895,382,601 cycles # 2.799 GHz > 1,763,437,652 stalled-cycles-frontend # 45.27% frontend cycles idle > <not supported> stalled-cycles-backend > 8,516,505,308 instructions # 2.19 insns per cycle > # 0.21 stalled cycles per insn > 901,420,178 branches # 647.754 M/sec > 1,081,349 branch-misses # 0.12% of all branches > > 1.391312205 seconds time elapsed > > mov/mov/mov > 1056.847689 task-clock (msec) # 1.000 CPUs utilized > 163 context-switches # 0.154 K/sec > 18 cpu-migrations # 0.017 K/sec > 3,844 page-faults # 0.004 M/sec > 2,894,961,063 cycles # 2.739 GHz > 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle > <not supported> stalled-cycles-backend > 8,640,787,395 instructions # 2.98 insns per cycle > # 0.09 stalled cycles per insn > 901,303,752 branches # 852.823 M/sec > 1,087,127 branch-misses # 0.12% of all branches > > 1.057128515 seconds time elapsed > > > xor/xor/xor > 1056.847689 task-clock (msec) # 1.000 CPUs utilized > 163 context-switches # 0.154 K/sec > 18 cpu-migrations # 0.017 K/sec > 3,844 page-faults # 0.004 M/sec > 2,894,961,063 cycles # 2.739 GHz > 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle > <not supported> stalled-cycles-backend > 8,640,787,395 instructions # 2.98 insns per cycle > # 0.09 stalled cycles per insn > 901,303,752 branches # 852.823 M/sec > 1,087,127 branch-misses # 0.12% of all branches > > 1.057128515 seconds time elapsed Let me know if you have trouble finding Emscripten/copy.js. I wanted to include a copy here, so we all look at the exact same thing, but I can't see an obvious way.
Message was sent while issue was closed.
On 2016/01/15 18:11:23, Mircea Trofin wrote: > On 2016/01/15 18:09:38, Mircea Trofin wrote: > > On 2016/01/15 03:18:40, Weiliang wrote: > > > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: > > > > Patchset 6 (id:??) landed as > > > > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa > > > > Cr-Commit-Position: refs/heads/master@{#33282} > > > > > > Could you please share the benchmarks or your exploration? xchg only has > > > performance penalty if one of the operand is memory. > > > > > > Thanks > > > -Weiliang > > > > The benchmark is Emscripten's copy.js > > > > I prepared this to simplify the study: > > https://codereview.chromium.org/1596593002 > > > > Build d8 for ia32. You can enable the disassembler and compare the output, > > you should only notice the difference due to xchg in the codegen. > > > > assuming linux, run: > > > > perf stat d8 [one of the flags in the hacky CL quoted above] copy.js > > > > > > Here's what I see: > > > > push/mov/pop > > 1132.803889 task-clock (msec) # 1.000 CPUs utilized > > > 169 context-switches # 0.149 K/sec > > > 19 cpu-migrations # 0.017 K/sec > > > 3,298 page-faults # 0.003 M/sec > > > 3,144,992,895 cycles # 2.776 GHz > > > 1,014,747,038 stalled-cycles-frontend # 32.27% frontend cycles idle > > > <not supported> stalled-cycles-backend > > 8,639,439,901 instructions # 2.75 insns per cycle > > > # 0.12 stalled cycles per > insn > > 901,088,586 branches # 795.450 M/sec > > > 1,072,717 branch-misses # 0.12% of all branches > > > > > 1.132646215 seconds time elapsed > > > > xchg > > 1391.608997 task-clock (msec) # 1.000 CPUs utilized > > > 179 context-switches # 0.129 K/sec > > > 23 cpu-migrations # 0.017 K/sec > > > 3,295 page-faults # 0.002 M/sec > > > 3,895,382,601 cycles # 2.799 GHz > > > 1,763,437,652 stalled-cycles-frontend # 45.27% frontend cycles idle > > > <not supported> stalled-cycles-backend > > 8,516,505,308 instructions # 2.19 insns per cycle > > > # 0.21 stalled cycles per > insn > > 901,420,178 branches # 647.754 M/sec > > > 1,081,349 branch-misses # 0.12% of all branches > > > > > 1.391312205 seconds time elapsed > > > > mov/mov/mov > > 1056.847689 task-clock (msec) # 1.000 CPUs utilized > > > 163 context-switches # 0.154 K/sec > > > 18 cpu-migrations # 0.017 K/sec > > > 3,844 page-faults # 0.004 M/sec > > > 2,894,961,063 cycles # 2.739 GHz > > > 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle > > > <not supported> stalled-cycles-backend > > 8,640,787,395 instructions # 2.98 insns per cycle > > > # 0.09 stalled cycles per > insn > > 901,303,752 branches # 852.823 M/sec > > > 1,087,127 branch-misses # 0.12% of all branches > > > > > 1.057128515 seconds time elapsed > > > > > > xor/xor/xor > > 1056.847689 task-clock (msec) # 1.000 CPUs utilized > > > 163 context-switches # 0.154 K/sec > > > 18 cpu-migrations # 0.017 K/sec > > > 3,844 page-faults # 0.004 M/sec > > > 2,894,961,063 cycles # 2.739 GHz > > > 764,176,642 stalled-cycles-frontend # 26.40% frontend cycles idle > > > <not supported> stalled-cycles-backend > > 8,640,787,395 instructions # 2.98 insns per cycle > > > # 0.09 stalled cycles per > insn > > 901,303,752 branches # 852.823 M/sec > > > 1,087,127 branch-misses # 0.12% of all branches > > > > > 1.057128515 seconds time elapsed > > Let me know if you have trouble finding Emscripten/copy.js. I wanted to include > a copy here, > so we all look at the exact same thing, but I can't see an obvious way. ERRATA. for xor, the result is: 1496.887932 task-clock (msec) # 1.000 CPUs utilized 141 context-switches # 0.094 K/sec 23 cpu-migrations # 0.015 K/sec 17,674 page-faults # 0.012 M/sec 4,132,609,079 cycles # 2.761 GHz 1,950,546,566 stalled-cycles-frontend # 47.20% frontend cycles idle <not supported> stalled-cycles-backend 8,782,658,141 instructions # 2.13 insns per cycle # 0.22 stalled cycles per insn 927,779,981 branches # 619.806 M/sec 1,177,290 branch-misses # 0.13% of all branches 1.496329604 seconds time elapsed
Message was sent while issue was closed.
yosin@chromium.org changed reviewers: + yosin@chromium.org
Message was sent while issue was closed.
https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... File src/compiler/ia32/code-generator-ia32.cc (right): https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... src/compiler/ia32/code-generator-ia32.cc:1619: __ push(src); Just a curiosity, why don't you use XOR-technique for swapping registers? Is PUSH/POP faster than three-XOR?
Message was sent while issue was closed.
On 2016/01/20 01:48:18, Yosi_UTC9 wrote: > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... > File src/compiler/ia32/code-generator-ia32.cc (right): > > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... > src/compiler/ia32/code-generator-ia32.cc:1619: __ push(src); > Just a curiosity, why don't you use XOR-technique for swapping registers? Is > PUSH/POP faster than three-XOR? Yes, see the notes above. You can experiment with the choices by applying locally the patch (copied from above) https://codereview.chromium.org/1596593002
Message was sent while issue was closed.
On 2016/01/20 at 01:50:45, mtrofin wrote: > On 2016/01/20 01:48:18, Yosi_UTC9 wrote: > > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... > > File src/compiler/ia32/code-generator-ia32.cc (right): > > > > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code... > > src/compiler/ia32/code-generator-ia32.cc:1619: __ push(src); > > Just a curiosity, why don't you use XOR-technique for swapping registers? Is > > PUSH/POP faster than three-XOR? > > Yes, see the notes above. > > You can experiment with the choices by applying locally the patch (copied from above) > https://codereview.chromium.org/1596593002 Thanks for quick response! Intel does amazing job. ;-) |