Issue 1580233003: [turbofan] avoid xchg instruction on Intel

Issue 1580233003: [turbofan] avoid xchg instruction on Intel (Closed)

Created:
4 years, 11 months ago by Mircea Trofin

Modified:
4 years, 11 months ago

Reviewers:
Benedikt Meurer, Jarin, yosin_UTC9

CC:
v8-reviews_googlegroups.com

Base URL:
https://chromium.googlesource.com/v8/v8.git@master

Target Ref:
refs/pending/heads/master

Project:
v8

Visibility:
Public.

More Reviews

Description

[turbofan] avoid xchg instruction on Intel On Intel, xchg stalls the pipeline. We use xchg to implement swap moves. In a separate exploration, the presence of xchg in a very hot loop, due to a change in register allocation, lead to over 20% regression. Simply changing that instruction with push/mov/pop (almost) eliminated the regression. In light of that, I removed uses of xchg. This leads to more instructions, though. That is particularly problematic for long cycles, which, today, we translate to successions of swaps. I plan to address this cycle issue in a separate change. For now, the goal is to unblock the initial work that lead here. Committed: https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa Cr-Commit-Position: refs/heads/master@{#33282}

Patch Set 1 #

Patch Set 2 : #

Patch Set 3 : #

Patch Set 4 : #

Patch Set 5 : #

Patch Set 6 : #

Total comments: 1

Created: 4 years, 11 months ago

Download [raw] [tar.bz2]

	Unified diffs	Side-by-side diffs	Delta from patch set	Stats (+41 lines, -10 lines)			Patch
M	src/compiler/ia32/code-generator-ia32.cc	View	1 2 3 4 5	1 chunk	+11 lines, -2 lines	1 comment	Download
M	src/compiler/x64/code-generator-x64.cc	View	1 2 3 4 5	2 chunks	+18 lines, -4 lines	0 comments	Download
M	src/crankshaft/ia32/lithium-gap-resolver-ia32.cc	View	1 2	1 chunk	+3 lines, -1 line	0 comments	Download
M	src/crankshaft/x64/lithium-gap-resolver-x64.cc	View	1 2 3	1 chunk	+3 lines, -1 line	0 comments	Download
M	src/ia32/code-stubs-ia32.cc	View	1 2	1 chunk	+3 lines, -1 line	0 comments	Download
M	src/x64/code-stubs-x64.cc	View	1 2	1 chunk	+3 lines, -1 line	0 comments	Download

Messages

Total messages: 19 (7 generated)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages

commit-bot: I haz the power

CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/1580233003/100001 View timeline at https://chromium-cq-status.appspot.com/patch-timeline/1580233003/100001

4 years, 11 months ago (2016-01-14 04:54:27 UTC) #7

commit-bot: I haz the power

Description was changed from ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls ...

4 years, 11 months ago (2016-01-14 05:13:56 UTC) #8

commit-bot: I haz the power

Description was changed from ========== [turbofan] avoid xchg instruction on Intel On Intel, xchg stalls ...

4 years, 11 months ago (2016-01-14 05:14:54 UTC) #10

commit-bot: I haz the power

Patchset 6 (id:??) landed as https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa Cr-Commit-Position: refs/heads/master@{#33282}

4 years, 11 months ago (2016-01-14 05:14:55 UTC) #11

Weiliang

On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: > Patchset 6 (id:??) landed as ...

4 years, 11 months ago (2016-01-15 03:18:40 UTC) #12

Mircea Trofin

On 2016/01/15 03:18:40, Weiliang wrote: > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote: ...

4 years, 11 months ago (2016-01-15 18:09:38 UTC) #13

Message was sent while issue was closed.

On 2016/01/15 03:18:40, Weiliang wrote:
> On 2016/01/14 05:14:55, commit-bot: I haz the power wrote:
> > Patchset 6 (id:??) landed as
> > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa
> > Cr-Commit-Position: refs/heads/master@{#33282}
> 
> Could you please share the benchmarks or your exploration? xchg only has
> performance penalty if one of the operand is memory.　
> 
> Thanks
> -Weiliang

The benchmark is Emscripten's copy.js

I prepared this to simplify the study:
https://codereview.chromium.org/1596593002

Build d8 for ia32. You can enable the disassembler and compare the output,
you should only notice the difference due to xchg in the codegen.

assuming linux, run:

perf stat d8 [one of the flags in the hacky CL quoted above] copy.js 


Here's what I see:

push/mov/pop
       1132.803889 task-clock (msec)         #    1.000 CPUs utilized          
               169 context-switches          #    0.149 K/sec                  
                19 cpu-migrations            #    0.017 K/sec                  
             3,298 page-faults               #    0.003 M/sec                  
     3,144,992,895 cycles                    #    2.776 GHz                    
     1,014,747,038 stalled-cycles-frontend   #   32.27% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     8,639,439,901 instructions              #    2.75  insns per cycle        
                                             #    0.12  stalled cycles per insn
       901,088,586 branches                  #  795.450 M/sec                  
         1,072,717 branch-misses             #    0.12% of all branches        

       1.132646215 seconds time elapsed

xchg
       1391.608997 task-clock (msec)         #    1.000 CPUs utilized          
               179 context-switches          #    0.129 K/sec                  
                23 cpu-migrations            #    0.017 K/sec                  
             3,295 page-faults               #    0.002 M/sec                  
     3,895,382,601 cycles                    #    2.799 GHz                    
     1,763,437,652 stalled-cycles-frontend   #   45.27% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     8,516,505,308 instructions              #    2.19  insns per cycle        
                                             #    0.21  stalled cycles per insn
       901,420,178 branches                  #  647.754 M/sec                  
         1,081,349 branch-misses             #    0.12% of all branches        

       1.391312205 seconds time elapsed

mov/mov/mov
       1056.847689 task-clock (msec)         #    1.000 CPUs utilized          
               163 context-switches          #    0.154 K/sec                  
                18 cpu-migrations            #    0.017 K/sec                  
             3,844 page-faults               #    0.004 M/sec                  
     2,894,961,063 cycles                    #    2.739 GHz                    
       764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     8,640,787,395 instructions              #    2.98  insns per cycle        
                                             #    0.09  stalled cycles per insn
       901,303,752 branches                  #  852.823 M/sec                  
         1,087,127 branch-misses             #    0.12% of all branches        

       1.057128515 seconds time elapsed


xor/xor/xor
       1056.847689 task-clock (msec)         #    1.000 CPUs utilized          
               163 context-switches          #    0.154 K/sec                  
                18 cpu-migrations            #    0.017 K/sec                  
             3,844 page-faults               #    0.004 M/sec                  
     2,894,961,063 cycles                    #    2.739 GHz                    
       764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     8,640,787,395 instructions              #    2.98  insns per cycle        
                                             #    0.09  stalled cycles per insn
       901,303,752 branches                  #  852.823 M/sec                  
         1,087,127 branch-misses             #    0.12% of all branches        

       1.057128515 seconds time elapsed

Mircea Trofin

On 2016/01/15 18:09:38, Mircea Trofin wrote: > On 2016/01/15 03:18:40, Weiliang wrote: > > On ...

4 years, 11 months ago (2016-01-15 18:11:23 UTC) #14

Message was sent while issue was closed.

On 2016/01/15 18:09:38, Mircea Trofin wrote:
> On 2016/01/15 03:18:40, Weiliang wrote:
> > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote:
> > > Patchset 6 (id:??) landed as
> > > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa
> > > Cr-Commit-Position: refs/heads/master@{#33282}
> > 
> > Could you please share the benchmarks or your exploration? xchg only has
> > performance penalty if one of the operand is memory.　
> > 
> > Thanks
> > -Weiliang
> 
> The benchmark is Emscripten's copy.js
> 
> I prepared this to simplify the study:
> https://codereview.chromium.org/1596593002
> 
> Build d8 for ia32. You can enable the disassembler and compare the output,
> you should only notice the difference due to xchg in the codegen.
> 
> assuming linux, run:
> 
> perf stat d8 [one of the flags in the hacky CL quoted above] copy.js 
> 
> 
> Here's what I see:
> 
> push/mov/pop
>        1132.803889 task-clock (msec)         #    1.000 CPUs utilized         

>                169 context-switches          #    0.149 K/sec                 

>                 19 cpu-migrations            #    0.017 K/sec                 

>              3,298 page-faults               #    0.003 M/sec                 

>      3,144,992,895 cycles                    #    2.776 GHz                   

>      1,014,747,038 stalled-cycles-frontend   #   32.27% frontend cycles idle  

>    <not supported> stalled-cycles-backend  
>      8,639,439,901 instructions              #    2.75  insns per cycle       

>                                              #    0.12  stalled cycles per
insn
>        901,088,586 branches                  #  795.450 M/sec                 

>          1,072,717 branch-misses             #    0.12% of all branches       

> 
>        1.132646215 seconds time elapsed
> 
> xchg
>        1391.608997 task-clock (msec)         #    1.000 CPUs utilized         

>                179 context-switches          #    0.129 K/sec                 

>                 23 cpu-migrations            #    0.017 K/sec                 

>              3,295 page-faults               #    0.002 M/sec                 

>      3,895,382,601 cycles                    #    2.799 GHz                   

>      1,763,437,652 stalled-cycles-frontend   #   45.27% frontend cycles idle  

>    <not supported> stalled-cycles-backend  
>      8,516,505,308 instructions              #    2.19  insns per cycle       

>                                              #    0.21  stalled cycles per
insn
>        901,420,178 branches                  #  647.754 M/sec                 

>          1,081,349 branch-misses             #    0.12% of all branches       

> 
>        1.391312205 seconds time elapsed
> 
> mov/mov/mov
>        1056.847689 task-clock (msec)         #    1.000 CPUs utilized         

>                163 context-switches          #    0.154 K/sec                 

>                 18 cpu-migrations            #    0.017 K/sec                 

>              3,844 page-faults               #    0.004 M/sec                 

>      2,894,961,063 cycles                    #    2.739 GHz                   

>        764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle  

>    <not supported> stalled-cycles-backend  
>      8,640,787,395 instructions              #    2.98  insns per cycle       

>                                              #    0.09  stalled cycles per
insn
>        901,303,752 branches                  #  852.823 M/sec                 

>          1,087,127 branch-misses             #    0.12% of all branches       

> 
>        1.057128515 seconds time elapsed
> 
> 
> xor/xor/xor
>        1056.847689 task-clock (msec)         #    1.000 CPUs utilized         

>                163 context-switches          #    0.154 K/sec                 

>                 18 cpu-migrations            #    0.017 K/sec                 

>              3,844 page-faults               #    0.004 M/sec                 

>      2,894,961,063 cycles                    #    2.739 GHz                   

>        764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle  

>    <not supported> stalled-cycles-backend  
>      8,640,787,395 instructions              #    2.98  insns per cycle       

>                                              #    0.09  stalled cycles per
insn
>        901,303,752 branches                  #  852.823 M/sec                 

>          1,087,127 branch-misses             #    0.12% of all branches       

> 
>        1.057128515 seconds time elapsed

Let me know if you have trouble finding Emscripten/copy.js. I wanted to include
a copy here, 
so we all look at the exact same thing, but I can't see an obvious way.

Mircea Trofin

On 2016/01/15 18:11:23, Mircea Trofin wrote: > On 2016/01/15 18:09:38, Mircea Trofin wrote: > > ...

4 years, 11 months ago (2016-01-15 21:20:59 UTC) #15

Message was sent while issue was closed.

On 2016/01/15 18:11:23, Mircea Trofin wrote:
> On 2016/01/15 18:09:38, Mircea Trofin wrote:
> > On 2016/01/15 03:18:40, Weiliang wrote:
> > > On 2016/01/14 05:14:55, commit-bot: I haz the power wrote:
> > > > Patchset 6 (id:??) landed as
> > > > https://crrev.com/7440edae1da2b78a37e6fe3a558e249dc33444aa
> > > > Cr-Commit-Position: refs/heads/master@{#33282}
> > > 
> > > Could you please share the benchmarks or your exploration? xchg only has
> > > performance penalty if one of the operand is memory.　
> > > 
> > > Thanks
> > > -Weiliang
> > 
> > The benchmark is Emscripten's copy.js
> > 
> > I prepared this to simplify the study:
> > https://codereview.chromium.org/1596593002
> > 
> > Build d8 for ia32. You can enable the disassembler and compare the output,
> > you should only notice the difference due to xchg in the codegen.
> > 
> > assuming linux, run:
> > 
> > perf stat d8 [one of the flags in the hacky CL quoted above] copy.js 
> > 
> > 
> > Here's what I see:
> > 
> > push/mov/pop
> >        1132.803889 task-clock (msec)         #    1.000 CPUs utilized       
 
> 
> >                169 context-switches          #    0.149 K/sec               
 
> 
> >                 19 cpu-migrations            #    0.017 K/sec               
 
> 
> >              3,298 page-faults               #    0.003 M/sec               
 
> 
> >      3,144,992,895 cycles                    #    2.776 GHz                 
 
> 
> >      1,014,747,038 stalled-cycles-frontend   #   32.27% frontend cycles idle
 
> 
> >    <not supported> stalled-cycles-backend  
> >      8,639,439,901 instructions              #    2.75  insns per cycle     
 
> 
> >                                              #    0.12  stalled cycles per
> insn
> >        901,088,586 branches                  #  795.450 M/sec               
 
> 
> >          1,072,717 branch-misses             #    0.12% of all branches     
 
> 
> > 
> >        1.132646215 seconds time elapsed
> > 
> > xchg
> >        1391.608997 task-clock (msec)         #    1.000 CPUs utilized       
 
> 
> >                179 context-switches          #    0.129 K/sec               
 
> 
> >                 23 cpu-migrations            #    0.017 K/sec               
 
> 
> >              3,295 page-faults               #    0.002 M/sec               
 
> 
> >      3,895,382,601 cycles                    #    2.799 GHz                 
 
> 
> >      1,763,437,652 stalled-cycles-frontend   #   45.27% frontend cycles idle
 
> 
> >    <not supported> stalled-cycles-backend  
> >      8,516,505,308 instructions              #    2.19  insns per cycle     
 
> 
> >                                              #    0.21  stalled cycles per
> insn
> >        901,420,178 branches                  #  647.754 M/sec               
 
> 
> >          1,081,349 branch-misses             #    0.12% of all branches     
 
> 
> > 
> >        1.391312205 seconds time elapsed
> > 
> > mov/mov/mov
> >        1056.847689 task-clock (msec)         #    1.000 CPUs utilized       
 
> 
> >                163 context-switches          #    0.154 K/sec               
 
> 
> >                 18 cpu-migrations            #    0.017 K/sec               
 
> 
> >              3,844 page-faults               #    0.004 M/sec               
 
> 
> >      2,894,961,063 cycles                    #    2.739 GHz                 
 
> 
> >        764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle
 
> 
> >    <not supported> stalled-cycles-backend  
> >      8,640,787,395 instructions              #    2.98  insns per cycle     
 
> 
> >                                              #    0.09  stalled cycles per
> insn
> >        901,303,752 branches                  #  852.823 M/sec               
 
> 
> >          1,087,127 branch-misses             #    0.12% of all branches     
 
> 
> > 
> >        1.057128515 seconds time elapsed
> > 
> > 
> > xor/xor/xor
> >        1056.847689 task-clock (msec)         #    1.000 CPUs utilized       
 
> 
> >                163 context-switches          #    0.154 K/sec               
 
> 
> >                 18 cpu-migrations            #    0.017 K/sec               
 
> 
> >              3,844 page-faults               #    0.004 M/sec               
 
> 
> >      2,894,961,063 cycles                    #    2.739 GHz                 
 
> 
> >        764,176,642 stalled-cycles-frontend   #   26.40% frontend cycles idle
 
> 
> >    <not supported> stalled-cycles-backend  
> >      8,640,787,395 instructions              #    2.98  insns per cycle     
 
> 
> >                                              #    0.09  stalled cycles per
> insn
> >        901,303,752 branches                  #  852.823 M/sec               
 
> 
> >          1,087,127 branch-misses             #    0.12% of all branches     
 
> 
> > 
> >        1.057128515 seconds time elapsed
> 
> Let me know if you have trouble finding Emscripten/copy.js. I wanted to
include
> a copy here, 
> so we all look at the exact same thing, but I can't see an obvious way.

ERRATA.

for xor, the result is:
       1496.887932 task-clock (msec)         #    1.000 CPUs utilized          
               141 context-switches          #    0.094 K/sec                  
                23 cpu-migrations            #    0.015 K/sec                  
            17,674 page-faults               #    0.012 M/sec                  
     4,132,609,079 cycles                    #    2.761 GHz                    
     1,950,546,566 stalled-cycles-frontend   #   47.20% frontend cycles idle   
   <not supported> stalled-cycles-backend  
     8,782,658,141 instructions              #    2.13  insns per cycle        
                                             #    0.22  stalled cycles per insn
       927,779,981 branches                  #  619.806 M/sec                  
         1,177,290 branch-misses             #    0.13% of all branches        

       1.496329604 seconds time elapsed

yosin_UTC9

https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code-generator-ia32.cc File src/compiler/ia32/code-generator-ia32.cc (right): https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code-generator-ia32.cc#newcode1619 src/compiler/ia32/code-generator-ia32.cc:1619: __ push(src); Just a curiosity, why don't you use ...

4 years, 11 months ago (2016-01-20 01:48:18 UTC) #17

Mircea Trofin

On 2016/01/20 01:48:18, Yosi_UTC9 wrote: > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code-generator-ia32.cc > File src/compiler/ia32/code-generator-ia32.cc (right): > > https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code-generator-ia32.cc#newcode1619 > ...

4 years, 11 months ago (2016-01-20 01:50:45 UTC) #18

yosin_UTC9

4 years, 11 months ago (2016-01-20 02:06:31 UTC) #19

Message was sent while issue was closed.

On 2016/01/20 at 01:50:45, mtrofin wrote:
> On 2016/01/20 01:48:18, Yosi_UTC9 wrote:
> >
https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code...
> > File src/compiler/ia32/code-generator-ia32.cc (right):
> > 
> >
https://codereview.chromium.org/1580233003/diff/100001/src/compiler/ia32/code...
> > src/compiler/ia32/code-generator-ia32.cc:1619: __ push(src);
> > Just a curiosity, why don't you use XOR-technique for swapping registers? Is
> > PUSH/POP faster than three-XOR?
> 
> Yes, see the notes above. 
> 
> You can experiment with the choices by applying locally the patch (copied from
above)
> https://codereview.chromium.org/1596593002

Thanks for quick response!
Intel does amazing job. ;-)

Expand Messages | Collapse Messages | Show Generated Messages | Hide Generated Messages