Issue 2951793003: [wasm] Implement remaining SIMD x64 compare ops, unops.

gdeepti

The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run

3 years, 6 months ago (2017-06-21 02:17:16 UTC) #1

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/1

3 years, 6 months ago (2017-06-21 02:17:25 UTC) #2

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 6 months ago (2017-06-21 02:47:55 UTC) #3

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 6 months ago (2017-06-21 02:47:56 UTC) #4

gdeepti

The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run

3 years, 6 months ago (2017-06-21 03:34:07 UTC) #5

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/20001

3 years, 6 months ago (2017-06-21 03:34:12 UTC) #6

gdeepti

The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run

3 years, 6 months ago (2017-06-21 03:38:23 UTC) #7

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/40001

3 years, 6 months ago (2017-06-21 03:38:24 UTC) #8

gdeepti

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); This is similar to the code ...

3 years, 6 months ago (2017-06-21 03:44:01 UTC) #9

gdeepti

Description was changed from ========== [wasm] Implement remmaining SIMD x64 compare ops, unops. Ops Implemented: ...

3 years, 6 months ago (2017-06-21 04:15:20 UTC) #10

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 6 months ago (2017-06-21 04:16:58 UTC) #11

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 6 months ago (2017-06-21 04:16:58 UTC) #12

bbudge

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/21 03:44:01, gdeepti wrote: > ...

3 years, 6 months ago (2017-06-22 16:29:50 UTC) #14

gdeepti

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/22 16:29:49, bbudge wrote: > ...

3 years, 6 months ago (2017-06-22 17:26:49 UTC) #15

bbudge

A general question: can you reorder operands in the instruction selector so we can avoid ...

3 years, 6 months ago (2017-06-22 17:30:10 UTC) #17

zvi

On 2017/06/22 17:30:10, bbudge wrote: > A general question: can you reorder operands in the ...

3 years, 6 months ago (2017-06-24 23:03:05 UTC) #18

zvi

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/22 17:26:48, gdeepti wrote: > ...

3 years, 6 months ago (2017-06-24 23:04:09 UTC) #19

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
File src/compiler/x64/code-generator-x64.cc (right):

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg);
On 2017/06/22 17:26:48, gdeepti wrote:
> On 2017/06/22 16:29:49, bbudge wrote:
> > On 2017/06/21 03:44:01, gdeepti wrote:
> > > This is similar to the code Clang generates for unsigned compares, but is
> > worse
> > > than it should be, because we do not have S128Consts. Given that, would it
> be
> > > better to implement this using pminu(d/w/b) instructions as for I32x4GeU?
> > 
> > I think we'll have to support S128 constants at some point. They're in the
> spec
> > proposal, and may be helpful in other contexts too, e.g. irregular shuffle
> masks
> > on ARM and MIPS.
> > 
> > These constants could be created by setting all 1's, then left shift by 31.
It
> > might be slightly faster.
> > 
> > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => max(a,
> b)
> > == a
> > movd(scratch, dst)
> > pmaxud(scratch, src)
> > pcmpeqd(dst, scratch)
> > 
> > Similarly, pmaxub, pmaxus for other formats.
> 
> I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU
> operations, I added both approaches to poll feedback, as gcc/clang both
generate
> code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems to
be
> better, or at least that till we have S128Constants. I'll use pmaxud/pminud
and 
> add a TODO to reconsider when we have S128Constants. 

I agree that the pmax/pmin option is preferable for 128-bit or larger operands
(pmin/pmax MMX throughput was reduced for mmx-types in recent processor
generations so it is better to avoid these).

Here is a comparison of IACA reports for two alternatives in the 'tie-braker'
case. The throughput is the same, but you save on code size and one micro-op
less:

Block Throughput: 2.00 Cycles
 | Num Of |                    Ports pressure in cycles                     |   
|
 |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |   
|

---------------------------------------------------------------------------------
 |   1    |           | 0.3 |           |           |     | 0.6 |     |     | CP
| vpcmpgtd xmm0, xmm1, xmm0
 |   1    |           | 0.7 |           |           |     | 0.3 |     |     |   
| vpcmpeqd xmm1, xmm1, xmm1
 |   1    | 0.8       | 0.1 |           |           |     | 0.2 |     |     | CP
| vpxor xmm0, xmm0, xmm1
 Total Num Of Uops: 3



Block Throughput: 2.00 Cycles
 | Num Of |                    Ports pressure in cycles                     |   
|
 |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |   
|

---------------------------------------------------------------------------------
 |   1    |           | 1.0 |           |           |     |     |     |     | CP
| vpmaxsd xmm2, xmm0, xmm1
 |   1    |           |     |           |           |     | 1.0 |     |     | CP
| vpcmpeqd xmm0, xmm0, xmm2
 Total Num Of Uops: 2

gdeepti

The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run

3 years, 5 months ago (2017-06-27 20:03:05 UTC) #20

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/60001

3 years, 5 months ago (2017-06-27 20:03:10 UTC) #21

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 5 months ago (2017-06-27 20:29:01 UTC) #22

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 5 months ago (2017-06-27 20:29:02 UTC) #23

gdeepti

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2254 src/compiler/x64/code-generator-x64.cc:2254: __ movaps(kScratchDoubleReg, i.InputSimd128Register(1)); On 2017/06/22 17:30:10, bbudge wrote: > ...

3 years, 5 months ago (2017-06-27 20:47:42 UTC) #24

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
File src/compiler/x64/code-generator-x64.cc (right):

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
src/compiler/x64/code-generator-x64.cc:2254: __ movaps(kScratchDoubleReg,
i.InputSimd128Register(1));
On 2017/06/22 17:30:10, bbudge wrote:
> Can you use pminsd here to save 1 instruction?

Done.

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg);
On 2017/06/24 23:04:09, zvi wrote:
> On 2017/06/22 17:26:48, gdeepti wrote:
> > On 2017/06/22 16:29:49, bbudge wrote:
> > > On 2017/06/21 03:44:01, gdeepti wrote:
> > > > This is similar to the code Clang generates for unsigned compares, but
is
> > > worse
> > > > than it should be, because we do not have S128Consts. Given that, would
it
> > be
> > > > better to implement this using pminu(d/w/b) instructions as for
I32x4GeU?
> > > 
> > > I think we'll have to support S128 constants at some point. They're in the
> > spec
> > > proposal, and may be helpful in other contexts too, e.g. irregular shuffle
> > masks
> > > on ARM and MIPS.
> > > 
> > > These constants could be created by setting all 1's, then left shift by
31.
> It
> > > might be slightly faster.
> > > 
> > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b =>
max(a,
> > b)
> > > == a
> > > movd(scratch, dst)
> > > pmaxud(scratch, src)
> > > pcmpeqd(dst, scratch)
> > > 
> > > Similarly, pmaxub, pmaxus for other formats.

Bill: 
This returns true for the == case as well, using pmax + complement, does this
look reasonable?

> > 
> > I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU
> > operations, I added both approaches to poll feedback, as gcc/clang both
> generate
> > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems
to
> be
> > better, or at least that till we have S128Constants. I'll use pmaxud/pminud
> and 
> > add a TODO to reconsider when we have S128Constants. 
> 
> I agree that the pmax/pmin option is preferable for 128-bit or larger operands
> (pmin/pmax MMX throughput was reduced for mmx-types in recent processor
> generations so it is better to avoid these).
> 
> Here is a comparison of IACA reports for two alternatives in the 'tie-braker'
> case. The throughput is the same, but you save on code size and one micro-op
> less:
> 
> Block Throughput: 2.00 Cycles
>  | Num Of |                    Ports pressure in cycles                     | 
 
> |
>  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  | 
 
> |
> 
>
---------------------------------------------------------------------------------
>  |   1    |           | 0.3 |           |           |     | 0.6 |     |     |
CP
> | vpcmpgtd xmm0, xmm1, xmm0
>  |   1    |           | 0.7 |           |           |     | 0.3 |     |     | 
 
> | vpcmpeqd xmm1, xmm1, xmm1
>  |   1    | 0.8       | 0.1 |           |           |     | 0.2 |     |     |
CP
> | vpxor xmm0, xmm0, xmm1
>  Total Num Of Uops: 3
> 
> 
> 
> Block Throughput: 2.00 Cycles
>  | Num Of |                    Ports pressure in cycles                     | 
 
> |
>  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  | 
 
> |
> 
>
---------------------------------------------------------------------------------
>  |   1    |           | 1.0 |           |           |     |     |     |     |
CP
> | vpmaxsd xmm2, xmm0, xmm1
>  |   1    |           |     |           |           |     | 1.0 |     |     |
CP
> | vpcmpeqd xmm0, xmm0, xmm2
>  Total Num Of Uops: 2

Zvi: 

Thanks for the detailed comparison. Reasons for using MMX instructions here:
The v8 assembler does not fully support the AVX instruction set - my assumption
here is that there is a possibility of encountering AVX/SSE transition penalties
if AVX instructions are used incosistently in the code generator. Is this
assumption correct? 

In the future when we have full assembler support for AVX instructions, the
macroassembler should be able to chose at runtime based on CPU support.

zvi

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/27 20:47:42, gdeepti wrote: > ...

3 years, 5 months ago (2017-06-27 22:30:32 UTC) #25

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
File src/compiler/x64/code-generator-x64.cc (right):

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg);
On 2017/06/27 20:47:42, gdeepti wrote:
> On 2017/06/24 23:04:09, zvi wrote:
> > On 2017/06/22 17:26:48, gdeepti wrote:
> > > On 2017/06/22 16:29:49, bbudge wrote:
> > > > On 2017/06/21 03:44:01, gdeepti wrote:
> > > > > This is similar to the code Clang generates for unsigned compares, but
> is
> > > > worse
> > > > > than it should be, because we do not have S128Consts. Given that,
would
> it
> > > be
> > > > > better to implement this using pminu(d/w/b) instructions as for
> I32x4GeU?
> > > > 
> > > > I think we'll have to support S128 constants at some point. They're in
the
> > > spec
> > > > proposal, and may be helpful in other contexts too, e.g. irregular
shuffle
> > > masks
> > > > on ARM and MIPS.
> > > > 
> > > > These constants could be created by setting all 1's, then left shift by
> 31.
> > It
> > > > might be slightly faster.
> > > > 
> > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b =>
> max(a,
> > > b)
> > > > == a
> > > > movd(scratch, dst)
> > > > pmaxud(scratch, src)
> > > > pcmpeqd(dst, scratch)
> > > > 
> > > > Similarly, pmaxub, pmaxus for other formats.
> > > 
> > > I like the simplicity of the pminud/pmaxud approach to implement the
GeU/GtU
> > > operations, I added both approaches to poll feedback, as gcc/clang both
> > generate
> > > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems
> to
> > be
> > > better, or at least that till we have S128Constants. I'll use
pmaxud/pminud
> > and 
> > > add a TODO to reconsider when we have S128Constants. 
> > 
> > I agree that the pmax/pmin option is preferable for 128-bit or larger
operands
> > (pmin/pmax MMX throughput was reduced for mmx-types in recent processor
> > generations so it is better to avoid these).
> > 
> > Here is a comparison of IACA reports for two alternatives in the
'tie-braker'
> > case. The throughput is the same, but you save on code size and one micro-op
> > less:
> > 
> > Block Throughput: 2.00 Cycles
> >  | Num Of |                    Ports pressure in cycles                    
| 
>  
> > |
> >  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7 
| 
>  
> > |
> > 
> >
>
---------------------------------------------------------------------------------
> >  |   1    |           | 0.3 |           |           |     | 0.6 |     |    
|
> CP
> > | vpcmpgtd xmm0, xmm1, xmm0
> >  |   1    |           | 0.7 |           |           |     | 0.3 |     |    
| 
>  
> > | vpcmpeqd xmm1, xmm1, xmm1
> >  |   1    | 0.8       | 0.1 |           |           |     | 0.2 |     |    
|
> CP
> > | vpxor xmm0, xmm0, xmm1
> >  Total Num Of Uops: 3
> > 
> > 
> > 
> > Block Throughput: 2.00 Cycles
> >  | Num Of |                    Ports pressure in cycles                    
| 
>  
> > |
> >  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7 
| 
>  
> > |
> > 
> >
>
---------------------------------------------------------------------------------
> >  |   1    |           | 1.0 |           |           |     |     |     |    
|
> CP
> > | vpmaxsd xmm2, xmm0, xmm1
> >  |   1    |           |     |           |           |     | 1.0 |     |    
|
> CP
> > | vpcmpeqd xmm0, xmm0, xmm2
> >  Total Num Of Uops: 2
> 
> Zvi: 
> 
> Thanks for the detailed comparison. Reasons for using MMX instructions here:
> The v8 assembler does not fully support the AVX instruction set - my
assumption
> here is that there is a possibility of encountering AVX/SSE transition
penalties
> if AVX instructions are used incosistently in the code generator. Is this
> assumption correct? 
> 
> In the future when we have full assembler support for AVX instructions, the
> macroassembler should be able to chose at runtime based on CPU support. 

I didn't mean to say you are selecting MMX instructions here - it appears to be
SSE which is fine. MMX is the older first generation SIMD instruction set that
used 64-bit registers. IIUC, the code under discussion here is targeting SSE
which shares the same execution resources as the 128-bit AVX instructions, so in
that respect SSE is not inferior. I'm sorry if my side-note about MMX mislead
you.

You are right that SSE and AVX code should not be mixed and in some cases you
can hit high penalties if you do not clear the upper 128-bits of the YMM
registers upon transition from AVX to SSE.

One thing i was wondering about:
AVX introduces non-destructive sources which will make the 'if (dst.is(src))'
checks unnecessary because with AVX the binary operations will not share an
operand for source and dest. Of course the register allocator is free to assign
the same register to source and dest if source is killed. So will there need to
be a redesign of these interfaces for AVX?

gdeepti

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-generator-x64.cc#newcode2284 src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/27 22:30:32, zvi wrote: > ...

3 years, 5 months ago (2017-06-27 23:23:37 UTC) #26

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
File src/compiler/x64/code-generator-x64.cc (right):

https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g...
src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg);
On 2017/06/27 22:30:32, zvi wrote:
> On 2017/06/27 20:47:42, gdeepti wrote:
> > On 2017/06/24 23:04:09, zvi wrote:
> > > On 2017/06/22 17:26:48, gdeepti wrote:
> > > > On 2017/06/22 16:29:49, bbudge wrote:
> > > > > On 2017/06/21 03:44:01, gdeepti wrote:
> > > > > > This is similar to the code Clang generates for unsigned compares,
but
> > is
> > > > > worse
> > > > > > than it should be, because we do not have S128Consts. Given that,
> would
> > it
> > > > be
> > > > > > better to implement this using pminu(d/w/b) instructions as for
> > I32x4GeU?
> > > > > 
> > > > > I think we'll have to support S128 constants at some point. They're in
> the
> > > > spec
> > > > > proposal, and may be helpful in other contexts too, e.g. irregular
> shuffle
> > > > masks
> > > > > on ARM and MIPS.
> > > > > 
> > > > > These constants could be created by setting all 1's, then left shift
by
> > 31.
> > > It
> > > > > might be slightly faster.
> > > > > 
> > > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b =>
> > max(a,
> > > > b)
> > > > > == a
> > > > > movd(scratch, dst)
> > > > > pmaxud(scratch, src)
> > > > > pcmpeqd(dst, scratch)
> > > > > 
> > > > > Similarly, pmaxub, pmaxus for other formats.
> > > > 
> > > > I like the simplicity of the pminud/pmaxud approach to implement the
> GeU/GtU
> > > > operations, I added both approaches to poll feedback, as gcc/clang both
> > > generate
> > > > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b)
seems
> > to
> > > be
> > > > better, or at least that till we have S128Constants. I'll use
> pmaxud/pminud
> > > and 
> > > > add a TODO to reconsider when we have S128Constants. 
> > > 
> > > I agree that the pmax/pmin option is preferable for 128-bit or larger
> operands
> > > (pmin/pmax MMX throughput was reduced for mmx-types in recent processor
> > > generations so it is better to avoid these).
> > > 
> > > Here is a comparison of IACA reports for two alternatives in the
> 'tie-braker'
> > > case. The throughput is the same, but you save on code size and one
micro-op
> > > less:
> > > 
> > > Block Throughput: 2.00 Cycles
> > >  | Num Of |                    Ports pressure in cycles                   

> | 
> >  
> > > |
> > >  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7

> | 
> >  
> > > |
> > > 
> > >
> >
>
---------------------------------------------------------------------------------
> > >  |   1    |           | 0.3 |           |           |     | 0.6 |     |   

> |
> > CP
> > > | vpcmpgtd xmm0, xmm1, xmm0
> > >  |   1    |           | 0.7 |           |           |     | 0.3 |     |   

> | 
> >  
> > > | vpcmpeqd xmm1, xmm1, xmm1
> > >  |   1    | 0.8       | 0.1 |           |           |     | 0.2 |     |   

> |
> > CP
> > > | vpxor xmm0, xmm0, xmm1
> > >  Total Num Of Uops: 3
> > > 
> > > 
> > > 
> > > Block Throughput: 2.00 Cycles
> > >  | Num Of |                    Ports pressure in cycles                   

> | 
> >  
> > > |
> > >  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7

> | 
> >  
> > > |
> > > 
> > >
> >
>
---------------------------------------------------------------------------------
> > >  |   1    |           | 1.0 |           |           |     |     |     |   

> |
> > CP
> > > | vpmaxsd xmm2, xmm0, xmm1
> > >  |   1    |           |     |           |           |     | 1.0 |     |   

> |
> > CP
> > > | vpcmpeqd xmm0, xmm0, xmm2
> > >  Total Num Of Uops: 2
> > 
> > Zvi: 
> > 
> > Thanks for the detailed comparison. Reasons for using MMX instructions here:
> > The v8 assembler does not fully support the AVX instruction set - my
> assumption
> > here is that there is a possibility of encountering AVX/SSE transition
> penalties
> > if AVX instructions are used incosistently in the code generator. Is this
> > assumption correct? 
> > 
> > In the future when we have full assembler support for AVX instructions, the
> > macroassembler should be able to chose at runtime based on CPU support. 
> 
> I didn't mean to say you are selecting MMX instructions here - it appears to
be
> SSE which is fine. MMX is the older first generation SIMD instruction set that
> used 64-bit registers. IIUC, the code under discussion here is targeting SSE
> which shares the same execution resources as the 128-bit AVX instructions, so
in
> that respect SSE is not inferior. I'm sorry if my side-note about MMX mislead
> you.

Sorry for the confusion, I remembered stale state where we were using MMX
encodings, that is not the case anymore, we are using SSE encodings now. 

> 
> You are right that SSE and AVX code should not be mixed and in some cases you
> can hit high penalties if you do not clear the upper 128-bits of the YMM
> registers upon transition from AVX to SSE.

Thanks for clarifying! 

> 
> One thing i was wondering about:
> AVX introduces non-destructive sources which will make the 'if (dst.is(src))'
> checks unnecessary because with AVX the binary operations will not share an
> operand for source and dest. Of course the register allocator is free to
assign
> the same register to source and dest if source is killed. So will there need
to
> be a redesign of these interfaces for AVX?

We will have to take a look at these again for AVX, but it should hopefully be
straightforward to fix. Currently we have an optimization for all SIMD binops in
the instruction-selector that handles destructive sources, that will need to be
disabled or handled per operation.

bbudge

https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc#newcode2195 src/compiler/x64/code-generator-x64.cc:2195: __ psubd(dst, kScratchDoubleReg); Could you use PSIGNB/W/D instructions here? ...

3 years, 5 months ago (2017-06-28 00:05:08 UTC) #27

gdeepti

The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run

3 years, 5 months ago (2017-06-28 21:19:10 UTC) #28

commit-bot: I haz the power

Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/80001

3 years, 5 months ago (2017-06-28 21:19:16 UTC) #29

gdeepti

https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc#newcode2195 src/compiler/x64/code-generator-x64.cc:2195: __ psubd(dst, kScratchDoubleReg); On 2017/06/28 00:05:08, bbudge wrote: > ...

3 years, 5 months ago (2017-06-28 21:28:17 UTC) #30

commit-bot: I haz the power

The CQ bit was unchecked by commit-bot@chromium.org

3 years, 5 months ago (2017-06-28 21:44:47 UTC) #31

commit-bot: I haz the power

Dry run: This issue passed the CQ dry run.

3 years, 5 months ago (2017-06-28 21:44:48 UTC) #32

bbudge

LGTM sorry to mis-direct you on GtU. https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc#newcode2281 src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, ...

3 years, 5 months ago (2017-06-29 01:35:30 UTC) #33

gdeepti

gdeepti@chromium.org changed reviewers: + titzer@chromium.org

3 years, 5 months ago (2017-06-29 01:40:52 UTC) #34

gdeepti

+ mtrofin@/titzer@ for OWNERS in src/compiler. https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc#newcode2281 src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); ...

3 years, 5 months ago (2017-06-29 01:40:53 UTC) #35

titzer

On 2017/06/29 01:40:53, gdeepti wrote: > + mtrofin@/titzer@ for OWNERS in src/compiler. > > https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-generator-x64.cc ...

3 years, 5 months ago (2017-06-29 08:52:10 UTC) #36

commit-bot: I haz the power

CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.org/2951793003/80001

3 years, 5 months ago (2017-06-29 16:05:55 UTC) #38

commit-bot: I haz the power

CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1498752348320490, "parent_rev": "36cfdf9e5783a1f1adddec9f4d0bfc56962853fe", "commit_rev": "3a9cd454237c55cea0cf0597b06d71e13cf1c894"}

3 years, 5 months ago (2017-06-29 16:07:29 UTC) #39

commit-bot: I haz the power

Description was changed from ========== [wasm] Implement remaining SIMD x64 compare ops, unops. Ops Implemented: ...

3 years, 5 months ago (2017-06-29 16:07:35 UTC) #40

commit-bot: I haz the power

3 years, 5 months ago (2017-06-29 16:07:36 UTC) #41

Message was sent while issue was closed.

Committed patchset #5 (id:80001) as
https://chromium.googlesource.com/v8/v8/+/3a9cd454237c55cea0cf0597b06d71e13cf...

Issue 2951793003: [wasm] Implement remaining SIMD x64 compare ops, unops. (Closed)

Description

Patch Set 1 #

Patch Set 2 : Cleanup #

Patch Set 3 : Fix bug #

Patch Set 4 : Use pmin/pmax instructions #

Patch Set 5 : Bill's review #

Messages