|
|
Description[wasm] Implement remaining SIMD x64 compare ops, unops.
Ops Implemented: I32x4Neg, I32x4GtS, I32x4GeS, I32x4GtU, I32x4GeU,
I16x8Neg, I16x8GtS, I16x8GeS, I16x8GtU, I16x8GeU
I8x16Neg, I8x16GtS, I8x16GeS, I8x16GtU, I8x16GeU
S128Not
BUG=v8:6020
R=bbudge@chromium.org, zvi.rackover@intel.com, mtrofin@chromium.org
Review-Url: https://codereview.chromium.org/2951793003
Cr-Commit-Position: refs/heads/master@{#46329}
Committed: https://chromium.googlesource.com/v8/v8/+/3a9cd454237c55cea0cf0597b06d71e13cf1c894
Patch Set 1 #Patch Set 2 : Cleanup #Patch Set 3 : Fix bug #
Total comments: 9
Patch Set 4 : Use pmin/pmax instructions #
Total comments: 6
Patch Set 5 : Bill's review #
Messages
Total messages: 41 (23 generated)
The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); This is similar to the code Clang generates for unsigned compares, but is worse than it should be, because we do not have S128Consts. Given that, would it be better to implement this using pminu(d/w/b) instructions as for I32x4GeU?
Description was changed from ========== [wasm] Implement remmaining SIMD x64 compare ops, unops. Ops Implemented: I32x4Neg, I32x4GtS, I32x4GeS, I32x4GtU, I32x4GeU, I16x8Neg, I16x8GtS, I16x8GeS, I16x8GtU, I16x8GeU I8x16Neg, I8x16GtS, I8x16GeS, I8x16GtU, I8x16GeU S128Not BUG=v8:6020 R=bbudge@chromium.org, zvi.rackover@intel.com, mtrofin@chromium.org ========== to ========== [wasm] Implement remaining SIMD x64 compare ops, unops. Ops Implemented: I32x4Neg, I32x4GtS, I32x4GeS, I32x4GtU, I32x4GeU, I16x8Neg, I16x8GtS, I16x8GeS, I16x8GtU, I16x8GeU I8x16Neg, I8x16GtS, I8x16GeS, I8x16GtU, I8x16GeU S128Not BUG=v8:6020 R=bbudge@chromium.org, zvi.rackover@intel.com, mtrofin@chromium.org ==========
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/21 03:44:01, gdeepti wrote: > This is similar to the code Clang generates for unsigned compares, but is worse > than it should be, because we do not have S128Consts. Given that, would it be > better to implement this using pminu(d/w/b) instructions as for I32x4GeU? I think we'll have to support S128 constants at some point. They're in the spec proposal, and may be helpful in other contexts too, e.g. irregular shuffle masks on ARM and MIPS. These constants could be created by setting all 1's, then left shift by 31. It might be slightly faster. However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => max(a, b) == a movd(scratch, dst) pmaxud(scratch, src) pcmpeqd(dst, scratch) Similarly, pmaxub, pmaxus for other formats.
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/22 16:29:49, bbudge wrote: > On 2017/06/21 03:44:01, gdeepti wrote: > > This is similar to the code Clang generates for unsigned compares, but is > worse > > than it should be, because we do not have S128Consts. Given that, would it be > > better to implement this using pminu(d/w/b) instructions as for I32x4GeU? > > I think we'll have to support S128 constants at some point. They're in the spec > proposal, and may be helpful in other contexts too, e.g. irregular shuffle masks > on ARM and MIPS. > > These constants could be created by setting all 1's, then left shift by 31. It > might be slightly faster. > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => max(a, b) > == a > movd(scratch, dst) > pmaxud(scratch, src) > pcmpeqd(dst, scratch) > > Similarly, pmaxub, pmaxus for other formats. I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU operations, I added both approaches to poll feedback, as gcc/clang both generate code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems to be better, or at least that till we have S128Constants. I'll use pmaxud/pminud and add a TODO to reconsider when we have S128Constants.
A general question: can you reorder operands in the instruction selector so we can avoid having to negate the result when using pmin/pmax? https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2254: __ movaps(kScratchDoubleReg, i.InputSimd128Register(1)); Can you use pminsd here to save 1 instruction?
On 2017/06/22 17:30:10, bbudge wrote: > A general question: can you reorder operands in the instruction selector so we > can avoid having to negate the result when using pmin/pmax? > > https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... > File src/compiler/x64/code-generator-x64.cc (right): > > https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... > src/compiler/x64/code-generator-x64.cc:2254: __ movaps(kScratchDoubleReg, > i.InputSimd128Register(1)); > Can you use pminsd here to save 1 instruction?
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/22 17:26:48, gdeepti wrote: > On 2017/06/22 16:29:49, bbudge wrote: > > On 2017/06/21 03:44:01, gdeepti wrote: > > > This is similar to the code Clang generates for unsigned compares, but is > > worse > > > than it should be, because we do not have S128Consts. Given that, would it > be > > > better to implement this using pminu(d/w/b) instructions as for I32x4GeU? > > > > I think we'll have to support S128 constants at some point. They're in the > spec > > proposal, and may be helpful in other contexts too, e.g. irregular shuffle > masks > > on ARM and MIPS. > > > > These constants could be created by setting all 1's, then left shift by 31. It > > might be slightly faster. > > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => max(a, > b) > > == a > > movd(scratch, dst) > > pmaxud(scratch, src) > > pcmpeqd(dst, scratch) > > > > Similarly, pmaxub, pmaxus for other formats. > > I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU > operations, I added both approaches to poll feedback, as gcc/clang both generate > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems to be > better, or at least that till we have S128Constants. I'll use pmaxud/pminud and > add a TODO to reconsider when we have S128Constants. I agree that the pmax/pmin option is preferable for 128-bit or larger operands (pmin/pmax MMX throughput was reduced for mmx-types in recent processor generations so it is better to avoid these). Here is a comparison of IACA reports for two alternatives in the 'tie-braker' case. The throughput is the same, but you save on code size and one micro-op less: Block Throughput: 2.00 Cycles | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | | 0.3 | | | | 0.6 | | | CP | vpcmpgtd xmm0, xmm1, xmm0 | 1 | | 0.7 | | | | 0.3 | | | | vpcmpeqd xmm1, xmm1, xmm1 | 1 | 0.8 | 0.1 | | | | 0.2 | | | CP | vpxor xmm0, xmm0, xmm1 Total Num Of Uops: 3 Block Throughput: 2.00 Cycles | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 1 | | 1.0 | | | | | | | CP | vpmaxsd xmm2, xmm0, xmm1 | 1 | | | | | | 1.0 | | | CP | vpcmpeqd xmm0, xmm0, xmm2 Total Num Of Uops: 2
The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2254: __ movaps(kScratchDoubleReg, i.InputSimd128Register(1)); On 2017/06/22 17:30:10, bbudge wrote: > Can you use pminsd here to save 1 instruction? Done. https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/24 23:04:09, zvi wrote: > On 2017/06/22 17:26:48, gdeepti wrote: > > On 2017/06/22 16:29:49, bbudge wrote: > > > On 2017/06/21 03:44:01, gdeepti wrote: > > > > This is similar to the code Clang generates for unsigned compares, but is > > > worse > > > > than it should be, because we do not have S128Consts. Given that, would it > > be > > > > better to implement this using pminu(d/w/b) instructions as for I32x4GeU? > > > > > > I think we'll have to support S128 constants at some point. They're in the > > spec > > > proposal, and may be helpful in other contexts too, e.g. irregular shuffle > > masks > > > on ARM and MIPS. > > > > > > These constants could be created by setting all 1's, then left shift by 31. > It > > > might be slightly faster. > > > > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => max(a, > > b) > > > == a > > > movd(scratch, dst) > > > pmaxud(scratch, src) > > > pcmpeqd(dst, scratch) > > > > > > Similarly, pmaxub, pmaxus for other formats. Bill: This returns true for the == case as well, using pmax + complement, does this look reasonable? > > > > I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU > > operations, I added both approaches to poll feedback, as gcc/clang both > generate > > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems to > be > > better, or at least that till we have S128Constants. I'll use pmaxud/pminud > and > > add a TODO to reconsider when we have S128Constants. > > I agree that the pmax/pmin option is preferable for 128-bit or larger operands > (pmin/pmax MMX throughput was reduced for mmx-types in recent processor > generations so it is better to avoid these). > > Here is a comparison of IACA reports for two alternatives in the 'tie-braker' > case. The throughput is the same, but you save on code size and one micro-op > less: > > Block Throughput: 2.00 Cycles > | Num Of | Ports pressure in cycles | > | > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | > | > > --------------------------------------------------------------------------------- > | 1 | | 0.3 | | | | 0.6 | | | CP > | vpcmpgtd xmm0, xmm1, xmm0 > | 1 | | 0.7 | | | | 0.3 | | | > | vpcmpeqd xmm1, xmm1, xmm1 > | 1 | 0.8 | 0.1 | | | | 0.2 | | | CP > | vpxor xmm0, xmm0, xmm1 > Total Num Of Uops: 3 > > > > Block Throughput: 2.00 Cycles > | Num Of | Ports pressure in cycles | > | > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | > | > > --------------------------------------------------------------------------------- > | 1 | | 1.0 | | | | | | | CP > | vpmaxsd xmm2, xmm0, xmm1 > | 1 | | | | | | 1.0 | | | CP > | vpcmpeqd xmm0, xmm0, xmm2 > Total Num Of Uops: 2 Zvi: Thanks for the detailed comparison. Reasons for using MMX instructions here: The v8 assembler does not fully support the AVX instruction set - my assumption here is that there is a possibility of encountering AVX/SSE transition penalties if AVX instructions are used incosistently in the code generator. Is this assumption correct? In the future when we have full assembler support for AVX instructions, the macroassembler should be able to chose at runtime based on CPU support.
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/27 20:47:42, gdeepti wrote: > On 2017/06/24 23:04:09, zvi wrote: > > On 2017/06/22 17:26:48, gdeepti wrote: > > > On 2017/06/22 16:29:49, bbudge wrote: > > > > On 2017/06/21 03:44:01, gdeepti wrote: > > > > > This is similar to the code Clang generates for unsigned compares, but > is > > > > worse > > > > > than it should be, because we do not have S128Consts. Given that, would > it > > > be > > > > > better to implement this using pminu(d/w/b) instructions as for > I32x4GeU? > > > > > > > > I think we'll have to support S128 constants at some point. They're in the > > > spec > > > > proposal, and may be helpful in other contexts too, e.g. irregular shuffle > > > masks > > > > on ARM and MIPS. > > > > > > > > These constants could be created by setting all 1's, then left shift by > 31. > > It > > > > might be slightly faster. > > > > > > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => > max(a, > > > b) > > > > == a > > > > movd(scratch, dst) > > > > pmaxud(scratch, src) > > > > pcmpeqd(dst, scratch) > > > > > > > > Similarly, pmaxub, pmaxus for other formats. > > > > > > I like the simplicity of the pminud/pmaxud approach to implement the GeU/GtU > > > operations, I added both approaches to poll feedback, as gcc/clang both > > generate > > > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems > to > > be > > > better, or at least that till we have S128Constants. I'll use pmaxud/pminud > > and > > > add a TODO to reconsider when we have S128Constants. > > > > I agree that the pmax/pmin option is preferable for 128-bit or larger operands > > (pmin/pmax MMX throughput was reduced for mmx-types in recent processor > > generations so it is better to avoid these). > > > > Here is a comparison of IACA reports for two alternatives in the 'tie-braker' > > case. The throughput is the same, but you save on code size and one micro-op > > less: > > > > Block Throughput: 2.00 Cycles > > | Num Of | Ports pressure in cycles | > > > | > > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | > > > | > > > > > --------------------------------------------------------------------------------- > > | 1 | | 0.3 | | | | 0.6 | | | > CP > > | vpcmpgtd xmm0, xmm1, xmm0 > > | 1 | | 0.7 | | | | 0.3 | | | > > > | vpcmpeqd xmm1, xmm1, xmm1 > > | 1 | 0.8 | 0.1 | | | | 0.2 | | | > CP > > | vpxor xmm0, xmm0, xmm1 > > Total Num Of Uops: 3 > > > > > > > > Block Throughput: 2.00 Cycles > > | Num Of | Ports pressure in cycles | > > > | > > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | > > > | > > > > > --------------------------------------------------------------------------------- > > | 1 | | 1.0 | | | | | | | > CP > > | vpmaxsd xmm2, xmm0, xmm1 > > | 1 | | | | | | 1.0 | | | > CP > > | vpcmpeqd xmm0, xmm0, xmm2 > > Total Num Of Uops: 2 > > Zvi: > > Thanks for the detailed comparison. Reasons for using MMX instructions here: > The v8 assembler does not fully support the AVX instruction set - my assumption > here is that there is a possibility of encountering AVX/SSE transition penalties > if AVX instructions are used incosistently in the code generator. Is this > assumption correct? > > In the future when we have full assembler support for AVX instructions, the > macroassembler should be able to chose at runtime based on CPU support. I didn't mean to say you are selecting MMX instructions here - it appears to be SSE which is fine. MMX is the older first generation SIMD instruction set that used 64-bit registers. IIUC, the code under discussion here is targeting SSE which shares the same execution resources as the 128-bit AVX instructions, so in that respect SSE is not inferior. I'm sorry if my side-note about MMX mislead you. You are right that SSE and AVX code should not be mixed and in some cases you can hit high penalties if you do not clear the upper 128-bits of the YMM registers upon transition from AVX to SSE. One thing i was wondering about: AVX introduces non-destructive sources which will make the 'if (dst.is(src))' checks unnecessary because with AVX the binary operations will not share an operand for source and dest. Of course the register allocator is free to assign the same register to source and dest if source is killed. So will there need to be a redesign of these interfaces for AVX?
https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/40001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2284: __ pxor(src, kScratchDoubleReg); On 2017/06/27 22:30:32, zvi wrote: > On 2017/06/27 20:47:42, gdeepti wrote: > > On 2017/06/24 23:04:09, zvi wrote: > > > On 2017/06/22 17:26:48, gdeepti wrote: > > > > On 2017/06/22 16:29:49, bbudge wrote: > > > > > On 2017/06/21 03:44:01, gdeepti wrote: > > > > > > This is similar to the code Clang generates for unsigned compares, but > > is > > > > > worse > > > > > > than it should be, because we do not have S128Consts. Given that, > would > > it > > > > be > > > > > > better to implement this using pminu(d/w/b) instructions as for > > I32x4GeU? > > > > > > > > > > I think we'll have to support S128 constants at some point. They're in > the > > > > spec > > > > > proposal, and may be helpful in other contexts too, e.g. irregular > shuffle > > > > masks > > > > > on ARM and MIPS. > > > > > > > > > > These constants could be created by setting all 1's, then left shift by > > 31. > > > It > > > > > might be slightly faster. > > > > > > > > > > However, it does seem simpler to use pmaxud (SSE 4.1?) here: a > b => > > max(a, > > > > b) > > > > > == a > > > > > movd(scratch, dst) > > > > > pmaxud(scratch, src) > > > > > pcmpeqd(dst, scratch) > > > > > > > > > > Similarly, pmaxub, pmaxus for other formats. > > > > > > > > I like the simplicity of the pminud/pmaxud approach to implement the > GeU/GtU > > > > operations, I added both approaches to poll feedback, as gcc/clang both > > > generate > > > > code in terms of pxor/psubd, but for our use case pmaxu/pminu(d/w/b) seems > > to > > > be > > > > better, or at least that till we have S128Constants. I'll use > pmaxud/pminud > > > and > > > > add a TODO to reconsider when we have S128Constants. > > > > > > I agree that the pmax/pmin option is preferable for 128-bit or larger > operands > > > (pmin/pmax MMX throughput was reduced for mmx-types in recent processor > > > generations so it is better to avoid these). > > > > > > Here is a comparison of IACA reports for two alternatives in the > 'tie-braker' > > > case. The throughput is the same, but you save on code size and one micro-op > > > less: > > > > > > Block Throughput: 2.00 Cycles > > > | Num Of | Ports pressure in cycles > | > > > > > | > > > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 > | > > > > > | > > > > > > > > > --------------------------------------------------------------------------------- > > > | 1 | | 0.3 | | | | 0.6 | | > | > > CP > > > | vpcmpgtd xmm0, xmm1, xmm0 > > > | 1 | | 0.7 | | | | 0.3 | | > | > > > > > | vpcmpeqd xmm1, xmm1, xmm1 > > > | 1 | 0.8 | 0.1 | | | | 0.2 | | > | > > CP > > > | vpxor xmm0, xmm0, xmm1 > > > Total Num Of Uops: 3 > > > > > > > > > > > > Block Throughput: 2.00 Cycles > > > | Num Of | Ports pressure in cycles > | > > > > > | > > > | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 > | > > > > > | > > > > > > > > > --------------------------------------------------------------------------------- > > > | 1 | | 1.0 | | | | | | > | > > CP > > > | vpmaxsd xmm2, xmm0, xmm1 > > > | 1 | | | | | | 1.0 | | > | > > CP > > > | vpcmpeqd xmm0, xmm0, xmm2 > > > Total Num Of Uops: 2 > > > > Zvi: > > > > Thanks for the detailed comparison. Reasons for using MMX instructions here: > > The v8 assembler does not fully support the AVX instruction set - my > assumption > > here is that there is a possibility of encountering AVX/SSE transition > penalties > > if AVX instructions are used incosistently in the code generator. Is this > > assumption correct? > > > > In the future when we have full assembler support for AVX instructions, the > > macroassembler should be able to chose at runtime based on CPU support. > > I didn't mean to say you are selecting MMX instructions here - it appears to be > SSE which is fine. MMX is the older first generation SIMD instruction set that > used 64-bit registers. IIUC, the code under discussion here is targeting SSE > which shares the same execution resources as the 128-bit AVX instructions, so in > that respect SSE is not inferior. I'm sorry if my side-note about MMX mislead > you. Sorry for the confusion, I remembered stale state where we were using MMX encodings, that is not the case anymore, we are using SSE encodings now. > > You are right that SSE and AVX code should not be mixed and in some cases you > can hit high penalties if you do not clear the upper 128-bits of the YMM > registers upon transition from AVX to SSE. Thanks for clarifying! > > One thing i was wondering about: > AVX introduces non-destructive sources which will make the 'if (dst.is(src))' > checks unnecessary because with AVX the binary operations will not share an > operand for source and dest. Of course the register allocator is free to assign > the same register to source and dest if source is killed. So will there need to > be a redesign of these interfaces for AVX? We will have to take a look at these again for AVX, but it should hopefully be straightforward to fix. Currently we have an optimization for all SIMD binops in the instruction-selector that handles destructive sources, that will need to be disabled or handled per operation.
https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2195: __ psubd(dst, kScratchDoubleReg); Could you use PSIGNB/W/D instructions here? e.g. if (dst.is(src)) { __ pcmpeqd(kScratchDoubleReg, kScratchDoubleReg); // all 1's __ psignd(dst, kScratchDoubleReg); } else { ... http://www.felixcloutier.com/x86/PSIGNB:PSIGNW:PSIGND.html https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); I'm still wondering if you can switch the operands in instruction-selector-x64.cc for GtU ops. Then you can do an equivalent operation in 2 instructions using pminu. 'src' would be input 0, dst would be input 1, and dst would be DefineSameAsFirst.
The CQ bit was checked by gdeepti@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2195: __ psubd(dst, kScratchDoubleReg); On 2017/06/28 00:05:08, bbudge wrote: > Could you use PSIGNB/W/D instructions here? e.g. > > if (dst.is(src)) { > __ pcmpeqd(kScratchDoubleReg, kScratchDoubleReg); // all 1's > __ psignd(dst, kScratchDoubleReg); > } else { > ... > > http://www.felixcloutier.com/x86/PSIGNB:PSIGNW:PSIGND.html Done. https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); On 2017/06/28 00:05:08, bbudge wrote: > I'm still wondering if you can switch the operands in > instruction-selector-x64.cc for GtU ops. Then you can do an equivalent operation > in 2 instructions using pminu. 'src' would be input 0, dst would be input 1, and > dst would be DefineSameAsFirst. Not sure if I'm missing something, but I think the problem using pmax/pmin and pcmpeq instructions is that they return true for the == case. I32x4GtU(0, 0) = > will return true instead of false whatever the ordering of the operands.
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
LGTM sorry to mis-direct you on GtU. https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); On 2017/06/28 21:28:17, gdeepti wrote: > On 2017/06/28 00:05:08, bbudge wrote: > > I'm still wondering if you can switch the operands in > > instruction-selector-x64.cc for GtU ops. Then you can do an equivalent > operation > > in 2 instructions using pminu. 'src' would be input 0, dst would be input 1, > and > > dst would be DefineSameAsFirst. > > Not sure if I'm missing something, but I think the problem using pmax/pmin and > pcmpeq instructions is that they return true for the == case. I32x4GtU(0, 0) = > > will return true instead of false whatever the ordering of the operands. No, you're not. This appears to be the best we can do and the Web seems to agree.
gdeepti@chromium.org changed reviewers: + titzer@chromium.org
+ mtrofin@/titzer@ for OWNERS in src/compiler. https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... File src/compiler/x64/code-generator-x64.cc (right): https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); On 2017/06/29 01:35:30, bbudge wrote: > On 2017/06/28 21:28:17, gdeepti wrote: > > On 2017/06/28 00:05:08, bbudge wrote: > > > I'm still wondering if you can switch the operands in > > > instruction-selector-x64.cc for GtU ops. Then you can do an equivalent > > operation > > > in 2 instructions using pminu. 'src' would be input 0, dst would be input 1, > > and > > > dst would be DefineSameAsFirst. > > > > Not sure if I'm missing something, but I think the problem using pmax/pmin and > > pcmpeq instructions is that they return true for the == case. I32x4GtU(0, 0) = > > > > will return true instead of false whatever the ordering of the operands. > > No, you're not. This appears to be the best we can do and the Web seems to > agree. Thanks for confirming!
On 2017/06/29 01:40:53, gdeepti wrote: > + mtrofin@/titzer@ for OWNERS in src/compiler. > > https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... > File src/compiler/x64/code-generator-x64.cc (right): > > https://codereview.chromium.org/2951793003/diff/60001/src/compiler/x64/code-g... > src/compiler/x64/code-generator-x64.cc:2281: __ pxor(dst, kScratchDoubleReg); > On 2017/06/29 01:35:30, bbudge wrote: > > On 2017/06/28 21:28:17, gdeepti wrote: > > > On 2017/06/28 00:05:08, bbudge wrote: > > > > I'm still wondering if you can switch the operands in > > > > instruction-selector-x64.cc for GtU ops. Then you can do an equivalent > > > operation > > > > in 2 instructions using pminu. 'src' would be input 0, dst would be input > 1, > > > and > > > > dst would be DefineSameAsFirst. > > > > > > Not sure if I'm missing something, but I think the problem using pmax/pmin > and > > > pcmpeq instructions is that they return true for the == case. I32x4GtU(0, 0) > = > > > > > > will return true instead of false whatever the ordering of the operands. > > > > No, you're not. This appears to be the best we can do and the Web seems to > > agree. > > Thanks for confirming! lgtm
The CQ bit was checked by gdeepti@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 80001, "attempt_start_ts": 1498752348320490, "parent_rev": "36cfdf9e5783a1f1adddec9f4d0bfc56962853fe", "commit_rev": "3a9cd454237c55cea0cf0597b06d71e13cf1c894"}
Message was sent while issue was closed.
Description was changed from ========== [wasm] Implement remaining SIMD x64 compare ops, unops. Ops Implemented: I32x4Neg, I32x4GtS, I32x4GeS, I32x4GtU, I32x4GeU, I16x8Neg, I16x8GtS, I16x8GeS, I16x8GtU, I16x8GeU I8x16Neg, I8x16GtS, I8x16GeS, I8x16GtU, I8x16GeU S128Not BUG=v8:6020 R=bbudge@chromium.org, zvi.rackover@intel.com, mtrofin@chromium.org ========== to ========== [wasm] Implement remaining SIMD x64 compare ops, unops. Ops Implemented: I32x4Neg, I32x4GtS, I32x4GeS, I32x4GtU, I32x4GeU, I16x8Neg, I16x8GtS, I16x8GeS, I16x8GtU, I16x8GeU I8x16Neg, I8x16GtS, I8x16GeS, I8x16GtU, I8x16GeU S128Not BUG=v8:6020 R=bbudge@chromium.org, zvi.rackover@intel.com, mtrofin@chromium.org Review-Url: https://codereview.chromium.org/2951793003 Cr-Commit-Position: refs/heads/master@{#46329} Committed: https://chromium.googlesource.com/v8/v8/+/3a9cd454237c55cea0cf0597b06d71e13cf... ==========
Message was sent while issue was closed.
Committed patchset #5 (id:80001) as https://chromium.googlesource.com/v8/v8/+/3a9cd454237c55cea0cf0597b06d71e13cf... |