|
|
Created:
6 years, 5 months ago by regis Modified:
6 years, 5 months ago CC:
reviews_dartlang.org, vm-dev_dartlang.org Visibility:
Public. |
DescriptionOptimize the multiplication of two 32-bit unsigned integers.
R=johnmccutchan@google.com, srdjan@google.com, vegorov@google.com
Committed: https://code.google.com/p/dart/source/detail?r=38400
Patch Set 1 #
Total comments: 2
Patch Set 2 : #
Total comments: 13
Patch Set 3 : #
Total comments: 2
Patch Set 4 : #
Total comments: 5
Patch Set 5 : #
Messages
Total messages: 16 (0 generated)
https://codereview.chromium.org/396733006/diff/1/runtime/vm/intermediate_lang... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/1/runtime/vm/intermediate_lang... runtime/vm/intermediate_language_ia32.cc:5937: __ shrl(right_hi, Immediate(30)); Maybe I misunderstand, but... Why test right_hi? Surely it is zero from the orl/jnz test? The product is in EDX:EAX which is aliased to left_hi:left_lo and out_hi:out_lo. Should shrl be removed? The flags generated by testl flags are killed by shrl, and doesn't shrl kill the high word of the result?
https://codereview.chromium.org/396733006/diff/1/runtime/vm/intermediate_lang... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/1/runtime/vm/intermediate_lang... runtime/vm/intermediate_language_ia32.cc:5937: __ shrl(right_hi, Immediate(30)); On 2014/07/16 00:58:27, sra1 wrote: > Maybe I misunderstand, but... > > Why test right_hi? Surely it is zero from the orl/jnz test? > The product is in EDX:EAX which is aliased to left_hi:left_lo and out_hi:out_lo. > > Should shrl be removed? > The flags generated by testl flags are killed by shrl, and doesn't shrl kill the > high word of the result? Oops, the shrl is left over from a previous version and is redundant here. Yes, it should be removed. Also, it should be out_hi, not right_hi. I should have tested with --unbox_mints=false out_hi is guaranteed to be EDX and it contains the upper 32-bit of the result. I've added asserts. Thanks!
LGTM with comments https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2041: HasOnlyTwoOf(ic_data, kInt32x4Cid)) { How do you prevent repeated deoptimizations? Maybe check 'ic_data.HasDeoptReason' as seen above? https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2042: operands_type = kInt32x4Cid; Optional alternative (something like this): } else if (HasOnlyTwoOf(ic_data, kInt32x4Cid)) { if (op_kind == Token::kMUL) { return; } operands_type = kInt32x4Cid; }
The Uint32 work isn't complete. Will discuss offline. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_arm.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_arm.cc:6356: __ smull(out, IP, left, right); Have you tested this code path? https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:6255: __ mull(right); // Result in EDX:EAX, CF set if EDX != 0. Uint32 operations are only used in places where we know we don't care about overflow.
LGTM after removing overflow check in BinaryUint32Op.
not lgtm lets not add optimistic special cases that have no disabling mechanism and that *must* be covered by uint32 work - this will just accumulate technical debt and muddy performance picture. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:5939: __ j(NOT_ZERO, deopt); I am very wary of this kind of optimistic optimizations that have no built-in mechanism for disabling them. If somebody writes multiplication that from time to time multiplies smth non uint32-ish then their code will be stuck in the deoptimization loop. We saw example of why such situations are not desirable with Uint32List loads. Additionally this feels somewhat backwards to me: should not we just ensure that bigint arithmetic never emits this kind of multiplication? We started UInt32 support precisely for this reason. If it does not hit the code that we are writing ourselves then maybe we should fix it first.
Thanks all. PTAL https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2041: HasOnlyTwoOf(ic_data, kInt32x4Cid)) { On 2014/07/16 16:55:23, srdjan wrote: > How do you prevent repeated deoptimizations? Maybe check > 'ic_data.HasDeoptReason' as seen above? It's done above in the same way as for ADD and SUB. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2042: operands_type = kInt32x4Cid; On 2014/07/16 16:55:23, srdjan wrote: > Optional alternative (something like this): > > } else if (HasOnlyTwoOf(ic_data, kInt32x4Cid)) { > if (op_kind == Token::kMUL) { > return; > } > operands_type = kInt32x4Cid; > } Done, but assert not kMUL, rather than return false. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_arm.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_arm.cc:6356: __ smull(out, IP, left, right); On 2014/07/16 18:09:48, Cutch wrote: > Have you tested this code path? Yes, I have ran all the tests, including the RSA benchmark with --unbox_mints=true and false. I have not tested the path when we do not have ARMv7, however. But I changed this code not to check for overflow and can use the MUL instruction available on all supported ARM versions. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:5939: __ j(NOT_ZERO, deopt); On 2014/07/16 19:51:22, Vyacheslav Egorov (Google wrote: > I am very wary of this kind of optimistic optimizations that have no built-in > mechanism for disabling them. If somebody writes multiplication that from time > to time multiplies smth non uint32-ish then their code will be stuck in the > deoptimization loop. > > We saw example of why such situations are not desirable with Uint32List loads. > > Additionally this feels somewhat backwards to me: should not we just ensure that > bigint arithmetic never emits this kind of multiplication? We started UInt32 > support precisely for this reason. If it does not hit the code that we are > writing ourselves then maybe we should fix it first. Fair enough. I have added a range check, so that this code is only generated if the inputs are guaranteed to be uint32. In other words, the code does not check the inputs anymore. However, the code can still deopt if the result is too large, but I think this is OK and no different than deopt for addition or subtraction. The new bigint implementation is not yes complete and critical parts will be hand coded in assembly. For now, 16-bit positive smis are multiplied and result in positive mints. In order for these to be handled by unit32 ops, corresponding mint ops must exist to be replaced. This is why I had to implement this mint code. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:6255: __ mull(right); // Result in EDX:EAX, CF set if EDX != 0. On 2014/07/16 18:09:48, Cutch wrote: > Uint32 operations are only used in places where we know we don't care about > overflow. I have removed the overflow check.
LGTM to unblock you. Please fix ARM before landing and revert changes related to range checking in optimizer. Fair point about needing MintOperation right now if we want our narrowing pass to find it. We can make that pass to operate on "non-lowered" arithmetic though. Ideally given two integers x and y stuff like (x * y) & 0xFFFFFFFF should always result in Uint32 operation independent of whether x * y can be optimistically lowered to Mint operation or not. https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_arm.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_arm.cc:6142: __ smull(out_lo, out_hi, left_lo, right_lo); should not this be umul given that we look at hi? (smul and umul match at lo part but not on hi). I feel like this miscomputes result of 0xFFFFFFFF * 0xFFFFFFFF which would be 1 after smul but should not be. Please write a pure Dart test on this. https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2057: if ((left_range == NULL) || Unfortunately range information is not available at this point yet so this check would always fail. You would have to resort for runtime check.
On 2014/07/16 23:51:35, Vyacheslav Egorov (Google wrote: > LGTM to unblock you. > > Please fix ARM before landing and revert changes related to range checking in > optimizer. > > Fair point about needing MintOperation right now if we want our narrowing pass > to find it. We can make that pass to operate on "non-lowered" arithmetic though. > > > Ideally given two integers x and y stuff like (x * y) & 0xFFFFFFFF should always > result in Uint32 operation independent of whether x * y can be optimistically > lowered to Mint operation or not. > > https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... > File runtime/vm/intermediate_language_arm.cc (right): > > https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... > runtime/vm/intermediate_language_arm.cc:6142: __ smull(out_lo, out_hi, left_lo, > right_lo); > should not this be umul given that we look at hi? (smul and umul match at lo > part but not on hi). > > I feel like this miscomputes result of 0xFFFFFFFF * 0xFFFFFFFF which would be 1 > after smul but should not be. > > Please write a pure Dart test on this. > > https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... > File runtime/vm/flow_graph_optimizer.cc (right): > > https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... > runtime/vm/flow_graph_optimizer.cc:2057: if ((left_range == NULL) || > Unfortunately range information is not available at this point yet so this check > would always fail. You would have to resort for runtime check. I'll chat with John to see what we can do for uint32 operations. For now, I have changed the mint multiplication to support two signed 32-bit inputs, rather than two positive 32-bit inputs. It does not make a huge difference, but it may be less surprising. Let me know what you think. This works almost as well for me in the bigint implementation. There is one case that is not covered anymore in the square function where I need to multiply a 32-bit unsigned digit by 2. Either we can solve that with better uint32 ops or by detecting that one operand is a power of two. Ideally, the mint multiplication should calculate the 128-bit product and deopt if it does not fit in 64-bit, but that is expensive. I'd prefer to rely on bigint for arbitrary large inputs. I think that using the BSR instruction (CLZ on ARM) to calculate the sum of the highest bits of the inputs and detect overflow is also too complicated. Please, let me know if you are more comfortable with this current version. Thanks, Regis
https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_arm.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_arm.cc:6142: __ smull(out_lo, out_hi, left_lo, right_lo); On 2014/07/16 23:51:35, Vyacheslav Egorov (Google wrote: > should not this be umul given that we look at hi? (smul and umul match at lo > part but not on hi). > > I feel like this miscomputes result of 0xFFFFFFFF * 0xFFFFFFFF which would be 1 > after smul but should not be. > > Please write a pure Dart test on this. You are right. This should have been an unsigned multiplication as on IA32. In the next revision, it is back to a signed multiplication. See below. I agree that tests would be nice, but it is really tricky to target optimized code. The code could have called deopt and the result is still correct. Any tips are welcome. https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/40001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2057: if ((left_range == NULL) || On 2014/07/16 23:51:35, Vyacheslav Egorov (Google wrote: > Unfortunately range information is not available at this point yet so this check > would always fail. You would have to resort for runtime check. You are right. I removed this range check.
LGTM https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_arm.cc (right): https://codereview.chromium.org/396733006/diff/20001/runtime/vm/intermediate_... runtime/vm/intermediate_language_arm.cc:6142: __ smull(out_lo, out_hi, left_lo, right_lo); > I agree that tests would be nice, but it is really tricky to target optimized > code. The code could have called deopt and the result is still correct. Any tips > are welcome. Well this particular case would not call a deopt right? 0xFFFFFFFF has hi == 0 so the first deopt will not happen. Result of multplication with smull would have hi == 0 too so the second deopt would not happen either. I just wanted to have a regression test for cases that miscomputes. If deopt happens we are fine :) https://codereview.chromium.org/396733006/diff/60001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/60001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:5924: Register temp = locs()->temp(0).reg(); Please add TODO here and in another place use range analysis to remove this check whenever possible
https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2041: ASSERT(op_kind != Token::kMUL); Why can this be guaranteed?
lgtm https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2041: ASSERT(op_kind != Token::kMUL); On 2014/07/18 16:32:54, srdjan wrote: > Why can this be guaranteed? Because Int32x4 doesn't have a multiply operation.
Thanks! https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... File runtime/vm/flow_graph_optimizer.cc (right): https://codereview.chromium.org/396733006/diff/60001/runtime/vm/flow_graph_op... runtime/vm/flow_graph_optimizer.cc:2041: ASSERT(op_kind != Token::kMUL); On 2014/07/18 16:39:39, Cutch wrote: > On 2014/07/18 16:32:54, srdjan wrote: > > Why can this be guaranteed? > > Because Int32x4 doesn't have a multiply operation. I have added a comment. https://codereview.chromium.org/396733006/diff/60001/runtime/vm/intermediate_... File runtime/vm/intermediate_language_ia32.cc (right): https://codereview.chromium.org/396733006/diff/60001/runtime/vm/intermediate_... runtime/vm/intermediate_language_ia32.cc:5924: Register temp = locs()->temp(0).reg(); On 2014/07/18 11:15:35, Vyacheslav Egorov (Google wrote: > Please add TODO here and in another place use range analysis to remove this > check whenever possible Done.
Message was sent while issue was closed.
Committed patchset #5 manually as r38400 (presubmit successful). |