|
|
Created:
6 years, 3 months ago by brianderson Modified:
6 years, 2 months ago CC:
chromium-reviews, erikwright+watch_chromium.org, jar (doing other things) Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionAvoid 64-bit divide when sampling timestamp
On some lower-end platforms, 64-bit integer division is
emulated or is significantly slower than 32-bit integer
division even if not emulated.
When converting ns to us in TimeTicks::Now(), we don't
need the range of a 64-bit integer because we are starting
with a 32-bit integer and making it smaller. There is no
need to convert it to 64-bits before the division.
BUG=none
Committed: https://crrev.com/738fcb80812fefc1811d7a0954d853a9cf294310
Cr-Commit-Position: refs/heads/master@{#297287}
Patch Set 1 #Patch Set 2 : formatting #
Total comments: 2
Messages
Total messages: 20 (4 generated)
brianderson@chromium.org changed reviewers: + ajwong@chromium.org, willchan@chromium.org
On 2014/09/29 19:06:04, brianderson wrote: Can you give a rationale for this?
I stumbled upon this when investigating something @jar was looking at where TimeTicks::Now() was potentially adding significant overhead. It didn't end up being the root issue, but we might as well optimize it anyway given how much we more we are relying on timestamps throughout the system for scheduling or debugging purposes. On some lower-end platforms, 64-bit integer division is emulated or is significantly slower than 32-bit integer division even if not emulated. In this case, we don't need the range of a 64-bit integer because we are starting with a 32-bit integer and making it smaller. There is no need to convert it to 64-bits before the division.
On 2014/09/29 19:37:12, brianderson wrote: > I stumbled upon this when investigating something @jar was looking at where > TimeTicks::Now() was potentially adding significant overhead. It didn't end up > being the root issue, but we might as well optimize it anyway given how much we > more we are relying on timestamps throughout the system for scheduling or > debugging purposes. > > On some lower-end platforms, 64-bit integer division is emulated or is > significantly slower than 32-bit integer division even if not emulated. > > In this case, we don't need the range of a 64-bit integer because we are > starting with a 32-bit integer and making it smaller. There is no need to > convert it to 64-bits before the division. Sounds good. :) Can you dump that into the commit message? It's a good habit to have the motivation in the commit log. I'll LG after that.
Thanks, ptal.
LGTM
The CQ bit was checked by brianderson@chromium.org
CQ is trying da patch. Follow status at https://chromium-cq-status.appspot.com/patch-status/600013003/20001
Message was sent while issue was closed.
Committed patchset #2 (id:20001) as 4d8eb2ac774e64e77eaf793a4f3cd1cc7781b039
Message was sent while issue was closed.
Patchset 2 (id:??) landed as https://crrev.com/738fcb80812fefc1811d7a0954d853a9cf294310 Cr-Commit-Position: refs/heads/master@{#297287}
Message was sent while issue was closed.
jamesr@chromium.org changed reviewers: + jamesr@chromium.org
Message was sent while issue was closed.
https://codereview.chromium.org/600013003/diff/20001/base/time/time_posix.cc File base/time/time_posix.cc (right): https://codereview.chromium.org/600013003/diff/20001/base/time/time_posix.cc#... base/time/time_posix.cc:101: (static_cast<int64>(ts.tv_nsec / base::Time::kNanosecondsPerMicrosecond)); i'm pretty sure this doesn't actually help, since kNanosecondsPerMicrosecond is: static const int64 kNanosecondsPerMicrosecond = 1000; so thanks to type promotion this is still a 64 bit divide.
Message was sent while issue was closed.
Yeah, on my box this makes no difference in the way ClockNow is compiled, both with and without this patch using gcc the disassembly looks like: 00000000 <_ZN12_GLOBAL__N_18ClockNowEi>: 0: b530 push {r4, r5, lr} 2: b083 sub sp, #12 4: 4604 mov r4, r0 6: 4608 mov r0, r1 8: 4669 mov r1, sp a: f7ff fffe bl 0 <clock_gettime> e: 2200 movs r2, #0 10: 2300 movs r3, #0 12: b988 cbnz r0, 38 <_ZN12_GLOBAL__N_18ClockNowEi+0x38> 14: f644 50d3 movw r0, #19923 ; 0x4dd3 18: f244 2540 movw r5, #16960 ; 0x4240 1c: f2c1 0062 movt r0, #4194 ; 0x1062 20: f2c0 050f movt r5, #15 24: 9a01 ldr r2, [sp, #4] 26: fb82 0100 smull r0, r1, r2, r0 2a: 17d2 asrs r2, r2, #31 2c: ebc2 12a1 rsb r2, r2, r1, asr #6 30: 9900 ldr r1, [sp, #0] 32: 17d3 asrs r3, r2, #31 34: fbc5 2301 smlal r2, r3, r5, r1 38: 4620 mov r0, r4 3a: e9c4 2300 strd r2, r3, [r4] 3e: b003 add sp, #12 40: bd30 pop {r4, r5, pc} 42: bf00 nop clang compiles this slightly differently but i think it's doing the same thing. With a bit of tweaking: diff --git a/base/time/time_posix.cc b/base/time/time_posix.cc index ad00c51..237bdb5 100644 --- a/base/time/time_posix.cc +++ b/base/time/time_posix.cc @@ -97,8 +97,8 @@ base::TimeTicks ClockNow(clockid_t clk_id) { } absolute_micro = - (static_cast<int64>(ts.tv_sec) * base::Time::kMicrosecondsPerSecond) + - (static_cast<int64>(ts.tv_nsec / base::Time::kNanosecondsPerMicrosecond)); + (static_cast<uint64_t>(ts.tv_sec) * base::Time::kMicrosecondsPerSecond) + + (static_cast<uint32_t>(ts.tv_nsec) / 1000u); return base::TimeTicks::FromInternalValue(absolute_micro); I can get GCC to produce this instead: 00000000 <_ZN12_GLOBAL__N_18ClockNowEi>: 0: b510 push {r4, lr} 2: b082 sub sp, #8 4: 4604 mov r4, r0 6: 4608 mov r0, r1 8: 4669 mov r1, sp a: f7ff fffe bl 0 <clock_gettime> e: 2200 movs r2, #0 10: 2300 movs r3, #0 12: b978 cbnz r0, 34 <_ZN12_GLOBAL__N_18ClockNowEi+0x34> 14: f644 52d3 movw r2, #19923 ; 0x4dd3 18: f2c1 0262 movt r2, #4194 ; 0x1062 1c: f244 2140 movw r1, #16960 ; 0x4240 20: 9b01 ldr r3, [sp, #4] 22: f2c0 010f movt r1, #15 26: fba3 2302 umull r2, r3, r3, r2 2a: 099a lsrs r2, r3, #6 2c: 2300 movs r3, #0 2e: 9800 ldr r0, [sp, #0] 30: fbc1 2300 smlal r2, r3, r1, r0 34: 4620 mov r0, r4 36: e9c4 2300 strd r2, r3, [r4] 3a: b002 add sp, #8 3c: bd10 pop {r4, pc} 3e: bf00 nop which seems a little better but doesn't benchmark any faster (or maybe I just suck at benchmarking, my perf seems busted so I'm not sure exactly what I'm hitting). :/
Message was sent while issue was closed.
pkasting@chromium.org changed reviewers: + pkasting@chromium.org
Message was sent while issue was closed.
https://codereview.chromium.org/600013003/diff/20001/base/time/time_posix.cc File base/time/time_posix.cc (right): https://codereview.chromium.org/600013003/diff/20001/base/time/time_posix.cc#... base/time/time_posix.cc:101: (static_cast<int64>(ts.tv_nsec / base::Time::kNanosecondsPerMicrosecond)); On 2014/09/29 23:37:31, jamesr wrote: > i'm pretty sure this doesn't actually help, since kNanosecondsPerMicrosecond is: > > static const int64 kNanosecondsPerMicrosecond = 1000; > > so thanks to type promotion this is still a 64 bit divide. This. I'm in the process of cleaning up our 64-bit math around time constants in various places in the codebase and frankly the right change here is probably to simply remove both casts entirely. Neither one will have any effect on the generated code, no matter how you parenthesize things.
Message was sent while issue was closed.
This should be a 32-bit operation when faster, however, since tv_nsec is always less than 1,000,000 and so the result of the division will definitely fit (easily) in 32 bits.
Message was sent while issue was closed.
Thanks for catching that, James. Sad that actually making it a 32-bit divide doesn't really make a measurable difference on your machine. Were you testing on your desktop? If so, I can try on mobile tomorrow. I've seen enough opcode latency charts to know that 64-bit divides have higher latencies than 32-bit divides in theory, so I'd still like to avoid 64-bit division where it make sense. I wonder if other operations in a directed perf test of this function are somehow hiding the division latency. Only a few of the Time constants actually need 64-bits to avoid overflow, but I'd be wary of just making the others 32-bits in case some code relies on implicit conversions. Would it make sense to add separate 32-bit versions of the constants?
Message was sent while issue was closed.
On 2014/09/30 01:25:00, brianderson wrote: > Thanks for catching that, James. > > Sad that actually making it a 32-bit divide doesn't really make a measurable > difference on your machine. Were you testing on your desktop? If so, I can try > on mobile tomorrow. Phone (Nexus 4) > > I've seen enough opcode latency charts to know that 64-bit divides have higher > latencies than 32-bit divides in theory, so I'd still like to avoid 64-bit > division where it make sense. I wonder if other operations in a directed perf > test of this function are somehow hiding the division latency. I'd agree we should avoid 64 bit divides where possible - they have shown up as crazy slow - I'm just not sure manipulating this function with its constants and whatnot is actually producing better code. > > Only a few of the Time constants actually need 64-bits to avoid overflow, but > I'd be wary of just making the others 32-bits in case some code relies on > implicit conversions. Would it make sense to add separate 32-bit versions of the > constants? I don't think we really need a separate typed constant for the number of microseconds in a nanosecond.
Message was sent while issue was closed.
On 2014/09/30 01:25:00, brianderson wrote: > I've seen enough opcode latency charts to know that 64-bit divides have higher > latencies than 32-bit divides in theory, so I'd still like to avoid 64-bit > division where it make sense. Please don't try and fix any other cases of this unless you have clear evidence that the fix you're landing actually improves perf measurably. We currently have a ridiculous number of places in our code that do a mix of 32- and 64-bit calculations with time constants. These produce tons of warnings in MSVC, which is part of why the warning in question has been disabled. In trying to fix these, it's often unclear whether doing the operation in 32 bits is safe. Every place you add more 32-bit operations, it adds more places where either it's not necessarily obvious to the reader that the operation is safe, or it IS safe now but potentially could be unsafe someday. The right answer is to make sure we do most of these calculations in 64 bits and only use 32 bits for the truly performance-critical exceptions, which should be very rare, and measurable. > Only a few of the Time constants actually need 64-bits to avoid overflow, but > I'd be wary of just making the others 32-bits in case some code relies on > implicit conversions. You are right to be worried. I don't think such a change would necessarily be safe. You *might* be OK as long as you leave most of the other Time functions that return 64-bit quantities as 64 bits, but it'd be almost impossible to verify the safety of such a change. > Would it make sense to add separate 32-bit versions of the > constants? See comments above. Don't do this unless you need to do it in order to fix specific perf issues. In general, we should not be attempting to purposefully lower the math we do with Time-related data to 32 bits. |