|
|
Created:
3 years, 6 months ago by hans Modified:
3 years, 5 months ago Reviewers:
jochen (gone - plz use gerrit), Michael Starzinger, Michael Lippautz, rmcilroy, Nico, vogelheim, marja CC:
Hannes Payer (out of office), marja+watch_chromium.org, v8-reviews_googlegroups.com, Yang Target Ref:
refs/heads/master Project:
v8 Visibility:
Public. |
DescriptionMake some functions that are hit during renderer startup available for inlining
This is towards closing the perf gap between the MSVC build (which uses link-
time optimization) and Clang (where LTO isn't ready on Windows yet). We did
a study (see bug) to see which non-inlined functions are hit a lot during render
start-up, and which would be inlined during LTO. This should benefit performance
in all builds which currently don't use LTO (Android, Linux, Mac) as well as
the Win/Clang build.
The binary size of chrome_child.dll increases by 2KB with this.
BUG=chromium:728324
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng;master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng
Review-Url: https://codereview.chromium.org/2950993002
Cr-Commit-Position: refs/heads/master@{#46229}
Committed: https://chromium.googlesource.com/v8/v8/+/777da354d20286d39048f1421d89fa109e38b9e1
Patch Set 1 #
Total comments: 2
Patch Set 2 : don't expose bytecode-traits.h in bytecodes.h #
Total comments: 3
Patch Set 3 : rebase #
Messages
Total messages: 66 (22 generated)
hans@chromium.org changed reviewers: + mstarzinger@chromium.org, vogelheim@chromium.org
Please take a look.
The CQ bit was checked by hans@chromium.org to run a CQ dry run
Dry run: CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Dry run: This issue passed the CQ dry run.
rmcilroy@chromium.org changed reviewers: + rmcilroy@chromium.org
https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h File src/interpreter/bytecodes.h (right): https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h... src/interpreter/bytecodes.h:14: #include "src/interpreter/bytecode-traits.h" I'd prefer not to expose bytecode-traits in bytecodes.h - bytecode-traits is an icky internal file and bytecodes.h is one of the headers that is used outside of interpreter/. Could you instead make kOperandSizes a static const field initialized at the top of bytecode.cc and just inline the reading of that field here?
https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h File src/interpreter/bytecodes.h (right): https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h... src/interpreter/bytecodes.h:14: #include "src/interpreter/bytecode-traits.h" On 2017/06/20 23:20:32, rmcilroy wrote: > I'd prefer not to expose bytecode-traits in bytecodes.h - bytecode-traits is an > icky internal file and bytecodes.h is one of the headers that is used outside of > interpreter/. Could you instead make kOperandSizes a static const field > initialized at the top of bytecode.cc and just inline the reading of that field > here? Sounds good to me. There already is a kOperandSizes, so I called it kOperandKindSizes.
vogelheim@chromium.org changed reviewers: + jochen@chromium.org
Generally, this looks good, and 2kB code size certainly looks like an acceptable trade-off. The main concern I have is code health: We already have a huge amount of code in our headers and have done several clean-ups to get our header size down, and this will increase it even further. @Jochen: Do you have an opinion? Pro: - Hans diligently measured this, and this has a provable positive impact on renderer start-up, for a very modest code size penalty. Con: - Goes against our current ideas about code health. - Seems to be for a temporary problem (note: "(where LTO isn't ready on Windows yet)" in the CL description), and I think it's save to say that no-one will remove this code once the actual reason for it (lack of link-time optimization on this compiler&platform) will be rectified. https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... File src/heap/objects-visiting.h (right): https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... src/heap/objects-visiting.h:89: static inline VisitorId GetVisitorId(int instance_type, int instance_size, [Here & elsewhere] Could we use the V8_INLINE macro?
what's the expected difference in startup from this change? https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... File src/heap/objects-visiting.h (right): https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... src/heap/objects-visiting.h:89: static inline VisitorId GetVisitorId(int instance_type, int instance_size, On 2017/06/21 at 10:18:02, vogelheim wrote: > [Here & elsewhere] Could we use the V8_INLINE macro? V8_INLINE expands to __attribute__((always_inline)) - that might be acceptable, but we should re-meassure the impact then
Re code health: I looked at v8's slow build time,l a while ago, and back then the main issue was that some basic v8 header (and thereby every cc file in v8) included many c++ standard headers. This cl doesn't affect that property (it didn't add many included in headers iirc -- and even if it did, it doesn't add stdlib includes) (but phone). On 2017/06/21 10:24:21, jochen wrote: > what's the expected difference in startup from this change? We don't know. We do know that LTO builds of chrome are faster (but also unusably large), and these are some of the functions in v8 called most during renderer startup that also happen to be inlined with lto. We hope to land this and then look at Uma to see what changed. (See bug.) If it doesn't help, we can revert this again. > > https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... > File src/heap/objects-visiting.h (right): > > https://codereview.chromium.org/2950993002/diff/20001/src/heap/objects-visiti... > src/heap/objects-visiting.h:89: static inline VisitorId GetVisitorId(int > instance_type, int instance_size, > On 2017/06/21 at 10:18:02, vogelheim wrote: > > [Here & elsewhere] Could we use the V8_INLINE macro? > > V8_INLINE expands to __attribute__((always_inline)) - that might be acceptable, > but we should re-meassure the impact then I think we should let the compiler decide if inlining is worth it.
Can you quantify the impact of inlining the Scavenger slow path? ScavengeObjectSlow should result in proper visitation and copying which should both dominate the call here.
On 2017/06/20 23:49:08, hans wrote: > https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h > File src/interpreter/bytecodes.h (right): > > https://codereview.chromium.org/2950993002/diff/1/src/interpreter/bytecodes.h... > src/interpreter/bytecodes.h:14: #include "src/interpreter/bytecode-traits.h" > On 2017/06/20 23:20:32, rmcilroy wrote: > > I'd prefer not to expose bytecode-traits in bytecodes.h - bytecode-traits is > an > > icky internal file and bytecodes.h is one of the headers that is used outside > of > > interpreter/. Could you instead make kOperandSizes a static const field > > initialized at the top of bytecode.cc and just inline the reading of that > field > > here? > > Sounds good to me. There already is a kOperandSizes, so I called it > kOperandKindSizes. Looks good, thanks. I'll let other reviewers approve for the overall CL.
marja@chromium.org changed reviewers: + marja@chromium.org
Q1: How much impact does this CL have? Q2 (more meta): If de-inlining the ast part (see comment below) is important, how come no benchmark tanked when I did the opposite change? Are we missing benchmarks? https://codereview.chromium.org/2950993002/diff/20001/src/ast/scopes.h File src/ast/scopes.h (right): https://codereview.chromium.org/2950993002/diff/20001/src/ast/scopes.h#newcode8 src/ast/scopes.h:8: #include "src/ast/ast.h" The changes in this while are exactly reversing the ones I made a while back...
Ah, noticed thakis's answer. The current problem with build times we're trying to address is the "I touch something trivial in v8 -> I need to rebuild the whole v8" problem, not high build times as such. For fixing that, it's necessary to make the #includes not be out of control as they are now. (Ofc if we solved the C++ header thing thakis is mentioning, maybe rebuilding all of v8 would be less of a problem?) And I noticed you already answered the question about the impact and promised to report back and potentially revert if this was not relevant. So ast lgtm, you'll still need owners reviews from others.
parsing/ lgtm too
I realize this is trading a bit of build time and header complexity against performance. In a future where we're using LTO on all platforms, this won't be necessary. In a way, this is a temporary problem yes, but it's a step towards getting to using Clang on Windows, which is a step towards Clang and LTO everywhere. In terms of expected impact, we don't know exactly yet because we're targeting a UMA metric rather than a local benchmark. What we do know is the number of times these functions are called during render start and NTP bring-up, and that they do get inlined in LTO builds. Many of them also look like obviously good candidates based on their source. > Can you quantify the impact of inlining the Scavenger slow path? > > ScavengeObjectSlow should result in proper visitation and copying which should both dominate the call here. It was called 80931 times in my measurement, and it gets inlined (at least in some cases) in an LTO build. Even if the body dominates the call, the call still takes time, and we're going after a "death by a thousand cuts" situation here. I think Nico addressed most comments already. Is there anything I missed? We'd like this to go in the next dev release so we can check back on that UMA metric :-)
vogelheim: owner ping for snapshot/ mstarzinger: owner ping for heap/ and interpreter/ As I mentioned before, we're keen to get this in for the next Dev release.
mstarzinger@chromium.org changed reviewers: + mlippautz@chromium.org
Sorry for delay, wasn't aware I was a required reviewer on this. LGTM on "interpreter", for "heap" I'll defer to Michael Lippautz.
Are you solely evaluating based on UMA call counts here? Call counts do not necessarily related to execution times. You might push the compiler over some inlining budget (because the current code base is already too aggressive with inlining) and regress some other metric. How are you determining the impact of this CL based on UMA? There are dozens of CLs pushed every day that change inlining behavior. Even within a busy roll it could already be hard to nail this down, not even talking about a whole Canary release.
On 2017/06/22 15:55:48, Michael Lippautz wrote: > Are you solely evaluating based on UMA call counts here? Call counts do not > necessarily related to execution times. You might push the compiler over some > inlining budget (because the current code base is already too aggressive with > inlining) and regress some other metric. The call counts are from a local instrumented build. These functions were found based on those counts, combined with the information about what functions are inlined in LTO builds. We're trying to analyze what's different between the MSVC and Clang build, and the data points to these functions. We use the UMA metric to evaluate (NewTabPage.Loadtime) if this works or not. Since the binary size was small, I don't expect I drastically changed the inlining decisions. > How are you determining the impact of this CL based on UMA? There are dozens of > CLs pushed every day that change inlining behavior. Even within a busy roll it > could already be hard to nail this down, not even talking about a whole Canary > release. We're looking at the difference between MSVC and Clang builds on the UMA metric, and that doesn't seem to move much between releases. > (vogelheim wrote, but it didn't go to rietveld for some reason) > I'm still not particularly happy, since this trades a permanent maintenance burden vs a temporary, potential performance gain. I understand your concern, but I really don't think these are too bad. Looking at the functions, most of them are obvious inlining wins. For example, inlining SizeOfOperand should result in a single load, which is significantly smaller than the overhead of doing the call. I think until the use of LTO is more wide-spread, this is something performance-focused C++ code has to live with. > How much does it gain when trying this locally? I understand you're "targeting a UMA metric", but surely you've measured this locally at least once, on some machine, right? I see 2 ms drop of the median time to refresh the NTP (50 samples), but it's pretty noisy so it's hard to tell (also I'm not sure if some things might be cached during the reload). > What is the 'success' criterion that would determine whether this stays in or not? That it shrinks the difference between MSVC and Clang builds on NewTabPage.Loadtime in a measureable way. If we get 10ms from this and the inlinings we've landed in chrome, we'll be very happy.
I am fine with inlining in the Scavenger in this CL. The approach in general does not scale though. We cannot just prematurely optimize based on the assumption that it will move the needle on UMA. What if UMA does not improve? Are you going to revert this or are the V8 devs going to be stuck with this change? I have been looking at NewTabPage.LoadTime for Win dev now. We are in the range of 1100ms. 2ms is roughly 1 permille on this metric. Where can is see the clang build on UMA? How are the 2KB relating to the overall size? On 2017/06/23 03:23:07, hans wrote: > On 2017/06/22 15:55:48, Michael Lippautz wrote: > > Are you solely evaluating based on UMA call counts here? Call counts do not > > necessarily related to execution times. You might push the compiler over some > > inlining budget (because the current code base is already too aggressive with > > inlining) and regress some other metric. > > The call counts are from a local instrumented build. These functions were found > based on those counts, combined with the information about what functions are > inlined in LTO builds. We're trying to analyze what's different between the MSVC > and Clang build, and the data points to these functions. > > We use the UMA metric to evaluate (NewTabPage.Loadtime) if this works or not. > Since the binary size was small, I don't expect I drastically changed the > inlining decisions. > > > How are you determining the impact of this CL based on UMA? There are dozens > of > > CLs pushed every day that change inlining behavior. Even within a busy roll it > > could already be hard to nail this down, not even talking about a whole Canary > > release. > > We're looking at the difference between MSVC and Clang builds on the UMA metric, > and that doesn't seem to move much between releases. > > > (vogelheim wrote, but it didn't go to rietveld for some reason) > > I'm still not particularly happy, since this trades a permanent maintenance > burden vs a temporary, potential performance gain. > > I understand your concern, but I really don't think these are too bad. Looking > at the functions, most of them are obvious inlining wins. For example, inlining > SizeOfOperand should result in a single load, which is significantly smaller > than the overhead of doing the call. I think until the use of LTO is more > wide-spread, this is something performance-focused C++ code has to live with. > > > How much does it gain when trying this locally? I understand you're "targeting > a UMA metric", but surely you've measured this locally at least once, on some > machine, right? > > I see 2 ms drop of the median time to refresh the NTP (50 samples), but it's > pretty noisy so it's hard to tell (also I'm not sure if some things might be > cached during the reload). > > > What is the 'success' criterion that would determine whether this stays in or > not? > > That it shrinks the difference between MSVC and Clang builds on > NewTabPage.Loadtime in a measureable way. If we get 10ms from this and the > inlinings we've landed in chrome, we'll be very happy.
On 2017/06/23 09:15:01, Michael Lippautz wrote: > I am fine with inlining in the Scavenger in this CL. > > The approach in general does not scale though. We cannot just prematurely > optimize based on the assumption that it will move the needle on UMA. What if > UMA does not improve? Are you going to revert this or are the V8 devs going to > be stuck with this change? > > I have been looking at NewTabPage.LoadTime for Win dev now. We are in the range > of 1100ms. 2ms is roughly 1 permille on this metric. Where can is see the clang > build on UMA? > > How are the 2KB relating to the overall size? > > On 2017/06/23 03:23:07, hans wrote: > > On 2017/06/22 15:55:48, Michael Lippautz wrote: > > > Are you solely evaluating based on UMA call counts here? Call counts do not > > > necessarily related to execution times. You might push the compiler over > some > > > inlining budget (because the current code base is already too aggressive > with > > > inlining) and regress some other metric. > > > > The call counts are from a local instrumented build. These functions were > found > > based on those counts, combined with the information about what functions are > > inlined in LTO builds. We're trying to analyze what's different between the > MSVC > > and Clang build, and the data points to these functions. > > > > We use the UMA metric to evaluate (NewTabPage.Loadtime) if this works or not. > > Since the binary size was small, I don't expect I drastically changed the > > inlining decisions. > > > > > How are you determining the impact of this CL based on UMA? There are dozens > > of > > > CLs pushed every day that change inlining behavior. Even within a busy roll > it > > > could already be hard to nail this down, not even talking about a whole > Canary > > > release. > > > > We're looking at the difference between MSVC and Clang builds on the UMA > metric, > > and that doesn't seem to move much between releases. > > > > > (vogelheim wrote, but it didn't go to rietveld for some reason) > > > I'm still not particularly happy, since this trades a permanent maintenance > > burden vs a temporary, potential performance gain. > > > > I understand your concern, but I really don't think these are too bad. Looking > > at the functions, most of them are obvious inlining wins. For example, > inlining > > SizeOfOperand should result in a single load, which is significantly smaller > > than the overhead of doing the call. I think until the use of LTO is more > > wide-spread, this is something performance-focused C++ code has to live with. > > > > > How much does it gain when trying this locally? I understand you're > "targeting > > a UMA metric", but surely you've measured this locally at least once, on some > > machine, right? > > > > I see 2 ms drop of the median time to refresh the NTP (50 samples), but it's > > pretty noisy so it's hard to tell (also I'm not sure if some things might be > > cached during the reload). > > > > > What is the 'success' criterion that would determine whether this stays in > or > > not? > > > > That it shrinks the difference between MSVC and Clang builds on > > NewTabPage.Loadtime in a measureable way. If we get 10ms from this and the > > inlinings we've landed in chrome, we'll be very happy. ... which brings us back to one of my previous questions: would it be possible to see this effect on a smaller, less noisy benchmark? Don't we run the relevant benchmarks on Win, or why haven't we seen any regressions?
On 2017/06/23 09:15:01, Michael Lippautz wrote: > I am fine with inlining in the Scavenger in this CL. > > The approach in general does not scale though. We cannot just prematurely > optimize based on the assumption that it will move the needle on UMA. What if > UMA does not improve? Are you going to revert this or are the V8 devs going to > be stuck with this change? If it doesn't improve at all, I will revert if you prefer. > I have been looking at NewTabPage.LoadTime for Win dev now. We are in the range > of 1100ms. 2ms is roughly 1 permille on this metric. Where can is see the clang > build on UMA? Most of the 1100ms is network time. The easiest way to see the Clang numbers are to look at the reports sent to g/lexan; the latest one is here: https://groups.google.com/a/google.com/d/msg/lexan/84o6mNvbSEs/ZPsQP6VsDQAJ What we're looking at is the difference between MSVC and Clang builds; that is 62ms on the last build, so 2ms is 3.2% of that. > > How are the 2KB relating to the overall size? chrome_child.dll is in the 50MB ball-park, so 2KB is about 0.0039%.
On 2017/06/23 11:04:22, marja wrote: > > > That it shrinks the difference between MSVC and Clang builds on > > > NewTabPage.Loadtime in a measureable way. If we get 10ms from this and the > > > inlinings we've landed in chrome, we'll be very happy. > > ... which brings us back to one of my previous questions: would it be possible > to see this effect on a smaller, less noisy benchmark? Don't we run the relevant > benchmarks on Win, or why haven't we seen any regressions? I'm not aware of one, but I'm also not very familiar with V8's benchmarks. The NTP is not a Javascript-heavy web page; we don't think we're spending a lot of time in V8 in absolute terms, but it accounts for the majority of the CPU time when loading the page. And on Windows, I expect all benchmarks are built with MSVC using LTO which already inlines these functions.
mlippautz, vogelheim: I believe I have approval from the others. If there is more I can do to address your concerns, please let me know.
thakis@chromium.org changed reviewers: + thakis@chromium.org
v8 folks: As said above, if this doesn't help, we'll gladly revert this. If it does help, it'll improve the life of people using chrome on any platform (including android), somewhere between 1-2 billion humans. I'd appreciate it if we could land this, get numbers from UMA, and then revert this again if it doesn't help. From a code health PoV, only the change in ast.h seems dubious since it adds an include. If you feel very strongly about that file, we can remove that change from this CL.
lgtm for what it's worth The approach itself does not look good to me and I will not approve of any future CLs that have non-trivial code changes.
On 2017/06/23 16:05:31, Michael Lippautz wrote: > lgtm for what it's worth Thanks for the lg. > The approach itself does not look good to me and I will not approve of any > future CLs that have non-trivial code changes. I'd like to understand what you dislike about the approach. We ran benchmarks, looked at differences in code, came up with a data-supported hypothesis and now we're doing an experiment to confirm our hypothesis. To me, this seems like a very reasonable approach. What is your problem with it?
On 2017/06/23 16:09:57, Nico wrote: > On 2017/06/23 16:05:31, Michael Lippautz wrote: > > lgtm for what it's worth > > Thanks for the lg. > > > The approach itself does not look good to me and I will not approve of any > > future CLs that have non-trivial code changes. > > I'd like to understand what you dislike about the approach. We ran benchmarks, > looked at differences in code, came up with a data-supported hypothesis and now > we're doing an experiment to confirm our hypothesis. To me, this seems like a > very reasonable approach. What is your problem with it? In essence, I'd like to see the data (you retrieved locally) on a benchmark that is accessible to *all* of us somewhere in an automated fashion. Otherwise nothing is preventing us from unintentionally regressing during refactorings or architectural changes -- which there will be some in the future. Without graph or alerts, we will regress at some point and unless somebody repeats the experiment it will go by unnoticed. Now, in particular for this change, I would expect that we actually have this coverage in another form. All microbenchmarks execute a ton of scavenges and we should see an improvement on single items on Octane (e.g. typescript) on non-LTO builds if that change had any impact. That's where we should look first. If it doesn't have any impact there, then there is no way that it will have noticable impact when loading the NTP. UMA is nice for a global view but if I understood correctly you only have the MSVC roll there and you cannot compare against clang. So you end up looking at the page load time numbers MSCV and I honestly cannot see you pinpoint this CL to a regression *or* improvement there. I would be happy if you could prove me wrong and show me the impact of this particular CL on the metric.
One more thing: I just saw that we ship the clang build and there should be a filter for it somewhere I guess? What am I not seeing?
vogelheim, I think we need your approval too. On 2017/06/23 16:38:57, Michael Lippautz wrote: > On 2017/06/23 16:09:57, Nico wrote: > > On 2017/06/23 16:05:31, Michael Lippautz wrote: > > > lgtm for what it's worth > > > > Thanks for the lg. > > > > > The approach itself does not look good to me and I will not approve of any > > > future CLs that have non-trivial code changes. > > > > I'd like to understand what you dislike about the approach. We ran benchmarks, > > looked at differences in code, came up with a data-supported hypothesis and > now > > we're doing an experiment to confirm our hypothesis. To me, this seems like a > > very reasonable approach. What is your problem with it? > > In essence, I'd like to see the data (you retrieved locally) on a benchmark that > is accessible to *all* of us somewhere in an automated fashion. Otherwise > nothing is preventing us from unintentionally regressing during refactorings or > architectural changes -- which there will be some in the future. Without graph > or alerts, we will regress at some point and unless somebody repeats the > experiment it will go by unnoticed. Sure, if the experiment shows that this helps, then we'll look into making something like that. But most of these ideas don't work out, and building these benchmarks takes time. So doing them only for experiments that do work out saves lots of unnecessary busywork. > Now, in particular for this change, I would expect that we actually have this > coverage in another form. All microbenchmarks execute a ton of scavenges and we > should see an improvement on single items on Octane (e.g. typescript) on non-LTO > builds if that change had any impact. That's where we should look first. If it > doesn't have any impact there, then there is no way that it will have noticable > impact when loading the NTP. UMA is nice for a global view but if I understood > correctly you only have the MSVC roll there and you cannot compare against > clang. So you end up looking at the page load time numbers MSCV and I honestly > cannot see you pinpoint this CL to a regression *or* improvement there. I would > be happy if you could prove me wrong and show me the impact of this particular > CL on the metric. We ship both a clang and an MSVC build to the chrome/win dev channel (50% each).
vogelheim, I think we need your approval too. On 2017/06/23 16:38:57, Michael Lippautz wrote: > On 2017/06/23 16:09:57, Nico wrote: > > On 2017/06/23 16:05:31, Michael Lippautz wrote: > > > lgtm for what it's worth > > > > Thanks for the lg. > > > > > The approach itself does not look good to me and I will not approve of any > > > future CLs that have non-trivial code changes. > > > > I'd like to understand what you dislike about the approach. We ran benchmarks, > > looked at differences in code, came up with a data-supported hypothesis and > now > > we're doing an experiment to confirm our hypothesis. To me, this seems like a > > very reasonable approach. What is your problem with it? > > In essence, I'd like to see the data (you retrieved locally) on a benchmark that > is accessible to *all* of us somewhere in an automated fashion. Otherwise > nothing is preventing us from unintentionally regressing during refactorings or > architectural changes -- which there will be some in the future. Without graph > or alerts, we will regress at some point and unless somebody repeats the > experiment it will go by unnoticed. Sure, if the experiment shows that this helps, then we'll look into making something like that. But most of these ideas don't work out, and building these benchmarks takes time. So doing them only for experiments that do work out saves lots of unnecessary busywork. > Now, in particular for this change, I would expect that we actually have this > coverage in another form. All microbenchmarks execute a ton of scavenges and we > should see an improvement on single items on Octane (e.g. typescript) on non-LTO > builds if that change had any impact. That's where we should look first. If it > doesn't have any impact there, then there is no way that it will have noticable > impact when loading the NTP. UMA is nice for a global view but if I understood > correctly you only have the MSVC roll there and you cannot compare against > clang. So you end up looking at the page load time numbers MSCV and I honestly > cannot see you pinpoint this CL to a regression *or* improvement there. I would > be happy if you could prove me wrong and show me the impact of this particular > CL on the metric. We ship both a clang and an MSVC build to the chrome/win dev channel (50% each).
With mlippautz's help, I ran Typescript-octane2.1 locally; two runs (each took about 5 minutes) with and without this patch (64-bit linux clang build): hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ sudo /work/v8/cpu.sh fast [sudo] password for hwennborg: Setting CPU frequency governor to "performance" hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ /work/v8/out/release/d8 run-some.js -- typescript Typescript-octane2.1(Score): 430 hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ /work/v8/out/release/d8 run-some.js -- typescript Typescript-octane2.1(Score): 435 hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ echo without inlining... without inlining... hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ /work/v8/out/release/d8 run-some.js -- typescript Typescript-octane2.1(Score): 423 hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ /work/v8/out/release/d8 run-some.js -- typescript Typescript-octane2.1(Score): 428 The scores with inlining are higher, which speaks in favour of this patch, but I suppose I don't know what the expected noise level is. I'll rebase the patch and send perf tryjobs too, but IIUC those can be pretty noisy.
On 2017/06/23 18:06:37, hans wrote: > With mlippautz's help, I ran Typescript-octane2.1 locally; two runs (each took > about 5 minutes) with and without this patch (64-bit linux clang build): > > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ sudo /work/v8/cpu.sh > fast > [sudo] password for hwennborg: > Setting CPU frequency governor to "performance" > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ > /work/v8/out/release/d8 run-some.js -- typescript > Typescript-octane2.1(Score): 430 > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ > /work/v8/out/release/d8 run-some.js -- typescript > Typescript-octane2.1(Score): 435 > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ echo without > inlining... > without inlining... > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ > /work/v8/out/release/d8 run-some.js -- typescript > Typescript-octane2.1(Score): 423 > hwennborg@hwennborg2:/work/v8/v8-perf/benchmarks/Octane2.1$ > /work/v8/out/release/d8 run-some.js -- typescript > Typescript-octane2.1(Score): 428 > > > The scores with inlining are higher, which speaks in favour of this patch, but I > suppose I don't know what the expected noise level is. > > I'll rebase the patch and send perf tryjobs too, but IIUC those can be pretty > noisy. If I read the numbers on the perfbot correctly (typescript, v8_linux64_perf_try), the patch is measured as ~2.5% slower on average. But the differences between patch/non-patch is less than the differences between runs, so this is pretty much all noise. As an aside, the Typescript scores on the perfbot are in the >20k range, and even on my Chromebox it's 22k, while you got values of ~430. I do wonder what exactly you're measuring there.
lgtm The "lgtm" is more due to attrition - I got pinged to "stamp the CL" Friday night, while at home - than to the change actually looking good to me. In particular - I share Michael's concerns in full, - it seems the expected best-case result for the V8-side changes is unlikely to exceed the measurement noise, - the main motivating benchmark - closing the gap between MSVC and clang builds - is a vanity benchmark without user relevance, and for the user-relevant benchmark (empty tab page) we expect only ~0.2%.
On 2017/06/23 20:09:48, vogelheim wrote: > lgtm Thanks! > The "lgtm" is more due to attrition - I got pinged to "stamp the CL" Friday > night, while at home - than to the change actually looking good to me. I'd recommend not to lg CLs because of "attrition" though. That doesn't lead to productive long-term interactions, and we are definitely interested in having those going forward. I'd also recommend to not look at gtalk while at home :-) I figured if you're still around, and since you said "basically ok" 2.5 days ago, I'd give pinging you a shot. I do want to point out that this CL has been in review for 3 days now. > In particular > - I share Michael's concerns in full, I tried addressing the concern above in comment 34. Was that not convincing? > - it seems the expected best-case result for the V8-side changes is unlikely to > exceed the measurement noise, Then we'll revert this. > - the main motivating benchmark - closing the gap between MSVC and clang builds > - is a vanity benchmark without user relevance, Not so! If this helps, it will help on all platforms that use clang – which is Android Mac Linux, in addition to chrome/win/clang. And closing this gap is our last blocker for switching chrome/win to clang.
On 2017/06/23 20:00:11, vogelheim wrote: > If I read the numbers on the perfbot correctly (typescript, > v8_linux64_perf_try), the patch is measured as ~2.5% slower on average. But the > differences between patch/non-patch is less than the differences between runs, > so this is pretty much all noise. Right. I ran tryjobs for a null-change (https://codereview.chromium.org/2958593002) which improved the TypeScript part of Octane by 2.4%, so I suppose the noise level is pretty high here :-) > As an aside, the Typescript scores on the perfbot are in the >20k range, and > even on my Chromebox it's 22k, while you got values of ~430. I do wonder what > exactly you're measuring there. Huh, this is my out/release/args.gn: is_debug = false target_cpu = "x64" v8_target_cpu = "arm64" use_goma = true I thought I was doing default release builds, but that "arm64" doesn't look right :-( Let's replace that with "x64". Sorry about that; this stuff is not so easy for a project outsider. New runs: $ for i in `seq 1 50`; do /work/v8/out/release/d8.inline run-some.js -- typescript ; done Typescript-octane2.1(Score): 37568 Typescript-octane2.1(Score): 36223 Typescript-octane2.1(Score): 39366 Typescript-octane2.1(Score): 39208 Typescript-octane2.1(Score): 38452 Typescript-octane2.1(Score): 36809 Typescript-octane2.1(Score): 39586 Typescript-octane2.1(Score): 37959 Typescript-octane2.1(Score): 37779 Typescript-octane2.1(Score): 35458 Typescript-octane2.1(Score): 36275 Typescript-octane2.1(Score): 36788 Typescript-octane2.1(Score): 36692 Typescript-octane2.1(Score): 39293 Typescript-octane2.1(Score): 38005 Typescript-octane2.1(Score): 35019 Typescript-octane2.1(Score): 36692 Typescript-octane2.1(Score): 36883 Typescript-octane2.1(Score): 36703 Typescript-octane2.1(Score): 39414 Typescript-octane2.1(Score): 39945 Typescript-octane2.1(Score): 36556 Typescript-octane2.1(Score): 37982 Typescript-octane2.1(Score): 38118 Typescript-octane2.1(Score): 39378 Typescript-octane2.1(Score): 39016 Typescript-octane2.1(Score): 37847 Typescript-octane2.1(Score): 39233 Typescript-octane2.1(Score): 38874 Typescript-octane2.1(Score): 36745 Typescript-octane2.1(Score): 38267 Typescript-octane2.1(Score): 35408 Typescript-octane2.1(Score): 37023 Typescript-octane2.1(Score): 38580 Typescript-octane2.1(Score): 40491 Typescript-octane2.1(Score): 36735 Typescript-octane2.1(Score): 37381 Typescript-octane2.1(Score): 39390 Typescript-octane2.1(Score): 37447 Typescript-octane2.1(Score): 35438 Typescript-octane2.1(Score): 35566 Typescript-octane2.1(Score): 36608 Typescript-octane2.1(Score): 37546 Typescript-octane2.1(Score): 35786 Typescript-octane2.1(Score): 35467 Typescript-octane2.1(Score): 38210 Typescript-octane2.1(Score): 36535 Typescript-octane2.1(Score): 36809 Typescript-octane2.1(Score): 40008 Typescript-octane2.1(Score): 37524 $ for i in `seq 1 50`; do /work/v8/out/release/d8.vanilla run-some.js -- typescript ; done Typescript-octane2.1(Score): 36441 Typescript-octane2.1(Score): 38615 Typescript-octane2.1(Score): 36682 Typescript-octane2.1(Score): 37370 Typescript-octane2.1(Score): 36745 Typescript-octane2.1(Score): 36368 Typescript-octane2.1(Score): 35947 Typescript-octane2.1(Score): 37195 Typescript-octane2.1(Score): 35968 Typescript-octane2.1(Score): 37414 Typescript-octane2.1(Score): 37304 Typescript-octane2.1(Score): 37141 Typescript-octane2.1(Score): 37087 Typescript-octane2.1(Score): 39709 Typescript-octane2.1(Score): 37447 Typescript-octane2.1(Score): 36213 Typescript-octane2.1(Score): 34762 Typescript-octane2.1(Score): 37914 Typescript-octane2.1(Score): 35696 Typescript-octane2.1(Score): 35686 Typescript-octane2.1(Score): 38909 Typescript-octane2.1(Score): 39697 Typescript-octane2.1(Score): 37326 Typescript-octane2.1(Score): 39821 Typescript-octane2.1(Score): 36080 Typescript-octane2.1(Score): 36285 Typescript-octane2.1(Score): 38897 Typescript-octane2.1(Score): 37768 Typescript-octane2.1(Score): 36969 Typescript-octane2.1(Score): 38267 Typescript-octane2.1(Score): 36524 Typescript-octane2.1(Score): 34424 Typescript-octane2.1(Score): 36357 Typescript-octane2.1(Score): 37066 Typescript-octane2.1(Score): 38510 Typescript-octane2.1(Score): 38371 Typescript-octane2.1(Score): 38050 Typescript-octane2.1(Score): 35656 Typescript-octane2.1(Score): 38591 Typescript-octane2.1(Score): 37746 Typescript-octane2.1(Score): 36121 Typescript-octane2.1(Score): 38615 Typescript-octane2.1(Score): 37044 Typescript-octane2.1(Score): 40121 Typescript-octane2.1(Score): 39341 Typescript-octane2.1(Score): 37033 Typescript-octane2.1(Score): 37304 Typescript-octane2.1(Score): 37513 Typescript-octane2.1(Score): 35606 Typescript-octane2.1(Score): 35988 The median is 37535 with my change and 37168 without, but the standard deviation is over 1000. My change is well in the noise, which is to be expected I suppose. On 2017/06/23 20:09:48, vogelheim wrote: > The "lgtm" is more due to attrition - I got pinged to "stamp the CL" Friday > night, while at home - than to the change actually looking good to me. > > In particular > - I share Michael's concerns in full, > - it seems the expected best-case result for the V8-side changes is unlikely to > exceed the measurement noise, That's the problem with death by a thousand cuts. > - the main motivating benchmark - closing the gap between MSVC and clang builds > - is a vanity benchmark without user relevance, and for the user-relevant > benchmark (empty tab page) we expect only ~0.2%. That's one way to look at it I suppose. Another way is that this is a real-world web page that's loaded billions of times per day where our instrumentation shows that V8 is the bottle-neck and that these functions benefit from inlining. I'm sorry if this patch has been ill received; I certainly didn't intend to annoy you with it, and believe me I would have been much happier to show some hard benchmark numbers. Assuming all goes well and this rolls in before the next dev cut, we shuold have numbers on the UMA metric by the end of next week and will report back. Thanks.
The CQ bit was checked by hans@chromium.org
The patchset sent to the CQ was uploaded after l-g-t-m from mstarzinger@chromium.org, mlippautz@chromium.org, marja@chromium.org Link to the patchset: https://codereview.chromium.org/2950993002/#ps40001 (title: "rebase")
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1498250574619930, "parent_rev": "ee0e295d8e829b5c3fc1225b09c7c886e00305f6", "commit_rev": "d00d52be1fce9c1bf5558c8b26bf984efd09e65b"}
Message was sent while issue was closed.
Description was changed from ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 ========== to ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 Review-Url: https://codereview.chromium.org/2950993002 Cr-Commit-Position: refs/heads/master@{#46191} Committed: https://chromium.googlesource.com/v8/v8/+/d00d52be1fce9c1bf5558c8b26bf984efd0... ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://chromium.googlesource.com/v8/v8/+/d00d52be1fce9c1bf5558c8b26bf984efd0...
Message was sent while issue was closed.
A revert of this CL (patchset #3 id:40001) has been created in https://codereview.chromium.org/2955793002/ by machenbach@chromium.org. The reason for reverting is: Blocks roll: https://codereview.chromium.org/2954833002/ E.g.: https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_com... https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium... Please include those chromium trybots on reland. Maybe missing symbol export?.
Message was sent while issue was closed.
Description was changed from ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 Review-Url: https://codereview.chromium.org/2950993002 Cr-Commit-Position: refs/heads/master@{#46191} Committed: https://chromium.googlesource.com/v8/v8/+/d00d52be1fce9c1bf5558c8b26bf984efd0... ========== to ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 ==========
Description was changed from ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 ========== to ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng,master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng ==========
On 2017/06/25 20:29:39, Michael Achenbach wrote: > A revert of this CL (patchset #3 id:40001) has been created in > https://codereview.chromium.org/2955793002/ by mailto:machenbach@chromium.org. > > The reason for reverting is: Blocks roll: > https://codereview.chromium.org/2954833002/ > > E.g.: > https://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_com... > https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium... > > Please include those chromium trybots on reland. Maybe missing symbol export?. Yup, missing export. Sorry for the breakage. That was fixed in https://codereview.chromium.org/2954353002/
The CQ bit was checked by hans@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
The CQ bit was unchecked by commit-bot@chromium.org
Your CL can not be processed by CQ because of: * Failed to parse additional trybots: master (or bucket) and trybuilders must be separated by : in `master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng,master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng`. Correct syntax is "trymaster:bot1,bot2;trybucket:bot3,bot4".
Description was changed from ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng,master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng ========== to ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng;master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng ==========
The CQ bit was checked by hans@chromium.org
CQ is trying da patch. Follow status at: https://chromium-cq-status.appspot.com/v2/patch-status/codereview.chromium.or...
CQ is committing da patch. Bot data: {"patchset_id": 40001, "attempt_start_ts": 1498498901604130, "parent_rev": "a2f51f779025adcd32fa72664591246edfcc3b9b", "commit_rev": "777da354d20286d39048f1421d89fa109e38b9e1"}
Message was sent while issue was closed.
Description was changed from ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng;master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng ========== to ========== Make some functions that are hit during renderer startup available for inlining This is towards closing the perf gap between the MSVC build (which uses link- time optimization) and Clang (where LTO isn't ready on Windows yet). We did a study (see bug) to see which non-inlined functions are hit a lot during render start-up, and which would be inlined during LTO. This should benefit performance in all builds which currently don't use LTO (Android, Linux, Mac) as well as the Win/Clang build. The binary size of chrome_child.dll increases by 2KB with this. BUG=chromium:728324 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.linux:linux_chromium_compile_dbg_ng;master.tryserver.chromium.mac:mac_chromium_compile_dbg_ng Review-Url: https://codereview.chromium.org/2950993002 Cr-Commit-Position: refs/heads/master@{#46229} Committed: https://chromium.googlesource.com/v8/v8/+/777da354d20286d39048f1421d89fa109e3... ==========
Message was sent while issue was closed.
Committed patchset #3 (id:40001) as https://chromium.googlesource.com/v8/v8/+/777da354d20286d39048f1421d89fa109e3...
Message was sent while issue was closed.
To follow up: since this got reverted initially (sorry, I should have watched the roller), it was only rolled into Chromium in #482434, and so it didn't make this week's dev channel (which was cut from #482153). That means we won't have any UMA data for it until next week, but the silver lining is that it should be easier to see the impact of this patch, as my other inlining patches landed in this week's release.
Message was sent while issue was closed.
On 2017/06/28 21:35:27, hans wrote: > To follow up: since this got reverted initially (sorry, I should have watched > the roller), it was only rolled into Chromium in #482434, and so it didn't make > this week's dev channel (which was cut from #482153). > > That means we won't have any UMA data for it until next week, but the silver > lining is that it should be easier to see the impact of this patch, as my other > inlining patches landed in this week's release. We already have >400k samples for Windows Canaries containing this CL. How can I filter for MSVC vs clang on the UMA timeline?
Message was sent while issue was closed.
It took a week longer than expected because there was no Dev channel release on the week of 3 July. (If you're at Google, the numbers are at go/61.0.3153.2-uma) Since most of us were on vacation, this V8 change is the only CL targeting the MSVC/Clang gap in that release. What we saw is that on SessionRestore.ForegroundTabFirstLoaded and Scheduling.Renderer.DrawDuration2, Clang-built Chrome is now faster than MSVC-built Chrome on the median. We can't tell with certainty that this is because of the V8 patch, but since we didn't do any other patches in this release, and since Clang-built Chrome was previously slower on those metrics, we'd like to keep this if possible. Having said that, I promised to reconsider these inlinings when we had metrics. To re-cap, these are the functions that were inlined, along with the number of times they were called before inlining during NTP load: 205153 Scanner::ScanTemplateContinuation 151841 VariableProxy::VariableProxy(AsRawString, ...) 140454 Scope::NewUnresolved 80931 Scavenger::ScavengeObjectsSlow 57994 Variable::Variable(Scope, AstRawString, ...) 45411 SnapshotByteSource::CopyRaw 37636 Heap::AllocateFixedArray 32558 StaticVisitorBase::GetVisitorId 30137 ByteCodes::SizeOfOperand I think some of these make a lot of sense on their own, but if you think some of them need to reverted, I'll do so. Scope::NewUnresolved was the only one that required including another header (ast.h in scopes.h). StaticVisitorBase::GetVisitorId is a rather ugly switch statement, and Scavenger::ScavengeObjectsSlow looks designed to be out-of-line. I'm also in Munich this week so I can discuss this in person if you'd like. |