|
|
DescriptionAdd blink_perf.html_to_dom benchmark.
For background and motivation see bugs.
BUG=625986, 595492
CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq
Patch Set 1 #Patch Set 2 : Changes to the original benchmark #
Messages
Total messages: 29 (6 generated)
Description was changed from ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 ========== to ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq ==========
ulan@chromium.org changed reviewers: + esprehn@chromium.org
I moved the HTML parsing benchmark to blink PerformanceTests as you suggested. I tried to keep the workload as close to the original as possible (the diff between PS1 and PS2 shows modifications to the benchmark). To achieve that I had to modify the runner to support benchmarks that produce multiple results in one run. WDYT? Log of running the benchmark: > tools/perf/run_benchmark blink_perf.html_to_dom --reset-results [ RUN ] html-benchmark.html Description: This benchmark parses subset of HTML and adds parsed data to DOM. See http://crbug.com/595492 for more information. values innerHTML 27.055000000000007 ms nodes innerHTML 19601 values real-bindings 60.269999999999996 ms nodes real-bindings 19601 values fake-bindings 30.67500000000001 ms nodes fake-bindings 19601 [ OK ] html-benchmark.html (3537 ms) [ PASSED ] 1 test. Pages: [html-benchmark.html] RESULT fake-bindings: html-benchmark.html= 30.675 ms *RESULT fake-bindings: fake-bindings= 30.675 ms RESULT innerHTML: html-benchmark.html= 27.055 ms *RESULT innerHTML: innerHTML= 27.055 ms RESULT real-bindings: html-benchmark.html= 60.27 ms *RESULT real-bindings: real-bindings= 60.27 ms RESULT telemetry_page_measurement_results: num_failed= 0 count RESULT telemetry_page_measurement_results: num_errored= 0 count
This looks really awesome, will the dashboard show all three lines output lines? I'm happy with this, but I want to sync with Nat first. :)
On 2016/07/28 21:48:23, esprehn wrote: > This looks really awesome, will the dashboard show all three lines output lines? I think so, but Ulan, if you can paste the json that's generated with --output-format=chartjson somewhere, I could double-check. > I'm happy with this, but I want to sync with Nat first. :)
On 2016/07/29 01:59:11, sullivan wrote: > On 2016/07/28 21:48:23, esprehn wrote: > > This looks really awesome, will the dashboard show all three lines output > lines? > > I think so, but Ulan, if you can paste the json that's generated with > --output-format=chartjson somewhere, I could double-check. tools/perf/results-chart.json: { "trace_rerun_options": [], "format_version": "0.1", "benchmark_description": null, "charts": { "real-bindings": { "html-benchmark.html": { "std": 0.0, "name": "real-bindings", "type": "list_of_scalar_values", "important": true, "values": [ 68.885 ], "units": "ms", "page_id": 0 }, "summary": { "std": 0.0, "name": "real-bindings", "important": true, "values": [ 68.885 ], "units": "ms", "type": "list_of_scalar_values" } }, "fake-bindings": { "html-benchmark.html": { "std": 0.0, "name": "fake-bindings", "type": "list_of_scalar_values", "important": true, "values": [ 37.08000000000001 ], "units": "ms", "page_id": 0 }, "summary": { "std": 0.0, "name": "fake-bindings", "important": true, "values": [ 37.08000000000001 ], "units": "ms", "type": "list_of_scalar_values" } }, "innerHTML": { "html-benchmark.html": { "std": 0.0, "name": "innerHTML", "type": "list_of_scalar_values", "important": true, "values": [ 30.35000000000001 ], "units": "ms", "page_id": 0 }, "summary": { "std": 0.0, "name": "innerHTML", "important": true, "values": [ 30.35000000000001 ], "units": "ms", "type": "list_of_scalar_values" } } }, "benchmark_metadata": { "rerun_options": [], "type": "telemetry_benchmark", "name": "blink_perf.html_to_dom", "description": null }, "next_version": "0.2", "benchmark_name": "blink_perf.html_to_dom" }
Thanks, Ulan! Yes, this will show 3 charts.
Okay great, is this ready to land?
On 2016/08/15 16:13:18, esprehn wrote: > Okay great, is this ready to land? Yes. Could you please take a look?
lgtm
ulan@chromium.org changed reviewers: + nednguyen@google.com
Ned, could you please take a look at tools/perf changes as an owner? The main change is replacing manual parsing of blink_perf results line with regexp and extending the line with optional metric name.
On 2016/09/01 08:27:45, ulan wrote: > Ned, could you please take a look at tools/perf changes as an owner? > > The main change is replacing manual parsing of blink_perf results line with > regexp and extending the line with optional metric name. esprehn@: what was the results of you syncing with Nat on this benchmark? ulan@: can you do a few runs of this and let me know: 1) What is the total cycle time of BlinkHTMLToDOM benchmark on Android? 2) What is the noise level of the metrics on desktop & mobile?
nduca@chromium.org changed reviewers: + haraken@chromium.org, nduca@chromium.org
why'd we move this from a tbmv2 benchmark to the blink_perf benchmark? I don't think we resolved the questions I raised, just moved the problem and in so doing kinda punted? I really think we need Elliot or someone seasoned with TBMv2 from Tokyo to provide direction and work with speed infra folks such as myself and Ned to come up with a vision here.
On 2016/09/01 at 22:07:46, nduca wrote: > why'd we move this from a tbmv2 benchmark to the blink_perf benchmark? I don't think we resolved the questions I raised, just moved the problem and in so doing kinda punted? I really think we need Elliot or someone seasoned with TBMv2 from Tokyo to provide direction and work with speed infra folks such as myself and Ned to come up with a vision here. I'd rather not block on tracing, the blink style micro benchmarks are very valuable, and having a trace of what this script is doing is not super helpful, there's no trace points inside of what it's doing, and we can't add them without seriously regressing performance.
On 2016/09/01 22:10:20, esprehn wrote: > On 2016/09/01 at 22:07:46, nduca wrote: > > why'd we move this from a tbmv2 benchmark to the blink_perf benchmark? I don't > think we resolved the questions I raised, just moved the problem and in so doing > kinda punted? I really think we need Elliot or someone seasoned with TBMv2 from > Tokyo to provide direction and work with speed infra folks such as myself and > Ned to come up with a vision here. > > I'd rather not block on tracing, the blink style micro benchmarks are very > valuable, and having a trace of what this script is doing is not super helpful, > there's no trace points inside of what it's doing, and we can't add them without > seriously regressing performance. The non tracing based benchmark is the technical debt we are trying to get rid off. I am not convinced on adding trace points will seriously regress performance and if it does, it won't be as bad comparing to timing done in js. On another, a technique to avoid over-tracing for events that have high frequency is this: [ U ] [x] [x] [x] [x] [x] assuming [x] refers to very short running function (like 0.001 ms short), to avoid blowing up the overhead of trace event, in the parent trace event U, you can manually measure wall-time/cpu-time of [x] add add these data to the trace event of U as extra args. So a trace event U would looks s.t like: { title: "U", start:.., end:.., args: {X_count: 5, X_total_time: 0.005ms}} The reason we insist on tracing is that we have infra that capture the bot's performance state & combine that with chrome trace so we can debug cases metrics regress because of bot environment change. By not doing this the tracing way, it would be hard for us to keep supporting this benchmark running stably in the long term
This is not blocking on tracing. This is about the architecture team leads having a conversation with speed benchmarking experts about what this benchmark is, whether its a one off, what other content might fit into here in the future, and how we understand it relative to the many other benchmarks that teams create. Kentaro did a great job in the memory benchmarks and Kouhei in the loading-dev benchmarks finding a good middle ground here between "moving quickly" and "long term," and I'd really like to understand why we can't do that here. I wrote a pretty long email to you more than a month ago explaining ways we could move this forward that were reconciled with our vision, trying to help move this forward. I'm confused why we can't just do something great here...
There was this attempt at doing it with the tracing stuff: https://codereview.chromium.org/2119413003 - I don't want a wpr, that's not compatible with other browsers and is difficult to run and profile with instruments. - There's no way to run the benchmark in that issue against Safari and Firefox which is critical here. Ned tells me we can somehow use the python based benchmark thing over there, but not use a wpr, and still retain a simple way to load the file off disk and see the benchmark run in other browsers. If that's true we can use the tracing benchmark system. He said he'll help next week. :)
Description was changed from ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq ========== to ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 ==========
Description was changed from ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 ========== to ========== Add blink_perf.html_to_dom benchmark. For background and motivation see bugs. BUG=625986, 595492 CQ_INCLUDE_TRYBOTS=master.tryserver.chromium.perf:android_s5_perf_cq;master.tryserver.chromium.perf:linux_perf_cq;master.tryserver.chromium.perf:mac_retina_perf_cq;master.tryserver.chromium.perf:winx64_10_perf_cq ==========
> 1) What is the total cycle time of BlinkHTMLToDOM benchmark on Android? About 7 seconds on Nexus5 and 3 seconds on desktop. > 2) What is the noise level of the metrics on desktop & mobile? 3 runs on Nexus5: blink_perf.html_to_dom:fake-bindings ms 95.62 ± 11.47% blink_perf.html_to_dom:innerHTML ms 129.85 ± 8.83% blink_perf.html_to_dom:real-bindings ms 255.73 ± 4.95% 3 runs on desktop: blink_perf.html_to_dom:fake-bindings ms 26.52 ± 13.34% blink_perf.html_to_dom:innerHTML ms 24.89 ± 13.62% blink_perf.html_to_dom:real-bindings ms 48.47 ± 8.66% > The non tracing based benchmark is the technical debt we are trying to get rid off. Does it mean that the existing blink_perf, Octane, Kraken, JetStream benchmarks are going to be converted to tbmv2 in future? > you can manually measure wall-time/cpu-time of [x] add add these data to the trace > event of U as extra args. So a trace event U would looks s.t like: > { title: "U", start:.., end:.., args: {X_count: 5, X_total_time: 0.005ms}} Then we would have to add a benchmark specific metric that can handle X_count and X_total_time?
On 2016/09/02 14:24:13, ulan wrote: > > 1) What is the total cycle time of BlinkHTMLToDOM benchmark on Android? > About 7 seconds on Nexus5 and 3 seconds on desktop. > > > 2) What is the noise level of the metrics on desktop & mobile? > 3 runs on Nexus5: > blink_perf.html_to_dom:fake-bindings ms 95.62 ± 11.47% > blink_perf.html_to_dom:innerHTML ms 129.85 ± 8.83% > blink_perf.html_to_dom:real-bindings ms 255.73 ± 4.95% > > 3 runs on desktop: > blink_perf.html_to_dom:fake-bindings ms 26.52 ± 13.34% > blink_perf.html_to_dom:innerHTML ms 24.89 ± 13.62% > blink_perf.html_to_dom:real-bindings ms 48.47 ± 8.66% > > > The non tracing based benchmark is the technical debt we are trying to get rid > off. > Does it mean that the existing blink_perf, Octane, Kraken, JetStream benchmarks > are going to be converted to tbmv2 in future? Correct. I think we gonna use the v8's runtime callstats metric for those. > > > you can manually measure wall-time/cpu-time of [x] add add these data to the > trace > > event of U as extra args. So a trace event U would looks s.t like: > > { title: "U", start:.., end:.., args: {X_count: 5, X_total_time: 0.005ms}} > Then we would have to add a benchmark specific metric that can handle X_count > and X_total_time? Correct, if X here is specific to binding and are not covered by runtime callstats already.
On 2016/09/02 14:25:58, nednguyen wrote: > On 2016/09/02 14:24:13, ulan wrote: > > > 1) What is the total cycle time of BlinkHTMLToDOM benchmark on Android? > > About 7 seconds on Nexus5 and 3 seconds on desktop. > > > > > 2) What is the noise level of the metrics on desktop & mobile? > > 3 runs on Nexus5: > > blink_perf.html_to_dom:fake-bindings ms 95.62 ± 11.47% > > blink_perf.html_to_dom:innerHTML ms 129.85 ± 8.83% > > blink_perf.html_to_dom:real-bindings ms 255.73 ± 4.95% > > > > 3 runs on desktop: > > blink_perf.html_to_dom:fake-bindings ms 26.52 ± 13.34% > > blink_perf.html_to_dom:innerHTML ms 24.89 ± 13.62% > > blink_perf.html_to_dom:real-bindings ms 48.47 ± 8.66% > > > > > The non tracing based benchmark is the technical debt we are trying to get > rid > > off. > > Does it mean that the existing blink_perf, Octane, Kraken, JetStream > benchmarks > > are going to be converted to tbmv2 in future? > > Correct. I think we gonna use the v8's runtime callstats metric for those. I think the TBMv2 versions of the benchmarks should match the results of the existing non-tracing benchmarks. For example, http://chromium.github.io/octane computes scores for 17 line items and then combines the scores to get the total score. I would expect Octane TBMv2 to have 17 lines items and the total score too. Otherwise, we cannot remove the old Octane without losing coverage of what people outside Chrome team are measuring. Since the score computation is Octane specific and is not covered by v8 runtime stats, we will probably end up having Octane specific TBMv2 metric? > > Then we would have to add a benchmark specific metric that can handle X_count > > and X_total_time? > Correct, if X here is specific to binding and are not covered by runtime > callstats already. I would like to avoid benchmark specific metrics, because I like the idea that metrics are generic and can be run on any trace. Looks like we have to choose between the following options for a non-TBM benchmark: 1. Add TBMv2 version of the benchmark without the benchmark specific scoring. In this case the new benchmark would use Chrome specific metric (e.g. v8 runtime callstats), and we would go out of sync with the rest of the world, because they will continue to track the benchmark scores. I doubt that this option is feasible for V8 and Blink teams. 2. Add TBMv2 version of the benchmark with the benchmark specific scoring. This would lead to many custom metrics. The existing benchmark code will be split between perf/benchmark and catapult/tracing. 3. Keep supporting non-TBM benchmarks. What option do you think is the best? Is there another option I missed?
On 2016/09/05 12:43:33, ulan wrote: > On 2016/09/02 14:25:58, nednguyen wrote: > > On 2016/09/02 14:24:13, ulan wrote: > > > > 1) What is the total cycle time of BlinkHTMLToDOM benchmark on Android? > > > About 7 seconds on Nexus5 and 3 seconds on desktop. > > > > > > > 2) What is the noise level of the metrics on desktop & mobile? > > > 3 runs on Nexus5: > > > blink_perf.html_to_dom:fake-bindings ms 95.62 ± 11.47% > > > blink_perf.html_to_dom:innerHTML ms 129.85 ± 8.83% > > > blink_perf.html_to_dom:real-bindings ms 255.73 ± 4.95% > > > > > > 3 runs on desktop: > > > blink_perf.html_to_dom:fake-bindings ms 26.52 ± 13.34% > > > blink_perf.html_to_dom:innerHTML ms 24.89 ± 13.62% > > > blink_perf.html_to_dom:real-bindings ms 48.47 ± 8.66% > > > > > > > The non tracing based benchmark is the technical debt we are trying to get > > rid > > > off. > > > Does it mean that the existing blink_perf, Octane, Kraken, JetStream > > benchmarks > > > are going to be converted to tbmv2 in future? > > > > Correct. I think we gonna use the v8's runtime callstats metric for those. > > I think the TBMv2 versions of the benchmarks should match the results of the > existing non-tracing benchmarks. > > For example, http://chromium.github.io/octane computes scores for 17 line items > and then combines > the scores to get the total score. I would expect Octane TBMv2 to have 17 lines > items and the total score too. > Otherwise, we cannot remove the old Octane without losing coverage of what > people outside Chrome team are measuring. > > Since the score computation is Octane specific and is not covered by v8 runtime > stats, > we will probably end up having Octane specific TBMv2 metric? > > > > Then we would have to add a benchmark specific metric that can handle > X_count > > > and X_total_time? > > Correct, if X here is specific to binding and are not covered by runtime > > callstats already. > > I would like to avoid benchmark specific metrics, because I like the idea that > metrics are generic and can be run on any trace. > > Looks like we have to choose between the following options for a non-TBM > benchmark: > 1. Add TBMv2 version of the benchmark without the benchmark specific scoring. > In this case the new benchmark would use Chrome specific metric (e.g. v8 > runtime callstats), and > we would go out of sync with the rest of the world, because they will > continue to track the benchmark scores. > I doubt that this option is feasible for V8 and Blink teams. > > 2. Add TBMv2 version of the benchmark with the benchmark specific scoring. > This would lead to many custom metrics. The existing benchmark code will be > split > between perf/benchmark and catapult/tracing. > > 3. Keep supporting non-TBM benchmarks. > > What option do you think is the best? Is there another option I missed? There is also the option of supporting both benchmark specific metrics & v8 metrics. The way I think we can architect it is as follow: 1) Create a TBMv2 metric that can parse special trace events that encode "benchmarks specific metrics" (like binding benchmarks's number in this case) & put them to tbm2 values. 2) We add an API to action_runner that allow you to add benchmark specific metrics parsed from the page. So you will write a html_to_dom story as follow: class HTMLTODomPage(page.Page): def RunPageInteractions(...): action_runner.WaitForJavaScriptExpression('testRunner.isDone') # .... parse the metrics output by page action_runner.AddStorySpecificMetric(name='....', value=...) # <-- use performance.mark or performance.measure APIs to put those value into chrome trace Then with this approach, we can enable both benchmark's specific metrics & other v8 metrics.
On 2016/09/02 at 00:14:36, esprehn wrote: > There was this attempt at doing it with the tracing stuff: > https://codereview.chromium.org/2119413003 > > - I don't want a wpr, that's not compatible with other browsers and is difficult to run and profile with instruments. > - There's no way to run the benchmark in that issue against Safari and Firefox which is critical here. I think we need to step back on the requirements you have here for this benchmark. Elliot, my understanding is that your primary goal is to speed up bindings, and track that performance over time. I understand that you also have a secondary goal of a comparative benchmark. However, we have had numerous conversations about comparative benchmarking over the years and in every case concluded that focusing on "chrome / chrome" competition, e.g. us competing with ourself, shoudl always be our primary goal. The main time to do cross-browser benchmarking is when we seek to shift a vendor that has resisted shifting via other ways. If you feel that bindings needs cross browser benchmarking for this use case, then I think you need to write something into a doc explaining your rationale. Then we can review it with Dimitri and see if this indeed does make sense in this case. Right now, I'm not sold on that particular step in your chain of requirements, so everything else in my head is a bit stuck.
On 2016/09/22 at 15:24:36, nduca wrote: > On 2016/09/02 at 00:14:36, esprehn wrote: > > There was this attempt at doing it with the tracing stuff: > > https://codereview.chromium.org/2119413003 > > > > - I don't want a wpr, that's not compatible with other browsers and is difficult to run and profile with instruments. > > - There's no way to run the benchmark in that issue against Safari and Firefox which is critical here. > I think we need to step back on the requirements you have here for this benchmark. Elliot, my understanding is that your primary goal is to speed up bindings, and track that performance over time. I understand that you also have a secondary goal of a comparative benchmark. However, we have had numerous conversations about comparative benchmarking over the years and in every case concluded that focusing on "chrome / chrome" competition, e.g. us competing with ourself, shoudl always be our primary goal. I don't agree, and I don't think others do either. See the massive gains made from Animometer. > The main time to do cross-browser benchmarking is when we seek to shift a vendor that has resisted shifting via other ways. > I don't think that's true. > If you feel that bindings needs cross browser benchmarking for this use case, then I think you need to write something into a doc explaining your rationale. Then we can review it with Dimitri and see if this indeed does make sense in this case. Right now, I'm not sold on that particular step in your chain of requirements, so everything else in my head is a bit stuck. You've been blocking progress now for months which is not acceptable, I want to land this benchmark for tracking our own progress. We need to move on here and start adding more benchmarks, for example ones for postMessage. jbroman@ is landing lots of code for that now. Cross browser testing there is also critical, we were so slow before, but how does jbroman@ know when he's done optimizing if he can't tell when we're as fast or faster than Safari? I'm happy to meet about this, but I'd respectfully ask that you allow my team to go forward landing benchmarks for our own tracking. You can ignore these benchmarks, that's fine, but we need them. :)
I'm sorry you feel blocked. I think I've raised pretty good questions, hoping that maybe you'd kick off a design doc or a VC so we could all get on the same page. That's what happened when, for instance, we raised good questions about page cycler v1: Kouhei stepped up, booked a VC, we all had a great chat and aired a ton of stuff, and out of that emerged a great doc for pcv2. Nobody got blocked, and everyone walked away happy. Is the mistake here that I didn't book a VC for us to all work through this? If you think I dropped the ball here, then I'm super sorry for any offense caused, and I plead with you my perspective: Annie, Ned and I handle about 30 different urgent benchmarking design problems a quarter from different parts of Chrome. That's a lot in aggregate, and each one is unique, so we often rely on the authors of the CLs or their TLs to take the initiative to work through issues we raise and get to consensus with us. Usually the authors assume that we've got something valid to say because we do a ton of benchmarking as our day to day job, and so want to talk to us to get advice, for instance, so it doesn't really end up in a situation like here of frustration. What I see here is that we've all got our point of view, and for some reason, our points of view have stayed separated for long enough now that everyone feels frustrated. We are frustrated, here, too, because we like to help people and we don't understand how you've reached the conclusions you've reached, yet. We'd like to understand. But, if it does help de-escalate the situation, I could rubberstamp this patch. That seems like something you want, and what I want more than anything is to get to a happy place. In parallel, I will book a VC for mid next week so we can chat. I am in Europe then so I will have to make it a relatively early meeting compared to our usual chats. :) Best, - Nat
Nat, I am sorry that I have not scheduled VC much earlier. Could you please add me to the upcoming meeting? In the meantime I would appreciate if you could rubber-stamp this CL. Please note that it is using the _existing_ blink_perf framework and thus not adding new problems. There are already dozens of blink_perf benchmarks. If the existing benchmarks are converted to TBMv2, then converting this benchmark by following the template will be trivial (I can do this part). However I doubt that there is a clean solution for converting JS-based benchmarks to TBMv2: see my concerns above about proliferation of benchmark-specific TBMv2 metrics to compute scores for Octane, Kraken, blink-perf, etc. Converting to TBMv2 would also make it harder to compare the results between browsers, which is critical for these benchmarks. With that in mind I don't see a great value in converting these benchmarks to TBMv2. |