|
|
Created:
6 years, 5 months ago by picksi1 Modified:
6 years, 3 months ago Reviewers:
jdduke (slow), nduca, Sami, Rick Byers, eseidel, picksi, ariblue, Yufeng Shen (Slow to review), tonyg CC:
chromium-reviews, telemetry+watch_chromium.org Base URL:
https://chromium.googlesource.com/chromium/src.git@master Project:
chromium Visibility:
Public. |
DescriptionAddition of first_gesture_scroll_latency metric.
BUG=303215
Patch Set 1 #
Total comments: 16
Patch Set 2 : Updates from code reviews #Patch Set 3 : Addition of unit tests #
Total comments: 11
Patch Set 4 : Addition of GSU latency metric #
Total comments: 2
Patch Set 5 : Metric measures only first GSU event #Patch Set 6 : Removing time to stablise metric #Patch Set 7 : Tidy-up of unneeded imports & whitespace #
Total comments: 1
Patch Set 8 : Renaming of new metric #
Total comments: 2
Messages
Total messages: 38 (0 generated)
I've add two new metrics to help measure blink scheduling improvements: 1. time_for_input_latency_to_stabilize This measures, in ms, how quickly the input latency falls to (below) the mean input latency - i.e. when a page is loading it is slow & user input events get badly delayed. This is a measure of how long the user needs to wait before they get a decent response from their input (e.g. scrolling). As and when we break up early monolithic tasks (e.g. v8 parsing) this metric will get lower. 2. input_event_latency_discrepancy_when_stable. Once (1) above knows when the input latency has fallen to a reasonable level we can then calculate the latency_discrepancy, ignoring the early page-loading noise. This 'noise' is currently overwhelming the business-as-usual latency variations. As we prioritize user input events (and other scheduling improvements) this metric will reduce.
https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:206: the mean stabilization time. This gives us a metric to measure how Do you want the mean? I guess I'd figured you'd just use some fixed constant for expected latency time and then count until more than N ticks in a row are below the expected latency. Mean might work great. I just don't quite understand the reasoning behind mean, vs. median vs. constant vs. Nth percentile, etc. here. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:217: for start_latency in start_latency_list: Pythonistas might have you write: total = sum(zip(start_latency_list)[1]) https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:220: mean = total / len(start_latency_list) Or just mean = numpy.mean(start_latency_list) https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:232: sorted_data = sorted(start_latency_list, compare_start_time) sorted(start_latency_list, key=lambda a: a[0]) or key=operator.itemgetter(0) if you want to get more fancy. :) Then you don't even need your own compare_start_time. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:240: Below_Mean_Run_Length = 3 Did you mean to start with a capital? https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:242: for event in sorted_data: This loop feels like it could be much shorter. You're trying to find the first pair of events which are below the mean, correct? run_start_times = [] for event in sorted_data: latency = event[1] # Oops, this event is above our threshold, start again. if latency >= mean: run_start_times = [] continue if len(run_start_times) > search_number: return (run_start_times[0]- first_event_time) run_times.append(event[0]) is a re-write of your code slightly shorter (your comments are probably helpful however) https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:250: continue here, instead of the else is probably clearer. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:292: event_time = event[0] It's lame that these events are tuples. Not your fault of course. But makes for very unreadable code.
https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:206: the mean stabilization time. This gives us a metric to measure how I selected mean as an easy to calculate value! I'll change this to use the 90th percentile as that better reflects the initial peak that we are trying to time. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:217: for start_latency in start_latency_list: This code has gone now that I'm using 90th percentile. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:220: mean = total / len(start_latency_list) This code has gone now that I'm using 90th percentile. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:232: sorted_data = sorted(start_latency_list, compare_start_time) On 2014/07/15 21:35:15, eseidel wrote: > sorted(start_latency_list, key=lambda a: a[0]) or key=operator.itemgetter(0) if > you want to get more fancy. :) Then you don't even need your own > compare_start_time. Done. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:240: Below_Mean_Run_Length = 3 On 2014/07/15 21:35:15, eseidel wrote: > Did you mean to start with a capital? Done. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:242: for event in sorted_data: I like your suggestion, it's elegant! Done. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:250: On 2014/07/15 21:35:15, eseidel wrote: > continue here, instead of the else is probably clearer. Done. https://codereview.chromium.org/397663002/diff/1/tools/telemetry/telemetry/ut... tools/telemetry/telemetry/util/statistics.py:292: event_time = event[0] Tuples *were* my fault, I was unhappy with this code for the same reason as you. I've not done much Python so didn't know better! I've subsequently learnt about namedtuples and I've changed this throughout.
Addition of unit tests and a small fix revealed by unit-tests. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:271: The unit-tests revealed an assumption that the event list was already sorted by event start_time. The 3 lines updated below get the minimum start_time, rather than the [0]'th start time.
Seems OK to me. I don't have power in this directory. We'd have to check the OWNERS file to see who does. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:227: key = lambda event:event.start_time) I don't think there are normally = around argument setters, but then again I don't feel like I know chromium python style very well.
Rick, can you have a peek since you had some Strong Opinions about this on the design doc? :)
miletus@ should really weight in on this since it changes and extends some of his input latency stuff. I've been trying to figure out how this is different/better than simply reporting the latency of the first GSU event. Conceptually it's the latency of the first GSU that matters to the user - that's how long from the time they started dragging their finger to the time scrolling was free to run smoothly. Note that the trace viewer UI is deceiving here - it doesn't visualize the ORIGINAL_COMPONENT entries (which is what's used in telemetry to mark the start of the GSU) - you can quickly approximate this by looking at the total time of a matching pair of InputLatency::TouchMove and InputLatency::GestureScrollUpdate events. So in practice I don't think there's really a bunch of different input events whose latency we're waiting to stabilize, but exactly ONE important GSU event and potentially a whole bunch of related supporting events (TouchMoves, etc.) whose latency are all related to the latency of this special GSU. One important way your approach is better is that it can be used to capture page load jank for scenarios other than just GestureScroll. In particular, for pages that implement their own scrolling/dragging in javascript based on touchmove events. In that case we really do need some measure of the all the different events. Is the 90th percentile measure really what we want though? What if the page load jank exceeds 10% of the total gesture time, then we'd choose some relatively arbitrary point within the jank as the cutoff. Another option would be to report just the max input event latency (the number of high-latency events is already accounted for to some extent in mean_input_event_latency). Are you planning on adding a new pageset specifically for measuring the blink scheduler improvements? Measurements of page load jank will be highly sensitive to the amount of delay we insert between page load and scroll start, and this isn't terribly predictable today (many page sets poll with a decaying delay until some condition is true before starting the gesture). I'm a little worried we'll get lots of noise in smoothness from this metric. We might need to reload a page ~10 times and take the average to get something remotely stable. Perhaps this needs to be a new measurement instead of part of smoothness? Can you share some results from running our existing smoothness tests with this metric (maybe with a page_repeat of >= 10 so we can see how stable it is compared to our other metrics)? Also your BUG= in the CL description appears to have the wrong bug number. Anyway if you've got data that shows this metric is useful (relatively stable between runs, anecdotally discriminates well sites we know to be problematic on load, etc.) then I'm OK with this CL. This is a tough problem and we shouldn't block progress on coming up with a perfect solution. But I think our choice of metric is very important to this work, and historically it's hard to come up with metrics that are this complex and still useful in practice. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:218: latency_list = GetLatenciesAfterTime(event_details_list, 0) nit: this would be shorter and clearer to me (and remove the need for a comment) if you just did the map inline rather than (ab)use your helper. Eg. "latency_list = [e.latency for e in event_details_list]" https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:260: """" Given a list of InputEventDetails tuples and an time return a nit: s/an/a/ https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:278: for event_details in event_details_list: I believe you can write these 4 lines more concisely in python with list comprehension, eg: return [e.latency for e in event_details if e.start_time >= absolute_stabilization_time] https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/web_perf/metrics/rendering_stats.py (right): https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/web_perf/metrics/rendering_stats.py:57: def FindEventStartAndEnd(event): nit: This function is very specific to input events / LatencyInfo. Maybe call it 'FindInputEventStartAndEnd' or even just 'FindInputEventBounds'.
My concern is that time_for_input_latency_to_stabilize is not a very straightforward metric and one has to think about what it tries to capture and why we choose this metric. If I were to do it, I would probably just go with something like the average of first 3 input events' latency. But I don't mean to block this one. I would love to see how this metric value changes/improves when we have a real CL that improves the main thread input responsiveness. On 2014/07/22 21:04:29, Rick Byers wrote: > miletus@ should really weight in on this since it changes and extends some of > his input latency stuff. > > I've been trying to figure out how this is different/better than simply > reporting the latency of the first GSU event. Conceptually it's the latency of > the first GSU that matters to the user - that's how long from the time they > started dragging their finger to the time scrolling was free to run smoothly. > Note that the trace viewer UI is deceiving here - it doesn't visualize the > ORIGINAL_COMPONENT entries (which is what's used in telemetry to mark the start > of the GSU) - you can quickly approximate this by looking at the total time of a > matching pair of InputLatency::TouchMove and InputLatency::GestureScrollUpdate > events. So in practice I don't think there's really a bunch of different input > events whose latency we're waiting to stabilize, but exactly ONE important GSU > event and potentially a whole bunch of related supporting events (TouchMoves, > etc.) whose latency are all related to the latency of this special GSU. > > One important way your approach is better is that it can be used to capture page > load jank for scenarios other than just GestureScroll. In particular, for pages > that implement their own scrolling/dragging in javascript based on touchmove > events. In that case we really do need some measure of the all the different > events. Is the 90th percentile measure really what we want though? What if the > page load jank exceeds 10% of the total gesture time, then we'd choose some > relatively arbitrary point within the jank as the cutoff. Another option would > be to report just the max input event latency (the number of high-latency events > is already accounted for to some extent in mean_input_event_latency). > > Are you planning on adding a new pageset specifically for measuring the blink > scheduler improvements? Measurements of page load jank will be highly sensitive > to the amount of delay we insert between page load and scroll start, and this > isn't terribly predictable today (many page sets poll with a decaying delay > until some condition is true before starting the gesture). I'm a little worried > we'll get lots of noise in smoothness from this metric. We might need to reload > a page ~10 times and take the average to get something remotely stable. Perhaps > this needs to be a new measurement instead of part of smoothness? > > Can you share some results from running our existing smoothness tests with this > metric (maybe with a page_repeat of >= 10 so we can see how stable it is > compared to our other metrics)? > > Also your BUG= in the CL description appears to have the wrong bug number. > > Anyway if you've got data that shows this metric is useful (relatively stable > between runs, anecdotally discriminates well sites we know to be problematic on > load, etc.) then I'm OK with this CL. This is a tough problem and we shouldn't > block progress on coming up with a perfect solution. But I think our choice of > metric is very important to this work, and historically it's hard to come up > with metrics that are this complex and still useful in practice. > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > File tools/telemetry/telemetry/util/statistics.py (right): > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > tools/telemetry/telemetry/util/statistics.py:218: latency_list = > GetLatenciesAfterTime(event_details_list, 0) > nit: this would be shorter and clearer to me (and remove the need for a comment) > if you just did the map inline rather than (ab)use your helper. Eg. > "latency_list = [e.latency for e in event_details_list]" > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > tools/telemetry/telemetry/util/statistics.py:260: """" Given a list of > InputEventDetails tuples and an time return a > nit: s/an/a/ > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > tools/telemetry/telemetry/util/statistics.py:278: for event_details in > event_details_list: > I believe you can write these 4 lines more concisely in python with list > comprehension, eg: > > return [e.latency for e in event_details if e.start_time >= > absolute_stabilization_time] > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > File tools/telemetry/telemetry/web_perf/metrics/rendering_stats.py (right): > > https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... > tools/telemetry/telemetry/web_perf/metrics/rendering_stats.py:57: def > FindEventStartAndEnd(event): > nit: This function is very specific to input events / LatencyInfo. Maybe call > it 'FindInputEventStartAndEnd' or even just 'FindInputEventBounds'.
On 2014/07/22 21:33:40, Yufeng Shen wrote: > My concern is that time_for_input_latency_to_stabilize is not a very > straightforward > metric and one has to think about what it tries to capture and why we choose > this > metric. If I were to do it, I would probably just go with something like the > average > of first 3 input events' latency. > > But I don't mean to block this one. I would love to see how this metric value > changes/improves > when we have a real CL that improves the main thread input responsiveness. Would it hurt to include both the stabilize time as well as the GSU response time (or even the average of the first 3 events as you said?)
Thanks for all the feedback! I'll make the code changes suggested by Rick I'll add another metric to capture the average of the first 3 input event's latency I'll gather some metrics on good Vs bad sites to show consistency and that the metric reflects the differing UX of these sites This metric is different from reporting just the first latency as there is a continuous period during which the input events all get delayed, this extended period of slow responsiveness lasts longer than a single event's latency as some input events "sneak through the cracks" and will get processed during the slow-period. When I've added the new first_event_latency we'll be able to see if the two metrics are correlated enough to make one of them redundant.
On 2014/07/23 10:31:04, picksi1 wrote: > Thanks for all the feedback! > > I'll make the code changes suggested by Rick > I'll add another metric to capture the average of the first 3 input event's > latency > I'll gather some metrics on good Vs bad sites to show consistency and that the > metric reflects the differing UX of these sites Perfect, it's this latter step I think is most valuable - we can debate forever about what should be the right approach, but if you can show data for any metric that's stable and does a good job discriminating the cases we care about then I'm happy :-) > This metric is different from reporting just the first latency as there is a > continuous period during which the input events all get delayed, this extended > period of slow responsiveness lasts longer than a single event's latency as some > input events "sneak through the cracks" and will get processed during the > slow-period. Are we talking just about a single scroll gesture (as typically done by our smoothness measurement) here? If so then what you describe should be impossible. Once we get an ACK back for the touchstart and first touchmove (which are both pre-requisities for sending the first GestureScrollUpdate), then the scroll should be running freely on impl without blocking on blink at all. Can you show me a smoothness trace with a "delayed" input event that happens AFTER the first GestureScrollUpdate? If we can find ANY examples of that, then we've either got a nasty bug we need to fix, or we're just suffering from core contention (which the bothat exists,it's probably just a bug somewhere. My "slow_handler" test in tough_scheduling_cases.py is designed to test exactly that and so should be an idealized test for your metric (there's a large synthetic delay enabled for blink event handling, but the scroll runs very quickly after starting). You can see a synthetic demo of this at http://www.rbyers.net/janky-touch-scroll.html with "do lots of work on the main thread" and "enable empty touchmove handler" enabled - once scrolling starts it's very smooth even though the blink thread is badly blocked. If you're looking at manual traces (as opposed to ones generated by telemetry) then there are some ways you can see something like what you describe: - Attempting multiple scrolls (a valid use case, but not something we're currently testing in telemetry AFAIK) - Hitting the limits during a scroll (eg. scrolling a little in the opposite direction at the start) If these are the scenarios you're concerned with then we'll need to create a new PageAction and new/updated page sets to cover these scenarios. > When I've added the new first_event_latency we'll be able to see if > the two metrics are correlated enough to make one of them redundant. Note that first_event_latency will need to be careful to look at the right event. The first couple touchmove events in a sequence may be uninteresting (if they didn't exceed the slop distance then they won't be forwarded to the renderer and so have a very low latency). There's also some other events (eg. GestureFlingCancel, GestureShowPress, GestureTapCancel) that may not always be forwarded to blink and so could confuse a naive 'average of first 3 input event' measurement. I'd suggest looking specifically at only the first GestureScrollUpdate event - that's the one that determines how long it takes for scrolling to start.
Thanks for this, there's lots of useful information here! I hadn't appreciated the way that GestureScrollUpdate worked, so I might have been talking nonsense. I have been doing a lot of manual traces with rapid changes in direction of scroll, which have no doubt colored my vision as to what is going on under-the-hood. Creating a new PageAction that has some abrupt scroll changes sounds like a good next step, as big latencies on these are very obvious and annoying as a user. I'll do as you suggest and produce a metric that measure the average of the first 3 GestureScrollUpdate events.
On 2014/07/23 14:22:21, Rick Byers wrote: > On 2014/07/23 10:31:04, picksi1 wrote: > > Thanks for all the feedback! > > > > I'll make the code changes suggested by Rick > > I'll add another metric to capture the average of the first 3 input event's > > latency > > I'll gather some metrics on good Vs bad sites to show consistency and that the > > metric reflects the differing UX of these sites > > Perfect, it's this latter step I think is most valuable - we can debate forever > about what should be the right approach, but if you can show data for any metric > that's stable and does a good job discriminating the cases we care about then > I'm happy :-) > > > This metric is different from reporting just the first latency as there is a > > continuous period during which the input events all get delayed, this extended > > period of slow responsiveness lasts longer than a single event's latency as > some > > input events "sneak through the cracks" and will get processed during the > > slow-period. > > Are we talking just about a single scroll gesture (as typically done by our > smoothness measurement) here? If so then what you describe should be > impossible. Once we get an ACK back for the touchstart and first touchmove > (which are both pre-requisities for sending the first GestureScrollUpdate), then > the scroll should be running freely on impl without blocking on blink at all. > Can you show me a smoothness trace with a "delayed" input event that happens > AFTER the first GestureScrollUpdate? If we can find ANY examples of that, then > we've either got a nasty bug we need to fix, or we're just suffering from core > contention (which the bothat exists,it's probably just a bug somewhere. My > "slow_handler" test in tough_scheduling_cases.py is designed to test exactly > that and so should be an idealized test for your metric (there's a large > synthetic delay enabled for blink event handling, but the scroll runs very > quickly after starting). You can see a synthetic demo of this at > http://www.rbyers.net/janky-touch-scroll.html with "do lots of work on the main > thread" and "enable empty touchmove handler" enabled - once scrolling starts > it's very smooth even though the blink thread is badly blocked. > > If you're looking at manual traces (as opposed to ones generated by telemetry) > then there are some ways you can see something like what you describe: > - Attempting multiple scrolls (a valid use case, but not something we're > currently testing in telemetry AFAIK) > - Hitting the limits during a scroll (eg. scrolling a little in the opposite > direction at the start) > If these are the scenarios you're concerned with then we'll need to create a new > PageAction and new/updated page sets to cover these scenarios. > > > When I've added the new first_event_latency we'll be able to see if > > the two metrics are correlated enough to make one of them redundant. > > Note that first_event_latency will need to be careful to look at the right > event. The first couple touchmove events in a sequence may be uninteresting (if > they didn't exceed the slop distance then they won't be forwarded to the > renderer and so have a very low latency). There's also some other events (eg. > GestureFlingCancel, GestureShowPress, GestureTapCancel) that may not always be > forwarded to blink and so could confuse a naive 'average of first 3 input event' > measurement. I'd suggest looking specifically at only the first > GestureScrollUpdate event - that's the one that determines how long it takes for > scrolling to start. sorry I wasn't making it clear that when I say first 3 input events' latency I meant the latency for the first 3 input events that have caused swap buffer. If it is a impl scroll page, then it would be the first 3 GestureScrollUpdate. If the page has touch handler which does real work, then it is the first 3 touch move latency (or touch begin latency + 2 touch move latency).
On 2014/07/23 15:29:28, picksi1 wrote: > Thanks for this, there's lots of useful information here! I hadn't appreciated > the way that GestureScrollUpdate worked, so I might have been talking nonsense. > I have been doing a lot of manual traces with rapid changes in direction of > scroll, which have no doubt colored my vision as to what is going on > under-the-hood. Creating a new PageAction that has some abrupt scroll changes > sounds like a good next step, as big latencies on these are very obvious and > annoying as a user. When you say "rapid changing of direction" you mean within one scroll gesture (i.e. without lifting your finger) without hitting the top/bottom of the page? An automated test for that also wouldn't cause any blocking on blink. We block on blink only at the start of the scroll and when a scroll update cannot be performed because we've reached the end (in any axis) of the document. If we want to focus on a scenario beyond just "how long did a single scroll take to start" I'd suggest maybe a series of multiple short/fast scroll gestures (scroll down a bit, lift, scroll down a bit more, lift, etc.). That's a fairly reasonable user scenario, gives us multiple data points within one page load, etc... > I'll do as you suggest and produce a metric that measure the average of the > first 3 GestureScrollUpdate events.
I've added a metric to report the average of the first 3 GSU events. I'll gather and post some metrics to show it in action. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... File tools/telemetry/telemetry/util/statistics.py (right): https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:218: latency_list = GetLatenciesAfterTime(event_details_list, 0) On 2014/07/22 21:04:29, Rick Byers wrote: > nit: this would be shorter and clearer to me (and remove the need for a comment) > if you just did the map inline rather than (ab)use your helper. Eg. > "latency_list = [e.latency for e in event_details_list]" Done. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:227: key = lambda event:event.start_time) I'm not sure what you mean here? Please clarify. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:260: """" Given a list of InputEventDetails tuples and an time return a On 2014/07/22 21:04:29, Rick Byers wrote: > nit: s/an/a/ Done. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:271: On 2014/07/21 14:06:31, picksi wrote: > The unit-tests revealed an assumption that the event list was already sorted by > event start_time. The 3 lines updated below get the minimum start_time, rather > than the [0]'th start time. Done. https://codereview.chromium.org/397663002/diff/40001/tools/telemetry/telemetr... tools/telemetry/telemetry/util/statistics.py:278: for event_details in event_details_list: Done. This is at the top end of complexity allowed by the style guide but it does make the code clearer. https://codereview.chromium.org/397663002/diff/60001/tools/telemetry/telemetr... File tools/telemetry/telemetry/web_perf/metrics/rendering_stats_unittest.py (right): https://codereview.chromium.org/397663002/diff/60001/tools/telemetry/telemetr... tools/telemetry/telemetry/web_perf/metrics/rendering_stats_unittest.py:15: I added GetInputEventStartAndLatency (below) which ended up being too long a line, so decided to refactor the imports into a list. There seems to be no guidance on this in the Style Guide... https://codereview.chromium.org/397663002/diff/60001/tools/telemetry/telemetr... tools/telemetry/telemetry/web_perf/metrics/rendering_stats_unittest.py:209: Due to a change in rendering_stats.py the code now does (in effect), (x/1000) + (y/1000) instead of (x+y)/1000. This resulted in differences around the 12th decimal place for lists containing latencies (due to rounding errors), which resulted in asserts firing. I added the following helper to fix this - I'm not sure that it's safe directly comparing floats anyway.
I've run smoothness tests with a repeat of 10, the results are here: https://x20web.corp.google.com/~picksi/results.html I don't know which pages are known to be bad and which good, so I can't judge the effectiveness of the metric. I do note, however, that sports.yahoo.com has the highest input_event_latency_discrepancy and also has the highest time_for_input_latency_to_stabilize and highest first_3_GSU_events_latency_mean, so there does appear to be some reasonable correlation.
how does this relate to frame queueing durations metric?
+ariblue --- ari, you and @picksi should probably become good friends ^_^
On 2014/07/28 14:18:28, picksi1 wrote: > I've run smoothness tests with a repeat of 10, the results are here: > > https://x20web.corp.google.com/~picksi/results.html > > I don't know which pages are known to be bad and which good, so I can't judge > the effectiveness of the metric. I do note, however, that http://sports.yahoo.com has > the highest input_event_latency_discrepancy and also has the highest > time_for_input_latency_to_stabilize and highest first_3_GSU_events_latency_mean, > so there does appear to be some reasonable correlation. Thanks Simon. They both seem rather noisy - notice that many are +- >100%, compared to mean_input_event_latency where the variation across runs was <30%. first_3_GSU_events_latency does seem substantially less noisy than time_for_input_latency_to_stabilize. I agree there's some decent correlation between the two though. What if you look only at the first GSU event latency (instead of averaging the first 3) - that should be a stronger discriminator. Yufeng, were you suggesting the first 3 input events that caused a swap so that we can handle both scroll and touchmove cases? If he's looking just at GSU events, then looking only at the first one should be best, right? Alternately if we want this to be agnostic (work for touch-scroll, mouse, and touch-drag) then I agree something like the average of the first 3 swapping events could work. There's some bugs ('JankyDueToBlinkScheduler' or whatever) with specific problematic sites that are motivating the scheduling work. Those were the "known bad" sites I was talking about. Also the synthetic test cases in the touch_scheduler_cases pageset - the "slow touch handler" cases there are synthetic worst-case for scroll starting. I suspect both of your heuristics should do a good job discriminating against those. Can you post the results against touch_scheduler_cases?
On 2014/07/29 19:56:51, Rick Byers wrote: > On 2014/07/28 14:18:28, picksi1 wrote: > > I've run smoothness tests with a repeat of 10, the results are here: > > > > https://x20web.corp.google.com/~picksi/results.html > > > > I don't know which pages are known to be bad and which good, so I can't judge > > the effectiveness of the metric. I do note, however, that > http://sports.yahoo.com has > > the highest input_event_latency_discrepancy and also has the highest > > time_for_input_latency_to_stabilize and highest > first_3_GSU_events_latency_mean, > > so there does appear to be some reasonable correlation. > > Thanks Simon. They both seem rather noisy - notice that many are +- >100%, > compared to mean_input_event_latency where the variation across runs was <30%. > first_3_GSU_events_latency does seem substantially less noisy than > time_for_input_latency_to_stabilize. I agree there's some decent correlation > between the two though. What if you look only at the first GSU event latency > (instead of averaging the first 3) - that should be a stronger discriminator. > > Yufeng, were you suggesting the first 3 input events that caused a swap so that > we can handle both scroll and touchmove cases? If he's looking just at GSU > events, then looking only at the first one should be best, right? Alternately > if we want this to be agnostic (work for touch-scroll, mouse, and touch-drag) > then I agree something like the average of the first 3 swapping events could > work. > My reasoning behind "1/3 input/GSU events that cause swap": 1) 1 vs 3: this is intended to accommodate the unpredictability of main thread's work. If we can repro consistently the timing in our tests that the first event is blocked by predictable main thread work, then 1 event is ok. But, if we can't consistently repro that, e.g. main thread happens to be idle, first event comes and gets handled, then main thread becomes busy again and blocks the next event, then in this case I think choosing 3 events is likely more robust in capturing the case of main thread blockage. 2) input vs GSU : if we are only focusing on scrolling, then looking only at GSU is fine. If we also care about say touch handling in main thread, then we should be more generic in choosing any input event that cause swap. > There's some bugs ('JankyDueToBlinkScheduler' or whatever) with specific > problematic sites that are motivating the scheduling work. Those were the > "known bad" sites I was talking about. Also the synthetic test cases in the > touch_scheduler_cases pageset - the "slow touch handler" cases there are > synthetic worst-case for scroll starting. I suspect both of your heuristics > should do a good job discriminating against those. Can you post the results > against touch_scheduler_cases?
On 2014/07/29 20:15:30, Yufeng Shen wrote: > On 2014/07/29 19:56:51, Rick Byers wrote: > > On 2014/07/28 14:18:28, picksi1 wrote: > > > I've run smoothness tests with a repeat of 10, the results are here: > > > > > > https://x20web.corp.google.com/~picksi/results.html > > > > > > I don't know which pages are known to be bad and which good, so I can't > judge > > > the effectiveness of the metric. I do note, however, that > > http://sports.yahoo.com has > > > the highest input_event_latency_discrepancy and also has the highest > > > time_for_input_latency_to_stabilize and highest > > first_3_GSU_events_latency_mean, > > > so there does appear to be some reasonable correlation. > > > > Thanks Simon. They both seem rather noisy - notice that many are +- >100%, > > compared to mean_input_event_latency where the variation across runs was <30%. > > > first_3_GSU_events_latency does seem substantially less noisy than > > time_for_input_latency_to_stabilize. I agree there's some decent correlation > > between the two though. What if you look only at the first GSU event latency > > (instead of averaging the first 3) - that should be a stronger discriminator. > > > > Yufeng, were you suggesting the first 3 input events that caused a swap so > that > > we can handle both scroll and touchmove cases? If he's looking just at GSU > > events, then looking only at the first one should be best, right? Alternately > > if we want this to be agnostic (work for touch-scroll, mouse, and touch-drag) > > then I agree something like the average of the first 3 swapping events could > > work. > > > > My reasoning behind "1/3 input/GSU events that cause swap": > > 1) 1 vs 3: this is intended to accommodate the unpredictability of main thread's > work. If we can repro consistently the timing in our tests that the first event > is blocked by predictable main thread work, then 1 event is ok. But, if we can't > consistently repro that, e.g. main thread happens to be idle, first event comes > and gets handled, then main thread becomes busy again and blocks the next event, > then in this case I think choosing 3 events is likely more robust in capturing > the case of main thread blockage. > OK, Rick corrected me on this offline. For impl scrolling, once the first GSU starts, then main thread will no longer block the scrolling since we handle the following GSUs in the impl thread. The reason we see not only just the first GSU, but the first few GSUs all have large input latency is because we coalesce multiple touch moves in the browser and only send one coalesced touch move to the renderer. If the main thread is busy, the ack for the coalesced touch move gets delayed and so all the individual touch move gets delayed ack. And when those touch moves turn into GSUs, those GSUs will have large input latency. > 2) input vs GSU : if we are only focusing on scrolling, then looking only at GSU > is fine. If we also care about say touch handling in main thread, then we should > be more generic in choosing any input event that cause swap. > > > > There's some bugs ('JankyDueToBlinkScheduler' or whatever) with specific > > problematic sites that are motivating the scheduling work. Those were the > > "known bad" sites I was talking about. Also the synthetic test cases in the > > touch_scheduler_cases pageset - the "slow touch handler" cases there are > > synthetic worst-case for scroll starting. I suspect both of your heuristics > > should do a good job discriminating against those. Can you post the results > > against touch_scheduler_cases?
I've run tough_scheduling_cases with a repeat of 10 and have put the results here: https://x20web.corp.google.com/~picksi/GSU1and3ToughSchedulingCases.html I measured both the 1st GSU latency and (74.34ms) the average of the first 3 GSUs latency (56.25ms) to see if there was a significant difference (it looks like the 1st GSU has a longer latency, which is unsurprising). I've updated the code to report just the 1st GSU for now, but have left the code that calculates the average in there (it calculates an average of one event at the moment!) in case we need to return to an average at some point... I can remove this if you think it's bloating the code with "YAGNI" work.
any response to #17?
Queuing_duration is an average of how long input events spend waiting in a task queue before they get processed (something we want the blink scheduler to drive down). Latency is the time taken from when an input event is received to when it's effect is visible to the user. Both these metrics are averages across the frame's lifetime (there are others on this thread more knowledgeable than me about this, please correct me if I'm wrong!) The new metrics I'm adding are based on the 'early' latency data. Eric was keen to unblock scrolling as soon as possible when a new page is loaded. These metrics are aiming to quantify that delay so we have some way of assessing if our scheduling work is improving the situation, i.e. giving control to the user sooner. As part of another related CL, Dominik has moved the start of data collection earlier (https://codereview.chromium.org/392613002/) - without this patch the first second or so of a page's life goes untracked, which hides some of the early unresponsiveness.
On 2014/07/30 14:46:57, picksi1 wrote: > I've run tough_scheduling_cases with a repeat of 10 and have put the results > here: > https://x20web.corp.google.com/~picksi/GSU1and3ToughSchedulingCases.html > > I measured both the 1st GSU latency and (74.34ms) the average of the first 3 > GSUs latency (56.25ms) to see if there was a significant difference (it looks > like the 1st GSU has a longer latency, which is unsurprising). I've updated the > code to report just the 1st GSU for now, but have left the code that calculates > the average in there (it calculates an average of one event at the moment!) in > case we need to return to an average at some point... I can remove this if you > think it's bloating the code with "YAGNI" work. The 1st GSU latency sounds fantastic. Don't have strong opinions about the averaging code but if it's unused probably best to remove it. lgtm as a non-owner who is not super familiar with this code and doesn't have authority here but nonetheless wants to see these metrics go live.
Looking at the tough_scheduler_cases_results, there are some interesting points: 1) look at first_1_GSU_events_latency_mean.empty_touch_handler.html?slow_handler. That reports almost exactly 600ms, which is the latency we've added in this test (each handleInputEvent is delayed by 200ms and there's touchstart, gesturetapdown, and touchmove before the scroll can start). There's also essentially no variation between runs (~1%). So this is doing a very good job of capturing the users perception of how long the scroll takes to start in this simplest possible case. 2) time_for_input_latency_to_stabilize.empty_touch_handler.html?slow_handler also seems to be doing a good job (630ms +=4%). 3) as expected, first_n_GSU_events_latency doesn't capture touch event handling cases like touch_handler_scrolling.html - just reports 0. This is fine - we reliably get no such signals from such pages, but that shouldn't mess anything up. And mean_input_event_latency is the right metric for such a page when the jank is steady. 4) time_for_input_latency_to_stabilize also doesn't seem to reliably capture these cases (since the latency is consistent throughout the sequence, there is no 'stabilize' so the results are meaningless and very noisy. I'm worried that if we add this metric we won't be able to alert on in reliably due to cases like this. Perhaps it should report 0 if there's not enough variation in the latency? Check out touch_handler_scrolling.html?slow_handler and super_slow_handler - as far as I can tell the result is meaningless here with tons of variability. 5) queueing_durations.empty_touch_handler.html?slow_handler also appears to capture the jank on the thread on average, but this isn't directly correlated to the user experience of scrolling because most of the jank does not block the scroll. It's also extremely noisy for some reasons - seems randomly to be either 0, 200, 400 or 600ms. 6) first_3_GSU_latency shows similar patterns but the effect is muted (as a result of being averaged with 2 low-latency GSUs). I don't see any benefit of this over first_1_GSU_latency. Taken together this suggests to me that we should be focusing on the first GSU latency (although I'd suggest simplifying the name to just gesture_scroll_start_latency). I'd suggest landing that first and watching it's performance on the bots. If we can figure out how to eliminate noise from time_for_input_latency_to_stabilize when it's not meaningful then adding that would help give coverage for touch event handling cases during page load. I'm not sure this adds enough value on top of the other metrics to justify the complexity though. Rick
I have removed time to stablize and renamed the GSU metric as you suggested.
Thanks Simon, this LGTM (with nit). Also please update the CL description. https://codereview.chromium.org/397663002/diff/120001/tools/telemetry/telemet... File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): https://codereview.chromium.org/397663002/diff/120001/tools/telemetry/telemet... tools/telemetry/telemetry/web_perf/metrics/smoothness.py:47: 'gesture_scroll_latency', nit: you want something like gesture_scroll_start_latency. It's the "start" that's really unique about this metric.
On 2014/08/01 12:42:02, Rick Byers wrote: > Thanks Simon, this LGTM (with nit). > > Also please update the CL description. > > https://codereview.chromium.org/397663002/diff/120001/tools/telemetry/telemet... > File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): > > https://codereview.chromium.org/397663002/diff/120001/tools/telemetry/telemet... > tools/telemetry/telemetry/web_perf/metrics/smoothness.py:47: > 'gesture_scroll_latency', > nit: you want something like gesture_scroll_start_latency. It's the "start" > that's really unique about this metric. I've called it 'first_gesture_scroll_latency', as that's what it is being reported!
... and I've changed the Issue name & CL description
On 2014/08/01 16:44:10, picksi1 wrote: > ... and I've changed the Issue name & CL description Thanks, still LGTM
On 2014/08/01 at 17:48:49, rbyers wrote: > On 2014/08/01 16:44:10, picksi1 wrote: > > ... and I've changed the Issue name & CL description > > Thanks, still LGTM
Er, i'd like to clarify something here: i'm not a drive by reviewer on this. I'd like to lg this before it lands. As it stands, I'm not happy with the lack of cohesion between this and the queueing metrics introduced by ari. I'm also not wild about the ad-hoc solution to quiescence. As a benchmark in isolation of all the rest of our benchmark and metric code, this seems good. But when you view it in the total-set-of-metrics-and-benchmarks-we-collect around smoothness, this brings up some interesting challenges that we should pro-actively discuss instead of gloss over. As such, I'd encourage us to split this patch into two parts: 1. figuring out how to land a metric for input event queueing delay. that part of the patch would discuss how input queueing delays relate to begin frame delays, and whether we should have some more generic notion of main-thread-responsiveness-to-rendering-intent. that part of the patch can probably proceed rather quicly as I suspect proper unification of input and frame production requires https://docs.google.com/a/chromium.org/document/d/13xtO-_NSSnNZRRS1Xq3xGNKZaw... 2. figuring out how to report some types of metrics for before-and-after quiescence, and having a very pro-active discussion about what this is relative to the load and domcontentready and network idle states, which are the typical indicators of such a state. fps-before-quiesence, thread time before quiescence, etc are all very interesting. i dont think every single one of those metrics should have to do one-off-code in order to support such a distinction.
ah, @$@)(#*, i didn't see you removed the quiesence stuff. I really like that btw, lets talk about how to add that as a core concept in telemetry throughout, its important stuff. apologies for replying before reading the current patch. you need unit tests for this. then i can lg.
https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... tools/telemetry/telemetry/web_perf/metrics/smoothness.py:38: input_start_and_latency_GSU = FlattenList(stats.input_start_and_latency_GSU) specifically should we have tests for the smoothness metric part? or do we not currently unit test the integration point?
https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... tools/telemetry/telemetry/web_perf/metrics/smoothness.py:38: input_start_and_latency_GSU = FlattenList(stats.input_start_and_latency_GSU) On 2014/08/01 19:21:26, nduca wrote: > specifically should we have tests for the smoothness metric part? or do we not > currently unit test the integration point? Hmm, looks like the smoothness metric lacks unit tests. Coverage would be nice, maybe in a follow-up patch?
On 2014/08/09 01:58:50, jdduke wrote: > https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... > File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): > > https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... > tools/telemetry/telemetry/web_perf/metrics/smoothness.py:38: > input_start_and_latency_GSU = FlattenList(stats.input_start_and_latency_GSU) > On 2014/08/01 19:21:26, nduca wrote: > > specifically should we have tests for the smoothness metric part? or do we not > > currently unit test the integration point? > > Hmm, looks like the smoothness metric lacks unit tests. Coverage would be nice, > maybe in a follow-up patch? Looks like the bug id is incorrect.
On 2014/08/12 18:55:14, tdresser wrote: > On 2014/08/09 01:58:50, jdduke wrote: > > > https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... > > File tools/telemetry/telemetry/web_perf/metrics/smoothness.py (right): > > > > > https://codereview.chromium.org/397663002/diff/140001/tools/telemetry/telemet... > > tools/telemetry/telemetry/web_perf/metrics/smoothness.py:38: > > input_start_and_latency_GSU = FlattenList(stats.input_start_and_latency_GSU) > > On 2014/08/01 19:21:26, nduca wrote: > > > specifically should we have tests for the smoothness metric part? or do we > not > > > currently unit test the integration point? > > > > Hmm, looks like the smoothness metric lacks unit tests. Coverage would be > nice, > > maybe in a follow-up patch? > > Looks like the bug id is incorrect. FYI, I've moved this patch over to https://codereview.chromium.org/467343002/. |