docs/speed/good_toplevel_metrics.md - Issue 2973213002: Add documentation on authoring metrics.

Side by Side Diff: docs/speed/good_toplevel_metrics.md

Issue 2973213002: Add documentation on authoring metrics. (Closed)

Patch Set: Created 3 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

OLD	NEW
(Empty)
	1 # Properties of a Good Top Level Metric

	2

	3 When defining a top level metric, there are several desirable properties which a re frequently in tension. This document attempts to roughly outline the desirabl e properties we should keep in mind when defining a metric. Also see the documen t on improving the actionability of a top level metric via [diagnostic metrics]( diagnostic_metrics.md).

	4

	5 [TOC]

	6

	7 ## Representative

	8

	9 Top level metrics are how we understand our product’s high level behavior, and i f they don’t correlate with user experience, our understanding is misaligned wit h the real world. However, measuring representativeness is costly. In the long t erm, we can use ablation studies (in the browser or in partnership with represen tative sites), or user studies to confirm representativeness. In the short term, we use our intuition in defining the metric, and carefully measure the metric i mplementation’s accuracy.

	10

	11 These metrics would ideally also correlate strongly with business value, making it easy to motivate site owners to optimizing these metrics.

	12

	13 ## Accurate

	14

	15 When we first come up with a metric, we have a concept in mind of what the metri c is trying to measure. The accuracy of a metric implementation is how closely t he metric implementation aligns to our conceptual model of what we’re trying to measure.

	16

	17 For example, First Contentful Paint was created to measure the first time we pai nt something the user might actually care about. Our current implementation look s at when the browser first painted any text, image, non-white canvas or SVG. Th e accuracy of this metric is determined by how often the first thing painted whi ch the user cares about is text, image, canvas or SVG.

	18

	19 To evaluate how accurate a metric is, there’s no substitute for manual evaluatio n. Ideally, this evaluation phase would be performed by multiple people, with li ttle knowledge of the metric in question.

	20

	21 To initially evaluate the accuracy of a point in time metric:

	22

	23 * Gather a bunch of samples of pages where we can compute our metric.

	24 * Get a group of people unfamiliar with proposed metric implementations to ident ify what they believe the correct point in time for each sample.

	25 * Measure the variability of the hand picked points in time. If this amount of v ariability is deemed too high, we’ll need to come up with a more specific metric , which is easier to hand evaluate.

	26 * Measure the error between the implementation results and the hand picked resul ts. Ideally, our error measurement would be more forgiving in cases where humans were unsure of the correct point in time. We don’t have a concrete plan here ye t.
	sullivan 2017/07/13 13:12:27 I really liked the analysis ksakamoto did here: ht I really liked the analysis ksakamoto did here: https://docs.google.com/document/d/1eMA9YWTLFCJomyiKUO4Be4PeVIOBFMfXMdE8bpudL... Maybe link it? Actually noticed the example below, it stands on its own. tdresser 2017/07/20 16:16:38 ksakamoto's results haven't been reproducible, so Show quoted text On 2017/07/13 13:12:27, sullivan wrote: > I really liked the analysis ksakamoto did here: > https://docs.google.com/document/d/1eMA9YWTLFCJomyiKUO4Be4PeVIOBFMfXMdE8bpudL... > > Maybe link it? > > Actually noticed the example below, it stands on its own. ksakamoto's results haven't been reproducible, so I hesitate to use it as an example of what to do. Issues with that analysis resulted in a number of the insights in this doc, such as requiring multiple people unfamiliar with the metric to perform the hand identification.
	27

	28 To initially evaluate accuracy of a quality of experience metric, we rely heavil y on human intuition:

	29

	30 * Gather a bunch of samples of pages where we can compute our metric.

	31 * Get a group of people unfamiliar with the proposed metric implementations to s ort the samples by their estimated quality of experience.

	32 * Use [Spearman's rank-order correlation](https://statistics.laerd.com/statistic al-guides/spearmans-rank-order-correlation-statistical-guide.php) to examine how well correlated the different orderings are. If they aren’t deemed consistent e nough, we’ll need to come up with a more specific metric, which is easier to han d evaluate.

	33 * Use the metric implementation to sort the samples.

	34 * Use [Spearman's rank-order correlation](https://statistics.laerd.com/statistic al-guides/spearmans-rank-order-correlation-statistical-guide.php) to evaluate ho w similar the metric implementation is to the hand ordering.

	35

	36 ## Stable

	37

	38 A metric is stable if the result doesn’t vary much between successive runs on si milar input. This can be quantitatively evaluated, ideally using Chrome Trace Pr ocessor and cluster telemetry on the top 10k sites. Eventually we hope to have a concrete threshold for a specific spread metric here, but for now, we gather th e stability data, and analyze it by hand.

	39

	40 Different domains have different amounts of inherent instability - for example, when measuring page load performance using a real network, the network injects s ignificant variability. We can’t avoid this, but we can try to implement metrics which minimize instability, and don’t exaggerate the instability inherent in th e system.

	41

	42 ## Interpretable

	43

	44 A metric is interpretable if the numbers it produces are easy to understand, esp ecially for individuals without strong domain knowledge. For example, point-in-t ime metrics tend to be easy to explain, even if their implementations are compli cated (see "Simplicity"). For example, it’s easy to communicate what First Meani ngful Paint is, even if how we compute it is very complicated. Conversely, somet hing like [SpeedIndex](https://sites.google.com/a/webpagetest.org/docs/using-web pagetest/metrics/speed-index) is somewhat difficult to explain and [hard to reas on about](https://docs.google.com/document/d/14K3HTKN7tyROlYQhSiFP89TT-Ddg2aId9u yEsWj5UAY/edit) - it’s the average time at which things were displayed on the pa ge.

	45

	46 Metrics which are easy to interpret are often easier to evaluate. For example, F irst Meaningful Paint can be evaluated by comparing hand picked first meaningful paint times to the results of a given approach for computing first meaningful p aint. SpeedIndex is more complicated to evaluate - we’d need to use the approach given [above](#Accurate) for quality of experience metrics.

	47

	48 ## Simple

	49

	50 A metric is simple if the way its computed is easy to understand. There’s a stro ng correlation between being simple and being interpretable, but there are count er examples, such as FMP being interpretable, but not simple.

	51

	52 A simple metric is less likely to have been overfit during the metric developmen t / evaluation phase, and has other obvious advantages (easier to maintain, ofte n faster to execute, less likely to contain bugs).

	53

	54 One crude way of quantifying simplicity is to measure the number of tunable para meters. For example, we can look at two ways of aggregating Frame Throughput. We could look at the average Frame Throughput defined over all animations during t he pageview. Alternatively, we could look for the 300ms window with the worst av erage Frame Throughput. The second approach has one additional parameter, and is thus strictly more complex.

	55

	56 ## Elastic

	57

	58 A good metric is [elastic](https://en.wikipedia.org/wiki/Elasticity_of_a_functio n), that is, a small change in the input (the page) results in a small change in the output.

	59

	60 In a continuous integration environment, you want to know whether or not a given code change resulted in metric improvements or regressions. Non-elastic metrics often obscure changes, making it hard to justify small but meaningful improveme nts, or allowing small but meaningful regressions to slip by. Elastic metrics al so generally have lower variability.

	61

	62 This is frequently at odds with the interpretability requirement. For example, F irst Meaningful Paint is easier to interpret than SpeedIndex, but is non-elastic .

	63

	64 If your metric involves thresholds (such as the 50ms task length threshold in TT I), or heuristics (looking at the largest jump in the number of layout objects i n FMP), it’s likely to be non-elastic.

	65

	66 ## Realtime

	67

	68 We’d like to have metrics which we can compute in realtime. For example, if we’r e measuring First Meaningful Paint, we’d like to know when First Meaningful Pain t occurred at the time it occurred. This isn’t always attainable, but when pos sible, it avoids some classes of [survivorship bias](https://en.wikipedia.org/wi ki/Survivorship_bias), which makes metrics easier to analyze.

	69

	70 # Example

	71

	72 [Time to Consistently Interactive](https://docs.google.com/document/d/1GGiI9-7Ke Y3TPqS3YT271upUVimo-XiL5mwWorDUD4c/edit):

	73

	74 * Representative

	75 * We should eventually do an ablation study, similar to the page load ablati on study [here](https://docs.google.com/document/d/1wpu8aqZIUVgjNm9zBP9gU_swx5OD leH1s2Kueo1pIfc/edit#).

	76

	77 * Accurate

	78 * Summary [here](https://docs.google.com/document/d/1GGiI9-7KeY3TPqS3YT271up UVimo-XiL5mwWorDUD4c/edit#heading=h.iqlwzaf6lqrh), analysis [here](https://docs. google.com/document/d/1pZsTKqcBUb1pc49J89QbZDisCmHLpMyUqElOwYqTpSI/edit#bookmark =id.4euqu19nka18). Overall, based on manual investigation of 25 sites, our appro ach fired uncontroversially at the right time 64% of the time, and possibly too late the other 36% of time. We split TTI in two to allow this metric to be quite pessimistic about when TTI fires, so we’re happy with when this fires for all 2 5 sites. A few issues with this research:

	79 * Ideally someone less familiar with our approach would have performed t he evaluation.

	80 * Ideally we’d have looked at more than 25 sites.

	81 * Stable

	82 * Analysis [here](https://docs.google.com/document/d/1GGiI9-7KeY3TPqS3YT271u pUVimo-XiL5mwWorDUD4c/edit#heading=h.27s41u6tkfzj).

	83 * Interpretable

	84 * Time to Consistently Interactive is easy to explain. We report the first 5 second window where the network is roughly idle and no tasks are greater than 5 0ms long.

	85 * Elastic

	86 * Time to Consistently Interactive is generally non-elastic. We’re investiga ting another metric which will quantify how busy the main thread is between FMP and TTI, which should be a nice elastic proxy metric for TTI.

	87 * Simple

	88 * Time To Consistently Interactive has a reasonable amount of complexity, bu t is much simpler than Time to First Interactive. Time to Consistently Interacti ve has 3 parameters:

	89 * Number of allowable requests during network idle (currently 2).

	90 * Length of allowable tasks during main thread idle (currently 50ms).

	91 * Window length (currently 5 seconds).

	92 * Realtime

	93 * Time To Consistently Interactive is definitely not realtime, as it needs t o wait until it’s seen 5 seconds of idle time before declaring that we became in teractive at the start of the 5 second window.

OLD	NEW

« docs/speed/diagnostic_metrics.md ('K') | « docs/speed/diagnostic_metrics.md ('k') | docs/speed/images/diagnostic-metrics-example.png » ('j') | no next file with comments »