Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(463)

Side by Side Diff: site/dev/contrib/simd.md

Issue 1330083002: Add simd.md to document Skia SIMD code / plans. (Closed) Base URL: https://skia.googlesource.com/skia.git@master
Patch Set: draft Created 5 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 Skia's New Approach to SIMD
2 ===========================
3
4 Most hot software paths in Skia are implemented with processor-specific SIMD ins tructions. For graphics performance, the parallelism from SIMD is essential: th ere is simply no realistic way to eek the same performance out of portable C++ c ode as we can from the SSE family of instruction sets on x86 or from NEON on ARM or from MIPS32's DSP instructions. Depending on the particular code path and m ath involved, we see 2, 4, 8, or even ~16x performance increases over portable c ode when really exploiting the processor-specific SIMD instructions.
5
6 But the SIMD code we've piled up over the years has some serious problems. It's often quite low-level, with poor factoring leading to verbose, bug prone, and d ifficult to read code. SIMD instrinsic types and functions take a good long whi le to get used to reading, let alone writing, and assembly is generally just a c omplete non-starter. SIMD coverage of Skia methods is not dense: a particular d rawing routine might be specialized for NEON but not for SSE, or might have a MI PS DSP implementation but no NEON. Even when we have full instruction set cover age, the implementations of these specialized routines may not produce identical results, either when compared with each other or with our portable fallback cod e. The SIMD implementations are often simply incorrect, but the code is so frag ile and difficult to understand, we can't fix it. There are long lived bugs in our tracker involving crashes and buffer under- and overflows that we simply can not fix because no one on the team understands the code involved. And finally, to top it all off, the code isn't always even really that fast.
7
8 This all needs to change. I want Skia developers to be able to write correct, c lear, and fast code, and in software rendering, SIMD is the only way to get "fas t". This document outlines a new vision for how Skia will use SIMD instructions with no compromises, writing clear code _once_ that runs quickly on all platfor ms we support.
9
10 The Plan
11 --------
12
13 We're going to wrap low-level platform-specific instrinsics with zero-cost abstr actions with interfaces matching Skia's higher-level-but-still-quite-low-level u se cases. Skia code will write to this interface _once_, which then compiles to efficient SSE, NEON, or portable code (MIPS is quite TBD, for now group it conc eptually under portable code) via platform-specific backends. The key here is t o find the right sweet spot of abstraction that allows us to express the graphic al concepts we want in Skia while allowing each of those platform-specific backe nds flexibility to implement those concepts as efficiently as possible.
14
15 While Skia uses a mix of float, 32-bit, 16-bit, and 8-bit integer SIMD instructi ons, 32-bit integers fall quite behind the rest in usage. Since we tend to oper ate on 8888 ARGB values, 8-bit SIMD tends to be the most natural and fastest app roach, but when multiplication gets involved (essentially all the time), 16-bit SIMD inevitably gets tangled in there. For some operations like division, squar e roots, or math with high range or precision requirements, we expand our 8-bit pixel components up to floats, and working with a single pixel as a 4-float vect or becomes most natural. This plan focuses on how we'll deal with these majorit y cases: floats, and 8- and 16-bit integers.
16
17 `SkNf` for floats
18 ---------------
19
20 Wrapping floats with an API that allows efficient implementation on SSE and NEON is by far the easiest task involved here. Both SSE and NEON naturally work wit h 128-bit vectors of 4 floats, and they have a near 1-to-1 correspondence betwee n operations. Indeed, the correspondence is so close that it's tempting to solv e this problem by picking one set of intrinsics, e.g. NEON, and just `#define`in g portable and SSE implementations of NEON:
21
22 #define float32x4_t __m128
23 #define vmulq_f32 _mm_mul_ps
24 #define vaddq_f32 _mm_add_ps
25 #define vld1q_f32 _mm_loadu_ps
26 #define vst1q_f32 _mm_storeu_ps
27 ...
28
29 This temptation starts to break down when you notice:
30
31 - there are operations that don't quite correspond, e.g. `_mm_movemask_ps`; an d
32 - math written with either SSE or NEON instrinsics is still very hard to read; and
33 - sometimes we want to work with 4 floats, but sometimes 2, maybe even 8, etc.
34
35 So we use a wrapper class `SkNf<N>`, parameterized on N, how many floats the vec tor contains, constrained at compile time to be a power of 2. `SkNf` provides a ll the methods you'd expect on vector of N floats: loading and storing from floa t arrays, all the usual arithmetic operators, min and max, low and high precisio n reciprocal and sqrt, all the usual comparison operators, and a `.thenElse()` m ethod acting as a non-branching ternary `?:` operator. To support Skia's main g raphic needs, `SkNf` can also load and store from a vector of N _bytes_, convert ing up to a float when loading and rounding down to [0,255] when storing.
36
37 As a convenience, `SkNf<N>` has two default implementations: `SkNf<1>` performs all these operations on a single float, and the generic `SkNf<N>` simply recurse s onto two `SkNf<N/2>`. This allows our different backends to inject specialiat ions where most natural: the portable backend does nothing, so all `SkNf<N>` rec urse down to the default `SkNf<1>`; the NEON backend specializes `SkNf<2>` with `float32x2_t` and 64-bit SIMD methods, and `SkNf<4>` with `float32x4_t` and 128 -bit SIMD methods; the SSE backend specializes both `SkNf<4>` and `SkNf<2>` to u se the full or lower half of an `__m128` vector, respectively. A future AVX bac kend could simply drop in an `SkNf<8>` specialization.
38
39 Our most common float use cases are working with 2D coordinates and with 4-float -component pixels. Since these are so common, we've made simple typedefs for th ese two use cases, `Sk2f` and `Sk4f`, and also versions reminding you that it ca n work with vectors of `SkScalar` (a Skia-specific float typedef) too: `Sk2s`, ` Sk4s`.
40
41 `SkNf` in practice
42 ----------------
43
44 To date we have implemented several parts of Skia using Sk4f:
45
46 1. `SkColorMatrixFilter`
47 2. `SkRadialGradient`
48 3. `SkColorCubeFilter`
49 4. Three complicated `SkXfermode` subclasses: `ColorBurn`, `ColorDodge`, and ` SoftLight`.
50
51 In all these cases, we have been able to write a single implementation, producin g the same results cross-platform. The first three of those sites using Sk4f ar e entirely newly vectorized, and run much faster than the previous portable impl ementations. The 3 Sk4f transfermodes replaced portable, SSE, and NEON implemen tations which all produced different results, and the Sk4f versions are all fast er than their predecessors.
52
53 `SkColorCubeFilter` stands out as a particularly good example of how and why to use Sk4f over custom platform-specific intrinsics. Starting from some portable code and a rather slow SSE-only sketch, a Google Chromium dev, an Intel contribu tor, and I worked together to write an Sk4f version that's more than twice as fa st as the original, and runs fast on _both_ x86 and ARM.
54
55 `SkPx` for 8- and 16-bit fixed point math
56 ----------------------------------------
57
58 Building an abstraction layer over 8- and 16-bit fixed point math has proven to be quite a challenge. In fixed point, NEON and SSE again have some overlap, and they could probably be implemented in terms of each other if you were willing t o sacrifice performance on SSE in favor of NEON or vice versa. But unlike with floats, where `SkNf` is really a pretty thin veneer over very similar operations , to really get the best performance out of each fixed point instruction set you need to work in rather different idioms.
59
60 `SkPx`, our latest approach (there have been alpha `Sk16b` and beta `Sk4px` pred ecessors) to 8- and 16-bit SIMD tries to abstract over those idioms to again al low Skia developers to write one piece of clear graphics code that different bac kends can translate into their native intrinsics idiomatically.
61
62 `SkPx` is really a family of three related types:
63
64 1. `SkPx` itself represents between 1 and `SkPx::N` 8888 ARGB pixels, where `S kPx::N` is a backend-specific compile-time power of 2.
65 2. `SkPx::Wide` represents those same pixels, but with 16-bits of space per co mponent.
66 3. `SkPx::Alpha` represents the alpha channels of those same pixels.
67
68 `SkPx`, `Wide` and `Alpha` create a somewhat complicated algebra of operations e ntirely motivated by the graphical operations we need to perform. Here are some examples:
69
70 SkPx::LoadN(const uint32_t*) -> SkPx // Load full cruising-speed SkPx.
71 SkPx::Load(n, const uint32_t*) -> SkPx // For the 0<n<N ragged tail.
72
73 SkPx.storeN(uint32_t*) // Store a full SkPx.
74 SkPx.store(n, uint32_t*) // For the ragged 0<n<N tail.
75
76 SkPx + SkPx -> SkPx
77 SkPx - SkPx -> SkPx
78 SkPx.saturatedAdd(SkPx) -> SkPx
79
80 SkPx.alpha() -> Alpha // Extract alpha channels.
81 Alpha::LoadN(const uint8_t*) -> Alpha // Like SkPx loads, in 8-bit steps.
82 Alpha::Load(n, const uint8_t*) -> Alpha
83
84 SkPx.widenLo() -> Wide // argb -> 0a0r0g0b
85 SkPx.widenHi() -> Wide // argb -> a0r0g0b0
86 SkPx.widenLoHi() -> Wide // argb -> aarrggbb
87
88 Wide + Wide -> Wide
89 Wide - Wide -> Wide
90 Wide << bits -> Wide
91 Wide >> bits -> Wide
92
93 SkPx * Alpha -> Wide // 8 x 8 -> 16 bit
94 Wide.div255() -> SkPx // 16-bit -> 8 bit
95
96 // A faster approximation of (SkPx * Alpha).div255().
97 SkPx.approxMulDiv255(Alpha) -> SkPx
98
99 We allow each `SkPx` backend to choose how it physically represents `SkPx`, `SkP x::Wide`, and `SkPx::Alpha` and to choose any power of two as its `SkPx::N` swee t spot. Code working with SkPx typically runs a loop like this:
100
101 while (n >= SkPx::N) {
102 // Apply some_function() to SkPx::N pixels.
103 some_function(SkPx::LoadN(src), SkPx::LoadN(dst)).storeN(dst);
104 src += SkPx::N; dst += SkPx::N; n -= SkPx::N;
105 }
106 if (n > 0) {
107 // Finish up the tail of 0<n<N pixels.
108 some_function(SkPx::Load(n, src), SkPx::Load(n, dst)).store(n, dst);
109 }
110
111 The portable code is of course the simplest place to start looking at implementa tion details: its `SkPx` is just `uint8_t[4]`, its `SkPx::Wide` `uint16_t[4]`, a nd its `SkPx::Alpha` just `uint8_t`. Its preferred number of pixels to work wit h is `SkPx::N = 1`. (Amusingly, GCC and Clang seem pretty good about autovector izing this backend using 32-bit math, which typically ends up within ~2x of the best we can do ourselves.)
112
113 The most important difference between SSE and NEON when working in fixed point i s that SSE works most naturally with 4 interlaced pixels at a time (argbargbargb argb), while NEON works most naturally with 8 planar pixels at a time (aaaaaaaa, rrrrrrrr, gggggggg, bbbbbbbb). Trying to jam one of these instruction sets int o the other's idiom ends up somewhere between not quite optimal (working with in terlaced pixels in NEON) and ridiculously inefficient (trying to work with plana r pixels in SSE).
114
115 So `SkPx`'s SSE backend sets N to 4 pixels, stores them interlaced in an `__m128 i`, representing `Wide` as two `__m128i` and `Alpha` as an `__m128i` with each p ixel's alpha component replicated four times. SkPx's NEON backend works with 8 planar pixels, loading them with `vld4_u8` into an `uint8x8x4_t` struct of 4 8-c omponent `uint8x8_t` planes. `Alpha` is just a single `uint8x8_t` 8-component p lane, and `Wide` is NEON's natural choice, `uint16x8x4_t`.
116
117 (It's fun to speculate what an AVX2 backend might look like. Do we make `SkPx` declare it wants to work with 8 pixels at a time, or leave it at 4? Does `SkPx` become `__m256i`, or maybe only `SkPx::Wide` does? What's the best way to repr esent `Alpha`? And of course, what about AVX-512?)
118
119 Keeping `Alpha` as a single dense `uint8x8_t` plane allows the NEON backend to b e much more efficient with operations involving `Alpha`. We'd love to do this i n SSE too, where we store `Alpha` somewhat inefficiently with each alpha compone nt replicated 4 times, but SSE simply doesn't expose efficient ways to transpose interlaced pixels into planar pixels and vice versa. We could write them ourse lves, but only as rather complex compound operations that slow things down more than they help.
120
121 These details will inevitably change over time. The important takeaway here is, to really work at peak throughput in SIMD fixed point, you need to work with th e idiom of the instruction set, and `SkPx` is a design that can present a consis tent interface to abstract away backend details for you.
122
123 `SkPx` in practice
124 ----------------
125
126 I am in the process of rolling out `SkPx`. Some Skia code is already using its precursor, `Sk4px`, which is a bit like `SkPx` that forces `N=4` and restricts t he layout to always use interlaced pixels: i.e. fine for SSE, not great for NEON .
127
128 1. All ~20 other `SkXfermode` subclasses that are not implemented with `SkNf`.
129 2. SkBlitRow::Color32
130 3. SkBlitMask::BlitColor
131
132 I can certainly say that the `Sk4px` and `SkPx` implementations of these methods are clearer, less buggy, and that all the `SkXfermode` implementations sped up at least 2x when porting from custom per-platform intrinsics. `Sk4px` has lead to some pretty bad performance regressions that `SkPx` is designed to avoid. Th is is an area of active experiementation and iteration.
133
134 In Summary
135 ----------
136
137 I am confident that Skia developers soon will be able to write single, clear, ma intainable, and of course _fast_, graphical algorithms using `SkNf` and `SkPx`. As I have been porting our algorithms, I have perversely enjoyed replacing tho usands of lines of unmaintainable code with usually mere dozens of readable code .
138
139 I'm also confident that if you're looking to use floats, `SkNf` is ready. Do no t write NEON or SSE SIMD code if you're looking to use floats, and do not accept external contributions that do so. Use `SkNf` instead.
140
141 `SkPx` is less proven, and while its design and early tests look promising, it's still at the stage where we should try it aware that we might need to fall back on hand-written SSE or NEON.
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698