OLD | NEW |
---|---|
(Empty) | |
1 Register allocation in Subzero | |
2 ============================== | |
3 | |
4 Introduction | |
5 ------------ | |
6 | |
7 `Subzero | |
8 <https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DES IGN.rst>`_ | |
9 is a fast code generator that translates architecture-independent `PNaCl bitcode | |
10 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into | |
11 architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been | |
12 simplified (e.g. weird-width primitive types like 57-bit integers are not | |
13 allowed) and has had architecture-independent optimizations already applied. | |
14 Subzero aims to generate high-quality code as fast as practical, and as such | |
15 Subzero needs to make tradeoffs between compilation speed and output code | |
16 quality. | |
17 | |
18 In Subzero, we have found register allocation to be one of the most important | |
19 optimizations toward achieving the best code quality, which is in tension with | |
20 register allocation's reputation for being complex and expensive. Linear-scan | |
21 register allocation is a modern favorite for getting fairly good register | |
22 allocation at relatively low cost. Subzero uses linear-scan for its core | |
23 register allocator, with a few internal modifications to improve its | |
24 effectiveness (specifically: register preference, register overlap, and register | |
25 aliases). Subzero also does a few high-level things on top of its core register | |
26 allocator to improve overall effectiveness (specifically: repeat until | |
27 convergence, delayed phi lowering, and local live range splitting). | |
28 | |
29 What we describe here are techniques that have worked well for Subzero, in the | |
30 context of its particular intermediate representation (IR) and compilation | |
31 strategy. Some of these techniques may not be applicable to another compiler, | |
32 depending on its particular IR and compilation strategy. Some key concepts in | |
33 Subzero are the following: | |
34 | |
35 - Subzero's ``Variable`` operand is an operand that resides either on the stack | |
36 or in a physical register. A Variable can be tagged as *must-have-register* | |
37 or *must-not-have-register*, but its default is *may-have-register*. All uses | |
38 of the Variable map to the same physical register or stack location. | |
39 | |
40 - Basic lowering is done before register allocation. Lowering is the process of | |
41 translating PNaCl bitcode instructions into native instructions. Sometimes a | |
42 native instruction, like the x86 ``add`` instruction, allows one of its | |
43 Variable operands to be either in a physical register or on the stack, in | |
44 which case the lowering is relatively simple. But if the lowered instruction | |
45 requires the operand to be in a physical register, we generate code that | |
46 copies the Variable into a *must-have-register* temporary, and then use that | |
47 temporary in the lowered instruction. | |
48 | |
49 - Instructions within a basic block appear in a linearized order (as opposed to | |
50 something like a directed acyclic graph of dependency edges). | |
51 | |
52 - An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually | |
53 small) number of *source* Variables. Assuming SSA form, the instruction | |
54 begins the live range of the dest Variable, and may end the live range of one | |
55 or more of the source Variables. | |
56 | |
57 Executive summary | |
58 ----------------- | |
59 | |
60 - Liveness analysis and live range construction are prerequisites for register | |
61 allocation. Without careful attention, they can be potentially very | |
62 expensive, especially when the number of variables and basic blocks gets very | |
63 large. Subzero uses some clever data structures to take advantage of the | |
64 sparsity of the data, resulting in stable performance as function size scales | |
65 up. This means that the proportion of compilation time spent on liveness | |
66 analysis stays roughly the same. | |
67 | |
68 - The core of Subzero's register allocator is taken from `Mössenböck and | |
69 Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on | |
70 linear-scan register allocation. | |
71 | |
72 - We enhance the core algorithm with a good automatic preference mechanism when | |
73 more than one register is available, to try to minimize register shuffling. | |
74 | |
75 - We also enhance it to allow overlapping live ranges to share the same | |
76 register, when one variable is recognized as a read-only copy of the other. | |
77 | |
78 - We deal with register aliasing in a clean and general fashion. Register | |
79 aliasing is when e.g. 16-bit architectural registers share some bits with | |
80 their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit | |
81 registers. | |
82 | |
83 - We improve register allocation by rerunning the algorithm on likely candidates | |
84 that didn't get registers during the previous iteration, without imposing much | |
85 additional cost. | |
86 | |
87 - The main register allocation is done before phi lowering, because phi lowering | |
88 imposes early and unnecessary ordering constraints on the resulting | |
89 assigments, which create spurious interferences in live ranges. | |
90 | |
91 - Within each basic block, we aggressively split each variable's live range at | |
92 every use, so that portions of the live range can get registers even if the | |
93 whole live range can't. Doing this separately for each basic block avoids | |
94 merge complications, and keeps liveness analysis and register allocation fast | |
95 by fitting well into the sparsity optimizations of their data structures. | |
96 | |
97 Liveness analysis | |
98 ----------------- | |
99 | |
100 With respect to register allocation, the main purpose of liveness analysis is to | |
101 calculate the live range of each variable. The live range is represented as a | |
102 set of instruction number ranges. Instruction numbers within a basic block must | |
103 be monotonically increasing, and the instruction ranges of two different basic | |
104 blocks must not overlap. | |
105 | |
106 Basic liveness | |
107 ^^^^^^^^^^^^^^ | |
108 | |
109 Liveness analysis is a straightforward dataflow algorithm. For each basic | |
110 block, we keep track of the live-in and live-out set, i.e. the set of variables | |
111 that are live coming into or going out of the basic block. Processing of a | |
112 basic block starts by initializing a temporary set as the union of the live-in | |
113 sets of the basic block's successor blocks. (This basic block's live-out set is | |
114 captured as the initial value of the temporary set.) Then each instruction of | |
115 the basic block is processed in reverse order. All the source variables of the | |
116 instruction are marked as live, by adding them to the temporary set, and the | |
117 dest variable of the instruction (if any) is marked as not-live, by removing it | |
118 from the temporary set. When we finish processing all of the block's | |
119 instructions, we add/union the temporary set into the basic block's live-in set. | |
120 If this changes the basic block's live-in set, then we mark all of this basic | |
121 block's predecessor blocks to be reprocessed. Then we repeat for other basic | |
122 blocks until convergence, i.e. no more basic blocks are marked to be | |
123 reprocessed. If basic blocks are processed in reverse topological order, then | |
124 the number of times each basic block need to be reprocessed is generally its | |
125 loop nest depth. | |
126 | |
127 The result of this pass is the live-in and live-out set for each basic block. | |
128 | |
129 With so many set operations, choice of data structure is crucial for | |
130 performance. We tried a few options, and found that a simple dense bit vector | |
131 works best. This keeps the per-instruction cost very low. However, we found | |
132 that when the function gets very large, merging and copying bit vectors at basic | |
133 block boundaries dominates the cost. This is due to the number of variables | |
134 (hence the bit vector size) and the number of basic blocks getting large. | |
135 | |
136 A simple enhancement brought this under control in Subzero. It turns out that | |
137 the vast majority of variables are referenced, and therefore live, only in a | |
138 single basic block. This is largely due to the SSA form of PNaCl bitcode. To | |
139 take advantage of this, we partition variables by single-block versus | |
140 multi-block liveness. Multi-block variables get lower-numbered bit vector | |
141 indexes, and single-block variables get higher-number indexes. Single-block bit | |
142 vector indexes are reused across different basic blocks. As such, the size of | |
143 live-in and live-out bit vectors is limited to the number of multi-block | |
144 variables, and the temporary set's size can be limited to that plus the largest | |
145 number of single-block variables across all basic blocks. | |
146 | |
147 For the purpose of live range construction, we also need to track definitions | |
148 (LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the | |
149 basic block. These are easy to detect while processing the instructions; data | |
150 structure choices are described below. | |
151 | |
152 Live range construction | |
153 ^^^^^^^^^^^^^^^^^^^^^^^ | |
154 | |
155 After the live-in and live-out sets are calculated, we construct each variable's | |
156 live range (as an ordered list of instruction ranges, described above). We do | |
157 this by considering the live-in and live-out sets, combined with LiveBegin and | |
158 LiveEnd information. This is done separately for each basic block. | |
159 | |
160 As before, we need to take advantage of sparsity of variable uses across basic | |
161 blocks, to avoid overly copying/merging data structures. The following is what | |
162 worked well for Subzero (after trying several other options). | |
163 | |
164 The basic liveness pass, described above, keeps track of when a variable's live | |
165 range begins or ends within the block. LiveBegin and LiveEnd are unordered | |
166 vectors where each element is a pair of the variable and the instruction number, | |
167 representing that the particular variable's live range begins or ends at the | |
168 particular instruction. When the liveness pass finds a variable whose live | |
169 range begins or ends, it appends and entry to LiveBegin or LiveEnd. | |
170 | |
171 During live range construction, the LiveBegin and LiveEnd vectors are sorted by | |
172 variable number. Then we iterate across both vectors in parallel. If a | |
173 variable appears in both LiveBegin and LiveEnd, then its live range is entirely | |
174 within this block. If it appears in only LiveBegin, then its live range starts | |
175 here and extends through the end of the block. If it appears in only LiveEnd, | |
176 then its live range starts at the beginning of the block and ends here. (Note | |
177 that this only covers the live range within this block, and this process is | |
178 repeated across all blocks.) | |
179 | |
180 It is also possible that a variable is live within this block but its live range | |
181 does not begin or end in this block. These variables are identified simply by | |
182 taking the intersection of the live-in and live-out sets. | |
183 | |
184 As a result of these data structures, performance of liveness analysis and live | |
185 range construction tend to be stable across small, medium, and large functions, | |
186 as measured by a fairly consistent proportion of total compilation time spent on | |
187 the liveness passes. | |
188 | |
189 Linear-scan register allocation | |
190 ------------------------------- | |
191 | |
192 The basis of Subzero's register allocator is the allocator described by | |
193 Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in | |
194 the Context of SSA Form and Register Constraints | |
195 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to | |
196 be a union of intervals rather than a single conservative interval, and it | |
197 allows pre-coloring of variables with specific physical registers. | |
198 | |
199 The paper suggests an approach of aggressively coalescing variables across Phi | |
200 instructions (i.e., trying to force Phi source and dest variables to have the | |
201 same register assignment), but we omit that in favor of the more natural | |
202 preference mechanism described below. | |
203 | |
204 We found the paper quite remarkable in that a straightforward implementation of | |
205 its pseudo-code led to an entirely correct register allocator. The only thing | |
206 we found in the specification that was even close to a mistake is that it was | |
207 too aggressive in evicting inactive ranges in the "else" clause of the | |
208 AssignMemLoc routine. An inactive range only needs to be evicted if it actually | |
209 overlaps the current range being considered, whereas the paper evicts it | |
210 unconditionally. (Search for "original paper" in Subzero's register allocator | |
211 source code.) | |
212 | |
213 Register preference | |
214 ------------------- | |
215 | |
216 The linear-scan algorithm from the paper talks about choosing an available | |
217 register, but isn't specific on how to choose among several available registers. | |
218 The simplest approach is to just choose the first available register, e.g. the | |
219 lowest-numbered register. Often a better choice is possible. | |
220 | |
221 Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` | |
222 with source variables ``S1,S2,...``, and that instruction ends the live range of | |
223 one of those source variables ``Sn``, and ``Sn`` was assigned a register, then | |
224 ``Sn``'s register is usually a good choice for ``V``. This is especially true | |
225 when the instruction is a simple assignment, because an assignment where the | |
226 dest and source variables end up with the same register can be trivially elided, | |
227 reducing the amount of register-shuffling code. | |
228 | |
229 This requires being able to find and inspect a variable's defining instruction, | |
230 which is not an assumption in the original paper but is easily tracked in | |
231 practice. | |
232 | |
233 Register overlap | |
234 ---------------- | |
235 | |
236 Because Subzero does register allocation after basic lowering, the lowering has | |
237 to be prepared for the possibility of any given program variable not getting a | |
238 physical register. It does this by introducing *must-have-register* temporary | |
239 variables, and copies the program variable into the temporary to ensure that | |
240 register requirements in the target instruction are met. | |
241 | |
242 In many cases, the program variable and temporary variable have overlapping live | |
243 ranges, and would be forced to have different registers even if the temporary | |
244 variable is effectively a read-only copy of the program variable. We recognize | |
245 this when the program variable has no definitions within the temporary | |
246 variable's live range, and the temporary variable has no definitions within the | |
247 program variable's live range with the exception of the copy assignment. | |
248 | |
249 This analysis is done as part of register preference detection. | |
250 | |
251 The main impact on the linear-scan implementation is that instead of | |
252 setting/resetting a boolean flag to indicate whether a register is free or in | |
253 use, we increment/decrement a number-of-uses counter. | |
254 | |
255 Register aliases | |
256 ---------------- | |
257 | |
258 Sometimes registers of different register classes partially overlap. For | |
259 example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't | |
260 alias each other), and all three alias ``eax`` and ``rax``. And in ARM, | |
261 registers ``s0`` and ``s1`` (which are single-precision floating-point | |
262 registers) alias ``d0`` (double-precision floating-point), and registers ``d0`` | |
263 and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper | |
264 doesn't address this issue; it assumes that all registers are independent. A | |
265 simple solution is to essentially avoid aliasing by disallowing a subset of the | |
266 registers, but there is obviously a reduction in code quality when e.g. half of | |
267 the registers are taken away. | |
268 | |
269 Subzero handles this more elegantly. For each register, we keep a bitmask | |
270 indicating which registers alias/conflict with it. For example, in x86, | |
271 ``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and | |
272 in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to | |
273 mark a register as being used or released, we do the same for all of its | |
274 aliases. | |
275 | |
276 Before implementing this, we were a bit apprehensive about the compile-time | |
277 performance impact. Instead of setting one bit in a bit vector or decrementing | |
278 one counter, this generally needs to be done in a loop that iterates over all | |
279 aliases. Fortunately, this seemed to make very little difference in | |
280 performance, as the bulk of the cost ends up being in live range overlap | |
281 computations, which are not affected by register aliasing. | |
282 | |
283 Repeat until convergence | |
284 ------------------------ | |
285 | |
286 Sometimes the linear-scan algorithm makes a register assignment only to later | |
287 revoke it in favor of a higher-priority variable, but it turns out that a | |
288 different initial register choice would not have been revoked. For relatively | |
289 low compile-time cost, we can give those variables another chance. | |
290 | |
291 During register allocation, we keep track of the revoked variables and then do | |
292 another round of register allocation targeted only to that set. We repeat until | |
293 no new register assignments are made, which is usually just a handful of | |
294 successively cheaper iterations. | |
295 | |
296 Another approach would be to repeat register allocation for *all* variables that | |
297 haven't had a register assigned (rather than variables that got a register that | |
298 was subsequently revoked), but our experience is that this greatly increases | |
299 compile-time cost, with little or no code quality gain. | |
300 | |
301 Delayed Phi lowering | |
302 -------------------- | |
303 | |
304 The linear-scan algorithm works for phi instructions as well as regular | |
305 instructions, but it is tempting to lower phi instructions into non-SSA | |
306 assignments before register allocation, so that register allocation need only | |
307 happen once. | |
308 | |
309 Unfortunately, simple phi lowering imposes an arbitrary ordering on the | |
310 resulting assignments that can cause artificial overlap/interference between | |
311 lowered assignments, and can lead to worse register allocation decisions. As a | |
312 simple example, consider these two phi instructions which are semantically | |
313 unordered:: | |
314 | |
315 A = phi(B) // B's live range ends here | |
316 C = phi(D) // D's live range ends here | |
317 | |
318 A straightforward lowering might yield:: | |
319 | |
320 A = B // end of B's live range | |
321 C = D // end of D's live range | |
322 | |
323 The potential problem here is that A and D's live ranges overlap, and so they | |
324 are prevented from having the same register. Swapping the order of lowered | |
325 assignments fixes that (but then B and C would overlap), but we can't really | |
326 know which is better until after register allocation. | |
327 | |
328 Subzero deals with this by doing the main register allocation before phi | |
329 lowering, followed by phi lowering, and finally a special register allocation | |
330 pass limited to the new lowered assignments. | |
331 | |
332 Phi lowering considers the phi operands separately for each predecessor edge, | |
333 and starts by finding a topological sort of the Phi instructions, such that | |
334 assignments can be executed in that order without violating dependencies on | |
335 registers or stack locations. If a topological sort is not possible due to a | |
336 cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can | |
337 become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up | |
338 registers early to reduce register pressure. | |
339 | |
340 It then lowers the linearized assignments into machine instructions (which may | |
341 require extra physical registers e.g. to copy from one stack location to | |
342 another), and finally runs the register allocator limited to those instructions. | |
343 | |
344 In rare cases, the register allocator may fail on some *must-have-register* | |
345 variable, because register pressure is too high to satisfy requirements arising | |
346 from cycle-breaking temporaries and registers required for stack-to-stack | |
347 copies. In this case, we have to find a register with no active uses within the | |
348 variable's live range, and actively spill/restore that register around the live | |
349 range. This makes the code quality suffer and may be slow to implement | |
350 depending on compiler data structures, but in practice we find the situation to | |
351 be vanishingly rare and so not really worth optimizing. | |
352 | |
353 Local live range splitting | |
354 -------------------------- | |
355 | |
356 The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets | |
357 a register for its entire live range, or not at all. This is unfortunate when a | |
358 variable has many uses close together, but ultimately a large enough live range | |
359 to prevent register assignment. Another bad example is on x86 where all vector | |
360 and floating-point registers (the ``xmm`` registers) are killed by call | |
361 instructions, per the x86 call ABI, so such variables are completely prevented | |
362 from having a register when their live ranges contain a call instruction. | |
363 | |
364 The general solution here is some policy for splitting live ranges. A variable | |
365 can be split into multiple copies and each can be register-allocated separately. | |
366 The complication comes in finding a sane policy for where and when to split | |
367 variables such that complexity doesn't explode, and how to join the different | |
368 values at merge points. | |
369 | |
370 Subzero implements aggressive block-local splitting of variables. We make a | |
371 pass over the instructions in a block. Every time a variable ``V`` is | |
John
2016/08/24 14:13:58
I don't understand how this leads to a chain
V''
Jim Stichnoth
2016/08/25 13:28:40
I was a bit lazy writing this paragraph, and so I'
| |
372 encountered in an instruction, we add an assignment ``V'=V``, adjacent to the | |
373 instruction, and replace all subsequent uses of ``V`` in the block with ``V'``. | |
374 | |
375 This leads to a chain of copies of ``V`` through the block, linked by assignment | |
376 instructions. The live ranges of these copies are usually much smaller, and | |
377 more likely to get registers. In fact, because of the preference mechanism | |
378 described above, they are likely to get the same register whenever possible. | |
379 | |
380 One obvious question comes up: won't this proliferation of new variables cause | |
381 an explosion in the running time of liveness analysis and register allocation? | |
382 As it turns out, not really. | |
383 | |
384 First, for register allocation, the cost tends to be dominated by live range | |
385 overlap computations, whose cost is roughly proportional to the size of the live | |
386 range. All the new variable copies' live ranges sum up to the original | |
387 variable's live range, so the cost isn't vastly greater. | |
388 | |
389 Second, for liveness analysis, the cost is dominated by merging bit vectors | |
390 corresponding to the set of variables that have multi-block liveness. All the | |
391 new copies are guaranteed to be single-block, so the main additional cost is | |
392 that of processing the new assignments. | |
393 | |
394 There's one other key issue here. The original variable and all of its copies | |
395 need to be "linked", in the sense that all of these variables that get a stack | |
396 slot (because they didn't get a register) are guaranteed to have the same stack | |
397 slot. This way, we can avoid generating any code related to ``V'=V`` when | |
398 neither gets a register. In addition, we can elide instructions that write a | |
399 value to a stack slot that is linked to another stack slot, because that is | |
400 guaranteed to be just rewriting the same value to the stack. | |
OLD | NEW |