OLD | NEW |
(Empty) | |
| 1 Register allocation in Subzero |
| 2 ============================== |
| 3 |
| 4 Introduction |
| 5 ------------ |
| 6 |
| 7 `Subzero |
| 8 <https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DES
IGN.rst>`_ |
| 9 is a fast code generator that translates architecture-independent `PNaCl bitcode |
| 10 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into |
| 11 architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been |
| 12 simplified (e.g. weird-width primitive types like 57-bit integers are not |
| 13 allowed) and has had architecture-independent optimizations already applied. |
| 14 Subzero aims to generate high-quality code as fast as practical, and as such |
| 15 Subzero needs to make tradeoffs between compilation speed and output code |
| 16 quality. |
| 17 |
| 18 In Subzero, we have found register allocation to be one of the most important |
| 19 optimizations toward achieving the best code quality, which is in tension with |
| 20 register allocation's reputation for being complex and expensive. Linear-scan |
| 21 register allocation is a modern favorite for getting fairly good register |
| 22 allocation at relatively low cost. Subzero uses linear-scan for its core |
| 23 register allocator, with a few internal modifications to improve its |
| 24 effectiveness (specifically: register preference, register overlap, and register |
| 25 aliases). Subzero also does a few high-level things on top of its core register |
| 26 allocator to improve overall effectiveness (specifically: repeat until |
| 27 convergence, delayed phi lowering, and local live range splitting). |
| 28 |
| 29 What we describe here are techniques that have worked well for Subzero, in the |
| 30 context of its particular intermediate representation (IR) and compilation |
| 31 strategy. Some of these techniques may not be applicable to another compiler, |
| 32 depending on its particular IR and compilation strategy. Some key concepts in |
| 33 Subzero are the following: |
| 34 |
| 35 - Subzero's ``Variable`` operand is an operand that resides either on the stack |
| 36 or in a physical register. A Variable can be tagged as *must-have-register* |
| 37 or *must-not-have-register*, but its default is *may-have-register*. All uses |
| 38 of the Variable map to the same physical register or stack location. |
| 39 |
| 40 - Basic lowering is done before register allocation. Lowering is the process of |
| 41 translating PNaCl bitcode instructions into native instructions. Sometimes a |
| 42 native instruction, like the x86 ``add`` instruction, allows one of its |
| 43 Variable operands to be either in a physical register or on the stack, in |
| 44 which case the lowering is relatively simple. But if the lowered instruction |
| 45 requires the operand to be in a physical register, we generate code that |
| 46 copies the Variable into a *must-have-register* temporary, and then use that |
| 47 temporary in the lowered instruction. |
| 48 |
| 49 - Instructions within a basic block appear in a linearized order (as opposed to |
| 50 something like a directed acyclic graph of dependency edges). |
| 51 |
| 52 - An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually |
| 53 small) number of *source* Variables. Assuming SSA form, the instruction |
| 54 begins the live range of the dest Variable, and may end the live range of one |
| 55 or more of the source Variables. |
| 56 |
| 57 Executive summary |
| 58 ----------------- |
| 59 |
| 60 - Liveness analysis and live range construction are prerequisites for register |
| 61 allocation. Without careful attention, they can be potentially very |
| 62 expensive, especially when the number of variables and basic blocks gets very |
| 63 large. Subzero uses some clever data structures to take advantage of the |
| 64 sparsity of the data, resulting in stable performance as function size scales |
| 65 up. This means that the proportion of compilation time spent on liveness |
| 66 analysis stays roughly the same. |
| 67 |
| 68 - The core of Subzero's register allocator is taken from `Mössenböck and |
| 69 Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on |
| 70 linear-scan register allocation. |
| 71 |
| 72 - We enhance the core algorithm with a good automatic preference mechanism when |
| 73 more than one register is available, to try to minimize register shuffling. |
| 74 |
| 75 - We also enhance it to allow overlapping live ranges to share the same |
| 76 register, when one variable is recognized as a read-only copy of the other. |
| 77 |
| 78 - We deal with register aliasing in a clean and general fashion. Register |
| 79 aliasing is when e.g. 16-bit architectural registers share some bits with |
| 80 their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit |
| 81 registers. |
| 82 |
| 83 - We improve register allocation by rerunning the algorithm on likely candidates |
| 84 that didn't get registers during the previous iteration, without imposing much |
| 85 additional cost. |
| 86 |
| 87 - The main register allocation is done before phi lowering, because phi lowering |
| 88 imposes early and unnecessary ordering constraints on the resulting |
| 89 assigments, which create spurious interferences in live ranges. |
| 90 |
| 91 - Within each basic block, we aggressively split each variable's live range at |
| 92 every use, so that portions of the live range can get registers even if the |
| 93 whole live range can't. Doing this separately for each basic block avoids |
| 94 merge complications, and keeps liveness analysis and register allocation fast |
| 95 by fitting well into the sparsity optimizations of their data structures. |
| 96 |
| 97 Liveness analysis |
| 98 ----------------- |
| 99 |
| 100 With respect to register allocation, the main purpose of liveness analysis is to |
| 101 calculate the live range of each variable. The live range is represented as a |
| 102 set of instruction number ranges. Instruction numbers within a basic block must |
| 103 be monotonically increasing, and the instruction ranges of two different basic |
| 104 blocks must not overlap. |
| 105 |
| 106 Basic liveness |
| 107 ^^^^^^^^^^^^^^ |
| 108 |
| 109 Liveness analysis is a straightforward dataflow algorithm. For each basic |
| 110 block, we keep track of the live-in and live-out set, i.e. the set of variables |
| 111 that are live coming into or going out of the basic block. Processing of a |
| 112 basic block starts by initializing a temporary set as the union of the live-in |
| 113 sets of the basic block's successor blocks. (This basic block's live-out set is |
| 114 captured as the initial value of the temporary set.) Then each instruction of |
| 115 the basic block is processed in reverse order. All the source variables of the |
| 116 instruction are marked as live, by adding them to the temporary set, and the |
| 117 dest variable of the instruction (if any) is marked as not-live, by removing it |
| 118 from the temporary set. When we finish processing all of the block's |
| 119 instructions, we add/union the temporary set into the basic block's live-in set. |
| 120 If this changes the basic block's live-in set, then we mark all of this basic |
| 121 block's predecessor blocks to be reprocessed. Then we repeat for other basic |
| 122 blocks until convergence, i.e. no more basic blocks are marked to be |
| 123 reprocessed. If basic blocks are processed in reverse topological order, then |
| 124 the number of times each basic block need to be reprocessed is generally its |
| 125 loop nest depth. |
| 126 |
| 127 The result of this pass is the live-in and live-out set for each basic block. |
| 128 |
| 129 With so many set operations, choice of data structure is crucial for |
| 130 performance. We tried a few options, and found that a simple dense bit vector |
| 131 works best. This keeps the per-instruction cost very low. However, we found |
| 132 that when the function gets very large, merging and copying bit vectors at basic |
| 133 block boundaries dominates the cost. This is due to the number of variables |
| 134 (hence the bit vector size) and the number of basic blocks getting large. |
| 135 |
| 136 A simple enhancement brought this under control in Subzero. It turns out that |
| 137 the vast majority of variables are referenced, and therefore live, only in a |
| 138 single basic block. This is largely due to the SSA form of PNaCl bitcode. To |
| 139 take advantage of this, we partition variables by single-block versus |
| 140 multi-block liveness. Multi-block variables get lower-numbered bit vector |
| 141 indexes, and single-block variables get higher-number indexes. Single-block bit |
| 142 vector indexes are reused across different basic blocks. As such, the size of |
| 143 live-in and live-out bit vectors is limited to the number of multi-block |
| 144 variables, and the temporary set's size can be limited to that plus the largest |
| 145 number of single-block variables across all basic blocks. |
| 146 |
| 147 For the purpose of live range construction, we also need to track definitions |
| 148 (LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the |
| 149 basic block. These are easy to detect while processing the instructions; data |
| 150 structure choices are described below. |
| 151 |
| 152 Live range construction |
| 153 ^^^^^^^^^^^^^^^^^^^^^^^ |
| 154 |
| 155 After the live-in and live-out sets are calculated, we construct each variable's |
| 156 live range (as an ordered list of instruction ranges, described above). We do |
| 157 this by considering the live-in and live-out sets, combined with LiveBegin and |
| 158 LiveEnd information. This is done separately for each basic block. |
| 159 |
| 160 As before, we need to take advantage of sparsity of variable uses across basic |
| 161 blocks, to avoid overly copying/merging data structures. The following is what |
| 162 worked well for Subzero (after trying several other options). |
| 163 |
| 164 The basic liveness pass, described above, keeps track of when a variable's live |
| 165 range begins or ends within the block. LiveBegin and LiveEnd are unordered |
| 166 vectors where each element is a pair of the variable and the instruction number, |
| 167 representing that the particular variable's live range begins or ends at the |
| 168 particular instruction. When the liveness pass finds a variable whose live |
| 169 range begins or ends, it appends and entry to LiveBegin or LiveEnd. |
| 170 |
| 171 During live range construction, the LiveBegin and LiveEnd vectors are sorted by |
| 172 variable number. Then we iterate across both vectors in parallel. If a |
| 173 variable appears in both LiveBegin and LiveEnd, then its live range is entirely |
| 174 within this block. If it appears in only LiveBegin, then its live range starts |
| 175 here and extends through the end of the block. If it appears in only LiveEnd, |
| 176 then its live range starts at the beginning of the block and ends here. (Note |
| 177 that this only covers the live range within this block, and this process is |
| 178 repeated across all blocks.) |
| 179 |
| 180 It is also possible that a variable is live within this block but its live range |
| 181 does not begin or end in this block. These variables are identified simply by |
| 182 taking the intersection of the live-in and live-out sets. |
| 183 |
| 184 As a result of these data structures, performance of liveness analysis and live |
| 185 range construction tend to be stable across small, medium, and large functions, |
| 186 as measured by a fairly consistent proportion of total compilation time spent on |
| 187 the liveness passes. |
| 188 |
| 189 Linear-scan register allocation |
| 190 ------------------------------- |
| 191 |
| 192 The basis of Subzero's register allocator is the allocator described by |
| 193 Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in |
| 194 the Context of SSA Form and Register Constraints |
| 195 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to |
| 196 be a union of intervals rather than a single conservative interval, and it |
| 197 allows pre-coloring of variables with specific physical registers. |
| 198 |
| 199 The paper suggests an approach of aggressively coalescing variables across Phi |
| 200 instructions (i.e., trying to force Phi source and dest variables to have the |
| 201 same register assignment), but we omit that in favor of the more natural |
| 202 preference mechanism described below. |
| 203 |
| 204 We found the paper quite remarkable in that a straightforward implementation of |
| 205 its pseudo-code led to an entirely correct register allocator. The only thing |
| 206 we found in the specification that was even close to a mistake is that it was |
| 207 too aggressive in evicting inactive ranges in the "else" clause of the |
| 208 AssignMemLoc routine. An inactive range only needs to be evicted if it actually |
| 209 overlaps the current range being considered, whereas the paper evicts it |
| 210 unconditionally. (Search for "original paper" in Subzero's register allocator |
| 211 source code.) |
| 212 |
| 213 Register preference |
| 214 ------------------- |
| 215 |
| 216 The linear-scan algorithm from the paper talks about choosing an available |
| 217 register, but isn't specific on how to choose among several available registers. |
| 218 The simplest approach is to just choose the first available register, e.g. the |
| 219 lowest-numbered register. Often a better choice is possible. |
| 220 |
| 221 Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)`` |
| 222 with source variables ``S1,S2,...``, and that instruction ends the live range of |
| 223 one of those source variables ``Sn``, and ``Sn`` was assigned a register, then |
| 224 ``Sn``'s register is usually a good choice for ``V``. This is especially true |
| 225 when the instruction is a simple assignment, because an assignment where the |
| 226 dest and source variables end up with the same register can be trivially elided, |
| 227 reducing the amount of register-shuffling code. |
| 228 |
| 229 This requires being able to find and inspect a variable's defining instruction, |
| 230 which is not an assumption in the original paper but is easily tracked in |
| 231 practice. |
| 232 |
| 233 Register overlap |
| 234 ---------------- |
| 235 |
| 236 Because Subzero does register allocation after basic lowering, the lowering has |
| 237 to be prepared for the possibility of any given program variable not getting a |
| 238 physical register. It does this by introducing *must-have-register* temporary |
| 239 variables, and copies the program variable into the temporary to ensure that |
| 240 register requirements in the target instruction are met. |
| 241 |
| 242 In many cases, the program variable and temporary variable have overlapping live |
| 243 ranges, and would be forced to have different registers even if the temporary |
| 244 variable is effectively a read-only copy of the program variable. We recognize |
| 245 this when the program variable has no definitions within the temporary |
| 246 variable's live range, and the temporary variable has no definitions within the |
| 247 program variable's live range with the exception of the copy assignment. |
| 248 |
| 249 This analysis is done as part of register preference detection. |
| 250 |
| 251 The main impact on the linear-scan implementation is that instead of |
| 252 setting/resetting a boolean flag to indicate whether a register is free or in |
| 253 use, we increment/decrement a number-of-uses counter. |
| 254 |
| 255 Register aliases |
| 256 ---------------- |
| 257 |
| 258 Sometimes registers of different register classes partially overlap. For |
| 259 example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't |
| 260 alias each other), and all three alias ``eax`` and ``rax``. And in ARM, |
| 261 registers ``s0`` and ``s1`` (which are single-precision floating-point |
| 262 registers) alias ``d0`` (double-precision floating-point), and registers ``d0`` |
| 263 and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper |
| 264 doesn't address this issue; it assumes that all registers are independent. A |
| 265 simple solution is to essentially avoid aliasing by disallowing a subset of the |
| 266 registers, but there is obviously a reduction in code quality when e.g. half of |
| 267 the registers are taken away. |
| 268 |
| 269 Subzero handles this more elegantly. For each register, we keep a bitmask |
| 270 indicating which registers alias/conflict with it. For example, in x86, |
| 271 ``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and |
| 272 in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to |
| 273 mark a register as being used or released, we do the same for all of its |
| 274 aliases. |
| 275 |
| 276 Before implementing this, we were a bit apprehensive about the compile-time |
| 277 performance impact. Instead of setting one bit in a bit vector or decrementing |
| 278 one counter, this generally needs to be done in a loop that iterates over all |
| 279 aliases. Fortunately, this seemed to make very little difference in |
| 280 performance, as the bulk of the cost ends up being in live range overlap |
| 281 computations, which are not affected by register aliasing. |
| 282 |
| 283 Repeat until convergence |
| 284 ------------------------ |
| 285 |
| 286 Sometimes the linear-scan algorithm makes a register assignment only to later |
| 287 revoke it in favor of a higher-priority variable, but it turns out that a |
| 288 different initial register choice would not have been revoked. For relatively |
| 289 low compile-time cost, we can give those variables another chance. |
| 290 |
| 291 During register allocation, we keep track of the revoked variables and then do |
| 292 another round of register allocation targeted only to that set. We repeat until |
| 293 no new register assignments are made, which is usually just a handful of |
| 294 successively cheaper iterations. |
| 295 |
| 296 Another approach would be to repeat register allocation for *all* variables that |
| 297 haven't had a register assigned (rather than variables that got a register that |
| 298 was subsequently revoked), but our experience is that this greatly increases |
| 299 compile-time cost, with little or no code quality gain. |
| 300 |
| 301 Delayed Phi lowering |
| 302 -------------------- |
| 303 |
| 304 The linear-scan algorithm works for phi instructions as well as regular |
| 305 instructions, but it is tempting to lower phi instructions into non-SSA |
| 306 assignments before register allocation, so that register allocation need only |
| 307 happen once. |
| 308 |
| 309 Unfortunately, simple phi lowering imposes an arbitrary ordering on the |
| 310 resulting assignments that can cause artificial overlap/interference between |
| 311 lowered assignments, and can lead to worse register allocation decisions. As a |
| 312 simple example, consider these two phi instructions which are semantically |
| 313 unordered:: |
| 314 |
| 315 A = phi(B) // B's live range ends here |
| 316 C = phi(D) // D's live range ends here |
| 317 |
| 318 A straightforward lowering might yield:: |
| 319 |
| 320 A = B // end of B's live range |
| 321 C = D // end of D's live range |
| 322 |
| 323 The potential problem here is that A and D's live ranges overlap, and so they |
| 324 are prevented from having the same register. Swapping the order of lowered |
| 325 assignments fixes that (but then B and C would overlap), but we can't really |
| 326 know which is better until after register allocation. |
| 327 |
| 328 Subzero deals with this by doing the main register allocation before phi |
| 329 lowering, followed by phi lowering, and finally a special register allocation |
| 330 pass limited to the new lowered assignments. |
| 331 |
| 332 Phi lowering considers the phi operands separately for each predecessor edge, |
| 333 and starts by finding a topological sort of the Phi instructions, such that |
| 334 assignments can be executed in that order without violating dependencies on |
| 335 registers or stack locations. If a topological sort is not possible due to a |
| 336 cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can |
| 337 become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up |
| 338 registers early to reduce register pressure. |
| 339 |
| 340 It then lowers the linearized assignments into machine instructions (which may |
| 341 require extra physical registers e.g. to copy from one stack location to |
| 342 another), and finally runs the register allocator limited to those instructions. |
| 343 |
| 344 In rare cases, the register allocator may fail on some *must-have-register* |
| 345 variable, because register pressure is too high to satisfy requirements arising |
| 346 from cycle-breaking temporaries and registers required for stack-to-stack |
| 347 copies. In this case, we have to find a register with no active uses within the |
| 348 variable's live range, and actively spill/restore that register around the live |
| 349 range. This makes the code quality suffer and may be slow to implement |
| 350 depending on compiler data structures, but in practice we find the situation to |
| 351 be vanishingly rare and so not really worth optimizing. |
| 352 |
| 353 Local live range splitting |
| 354 -------------------------- |
| 355 |
| 356 The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets |
| 357 a register for its entire live range, or not at all. This is unfortunate when a |
| 358 variable has many uses close together, but ultimately a large enough live range |
| 359 to prevent register assignment. Another bad example is on x86 where all vector |
| 360 and floating-point registers (the ``xmm`` registers) are killed by call |
| 361 instructions, per the x86 call ABI, so such variables are completely prevented |
| 362 from having a register when their live ranges contain a call instruction. |
| 363 |
| 364 The general solution here is some policy for splitting live ranges. A variable |
| 365 can be split into multiple copies and each can be register-allocated separately. |
| 366 The complication comes in finding a sane policy for where and when to split |
| 367 variables such that complexity doesn't explode, and how to join the different |
| 368 values at merge points. |
| 369 |
| 370 Subzero implements aggressive block-local splitting of variables. Each basic |
| 371 block is handled separately and independently. Within the block, we maintain a |
| 372 table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with |
| 373 every variable ``V``'s initial state set (implicitly) as ``T[V]=V``. For each |
| 374 instruction in the block, and for each *may-have-register* variable ``V`` in the |
| 375 instruction, we do the following: |
| 376 |
| 377 - Replace all uses of ``V`` in the instruction with ``T[V]``. |
| 378 |
| 379 - Create a new split variable ``V'``. |
| 380 |
| 381 - Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately |
| 382 before or immediately after) the current instruction. |
| 383 |
| 384 - Update ``T[V]`` to be ``V'``. |
| 385 |
| 386 This leads to a chain of copies of ``V`` through the block, linked by assignment |
| 387 instructions. The live ranges of these copies are usually much smaller, and |
| 388 more likely to get registers. In fact, because of the preference mechanism |
| 389 described above, they are likely to get the same register whenever possible. |
| 390 |
| 391 One obvious question comes up: won't this proliferation of new variables cause |
| 392 an explosion in the running time of liveness analysis and register allocation? |
| 393 As it turns out, not really. |
| 394 |
| 395 First, for register allocation, the cost tends to be dominated by live range |
| 396 overlap computations, whose cost is roughly proportional to the size of the live |
| 397 range. All the new variable copies' live ranges sum up to the original |
| 398 variable's live range, so the cost isn't vastly greater. |
| 399 |
| 400 Second, for liveness analysis, the cost is dominated by merging bit vectors |
| 401 corresponding to the set of variables that have multi-block liveness. All the |
| 402 new copies are guaranteed to be single-block, so the main additional cost is |
| 403 that of processing the new assignments. |
| 404 |
| 405 There's one other key issue here. The original variable and all of its copies |
| 406 need to be "linked", in the sense that all of these variables that get a stack |
| 407 slot (because they didn't get a register) are guaranteed to have the same stack |
| 408 slot. This way, we can avoid generating any code related to ``V'=V`` when |
| 409 neither gets a register. In addition, we can elide instructions that write a |
| 410 value to a stack slot that is linked to another stack slot, because that is |
| 411 guaranteed to be just rewriting the same value to the stack. |
OLD | NEW |