docs/REGALLOC.rst - Issue 2277493003: Subzero: Add a new document describing the register allocator.

Side by Side Diff: docs/REGALLOC.rst

Issue 2277493003: Subzero: Add a new document describing the register allocator. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Improve block-local splitting description Created 4 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 Register allocation in Subzero

	2 ==============================

	3

	4 Introduction

	5 ------------

	6

	7 `Subzero

	8 <https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DES IGN.rst>`_

	9 is a fast code generator that translates architecture-independent `PNaCl bitcode

	10 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into

	11 architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been

	12 simplified (e.g. weird-width primitive types like 57-bit integers are not

	13 allowed) and has had architecture-independent optimizations already applied.

	14 Subzero aims to generate high-quality code as fast as practical, and as such

	15 Subzero needs to make tradeoffs between compilation speed and output code

	16 quality.

	17

	18 In Subzero, we have found register allocation to be one of the most important

	19 optimizations toward achieving the best code quality, which is in tension with

	20 register allocation's reputation for being complex and expensive. Linear-scan

	21 register allocation is a modern favorite for getting fairly good register

	22 allocation at relatively low cost. Subzero uses linear-scan for its core

	23 register allocator, with a few internal modifications to improve its

	24 effectiveness (specifically: register preference, register overlap, and register

	25 aliases). Subzero also does a few high-level things on top of its core register

	26 allocator to improve overall effectiveness (specifically: repeat until

	27 convergence, delayed phi lowering, and local live range splitting).

	28

	29 What we describe here are techniques that have worked well for Subzero, in the

	30 context of its particular intermediate representation (IR) and compilation

	31 strategy. Some of these techniques may not be applicable to another compiler,

	32 depending on its particular IR and compilation strategy. Some key concepts in

	33 Subzero are the following:

	34

	35 - Subzero's ``Variable`` operand is an operand that resides either on the stack

	36 or in a physical register. A Variable can be tagged as must-have-register

	37 or must-not-have-register, but its default is may-have-register. All uses

	38 of the Variable map to the same physical register or stack location.

	39

	40 - Basic lowering is done before register allocation. Lowering is the process of

	41 translating PNaCl bitcode instructions into native instructions. Sometimes a

	42 native instruction, like the x86 ``add`` instruction, allows one of its

	43 Variable operands to be either in a physical register or on the stack, in

	44 which case the lowering is relatively simple. But if the lowered instruction

	45 requires the operand to be in a physical register, we generate code that

	46 copies the Variable into a must-have-register temporary, and then use that

	47 temporary in the lowered instruction.

	48

	49 - Instructions within a basic block appear in a linearized order (as opposed to

	50 something like a directed acyclic graph of dependency edges).

	51

	52 - An instruction has 0 or 1 dest Variables and an arbitrary (but usually

	53 small) number of source Variables. Assuming SSA form, the instruction

	54 begins the live range of the dest Variable, and may end the live range of one

	55 or more of the source Variables.

	56

	57 Executive summary

	58 -----------------

	59

	60 - Liveness analysis and live range construction are prerequisites for register

	61 allocation. Without careful attention, they can be potentially very

	62 expensive, especially when the number of variables and basic blocks gets very

	63 large. Subzero uses some clever data structures to take advantage of the

	64 sparsity of the data, resulting in stable performance as function size scales

	65 up. This means that the proportion of compilation time spent on liveness

	66 analysis stays roughly the same.

	67

	68 - The core of Subzero's register allocator is taken from `Mössenböck and

	69 Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on

	70 linear-scan register allocation.

	71

	72 - We enhance the core algorithm with a good automatic preference mechanism when

	73 more than one register is available, to try to minimize register shuffling.

	74

	75 - We also enhance it to allow overlapping live ranges to share the same

	76 register, when one variable is recognized as a read-only copy of the other.

	77

	78 - We deal with register aliasing in a clean and general fashion. Register

	79 aliasing is when e.g. 16-bit architectural registers share some bits with

	80 their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit

	81 registers.

	82

	83 - We improve register allocation by rerunning the algorithm on likely candidates

	84 that didn't get registers during the previous iteration, without imposing much

	85 additional cost.

	86

	87 - The main register allocation is done before phi lowering, because phi lowering

	88 imposes early and unnecessary ordering constraints on the resulting

	89 assigments, which create spurious interferences in live ranges.

	90

	91 - Within each basic block, we aggressively split each variable's live range at

	92 every use, so that portions of the live range can get registers even if the

	93 whole live range can't. Doing this separately for each basic block avoids

	94 merge complications, and keeps liveness analysis and register allocation fast

	95 by fitting well into the sparsity optimizations of their data structures.

	96

	97 Liveness analysis

	98 -----------------

	99

	100 With respect to register allocation, the main purpose of liveness analysis is to

	101 calculate the live range of each variable. The live range is represented as a

	102 set of instruction number ranges. Instruction numbers within a basic block must

	103 be monotonically increasing, and the instruction ranges of two different basic

	104 blocks must not overlap.

	105

	106 Basic liveness

	107 ^^^^^^^^^^^^^^

	108

	109 Liveness analysis is a straightforward dataflow algorithm. For each basic

	110 block, we keep track of the live-in and live-out set, i.e. the set of variables

	111 that are live coming into or going out of the basic block. Processing of a

	112 basic block starts by initializing a temporary set as the union of the live-in

	113 sets of the basic block's successor blocks. (This basic block's live-out set is

	114 captured as the initial value of the temporary set.) Then each instruction of

	115 the basic block is processed in reverse order. All the source variables of the

	116 instruction are marked as live, by adding them to the temporary set, and the

	117 dest variable of the instruction (if any) is marked as not-live, by removing it

	118 from the temporary set. When we finish processing all of the block's

	119 instructions, we add/union the temporary set into the basic block's live-in set.

	120 If this changes the basic block's live-in set, then we mark all of this basic

	121 block's predecessor blocks to be reprocessed. Then we repeat for other basic

	122 blocks until convergence, i.e. no more basic blocks are marked to be

	123 reprocessed. If basic blocks are processed in reverse topological order, then

	124 the number of times each basic block need to be reprocessed is generally its

	125 loop nest depth.

	126

	127 The result of this pass is the live-in and live-out set for each basic block.

	128

	129 With so many set operations, choice of data structure is crucial for

	130 performance. We tried a few options, and found that a simple dense bit vector

	131 works best. This keeps the per-instruction cost very low. However, we found

	132 that when the function gets very large, merging and copying bit vectors at basic

	133 block boundaries dominates the cost. This is due to the number of variables

	134 (hence the bit vector size) and the number of basic blocks getting large.

	135

	136 A simple enhancement brought this under control in Subzero. It turns out that

	137 the vast majority of variables are referenced, and therefore live, only in a

	138 single basic block. This is largely due to the SSA form of PNaCl bitcode. To

	139 take advantage of this, we partition variables by single-block versus

	140 multi-block liveness. Multi-block variables get lower-numbered bit vector

	141 indexes, and single-block variables get higher-number indexes. Single-block bit

	142 vector indexes are reused across different basic blocks. As such, the size of

	143 live-in and live-out bit vectors is limited to the number of multi-block

	144 variables, and the temporary set's size can be limited to that plus the largest

	145 number of single-block variables across all basic blocks.

	146

	147 For the purpose of live range construction, we also need to track definitions

	148 (LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the

	149 basic block. These are easy to detect while processing the instructions; data

	150 structure choices are described below.

	151

	152 Live range construction

	153 ^^^^^^^^^^^^^^^^^^^^^^^

	154

	155 After the live-in and live-out sets are calculated, we construct each variable's

	156 live range (as an ordered list of instruction ranges, described above). We do

	157 this by considering the live-in and live-out sets, combined with LiveBegin and

	158 LiveEnd information. This is done separately for each basic block.

	159

	160 As before, we need to take advantage of sparsity of variable uses across basic

	161 blocks, to avoid overly copying/merging data structures. The following is what

	162 worked well for Subzero (after trying several other options).

	163

	164 The basic liveness pass, described above, keeps track of when a variable's live

	165 range begins or ends within the block. LiveBegin and LiveEnd are unordered

	166 vectors where each element is a pair of the variable and the instruction number,

	167 representing that the particular variable's live range begins or ends at the

	168 particular instruction. When the liveness pass finds a variable whose live

	169 range begins or ends, it appends and entry to LiveBegin or LiveEnd.

	170

	171 During live range construction, the LiveBegin and LiveEnd vectors are sorted by

	172 variable number. Then we iterate across both vectors in parallel. If a

	173 variable appears in both LiveBegin and LiveEnd, then its live range is entirely

	174 within this block. If it appears in only LiveBegin, then its live range starts

	175 here and extends through the end of the block. If it appears in only LiveEnd,

	176 then its live range starts at the beginning of the block and ends here. (Note

	177 that this only covers the live range within this block, and this process is

	178 repeated across all blocks.)

	179

	180 It is also possible that a variable is live within this block but its live range

	181 does not begin or end in this block. These variables are identified simply by

	182 taking the intersection of the live-in and live-out sets.

	183

	184 As a result of these data structures, performance of liveness analysis and live

	185 range construction tend to be stable across small, medium, and large functions,

	186 as measured by a fairly consistent proportion of total compilation time spent on

	187 the liveness passes.

	188

	189 Linear-scan register allocation

	190 -------------------------------

	191

	192 The basis of Subzero's register allocator is the allocator described by

	193 Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in

	194 the Context of SSA Form and Register Constraints

	195 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to

	196 be a union of intervals rather than a single conservative interval, and it

	197 allows pre-coloring of variables with specific physical registers.

	198

	199 The paper suggests an approach of aggressively coalescing variables across Phi

	200 instructions (i.e., trying to force Phi source and dest variables to have the

	201 same register assignment), but we omit that in favor of the more natural

	202 preference mechanism described below.

	203

	204 We found the paper quite remarkable in that a straightforward implementation of

	205 its pseudo-code led to an entirely correct register allocator. The only thing

	206 we found in the specification that was even close to a mistake is that it was

	207 too aggressive in evicting inactive ranges in the "else" clause of the

	208 AssignMemLoc routine. An inactive range only needs to be evicted if it actually

	209 overlaps the current range being considered, whereas the paper evicts it

	210 unconditionally. (Search for "original paper" in Subzero's register allocator

	211 source code.)

	212

	213 Register preference

	214 -------------------

	215

	216 The linear-scan algorithm from the paper talks about choosing an available

	217 register, but isn't specific on how to choose among several available registers.

	218 The simplest approach is to just choose the first available register, e.g. the

	219 lowest-numbered register. Often a better choice is possible.

	220

	221 Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)``

	222 with source variables ``S1,S2,...``, and that instruction ends the live range of

	223 one of those source variables ``Sn``, and ``Sn`` was assigned a register, then

	224 ``Sn``'s register is usually a good choice for ``V``. This is especially true

	225 when the instruction is a simple assignment, because an assignment where the

	226 dest and source variables end up with the same register can be trivially elided,

	227 reducing the amount of register-shuffling code.

	228

	229 This requires being able to find and inspect a variable's defining instruction,

	230 which is not an assumption in the original paper but is easily tracked in

	231 practice.

	232

	233 Register overlap

	234 ----------------

	235

	236 Because Subzero does register allocation after basic lowering, the lowering has

	237 to be prepared for the possibility of any given program variable not getting a

	238 physical register. It does this by introducing must-have-register temporary

	239 variables, and copies the program variable into the temporary to ensure that

	240 register requirements in the target instruction are met.

	241

	242 In many cases, the program variable and temporary variable have overlapping live

	243 ranges, and would be forced to have different registers even if the temporary

	244 variable is effectively a read-only copy of the program variable. We recognize

	245 this when the program variable has no definitions within the temporary

	246 variable's live range, and the temporary variable has no definitions within the

	247 program variable's live range with the exception of the copy assignment.

	248

	249 This analysis is done as part of register preference detection.

	250

	251 The main impact on the linear-scan implementation is that instead of

	252 setting/resetting a boolean flag to indicate whether a register is free or in

	253 use, we increment/decrement a number-of-uses counter.

	254

	255 Register aliases

	256 ----------------

	257

	258 Sometimes registers of different register classes partially overlap. For

	259 example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't

	260 alias each other), and all three alias ``eax`` and ``rax``. And in ARM,

	261 registers ``s0`` and ``s1`` (which are single-precision floating-point

	262 registers) alias ``d0`` (double-precision floating-point), and registers ``d0``

	263 and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper

	264 doesn't address this issue; it assumes that all registers are independent. A

	265 simple solution is to essentially avoid aliasing by disallowing a subset of the

	266 registers, but there is obviously a reduction in code quality when e.g. half of

	267 the registers are taken away.

	268

	269 Subzero handles this more elegantly. For each register, we keep a bitmask

	270 indicating which registers alias/conflict with it. For example, in x86,

	271 ``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and

	272 in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to

	273 mark a register as being used or released, we do the same for all of its

	274 aliases.

	275

	276 Before implementing this, we were a bit apprehensive about the compile-time

	277 performance impact. Instead of setting one bit in a bit vector or decrementing

	278 one counter, this generally needs to be done in a loop that iterates over all

	279 aliases. Fortunately, this seemed to make very little difference in

	280 performance, as the bulk of the cost ends up being in live range overlap

	281 computations, which are not affected by register aliasing.

	282

	283 Repeat until convergence

	284 ------------------------

	285

	286 Sometimes the linear-scan algorithm makes a register assignment only to later

	287 revoke it in favor of a higher-priority variable, but it turns out that a

	288 different initial register choice would not have been revoked. For relatively

	289 low compile-time cost, we can give those variables another chance.

	290

	291 During register allocation, we keep track of the revoked variables and then do

	292 another round of register allocation targeted only to that set. We repeat until

	293 no new register assignments are made, which is usually just a handful of

	294 successively cheaper iterations.

	295

	296 Another approach would be to repeat register allocation for all variables that

	297 haven't had a register assigned (rather than variables that got a register that

	298 was subsequently revoked), but our experience is that this greatly increases

	299 compile-time cost, with little or no code quality gain.

	300

	301 Delayed Phi lowering

	302 --------------------

	303

	304 The linear-scan algorithm works for phi instructions as well as regular

	305 instructions, but it is tempting to lower phi instructions into non-SSA

	306 assignments before register allocation, so that register allocation need only

	307 happen once.

	308

	309 Unfortunately, simple phi lowering imposes an arbitrary ordering on the

	310 resulting assignments that can cause artificial overlap/interference between

	311 lowered assignments, and can lead to worse register allocation decisions. As a

	312 simple example, consider these two phi instructions which are semantically

	313 unordered::

	314

	315 A = phi(B) // B's live range ends here

	316 C = phi(D) // D's live range ends here

	317

	318 A straightforward lowering might yield::

	319

	320 A = B // end of B's live range

	321 C = D // end of D's live range

	322

	323 The potential problem here is that A and D's live ranges overlap, and so they

	324 are prevented from having the same register. Swapping the order of lowered

	325 assignments fixes that (but then B and C would overlap), but we can't really

	326 know which is better until after register allocation.

	327

	328 Subzero deals with this by doing the main register allocation before phi

	329 lowering, followed by phi lowering, and finally a special register allocation

	330 pass limited to the new lowered assignments.

	331

	332 Phi lowering considers the phi operands separately for each predecessor edge,

	333 and starts by finding a topological sort of the Phi instructions, such that

	334 assignments can be executed in that order without violating dependencies on

	335 registers or stack locations. If a topological sort is not possible due to a

	336 cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can

	337 become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up

	338 registers early to reduce register pressure.

	339

	340 It then lowers the linearized assignments into machine instructions (which may

	341 require extra physical registers e.g. to copy from one stack location to

	342 another), and finally runs the register allocator limited to those instructions.

	343

	344 In rare cases, the register allocator may fail on some must-have-register

	345 variable, because register pressure is too high to satisfy requirements arising

	346 from cycle-breaking temporaries and registers required for stack-to-stack

	347 copies. In this case, we have to find a register with no active uses within the

	348 variable's live range, and actively spill/restore that register around the live

	349 range. This makes the code quality suffer and may be slow to implement

	350 depending on compiler data structures, but in practice we find the situation to

	351 be vanishingly rare and so not really worth optimizing.

	352

	353 Local live range splitting

	354 --------------------------

	355

	356 The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets

	357 a register for its entire live range, or not at all. This is unfortunate when a

	358 variable has many uses close together, but ultimately a large enough live range

	359 to prevent register assignment. Another bad example is on x86 where all vector

	360 and floating-point registers (the ``xmm`` registers) are killed by call

	361 instructions, per the x86 call ABI, so such variables are completely prevented

	362 from having a register when their live ranges contain a call instruction.

	363

	364 The general solution here is some policy for splitting live ranges. A variable

	365 can be split into multiple copies and each can be register-allocated separately.

	366 The complication comes in finding a sane policy for where and when to split

	367 variables such that complexity doesn't explode, and how to join the different

	368 values at merge points.

	369

	370 Subzero implements aggressive block-local splitting of variables. Each basic

	371 block is handled separately and independently. Within the block, we maintain a

	372 table ``T`` that maps each variable ``V`` to its split version ``T[V]``, with

	373 every variable ``V``'s initial state set (implicitly) as ``T[V]=V``. For each

	374 instruction in the block, and for each may-have-register variable ``V`` in the

	375 instruction, we do the following:

	376

	377 - Replace all uses of ``V`` in the instruction with ``T[V]``.

	378

	379 - Create a new split variable ``V'``.

	380

	381 - Add a new assignment ``V'=T[V]``, placing it adjacent to (either immediately

	382 before or immediately after) the current instruction.

	383

	384 - Update ``T[V]`` to be ``V'``.

	385

	386 This leads to a chain of copies of ``V`` through the block, linked by assignment

	387 instructions. The live ranges of these copies are usually much smaller, and

	388 more likely to get registers. In fact, because of the preference mechanism

	389 described above, they are likely to get the same register whenever possible.

	390

	391 One obvious question comes up: won't this proliferation of new variables cause

	392 an explosion in the running time of liveness analysis and register allocation?

	393 As it turns out, not really.

	394

	395 First, for register allocation, the cost tends to be dominated by live range

	396 overlap computations, whose cost is roughly proportional to the size of the live

	397 range. All the new variable copies' live ranges sum up to the original

	398 variable's live range, so the cost isn't vastly greater.

	399

	400 Second, for liveness analysis, the cost is dominated by merging bit vectors

	401 corresponding to the set of variables that have multi-block liveness. All the

	402 new copies are guaranteed to be single-block, so the main additional cost is

	403 that of processing the new assignments.

	404

	405 There's one other key issue here. The original variable and all of its copies

	406 need to be "linked", in the sense that all of these variables that get a stack

	407 slot (because they didn't get a register) are guaranteed to have the same stack

	408 slot. This way, we can avoid generating any code related to ``V'=V`` when

	409 neither gets a register. In addition, we can elide instructions that write a

	410 value to a stack slot that is linked to another stack slot, because that is

	411 guaranteed to be just rewriting the same value to the stack.

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »