docs/REGALLOC.rst - Issue 2277493003: Subzero: Add a new document describing the register allocator.

Unified Diff: docs/REGALLOC.rst

Issue 2277493003: Subzero: Add a new document describing the register allocator. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Created 4 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: docs/REGALLOC.rst

diff --git a/docs/REGALLOC.rst b/docs/REGALLOC.rst

new file mode 100644

index 0000000000000000000000000000000000000000..4ab7eb51302f79a81d033c6b12f3226bb81e8ec9

--- /dev/null

+++ b/docs/REGALLOC.rst

@@ -0,0 +1,400 @@

+Register allocation in Subzero

+==============================

+Introduction

+------------

+`Subzero

+<https://chromium.googlesource.com/native_client/pnacl-subzero/+/master/docs/DESIGN.rst>`_

+is a fast code generator that translates architecture-independent `PNaCl bitcode

+<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_ into

+architecture-specific machine code. PNaCl bitcode is LLVM bitcode that has been

+simplified (e.g. weird-width primitive types like 57-bit integers are not

+allowed) and has had architecture-independent optimizations already applied.

+Subzero aims to generate high-quality code as fast as practical, and as such

+Subzero needs to make tradeoffs between compilation speed and output code

+quality.

+In Subzero, we have found register allocation to be one of the most important

+optimizations toward achieving the best code quality, which is in tension with

+register allocation's reputation for being complex and expensive. Linear-scan

+register allocation is a modern favorite for getting fairly good register

+allocation at relatively low cost. Subzero uses linear-scan for its core

+register allocator, with a few internal modifications to improve its

+effectiveness (specifically: register preference, register overlap, and register

+aliases). Subzero also does a few high-level things on top of its core register

+allocator to improve overall effectiveness (specifically: repeat until

+convergence, delayed phi lowering, and local live range splitting).

+What we describe here are techniques that have worked well for Subzero, in the

+context of its particular intermediate representation (IR) and compilation

+strategy. Some of these techniques may not be applicable to another compiler,

+depending on its particular IR and compilation strategy. Some key concepts in

+Subzero are the following:

+- Subzero's ``Variable`` operand is an operand that resides either on the stack

+ or in a physical register. A Variable can be tagged as *must-have-register*

+ or *must-not-have-register*, but its default is *may-have-register*. All uses

+ of the Variable map to the same physical register or stack location.

+- Basic lowering is done before register allocation. Lowering is the process of

+ translating PNaCl bitcode instructions into native instructions. Sometimes a

+ native instruction, like the x86 ``add`` instruction, allows one of its

+ Variable operands to be either in a physical register or on the stack, in

+ which case the lowering is relatively simple. But if the lowered instruction

+ requires the operand to be in a physical register, we generate code that

+ copies the Variable into a *must-have-register* temporary, and then use that

+ temporary in the lowered instruction.

+- Instructions within a basic block appear in a linearized order (as opposed to

+ something like a directed acyclic graph of dependency edges).

+- An instruction has 0 or 1 *dest* Variables and an arbitrary (but usually

+ small) number of *source* Variables. Assuming SSA form, the instruction

+ begins the live range of the dest Variable, and may end the live range of one

+ or more of the source Variables.

+Executive summary

+-----------------

+- Liveness analysis and live range construction are prerequisites for register

+ allocation. Without careful attention, they can be potentially very

+ expensive, especially when the number of variables and basic blocks gets very

+ large. Subzero uses some clever data structures to take advantage of the

+ sparsity of the data, resulting in stable performance as function size scales

+ up. This means that the proportion of compilation time spent on liveness

+ analysis stays roughly the same.

+- The core of Subzero's register allocator is taken from `Mössenböck and

+ Pfeiffer's paper <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_ on

+ linear-scan register allocation.

+- We enhance the core algorithm with a good automatic preference mechanism when

+ more than one register is available, to try to minimize register shuffling.

+- We also enhance it to allow overlapping live ranges to share the same

+ register, when one variable is recognized as a read-only copy of the other.

+- We deal with register aliasing in a clean and general fashion. Register

+ aliasing is when e.g. 16-bit architectural registers share some bits with

+ their 32-bit counterparts, or 64-bit registers are actually pairs of 32-bit

+ registers.

+- We improve register allocation by rerunning the algorithm on likely candidates

+ that didn't get registers during the previous iteration, without imposing much

+ additional cost.

+- The main register allocation is done before phi lowering, because phi lowering

+ imposes early and unnecessary ordering constraints on the resulting

+ assigments, which create spurious interferences in live ranges.

+- Within each basic block, we aggressively split each variable's live range at

+ every use, so that portions of the live range can get registers even if the

+ whole live range can't. Doing this separately for each basic block avoids

+ merge complications, and keeps liveness analysis and register allocation fast

+ by fitting well into the sparsity optimizations of their data structures.

+Liveness analysis

+-----------------

+With respect to register allocation, the main purpose of liveness analysis is to

+calculate the live range of each variable. The live range is represented as a

+set of instruction number ranges. Instruction numbers within a basic block must

+be monotonically increasing, and the instruction ranges of two different basic

+blocks must not overlap.

+Basic liveness

+^^^^^^^^^^^^^^

+Liveness analysis is a straightforward dataflow algorithm. For each basic

+block, we keep track of the live-in and live-out set, i.e. the set of variables

+that are live coming into or going out of the basic block. Processing of a

+basic block starts by initializing a temporary set as the union of the live-in

+sets of the basic block's successor blocks. (This basic block's live-out set is

+captured as the initial value of the temporary set.) Then each instruction of

+the basic block is processed in reverse order. All the source variables of the

+instruction are marked as live, by adding them to the temporary set, and the

+dest variable of the instruction (if any) is marked as not-live, by removing it

+from the temporary set. When we finish processing all of the block's

+instructions, we add/union the temporary set into the basic block's live-in set.

+If this changes the basic block's live-in set, then we mark all of this basic

+block's predecessor blocks to be reprocessed. Then we repeat for other basic

+blocks until convergence, i.e. no more basic blocks are marked to be

+reprocessed. If basic blocks are processed in reverse topological order, then

+the number of times each basic block need to be reprocessed is generally its

+loop nest depth.

+The result of this pass is the live-in and live-out set for each basic block.

+With so many set operations, choice of data structure is crucial for

+performance. We tried a few options, and found that a simple dense bit vector

+works best. This keeps the per-instruction cost very low. However, we found

+that when the function gets very large, merging and copying bit vectors at basic

+block boundaries dominates the cost. This is due to the number of variables

+(hence the bit vector size) and the number of basic blocks getting large.

+A simple enhancement brought this under control in Subzero. It turns out that

+the vast majority of variables are referenced, and therefore live, only in a

+single basic block. This is largely due to the SSA form of PNaCl bitcode. To

+take advantage of this, we partition variables by single-block versus

+multi-block liveness. Multi-block variables get lower-numbered bit vector

+indexes, and single-block variables get higher-number indexes. Single-block bit

+vector indexes are reused across different basic blocks. As such, the size of

+live-in and live-out bit vectors is limited to the number of multi-block

+variables, and the temporary set's size can be limited to that plus the largest

+number of single-block variables across all basic blocks.

+For the purpose of live range construction, we also need to track definitions

+(LiveBegin) and last-uses (LiveEnd) of variables used within instructions of the

+basic block. These are easy to detect while processing the instructions; data

+structure choices are described below.

+Live range construction

+^^^^^^^^^^^^^^^^^^^^^^^

+After the live-in and live-out sets are calculated, we construct each variable's

+live range (as an ordered list of instruction ranges, described above). We do

+this by considering the live-in and live-out sets, combined with LiveBegin and

+LiveEnd information. This is done separately for each basic block.

+As before, we need to take advantage of sparsity of variable uses across basic

+blocks, to avoid overly copying/merging data structures. The following is what

+worked well for Subzero (after trying several other options).

+The basic liveness pass, described above, keeps track of when a variable's live

+range begins or ends within the block. LiveBegin and LiveEnd are unordered

+vectors where each element is a pair of the variable and the instruction number,

+representing that the particular variable's live range begins or ends at the

+particular instruction. When the liveness pass finds a variable whose live

+range begins or ends, it appends and entry to LiveBegin or LiveEnd.

+During live range construction, the LiveBegin and LiveEnd vectors are sorted by

+variable number. Then we iterate across both vectors in parallel. If a

+variable appears in both LiveBegin and LiveEnd, then its live range is entirely

+within this block. If it appears in only LiveBegin, then its live range starts

+here and extends through the end of the block. If it appears in only LiveEnd,

+then its live range starts at the beginning of the block and ends here. (Note

+that this only covers the live range within this block, and this process is

+repeated across all blocks.)

+It is also possible that a variable is live within this block but its live range

+does not begin or end in this block. These variables are identified simply by

+taking the intersection of the live-in and live-out sets.

+As a result of these data structures, performance of liveness analysis and live

+range construction tend to be stable across small, medium, and large functions,

+as measured by a fairly consistent proportion of total compilation time spent on

+the liveness passes.

+Linear-scan register allocation

+-------------------------------

+The basis of Subzero's register allocator is the allocator described by

+Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan Register Allocation in

+the Context of SSA Form and Register Constraints

+<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. It allows live ranges to

+be a union of intervals rather than a single conservative interval, and it

+allows pre-coloring of variables with specific physical registers.

+The paper suggests an approach of aggressively coalescing variables across Phi

+instructions (i.e., trying to force Phi source and dest variables to have the

+same register assignment), but we omit that in favor of the more natural

+preference mechanism described below.

+We found the paper quite remarkable in that a straightforward implementation of

+its pseudo-code led to an entirely correct register allocator. The only thing

+we found in the specification that was even close to a mistake is that it was

+too aggressive in evicting inactive ranges in the "else" clause of the

+AssignMemLoc routine. An inactive range only needs to be evicted if it actually

+overlaps the current range being considered, whereas the paper evicts it

+unconditionally. (Search for "original paper" in Subzero's register allocator

+source code.)

+Register preference

+-------------------

+The linear-scan algorithm from the paper talks about choosing an available

+register, but isn't specific on how to choose among several available registers.

+The simplest approach is to just choose the first available register, e.g. the

+lowest-numbered register. Often a better choice is possible.

+Specifically, if variable ``V`` is assigned in an instruction ``V=f(S1,S2,...)``

+with source variables ``S1,S2,...``, and that instruction ends the live range of

+one of those source variables ``Sn``, and ``Sn`` was assigned a register, then

+``Sn``'s register is usually a good choice for ``V``. This is especially true

+when the instruction is a simple assignment, because an assignment where the

+dest and source variables end up with the same register can be trivially elided,

+reducing the amount of register-shuffling code.

+This requires being able to find and inspect a variable's defining instruction,

+which is not an assumption in the original paper but is easily tracked in

+practice.

+Register overlap

+----------------

+Because Subzero does register allocation after basic lowering, the lowering has

+to be prepared for the possibility of any given program variable not getting a

+physical register. It does this by introducing *must-have-register* temporary

+variables, and copies the program variable into the temporary to ensure that

+register requirements in the target instruction are met.

+In many cases, the program variable and temporary variable have overlapping live

+ranges, and would be forced to have different registers even if the temporary

+variable is effectively a read-only copy of the program variable. We recognize

+this when the program variable has no definitions within the temporary

+variable's live range, and the temporary variable has no definitions within the

+program variable's live range with the exception of the copy assignment.

+This analysis is done as part of register preference detection.

+The main impact on the linear-scan implementation is that instead of

+setting/resetting a boolean flag to indicate whether a register is free or in

+use, we increment/decrement a number-of-uses counter.

+Register aliases

+----------------

+Sometimes registers of different register classes partially overlap. For

+example, in x86, registers ``al`` and ``ah`` alias ``ax`` (though they don't

+alias each other), and all three alias ``eax`` and ``rax``. And in ARM,

+registers ``s0`` and ``s1`` (which are single-precision floating-point

+registers) alias ``d0`` (double-precision floating-point), and registers ``d0``

+and ``d1`` alias ``q0`` (128-bit vector register). The linear-scan paper

+doesn't address this issue; it assumes that all registers are independent. A

+simple solution is to essentially avoid aliasing by disallowing a subset of the

+registers, but there is obviously a reduction in code quality when e.g. half of

+the registers are taken away.

+Subzero handles this more elegantly. For each register, we keep a bitmask

+indicating which registers alias/conflict with it. For example, in x86,

+``ah``'s alias set is ``ah``, ``ax``, ``eax``, and ``rax`` (but not ``al``), and

+in ARM, ``s1``'s alias set is ``s1``, ``d0``, and ``q0``. Whenever we want to

+mark a register as being used or released, we do the same for all of its

+aliases.

+Before implementing this, we were a bit apprehensive about the compile-time

+performance impact. Instead of setting one bit in a bit vector or decrementing

+one counter, this generally needs to be done in a loop that iterates over all

+aliases. Fortunately, this seemed to make very little difference in

+performance, as the bulk of the cost ends up being in live range overlap

+computations, which are not affected by register aliasing.

+Repeat until convergence

+------------------------

+Sometimes the linear-scan algorithm makes a register assignment only to later

+revoke it in favor of a higher-priority variable, but it turns out that a

+different initial register choice would not have been revoked. For relatively

+low compile-time cost, we can give those variables another chance.

+During register allocation, we keep track of the revoked variables and then do

+another round of register allocation targeted only to that set. We repeat until

+no new register assignments are made, which is usually just a handful of

+successively cheaper iterations.

+Another approach would be to repeat register allocation for *all* variables that

+haven't had a register assigned (rather than variables that got a register that

+was subsequently revoked), but our experience is that this greatly increases

+compile-time cost, with little or no code quality gain.

+Delayed Phi lowering

+--------------------

+The linear-scan algorithm works for phi instructions as well as regular

+instructions, but it is tempting to lower phi instructions into non-SSA

+assignments before register allocation, so that register allocation need only

+happen once.

+Unfortunately, simple phi lowering imposes an arbitrary ordering on the

+resulting assignments that can cause artificial overlap/interference between

+lowered assignments, and can lead to worse register allocation decisions. As a

+simple example, consider these two phi instructions which are semantically

+unordered::

+ A = phi(B) // B's live range ends here

+ C = phi(D) // D's live range ends here

+A straightforward lowering might yield::

+ A = B // end of B's live range

+ C = D // end of D's live range

+The potential problem here is that A and D's live ranges overlap, and so they

+are prevented from having the same register. Swapping the order of lowered

+assignments fixes that (but then B and C would overlap), but we can't really

+know which is better until after register allocation.

+Subzero deals with this by doing the main register allocation before phi

+lowering, followed by phi lowering, and finally a special register allocation

+pass limited to the new lowered assignments.

+Phi lowering considers the phi operands separately for each predecessor edge,

+and starts by finding a topological sort of the Phi instructions, such that

+assignments can be executed in that order without violating dependencies on

+registers or stack locations. If a topological sort is not possible due to a

+cycle, the cycle is broken by introducing a temporary, e.g. ``A=B;B=C;C=A`` can

+become ``T=A;A=B;B=C;C=T``. The topological order is tuned to favor freeing up

+registers early to reduce register pressure.

+It then lowers the linearized assignments into machine instructions (which may

+require extra physical registers e.g. to copy from one stack location to

+another), and finally runs the register allocator limited to those instructions.

+In rare cases, the register allocator may fail on some *must-have-register*

+variable, because register pressure is too high to satisfy requirements arising

+from cycle-breaking temporaries and registers required for stack-to-stack

+copies. In this case, we have to find a register with no active uses within the

+variable's live range, and actively spill/restore that register around the live

+range. This makes the code quality suffer and may be slow to implement

+depending on compiler data structures, but in practice we find the situation to

+be vanishingly rare and so not really worth optimizing.

+Local live range splitting

+--------------------------

+The basic linear-scan algorithm has an "all-or-nothing" policy: a variable gets

+a register for its entire live range, or not at all. This is unfortunate when a

+variable has many uses close together, but ultimately a large enough live range

+to prevent register assignment. Another bad example is on x86 where all vector

+and floating-point registers (the ``xmm`` registers) are killed by call

+instructions, per the x86 call ABI, so such variables are completely prevented

+from having a register when their live ranges contain a call instruction.

+The general solution here is some policy for splitting live ranges. A variable

+can be split into multiple copies and each can be register-allocated separately.

+The complication comes in finding a sane policy for where and when to split

+variables such that complexity doesn't explode, and how to join the different

+values at merge points.

+Subzero implements aggressive block-local splitting of variables. We make a

+pass over the instructions in a block. Every time a variable ``V`` is

John 2016/08/24 14:13:58 I don't understand how this leads to a chain V''

Jim Stichnoth 2016/08/25 13:28:40 I was a bit lazy writing this paragraph, and so I'

+encountered in an instruction, we add an assignment ``V'=V``, adjacent to the

+instruction, and replace all subsequent uses of ``V`` in the block with ``V'``.

+This leads to a chain of copies of ``V`` through the block, linked by assignment

+instructions. The live ranges of these copies are usually much smaller, and

+more likely to get registers. In fact, because of the preference mechanism

+described above, they are likely to get the same register whenever possible.

+One obvious question comes up: won't this proliferation of new variables cause

+an explosion in the running time of liveness analysis and register allocation?

+As it turns out, not really.

+First, for register allocation, the cost tends to be dominated by live range

+overlap computations, whose cost is roughly proportional to the size of the live

+range. All the new variable copies' live ranges sum up to the original

+variable's live range, so the cost isn't vastly greater.

+Second, for liveness analysis, the cost is dominated by merging bit vectors

+corresponding to the set of variables that have multi-block liveness. All the

+new copies are guaranteed to be single-block, so the main additional cost is

+that of processing the new assignments.

+There's one other key issue here. The original variable and all of its copies

+need to be "linked", in the sense that all of these variables that get a stack

+slot (because they didn't get a register) are guaranteed to have the same stack

+slot. This way, we can avoid generating any code related to ``V'=V`` when

+neither gets a register. In addition, we can elide instructions that write a

+value to a stack slot that is linked to another stack slot, because that is

+guaranteed to be just rewriting the same value to the stack.

« no previous file with comments | « no previous file | no next file » | no next file with comments »