Chromium Code Reviews| Index: DESIGN.rst |
| diff --git a/DESIGN.rst b/DESIGN.rst |
| new file mode 100644 |
| index 0000000000000000000000000000000000000000..a43384d8022150679ecae383601d6155a75fbc8a |
| --- /dev/null |
| +++ b/DESIGN.rst |
| @@ -0,0 +1,1280 @@ |
| +Design of the Subzero fast code generator |
| +========================================= |
| + |
| +Introduction |
| +------------ |
| + |
| +The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
| +compiler technology based on `LLVM <http://llvm.org/>`_. The web program |
| +developer uses the PNaCl toolchain to compile their application to |
| +architecture-neutral PNaCl bitcode (a pexe file), using as much |
| +architecture-neutral optimization as possible. The pexe file is downloaded to |
| +the user's browser where the PNaCl translator (a component of Chrome) compiles |
| +the pexe file to sandboxed native code. The translator uses |
| +architecture-specific optimizations as much as practical to generate good native |
| +code. |
| + |
| +The native code can be cached by the browser to avoid repeating translation on |
| +future page loads. However, first-time user experience can be hampered by long |
| +translation times. The LLVM-based PNaCl translator is pretty slow, even when |
| +using ``-O0`` to minimize optimizations, so delays are especially noticeable on |
| +slow browser platforms such as ARM-based Chromebooks. |
| + |
| +Translator slowness can be mitigated or hidden in a number of ways. |
| + |
| +- Parallel translation. However, slow machines where this matters most, |
| + e.g. ARM chromebooks, are likely to have fewer cores to parallelize across. |
| + |
| +- Streaming translation, i.e. start translating as soon as the download starts. |
| + This doesn't help much when translation speed is 10x slower than download |
| + speed, or the pexe file is already cached while the translated binary was |
| + flushed from the cache. |
| + |
| +- Arrange the web page such that translation is done in parallel with |
| + downloading large assets. Arrange the web page to distract the user with cat |
| + videos while translation is in progress. Or, improve translator performance |
| + to something more reasonable. |
| + |
| +Goals |
| +===== |
| + |
| +Translation speed |
| +----------------- |
| + |
| +We'd like to be able to translate a pexe as fast as download speed. Any faster |
| +is in a sense wasted effort. Download speed varies greatly, but we'll |
| +arbitrarily say 1 MB/sec. We'll pick an ARM A15 platform as the example of a |
| +slow machine. We observe a 3x single-thread performance difference between A15 |
| +and a z620 workstation, and aggressively assume a pexe file could be compressed |
| +to 50% on the web server before network transport, so we set the translation |
| +speed goal to 6 MB/sec on a z620 workstation. |
| + |
| +Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at |
| +1/10 the target rate. The ``-O2`` mode takes 3x as long as the ``-O0`` mode. |
| + |
| +In other words, Subzero's goal is to improve over LLVM's translation speed by |
| +10x. |
| + |
| +Code quality |
| +------------ |
| + |
| +Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` |
| +code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On |
| +average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. |
| + |
| +Translator size |
| +--------------- |
| + |
| +The current LLVM-based translator binary (pnacl-llc) is about 10MB in size. We |
| +think 1MB is a more reasonable size, thus we target a 10x reduction in binary |
| +size. |
| + |
| +For development, Subzero can be built for all target architectures, and all |
| +debugging and diagnostic options enabled. For a smaller translator, we restrict |
| +to a single target architecture, and define a MINIMAL build where unnecessary |
| +features are compiled out. |
| + |
| +Memory footprint |
| +---------------- |
| + |
| +The current LLVM-based translator suffers from an issue in which some |
| +function-specific data has to be retained in memory until all translation |
| +completes, and therefore the memory footprint grows without bound. Large pexes |
| +can lead to the translator process holding hundreds of MB of memory by the end. |
| + |
| +Subzero should maintain a stable memory footprint throughout translation. It's |
| +not really practical to set a specific limit, because there is not really a |
| +practical limit on a single function's size, but the footprint should be |
| +"reasonable" and not have leaks or inexorable growth. (We use ASAN builds to |
| +test for leaks.) |
| + |
| +Multithreaded translation |
| +------------------------- |
| + |
| +It should be practical to translate different functions concurrently and see |
| +good scalability. Some locking may be needed, such as accessing output buffers |
| +or constant pools, but that should be fairly minimal. In contrast, LLVM was |
| +only designed for module-level parallelism, and as such, the PNaCl translator |
| +internally splits a pexe file into several modules for concurrent translation. |
| +All output needs to be deterministic regardless of the level of multithreading, |
| +i.e. functions and data should always be output in the same order. |
| + |
| +Target architectures |
| +-------------------- |
| + |
| +Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future |
| +targets include ARM64 and MIPS64, though these targets lack NaCl support |
| +including a sandbox model or a validator. |
| + |
| +The first implementation is for x86-32, because it was expected to be |
| +particularly challenging, and thus more likely to draw out any design problems |
| +early: |
| + |
| +- There are a number of special cases, asymmetries, and warts in the x86 |
| + instruction set. |
| + |
| +- Complex addressing modes may be leveraged for better code quality. |
| + |
| +- 64-bit integer operations have to be lowered into longer sequences of 32-bit |
| + operations. |
| + |
| +- Paucity of physical registers may reveal code quality issues early in the |
| + design. |
| + |
| +Detailed design |
| +=============== |
| + |
| +Intermediate representation - ICE |
| +--------------------------------- |
| + |
| +Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's |
| +IR, which is reflected in the pexe bitcode structure. |
| + |
| +A function is represented by the ``Ice::Cfg`` class (CFG = control flow graph). |
| +Its key contents include: |
| + |
| +- A list of ``CfgNode`` pointers, generally held in topological order. |
| + |
| +- A list of ``Variable`` pointers corresponding to local variables used in the |
| + function plus compiler-generated temporaries. |
| + |
| +A basic block is represented by the ``Ice::CfgNode`` class. Its key contents |
| +include: |
| + |
| +- A linear list of instructions, in the same style as LLVM. The last |
| + instruction of the list is always a terminator instruction: branch, switch, |
| + return, unreachable. |
| + |
| +- A list of Phi instructions, also in the same style as LLVM. They are held as |
| + a linear list for convenience, though per Phi semantics, they are executed "in |
| + parallel" without dependencies on each other. |
| + |
| +- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and |
| + another list for outgoing edges. |
| + |
| +- The node's unique, 0-based index into the Cfg's node list. |
| + |
| +An instruction is represented by the ``Ice::Inst`` class. Its key contents |
| +include: |
| + |
| +- A list of source operands. |
| + |
| +- Its destination variable, if the instruction produces a result in an |
| + ``Ice::Variable``. |
| + |
| +- A bitvector indicating which variables' live ranges this instruction ends. |
| + This is computed during liveness analysis. |
| + |
| +Instructions kinds are divided into high-level ICE instructions and low-level |
| +ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode |
| +instruction kinds. Each target architecture implementation extends the |
| +instruction space with its own set of low-level instructions. Generally, |
| +low-level instructions correspond to individual machine instructions. The |
| +high-level ICE instruction space includes a few additional instruction kinds |
| +that are not part of LLVM but are generally useful (e.g., an Assignment |
| +instruction), or are useful across targets (e.g., BundleLock and BundleUnlock |
| +instructions for sandboxing). |
| + |
| +Specifically, high-level ICE instructions that derive from LLVM are the |
| +following: |
| + |
| +- Alloca: allocate data on the stack |
| + |
| +- Arithmetic: binary operations of the form A = B op C |
| + |
| +- Br: conditional or unconditional branch |
| + |
| +- Call: function call |
| + |
| +- Cast: unary type-conversion operations |
| + |
| +- ExtractElement: extract a scalar element from a vector-type value |
| + |
| +- Fcmp: floating-point comparison |
| + |
| +- Icmp: integer comparison |
| + |
| +- IntrinsicCall: call a known intrinsic |
| + |
| +- InsertElement: insert a scalar element into a vector-type value |
| + |
| +- Load: load a value from memory |
| + |
| +- Phi: implement the SSA phi node |
| + |
| +- Ret: return from the function |
| + |
| +- Select: essentially the C operation of the form X = C ? Y : Z |
| + |
| +- Store: store a value into memory |
| + |
| +- Switch: generalized branch to multiple possible locations |
| + |
| +- Unreachable: indicate that this portion of the code is unreachable |
| + |
| +The additional high-level ICE instructions are the following: |
| + |
| +- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi |
| + instructions to non-SSA assignments, before lowering to machine code. |
| + |
| +- BundleLock, BundleUnlock. These are markers used for sandboxing, but are |
| + common across all targets and so they are elevated to the high-level |
| + instruction set. |
| + |
| +- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in |
| + liveness analysis, elevated to the high-level because they are used by all |
| + targets. They are described in more detail at the end of this section. |
| + |
| +- JumpTable: this represents the result of switch optimization analysis, where |
| + some switch instructions may use jump tables instead of cascading |
| + compare/branches. |
| + |
| +An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an |
| +operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants |
| +include integer constants, floating point constants, Undef (an unspecified |
| +constant of a particular type), and symbol constants (essentially addresses of |
| +globals). A variable represents a value allocated on the stack (though not |
| +including alloca-derived storage). Among other things, a variable holds its |
| +unique, 0-based index into the Cfg's variable list. |
| + |
| +Each target can extend the ``Constant`` and ``Variable`` classes for its own |
| +needs. In addition, the ``Operand`` class may be extended, e.g. to define an |
| +x86 ``MemOperand`` that encodes a base register, an index register, an index |
| +register shift amount, and a constant offset. |
| + |
| +Register allocation and liveness analysis are restricted to Variable operands. |
| +Because of the importance of register allocation to code quality, and the |
| +translation-time cost of liveness analysis, Variable operands get some special |
| +treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate |
| +across all the Variables of an instruction. An instruction holds a list of |
| +operands, but an operand may contain 0, 1, or more Variables. As such, the |
| +``Operand`` class specially holds a list of Variables contained within, for |
| +quick access. |
| + |
| +A Subzero transformation pass may work by deleting an existing instruction and |
| +replacing it with zero or more new instructions. Instead of actually deleting |
| +the existing instruction, we generally mark it as deleted and insert the new |
| +instructions right after the deleted instruction. When printing the IR for |
| +debugging, this is a big help because it makes it much more clear how the |
| +non-deleted instructions came about. |
| + |
| +Subzero has a few special instructions to help with liveness analysis |
| +consistency. |
| + |
| +- The FakeDef instruction gives a fake definition of some variable. For |
| + example, on x86-32, a divide instruction defines both ``eax`` and ``edx`` but |
| + an ICE instruction can represent only one destination variable. This is |
| + similar for multiply instructions, and for function calls that return a 64-bit |
| + integer result in the edx:eax pair. Also, using the ``xor eax, eax`` trick to |
| + set ``eax`` to 0 requires an initial FakeDef of ``eax``. |
| + |
| +- The FakeUse instruction registers a use of a variable, typically to prevent an |
| + earlier assignment to that variable from being dead-code eliminated. For |
| + example, lowering an operation like ``x=cc?y:z`` may be done using x86's |
| + conditional move (cmov) instruction: ``mov x, z; cmov_cc x, y``. Without a |
| + FakeUse of ``x`` between the two instructions, the liveness analysis pass may |
| + dead-code eliminate the first instruction. |
| + |
| +- The FakeKill instruction is added after a call instruction, and is a quick way |
| + of indicating that caller-save registers are invalidated. |
| + |
| +Overview of lowering |
| +-------------------- |
| + |
| +In general, translation goes like this: |
| + |
| +- Parse the next function from the pexe file and construct a Cfg consisting of |
| + high-level ICE. |
| + |
| +- Do analysis passes and transformation passes on the high-level ICE, as |
| + desired. |
| + |
| +- Lower each high-level ICE instruction into a sequence of zero or more |
| + low-level ICE instructions. Each high-level instruction is generally lowered |
| + independently, though the target lowering is allowed to look ahead in the |
| + CfgNode's instruction list if desired. |
| + |
| +- Do more analysis and transformation passes on the low-level ICE, as desired. |
| + |
| +- Assemble the low-level Cfg into an ELF object file (alternatively, a textual |
| + assembly file that is later assembled by some external tool). |
| + |
| +- Repeat for all functions, and also produce object code for data such as global |
| + initializers and internal constant pools. |
| + |
| +Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``, |
| +the intention is to apply all available optimizations to get the best code |
| +quality (though the initial code quality goal is measured against LLVM's ``O0`` |
| +code quality). For ``Om1``, the intention is to apply as few optimizations as |
| +possible and produce code as quickly as possible, accepting poor code quality. |
| +``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, |
| +"sub-zero". |
| + |
| +Debuggability of generated code was never considered in the design, but so far, |
| +Subzero doesn't really do transformations that would obfuscate debugging. The |
| +main thing might be that register allocation (including stack slot coalescing |
| +for stack-allocated variables whose live ranges don't overlap) may render a |
| +variable's value unobtainable after its live range ends. This would not be an |
| +issue for ``Om1`` since it doesn't register-allocate program-level variables, |
| +nor does it coalesce stack slots. |
| + |
| +Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but |
| +produces code with one third the code quality. ``Om1`` is good for testing and |
| +debugging -- during translation, it tends to expose errors in the basic lowering |
| +that might otherwise have been hidden by the register allocator or other |
| +optimization passes. It also helps determine whether a code correctness problem |
| +is a fundamental problem in the basic lowering, or an error in another |
| +optimization pass. |
| + |
| +The implementation of target lowering also controls the recipe of passes used |
| +for ``Om1`` and ``O2`` translation. For example, address mode inference may |
| +only be relevant for x86. |
| + |
| +Pexe parsing |
| +------------ |
| + |
| +Subzero includes an integrated PNaCl pexe bitcode parser. It parses the pexe |
| +file function by function, ultimately constructing an ICE Cfg for each function. |
| +After a function is parsed, its Cfg is handed off to the translation phase. The |
| +bitcode parser also parses global initializer data and hands it off to be |
| +translated to data sections in the object file. |
| + |
| +Subzero has another parsing strategy for testing/debugging. LLVM libraries can |
| +be used to parse a module into LLVM IR (though very slowly relative to Subzero |
| +native parsing). Then we iterate across the LLVM IR and construct high-level |
| +ICE, handing off each Cfg to the translation phase. |
| + |
| +Lowering strategy |
| +----------------- |
| + |
| +The core of Subzero's lowering from high-level ICE to low-level ICE is to lower |
| +each high-level instruction down to a sequence of low-level target-specific |
| +instructions, in a largely context-free setting. That is, each high-level |
| +instruction conceptually has a simple template expansion into low-level |
| +instructions, and lowering can in theory be done in any order. This may sound |
| +like a small effort, but quite a large number of templates may be needed because |
| +of the number of PNaCl types and instruction variants. Furthermore, there may |
| +be optimized templates, e.g. to take advantage of operator commutativity. |
| + |
| +The key idea for a lowering template is to produce valid low-level instructions |
| +that are guaranteed to meet address mode and other structural requirements of |
| +the instruction set. For example, on x86, the source operand of an integer |
| +store instruction must be an immediate or a physical register; a shift |
| +instruction's shift amount must be an immediate or in register ``cl``; a |
| +function's integer return value is in ``eax``; most x86 instructions are |
| +two-operand, in contrast to corresponding three-operand high-level instructions; |
| +etc. |
| + |
| +Because target lowering runs before register allocation, there is no way to know |
| +whether a given ``Ice::Variable`` operand lives on the stack or in a physical |
| +register. When the low-level instruction calls for a physical register operand, |
| +the target lowering can create an infinite-weight Variable. This tells the |
| +register allocator to assign infinite weight when making decisions, effectively |
| +guaranteeing some physical register. Variables can also be pre-colored to a |
| +specific physical register (``cl`` in the shift example above), which also gives |
| +infinite weight. |
| + |
| +To illustrate, consider a high-level arithmetic instruction on 32-bit integer |
| +operands:: |
| + |
| + A = B + C |
| + |
| +X86 target lowering might produce the following:: |
| + |
| + T.inf = B // mov instruction |
| + T.inf += C // add instruction |
| + A = T.inf // mov instruction |
| + |
| +Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a |
| +physical register, the three lowered instructions are all encodable regardless |
| +of whether ``B`` and ``C`` are physical registers, memory, or immediates, and |
| +whether ``A`` is a physical register or in memory. |
| + |
| +In this example, ``A`` must be a Variable and one may be tempted to simplify the |
| +lowering sequence by setting ``A`` as infinite-weight and using:: |
| + |
| + A = B // mov instruction |
| + A += C // add instruction |
| + |
| +This has two problems. First, if the original instruction was actually ``A = |
| +B + A``, the result would be incorrect. Second, assigning ``A`` a physical |
| +register applies throughout ``A``'s entire live range. This is probably not |
| +what is intended, and may ultimately lead to a failure to allocate a register |
| +for an infinite-weight variable. |
| + |
| +This style of lowering leads to many temporaries being generated, so in ``O2`` |
| +mode, we rely on the register allocator to clean things up. For example, in the |
| +example above, if ``B`` ends up getting a physical register and its live range |
| +ends at this instruction, the register allocator is likely to reuse that |
| +register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
| +copy, which is removed as an emission-time peephole optimization. |
| + |
| +O2 lowering |
| +----------- |
| + |
| +Currently, the ``O2`` lowering recipe is the following: |
| + |
| +- Address mode inference |
| + |
| +- Read-modify-write (RMW) transformation |
| + |
| +- Basic liveness analysis |
| + |
| +- Load optimization |
| + |
| +- Target lowering |
| + |
| +- Full liveness analysis |
| + |
| +- Register allocation |
| + |
| +- Phi instruction lowering (advanced) |
| + |
| +- Post-phi lowering register allocation |
| + |
| +- Branch optimization |
| + |
| +These passes are described in more detail below. |
| + |
| +Om1 lowering |
| +------------ |
| + |
| +Currently, the ``Om1`` lowering recipe is the following: |
| + |
| +- Phi instruction lowering (simple) |
| + |
| +- Target lowering |
| + |
| +- Register allocation (infinite-weight and pre-colored only) |
| + |
| +Optimization passes |
| +------------------- |
| + |
| +Liveness analysis |
| +^^^^^^^^^^^^^^^^^ |
| + |
| +Liveness analysis is a standard dataflow optimization, implemented as follows. |
| +For each node (basic block), its live-out set is computed as the union of the |
| +live-in sets of its successor nodes. Then the node's instructions are processed |
| +in reverse order, updating the live set, until the beginning of the node is |
| +reached, and the node's live-in set is recorded. If this iteration has changed |
| +the node's live-in set, the node's predecessors are marked for reprocessing. |
| +This continues until no more nodes need reprocessing. If nodes are processed in |
| +reverse topological order, the number of iterations over the Cfg is generally |
| +equal to the maximum loop nest depth. |
| + |
| +To implement this, each node records its live-in and live-out sets, initialized |
| +to the empty set. Each instruction records which of its Variables' live ranges |
| +end in that instruction, initialized to the empty set. A side effect of |
| +liveness analysis is dead instruction elimination. Each instruction can be |
| +marked as tentatively dead, and after the algorithm converges, the tentatively |
| +dead instructions are permanently deleted. |
| + |
| +Optionally, after this liveness analysis completes, we can do live range |
| +construction, in which we calculate the live range of each variable in terms of |
| +instruction numbers. A live range is represented as a union of segments, where |
| +the segment endpoints are instruction numbers. Instruction numbers are required |
| +to be unique across the Cfg, and monotonically increasing within a basic block. |
| +As a union of segments, live ranges can contain "gaps" and are therefore |
| +precise. Because of SSA properties, a variable's live range can start at most |
| +once in a basic block, and can end at most once in a basic block. Liveness |
| +analysis keeps track of which variable/instruction tuples begin live ranges and |
| +end live ranges, and combined with live-in and live-out sets, we can efficiently |
| +build up live ranges of all variables across all basic blocks. |
| + |
| +A lot of care is taken to try to make liveness analysis fast and efficient. |
| +Because of the lowering strategy, the number of variables is generally |
| +proportional to the number of instructions, leading to an O(N^2) complexity |
| +algorithm if implemented naively. To improve things based on sparsity, we note |
| +that most variables are "local" and referenced in at most one basic block (in |
| +contrast to the "global" variables with multi-block usage), and therefore cannot |
| +be live across basic blocks. Therefore, the live-in and live-out sets, |
| +typically represented as bit vectors, can be limited to the set of global |
| +variables, and the intra-block liveness bit vector can be compacted to hold the |
| +global variables plus the local variables for that block. |
| + |
| +Register allocation |
| +^^^^^^^^^^^^^^^^^^^ |
| + |
| +Subzero implements a simple linear-scan register allocator, based on the |
| +allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan |
| +Register Allocation in the Context of SSA Form and Register Constraints |
| +<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has |
| +several nice features: |
| + |
| +- Live ranges are represented as unions of segments, as described above, rather |
| + than a single start/end tuple. |
| + |
| +- It allows pre-coloring of variables with specific physical registers. |
| + |
| +- It applies equally well to pre-lowered Phi instructions. |
| + |
| +The paper suggests an approach of aggressively coalescing variables across Phi |
| +instructions (i.e., trying to force Phi source and destination variables to have |
| +the same register assignment), but we reject that in favor of the more natural |
| +preference mechanism described below. |
| + |
| +We enhance the algorithm in the paper with the capability of automatic inference |
| +of register preference, and with the capability of allowing overlapping live |
| +ranges to safely share the same register in certain circumstances. If we are |
| +considering register allocation for variable ``A``, and ``A`` has a single |
| +defining instruction ``A=B+C``, then the preferred register for ``A``, if |
| +available, would be the register assigned to ``B`` or ``C``, if any, provided |
| +that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this |
| +way we infer a good register preference for ``A``. |
| + |
| +We allow overlapping live ranges to get the same register in certain cases. |
| +Suppose a high-level instruction like:: |
| + |
| + A = unary_op(B) |
| + |
| +has been target-lowered like:: |
| + |
| + T.inf = B |
| + A = unary_op(T.inf) |
| + |
| +Further, assume that ``B``'s live range continues beyond this instruction |
| +sequence, and that ``B`` has already been assigned some register. Normally, we |
| +might want to infer ``B``'s register as a good candidate for ``T.inf``, but it |
| +turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have |
| +different registers. But ``T.inf`` is just a read-only copy of ``B`` that is |
| +guaranteed to be in a register, so in theory these overlapping live ranges could |
| +safely have the same register. Our implementation allows this overlap as long |
| +as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never |
| +modified within ``T.inf``'s live range. |
| + |
| +Subzero's register allocator can be run in 3 configurations. |
| + |
| +- Normal mode. All Variables are considered for register allocation. It |
| + requires full liveness analysis and live range construction as a prerequisite. |
| + This is used by ``O2`` lowering. |
| + |
| +- Minimal mode. Only infinite-weight or pre-colored Variables are considered. |
| + All other Variables are stack-allocated. It does not require liveness |
| + analysis; instead, it quickly scans the instructions and records first |
| + definitions and last uses of all relevant Variables, using that to construct a |
| + single-segment live range. Although this includes most of the Variables, the |
| + live ranges are mostly simple, short, and rarely overlapping, which the |
| + register allocator handles efficiently. This is used by ``Om1`` lowering. |
| + |
| +- Post-phi lowering mode. Advanced phi lowering is done after normal-mode |
| + register allocation, and may result in new infinite-weight Variables that need |
| + registers. One would like to just run something like minimal mode to assign |
| + registers to the new Variables while respecting existing register allocation |
| + decisions. However, it sometimes happens that there are no free registers. |
| + In this case, some register needs to be forcibly spilled to the stack and |
| + temporarily reassigned to the new Variable, and reloaded at the end of the new |
| + Variable's live range. The register must be one that has no explicit |
| + references during the Variable's live range. Since Subzero currently doesn't |
| + track def/use chains (though it does record the CfgNode where a Variable is |
| + defined), we just do a brute-force search across the CfgNode's instruction |
| + list for the instruction numbers of interest. This situation happens very |
| + rarely, so there's little point for now in improving its performance. |
| + |
| +The basic linear-scan algorithm may, as it proceeds, rescind an early register |
| +allocation decision, leaving that Variable to be stack-allocated. Some of these |
| +times, it turns out that the Variable could have been given a different register |
| +without conflict, but by this time it's too late. The literature recognizes |
| +this situation and describes "second-chance bin-packing", which Subzero can do. |
| +We can rerun the register allocator in a mode that respects existing register |
| +allocation decisions, and sometimes it finds new non-conflicting opportunities. |
| +In fact, we can repeatedly run the register allocator until convergence. |
| +Unfortunately, in the current implementation, these subsequent register |
| +allocation passes end up being extremely expensive. This is because of the |
| +treatment of the "unhandled pre-colored" Variable set, which is normally very |
| +small but ends up being quite large on subsequent passes. Its performance can |
| +probably be made acceptable with a better choice of data structures, but for now |
| +this second-chance mechanism is disabled. |
| + |
| +Basic phi lowering |
| +^^^^^^^^^^^^^^^^^^ |
| + |
| +The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
| +implements it). Consider this example:: |
| + |
| + L1: |
| + ... |
| + br L3 |
| + L2: |
| + ... |
| + br L3 |
| + L3: |
| + A = phi [B, L1], [C, L2] |
| + X = phi [Y, L1], [Z, L2] |
| + |
| +For each destination of a phi instruction, we can create a temporary and insert |
| +the temporary's assignment at the end of the predecessor block:: |
| + |
| + L1: |
| + ... |
| + A' = B |
| + X' = Y |
| + br L3 |
| + L2: |
| + ... |
| + A' = C |
| + X' = Z |
| + br L3 |
| + L2: |
| + A = A' |
| + X = X' |
| + |
| +This transformation is very simple and reliable. It can be done before target |
| +lowering and register allocation, and it easily avoids the classic lost-copy and |
| +related problems. ``Om1`` lowering uses this strategy. |
| + |
| +However, it has the disadvantage of initializing temporaries even for branches |
| +not taken, though that could be mitigated by splitting non-critical edges and |
| +putting assignments in the edge-split nodes. Another problem is that without |
| +extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a |
| +specific ordering even though phi semantics are that the assignments are |
| +parallel or unordered. This sometimes imposes false live range overlaps and |
| +leads to poorer register allocation. |
| + |
| +Advanced phi lowering |
| +^^^^^^^^^^^^^^^^^^^^^ |
| + |
| +``O2`` lowering defers phi lowering until after register allocation to avoid the |
| +problem of false live range overlaps. It works as follows. We split each |
| +incoming edge and move the (parallel) phi assignments into the split nodes. We |
| +linearize each set of assignments by finding a safe, topological ordering of the |
| +assignments, respecting register assignments as well. For example:: |
| + |
| + A = B |
| + X = Y |
| + |
| +Normally these assignments could be executed in either order, but if ``B`` and |
| +``X`` are assigned the same physical register, we would want to use the above |
| +ordering. Dependency cycles are broken by introducing a temporary. For |
| +example:: |
| + |
| + A = B |
| + B = A |
| + |
| +Here, a temporary breaks the cycle:: |
| + |
| + t = A |
| + A = B |
| + B = t |
| + |
| +Finally, we use the existing target lowering to lower the assignments in this |
| +basic block, and once that is done for all basic blocks, we run the post-phi |
| +variant of register allocation on the edge-split basic blocks. |
| + |
| +When computing a topological order, we try to first schedule assignments whose |
| +source has a physical register, and last schedule assignments whose destination |
| +has a physical register. This helps reduce register pressure. |
| + |
| +X86 address mode inference |
| +^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| + |
| +We try to take advantage of the x86 addressing mode that includes a base |
| +register, an index register, an index register scale amount, and an immediate |
| +offset. We do this through simple pattern matching. Starting with a load or |
| +store instruction where the address is a variable, we initialize the base |
| +register to that variable, and look up the instruction where that variable is |
| +defined. If that is an add instruction of two variables and the index register |
| +hasn't been set, we replace the base and index register with those two |
| +variables. If instead it is an add instruction of a variable and a constant, we |
| +replace the base register with the variable and add the constant to the |
| +immediate offset. |
| + |
| +There are several more patterns that can be matched. This pattern matching |
| +continues on the load or store instruction until no more matches are found. |
| +Because a program typically has few load and store instructions (not to be |
| +confused with instructions that manipulate stack variables), this address mode |
| +inference pass is fast. |
| + |
| +X86 read-modify-write inference |
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| + |
| +A reasonably common bitcode pattern is a non-atomic update of a memory |
| +location:: |
| + |
| + x = load addr |
| + y = add x, 1 |
| + store y, addr |
| + |
| +On x86, with good register allocation, the Subzero passes described above |
| +generate code with only this quality:: |
| + |
| + mov eax, [ebx] |
| + add eax, 1 |
| + mov [ebx], eax |
| + |
| +However, x86 allows for this kind of code:: |
| + |
| + add [ebx], 1 |
| + |
| +which requires fewer instructions, but perhaps more importantly, requires fewer |
| +physical registers. |
| + |
| +It's also important to note that this transformation only makes sense if the |
| +store instruction ends ``x``'s live range. |
| + |
| +Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) |
| +opportunities via simple pattern matching. The only problem is that it is run |
| +before liveness analysis, which is needed to determine whether ``x``'s live |
| +range ends after the RMW. Since liveness analysis is one of the most expensive |
| +passes, it's not attractive to run it an extra time just for RMW analysis. |
| +Instead, we essentially generate both the RMW and the non-RMW versions, and then |
| +during lowering, the RMW version deletes itself if it finds x still live. |
| + |
| +X86 compare-branch inference |
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| + |
| +In the LLVM instruction set, the compare/branch pattern works like this:: |
| + |
| + cond = icmp eq a, b |
| + br cond, target |
| + |
| +The result of the icmp instruction is a single bit, and a conditional branch |
| +tests that bit. By contrast, most target architectures use this pattern:: |
| + |
| + cmp a, b // implicitly sets various bits of FLAGS register |
| + br eq, target // branch on a particular FLAGS bit |
| + |
| +A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests |
| +``cond`` and conditionally branches. Subzero has a pass that identifies |
| +boolean-based operations like this and folds them into a single |
| +compare/branch-like operation. It is set up for more than just cmp/br though. |
| +Boolean producers include icmp (integer compare), fcmp (floating-point compare), |
| +and trunc (integer truncation when the destination has bool type). Boolean |
| +consumers include branch, select (the ternary operator from the C language), and |
| +sign-extend and zero-extend when the source has bool type. |
| + |
| +Sandboxing |
| +^^^^^^^^^^ |
| + |
| +This is not actually a separate pass, but is integrated into lowering and |
| +assembly. |
| + |
| +- Indirect branches, including the ret instruction, are masked to a bundle |
| + boundary and bundle-locked. |
| + |
| +- Call instructions are aligned to the end of the bundle so that the return |
| + address is bundle-aligned. |
| + |
| +- Indirect branch targets, including function entry and targets in a switch |
| + statement jump table, are bundle-aligned. |
| + |
| +- The intrinsic for reading the thread pointer is inlined appropriately. |
| + |
| +- For x86-64, non-stack memory accesses are with respect to the reserved sandbox |
| + base register. We reduce the aggressiveness of address mode inference to |
| + leave room for the sandbox base register during lowering. There are no memory |
| + sandboxing changes for x86-32. |
| + |
| +Code emission |
| +------------- |
| + |
| +Subzero's integrated assembler is derived from Dart's assembler code. There is |
| +a pass that iterates through the low-level ICE instructions and invokes the |
| +relevant assembler functions. Placeholders are added for later fixup of branch |
| +target offsets. (Backward branches use short offsets if possible; forward |
| +branches always use long offsets.) The assembler emits into a staging buffer. |
| +Once emission into the staging buffer for a function is complete, the data is |
| +emitted to the output file as an ELF object file, and metadata such as |
| +relocations, symbol table, and string table, are accumulated for emission at the |
| +end. Global data initializers are emitted similarly. A key point is that at |
| +this point, the staging buffer can be deallocated, and only a minimum of data |
| +needs to held until the end. |
| + |
| +As a debugging alternative, Subzero can emit textual assembly code which can |
| +then be run through an external assembler. This is of course super slow, but |
| +quite valuable when bringing up a new target. |
| + |
| +As another debugging option, the staging buffer can be emitted as textual |
| +assembly, primarily in the form of ".byte" lines. This allows the assembler to |
| +be tested separately from the ELF related code. |
| + |
| +Memory management |
| +----------------- |
| + |
| +Where possible, we allocate from a CfgLocalAllocator which derives from LLVM's |
| +BumpPtrAllocator. This is an arena-style allocator where objects allocated from |
| +the arena are never actually freed; instead, when the Cfg translation completes |
| +and the Cfg is deleted, the entire arena memory is reclaimed at once. |
| + |
| +Instructions are probably the most heavily allocated complex class in Subzero. |
| +We represent an instruction list as an intrusive doubly linked list, allocate |
| +all instructions from the CfgLoclaAllocator, and we make sure each instruction |
| +subclass is basically POD with a trivial destructor. This way, when the Cfg is |
| +finished, we don't need to individually deallocate every instruction. We do |
| +similar for Variables, which is probably the second most popular complex class. |
| + |
| +There are some situations where passes need to use some STL container class. |
| +Subzero has a way of using the CfgLocalAllocator as the container allocator if |
| +this is needed. |
| + |
| +Multithreaded translation |
| +------------------------- |
| + |
| +Subzero is designed to be able to translate functions in parallel. With the |
| +``-threads=N`` command-line option, there is a 3-stage producer-consumer |
| +pipeline: |
| + |
| +- A single thread parses the pexe file and produces a sequence of work units. A |
| + work unit can be either a fully constructed Cfg, or a set of global |
| + initializers. The work unit includes its sequence number denoting its parse |
| + order. Each work unit is added to the translation queue. |
| + |
| +- There are N translation threads that draw work units from the translation |
| + queue and lower them into assembler buffers. Each assembler buffer is added |
| + to the emitter queue, tagged with its sequence number. The Cfg and its |
| + CfgLocalAllocator are disposed of at this point. |
| + |
| +- A single thread draws assembler buffers from the emitter queue and appends to |
| + the output file. It uses the sequence numbers to reintegrate the assembler |
| + buffers according to the original parse order, such that output order is |
| + always deterministic. |
| + |
| +Note that ``-threads=N`` actually creates N+2 threads. For the special case of |
| +N=0, execution is entirely sequential -- the same thread parses, translates, and |
| +emits, one function at a time. This is useful for performance measurements. |
| + |
| +Currently, parsing takes about 11% of total sequential time. If translation |
| +scalability ever gets so fast and awesomely scalable that parsing becomes a |
| +bottleneck, it should be possible to make parsing multithreaded as well. |
| + |
| +Internally, all shared, mutable data is held in the GlobalContext object, and |
| +access to each field is guarded by a mutex. |
| + |
| +Security |
| +-------- |
| + |
| +Sandboxed translator |
| +^^^^^^^^^^^^^^^^^^^^ |
| + |
| +When running inside the browser, the Subzero translator executes as sandboxed, |
| +untrusted code that is initially checked by the validator, just like the |
| +LLVM-based pnacl-llc translator. As such, the Subzero binary should be no more |
| +or less secure than the translator it replaces, from the point of view of the |
| +Chrome sandbox. |
| + |
| +Code diversification |
| +^^^^^^^^^^^^^^^^^^^^ |
| + |
| +Return-oriented programming (ROP) is a now-common technique for starting with |
| +e.g. a known buffer overflow situation and launching it into a deeper exploit. |
| +The attacker scans the executable looking for ROP gadgets, which are short |
| +sequences of code that happen to load known values into known registers and then |
| +return. If the attacker can manage to overwrite parts of the stack, he can |
| +overwrite it with carefully chosen return addresses such that certain ROP |
| +gadgets are effectively chained together to set up the register state as |
| +desired, finally returning to some code that manages to do something nasty based |
| +on those register values. |
| + |
| +If there is a popular pexe with a large install base, the attacker could run |
| +Subzero on it and scan the executable for suitable ROP gadgets to use as part of |
| +a potential exploit. Note that if the trusted validator is working correctly, |
| +these ROP gadgets are limited to starting at a bundle boundary and cannot use |
| +the trick of finding a gadget that happens to begin inside another instruction. |
| +All the same, gadgets with these constraints still exist and the attacker has |
| +access to them. |
| + |
| +Subzero can mitigate these attacks to some degree through code diversification. |
| +Specifically, we can apply some randomness to the code generation that makes ROP |
| +gadgets less predictable. This randomness can have some compile-time cost, and |
| +it can affect the code quality; and some diversifications may be more effective |
| +than others. |
| + |
| +To evaluate diversification effectiveness, we use a third-party ROP gadget |
| +finder and limit its results to bundle-aligned addresses. For a given |
| +diversification technique, we run it with a number of different random seeds, |
| +find ROP gadgets for each version, and determine how persistent each ROP gadget |
| +is across the different versions. A gadget is persistent if the same gadget is |
| +found at the same code address. The best diversifications are ones with low |
| +gadget persistence rates. |
| + |
| +Subzero implements 7 different diversification techniques. Below is a |
| +discussion of each technique, its effectiveness, and its cost. |
| + |
| +Constant blinding |
| +~~~~~~~~~~~~~~~~~ |
| + |
| +Here, we prevent attackers from controlling large immediates in the text |
| +(executable) section. A random cookie is generated for each function, and if |
| +the constant exceeds a specified threshold, the constant is obfuscated with the |
| +cookie and equivalent code is generated. For example, instead of this x86 |
| +instruction:: |
| + |
| + mov $0x11223344, <%Reg/Mem> |
|
John
2015/08/31 15:02:18
The example above (RMW) uses intel syntax. I'd say
Jim Stichnoth
2015/08/31 15:35:32
Switched everything to AT&T syntax.
|
| + |
| +the following code might be generated:: |
| + |
| + mov $(0x11223344+Cookie), %temp |
| + lea -Cookie(%temp), %temp |
| + mov %temp, <%Reg/Mem> |
| + |
| +The ``lea`` instruction is used rather than e.g. add/sub or xor, to prevent |
| +unintended effects on the flags register. |
| + |
| +This transformation has almost no effect on translation time, and about 1% |
| +impact on code quality, depending on the threshold chosen. It does little to |
| +reduce gadget persistence, but it does remove a lot of potential opportunities |
| +to construct intra-instruction ROP gadgets (which an attacker could use if a |
| +validator bug were discovered). |
| + |
| +Constant pooling |
| +~~~~~~~~~~~~~~~~ |
| + |
| +This is similar to constant blinding, in that large immediates are removed from |
| +the text section. In this case, each unique constant above the threshold is |
| +stored in a read-only data section and the constant is accessed via a memory |
| +load. For the above example, the following code might be generated:: |
| + |
| + mov $Label$1, %temp |
| + mov %temp, <%Reg/Mem> |
| + |
| +This has a similarly small impact on translation time and ROP gadget |
| +persistence, and a smaller (better) impact on code quality. This is because it |
| +uses fewer instructions, and in some cases good register allocation leads to no |
| +increase in instruction count. Note that this still gives an attacker some |
| +limited amount of control over some text section values, unless we randomize the |
| +constant pool layout. |
| + |
| +Static data reordering |
| +~~~~~~~~~~~~~~~~~~~~~~ |
| + |
| +This transformation limits the attacker's ability to control bits in global data |
| +address references. It simply permutes the order in memory of global variables |
| +and internal constant pool entries. For the constant pool, we only permute |
| +within a type (i.e., emit a randomized list of ints, followed by a randomized |
| +list of floats, etc.) to maintain good packing in the face of alignment |
| +constraints. |
| + |
| +As might be expected, this has no impact on code quality, translation time, or |
| +ROP gadget persistence (though as above, it limits opportunities for |
| +intra-instruction ROP gadgets with a broken validator). |
| + |
| +Basic block reordering |
| +~~~~~~~~~~~~~~~~~~~~~~ |
| + |
| +Here, we randomize the order of basic blocks within a function, with the |
| +constraint that we still want to maintain a topological order as much as |
| +possible, to avoid making the code too branchy. |
| + |
| +This has no impact on code quality, and about 1% impact on translation time, due |
| +to a separate pass to recompute layout. It ends up having a huge effect on ROP |
| +gadget persistence, tied for best with nop insertion, reducing ROP gadget |
| +persistence to less than 5%. |
| + |
| +Function reordering |
| +~~~~~~~~~~~~~~~~~~~ |
| + |
| +Here, we permute the order that functions are emitted, primarily to shift ROP |
| +gadgets around to less predictable locations. It may also change call address |
| +offsets in case the attacker was trying to control that offset in the code. |
| + |
| +To control latency and memory footprint, we don't arbitrarily permute functions. |
| +Instead, for some relatively small value of N, we queue up N assembler buffers, |
| +and then emit the N functions in random order, and repeat until all functions |
| +are emitted. |
| + |
| +Function reordering has no impact on translation time or code quality. |
| +Measurements indicate that it reduces ROP gadget persistence to about 15%. |
| + |
| +Nop insertion |
| +~~~~~~~~~~~~~ |
| + |
| +This diversification randomly adds a nop instruction after each regular |
| +instruction, with some probability. Nop instructions of different lengths may |
| +be selected. Nop instructions are never added inside a bundle_lock region. |
| +Note that when sandboxing is enabled, nop instructions are already being added |
| +for bundle alignment, so the diversification nop instructions may simply be |
| +taking the place of alignment nop instructions, though distributed differently |
| +through the bundle. |
| + |
| +In Subzero's currently implementation, nop insertion adds 3-5% to the |
| +translation time, but this is probably because it is implemented as a separate |
| +pass that adds actual nop instructions to the IR. The overhead would probably |
| +be a lot less if it were integrated into the assembler pass. The code quality |
| +is also reduced by 3-5%, making nop insertion the most expensive of the |
| +diversification techniques. |
| + |
| +Nop insertion is very effective in reducing ROP gadget persistence, at the same |
| +level as basic block randomization. But given nop insertion's impact on |
| +translation time and code quality, one would most likely prefer to use basic |
| +block randomization instead. |
| + |
| +Register allocation randomization |
| +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| + |
| +In this diversification, the register allocator tries to make different but |
| +mostly functionally equivalent choices, while maintaining stable code quality. |
| + |
| +A naive approach would be the following. Whenever the allocator has more than |
| +one choice for assigning a register, choose randomly among those options. And |
| +whenever there are no registers available and there is a tie for the |
| +lowest-weight variable, randomly select one of the lowest-weight variables to |
| +evict. Because of the one-pass nature of the linear-scan algorithm, this |
| +randomization strategy can have a large impact on which variables are ultimately |
| +assigned registers, with a corresponding large impact on code quality. |
| + |
| +Instead, we choose an approach that tries to keep code quality stable regardless |
| +of the random seed. We partition the set of physical registers into equivalence |
| +classes. If a register is pre-colored in the function (i.e., referenced |
| +explicitly by name), it forms its own equivalence class. The remaining |
| +registers are partitioned according to their combination of attributes such as |
| +integer versus floating-point, 8-bit versus 32-bit, caller-save versus |
| +callee-saved, etc. Each equivalence class is randomly permuted, and the |
| +complete permutation is applied to the final register assignments. |
| + |
| +Register randomization reduces ROP gadget persistence to about 10% on average, |
| +though there tends to be fairly high variance across functions and applications. |
| +This probably has to do with the set of restrictions in the x86-32 instruction |
| +set and ABI, such as few general-purpose registers, eax used for return values, |
| +edx used for division, ecx used for shifting, etc. As intended, register |
| +randomization has no impact on code quality, and a slight (0.5%) impact on |
| +translation time due to an extra scan over the variables to identify pre-colored |
| +registers. |
| + |
| +Fuzzing |
| +^^^^^^^ |
| + |
| +We have started fuzzing the pexe files input to Subzero. Most of the problems |
| +found so far are ones most appropriately handled in the parser. However, there |
| +have been a couple that have identified problems in the lowering, or otherwise |
| +inappropriately triggered assertion failures and fatal errors. We continue to |
| +dig into this area. |
| + |
| +Chrome integration |
| +------------------ |
| + |
| +Currently Subzero is available in Chrome for the x86-32 architecture, but under |
| +a flag. When the flag is enabled, Subzero is used when the pexe developer's |
| +manifest specifies the ``O0`` optimization level. |
| + |
| +The next step is to remove the flag, i.e. invoke Subzero as the only translator |
| +for ``O0``-specified manifest files. |
| + |
| +Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which |
| +case Subzero could be used for all PNaCl translation. |
| + |
| +Command line options |
| +-------------------- |
| + |
| +Subzero has a number of command-line options for debugging and diagnostics. |
| +Among the more interesting are the following. |
| + |
| +- Using the ``-verbose`` flag, Subzero will dump the Cfg, or produce other |
| + diagnostic output, with various levels of detail after each pass. Instruction |
| + numbers can be printed or suppressed. Deleted instructions can be printed or |
| + suppressed (they are retained in the instruction list, as discussed earlier, |
| + because they can help explain how lower-level instructions originated). |
| + Liveness information can be printed when available. Details of register |
| + allocation can be printed as register allocator decisions are made. And more. |
| + |
| +- Running Subzero with any level of verbosity produces an enormous amount of |
| + output. When debugging a single function, verbose output can be suppressed |
| + except for a particular function. The ``-verbose-focus`` flag suppresses |
| + verbose output except for the specified function. |
| + |
| +- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing |
| + at exit. Timing markers can be placed in the Subzero source code to demarcate |
| + logical operations or passes of interest. Basic timing information plus |
| + call-stack type timing information is printed at the end. |
| + |
| +- Along with ``-timing``, the user can instead get a report on the overall |
| + translation time for each function, to help focus on timing outliers. Also, |
| + ``-timing-focus`` limits the ``-timing`` reporting to a single function, |
| + instead of aggregating pass timing across all functions. |
| + |
| +- The ``-szstats`` option reports various statistics on each function, such as |
| + stack frame size, static instruction count, etc. It may be helpful to track |
| + these stats over time as Subzero is improved, as an approximate measure of |
| + code quality. |
| + |
| +- The flag ``-asm-verbose``, in conjunction with emitting textual assembly |
| + output, annotate the assembly output with register-focused liveness |
| + information. In particular, each basic block is annotated with which |
| + registers are live-in and live-out, and each instruction is annotated with |
| + which registers' and stack locations' live ranges end at that instruction. |
| + This is really useful when studying the generated code to find opportunities |
| + for code quality improvements. |
| + |
| +Testing and debugging |
| +--------------------- |
| + |
| +LLVM lit tests |
| +^^^^^^^^^^^^^^ |
| + |
| +For basic testing, Subzero uses LLVM's lit framework for running tests. We have |
| +a suite of hundreds of small functions where we test for particular assembly |
| +code patterns across different target architectures. |
| + |
| +Cross tests |
| +^^^^^^^^^^^ |
| + |
| +Unfortunately, the lit tests don't do a great job of precisely testing the |
| +correctness of the output. Much better are the cross tests, which are execution |
| +tests that compare Subzero and llc translated bitcode across a wide variety of |
| +interesting inputs. Each cross test consists of a set of C, C++, and/or |
| +low-level bitcode files. The C and C++ source files are compiled down to |
| +bitcode. The bitcode files are translated by llc and also by Subzero. Subzero |
| +mangles global symbol names with a special prefix to avoid duplicate symbol |
| +errors. A driver program invokes both versions on a large set of interesting |
| +inputs, and reports when the Subzero and llc results differ. Cross tests turn |
| +out to be an excellent way of testing the basic lowering patterns, but they are |
| +less useful for testing more global things like liveness analysis and register |
| +allocation. |
| + |
| +Bisection debugging |
| +^^^^^^^^^^^^^^^^^^^ |
| + |
| +Sometimes with a new application, Subzero will end up producing incorrect code |
| +that either crashes at runtime or otherwise produces the wrong results. When |
| +this happens, we need to narrow it down to a single function (or small set of |
| +functions) that yield incorrect behavior. For this, we have a bisection |
| +debugging framework. Here, we initially translate the entire application once |
| +with Subzero and once with llc. We then use objdump to selectively weaken |
| +symbols based on a whitelist or blacklist provided on the command line. The two |
| +object files can then be linked together without link errors, with the desired |
| +version of each method "winning". Then the binary is tested, and bisection |
| +proceeds based on whether the binary produces correct output. |
| + |
| +When the bisection completes, we are left with a minimal set of |
| +Subzero-translated functions that cause the failure. Usually it is a single |
| +function, though sometimes it might require a combination of several functions |
| +to cause a failure; this may be due to an incorrect call ABI, for example. |
| +However, Murphy's Law implies that the single failing function is enormous and |
| +impractical to debug. In that case, we can restart the bisection, explicitly |
| +blacklisting the enormous function, and try to find another candidate to debug. |
| +(Future work is to automate this to find all minimal sets of functions, so that |
| +debugging can focus on the simplest example.) |
| + |
| +Fuzz testing |
| +^^^^^^^^^^^^ |
| + |
| +As described above, we try to find internal Subzero bugs using fuzz testing |
| +techniques. |
| + |
| +Sanitizers |
| +^^^^^^^^^^ |
| + |
| +We regularly run Subzero with asan and tsan. So far, multithreading has been |
| +simple enough that tsan hasn't found any bugs, but asan has found at least one |
| +memory leak which was subsequently fixed. |
| + |
| +Current status |
| +============== |
| + |
| +Target architectures |
| +-------------------- |
| + |
| +Subzero is currently more or less complete for the x86-32 target. It has been |
| +refactored and extended to handle x86-64 as well, and that is mostly complete at |
| +this point. |
| + |
| +ARM32 work is in progress and is mostly complete. It currently lacks the |
|
John
2015/08/31 15:02:18
As Jan pointed out there's substantial amount of w
Jim Stichnoth
2015/08/31 15:35:32
Done.
|
| +testing level of x86, at least in part because Subzero's register allocator |
| +needs modifications to handle ARM's aliasing of floating point and vector |
| +registers. Specifically, a 64-bit register is actually a gang of two |
| +consecutive and aligned 32-bit registers, and a 128-bit register is a gang of 4 |
| +consecutive and aligned 32-bit registers. ARM64 work has not started; the whole |
| +ARM64 sandbox model would need to be defined. |
| + |
| +An external contributor is adding MIPS support, in most part by following the |
| +ARM work. |
| + |
| +Translator performance |
| +---------------------- |
| + |
| +Single-threaded translation speed is currently about 5x the llc translation |
| +speed. For a large pexe, the time breaks down as: |
| + |
| +- 11% for parsing and initial IR building |
| + |
| +- 4% for emitting to /dev/null |
| + |
| +- 27% for liveness analysis (two liveness passes plus live range construction) |
| + |
| +- 15% for linear-scan register allocation |
| + |
| +- 9% for basic lowering |
| + |
| +- 10% for advanced phi lowering |
| + |
| +- ~11% for other minor analysis |
| + |
| +- ~10% measurement overhead to acquire these numbers |
| + |
| +Some improvements could undoubtedly be made, but it will be hard to increase the |
| +speed to 10x of llc while keeping acceptable code quality. With ``-Om1`` (lack |
| +of) optimization, we do actually achieve roughly 10x llc translation speed, but |
| +code quality drops by a factor of 3. |
| + |
| +Code quality |
| +------------ |
| + |
| +Measured across 16 components of spec2k, Subzero's code quality is uniformly |
| +better than llc ``-O0`` code quality, and in many cases solidly between llc |
| +``-O0`` and ``-O2``. |
| + |
| +Translator size |
| +--------------- |
| + |
| +When built in MINIMAL mode, the x86-64 native translator size for the x86-32 |
| +target is about 700 KB, not including the size of functions referenced in |
| +dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1 |
| +MB, and it is statically linked and also includes nop padding for bundling as |
| +well as indirect branch masking. |
| + |
| +Translator memory footprint |
| +--------------------------- |
| + |
| +It's hard to draw firm conclusions about memory footprint, since the footprint |
| +is at least proportional to the input function size, and there is no real limit |
| +on the size of functions in the pexe file. |
| + |
| +That said, we looked at the memory footprint over time as Subzero translated |
| +pnacl-llc.pexe, which is the largest pexe (7.2 MB) at our disposal. One of |
| +LLVM's libraries that Subzero uses can report the current malloc heap usage. |
| +With single-threaded translation, Subzero tends to hover around 15 MB of memory |
| +usage. There are a couple of monstrous functions where Subzero grows to around |
| +100 MB, but then it drops back down after those functions finish translating. |
| +In contrast, llc grows larger and larger throughout translation, reaching |
| +several hundred MB by the time it completes. |
| + |
| +It's a bit more interesting when we enable multithreaded translation. When |
| +there are N translation threads, Subzero implements a policy that limits the |
| +size of the translation queue to N entries - if it is "full" when the parser |
| +tries to add a new Cfg, the parser blocks until one of the translation threads |
| +removes a Cfg. This means the number of in-memory Cfgs can (and generally does) |
| +reach 2*N+1, and so the memory footprint rises in proportion to the number of |
| +threads. Adding to the pressure is the observation that the monstrous functions |
| +also take proportionally longer time to translate, so there's a good chance many |
| +of the monstrous functions will be active at the same time with multithreaded |
| +translation. As a result, for N=32, Subzero's memory footprint peaks at about |
| +260 MB, but drops back down as the large functions finish translating. |
| + |
| +If this peak memory size becomes a problem, it might be possible for the parser |
| +to resequence the functions to try to spread out the larger functions, or to |
| +throttle the translation queue to prevent too many in-flight large functions. |
| + |
| +Translator scalability |
| +---------------------- |
| + |
| +Currently scalability is "not very good". Multiple translation threads lead to |
| +faster translation, but not to the degree desired. We haven't dug in to |
| +investigate yet. |
| + |
| +There are a few areas to investigate. First, there may be contention on the |
| +constant pool, which all threads access, and which requires locked access even |
| +for reading. This could be mitigated by keeping a Cfg-local cache of the most |
| +common constants. |
| + |
| +Second, there may be contention on memory allocation. While almost all Cfg |
| +objects are allocated from the Cfg-local allocator, some passes use temporary |
| +STL containers that use the default allocator, which may require global locking. |
| +This could be mitigated by switching these to the Cfg-local allocator. |
| + |
| +Third, multithreading may make the default allocator strategy more expensive. |
| +In a single-threaded environment, a pass will allocate its containers, run the |
| +pass, and deallocate the containers. This results in stack-like allocation |
| +behavior and makes the heap free list easier to manage, with less heap |
| +fragmentation. But when multithreading is added, the allocations and |
| +deallocations become much less stack-like, making allocation and deallocation |
| +operations individually more expensive. Again, this could be mitigated by |
| +switching these to the Cfg-local allocator. |