| Index: DESIGN.rst
|
| diff --git a/DESIGN.rst b/DESIGN.rst
|
| new file mode 100644
|
| index 0000000000000000000000000000000000000000..3d324f64ba81798b5a7a038b24f265bbe292254b
|
| --- /dev/null
|
| +++ b/DESIGN.rst
|
| @@ -0,0 +1,1494 @@
|
| +Design of the Subzero fast code generator
|
| +=========================================
|
| +
|
| +Introduction
|
| +------------
|
| +
|
| +The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
|
| +compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
|
| +PNaCl toolchain to compile their application to architecture-neutral PNaCl
|
| +bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
|
| +possible. The ``.pexe`` file is downloaded to the user's browser where the
|
| +PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
|
| +`sandboxed
|
| +<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
|
| +native code. The translator uses architecture-specific optimizations as much as
|
| +practical to generate good native code.
|
| +
|
| +The native code can be cached by the browser to avoid repeating translation on
|
| +future page loads. However, first-time user experience is hampered by long
|
| +translation times. The LLVM-based PNaCl translator is pretty slow, even when
|
| +using ``-O0`` to minimize optimizations, so delays are especially noticeable on
|
| +slow browser platforms such as ARM-based Chromebooks.
|
| +
|
| +Translator slowness can be mitigated or hidden in a number of ways.
|
| +
|
| +- Parallel translation. However, slow machines where this matters most, e.g.
|
| + ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
|
| + and are likely to less memory available for multiple translation threads to
|
| + use.
|
| +
|
| +- Streaming translation, i.e. start translating as soon as the download starts.
|
| + This doesn't help much when translation speed is 10× slower than download
|
| + speed, or the ``.pexe`` file is already cached while the translated binary was
|
| + flushed from the cache.
|
| +
|
| +- Arrange the web page such that translation is done in parallel with
|
| + downloading large assets.
|
| +
|
| +- Arrange the web page to distract the user with `cat videos
|
| + <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
|
| + progress.
|
| +
|
| +Or, improve translator performance to something more reasonable.
|
| +
|
| +This document describes Subzero's attempt to improve translation speed by an
|
| +order of magnitude while rivaling LLVM's code quality. Subzero does this
|
| +through minimal IR layering, lean data structures and passes, and a careful
|
| +selection of fast optimization passes. It has two optimization recipes: full
|
| +optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
|
| +following (described in more detail below):
|
| +
|
| ++-----------------------------------+-----------------------+
|
| +| O2 recipe | Om1 recipe |
|
| ++===================================+=======================+
|
| +| Parse .pexe file | Parse .pexe file |
|
| ++-----------------------------------+-----------------------+
|
| +| Loop nest analysis | |
|
| ++-----------------------------------+-----------------------+
|
| +| Address mode inference | |
|
| ++-----------------------------------+-----------------------+
|
| +| Read-modify-write (RMW) transform | |
|
| ++-----------------------------------+-----------------------+
|
| +| Basic liveness analysis | |
|
| ++-----------------------------------+-----------------------+
|
| +| Load optimization | |
|
| ++-----------------------------------+-----------------------+
|
| +| | Phi lowering (simple) |
|
| ++-----------------------------------+-----------------------+
|
| +| Target lowering | Target lowering |
|
| ++-----------------------------------+-----------------------+
|
| +| Full liveness analysis | |
|
| ++-----------------------------------+-----------------------+
|
| +| Register allocation | |
|
| ++-----------------------------------+-----------------------+
|
| +| Phi lowering (advanced) | |
|
| ++-----------------------------------+-----------------------+
|
| +| Post-phi register allocation | |
|
| ++-----------------------------------+-----------------------+
|
| +| Branch optimization | |
|
| ++-----------------------------------+-----------------------+
|
| +| Code emission | Code emission |
|
| ++-----------------------------------+-----------------------+
|
| +
|
| +Goals
|
| +=====
|
| +
|
| +Translation speed
|
| +-----------------
|
| +
|
| +We'd like to be able to translate a ``.pexe`` file as fast as download speed.
|
| +Any faster is in a sense wasted effort. Download speed varies greatly, but
|
| +we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
|
| +slow machine. We observe a 3× single-thread performance difference between A15
|
| +and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
|
| +``.pexe`` file could be compressed to 50% on the web server using gzip transport
|
| +compression, so we set the translation speed goal to 6 MB/sec on the high-end
|
| +Xeon workstation.
|
| +
|
| +Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
|
| +⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
|
| +
|
| +In other words, Subzero's goal is to improve over LLVM's translation speed by
|
| +10×.
|
| +
|
| +Code quality
|
| +------------
|
| +
|
| +Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
|
| +code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
|
| +average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
|
| +
|
| +It's important to note that the quality of Subzero-generated code depends on
|
| +target-neutral optimizations and simplifications being run beforehand in the
|
| +developer environment. The ``.pexe`` file reflects these optimizations. For
|
| +example, Subzero assumes that the basic blocks are ordered topologically where
|
| +possible (which makes liveness analysis converge fastest), and Subzero does not
|
| +do any function inlining because it should already have been done.
|
| +
|
| +Translator size
|
| +---------------
|
| +
|
| +The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
|
| +We think 1 MB is a more reasonable size -- especially for such a component that
|
| +is distributed to a billion Chrome users. Thus we target a 10× reduction in
|
| +binary size.
|
| +
|
| +For development, Subzero can be built for all target architectures, and all
|
| +debugging and diagnostic options enabled. For a smaller translator, we restrict
|
| +to a single target architecture, and define a ``MINIMAL`` build where
|
| +unnecessary features are compiled out.
|
| +
|
| +Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
|
| +include directories, which have little impact on translator size. It also uses
|
| +some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
|
| +with little size impact. In non-``MINIMAL`` builds, the translator size is much
|
| +larger due to including code for parsing text-format bitcode files and forming
|
| +LLVM IR.
|
| +
|
| +Memory footprint
|
| +----------------
|
| +
|
| +The current LLVM-based translator suffers from an issue in which some
|
| +function-specific data has to be retained in memory until all translation
|
| +completes, and therefore the memory footprint grows without bound. Large
|
| +``.pexe`` files can lead to the translator process holding hundreds of MB of
|
| +memory by the end. The translator runs in a separate process, so this memory
|
| +growth doesn't *directly* affect other processes, but it does dirty the physical
|
| +memory and contributes to a perception of bloat and sometimes a reality of
|
| +out-of-memory tab killing, especially noticeable on weaker systems.
|
| +
|
| +Subzero should maintain a stable memory footprint throughout translation. It's
|
| +not really practical to set a specific limit, because there is not really a
|
| +practical limit on a single function's size, but the footprint should be
|
| +"reasonable" and be proportional to the largest input function size, not the
|
| +total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
|
| +inexorable memory growth. (We use ASAN builds to test for leaks.)
|
| +
|
| +Multithreaded translation
|
| +-------------------------
|
| +
|
| +It should be practical to translate different functions concurrently and see
|
| +good scalability. Some locking may be needed, such as accessing output buffers
|
| +or constant pools, but that should be fairly minimal. In contrast, LLVM was
|
| +only designed for module-level parallelism, and as such, the PNaCl translator
|
| +internally splits a ``.pexe`` file into several modules for concurrent
|
| +translation. All output needs to be deterministic regardless of the level of
|
| +multithreading, i.e. functions and data should always be output in the same
|
| +order.
|
| +
|
| +Target architectures
|
| +--------------------
|
| +
|
| +Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
|
| +targets include ARM64 and MIPS64, though these targets lack NaCl support
|
| +including a sandbox model or a validator.
|
| +
|
| +The first implementation is for x86-32, because it was expected to be
|
| +particularly challenging, and thus more likely to draw out any design problems
|
| +early:
|
| +
|
| +- There are a number of special cases, asymmetries, and warts in the x86
|
| + instruction set.
|
| +
|
| +- Complex addressing modes may be leveraged for better code quality.
|
| +
|
| +- 64-bit integer operations have to be lowered into longer sequences of 32-bit
|
| + operations.
|
| +
|
| +- Paucity of physical registers may reveal code quality issues early in the
|
| + design.
|
| +
|
| +Detailed design
|
| +===============
|
| +
|
| +Intermediate representation - ICE
|
| +---------------------------------
|
| +
|
| +Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
|
| +IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
|
| +representation of global variables and initializers, and a set of functions.
|
| +Each function contains a list of basic blocks, and each basic block constains a
|
| +list of instructions. Instructions that operate on stack and register variables
|
| +do so using static single assignment (SSA) form.
|
| +
|
| +The ``.pexe`` file is translated one function at a time (or in parallel by
|
| +multiple translation threads). The recipe for optimization passes depends on
|
| +the specific target and optimization level, and is described in detail below.
|
| +Global variables (types and initializers) are simply and directly translated to
|
| +object code, without any meaningful attempts at optimization.
|
| +
|
| +A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
|
| +Its key contents include:
|
| +
|
| +- A list of ``CfgNode`` pointers, generally held in topological order.
|
| +
|
| +- A list of ``Variable`` pointers corresponding to local variables used in the
|
| + function plus compiler-generated temporaries.
|
| +
|
| +A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
|
| +include:
|
| +
|
| +- A linear list of instructions, in the same style as LLVM. The last
|
| + instruction of the list is always a terminator instruction: branch, switch,
|
| + return, unreachable.
|
| +
|
| +- A list of Phi instructions, also in the same style as LLVM. They are held as
|
| + a linear list for convenience, though per Phi semantics, they are executed "in
|
| + parallel" without dependencies on each other.
|
| +
|
| +- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
|
| + another list for outgoing edges.
|
| +
|
| +- The node's unique, 0-based index into the CFG's node list.
|
| +
|
| +An instruction is represented by the ``Ice::Inst`` class. Its key contents
|
| +include:
|
| +
|
| +- A list of source operands.
|
| +
|
| +- Its destination variable, if the instruction produces a result in an
|
| + ``Ice::Variable``.
|
| +
|
| +- A bitvector indicating which variables' live ranges this instruction ends.
|
| + This is computed during liveness analysis.
|
| +
|
| +Instructions kinds are divided into high-level ICE instructions and low-level
|
| +ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
|
| +instruction kinds. Each target architecture implementation extends the
|
| +instruction space with its own set of low-level instructions. Generally,
|
| +low-level instructions correspond to individual machine instructions. The
|
| +high-level ICE instruction space includes a few additional instruction kinds
|
| +that are not part of LLVM but are generally useful (e.g., an Assignment
|
| +instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
|
| +instructions for sandboxing).
|
| +
|
| +Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
|
| +ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
|
| +<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
|
| +the following:
|
| +
|
| +- Alloca: allocate data on the stack
|
| +
|
| +- Arithmetic: binary operations of the form ``A = B op C``
|
| +
|
| +- Br: conditional or unconditional branch
|
| +
|
| +- Call: function call
|
| +
|
| +- Cast: unary type-conversion operations
|
| +
|
| +- ExtractElement: extract a scalar element from a vector-type value
|
| +
|
| +- Fcmp: floating-point comparison
|
| +
|
| +- Icmp: integer comparison
|
| +
|
| +- IntrinsicCall: call a known intrinsic
|
| +
|
| +- InsertElement: insert a scalar element into a vector-type value
|
| +
|
| +- Load: load a value from memory
|
| +
|
| +- Phi: implement the SSA phi node
|
| +
|
| +- Ret: return from the function
|
| +
|
| +- Select: essentially the C language operation of the form ``X = C ? Y : Z``
|
| +
|
| +- Store: store a value into memory
|
| +
|
| +- Switch: generalized branch to multiple possible locations
|
| +
|
| +- Unreachable: indicate that this portion of the code is unreachable
|
| +
|
| +The additional high-level ICE instructions are the following:
|
| +
|
| +- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
|
| + instructions to non-SSA assignments, before lowering to machine code.
|
| +
|
| +- BundleLock, BundleUnlock. These are markers used for sandboxing, but are
|
| + common across all targets and so they are elevated to the high-level
|
| + instruction set.
|
| +
|
| +- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
|
| + liveness analysis, elevated to the high-level because they are used by all
|
| + targets. They are described in more detail at the end of this section.
|
| +
|
| +- JumpTable: this represents the result of switch optimization analysis, where
|
| + some switch instructions may use jump tables instead of cascading
|
| + compare/branches.
|
| +
|
| +An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
|
| +operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
|
| +include scalar integer constants, scalar floating point constants, Undef (an
|
| +unspecified constant of a particular scalar or vector type), and symbol
|
| +constants (essentially addresses of globals). Note that the PNaCl ABI does not
|
| +include vector-type constants besides Undef, and as such, Subzero (so far) has
|
| +no reason to represent vector-type constants internally. A variable represents
|
| +a value allocated on the stack (though not including alloca-derived storage).
|
| +Among other things, a variable holds its unique, 0-based index into the CFG's
|
| +variable list.
|
| +
|
| +Each target can extend the ``Constant`` and ``Variable`` classes for its own
|
| +needs. In addition, the ``Operand`` class may be extended, e.g. to define an
|
| +x86 ``MemOperand`` that encodes a base register, an index register, an index
|
| +register shift amount, and a constant offset.
|
| +
|
| +Register allocation and liveness analysis are restricted to Variable operands.
|
| +Because of the importance of register allocation to code quality, and the
|
| +translation-time cost of liveness analysis, Variable operands get some special
|
| +treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
|
| +across all the Variables of an instruction. An instruction holds a list of
|
| +operands, but an operand may contain 0, 1, or more Variables. As such, the
|
| +``Operand`` class specially holds a list of Variables contained within, for
|
| +quick access.
|
| +
|
| +A Subzero transformation pass may work by deleting an existing instruction and
|
| +replacing it with zero or more new instructions. Instead of actually deleting
|
| +the existing instruction, we generally mark it as deleted and insert the new
|
| +instructions right after the deleted instruction. When printing the IR for
|
| +debugging, this is a big help because it makes it much more clear how the
|
| +non-deleted instructions came about.
|
| +
|
| +Subzero has a few special instructions to help with liveness analysis
|
| +consistency.
|
| +
|
| +- The FakeDef instruction gives a fake definition of some variable. For
|
| + example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
|
| + but an ICE instruction can represent only one destination variable. This is
|
| + similar for multiply instructions, and for function calls that return a 64-bit
|
| + integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
|
| + trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
|
| +
|
| +- The FakeUse instruction registers a use of a variable, typically to prevent an
|
| + earlier assignment to that variable from being dead-code eliminated. For
|
| + example, lowering an operation like ``x=cc?y:z`` may be done using x86's
|
| + conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
|
| + FakeUse of ``x`` between the two instructions, the liveness analysis pass may
|
| + dead-code eliminate the first instruction.
|
| +
|
| +- The FakeKill instruction is added after a call instruction, and is a quick way
|
| + of indicating that caller-save registers are invalidated.
|
| +
|
| +Pexe parsing
|
| +------------
|
| +
|
| +Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
|
| +parses the ``.pexe`` file function by function, ultimately constructing an ICE
|
| +CFG for each function. After a function is parsed, its CFG is handed off to the
|
| +translation phase. The bitcode parser also parses global initializer data and
|
| +hands it off to be translated to data sections in the object file.
|
| +
|
| +Subzero has another parsing strategy for testing/debugging. LLVM libraries can
|
| +be used to parse a module into LLVM IR (though very slowly relative to Subzero
|
| +native parsing). Then we iterate across the LLVM IR and construct high-level
|
| +ICE, handing off each CFG to the translation phase.
|
| +
|
| +Overview of lowering
|
| +--------------------
|
| +
|
| +In general, translation goes like this:
|
| +
|
| +- Parse the next function from the ``.pexe`` file and construct a CFG consisting
|
| + of high-level ICE.
|
| +
|
| +- Do analysis passes and transformation passes on the high-level ICE, as
|
| + desired.
|
| +
|
| +- Lower each high-level ICE instruction into a sequence of zero or more
|
| + low-level ICE instructions. Each high-level instruction is generally lowered
|
| + independently, though the target lowering is allowed to look ahead in the
|
| + CfgNode's instruction list if desired.
|
| +
|
| +- Do more analysis and transformation passes on the low-level ICE, as desired.
|
| +
|
| +- Assemble the low-level CFG into an ELF object file (alternatively, a textual
|
| + assembly file that is later assembled by some external tool).
|
| +
|
| +- Repeat for all functions, and also produce object code for data such as global
|
| + initializers and internal constant pools.
|
| +
|
| +Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
|
| +the intention is to apply all available optimizations to get the best code
|
| +quality (though the initial code quality goal is measured against LLVM's ``O0``
|
| +code quality). For ``Om1``, the intention is to apply as few optimizations as
|
| +possible and produce code as quickly as possible, accepting poor code quality.
|
| +``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
|
| +"sub-zero".
|
| +
|
| +High-level debuggability of generated code is so far not a design requirement.
|
| +Subzero doesn't really do transformations that would obfuscate debugging; the
|
| +main thing might be that register allocation (including stack slot coalescing
|
| +for stack-allocated variables whose live ranges don't overlap) may render a
|
| +variable's value unobtainable after its live range ends. This would not be an
|
| +issue for ``Om1`` since it doesn't register-allocate program-level variables,
|
| +nor does it coalesce stack slots. That said, fully supporting debuggability
|
| +would require a few additions:
|
| +
|
| +- DWARF support would need to be added to Subzero's ELF file emitter. Subzero
|
| + propagates global symbol names, local variable names, and function-internal
|
| + label names that are present in the ``.pexe`` file. This would allow a
|
| + debugger to map addresses back to symbols in the ``.pexe`` file.
|
| +
|
| +- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
|
| + file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
|
| + metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
|
| + `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
|
| +
|
| +- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
|
| + the toolchain would need to be modified to preserve it.
|
| +
|
| +Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
|
| +produces code with one third the code quality. ``Om1`` is good for testing and
|
| +debugging -- during translation, it tends to expose errors in the basic lowering
|
| +that might otherwise have been hidden by the register allocator or other
|
| +optimization passes. It also helps determine whether a code correctness problem
|
| +is a fundamental problem in the basic lowering, or an error in another
|
| +optimization pass.
|
| +
|
| +The implementation of target lowering also controls the recipe of passes used
|
| +for ``Om1`` and ``O2`` translation. For example, address mode inference may
|
| +only be relevant for x86.
|
| +
|
| +Lowering strategy
|
| +-----------------
|
| +
|
| +The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
|
| +each high-level instruction down to a sequence of low-level target-specific
|
| +instructions, in a largely context-free setting. That is, each high-level
|
| +instruction conceptually has a simple template expansion into low-level
|
| +instructions, and lowering can in theory be done in any order. This may sound
|
| +like a small effort, but quite a large number of templates may be needed because
|
| +of the number of PNaCl types and instruction variants. Furthermore, there may
|
| +be optimized templates, e.g. to take advantage of operator commutativity (for
|
| +example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is
|
| +similar to other template-based approaches in fast code generation or
|
| +interpretation, though some decisions are deferred until after some global
|
| +analysis passes, mostly related to register allocation, stack slot assignment,
|
| +and specific choice of instruction variant and addressing mode.
|
| +
|
| +The key idea for a lowering template is to produce valid low-level instructions
|
| +that are guaranteed to meet address mode and other structural requirements of
|
| +the instruction set. For example, on x86, the source operand of an integer
|
| +store instruction must be an immediate or a physical register; a shift
|
| +instruction's shift amount must be an immediate or in register ``%cl``; a
|
| +function's integer return value is in ``%eax``; most x86 instructions are
|
| +two-operand, in contrast to corresponding three-operand high-level instructions;
|
| +etc.
|
| +
|
| +Because target lowering runs before register allocation, there is no way to know
|
| +whether a given ``Ice::Variable`` operand lives on the stack or in a physical
|
| +register. When the low-level instruction calls for a physical register operand,
|
| +the target lowering can create an infinite-weight Variable. This tells the
|
| +register allocator to assign infinite weight when making decisions, effectively
|
| +guaranteeing some physical register. Variables can also be pre-colored to a
|
| +specific physical register (``cl`` in the shift example above), which also gives
|
| +infinite weight.
|
| +
|
| +To illustrate, consider a high-level arithmetic instruction on 32-bit integer
|
| +operands::
|
| +
|
| + A = B + C
|
| +
|
| +X86 target lowering might produce the following::
|
| +
|
| + T.inf = B // mov instruction
|
| + T.inf += C // add instruction
|
| + A = T.inf // mov instruction
|
| +
|
| +Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
|
| +physical register, the three lowered instructions are all encodable regardless
|
| +of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
|
| +whether ``A`` is a physical register or in memory.
|
| +
|
| +In this example, ``A`` must be a Variable and one may be tempted to simplify the
|
| +lowering sequence by setting ``A`` as infinite-weight and using::
|
| +
|
| + A = B // mov instruction
|
| + A += C // add instruction
|
| +
|
| +This has two problems. First, if the original instruction was actually ``A =
|
| +B + A``, the result would be incorrect. Second, assigning ``A`` a physical
|
| +register applies throughout ``A``'s entire live range. This is probably not
|
| +what is intended, and may ultimately lead to a failure to allocate a register
|
| +for an infinite-weight variable.
|
| +
|
| +This style of lowering leads to many temporaries being generated, so in ``O2``
|
| +mode, we rely on the register allocator to clean things up. For example, in the
|
| +example above, if ``B`` ends up getting a physical register and its live range
|
| +ends at this instruction, the register allocator is likely to reuse that
|
| +register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
|
| +copy, which is removed as an emission-time peephole optimization.
|
| +
|
| +O2 lowering
|
| +-----------
|
| +
|
| +Currently, the ``O2`` lowering recipe is the following:
|
| +
|
| +- Loop nest analysis
|
| +
|
| +- Address mode inference
|
| +
|
| +- Read-modify-write (RMW) transformation
|
| +
|
| +- Basic liveness analysis
|
| +
|
| +- Load optimization
|
| +
|
| +- Target lowering
|
| +
|
| +- Full liveness analysis
|
| +
|
| +- Register allocation
|
| +
|
| +- Phi instruction lowering (advanced)
|
| +
|
| +- Post-phi lowering register allocation
|
| +
|
| +- Branch optimization
|
| +
|
| +These passes are described in more detail below.
|
| +
|
| +Om1 lowering
|
| +------------
|
| +
|
| +Currently, the ``Om1`` lowering recipe is the following:
|
| +
|
| +- Phi instruction lowering (simple)
|
| +
|
| +- Target lowering
|
| +
|
| +- Register allocation (infinite-weight and pre-colored only)
|
| +
|
| +Optimization passes
|
| +-------------------
|
| +
|
| +Liveness analysis
|
| +^^^^^^^^^^^^^^^^^
|
| +
|
| +Liveness analysis is a standard dataflow optimization, implemented as follows.
|
| +For each node (basic block), its live-out set is computed as the union of the
|
| +live-in sets of its successor nodes. Then the node's instructions are processed
|
| +in reverse order, updating the live set, until the beginning of the node is
|
| +reached, and the node's live-in set is recorded. If this iteration has changed
|
| +the node's live-in set, the node's predecessors are marked for reprocessing.
|
| +This continues until no more nodes need reprocessing. If nodes are processed in
|
| +reverse topological order, the number of iterations over the CFG is generally
|
| +equal to the maximum loop nest depth.
|
| +
|
| +To implement this, each node records its live-in and live-out sets, initialized
|
| +to the empty set. Each instruction records which of its Variables' live ranges
|
| +end in that instruction, initialized to the empty set. A side effect of
|
| +liveness analysis is dead instruction elimination. Each instruction can be
|
| +marked as tentatively dead, and after the algorithm converges, the tentatively
|
| +dead instructions are permanently deleted.
|
| +
|
| +Optionally, after this liveness analysis completes, we can do live range
|
| +construction, in which we calculate the live range of each variable in terms of
|
| +instruction numbers. A live range is represented as a union of segments, where
|
| +the segment endpoints are instruction numbers. Instruction numbers are required
|
| +to be unique across the CFG, and monotonically increasing within a basic block.
|
| +As a union of segments, live ranges can contain "gaps" and are therefore
|
| +precise. Because of SSA properties, a variable's live range can start at most
|
| +once in a basic block, and can end at most once in a basic block. Liveness
|
| +analysis keeps track of which variable/instruction tuples begin live ranges and
|
| +end live ranges, and combined with live-in and live-out sets, we can efficiently
|
| +build up live ranges of all variables across all basic blocks.
|
| +
|
| +A lot of care is taken to try to make liveness analysis fast and efficient.
|
| +Because of the lowering strategy, the number of variables is generally
|
| +proportional to the number of instructions, leading to an O(N^2) complexity
|
| +algorithm if implemented naively. To improve things based on sparsity, we note
|
| +that most variables are "local" and referenced in at most one basic block (in
|
| +contrast to the "global" variables with multi-block usage), and therefore cannot
|
| +be live across basic blocks. Therefore, the live-in and live-out sets,
|
| +typically represented as bit vectors, can be limited to the set of global
|
| +variables, and the intra-block liveness bit vector can be compacted to hold the
|
| +global variables plus the local variables for that block.
|
| +
|
| +Register allocation
|
| +^^^^^^^^^^^^^^^^^^^
|
| +
|
| +Subzero implements a simple linear-scan register allocator, based on the
|
| +allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
|
| +Register Allocation in the Context of SSA Form and Register Constraints
|
| +<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
|
| +several nice features:
|
| +
|
| +- Live ranges are represented as unions of segments, as described above, rather
|
| + than a single start/end tuple.
|
| +
|
| +- It allows pre-coloring of variables with specific physical registers.
|
| +
|
| +- It applies equally well to pre-lowered Phi instructions.
|
| +
|
| +The paper suggests an approach of aggressively coalescing variables across Phi
|
| +instructions (i.e., trying to force Phi source and destination variables to have
|
| +the same register assignment), but we reject that in favor of the more natural
|
| +preference mechanism described below.
|
| +
|
| +We enhance the algorithm in the paper with the capability of automatic inference
|
| +of register preference, and with the capability of allowing overlapping live
|
| +ranges to safely share the same register in certain circumstances. If we are
|
| +considering register allocation for variable ``A``, and ``A`` has a single
|
| +defining instruction ``A=B+C``, then the preferred register for ``A``, if
|
| +available, would be the register assigned to ``B`` or ``C``, if any, provided
|
| +that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
|
| +way we infer a good register preference for ``A``.
|
| +
|
| +We allow overlapping live ranges to get the same register in certain cases.
|
| +Suppose a high-level instruction like::
|
| +
|
| + A = unary_op(B)
|
| +
|
| +has been target-lowered like::
|
| +
|
| + T.inf = B
|
| + A = unary_op(T.inf)
|
| +
|
| +Further, assume that ``B``'s live range continues beyond this instruction
|
| +sequence, and that ``B`` has already been assigned some register. Normally, we
|
| +might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
|
| +turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
|
| +different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
|
| +guaranteed to be in a register, so in theory these overlapping live ranges could
|
| +safely have the same register. Our implementation allows this overlap as long
|
| +as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
|
| +modified within ``T.inf``'s live range.
|
| +
|
| +Subzero's register allocator can be run in 3 configurations.
|
| +
|
| +- Normal mode. All Variables are considered for register allocation. It
|
| + requires full liveness analysis and live range construction as a prerequisite.
|
| + This is used by ``O2`` lowering.
|
| +
|
| +- Minimal mode. Only infinite-weight or pre-colored Variables are considered.
|
| + All other Variables are stack-allocated. It does not require liveness
|
| + analysis; instead, it quickly scans the instructions and records first
|
| + definitions and last uses of all relevant Variables, using that to construct a
|
| + single-segment live range. Although this includes most of the Variables, the
|
| + live ranges are mostly simple, short, and rarely overlapping, which the
|
| + register allocator handles efficiently. This is used by ``Om1`` lowering.
|
| +
|
| +- Post-phi lowering mode. Advanced phi lowering is done after normal-mode
|
| + register allocation, and may result in new infinite-weight Variables that need
|
| + registers. One would like to just run something like minimal mode to assign
|
| + registers to the new Variables while respecting existing register allocation
|
| + decisions. However, it sometimes happens that there are no free registers.
|
| + In this case, some register needs to be forcibly spilled to the stack and
|
| + temporarily reassigned to the new Variable, and reloaded at the end of the new
|
| + Variable's live range. The register must be one that has no explicit
|
| + references during the Variable's live range. Since Subzero currently doesn't
|
| + track def/use chains (though it does record the CfgNode where a Variable is
|
| + defined), we just do a brute-force search across the CfgNode's instruction
|
| + list for the instruction numbers of interest. This situation happens very
|
| + rarely, so there's little point for now in improving its performance.
|
| +
|
| +The basic linear-scan algorithm may, as it proceeds, rescind an early register
|
| +allocation decision, leaving that Variable to be stack-allocated. Some of these
|
| +times, it turns out that the Variable could have been given a different register
|
| +without conflict, but by this time it's too late. The literature recognizes
|
| +this situation and describes "second-chance bin-packing", which Subzero can do.
|
| +We can rerun the register allocator in a mode that respects existing register
|
| +allocation decisions, and sometimes it finds new non-conflicting opportunities.
|
| +In fact, we can repeatedly run the register allocator until convergence.
|
| +Unfortunately, in the current implementation, these subsequent register
|
| +allocation passes end up being extremely expensive. This is because of the
|
| +treatment of the "unhandled pre-colored" Variable set, which is normally very
|
| +small but ends up being quite large on subsequent passes. Its performance can
|
| +probably be made acceptable with a better choice of data structures, but for now
|
| +this second-chance mechanism is disabled.
|
| +
|
| +Future work is to implement LLVM's `Greedy
|
| +<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
|
| +register allocator as a replacement for the basic linear-scan algorithm, given
|
| +LLVM's experience with its improvement in code quality. (The blog post claims
|
| +that the Greedy allocator also improved maintainability because a lot of hacks
|
| +could be removed, but Subzero is probably not yet to that level of hacks, and is
|
| +less likely to see that particular benefit.)
|
| +
|
| +Basic phi lowering
|
| +^^^^^^^^^^^^^^^^^^
|
| +
|
| +The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
|
| +implements it). Consider this example::
|
| +
|
| + L1:
|
| + ...
|
| + br L3
|
| + L2:
|
| + ...
|
| + br L3
|
| + L3:
|
| + A = phi [B, L1], [C, L2]
|
| + X = phi [Y, L1], [Z, L2]
|
| +
|
| +For each destination of a phi instruction, we can create a temporary and insert
|
| +the temporary's assignment at the end of the predecessor block::
|
| +
|
| + L1:
|
| + ...
|
| + A' = B
|
| + X' = Y
|
| + br L3
|
| + L2:
|
| + ...
|
| + A' = C
|
| + X' = Z
|
| + br L3
|
| + L2:
|
| + A = A'
|
| + X = X'
|
| +
|
| +This transformation is very simple and reliable. It can be done before target
|
| +lowering and register allocation, and it easily avoids the classic lost-copy and
|
| +related problems. ``Om1`` lowering uses this strategy.
|
| +
|
| +However, it has the disadvantage of initializing temporaries even for branches
|
| +not taken, though that could be mitigated by splitting non-critical edges and
|
| +putting assignments in the edge-split nodes. Another problem is that without
|
| +extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
|
| +specific ordering even though phi semantics are that the assignments are
|
| +parallel or unordered. This sometimes imposes false live range overlaps and
|
| +leads to poorer register allocation.
|
| +
|
| +Advanced phi lowering
|
| +^^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +``O2`` lowering defers phi lowering until after register allocation to avoid the
|
| +problem of false live range overlaps. It works as follows. We split each
|
| +incoming edge and move the (parallel) phi assignments into the split nodes. We
|
| +linearize each set of assignments by finding a safe, topological ordering of the
|
| +assignments, respecting register assignments as well. For example::
|
| +
|
| + A = B
|
| + X = Y
|
| +
|
| +Normally these assignments could be executed in either order, but if ``B`` and
|
| +``X`` are assigned the same physical register, we would want to use the above
|
| +ordering. Dependency cycles are broken by introducing a temporary. For
|
| +example::
|
| +
|
| + A = B
|
| + B = A
|
| +
|
| +Here, a temporary breaks the cycle::
|
| +
|
| + t = A
|
| + A = B
|
| + B = t
|
| +
|
| +Finally, we use the existing target lowering to lower the assignments in this
|
| +basic block, and once that is done for all basic blocks, we run the post-phi
|
| +variant of register allocation on the edge-split basic blocks.
|
| +
|
| +When computing a topological order, we try to first schedule assignments whose
|
| +source has a physical register, and last schedule assignments whose destination
|
| +has a physical register. This helps reduce register pressure.
|
| +
|
| +X86 address mode inference
|
| +^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +We try to take advantage of the x86 addressing mode that includes a base
|
| +register, an index register, an index register scale amount, and an immediate
|
| +offset. We do this through simple pattern matching. Starting with a load or
|
| +store instruction where the address is a variable, we initialize the base
|
| +register to that variable, and look up the instruction where that variable is
|
| +defined. If that is an add instruction of two variables and the index register
|
| +hasn't been set, we replace the base and index register with those two
|
| +variables. If instead it is an add instruction of a variable and a constant, we
|
| +replace the base register with the variable and add the constant to the
|
| +immediate offset.
|
| +
|
| +There are several more patterns that can be matched. This pattern matching
|
| +continues on the load or store instruction until no more matches are found.
|
| +Because a program typically has few load and store instructions (not to be
|
| +confused with instructions that manipulate stack variables), this address mode
|
| +inference pass is fast.
|
| +
|
| +X86 read-modify-write inference
|
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +A reasonably common bitcode pattern is a non-atomic update of a memory
|
| +location::
|
| +
|
| + x = load addr
|
| + y = add x, 1
|
| + store y, addr
|
| +
|
| +On x86, with good register allocation, the Subzero passes described above
|
| +generate code with only this quality::
|
| +
|
| + mov [%ebx], %eax
|
| + add $1, %eax
|
| + mov %eax, [%ebx]
|
| +
|
| +However, x86 allows for this kind of code::
|
| +
|
| + add $1, [%ebx]
|
| +
|
| +which requires fewer instructions, but perhaps more importantly, requires fewer
|
| +physical registers.
|
| +
|
| +It's also important to note that this transformation only makes sense if the
|
| +store instruction ends ``x``'s live range.
|
| +
|
| +Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
|
| +opportunities via simple pattern matching. The only problem is that it is run
|
| +before liveness analysis, which is needed to determine whether ``x``'s live
|
| +range ends after the RMW. Since liveness analysis is one of the most expensive
|
| +passes, it's not attractive to run it an extra time just for RMW analysis.
|
| +Instead, we essentially generate both the RMW and the non-RMW versions, and then
|
| +during lowering, the RMW version deletes itself if it finds x still live.
|
| +
|
| +X86 compare-branch inference
|
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +In the LLVM instruction set, the compare/branch pattern works like this::
|
| +
|
| + cond = icmp eq a, b
|
| + br cond, target
|
| +
|
| +The result of the icmp instruction is a single bit, and a conditional branch
|
| +tests that bit. By contrast, most target architectures use this pattern::
|
| +
|
| + cmp a, b // implicitly sets various bits of FLAGS register
|
| + br eq, target // branch on a particular FLAGS bit
|
| +
|
| +A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
|
| +``cond`` and conditionally branches. Subzero has a pass that identifies
|
| +boolean-based operations like this and folds them into a single
|
| +compare/branch-like operation. It is set up for more than just cmp/br though.
|
| +Boolean producers include icmp (integer compare), fcmp (floating-point compare),
|
| +and trunc (integer truncation when the destination has bool type). Boolean
|
| +consumers include branch, select (the ternary operator from the C language), and
|
| +sign-extend and zero-extend when the source has bool type.
|
| +
|
| +Sandboxing
|
| +^^^^^^^^^^
|
| +
|
| +Native Client's sandbox model uses software fault isolation (SFI) to provide
|
| +safety when running untrusted code in a browser or other environment. Subzero
|
| +implements Native Client's `sandboxing
|
| +<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
|
| +to enable Subzero-translated executables to be run inside Chrome. Subzero also
|
| +provides a fairly simple framework for investigating alternative sandbox models
|
| +or other restrictions on the sandbox model.
|
| +
|
| +Sandboxing in Subzero is not actually implemented as a separate pass, but is
|
| +integrated into lowering and assembly.
|
| +
|
| +- Indirect branches, including the ret instruction, are masked to a bundle
|
| + boundary and bundle-locked.
|
| +
|
| +- Call instructions are aligned to the end of the bundle so that the return
|
| + address is bundle-aligned.
|
| +
|
| +- Indirect branch targets, including function entry and targets in a switch
|
| + statement jump table, are bundle-aligned.
|
| +
|
| +- The intrinsic for reading the thread pointer is inlined appropriately.
|
| +
|
| +- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
|
| + base register. We reduce the aggressiveness of address mode inference to
|
| + leave room for the sandbox base register during lowering. There are no memory
|
| + sandboxing changes for x86-32.
|
| +
|
| +Code emission
|
| +-------------
|
| +
|
| +Subzero's integrated assembler is derived from Dart's `assembler code
|
| +<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
|
| +that iterates through the low-level ICE instructions and invokes the relevant
|
| +assembler functions. Placeholders are added for later fixup of branch target
|
| +offsets. (Backward branches use short offsets if possible; forward branches
|
| +generally use long offsets unless it is an intra-block branch of "known" short
|
| +length.) The assembler emits into a staging buffer. Once emission into the
|
| +staging buffer for a function is complete, the data is emitted to the output
|
| +file as an ELF object file, and metadata such as relocations, symbol table, and
|
| +string table, are accumulated for emission at the end. Global data initializers
|
| +are emitted similarly. A key point is that at this point, the staging buffer
|
| +can be deallocated, and only a minimum of data needs to held until the end.
|
| +
|
| +As a debugging alternative, Subzero can emit textual assembly code which can
|
| +then be run through an external assembler. This is of course super slow, but
|
| +quite valuable when bringing up a new target.
|
| +
|
| +As another debugging option, the staging buffer can be emitted as textual
|
| +assembly, primarily in the form of ".byte" lines. This allows the assembler to
|
| +be tested separately from the ELF related code.
|
| +
|
| +Memory management
|
| +-----------------
|
| +
|
| +Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
|
| +LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
|
| +allocated from the arena are never actually freed; instead, when the CFG
|
| +translation completes and the CFG is deleted, the entire arena memory is
|
| +reclaimed at once. This style of allocation works well in an environment like a
|
| +compiler where there are distinct phases with only easily-identifiable objects
|
| +living across phases. It frees the developer from having to manage object
|
| +deletion, and it amortizes deletion costs across just a single arena deletion at
|
| +the end of the phase. Furthermore, it helps scalability by allocating entirely
|
| +from thread-local memory pools, and minimizing global locking of the heap.
|
| +
|
| +Instructions are probably the most heavily allocated complex class in Subzero.
|
| +We represent an instruction list as an intrusive doubly linked list, allocate
|
| +all instructions from the ``CfgLocalAllocator``, and we make sure each
|
| +instruction subclass is basically `POD
|
| +<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
|
| +trivial destructor. This way, when the CFG is finished, we don't need to
|
| +individually deallocate every instruction. We do similar for Variables, which
|
| +is probably the second most popular complex class.
|
| +
|
| +There are some situations where passes need to use some `STL container class
|
| +<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
|
| +``CfgLocalAllocator`` as the container allocator if this is needed.
|
| +
|
| +Multithreaded translation
|
| +-------------------------
|
| +
|
| +Subzero is designed to be able to translate functions in parallel. With the
|
| +``-threads=N`` command-line option, there is a 3-stage producer-consumer
|
| +pipeline:
|
| +
|
| +- A single thread parses the ``.pexe`` file and produces a sequence of work
|
| + units. A work unit can be either a fully constructed CFG, or a set of global
|
| + initializers. The work unit includes its sequence number denoting its parse
|
| + order. Each work unit is added to the translation queue.
|
| +
|
| +- There are N translation threads that draw work units from the translation
|
| + queue and lower them into assembler buffers. Each assembler buffer is added
|
| + to the emitter queue, tagged with its sequence number. The CFG and its
|
| + ``CfgLocalAllocator`` are disposed of at this point.
|
| +
|
| +- A single thread draws assembler buffers from the emitter queue and appends to
|
| + the output file. It uses the sequence numbers to reintegrate the assembler
|
| + buffers according to the original parse order, such that output order is
|
| + always deterministic.
|
| +
|
| +This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
|
| +for a total of ``N+2`` execution threads, taking the parser and emitter threads
|
| +into account. For the special case of ``N=0``, execution is entirely sequential
|
| +-- the same thread parses, translates, and emits, one function at a time. This
|
| +is useful for performance measurements.
|
| +
|
| +Ideally, we would like to get near-linear scalability as the number of
|
| +translation threads increases. We expect that ``-threads=1`` should be slightly
|
| +faster than ``-threads=0`` as the small amount of time spent parsing and
|
| +emitting is done largely in parallel with translation. With perfect
|
| +scalability, we see ``-threads=N`` translating ``N`` times as fast as
|
| +``-threads=1``, up until the point where parsing or emitting becomes the
|
| +bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
|
| +performance would become a bottleneck and efficiency might peak at, say, 75%.
|
| +
|
| +Currently, parsing takes about 11% of total sequential time. If translation
|
| +scalability ever gets so fast and awesomely scalable that parsing becomes a
|
| +bottleneck, it should be possible to make parsing multithreaded as well.
|
| +
|
| +Internally, all shared, mutable data is held in the GlobalContext object, and
|
| +access to each field is guarded by a mutex.
|
| +
|
| +Security
|
| +--------
|
| +
|
| +Subzero includes a number of security features in the generated code, as well as
|
| +in the Subzero translator itself, which run on top of the existing Native Client
|
| +sandbox as well as Chrome's OS-level sandbox.
|
| +
|
| +Sandboxed translator
|
| +^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +When running inside the browser, the Subzero translator executes as sandboxed,
|
| +untrusted code that is initially checked by the validator, just like the
|
| +LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
|
| +more or less secure than the translator it replaces, from the point of view of
|
| +the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
|
| +was designed from the start with security in mind, so one expects fewer attacker
|
| +opportunities here.
|
| +
|
| +Code diversification
|
| +^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +`Return-oriented programming
|
| +<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
|
| +now-common technique for starting with e.g. a known buffer overflow situation
|
| +and launching it into a deeper exploit. The attacker scans the executable
|
| +looking for ROP gadgets, which are short sequences of code that happen to load
|
| +known values into known registers and then return. An attacker who manages to
|
| +overwrite parts of the stack can overwrite it with carefully chosen return
|
| +addresses such that certain ROP gadgets are effectively chained together to set
|
| +up the register state as desired, finally returning to some code that manages to
|
| +do something nasty based on those register values.
|
| +
|
| +If there is a popular ``.pexe`` with a large install base, the attacker could
|
| +run Subzero on it and scan the executable for suitable ROP gadgets to use as
|
| +part of a potential exploit. Note that if the trusted validator is working
|
| +correctly, these ROP gadgets are limited to starting at a bundle boundary and
|
| +cannot use the trick of finding a gadget that happens to begin inside another
|
| +instruction. All the same, gadgets with these constraints still exist and the
|
| +attacker has access to them. This is the attack model we focus most on --
|
| +protecting the user against misuse of a "trusted" developer's application, as
|
| +opposed to mischief from a malicious ``.pexe`` file.
|
| +
|
| +Subzero can mitigate these attacks to some degree through code diversification.
|
| +Specifically, we can apply some randomness to the code generation that makes ROP
|
| +gadgets less predictable. This randomness can have some compile-time cost, and
|
| +it can affect the code quality; and some diversifications may be more effective
|
| +than others. A more detailed treatment of hardening techniques may be found in
|
| +the Matasano report "`Attacking Clientside JIT Compilers
|
| +<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
|
| +
|
| +To evaluate diversification effectiveness, we use a third-party ROP gadget
|
| +finder and limit its results to bundle-aligned addresses. For a given
|
| +diversification technique, we run it with a number of different random seeds,
|
| +find ROP gadgets for each version, and determine how persistent each ROP gadget
|
| +is across the different versions. A gadget is persistent if the same gadget is
|
| +found at the same code address. The best diversifications are ones with low
|
| +gadget persistence rates.
|
| +
|
| +Subzero implements 7 different diversification techniques. Below is a
|
| +discussion of each technique, its effectiveness, and its cost. The discussions
|
| +of cost and effectiveness are for a single diversification technique; the
|
| +translation-time costs for multiple techniques are additive, but the effects of
|
| +multiple techniques on code quality and effectiveness are not yet known.
|
| +
|
| +In Subzero's implementation, each randomization is "repeatable" in a sense.
|
| +Each pass that includes a randomization option gets its own private instance of
|
| +a random number generator (RNG). The RNG is seeded with a combination of a
|
| +global seed, the pass ID, and the function's sequence number. The global seed
|
| +is designed to be different across runs (perhaps based on the current time), but
|
| +for debugging, the global seed can be set to a specific value and the results
|
| +will be repeatable.
|
| +
|
| +Subzero-generated code is subject to diversification once per translation, and
|
| +then Chrome caches the diversified binary for subsequent executions. An
|
| +attacker may attempt to run the binary multiple times hoping for
|
| +higher-probability combinations of ROP gadgets. When the attacker guesses
|
| +wrong, a likely outcome is an application crash. Chrome throttles creation of
|
| +crashy processes which reduces the likelihood of the attacker eventually gaining
|
| +a foothold.
|
| +
|
| +Constant blinding
|
| +~~~~~~~~~~~~~~~~~
|
| +
|
| +Here, we prevent attackers from controlling large immediates in the text
|
| +(executable) section. A random cookie is generated for each function, and if
|
| +the constant exceeds a specified threshold, the constant is obfuscated with the
|
| +cookie and equivalent code is generated. For example, instead of this x86
|
| +instruction::
|
| +
|
| + mov $0x11223344, <%Reg/Mem>
|
| +
|
| +the following code might be generated::
|
| +
|
| + mov $(0x11223344+Cookie), %temp
|
| + lea -Cookie(%temp), %temp
|
| + mov %temp, <%Reg/Mem>
|
| +
|
| +The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
|
| +prevent unintended effects on the flags register.
|
| +
|
| +This transformation has almost no effect on translation time, and about 1%
|
| +impact on code quality, depending on the threshold chosen. It does little to
|
| +reduce gadget persistence, but it does remove a lot of potential opportunities
|
| +to construct intra-instruction ROP gadgets (which an attacker could use only if
|
| +a validator bug were discovered, since the Native Client sandbox and associated
|
| +validator force returns and other indirect branches to be to bundle-aligned
|
| +addresses).
|
| +
|
| +Constant pooling
|
| +~~~~~~~~~~~~~~~~
|
| +
|
| +This is similar to constant blinding, in that large immediates are removed from
|
| +the text section. In this case, each unique constant above the threshold is
|
| +stored in a read-only data section and the constant is accessed via a memory
|
| +load. For the above example, the following code might be generated::
|
| +
|
| + mov $Label$1, %temp
|
| + mov %temp, <%Reg/Mem>
|
| +
|
| +This has a similarly small impact on translation time and ROP gadget
|
| +persistence, and a smaller (better) impact on code quality. This is because it
|
| +uses fewer instructions, and in some cases good register allocation leads to no
|
| +increase in instruction count. Note that this still gives an attacker some
|
| +limited amount of control over some text section values, unless we randomize the
|
| +constant pool layout.
|
| +
|
| +Static data reordering
|
| +~~~~~~~~~~~~~~~~~~~~~~
|
| +
|
| +This transformation limits the attacker's ability to control bits in global data
|
| +address references. It simply permutes the order in memory of global variables
|
| +and internal constant pool entries. For the constant pool, we only permute
|
| +within a type (i.e., emit a randomized list of ints, followed by a randomized
|
| +list of floats, etc.) to maintain good packing in the face of alignment
|
| +constraints.
|
| +
|
| +As might be expected, this has no impact on code quality, translation time, or
|
| +ROP gadget persistence (though as above, it limits opportunities for
|
| +intra-instruction ROP gadgets with a broken validator).
|
| +
|
| +Basic block reordering
|
| +~~~~~~~~~~~~~~~~~~~~~~
|
| +
|
| +Here, we randomize the order of basic blocks within a function, with the
|
| +constraint that we still want to maintain a topological order as much as
|
| +possible, to avoid making the code too branchy.
|
| +
|
| +This has no impact on code quality, and about 1% impact on translation time, due
|
| +to a separate pass to recompute layout. It ends up having a huge effect on ROP
|
| +gadget persistence, tied for best with nop insertion, reducing ROP gadget
|
| +persistence to less than 5%.
|
| +
|
| +Function reordering
|
| +~~~~~~~~~~~~~~~~~~~
|
| +
|
| +Here, we permute the order that functions are emitted, primarily to shift ROP
|
| +gadgets around to less predictable locations. It may also change call address
|
| +offsets in case the attacker was trying to control that offset in the code.
|
| +
|
| +To control latency and memory footprint, we don't arbitrarily permute functions.
|
| +Instead, for some relatively small value of N, we queue up N assembler buffers,
|
| +and then emit the N functions in random order, and repeat until all functions
|
| +are emitted.
|
| +
|
| +Function reordering has no impact on translation time or code quality.
|
| +Measurements indicate that it reduces ROP gadget persistence to about 15%.
|
| +
|
| +Nop insertion
|
| +~~~~~~~~~~~~~
|
| +
|
| +This diversification randomly adds a nop instruction after each regular
|
| +instruction, with some probability. Nop instructions of different lengths may
|
| +be selected. Nop instructions are never added inside a bundle_lock region.
|
| +Note that when sandboxing is enabled, nop instructions are already being added
|
| +for bundle alignment, so the diversification nop instructions may simply be
|
| +taking the place of alignment nop instructions, though distributed differently
|
| +through the bundle.
|
| +
|
| +In Subzero's currently implementation, nop insertion adds 3-5% to the
|
| +translation time, but this is probably because it is implemented as a separate
|
| +pass that adds actual nop instructions to the IR. The overhead would probably
|
| +be a lot less if it were integrated into the assembler pass. The code quality
|
| +is also reduced by 3-5%, making nop insertion the most expensive of the
|
| +diversification techniques.
|
| +
|
| +Nop insertion is very effective in reducing ROP gadget persistence, at the same
|
| +level as basic block randomization (less than 5%). But given nop insertion's
|
| +impact on translation time and code quality, one would most likely prefer to use
|
| +basic block randomization instead (though the combined effects of the different
|
| +diversification techniques have not yet been studied).
|
| +
|
| +Register allocation randomization
|
| +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| +
|
| +In this diversification, the register allocator tries to make different but
|
| +mostly functionally equivalent choices, while maintaining stable code quality.
|
| +
|
| +A naive approach would be the following. Whenever the allocator has more than
|
| +one choice for assigning a register, choose randomly among those options. And
|
| +whenever there are no registers available and there is a tie for the
|
| +lowest-weight variable, randomly select one of the lowest-weight variables to
|
| +evict. Because of the one-pass nature of the linear-scan algorithm, this
|
| +randomization strategy can have a large impact on which variables are ultimately
|
| +assigned registers, with a corresponding large impact on code quality.
|
| +
|
| +Instead, we choose an approach that tries to keep code quality stable regardless
|
| +of the random seed. We partition the set of physical registers into equivalence
|
| +classes. If a register is pre-colored in the function (i.e., referenced
|
| +explicitly by name), it forms its own equivalence class. The remaining
|
| +registers are partitioned according to their combination of attributes such as
|
| +integer versus floating-point, 8-bit versus 32-bit, caller-save versus
|
| +callee-saved, etc. Each equivalence class is randomly permuted, and the
|
| +complete permutation is applied to the final register assignments.
|
| +
|
| +Register randomization reduces ROP gadget persistence to about 10% on average,
|
| +though there tends to be fairly high variance across functions and applications.
|
| +This probably has to do with the set of restrictions in the x86-32 instruction
|
| +set and ABI, such as few general-purpose registers, ``%eax`` used for return
|
| +values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
|
| +intended, register randomization has no impact on code quality, and a slight
|
| +(0.5%) impact on translation time due to an extra scan over the variables to
|
| +identify pre-colored registers.
|
| +
|
| +Fuzzing
|
| +^^^^^^^
|
| +
|
| +We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
|
| +combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
|
| +<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
|
| +find and fix cases where Subzero crashes or otherwise ungracefully fails on
|
| +unexpected inputs, and to do so automatically over a large range of unexpected
|
| +inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
|
| +of an attacker exploiting these bugs.
|
| +
|
| +Most of the problems found so far are ones most appropriately handled in the
|
| +parser. However, there have been a couple that have identified problems in the
|
| +lowering, or otherwise inappropriately triggered assertion failures and fatal
|
| +errors. We continue to dig into this area.
|
| +
|
| +Future security work
|
| +^^^^^^^^^^^^^^^^^^^^
|
| +
|
| +Subzero is well-positioned to explore other future security enhancements, e.g.:
|
| +
|
| +- Tightening the Native Client sandbox. ABI changes, such as the previous work
|
| + on `hiding the sandbox base address
|
| + <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
|
| + in x86-64, are easy to experiment with in Subzero.
|
| +
|
| +- Making the executable code section read-only. This would prevent a PNaCl
|
| + application from inspecting its own binary and trying to find ROP gadgets even
|
| + after code diversification has been performed. It may still be susceptible to
|
| + `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
|
| + security is still overall improved.
|
| +
|
| +- Instruction selection diversification. It may be possible to lower a given
|
| + instruction in several largely equivalent ways, which gives more opportunities
|
| + for code randomization.
|
| +
|
| +Chrome integration
|
| +------------------
|
| +
|
| +Currently Subzero is available in Chrome for the x86-32 architecture, but under
|
| +a flag. When the flag is enabled, Subzero is used when the `manifest file
|
| +<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
|
| +linking to the ``.pexe`` file specifies the ``O0`` optimization level.
|
| +
|
| +The next step is to remove the flag, i.e. invoke Subzero as the only translator
|
| +for ``O0``-specified manifest files.
|
| +
|
| +Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
|
| +case Subzero could be used for all PNaCl translation.
|
| +
|
| +Command line options
|
| +--------------------
|
| +
|
| +Subzero has a number of command-line options for debugging and diagnostics.
|
| +Among the more interesting are the following.
|
| +
|
| +- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
|
| + diagnostic output, with various levels of detail after each pass. Instruction
|
| + numbers can be printed or suppressed. Deleted instructions can be printed or
|
| + suppressed (they are retained in the instruction list, as discussed earlier,
|
| + because they can help explain how lower-level instructions originated).
|
| + Liveness information can be printed when available. Details of register
|
| + allocation can be printed as register allocator decisions are made. And more.
|
| +
|
| +- Running Subzero with any level of verbosity produces an enormous amount of
|
| + output. When debugging a single function, verbose output can be suppressed
|
| + except for a particular function. The ``-verbose-focus`` flag suppresses
|
| + verbose output except for the specified function.
|
| +
|
| +- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
|
| + at exit. Timing markers can be placed in the Subzero source code to demarcate
|
| + logical operations or passes of interest. Basic timing information plus
|
| + call-stack type timing information is printed at the end.
|
| +
|
| +- Along with ``-timing``, the user can instead get a report on the overall
|
| + translation time for each function, to help focus on timing outliers. Also,
|
| + ``-timing-focus`` limits the ``-timing`` reporting to a single function,
|
| + instead of aggregating pass timing across all functions.
|
| +
|
| +- The ``-szstats`` option reports various statistics on each function, such as
|
| + stack frame size, static instruction count, etc. It may be helpful to track
|
| + these stats over time as Subzero is improved, as an approximate measure of
|
| + code quality.
|
| +
|
| +- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
|
| + output, annotate the assembly output with register-focused liveness
|
| + information. In particular, each basic block is annotated with which
|
| + registers are live-in and live-out, and each instruction is annotated with
|
| + which registers' and stack locations' live ranges end at that instruction.
|
| + This is really useful when studying the generated code to find opportunities
|
| + for code quality improvements.
|
| +
|
| +Testing and debugging
|
| +---------------------
|
| +
|
| +LLVM lit tests
|
| +^^^^^^^^^^^^^^
|
| +
|
| +For basic testing, Subzero uses LLVM's `lit
|
| +<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
|
| +have a suite of hundreds of small functions where we test for particular
|
| +assembly code patterns across different target architectures.
|
| +
|
| +Cross tests
|
| +^^^^^^^^^^^
|
| +
|
| +Unfortunately, the lit tests don't do a great job of precisely testing the
|
| +correctness of the output. Much better are the cross tests, which are execution
|
| +tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
|
| +variety of interesting inputs. Each cross test consists of a set of C, C++,
|
| +and/or low-level bitcode files. The C and C++ source files are compiled down to
|
| +bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
|
| +Subzero mangles global symbol names with a special prefix to avoid duplicate
|
| +symbol errors. A driver program invokes both versions on a large set of
|
| +interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
|
| +differ. Cross tests turn out to be an excellent way of testing the basic
|
| +lowering patterns, but they are less useful for testing more global things like
|
| +liveness analysis and register allocation.
|
| +
|
| +Bisection debugging
|
| +^^^^^^^^^^^^^^^^^^^
|
| +
|
| +Sometimes with a new application, Subzero will end up producing incorrect code
|
| +that either crashes at runtime or otherwise produces the wrong results. When
|
| +this happens, we need to narrow it down to a single function (or small set of
|
| +functions) that yield incorrect behavior. For this, we have a bisection
|
| +debugging framework. Here, we initially translate the entire application once
|
| +with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
|
| +selectively weaken symbols based on a whitelist or blacklist provided on the
|
| +command line. The two object files can then be linked together without link
|
| +errors, with the desired version of each method "winning". Then the binary is
|
| +tested, and bisection proceeds based on whether the binary produces correct
|
| +output.
|
| +
|
| +When the bisection completes, we are left with a minimal set of
|
| +Subzero-translated functions that cause the failure. Usually it is a single
|
| +function, though sometimes it might require a combination of several functions
|
| +to cause a failure; this may be due to an incorrect call ABI, for example.
|
| +However, Murphy's Law implies that the single failing function is enormous and
|
| +impractical to debug. In that case, we can restart the bisection, explicitly
|
| +blacklisting the enormous function, and try to find another candidate to debug.
|
| +(Future work is to automate this to find all minimal sets of functions, so that
|
| +debugging can focus on the simplest example.)
|
| +
|
| +Fuzz testing
|
| +^^^^^^^^^^^^
|
| +
|
| +As described above, we try to find internal Subzero bugs using fuzz testing
|
| +techniques.
|
| +
|
| +Sanitizers
|
| +^^^^^^^^^^
|
| +
|
| +Subzero can be built with `AddressSanitizer
|
| +<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
|
| +<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is
|
| +done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,
|
| +multithreading has been simple enough that TSan hasn't found any bugs, but ASan
|
| +has found at least one memory leak which was subsequently fixed.
|
| +`UndefinedBehaviorSanitizer
|
| +<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
|
| +(UBSan) support is in progress. `Control flow integrity sanitization
|
| +<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
|
| +consideration.
|
| +
|
| +Current status
|
| +==============
|
| +
|
| +Target architectures
|
| +--------------------
|
| +
|
| +Subzero is currently more or less complete for the x86-32 target. It has been
|
| +refactored and extended to handle x86-64 as well, and that is mostly complete at
|
| +this point.
|
| +
|
| +ARM32 work is in progress. It currently lacks the testing level of x86, at
|
| +least in part because Subzero's register allocator needs modifications to handle
|
| +ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
|
| +register is actually a gang of two consecutive and aligned 32-bit registers, and
|
| +a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
|
| +ARM64 work has not started; when it does, it will be native-only since the
|
| +Native Client sandbox model, validator, and other tools have never been defined.
|
| +
|
| +An external contributor is adding MIPS support, in most part by following the
|
| +ARM work.
|
| +
|
| +Translator performance
|
| +----------------------
|
| +
|
| +Single-threaded translation speed is currently about 5× the ``pnacl-llc``
|
| +translation speed. For a large ``.pexe`` file, the time breaks down as:
|
| +
|
| +- 11% for parsing and initial IR building
|
| +
|
| +- 4% for emitting to /dev/null
|
| +
|
| +- 27% for liveness analysis (two liveness passes plus live range construction)
|
| +
|
| +- 15% for linear-scan register allocation
|
| +
|
| +- 9% for basic lowering
|
| +
|
| +- 10% for advanced phi lowering
|
| +
|
| +- ~11% for other minor analysis
|
| +
|
| +- ~10% measurement overhead to acquire these numbers
|
| +
|
| +Some improvements could undoubtedly be made, but it will be hard to increase the
|
| +speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
|
| +``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
|
| +``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
|
| +
|
| +Code quality
|
| +------------
|
| +
|
| +Measured across 16 components of spec2k, Subzero's code quality is uniformly
|
| +better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
|
| +between ``pnacl-llc`` ``-O0`` and ``-O2``.
|
| +
|
| +Translator size
|
| +---------------
|
| +
|
| +When built in MINIMAL mode, the x86-64 native translator size for the x86-32
|
| +target is about 700 KB, not including the size of functions referenced in
|
| +dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
|
| +MB, and it is statically linked and also includes nop padding for bundling as
|
| +well as indirect branch masking.
|
| +
|
| +Translator memory footprint
|
| +---------------------------
|
| +
|
| +It's hard to draw firm conclusions about memory footprint, since the footprint
|
| +is at least proportional to the input function size, and there is no real limit
|
| +on the size of functions in the ``.pexe`` file.
|
| +
|
| +That said, we looked at the memory footprint over time as Subzero translated
|
| +``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
|
| +disposal. One of LLVM's libraries that Subzero uses can report the current
|
| +malloc heap usage. With single-threaded translation, Subzero tends to hover
|
| +around 15 MB of memory usage. There are a couple of monstrous functions where
|
| +Subzero grows to around 100 MB, but then it drops back down after those
|
| +functions finish translating. In contrast, ``pnacl-llc`` grows larger and
|
| +larger throughout translation, reaching several hundred MB by the time it
|
| +completes.
|
| +
|
| +It's a bit more interesting when we enable multithreaded translation. When
|
| +there are N translation threads, Subzero implements a policy that limits the
|
| +size of the translation queue to N entries -- if it is "full" when the parser
|
| +tries to add a new CFG, the parser blocks until one of the translation threads
|
| +removes a CFG. This means the number of in-memory CFGs can (and generally does)
|
| +reach 2*N+1, and so the memory footprint rises in proportion to the number of
|
| +threads. Adding to the pressure is the observation that the monstrous functions
|
| +also take proportionally longer time to translate, so there's a good chance many
|
| +of the monstrous functions will be active at the same time with multithreaded
|
| +translation. As a result, for N=32, Subzero's memory footprint peaks at about
|
| +260 MB, but drops back down as the large functions finish translating.
|
| +
|
| +If this peak memory size becomes a problem, it might be possible for the parser
|
| +to resequence the functions to try to spread out the larger functions, or to
|
| +throttle the translation queue to prevent too many in-flight large functions.
|
| +It may also be possible to throttle based on memory pressure signaling from
|
| +Chrome.
|
| +
|
| +Translator scalability
|
| +----------------------
|
| +
|
| +Currently scalability is "not very good". Multiple translation threads lead to
|
| +faster translation, but not to the degree desired. We haven't dug in to
|
| +investigate yet.
|
| +
|
| +There are a few areas to investigate. First, there may be contention on the
|
| +constant pool, which all threads access, and which requires locked access even
|
| +for reading. This could be mitigated by keeping a CFG-local cache of the most
|
| +common constants.
|
| +
|
| +Second, there may be contention on memory allocation. While almost all CFG
|
| +objects are allocated from the CFG-local allocator, some passes use temporary
|
| +STL containers that use the default allocator, which may require global locking.
|
| +This could be mitigated by switching these to the CFG-local allocator.
|
| +
|
| +Third, multithreading may make the default allocator strategy more expensive.
|
| +In a single-threaded environment, a pass will allocate its containers, run the
|
| +pass, and deallocate the containers. This results in stack-like allocation
|
| +behavior and makes the heap free list easier to manage, with less heap
|
| +fragmentation. But when multithreading is added, the allocations and
|
| +deallocations become much less stack-like, making allocation and deallocation
|
| +operations individually more expensive. Again, this could be mitigated by
|
| +switching these to the CFG-local allocator.
|
|
|