| Index: DESIGN.rst
|
| diff --git a/DESIGN.rst b/DESIGN.rst
|
| deleted file mode 100644
|
| index 3d324f64ba81798b5a7a038b24f265bbe292254b..0000000000000000000000000000000000000000
|
| --- a/DESIGN.rst
|
| +++ /dev/null
|
| @@ -1,1494 +0,0 @@
|
| -Design of the Subzero fast code generator
|
| -=========================================
|
| -
|
| -Introduction
|
| -------------
|
| -
|
| -The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
|
| -compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
|
| -PNaCl toolchain to compile their application to architecture-neutral PNaCl
|
| -bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
|
| -possible. The ``.pexe`` file is downloaded to the user's browser where the
|
| -PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
|
| -`sandboxed
|
| -<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
|
| -native code. The translator uses architecture-specific optimizations as much as
|
| -practical to generate good native code.
|
| -
|
| -The native code can be cached by the browser to avoid repeating translation on
|
| -future page loads. However, first-time user experience is hampered by long
|
| -translation times. The LLVM-based PNaCl translator is pretty slow, even when
|
| -using ``-O0`` to minimize optimizations, so delays are especially noticeable on
|
| -slow browser platforms such as ARM-based Chromebooks.
|
| -
|
| -Translator slowness can be mitigated or hidden in a number of ways.
|
| -
|
| -- Parallel translation. However, slow machines where this matters most, e.g.
|
| - ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
|
| - and are likely to less memory available for multiple translation threads to
|
| - use.
|
| -
|
| -- Streaming translation, i.e. start translating as soon as the download starts.
|
| - This doesn't help much when translation speed is 10× slower than download
|
| - speed, or the ``.pexe`` file is already cached while the translated binary was
|
| - flushed from the cache.
|
| -
|
| -- Arrange the web page such that translation is done in parallel with
|
| - downloading large assets.
|
| -
|
| -- Arrange the web page to distract the user with `cat videos
|
| - <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
|
| - progress.
|
| -
|
| -Or, improve translator performance to something more reasonable.
|
| -
|
| -This document describes Subzero's attempt to improve translation speed by an
|
| -order of magnitude while rivaling LLVM's code quality. Subzero does this
|
| -through minimal IR layering, lean data structures and passes, and a careful
|
| -selection of fast optimization passes. It has two optimization recipes: full
|
| -optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
|
| -following (described in more detail below):
|
| -
|
| -+-----------------------------------+-----------------------+
|
| -| O2 recipe | Om1 recipe |
|
| -+===================================+=======================+
|
| -| Parse .pexe file | Parse .pexe file |
|
| -+-----------------------------------+-----------------------+
|
| -| Loop nest analysis | |
|
| -+-----------------------------------+-----------------------+
|
| -| Address mode inference | |
|
| -+-----------------------------------+-----------------------+
|
| -| Read-modify-write (RMW) transform | |
|
| -+-----------------------------------+-----------------------+
|
| -| Basic liveness analysis | |
|
| -+-----------------------------------+-----------------------+
|
| -| Load optimization | |
|
| -+-----------------------------------+-----------------------+
|
| -| | Phi lowering (simple) |
|
| -+-----------------------------------+-----------------------+
|
| -| Target lowering | Target lowering |
|
| -+-----------------------------------+-----------------------+
|
| -| Full liveness analysis | |
|
| -+-----------------------------------+-----------------------+
|
| -| Register allocation | |
|
| -+-----------------------------------+-----------------------+
|
| -| Phi lowering (advanced) | |
|
| -+-----------------------------------+-----------------------+
|
| -| Post-phi register allocation | |
|
| -+-----------------------------------+-----------------------+
|
| -| Branch optimization | |
|
| -+-----------------------------------+-----------------------+
|
| -| Code emission | Code emission |
|
| -+-----------------------------------+-----------------------+
|
| -
|
| -Goals
|
| -=====
|
| -
|
| -Translation speed
|
| ------------------
|
| -
|
| -We'd like to be able to translate a ``.pexe`` file as fast as download speed.
|
| -Any faster is in a sense wasted effort. Download speed varies greatly, but
|
| -we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
|
| -slow machine. We observe a 3× single-thread performance difference between A15
|
| -and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
|
| -``.pexe`` file could be compressed to 50% on the web server using gzip transport
|
| -compression, so we set the translation speed goal to 6 MB/sec on the high-end
|
| -Xeon workstation.
|
| -
|
| -Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
|
| -⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
|
| -
|
| -In other words, Subzero's goal is to improve over LLVM's translation speed by
|
| -10×.
|
| -
|
| -Code quality
|
| -------------
|
| -
|
| -Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
|
| -code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
|
| -average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
|
| -
|
| -It's important to note that the quality of Subzero-generated code depends on
|
| -target-neutral optimizations and simplifications being run beforehand in the
|
| -developer environment. The ``.pexe`` file reflects these optimizations. For
|
| -example, Subzero assumes that the basic blocks are ordered topologically where
|
| -possible (which makes liveness analysis converge fastest), and Subzero does not
|
| -do any function inlining because it should already have been done.
|
| -
|
| -Translator size
|
| ----------------
|
| -
|
| -The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
|
| -We think 1 MB is a more reasonable size -- especially for such a component that
|
| -is distributed to a billion Chrome users. Thus we target a 10× reduction in
|
| -binary size.
|
| -
|
| -For development, Subzero can be built for all target architectures, and all
|
| -debugging and diagnostic options enabled. For a smaller translator, we restrict
|
| -to a single target architecture, and define a ``MINIMAL`` build where
|
| -unnecessary features are compiled out.
|
| -
|
| -Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
|
| -include directories, which have little impact on translator size. It also uses
|
| -some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
|
| -with little size impact. In non-``MINIMAL`` builds, the translator size is much
|
| -larger due to including code for parsing text-format bitcode files and forming
|
| -LLVM IR.
|
| -
|
| -Memory footprint
|
| -----------------
|
| -
|
| -The current LLVM-based translator suffers from an issue in which some
|
| -function-specific data has to be retained in memory until all translation
|
| -completes, and therefore the memory footprint grows without bound. Large
|
| -``.pexe`` files can lead to the translator process holding hundreds of MB of
|
| -memory by the end. The translator runs in a separate process, so this memory
|
| -growth doesn't *directly* affect other processes, but it does dirty the physical
|
| -memory and contributes to a perception of bloat and sometimes a reality of
|
| -out-of-memory tab killing, especially noticeable on weaker systems.
|
| -
|
| -Subzero should maintain a stable memory footprint throughout translation. It's
|
| -not really practical to set a specific limit, because there is not really a
|
| -practical limit on a single function's size, but the footprint should be
|
| -"reasonable" and be proportional to the largest input function size, not the
|
| -total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
|
| -inexorable memory growth. (We use ASAN builds to test for leaks.)
|
| -
|
| -Multithreaded translation
|
| --------------------------
|
| -
|
| -It should be practical to translate different functions concurrently and see
|
| -good scalability. Some locking may be needed, such as accessing output buffers
|
| -or constant pools, but that should be fairly minimal. In contrast, LLVM was
|
| -only designed for module-level parallelism, and as such, the PNaCl translator
|
| -internally splits a ``.pexe`` file into several modules for concurrent
|
| -translation. All output needs to be deterministic regardless of the level of
|
| -multithreading, i.e. functions and data should always be output in the same
|
| -order.
|
| -
|
| -Target architectures
|
| ---------------------
|
| -
|
| -Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
|
| -targets include ARM64 and MIPS64, though these targets lack NaCl support
|
| -including a sandbox model or a validator.
|
| -
|
| -The first implementation is for x86-32, because it was expected to be
|
| -particularly challenging, and thus more likely to draw out any design problems
|
| -early:
|
| -
|
| -- There are a number of special cases, asymmetries, and warts in the x86
|
| - instruction set.
|
| -
|
| -- Complex addressing modes may be leveraged for better code quality.
|
| -
|
| -- 64-bit integer operations have to be lowered into longer sequences of 32-bit
|
| - operations.
|
| -
|
| -- Paucity of physical registers may reveal code quality issues early in the
|
| - design.
|
| -
|
| -Detailed design
|
| -===============
|
| -
|
| -Intermediate representation - ICE
|
| ----------------------------------
|
| -
|
| -Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
|
| -IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
|
| -representation of global variables and initializers, and a set of functions.
|
| -Each function contains a list of basic blocks, and each basic block constains a
|
| -list of instructions. Instructions that operate on stack and register variables
|
| -do so using static single assignment (SSA) form.
|
| -
|
| -The ``.pexe`` file is translated one function at a time (or in parallel by
|
| -multiple translation threads). The recipe for optimization passes depends on
|
| -the specific target and optimization level, and is described in detail below.
|
| -Global variables (types and initializers) are simply and directly translated to
|
| -object code, without any meaningful attempts at optimization.
|
| -
|
| -A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
|
| -Its key contents include:
|
| -
|
| -- A list of ``CfgNode`` pointers, generally held in topological order.
|
| -
|
| -- A list of ``Variable`` pointers corresponding to local variables used in the
|
| - function plus compiler-generated temporaries.
|
| -
|
| -A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
|
| -include:
|
| -
|
| -- A linear list of instructions, in the same style as LLVM. The last
|
| - instruction of the list is always a terminator instruction: branch, switch,
|
| - return, unreachable.
|
| -
|
| -- A list of Phi instructions, also in the same style as LLVM. They are held as
|
| - a linear list for convenience, though per Phi semantics, they are executed "in
|
| - parallel" without dependencies on each other.
|
| -
|
| -- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
|
| - another list for outgoing edges.
|
| -
|
| -- The node's unique, 0-based index into the CFG's node list.
|
| -
|
| -An instruction is represented by the ``Ice::Inst`` class. Its key contents
|
| -include:
|
| -
|
| -- A list of source operands.
|
| -
|
| -- Its destination variable, if the instruction produces a result in an
|
| - ``Ice::Variable``.
|
| -
|
| -- A bitvector indicating which variables' live ranges this instruction ends.
|
| - This is computed during liveness analysis.
|
| -
|
| -Instructions kinds are divided into high-level ICE instructions and low-level
|
| -ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
|
| -instruction kinds. Each target architecture implementation extends the
|
| -instruction space with its own set of low-level instructions. Generally,
|
| -low-level instructions correspond to individual machine instructions. The
|
| -high-level ICE instruction space includes a few additional instruction kinds
|
| -that are not part of LLVM but are generally useful (e.g., an Assignment
|
| -instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
|
| -instructions for sandboxing).
|
| -
|
| -Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
|
| -ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
|
| -<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
|
| -the following:
|
| -
|
| -- Alloca: allocate data on the stack
|
| -
|
| -- Arithmetic: binary operations of the form ``A = B op C``
|
| -
|
| -- Br: conditional or unconditional branch
|
| -
|
| -- Call: function call
|
| -
|
| -- Cast: unary type-conversion operations
|
| -
|
| -- ExtractElement: extract a scalar element from a vector-type value
|
| -
|
| -- Fcmp: floating-point comparison
|
| -
|
| -- Icmp: integer comparison
|
| -
|
| -- IntrinsicCall: call a known intrinsic
|
| -
|
| -- InsertElement: insert a scalar element into a vector-type value
|
| -
|
| -- Load: load a value from memory
|
| -
|
| -- Phi: implement the SSA phi node
|
| -
|
| -- Ret: return from the function
|
| -
|
| -- Select: essentially the C language operation of the form ``X = C ? Y : Z``
|
| -
|
| -- Store: store a value into memory
|
| -
|
| -- Switch: generalized branch to multiple possible locations
|
| -
|
| -- Unreachable: indicate that this portion of the code is unreachable
|
| -
|
| -The additional high-level ICE instructions are the following:
|
| -
|
| -- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
|
| - instructions to non-SSA assignments, before lowering to machine code.
|
| -
|
| -- BundleLock, BundleUnlock. These are markers used for sandboxing, but are
|
| - common across all targets and so they are elevated to the high-level
|
| - instruction set.
|
| -
|
| -- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
|
| - liveness analysis, elevated to the high-level because they are used by all
|
| - targets. They are described in more detail at the end of this section.
|
| -
|
| -- JumpTable: this represents the result of switch optimization analysis, where
|
| - some switch instructions may use jump tables instead of cascading
|
| - compare/branches.
|
| -
|
| -An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
|
| -operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
|
| -include scalar integer constants, scalar floating point constants, Undef (an
|
| -unspecified constant of a particular scalar or vector type), and symbol
|
| -constants (essentially addresses of globals). Note that the PNaCl ABI does not
|
| -include vector-type constants besides Undef, and as such, Subzero (so far) has
|
| -no reason to represent vector-type constants internally. A variable represents
|
| -a value allocated on the stack (though not including alloca-derived storage).
|
| -Among other things, a variable holds its unique, 0-based index into the CFG's
|
| -variable list.
|
| -
|
| -Each target can extend the ``Constant`` and ``Variable`` classes for its own
|
| -needs. In addition, the ``Operand`` class may be extended, e.g. to define an
|
| -x86 ``MemOperand`` that encodes a base register, an index register, an index
|
| -register shift amount, and a constant offset.
|
| -
|
| -Register allocation and liveness analysis are restricted to Variable operands.
|
| -Because of the importance of register allocation to code quality, and the
|
| -translation-time cost of liveness analysis, Variable operands get some special
|
| -treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
|
| -across all the Variables of an instruction. An instruction holds a list of
|
| -operands, but an operand may contain 0, 1, or more Variables. As such, the
|
| -``Operand`` class specially holds a list of Variables contained within, for
|
| -quick access.
|
| -
|
| -A Subzero transformation pass may work by deleting an existing instruction and
|
| -replacing it with zero or more new instructions. Instead of actually deleting
|
| -the existing instruction, we generally mark it as deleted and insert the new
|
| -instructions right after the deleted instruction. When printing the IR for
|
| -debugging, this is a big help because it makes it much more clear how the
|
| -non-deleted instructions came about.
|
| -
|
| -Subzero has a few special instructions to help with liveness analysis
|
| -consistency.
|
| -
|
| -- The FakeDef instruction gives a fake definition of some variable. For
|
| - example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
|
| - but an ICE instruction can represent only one destination variable. This is
|
| - similar for multiply instructions, and for function calls that return a 64-bit
|
| - integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
|
| - trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
|
| -
|
| -- The FakeUse instruction registers a use of a variable, typically to prevent an
|
| - earlier assignment to that variable from being dead-code eliminated. For
|
| - example, lowering an operation like ``x=cc?y:z`` may be done using x86's
|
| - conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
|
| - FakeUse of ``x`` between the two instructions, the liveness analysis pass may
|
| - dead-code eliminate the first instruction.
|
| -
|
| -- The FakeKill instruction is added after a call instruction, and is a quick way
|
| - of indicating that caller-save registers are invalidated.
|
| -
|
| -Pexe parsing
|
| -------------
|
| -
|
| -Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
|
| -parses the ``.pexe`` file function by function, ultimately constructing an ICE
|
| -CFG for each function. After a function is parsed, its CFG is handed off to the
|
| -translation phase. The bitcode parser also parses global initializer data and
|
| -hands it off to be translated to data sections in the object file.
|
| -
|
| -Subzero has another parsing strategy for testing/debugging. LLVM libraries can
|
| -be used to parse a module into LLVM IR (though very slowly relative to Subzero
|
| -native parsing). Then we iterate across the LLVM IR and construct high-level
|
| -ICE, handing off each CFG to the translation phase.
|
| -
|
| -Overview of lowering
|
| ---------------------
|
| -
|
| -In general, translation goes like this:
|
| -
|
| -- Parse the next function from the ``.pexe`` file and construct a CFG consisting
|
| - of high-level ICE.
|
| -
|
| -- Do analysis passes and transformation passes on the high-level ICE, as
|
| - desired.
|
| -
|
| -- Lower each high-level ICE instruction into a sequence of zero or more
|
| - low-level ICE instructions. Each high-level instruction is generally lowered
|
| - independently, though the target lowering is allowed to look ahead in the
|
| - CfgNode's instruction list if desired.
|
| -
|
| -- Do more analysis and transformation passes on the low-level ICE, as desired.
|
| -
|
| -- Assemble the low-level CFG into an ELF object file (alternatively, a textual
|
| - assembly file that is later assembled by some external tool).
|
| -
|
| -- Repeat for all functions, and also produce object code for data such as global
|
| - initializers and internal constant pools.
|
| -
|
| -Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
|
| -the intention is to apply all available optimizations to get the best code
|
| -quality (though the initial code quality goal is measured against LLVM's ``O0``
|
| -code quality). For ``Om1``, the intention is to apply as few optimizations as
|
| -possible and produce code as quickly as possible, accepting poor code quality.
|
| -``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
|
| -"sub-zero".
|
| -
|
| -High-level debuggability of generated code is so far not a design requirement.
|
| -Subzero doesn't really do transformations that would obfuscate debugging; the
|
| -main thing might be that register allocation (including stack slot coalescing
|
| -for stack-allocated variables whose live ranges don't overlap) may render a
|
| -variable's value unobtainable after its live range ends. This would not be an
|
| -issue for ``Om1`` since it doesn't register-allocate program-level variables,
|
| -nor does it coalesce stack slots. That said, fully supporting debuggability
|
| -would require a few additions:
|
| -
|
| -- DWARF support would need to be added to Subzero's ELF file emitter. Subzero
|
| - propagates global symbol names, local variable names, and function-internal
|
| - label names that are present in the ``.pexe`` file. This would allow a
|
| - debugger to map addresses back to symbols in the ``.pexe`` file.
|
| -
|
| -- To map ``.pexe`` file symbols back to meaningful source-level symbol names,
|
| - file names, line numbers, etc., Subzero would need to handle `LLVM bitcode
|
| - metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``
|
| - `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.
|
| -
|
| -- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so
|
| - the toolchain would need to be modified to preserve it.
|
| -
|
| -Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
|
| -produces code with one third the code quality. ``Om1`` is good for testing and
|
| -debugging -- during translation, it tends to expose errors in the basic lowering
|
| -that might otherwise have been hidden by the register allocator or other
|
| -optimization passes. It also helps determine whether a code correctness problem
|
| -is a fundamental problem in the basic lowering, or an error in another
|
| -optimization pass.
|
| -
|
| -The implementation of target lowering also controls the recipe of passes used
|
| -for ``Om1`` and ``O2`` translation. For example, address mode inference may
|
| -only be relevant for x86.
|
| -
|
| -Lowering strategy
|
| ------------------
|
| -
|
| -The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
|
| -each high-level instruction down to a sequence of low-level target-specific
|
| -instructions, in a largely context-free setting. That is, each high-level
|
| -instruction conceptually has a simple template expansion into low-level
|
| -instructions, and lowering can in theory be done in any order. This may sound
|
| -like a small effort, but quite a large number of templates may be needed because
|
| -of the number of PNaCl types and instruction variants. Furthermore, there may
|
| -be optimized templates, e.g. to take advantage of operator commutativity (for
|
| -example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is
|
| -similar to other template-based approaches in fast code generation or
|
| -interpretation, though some decisions are deferred until after some global
|
| -analysis passes, mostly related to register allocation, stack slot assignment,
|
| -and specific choice of instruction variant and addressing mode.
|
| -
|
| -The key idea for a lowering template is to produce valid low-level instructions
|
| -that are guaranteed to meet address mode and other structural requirements of
|
| -the instruction set. For example, on x86, the source operand of an integer
|
| -store instruction must be an immediate or a physical register; a shift
|
| -instruction's shift amount must be an immediate or in register ``%cl``; a
|
| -function's integer return value is in ``%eax``; most x86 instructions are
|
| -two-operand, in contrast to corresponding three-operand high-level instructions;
|
| -etc.
|
| -
|
| -Because target lowering runs before register allocation, there is no way to know
|
| -whether a given ``Ice::Variable`` operand lives on the stack or in a physical
|
| -register. When the low-level instruction calls for a physical register operand,
|
| -the target lowering can create an infinite-weight Variable. This tells the
|
| -register allocator to assign infinite weight when making decisions, effectively
|
| -guaranteeing some physical register. Variables can also be pre-colored to a
|
| -specific physical register (``cl`` in the shift example above), which also gives
|
| -infinite weight.
|
| -
|
| -To illustrate, consider a high-level arithmetic instruction on 32-bit integer
|
| -operands::
|
| -
|
| - A = B + C
|
| -
|
| -X86 target lowering might produce the following::
|
| -
|
| - T.inf = B // mov instruction
|
| - T.inf += C // add instruction
|
| - A = T.inf // mov instruction
|
| -
|
| -Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
|
| -physical register, the three lowered instructions are all encodable regardless
|
| -of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
|
| -whether ``A`` is a physical register or in memory.
|
| -
|
| -In this example, ``A`` must be a Variable and one may be tempted to simplify the
|
| -lowering sequence by setting ``A`` as infinite-weight and using::
|
| -
|
| - A = B // mov instruction
|
| - A += C // add instruction
|
| -
|
| -This has two problems. First, if the original instruction was actually ``A =
|
| -B + A``, the result would be incorrect. Second, assigning ``A`` a physical
|
| -register applies throughout ``A``'s entire live range. This is probably not
|
| -what is intended, and may ultimately lead to a failure to allocate a register
|
| -for an infinite-weight variable.
|
| -
|
| -This style of lowering leads to many temporaries being generated, so in ``O2``
|
| -mode, we rely on the register allocator to clean things up. For example, in the
|
| -example above, if ``B`` ends up getting a physical register and its live range
|
| -ends at this instruction, the register allocator is likely to reuse that
|
| -register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
|
| -copy, which is removed as an emission-time peephole optimization.
|
| -
|
| -O2 lowering
|
| ------------
|
| -
|
| -Currently, the ``O2`` lowering recipe is the following:
|
| -
|
| -- Loop nest analysis
|
| -
|
| -- Address mode inference
|
| -
|
| -- Read-modify-write (RMW) transformation
|
| -
|
| -- Basic liveness analysis
|
| -
|
| -- Load optimization
|
| -
|
| -- Target lowering
|
| -
|
| -- Full liveness analysis
|
| -
|
| -- Register allocation
|
| -
|
| -- Phi instruction lowering (advanced)
|
| -
|
| -- Post-phi lowering register allocation
|
| -
|
| -- Branch optimization
|
| -
|
| -These passes are described in more detail below.
|
| -
|
| -Om1 lowering
|
| -------------
|
| -
|
| -Currently, the ``Om1`` lowering recipe is the following:
|
| -
|
| -- Phi instruction lowering (simple)
|
| -
|
| -- Target lowering
|
| -
|
| -- Register allocation (infinite-weight and pre-colored only)
|
| -
|
| -Optimization passes
|
| --------------------
|
| -
|
| -Liveness analysis
|
| -^^^^^^^^^^^^^^^^^
|
| -
|
| -Liveness analysis is a standard dataflow optimization, implemented as follows.
|
| -For each node (basic block), its live-out set is computed as the union of the
|
| -live-in sets of its successor nodes. Then the node's instructions are processed
|
| -in reverse order, updating the live set, until the beginning of the node is
|
| -reached, and the node's live-in set is recorded. If this iteration has changed
|
| -the node's live-in set, the node's predecessors are marked for reprocessing.
|
| -This continues until no more nodes need reprocessing. If nodes are processed in
|
| -reverse topological order, the number of iterations over the CFG is generally
|
| -equal to the maximum loop nest depth.
|
| -
|
| -To implement this, each node records its live-in and live-out sets, initialized
|
| -to the empty set. Each instruction records which of its Variables' live ranges
|
| -end in that instruction, initialized to the empty set. A side effect of
|
| -liveness analysis is dead instruction elimination. Each instruction can be
|
| -marked as tentatively dead, and after the algorithm converges, the tentatively
|
| -dead instructions are permanently deleted.
|
| -
|
| -Optionally, after this liveness analysis completes, we can do live range
|
| -construction, in which we calculate the live range of each variable in terms of
|
| -instruction numbers. A live range is represented as a union of segments, where
|
| -the segment endpoints are instruction numbers. Instruction numbers are required
|
| -to be unique across the CFG, and monotonically increasing within a basic block.
|
| -As a union of segments, live ranges can contain "gaps" and are therefore
|
| -precise. Because of SSA properties, a variable's live range can start at most
|
| -once in a basic block, and can end at most once in a basic block. Liveness
|
| -analysis keeps track of which variable/instruction tuples begin live ranges and
|
| -end live ranges, and combined with live-in and live-out sets, we can efficiently
|
| -build up live ranges of all variables across all basic blocks.
|
| -
|
| -A lot of care is taken to try to make liveness analysis fast and efficient.
|
| -Because of the lowering strategy, the number of variables is generally
|
| -proportional to the number of instructions, leading to an O(N^2) complexity
|
| -algorithm if implemented naively. To improve things based on sparsity, we note
|
| -that most variables are "local" and referenced in at most one basic block (in
|
| -contrast to the "global" variables with multi-block usage), and therefore cannot
|
| -be live across basic blocks. Therefore, the live-in and live-out sets,
|
| -typically represented as bit vectors, can be limited to the set of global
|
| -variables, and the intra-block liveness bit vector can be compacted to hold the
|
| -global variables plus the local variables for that block.
|
| -
|
| -Register allocation
|
| -^^^^^^^^^^^^^^^^^^^
|
| -
|
| -Subzero implements a simple linear-scan register allocator, based on the
|
| -allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
|
| -Register Allocation in the Context of SSA Form and Register Constraints
|
| -<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
|
| -several nice features:
|
| -
|
| -- Live ranges are represented as unions of segments, as described above, rather
|
| - than a single start/end tuple.
|
| -
|
| -- It allows pre-coloring of variables with specific physical registers.
|
| -
|
| -- It applies equally well to pre-lowered Phi instructions.
|
| -
|
| -The paper suggests an approach of aggressively coalescing variables across Phi
|
| -instructions (i.e., trying to force Phi source and destination variables to have
|
| -the same register assignment), but we reject that in favor of the more natural
|
| -preference mechanism described below.
|
| -
|
| -We enhance the algorithm in the paper with the capability of automatic inference
|
| -of register preference, and with the capability of allowing overlapping live
|
| -ranges to safely share the same register in certain circumstances. If we are
|
| -considering register allocation for variable ``A``, and ``A`` has a single
|
| -defining instruction ``A=B+C``, then the preferred register for ``A``, if
|
| -available, would be the register assigned to ``B`` or ``C``, if any, provided
|
| -that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
|
| -way we infer a good register preference for ``A``.
|
| -
|
| -We allow overlapping live ranges to get the same register in certain cases.
|
| -Suppose a high-level instruction like::
|
| -
|
| - A = unary_op(B)
|
| -
|
| -has been target-lowered like::
|
| -
|
| - T.inf = B
|
| - A = unary_op(T.inf)
|
| -
|
| -Further, assume that ``B``'s live range continues beyond this instruction
|
| -sequence, and that ``B`` has already been assigned some register. Normally, we
|
| -might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
|
| -turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
|
| -different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
|
| -guaranteed to be in a register, so in theory these overlapping live ranges could
|
| -safely have the same register. Our implementation allows this overlap as long
|
| -as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
|
| -modified within ``T.inf``'s live range.
|
| -
|
| -Subzero's register allocator can be run in 3 configurations.
|
| -
|
| -- Normal mode. All Variables are considered for register allocation. It
|
| - requires full liveness analysis and live range construction as a prerequisite.
|
| - This is used by ``O2`` lowering.
|
| -
|
| -- Minimal mode. Only infinite-weight or pre-colored Variables are considered.
|
| - All other Variables are stack-allocated. It does not require liveness
|
| - analysis; instead, it quickly scans the instructions and records first
|
| - definitions and last uses of all relevant Variables, using that to construct a
|
| - single-segment live range. Although this includes most of the Variables, the
|
| - live ranges are mostly simple, short, and rarely overlapping, which the
|
| - register allocator handles efficiently. This is used by ``Om1`` lowering.
|
| -
|
| -- Post-phi lowering mode. Advanced phi lowering is done after normal-mode
|
| - register allocation, and may result in new infinite-weight Variables that need
|
| - registers. One would like to just run something like minimal mode to assign
|
| - registers to the new Variables while respecting existing register allocation
|
| - decisions. However, it sometimes happens that there are no free registers.
|
| - In this case, some register needs to be forcibly spilled to the stack and
|
| - temporarily reassigned to the new Variable, and reloaded at the end of the new
|
| - Variable's live range. The register must be one that has no explicit
|
| - references during the Variable's live range. Since Subzero currently doesn't
|
| - track def/use chains (though it does record the CfgNode where a Variable is
|
| - defined), we just do a brute-force search across the CfgNode's instruction
|
| - list for the instruction numbers of interest. This situation happens very
|
| - rarely, so there's little point for now in improving its performance.
|
| -
|
| -The basic linear-scan algorithm may, as it proceeds, rescind an early register
|
| -allocation decision, leaving that Variable to be stack-allocated. Some of these
|
| -times, it turns out that the Variable could have been given a different register
|
| -without conflict, but by this time it's too late. The literature recognizes
|
| -this situation and describes "second-chance bin-packing", which Subzero can do.
|
| -We can rerun the register allocator in a mode that respects existing register
|
| -allocation decisions, and sometimes it finds new non-conflicting opportunities.
|
| -In fact, we can repeatedly run the register allocator until convergence.
|
| -Unfortunately, in the current implementation, these subsequent register
|
| -allocation passes end up being extremely expensive. This is because of the
|
| -treatment of the "unhandled pre-colored" Variable set, which is normally very
|
| -small but ends up being quite large on subsequent passes. Its performance can
|
| -probably be made acceptable with a better choice of data structures, but for now
|
| -this second-chance mechanism is disabled.
|
| -
|
| -Future work is to implement LLVM's `Greedy
|
| -<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
|
| -register allocator as a replacement for the basic linear-scan algorithm, given
|
| -LLVM's experience with its improvement in code quality. (The blog post claims
|
| -that the Greedy allocator also improved maintainability because a lot of hacks
|
| -could be removed, but Subzero is probably not yet to that level of hacks, and is
|
| -less likely to see that particular benefit.)
|
| -
|
| -Basic phi lowering
|
| -^^^^^^^^^^^^^^^^^^
|
| -
|
| -The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
|
| -implements it). Consider this example::
|
| -
|
| - L1:
|
| - ...
|
| - br L3
|
| - L2:
|
| - ...
|
| - br L3
|
| - L3:
|
| - A = phi [B, L1], [C, L2]
|
| - X = phi [Y, L1], [Z, L2]
|
| -
|
| -For each destination of a phi instruction, we can create a temporary and insert
|
| -the temporary's assignment at the end of the predecessor block::
|
| -
|
| - L1:
|
| - ...
|
| - A' = B
|
| - X' = Y
|
| - br L3
|
| - L2:
|
| - ...
|
| - A' = C
|
| - X' = Z
|
| - br L3
|
| - L2:
|
| - A = A'
|
| - X = X'
|
| -
|
| -This transformation is very simple and reliable. It can be done before target
|
| -lowering and register allocation, and it easily avoids the classic lost-copy and
|
| -related problems. ``Om1`` lowering uses this strategy.
|
| -
|
| -However, it has the disadvantage of initializing temporaries even for branches
|
| -not taken, though that could be mitigated by splitting non-critical edges and
|
| -putting assignments in the edge-split nodes. Another problem is that without
|
| -extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
|
| -specific ordering even though phi semantics are that the assignments are
|
| -parallel or unordered. This sometimes imposes false live range overlaps and
|
| -leads to poorer register allocation.
|
| -
|
| -Advanced phi lowering
|
| -^^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -``O2`` lowering defers phi lowering until after register allocation to avoid the
|
| -problem of false live range overlaps. It works as follows. We split each
|
| -incoming edge and move the (parallel) phi assignments into the split nodes. We
|
| -linearize each set of assignments by finding a safe, topological ordering of the
|
| -assignments, respecting register assignments as well. For example::
|
| -
|
| - A = B
|
| - X = Y
|
| -
|
| -Normally these assignments could be executed in either order, but if ``B`` and
|
| -``X`` are assigned the same physical register, we would want to use the above
|
| -ordering. Dependency cycles are broken by introducing a temporary. For
|
| -example::
|
| -
|
| - A = B
|
| - B = A
|
| -
|
| -Here, a temporary breaks the cycle::
|
| -
|
| - t = A
|
| - A = B
|
| - B = t
|
| -
|
| -Finally, we use the existing target lowering to lower the assignments in this
|
| -basic block, and once that is done for all basic blocks, we run the post-phi
|
| -variant of register allocation on the edge-split basic blocks.
|
| -
|
| -When computing a topological order, we try to first schedule assignments whose
|
| -source has a physical register, and last schedule assignments whose destination
|
| -has a physical register. This helps reduce register pressure.
|
| -
|
| -X86 address mode inference
|
| -^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -We try to take advantage of the x86 addressing mode that includes a base
|
| -register, an index register, an index register scale amount, and an immediate
|
| -offset. We do this through simple pattern matching. Starting with a load or
|
| -store instruction where the address is a variable, we initialize the base
|
| -register to that variable, and look up the instruction where that variable is
|
| -defined. If that is an add instruction of two variables and the index register
|
| -hasn't been set, we replace the base and index register with those two
|
| -variables. If instead it is an add instruction of a variable and a constant, we
|
| -replace the base register with the variable and add the constant to the
|
| -immediate offset.
|
| -
|
| -There are several more patterns that can be matched. This pattern matching
|
| -continues on the load or store instruction until no more matches are found.
|
| -Because a program typically has few load and store instructions (not to be
|
| -confused with instructions that manipulate stack variables), this address mode
|
| -inference pass is fast.
|
| -
|
| -X86 read-modify-write inference
|
| -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -A reasonably common bitcode pattern is a non-atomic update of a memory
|
| -location::
|
| -
|
| - x = load addr
|
| - y = add x, 1
|
| - store y, addr
|
| -
|
| -On x86, with good register allocation, the Subzero passes described above
|
| -generate code with only this quality::
|
| -
|
| - mov [%ebx], %eax
|
| - add $1, %eax
|
| - mov %eax, [%ebx]
|
| -
|
| -However, x86 allows for this kind of code::
|
| -
|
| - add $1, [%ebx]
|
| -
|
| -which requires fewer instructions, but perhaps more importantly, requires fewer
|
| -physical registers.
|
| -
|
| -It's also important to note that this transformation only makes sense if the
|
| -store instruction ends ``x``'s live range.
|
| -
|
| -Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
|
| -opportunities via simple pattern matching. The only problem is that it is run
|
| -before liveness analysis, which is needed to determine whether ``x``'s live
|
| -range ends after the RMW. Since liveness analysis is one of the most expensive
|
| -passes, it's not attractive to run it an extra time just for RMW analysis.
|
| -Instead, we essentially generate both the RMW and the non-RMW versions, and then
|
| -during lowering, the RMW version deletes itself if it finds x still live.
|
| -
|
| -X86 compare-branch inference
|
| -^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -In the LLVM instruction set, the compare/branch pattern works like this::
|
| -
|
| - cond = icmp eq a, b
|
| - br cond, target
|
| -
|
| -The result of the icmp instruction is a single bit, and a conditional branch
|
| -tests that bit. By contrast, most target architectures use this pattern::
|
| -
|
| - cmp a, b // implicitly sets various bits of FLAGS register
|
| - br eq, target // branch on a particular FLAGS bit
|
| -
|
| -A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
|
| -``cond`` and conditionally branches. Subzero has a pass that identifies
|
| -boolean-based operations like this and folds them into a single
|
| -compare/branch-like operation. It is set up for more than just cmp/br though.
|
| -Boolean producers include icmp (integer compare), fcmp (floating-point compare),
|
| -and trunc (integer truncation when the destination has bool type). Boolean
|
| -consumers include branch, select (the ternary operator from the C language), and
|
| -sign-extend and zero-extend when the source has bool type.
|
| -
|
| -Sandboxing
|
| -^^^^^^^^^^
|
| -
|
| -Native Client's sandbox model uses software fault isolation (SFI) to provide
|
| -safety when running untrusted code in a browser or other environment. Subzero
|
| -implements Native Client's `sandboxing
|
| -<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
|
| -to enable Subzero-translated executables to be run inside Chrome. Subzero also
|
| -provides a fairly simple framework for investigating alternative sandbox models
|
| -or other restrictions on the sandbox model.
|
| -
|
| -Sandboxing in Subzero is not actually implemented as a separate pass, but is
|
| -integrated into lowering and assembly.
|
| -
|
| -- Indirect branches, including the ret instruction, are masked to a bundle
|
| - boundary and bundle-locked.
|
| -
|
| -- Call instructions are aligned to the end of the bundle so that the return
|
| - address is bundle-aligned.
|
| -
|
| -- Indirect branch targets, including function entry and targets in a switch
|
| - statement jump table, are bundle-aligned.
|
| -
|
| -- The intrinsic for reading the thread pointer is inlined appropriately.
|
| -
|
| -- For x86-64, non-stack memory accesses are with respect to the reserved sandbox
|
| - base register. We reduce the aggressiveness of address mode inference to
|
| - leave room for the sandbox base register during lowering. There are no memory
|
| - sandboxing changes for x86-32.
|
| -
|
| -Code emission
|
| --------------
|
| -
|
| -Subzero's integrated assembler is derived from Dart's `assembler code
|
| -<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
|
| -that iterates through the low-level ICE instructions and invokes the relevant
|
| -assembler functions. Placeholders are added for later fixup of branch target
|
| -offsets. (Backward branches use short offsets if possible; forward branches
|
| -generally use long offsets unless it is an intra-block branch of "known" short
|
| -length.) The assembler emits into a staging buffer. Once emission into the
|
| -staging buffer for a function is complete, the data is emitted to the output
|
| -file as an ELF object file, and metadata such as relocations, symbol table, and
|
| -string table, are accumulated for emission at the end. Global data initializers
|
| -are emitted similarly. A key point is that at this point, the staging buffer
|
| -can be deallocated, and only a minimum of data needs to held until the end.
|
| -
|
| -As a debugging alternative, Subzero can emit textual assembly code which can
|
| -then be run through an external assembler. This is of course super slow, but
|
| -quite valuable when bringing up a new target.
|
| -
|
| -As another debugging option, the staging buffer can be emitted as textual
|
| -assembly, primarily in the form of ".byte" lines. This allows the assembler to
|
| -be tested separately from the ELF related code.
|
| -
|
| -Memory management
|
| ------------------
|
| -
|
| -Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
|
| -LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
|
| -allocated from the arena are never actually freed; instead, when the CFG
|
| -translation completes and the CFG is deleted, the entire arena memory is
|
| -reclaimed at once. This style of allocation works well in an environment like a
|
| -compiler where there are distinct phases with only easily-identifiable objects
|
| -living across phases. It frees the developer from having to manage object
|
| -deletion, and it amortizes deletion costs across just a single arena deletion at
|
| -the end of the phase. Furthermore, it helps scalability by allocating entirely
|
| -from thread-local memory pools, and minimizing global locking of the heap.
|
| -
|
| -Instructions are probably the most heavily allocated complex class in Subzero.
|
| -We represent an instruction list as an intrusive doubly linked list, allocate
|
| -all instructions from the ``CfgLocalAllocator``, and we make sure each
|
| -instruction subclass is basically `POD
|
| -<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
|
| -trivial destructor. This way, when the CFG is finished, we don't need to
|
| -individually deallocate every instruction. We do similar for Variables, which
|
| -is probably the second most popular complex class.
|
| -
|
| -There are some situations where passes need to use some `STL container class
|
| -<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
|
| -``CfgLocalAllocator`` as the container allocator if this is needed.
|
| -
|
| -Multithreaded translation
|
| --------------------------
|
| -
|
| -Subzero is designed to be able to translate functions in parallel. With the
|
| -``-threads=N`` command-line option, there is a 3-stage producer-consumer
|
| -pipeline:
|
| -
|
| -- A single thread parses the ``.pexe`` file and produces a sequence of work
|
| - units. A work unit can be either a fully constructed CFG, or a set of global
|
| - initializers. The work unit includes its sequence number denoting its parse
|
| - order. Each work unit is added to the translation queue.
|
| -
|
| -- There are N translation threads that draw work units from the translation
|
| - queue and lower them into assembler buffers. Each assembler buffer is added
|
| - to the emitter queue, tagged with its sequence number. The CFG and its
|
| - ``CfgLocalAllocator`` are disposed of at this point.
|
| -
|
| -- A single thread draws assembler buffers from the emitter queue and appends to
|
| - the output file. It uses the sequence numbers to reintegrate the assembler
|
| - buffers according to the original parse order, such that output order is
|
| - always deterministic.
|
| -
|
| -This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
|
| -for a total of ``N+2`` execution threads, taking the parser and emitter threads
|
| -into account. For the special case of ``N=0``, execution is entirely sequential
|
| --- the same thread parses, translates, and emits, one function at a time. This
|
| -is useful for performance measurements.
|
| -
|
| -Ideally, we would like to get near-linear scalability as the number of
|
| -translation threads increases. We expect that ``-threads=1`` should be slightly
|
| -faster than ``-threads=0`` as the small amount of time spent parsing and
|
| -emitting is done largely in parallel with translation. With perfect
|
| -scalability, we see ``-threads=N`` translating ``N`` times as fast as
|
| -``-threads=1``, up until the point where parsing or emitting becomes the
|
| -bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
|
| -performance would become a bottleneck and efficiency might peak at, say, 75%.
|
| -
|
| -Currently, parsing takes about 11% of total sequential time. If translation
|
| -scalability ever gets so fast and awesomely scalable that parsing becomes a
|
| -bottleneck, it should be possible to make parsing multithreaded as well.
|
| -
|
| -Internally, all shared, mutable data is held in the GlobalContext object, and
|
| -access to each field is guarded by a mutex.
|
| -
|
| -Security
|
| ---------
|
| -
|
| -Subzero includes a number of security features in the generated code, as well as
|
| -in the Subzero translator itself, which run on top of the existing Native Client
|
| -sandbox as well as Chrome's OS-level sandbox.
|
| -
|
| -Sandboxed translator
|
| -^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -When running inside the browser, the Subzero translator executes as sandboxed,
|
| -untrusted code that is initially checked by the validator, just like the
|
| -LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
|
| -more or less secure than the translator it replaces, from the point of view of
|
| -the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
|
| -was designed from the start with security in mind, so one expects fewer attacker
|
| -opportunities here.
|
| -
|
| -Code diversification
|
| -^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -`Return-oriented programming
|
| -<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
|
| -now-common technique for starting with e.g. a known buffer overflow situation
|
| -and launching it into a deeper exploit. The attacker scans the executable
|
| -looking for ROP gadgets, which are short sequences of code that happen to load
|
| -known values into known registers and then return. An attacker who manages to
|
| -overwrite parts of the stack can overwrite it with carefully chosen return
|
| -addresses such that certain ROP gadgets are effectively chained together to set
|
| -up the register state as desired, finally returning to some code that manages to
|
| -do something nasty based on those register values.
|
| -
|
| -If there is a popular ``.pexe`` with a large install base, the attacker could
|
| -run Subzero on it and scan the executable for suitable ROP gadgets to use as
|
| -part of a potential exploit. Note that if the trusted validator is working
|
| -correctly, these ROP gadgets are limited to starting at a bundle boundary and
|
| -cannot use the trick of finding a gadget that happens to begin inside another
|
| -instruction. All the same, gadgets with these constraints still exist and the
|
| -attacker has access to them. This is the attack model we focus most on --
|
| -protecting the user against misuse of a "trusted" developer's application, as
|
| -opposed to mischief from a malicious ``.pexe`` file.
|
| -
|
| -Subzero can mitigate these attacks to some degree through code diversification.
|
| -Specifically, we can apply some randomness to the code generation that makes ROP
|
| -gadgets less predictable. This randomness can have some compile-time cost, and
|
| -it can affect the code quality; and some diversifications may be more effective
|
| -than others. A more detailed treatment of hardening techniques may be found in
|
| -the Matasano report "`Attacking Clientside JIT Compilers
|
| -<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".
|
| -
|
| -To evaluate diversification effectiveness, we use a third-party ROP gadget
|
| -finder and limit its results to bundle-aligned addresses. For a given
|
| -diversification technique, we run it with a number of different random seeds,
|
| -find ROP gadgets for each version, and determine how persistent each ROP gadget
|
| -is across the different versions. A gadget is persistent if the same gadget is
|
| -found at the same code address. The best diversifications are ones with low
|
| -gadget persistence rates.
|
| -
|
| -Subzero implements 7 different diversification techniques. Below is a
|
| -discussion of each technique, its effectiveness, and its cost. The discussions
|
| -of cost and effectiveness are for a single diversification technique; the
|
| -translation-time costs for multiple techniques are additive, but the effects of
|
| -multiple techniques on code quality and effectiveness are not yet known.
|
| -
|
| -In Subzero's implementation, each randomization is "repeatable" in a sense.
|
| -Each pass that includes a randomization option gets its own private instance of
|
| -a random number generator (RNG). The RNG is seeded with a combination of a
|
| -global seed, the pass ID, and the function's sequence number. The global seed
|
| -is designed to be different across runs (perhaps based on the current time), but
|
| -for debugging, the global seed can be set to a specific value and the results
|
| -will be repeatable.
|
| -
|
| -Subzero-generated code is subject to diversification once per translation, and
|
| -then Chrome caches the diversified binary for subsequent executions. An
|
| -attacker may attempt to run the binary multiple times hoping for
|
| -higher-probability combinations of ROP gadgets. When the attacker guesses
|
| -wrong, a likely outcome is an application crash. Chrome throttles creation of
|
| -crashy processes which reduces the likelihood of the attacker eventually gaining
|
| -a foothold.
|
| -
|
| -Constant blinding
|
| -~~~~~~~~~~~~~~~~~
|
| -
|
| -Here, we prevent attackers from controlling large immediates in the text
|
| -(executable) section. A random cookie is generated for each function, and if
|
| -the constant exceeds a specified threshold, the constant is obfuscated with the
|
| -cookie and equivalent code is generated. For example, instead of this x86
|
| -instruction::
|
| -
|
| - mov $0x11223344, <%Reg/Mem>
|
| -
|
| -the following code might be generated::
|
| -
|
| - mov $(0x11223344+Cookie), %temp
|
| - lea -Cookie(%temp), %temp
|
| - mov %temp, <%Reg/Mem>
|
| -
|
| -The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
|
| -prevent unintended effects on the flags register.
|
| -
|
| -This transformation has almost no effect on translation time, and about 1%
|
| -impact on code quality, depending on the threshold chosen. It does little to
|
| -reduce gadget persistence, but it does remove a lot of potential opportunities
|
| -to construct intra-instruction ROP gadgets (which an attacker could use only if
|
| -a validator bug were discovered, since the Native Client sandbox and associated
|
| -validator force returns and other indirect branches to be to bundle-aligned
|
| -addresses).
|
| -
|
| -Constant pooling
|
| -~~~~~~~~~~~~~~~~
|
| -
|
| -This is similar to constant blinding, in that large immediates are removed from
|
| -the text section. In this case, each unique constant above the threshold is
|
| -stored in a read-only data section and the constant is accessed via a memory
|
| -load. For the above example, the following code might be generated::
|
| -
|
| - mov $Label$1, %temp
|
| - mov %temp, <%Reg/Mem>
|
| -
|
| -This has a similarly small impact on translation time and ROP gadget
|
| -persistence, and a smaller (better) impact on code quality. This is because it
|
| -uses fewer instructions, and in some cases good register allocation leads to no
|
| -increase in instruction count. Note that this still gives an attacker some
|
| -limited amount of control over some text section values, unless we randomize the
|
| -constant pool layout.
|
| -
|
| -Static data reordering
|
| -~~~~~~~~~~~~~~~~~~~~~~
|
| -
|
| -This transformation limits the attacker's ability to control bits in global data
|
| -address references. It simply permutes the order in memory of global variables
|
| -and internal constant pool entries. For the constant pool, we only permute
|
| -within a type (i.e., emit a randomized list of ints, followed by a randomized
|
| -list of floats, etc.) to maintain good packing in the face of alignment
|
| -constraints.
|
| -
|
| -As might be expected, this has no impact on code quality, translation time, or
|
| -ROP gadget persistence (though as above, it limits opportunities for
|
| -intra-instruction ROP gadgets with a broken validator).
|
| -
|
| -Basic block reordering
|
| -~~~~~~~~~~~~~~~~~~~~~~
|
| -
|
| -Here, we randomize the order of basic blocks within a function, with the
|
| -constraint that we still want to maintain a topological order as much as
|
| -possible, to avoid making the code too branchy.
|
| -
|
| -This has no impact on code quality, and about 1% impact on translation time, due
|
| -to a separate pass to recompute layout. It ends up having a huge effect on ROP
|
| -gadget persistence, tied for best with nop insertion, reducing ROP gadget
|
| -persistence to less than 5%.
|
| -
|
| -Function reordering
|
| -~~~~~~~~~~~~~~~~~~~
|
| -
|
| -Here, we permute the order that functions are emitted, primarily to shift ROP
|
| -gadgets around to less predictable locations. It may also change call address
|
| -offsets in case the attacker was trying to control that offset in the code.
|
| -
|
| -To control latency and memory footprint, we don't arbitrarily permute functions.
|
| -Instead, for some relatively small value of N, we queue up N assembler buffers,
|
| -and then emit the N functions in random order, and repeat until all functions
|
| -are emitted.
|
| -
|
| -Function reordering has no impact on translation time or code quality.
|
| -Measurements indicate that it reduces ROP gadget persistence to about 15%.
|
| -
|
| -Nop insertion
|
| -~~~~~~~~~~~~~
|
| -
|
| -This diversification randomly adds a nop instruction after each regular
|
| -instruction, with some probability. Nop instructions of different lengths may
|
| -be selected. Nop instructions are never added inside a bundle_lock region.
|
| -Note that when sandboxing is enabled, nop instructions are already being added
|
| -for bundle alignment, so the diversification nop instructions may simply be
|
| -taking the place of alignment nop instructions, though distributed differently
|
| -through the bundle.
|
| -
|
| -In Subzero's currently implementation, nop insertion adds 3-5% to the
|
| -translation time, but this is probably because it is implemented as a separate
|
| -pass that adds actual nop instructions to the IR. The overhead would probably
|
| -be a lot less if it were integrated into the assembler pass. The code quality
|
| -is also reduced by 3-5%, making nop insertion the most expensive of the
|
| -diversification techniques.
|
| -
|
| -Nop insertion is very effective in reducing ROP gadget persistence, at the same
|
| -level as basic block randomization (less than 5%). But given nop insertion's
|
| -impact on translation time and code quality, one would most likely prefer to use
|
| -basic block randomization instead (though the combined effects of the different
|
| -diversification techniques have not yet been studied).
|
| -
|
| -Register allocation randomization
|
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
| -
|
| -In this diversification, the register allocator tries to make different but
|
| -mostly functionally equivalent choices, while maintaining stable code quality.
|
| -
|
| -A naive approach would be the following. Whenever the allocator has more than
|
| -one choice for assigning a register, choose randomly among those options. And
|
| -whenever there are no registers available and there is a tie for the
|
| -lowest-weight variable, randomly select one of the lowest-weight variables to
|
| -evict. Because of the one-pass nature of the linear-scan algorithm, this
|
| -randomization strategy can have a large impact on which variables are ultimately
|
| -assigned registers, with a corresponding large impact on code quality.
|
| -
|
| -Instead, we choose an approach that tries to keep code quality stable regardless
|
| -of the random seed. We partition the set of physical registers into equivalence
|
| -classes. If a register is pre-colored in the function (i.e., referenced
|
| -explicitly by name), it forms its own equivalence class. The remaining
|
| -registers are partitioned according to their combination of attributes such as
|
| -integer versus floating-point, 8-bit versus 32-bit, caller-save versus
|
| -callee-saved, etc. Each equivalence class is randomly permuted, and the
|
| -complete permutation is applied to the final register assignments.
|
| -
|
| -Register randomization reduces ROP gadget persistence to about 10% on average,
|
| -though there tends to be fairly high variance across functions and applications.
|
| -This probably has to do with the set of restrictions in the x86-32 instruction
|
| -set and ABI, such as few general-purpose registers, ``%eax`` used for return
|
| -values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
|
| -intended, register randomization has no impact on code quality, and a slight
|
| -(0.5%) impact on translation time due to an extra scan over the variables to
|
| -identify pre-colored registers.
|
| -
|
| -Fuzzing
|
| -^^^^^^^
|
| -
|
| -We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
|
| -combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
|
| -<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
|
| -find and fix cases where Subzero crashes or otherwise ungracefully fails on
|
| -unexpected inputs, and to do so automatically over a large range of unexpected
|
| -inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
|
| -of an attacker exploiting these bugs.
|
| -
|
| -Most of the problems found so far are ones most appropriately handled in the
|
| -parser. However, there have been a couple that have identified problems in the
|
| -lowering, or otherwise inappropriately triggered assertion failures and fatal
|
| -errors. We continue to dig into this area.
|
| -
|
| -Future security work
|
| -^^^^^^^^^^^^^^^^^^^^
|
| -
|
| -Subzero is well-positioned to explore other future security enhancements, e.g.:
|
| -
|
| -- Tightening the Native Client sandbox. ABI changes, such as the previous work
|
| - on `hiding the sandbox base address
|
| - <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_
|
| - in x86-64, are easy to experiment with in Subzero.
|
| -
|
| -- Making the executable code section read-only. This would prevent a PNaCl
|
| - application from inspecting its own binary and trying to find ROP gadgets even
|
| - after code diversification has been performed. It may still be susceptible to
|
| - `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
|
| - security is still overall improved.
|
| -
|
| -- Instruction selection diversification. It may be possible to lower a given
|
| - instruction in several largely equivalent ways, which gives more opportunities
|
| - for code randomization.
|
| -
|
| -Chrome integration
|
| -------------------
|
| -
|
| -Currently Subzero is available in Chrome for the x86-32 architecture, but under
|
| -a flag. When the flag is enabled, Subzero is used when the `manifest file
|
| -<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
|
| -linking to the ``.pexe`` file specifies the ``O0`` optimization level.
|
| -
|
| -The next step is to remove the flag, i.e. invoke Subzero as the only translator
|
| -for ``O0``-specified manifest files.
|
| -
|
| -Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
|
| -case Subzero could be used for all PNaCl translation.
|
| -
|
| -Command line options
|
| ---------------------
|
| -
|
| -Subzero has a number of command-line options for debugging and diagnostics.
|
| -Among the more interesting are the following.
|
| -
|
| -- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
|
| - diagnostic output, with various levels of detail after each pass. Instruction
|
| - numbers can be printed or suppressed. Deleted instructions can be printed or
|
| - suppressed (they are retained in the instruction list, as discussed earlier,
|
| - because they can help explain how lower-level instructions originated).
|
| - Liveness information can be printed when available. Details of register
|
| - allocation can be printed as register allocator decisions are made. And more.
|
| -
|
| -- Running Subzero with any level of verbosity produces an enormous amount of
|
| - output. When debugging a single function, verbose output can be suppressed
|
| - except for a particular function. The ``-verbose-focus`` flag suppresses
|
| - verbose output except for the specified function.
|
| -
|
| -- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
|
| - at exit. Timing markers can be placed in the Subzero source code to demarcate
|
| - logical operations or passes of interest. Basic timing information plus
|
| - call-stack type timing information is printed at the end.
|
| -
|
| -- Along with ``-timing``, the user can instead get a report on the overall
|
| - translation time for each function, to help focus on timing outliers. Also,
|
| - ``-timing-focus`` limits the ``-timing`` reporting to a single function,
|
| - instead of aggregating pass timing across all functions.
|
| -
|
| -- The ``-szstats`` option reports various statistics on each function, such as
|
| - stack frame size, static instruction count, etc. It may be helpful to track
|
| - these stats over time as Subzero is improved, as an approximate measure of
|
| - code quality.
|
| -
|
| -- The flag ``-asm-verbose``, in conjunction with emitting textual assembly
|
| - output, annotate the assembly output with register-focused liveness
|
| - information. In particular, each basic block is annotated with which
|
| - registers are live-in and live-out, and each instruction is annotated with
|
| - which registers' and stack locations' live ranges end at that instruction.
|
| - This is really useful when studying the generated code to find opportunities
|
| - for code quality improvements.
|
| -
|
| -Testing and debugging
|
| ----------------------
|
| -
|
| -LLVM lit tests
|
| -^^^^^^^^^^^^^^
|
| -
|
| -For basic testing, Subzero uses LLVM's `lit
|
| -<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
|
| -have a suite of hundreds of small functions where we test for particular
|
| -assembly code patterns across different target architectures.
|
| -
|
| -Cross tests
|
| -^^^^^^^^^^^
|
| -
|
| -Unfortunately, the lit tests don't do a great job of precisely testing the
|
| -correctness of the output. Much better are the cross tests, which are execution
|
| -tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
|
| -variety of interesting inputs. Each cross test consists of a set of C, C++,
|
| -and/or low-level bitcode files. The C and C++ source files are compiled down to
|
| -bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
|
| -Subzero mangles global symbol names with a special prefix to avoid duplicate
|
| -symbol errors. A driver program invokes both versions on a large set of
|
| -interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
|
| -differ. Cross tests turn out to be an excellent way of testing the basic
|
| -lowering patterns, but they are less useful for testing more global things like
|
| -liveness analysis and register allocation.
|
| -
|
| -Bisection debugging
|
| -^^^^^^^^^^^^^^^^^^^
|
| -
|
| -Sometimes with a new application, Subzero will end up producing incorrect code
|
| -that either crashes at runtime or otherwise produces the wrong results. When
|
| -this happens, we need to narrow it down to a single function (or small set of
|
| -functions) that yield incorrect behavior. For this, we have a bisection
|
| -debugging framework. Here, we initially translate the entire application once
|
| -with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
|
| -selectively weaken symbols based on a whitelist or blacklist provided on the
|
| -command line. The two object files can then be linked together without link
|
| -errors, with the desired version of each method "winning". Then the binary is
|
| -tested, and bisection proceeds based on whether the binary produces correct
|
| -output.
|
| -
|
| -When the bisection completes, we are left with a minimal set of
|
| -Subzero-translated functions that cause the failure. Usually it is a single
|
| -function, though sometimes it might require a combination of several functions
|
| -to cause a failure; this may be due to an incorrect call ABI, for example.
|
| -However, Murphy's Law implies that the single failing function is enormous and
|
| -impractical to debug. In that case, we can restart the bisection, explicitly
|
| -blacklisting the enormous function, and try to find another candidate to debug.
|
| -(Future work is to automate this to find all minimal sets of functions, so that
|
| -debugging can focus on the simplest example.)
|
| -
|
| -Fuzz testing
|
| -^^^^^^^^^^^^
|
| -
|
| -As described above, we try to find internal Subzero bugs using fuzz testing
|
| -techniques.
|
| -
|
| -Sanitizers
|
| -^^^^^^^^^^
|
| -
|
| -Subzero can be built with `AddressSanitizer
|
| -<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer
|
| -<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is
|
| -done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,
|
| -multithreading has been simple enough that TSan hasn't found any bugs, but ASan
|
| -has found at least one memory leak which was subsequently fixed.
|
| -`UndefinedBehaviorSanitizer
|
| -<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
|
| -(UBSan) support is in progress. `Control flow integrity sanitization
|
| -<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under
|
| -consideration.
|
| -
|
| -Current status
|
| -==============
|
| -
|
| -Target architectures
|
| ---------------------
|
| -
|
| -Subzero is currently more or less complete for the x86-32 target. It has been
|
| -refactored and extended to handle x86-64 as well, and that is mostly complete at
|
| -this point.
|
| -
|
| -ARM32 work is in progress. It currently lacks the testing level of x86, at
|
| -least in part because Subzero's register allocator needs modifications to handle
|
| -ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
|
| -register is actually a gang of two consecutive and aligned 32-bit registers, and
|
| -a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
|
| -ARM64 work has not started; when it does, it will be native-only since the
|
| -Native Client sandbox model, validator, and other tools have never been defined.
|
| -
|
| -An external contributor is adding MIPS support, in most part by following the
|
| -ARM work.
|
| -
|
| -Translator performance
|
| -----------------------
|
| -
|
| -Single-threaded translation speed is currently about 5× the ``pnacl-llc``
|
| -translation speed. For a large ``.pexe`` file, the time breaks down as:
|
| -
|
| -- 11% for parsing and initial IR building
|
| -
|
| -- 4% for emitting to /dev/null
|
| -
|
| -- 27% for liveness analysis (two liveness passes plus live range construction)
|
| -
|
| -- 15% for linear-scan register allocation
|
| -
|
| -- 9% for basic lowering
|
| -
|
| -- 10% for advanced phi lowering
|
| -
|
| -- ~11% for other minor analysis
|
| -
|
| -- ~10% measurement overhead to acquire these numbers
|
| -
|
| -Some improvements could undoubtedly be made, but it will be hard to increase the
|
| -speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
|
| -``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
|
| -``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
|
| -
|
| -Code quality
|
| -------------
|
| -
|
| -Measured across 16 components of spec2k, Subzero's code quality is uniformly
|
| -better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
|
| -between ``pnacl-llc`` ``-O0`` and ``-O2``.
|
| -
|
| -Translator size
|
| ----------------
|
| -
|
| -When built in MINIMAL mode, the x86-64 native translator size for the x86-32
|
| -target is about 700 KB, not including the size of functions referenced in
|
| -dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
|
| -MB, and it is statically linked and also includes nop padding for bundling as
|
| -well as indirect branch masking.
|
| -
|
| -Translator memory footprint
|
| ----------------------------
|
| -
|
| -It's hard to draw firm conclusions about memory footprint, since the footprint
|
| -is at least proportional to the input function size, and there is no real limit
|
| -on the size of functions in the ``.pexe`` file.
|
| -
|
| -That said, we looked at the memory footprint over time as Subzero translated
|
| -``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
|
| -disposal. One of LLVM's libraries that Subzero uses can report the current
|
| -malloc heap usage. With single-threaded translation, Subzero tends to hover
|
| -around 15 MB of memory usage. There are a couple of monstrous functions where
|
| -Subzero grows to around 100 MB, but then it drops back down after those
|
| -functions finish translating. In contrast, ``pnacl-llc`` grows larger and
|
| -larger throughout translation, reaching several hundred MB by the time it
|
| -completes.
|
| -
|
| -It's a bit more interesting when we enable multithreaded translation. When
|
| -there are N translation threads, Subzero implements a policy that limits the
|
| -size of the translation queue to N entries -- if it is "full" when the parser
|
| -tries to add a new CFG, the parser blocks until one of the translation threads
|
| -removes a CFG. This means the number of in-memory CFGs can (and generally does)
|
| -reach 2*N+1, and so the memory footprint rises in proportion to the number of
|
| -threads. Adding to the pressure is the observation that the monstrous functions
|
| -also take proportionally longer time to translate, so there's a good chance many
|
| -of the monstrous functions will be active at the same time with multithreaded
|
| -translation. As a result, for N=32, Subzero's memory footprint peaks at about
|
| -260 MB, but drops back down as the large functions finish translating.
|
| -
|
| -If this peak memory size becomes a problem, it might be possible for the parser
|
| -to resequence the functions to try to spread out the larger functions, or to
|
| -throttle the translation queue to prevent too many in-flight large functions.
|
| -It may also be possible to throttle based on memory pressure signaling from
|
| -Chrome.
|
| -
|
| -Translator scalability
|
| -----------------------
|
| -
|
| -Currently scalability is "not very good". Multiple translation threads lead to
|
| -faster translation, but not to the degree desired. We haven't dug in to
|
| -investigate yet.
|
| -
|
| -There are a few areas to investigate. First, there may be contention on the
|
| -constant pool, which all threads access, and which requires locked access even
|
| -for reading. This could be mitigated by keeping a CFG-local cache of the most
|
| -common constants.
|
| -
|
| -Second, there may be contention on memory allocation. While almost all CFG
|
| -objects are allocated from the CFG-local allocator, some passes use temporary
|
| -STL containers that use the default allocator, which may require global locking.
|
| -This could be mitigated by switching these to the CFG-local allocator.
|
| -
|
| -Third, multithreading may make the default allocator strategy more expensive.
|
| -In a single-threaded environment, a pass will allocate its containers, run the
|
| -pass, and deallocate the containers. This results in stack-like allocation
|
| -behavior and makes the heap free list easier to manage, with less heap
|
| -fragmentation. But when multithreading is added, the allocations and
|
| -deallocations become much less stack-like, making allocation and deallocation
|
| -operations individually more expensive. Again, this could be mitigated by
|
| -switching these to the CFG-local allocator.
|
|
|