DESIGN.rst - Issue 1562543002: move .rst to docs

Unified Diff: DESIGN.rst

Issue 1562543002: move .rst to docs (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: changes suggested by stichnot Created 4 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: DESIGN.rst

diff --git a/DESIGN.rst b/DESIGN.rst

deleted file mode 100644

index 3d324f64ba81798b5a7a038b24f265bbe292254b..0000000000000000000000000000000000000000

--- a/DESIGN.rst

+++ /dev/null

@@ -1,1494 +0,0 @@

-Design of the Subzero fast code generator

-=========================================

-Introduction

-------------

-The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

-compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the

-PNaCl toolchain to compile their application to architecture-neutral PNaCl

-bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as

-possible. The ``.pexe`` file is downloaded to the user's browser where the

-PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to

-`sandboxed

-<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_

-native code. The translator uses architecture-specific optimizations as much as

-practical to generate good native code.

-The native code can be cached by the browser to avoid repeating translation on

-future page loads. However, first-time user experience is hampered by long

-translation times. The LLVM-based PNaCl translator is pretty slow, even when

-using ``-O0`` to minimize optimizations, so delays are especially noticeable on

-slow browser platforms such as ARM-based Chromebooks.

-Translator slowness can be mitigated or hidden in a number of ways.

-- Parallel translation. However, slow machines where this matters most, e.g.

- ARM-based Chromebooks, are likely to have fewer cores to parallelize across,

- and are likely to less memory available for multiple translation threads to

- use.

-- Streaming translation, i.e. start translating as soon as the download starts.

- This doesn't help much when translation speed is 10× slower than download

- speed, or the ``.pexe`` file is already cached while the translated binary was

- flushed from the cache.

-- Arrange the web page such that translation is done in parallel with

- downloading large assets.

-- Arrange the web page to distract the user with `cat videos

- <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in

- progress.

-Or, improve translator performance to something more reasonable.

-This document describes Subzero's attempt to improve translation speed by an

-order of magnitude while rivaling LLVM's code quality. Subzero does this

-through minimal IR layering, lean data structures and passes, and a careful

-selection of fast optimization passes. It has two optimization recipes: full

-optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

-following (described in more detail below):

-+-----------------------------------+-----------------------+

-| O2 recipe | Om1 recipe |

-+===================================+=======================+

-| Parse .pexe file | Parse .pexe file |

-+-----------------------------------+-----------------------+

-| Loop nest analysis | |

-+-----------------------------------+-----------------------+

-| Address mode inference | |

-+-----------------------------------+-----------------------+

-| Read-modify-write (RMW) transform | |

-+-----------------------------------+-----------------------+

-| Basic liveness analysis | |

-+-----------------------------------+-----------------------+

-| Load optimization | |

-+-----------------------------------+-----------------------+

-| | Phi lowering (simple) |

-+-----------------------------------+-----------------------+

-| Target lowering | Target lowering |

-+-----------------------------------+-----------------------+

-| Full liveness analysis | |

-+-----------------------------------+-----------------------+

-| Register allocation | |

-+-----------------------------------+-----------------------+

-| Phi lowering (advanced) | |

-+-----------------------------------+-----------------------+

-| Post-phi register allocation | |

-+-----------------------------------+-----------------------+

-| Branch optimization | |

-+-----------------------------------+-----------------------+

-| Code emission | Code emission |

-+-----------------------------------+-----------------------+

-Goals

-=====

-Translation speed

------------------

-We'd like to be able to translate a ``.pexe`` file as fast as download speed.

-Any faster is in a sense wasted effort. Download speed varies greatly, but

-we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a

-slow machine. We observe a 3× single-thread performance difference between A15

-and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a

-``.pexe`` file could be compressed to 50% on the web server using gzip transport

-compression, so we set the translation speed goal to 6 MB/sec on the high-end

-Xeon workstation.

-Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at

-⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.

-In other words, Subzero's goal is to improve over LLVM's translation speed by

-10×.

-Code quality

-------------

-Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``

-code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On

-average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

-It's important to note that the quality of Subzero-generated code depends on

-target-neutral optimizations and simplifications being run beforehand in the

-developer environment. The ``.pexe`` file reflects these optimizations. For

-example, Subzero assumes that the basic blocks are ordered topologically where

-possible (which makes liveness analysis converge fastest), and Subzero does not

-do any function inlining because it should already have been done.

-Translator size

----------------

-The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.

-We think 1 MB is a more reasonable size -- especially for such a component that

-is distributed to a billion Chrome users. Thus we target a 10× reduction in

-binary size.

-For development, Subzero can be built for all target architectures, and all

-debugging and diagnostic options enabled. For a smaller translator, we restrict

-to a single target architecture, and define a ``MINIMAL`` build where

-unnecessary features are compiled out.

-Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``

-include directories, which have little impact on translator size. It also uses

-some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again

-with little size impact. In non-``MINIMAL`` builds, the translator size is much

-larger due to including code for parsing text-format bitcode files and forming

-LLVM IR.

-Memory footprint

-----------------

-The current LLVM-based translator suffers from an issue in which some

-function-specific data has to be retained in memory until all translation

-completes, and therefore the memory footprint grows without bound. Large

-``.pexe`` files can lead to the translator process holding hundreds of MB of

-memory by the end. The translator runs in a separate process, so this memory

-growth doesn't *directly* affect other processes, but it does dirty the physical

-memory and contributes to a perception of bloat and sometimes a reality of

-out-of-memory tab killing, especially noticeable on weaker systems.

-Subzero should maintain a stable memory footprint throughout translation. It's

-not really practical to set a specific limit, because there is not really a

-practical limit on a single function's size, but the footprint should be

-"reasonable" and be proportional to the largest input function size, not the

-total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or

-inexorable memory growth. (We use ASAN builds to test for leaks.)

-Multithreaded translation

--------------------------

-It should be practical to translate different functions concurrently and see

-good scalability. Some locking may be needed, such as accessing output buffers

-or constant pools, but that should be fairly minimal. In contrast, LLVM was

-only designed for module-level parallelism, and as such, the PNaCl translator

-internally splits a ``.pexe`` file into several modules for concurrent

-translation. All output needs to be deterministic regardless of the level of

-multithreading, i.e. functions and data should always be output in the same

-order.

-Target architectures

---------------------

-Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future

-targets include ARM64 and MIPS64, though these targets lack NaCl support

-including a sandbox model or a validator.

-The first implementation is for x86-32, because it was expected to be

-particularly challenging, and thus more likely to draw out any design problems

-early:

-- There are a number of special cases, asymmetries, and warts in the x86

- instruction set.

-- Complex addressing modes may be leveraged for better code quality.

-- 64-bit integer operations have to be lowered into longer sequences of 32-bit

- operations.

-- Paucity of physical registers may reveal code quality issues early in the

- design.

-Detailed design

-===============

-Intermediate representation - ICE

----------------------------------

-Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's

-IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a

-representation of global variables and initializers, and a set of functions.

-Each function contains a list of basic blocks, and each basic block constains a

-list of instructions. Instructions that operate on stack and register variables

-do so using static single assignment (SSA) form.

-The ``.pexe`` file is translated one function at a time (or in parallel by

-multiple translation threads). The recipe for optimization passes depends on

-the specific target and optimization level, and is described in detail below.

-Global variables (types and initializers) are simply and directly translated to

-object code, without any meaningful attempts at optimization.

-A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.

-Its key contents include:

-- A list of ``CfgNode`` pointers, generally held in topological order.

-- A list of ``Variable`` pointers corresponding to local variables used in the

- function plus compiler-generated temporaries.

-A basic block is represented by the ``Ice::CfgNode`` class. Its key contents

-include:

-- A linear list of instructions, in the same style as LLVM. The last

- instruction of the list is always a terminator instruction: branch, switch,

- return, unreachable.

-- A list of Phi instructions, also in the same style as LLVM. They are held as

- a linear list for convenience, though per Phi semantics, they are executed "in

- parallel" without dependencies on each other.

-- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and

- another list for outgoing edges.

-- The node's unique, 0-based index into the CFG's node list.

-An instruction is represented by the ``Ice::Inst`` class. Its key contents

-include:

-- A list of source operands.

-- Its destination variable, if the instruction produces a result in an

- ``Ice::Variable``.

-- A bitvector indicating which variables' live ranges this instruction ends.

- This is computed during liveness analysis.

-Instructions kinds are divided into high-level ICE instructions and low-level

-ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode

-instruction kinds. Each target architecture implementation extends the

-instruction space with its own set of low-level instructions. Generally,

-low-level instructions correspond to individual machine instructions. The

-high-level ICE instruction space includes a few additional instruction kinds

-that are not part of LLVM but are generally useful (e.g., an Assignment

-instruction), or are useful across targets (e.g., BundleLock and BundleUnlock

-instructions for sandboxing).

-Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl

-ABI restrictions as documented in the `PNaCl Bitcode Reference Manual

-<https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are

-the following:

-- Alloca: allocate data on the stack

-- Arithmetic: binary operations of the form ``A = B op C``

-- Br: conditional or unconditional branch

-- Call: function call

-- Cast: unary type-conversion operations

-- ExtractElement: extract a scalar element from a vector-type value

-- Fcmp: floating-point comparison

-- Icmp: integer comparison

-- IntrinsicCall: call a known intrinsic

-- InsertElement: insert a scalar element into a vector-type value

-- Load: load a value from memory

-- Phi: implement the SSA phi node

-- Ret: return from the function

-- Select: essentially the C language operation of the form ``X = C ? Y : Z``

-- Store: store a value into memory

-- Switch: generalized branch to multiple possible locations

-- Unreachable: indicate that this portion of the code is unreachable

-The additional high-level ICE instructions are the following:

-- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi

- instructions to non-SSA assignments, before lowering to machine code.

-- BundleLock, BundleUnlock. These are markers used for sandboxing, but are

- common across all targets and so they are elevated to the high-level

- instruction set.

-- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in

- liveness analysis, elevated to the high-level because they are used by all

- targets. They are described in more detail at the end of this section.

-- JumpTable: this represents the result of switch optimization analysis, where

- some switch instructions may use jump tables instead of cascading

- compare/branches.

-An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an

-operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants

-include scalar integer constants, scalar floating point constants, Undef (an

-unspecified constant of a particular scalar or vector type), and symbol

-constants (essentially addresses of globals). Note that the PNaCl ABI does not

-include vector-type constants besides Undef, and as such, Subzero (so far) has

-no reason to represent vector-type constants internally. A variable represents

-a value allocated on the stack (though not including alloca-derived storage).

-Among other things, a variable holds its unique, 0-based index into the CFG's

-variable list.

-Each target can extend the ``Constant`` and ``Variable`` classes for its own

-needs. In addition, the ``Operand`` class may be extended, e.g. to define an

-x86 ``MemOperand`` that encodes a base register, an index register, an index

-register shift amount, and a constant offset.

-Register allocation and liveness analysis are restricted to Variable operands.

-Because of the importance of register allocation to code quality, and the

-translation-time cost of liveness analysis, Variable operands get some special

-treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate

-across all the Variables of an instruction. An instruction holds a list of

-operands, but an operand may contain 0, 1, or more Variables. As such, the

-``Operand`` class specially holds a list of Variables contained within, for

-quick access.

-A Subzero transformation pass may work by deleting an existing instruction and

-replacing it with zero or more new instructions. Instead of actually deleting

-the existing instruction, we generally mark it as deleted and insert the new

-instructions right after the deleted instruction. When printing the IR for

-debugging, this is a big help because it makes it much more clear how the

-non-deleted instructions came about.

-Subzero has a few special instructions to help with liveness analysis

-consistency.

-- The FakeDef instruction gives a fake definition of some variable. For

- example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``

- but an ICE instruction can represent only one destination variable. This is

- similar for multiply instructions, and for function calls that return a 64-bit

- integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``

- trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.

-- The FakeUse instruction registers a use of a variable, typically to prevent an

- earlier assignment to that variable from being dead-code eliminated. For

- example, lowering an operation like ``x=cc?y:z`` may be done using x86's

- conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a

- FakeUse of ``x`` between the two instructions, the liveness analysis pass may

- dead-code eliminate the first instruction.

-- The FakeKill instruction is added after a call instruction, and is a quick way

- of indicating that caller-save registers are invalidated.

-Pexe parsing

-------------

-Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It

-parses the ``.pexe`` file function by function, ultimately constructing an ICE

-CFG for each function. After a function is parsed, its CFG is handed off to the

-translation phase. The bitcode parser also parses global initializer data and

-hands it off to be translated to data sections in the object file.

-Subzero has another parsing strategy for testing/debugging. LLVM libraries can

-be used to parse a module into LLVM IR (though very slowly relative to Subzero

-native parsing). Then we iterate across the LLVM IR and construct high-level

-ICE, handing off each CFG to the translation phase.

-Overview of lowering

---------------------

-In general, translation goes like this:

-- Parse the next function from the ``.pexe`` file and construct a CFG consisting

- of high-level ICE.

-- Do analysis passes and transformation passes on the high-level ICE, as

- desired.

-- Lower each high-level ICE instruction into a sequence of zero or more

- low-level ICE instructions. Each high-level instruction is generally lowered

- independently, though the target lowering is allowed to look ahead in the

- CfgNode's instruction list if desired.

-- Do more analysis and transformation passes on the low-level ICE, as desired.

-- Assemble the low-level CFG into an ELF object file (alternatively, a textual

- assembly file that is later assembled by some external tool).

-- Repeat for all functions, and also produce object code for data such as global

- initializers and internal constant pools.

-Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,

-the intention is to apply all available optimizations to get the best code

-quality (though the initial code quality goal is measured against LLVM's ``O0``

-code quality). For ``Om1``, the intention is to apply as few optimizations as

-possible and produce code as quickly as possible, accepting poor code quality.

-``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,

-"sub-zero".

-High-level debuggability of generated code is so far not a design requirement.

-Subzero doesn't really do transformations that would obfuscate debugging; the

-main thing might be that register allocation (including stack slot coalescing

-for stack-allocated variables whose live ranges don't overlap) may render a

-variable's value unobtainable after its live range ends. This would not be an

-issue for ``Om1`` since it doesn't register-allocate program-level variables,

-nor does it coalesce stack slots. That said, fully supporting debuggability

-would require a few additions:

-- DWARF support would need to be added to Subzero's ELF file emitter. Subzero

- propagates global symbol names, local variable names, and function-internal

- label names that are present in the ``.pexe`` file. This would allow a

- debugger to map addresses back to symbols in the ``.pexe`` file.

-- To map ``.pexe`` file symbols back to meaningful source-level symbol names,

- file names, line numbers, etc., Subzero would need to handle `LLVM bitcode

- metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``

- `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.

-- The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so

- the toolchain would need to be modified to preserve it.

-Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but

-produces code with one third the code quality. ``Om1`` is good for testing and

-debugging -- during translation, it tends to expose errors in the basic lowering

-that might otherwise have been hidden by the register allocator or other

-optimization passes. It also helps determine whether a code correctness problem

-is a fundamental problem in the basic lowering, or an error in another

-optimization pass.

-The implementation of target lowering also controls the recipe of passes used

-for ``Om1`` and ``O2`` translation. For example, address mode inference may

-only be relevant for x86.

-Lowering strategy

------------------

-The core of Subzero's lowering from high-level ICE to low-level ICE is to lower

-each high-level instruction down to a sequence of low-level target-specific

-instructions, in a largely context-free setting. That is, each high-level

-instruction conceptually has a simple template expansion into low-level

-instructions, and lowering can in theory be done in any order. This may sound

-like a small effort, but quite a large number of templates may be needed because

-of the number of PNaCl types and instruction variants. Furthermore, there may

-be optimized templates, e.g. to take advantage of operator commutativity (for

-example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is

-similar to other template-based approaches in fast code generation or

-interpretation, though some decisions are deferred until after some global

-analysis passes, mostly related to register allocation, stack slot assignment,

-and specific choice of instruction variant and addressing mode.

-The key idea for a lowering template is to produce valid low-level instructions

-that are guaranteed to meet address mode and other structural requirements of

-the instruction set. For example, on x86, the source operand of an integer

-store instruction must be an immediate or a physical register; a shift

-instruction's shift amount must be an immediate or in register ``%cl``; a

-function's integer return value is in ``%eax``; most x86 instructions are

-two-operand, in contrast to corresponding three-operand high-level instructions;

-etc.

-Because target lowering runs before register allocation, there is no way to know

-whether a given ``Ice::Variable`` operand lives on the stack or in a physical

-register. When the low-level instruction calls for a physical register operand,

-the target lowering can create an infinite-weight Variable. This tells the

-register allocator to assign infinite weight when making decisions, effectively

-guaranteeing some physical register. Variables can also be pre-colored to a

-specific physical register (``cl`` in the shift example above), which also gives

-infinite weight.

-To illustrate, consider a high-level arithmetic instruction on 32-bit integer

-operands::

- A = B + C

-X86 target lowering might produce the following::

- T.inf = B // mov instruction

- T.inf += C // add instruction

- A = T.inf // mov instruction

-Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a

-physical register, the three lowered instructions are all encodable regardless

-of whether ``B`` and ``C`` are physical registers, memory, or immediates, and

-whether ``A`` is a physical register or in memory.

-In this example, ``A`` must be a Variable and one may be tempted to simplify the

-lowering sequence by setting ``A`` as infinite-weight and using::

- A = B // mov instruction

- A += C // add instruction

-This has two problems. First, if the original instruction was actually ``A =

-B + A``, the result would be incorrect. Second, assigning ``A`` a physical

-register applies throughout ``A``'s entire live range. This is probably not

-what is intended, and may ultimately lead to a failure to allocate a register

-for an infinite-weight variable.

-This style of lowering leads to many temporaries being generated, so in ``O2``

-mode, we rely on the register allocator to clean things up. For example, in the

-example above, if ``B`` ends up getting a physical register and its live range

-ends at this instruction, the register allocator is likely to reuse that

-register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

-copy, which is removed as an emission-time peephole optimization.

-O2 lowering

------------

-Currently, the ``O2`` lowering recipe is the following:

-- Loop nest analysis

-- Address mode inference

-- Read-modify-write (RMW) transformation

-- Basic liveness analysis

-- Load optimization

-- Target lowering

-- Full liveness analysis

-- Register allocation

-- Phi instruction lowering (advanced)

-- Post-phi lowering register allocation

-- Branch optimization

-These passes are described in more detail below.

-Om1 lowering

-------------

-Currently, the ``Om1`` lowering recipe is the following:

-- Phi instruction lowering (simple)

-- Target lowering

-- Register allocation (infinite-weight and pre-colored only)

-Optimization passes

--------------------

-Liveness analysis

-^^^^^^^^^^^^^^^^^

-Liveness analysis is a standard dataflow optimization, implemented as follows.

-For each node (basic block), its live-out set is computed as the union of the

-live-in sets of its successor nodes. Then the node's instructions are processed

-in reverse order, updating the live set, until the beginning of the node is

-reached, and the node's live-in set is recorded. If this iteration has changed

-the node's live-in set, the node's predecessors are marked for reprocessing.

-This continues until no more nodes need reprocessing. If nodes are processed in

-reverse topological order, the number of iterations over the CFG is generally

-equal to the maximum loop nest depth.

-To implement this, each node records its live-in and live-out sets, initialized

-to the empty set. Each instruction records which of its Variables' live ranges

-end in that instruction, initialized to the empty set. A side effect of

-liveness analysis is dead instruction elimination. Each instruction can be

-marked as tentatively dead, and after the algorithm converges, the tentatively

-dead instructions are permanently deleted.

-Optionally, after this liveness analysis completes, we can do live range

-construction, in which we calculate the live range of each variable in terms of

-instruction numbers. A live range is represented as a union of segments, where

-the segment endpoints are instruction numbers. Instruction numbers are required

-to be unique across the CFG, and monotonically increasing within a basic block.

-As a union of segments, live ranges can contain "gaps" and are therefore

-precise. Because of SSA properties, a variable's live range can start at most

-once in a basic block, and can end at most once in a basic block. Liveness

-analysis keeps track of which variable/instruction tuples begin live ranges and

-end live ranges, and combined with live-in and live-out sets, we can efficiently

-build up live ranges of all variables across all basic blocks.

-A lot of care is taken to try to make liveness analysis fast and efficient.

-Because of the lowering strategy, the number of variables is generally

-proportional to the number of instructions, leading to an O(N^2) complexity

-algorithm if implemented naively. To improve things based on sparsity, we note

-that most variables are "local" and referenced in at most one basic block (in

-contrast to the "global" variables with multi-block usage), and therefore cannot

-be live across basic blocks. Therefore, the live-in and live-out sets,

-typically represented as bit vectors, can be limited to the set of global

-variables, and the intra-block liveness bit vector can be compacted to hold the

-global variables plus the local variables for that block.

-Register allocation

-^^^^^^^^^^^^^^^^^^^

-Subzero implements a simple linear-scan register allocator, based on the

-allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan

-Register Allocation in the Context of SSA Form and Register Constraints

-<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has

-several nice features:

-- Live ranges are represented as unions of segments, as described above, rather

- than a single start/end tuple.

-- It allows pre-coloring of variables with specific physical registers.

-- It applies equally well to pre-lowered Phi instructions.

-The paper suggests an approach of aggressively coalescing variables across Phi

-instructions (i.e., trying to force Phi source and destination variables to have

-the same register assignment), but we reject that in favor of the more natural

-preference mechanism described below.

-We enhance the algorithm in the paper with the capability of automatic inference

-of register preference, and with the capability of allowing overlapping live

-ranges to safely share the same register in certain circumstances. If we are

-considering register allocation for variable ``A``, and ``A`` has a single

-defining instruction ``A=B+C``, then the preferred register for ``A``, if

-available, would be the register assigned to ``B`` or ``C``, if any, provided

-that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this

-way we infer a good register preference for ``A``.

-We allow overlapping live ranges to get the same register in certain cases.

-Suppose a high-level instruction like::

- A = unary_op(B)

-has been target-lowered like::

- T.inf = B

- A = unary_op(T.inf)

-Further, assume that ``B``'s live range continues beyond this instruction

-sequence, and that ``B`` has already been assigned some register. Normally, we

-might want to infer ``B``'s register as a good candidate for ``T.inf``, but it

-turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have

-different registers. But ``T.inf`` is just a read-only copy of ``B`` that is

-guaranteed to be in a register, so in theory these overlapping live ranges could

-safely have the same register. Our implementation allows this overlap as long

-as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never

-modified within ``T.inf``'s live range.

-Subzero's register allocator can be run in 3 configurations.

-- Normal mode. All Variables are considered for register allocation. It

- requires full liveness analysis and live range construction as a prerequisite.

- This is used by ``O2`` lowering.

-- Minimal mode. Only infinite-weight or pre-colored Variables are considered.

- All other Variables are stack-allocated. It does not require liveness

- analysis; instead, it quickly scans the instructions and records first

- definitions and last uses of all relevant Variables, using that to construct a

- single-segment live range. Although this includes most of the Variables, the

- live ranges are mostly simple, short, and rarely overlapping, which the

- register allocator handles efficiently. This is used by ``Om1`` lowering.

-- Post-phi lowering mode. Advanced phi lowering is done after normal-mode

- register allocation, and may result in new infinite-weight Variables that need

- registers. One would like to just run something like minimal mode to assign

- registers to the new Variables while respecting existing register allocation

- decisions. However, it sometimes happens that there are no free registers.

- In this case, some register needs to be forcibly spilled to the stack and

- temporarily reassigned to the new Variable, and reloaded at the end of the new

- Variable's live range. The register must be one that has no explicit

- references during the Variable's live range. Since Subzero currently doesn't

- track def/use chains (though it does record the CfgNode where a Variable is

- defined), we just do a brute-force search across the CfgNode's instruction

- list for the instruction numbers of interest. This situation happens very

- rarely, so there's little point for now in improving its performance.

-The basic linear-scan algorithm may, as it proceeds, rescind an early register

-allocation decision, leaving that Variable to be stack-allocated. Some of these

-times, it turns out that the Variable could have been given a different register

-without conflict, but by this time it's too late. The literature recognizes

-this situation and describes "second-chance bin-packing", which Subzero can do.

-We can rerun the register allocator in a mode that respects existing register

-allocation decisions, and sometimes it finds new non-conflicting opportunities.

-In fact, we can repeatedly run the register allocator until convergence.

-Unfortunately, in the current implementation, these subsequent register

-allocation passes end up being extremely expensive. This is because of the

-treatment of the "unhandled pre-colored" Variable set, which is normally very

-small but ends up being quite large on subsequent passes. Its performance can

-probably be made acceptable with a better choice of data structures, but for now

-this second-chance mechanism is disabled.

-Future work is to implement LLVM's `Greedy

-<http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_

-register allocator as a replacement for the basic linear-scan algorithm, given

-LLVM's experience with its improvement in code quality. (The blog post claims

-that the Greedy allocator also improved maintainability because a lot of hacks

-could be removed, but Subzero is probably not yet to that level of hacks, and is

-less likely to see that particular benefit.)

-Basic phi lowering

-^^^^^^^^^^^^^^^^^^

-The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

-implements it). Consider this example::

- L1:

- ...

- br L3

- L2:

- ...

- br L3

- L3:

- A = phi [B, L1], [C, L2]

- X = phi [Y, L1], [Z, L2]

-For each destination of a phi instruction, we can create a temporary and insert

-the temporary's assignment at the end of the predecessor block::

- L1:

- ...

- A' = B

- X' = Y

- br L3

- L2:

- ...

- A' = C

- X' = Z

- br L3

- L2:

- A = A'

- X = X'

-This transformation is very simple and reliable. It can be done before target

-lowering and register allocation, and it easily avoids the classic lost-copy and

-related problems. ``Om1`` lowering uses this strategy.

-However, it has the disadvantage of initializing temporaries even for branches

-not taken, though that could be mitigated by splitting non-critical edges and

-putting assignments in the edge-split nodes. Another problem is that without

-extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a

-specific ordering even though phi semantics are that the assignments are

-parallel or unordered. This sometimes imposes false live range overlaps and

-leads to poorer register allocation.

-Advanced phi lowering

-^^^^^^^^^^^^^^^^^^^^^

-``O2`` lowering defers phi lowering until after register allocation to avoid the

-problem of false live range overlaps. It works as follows. We split each

-incoming edge and move the (parallel) phi assignments into the split nodes. We

-linearize each set of assignments by finding a safe, topological ordering of the

-assignments, respecting register assignments as well. For example::

- A = B

- X = Y

-Normally these assignments could be executed in either order, but if ``B`` and

-``X`` are assigned the same physical register, we would want to use the above

-ordering. Dependency cycles are broken by introducing a temporary. For

-example::

- A = B

- B = A

-Here, a temporary breaks the cycle::

- t = A

- A = B

- B = t

-Finally, we use the existing target lowering to lower the assignments in this

-basic block, and once that is done for all basic blocks, we run the post-phi

-variant of register allocation on the edge-split basic blocks.

-When computing a topological order, we try to first schedule assignments whose

-source has a physical register, and last schedule assignments whose destination

-has a physical register. This helps reduce register pressure.

-X86 address mode inference

-^^^^^^^^^^^^^^^^^^^^^^^^^^

-We try to take advantage of the x86 addressing mode that includes a base

-register, an index register, an index register scale amount, and an immediate

-offset. We do this through simple pattern matching. Starting with a load or

-store instruction where the address is a variable, we initialize the base

-register to that variable, and look up the instruction where that variable is

-defined. If that is an add instruction of two variables and the index register

-hasn't been set, we replace the base and index register with those two

-variables. If instead it is an add instruction of a variable and a constant, we

-replace the base register with the variable and add the constant to the

-immediate offset.

-There are several more patterns that can be matched. This pattern matching

-continues on the load or store instruction until no more matches are found.

-Because a program typically has few load and store instructions (not to be

-confused with instructions that manipulate stack variables), this address mode

-inference pass is fast.

-X86 read-modify-write inference

-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-A reasonably common bitcode pattern is a non-atomic update of a memory

-location::

- x = load addr

- y = add x, 1

- store y, addr

-On x86, with good register allocation, the Subzero passes described above

-generate code with only this quality::

- mov [%ebx], %eax

- add $1, %eax

- mov %eax, [%ebx]

-However, x86 allows for this kind of code::

- add $1, [%ebx]

-which requires fewer instructions, but perhaps more importantly, requires fewer

-physical registers.

-It's also important to note that this transformation only makes sense if the

-store instruction ends ``x``'s live range.

-Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)

-opportunities via simple pattern matching. The only problem is that it is run

-before liveness analysis, which is needed to determine whether ``x``'s live

-range ends after the RMW. Since liveness analysis is one of the most expensive

-passes, it's not attractive to run it an extra time just for RMW analysis.

-Instead, we essentially generate both the RMW and the non-RMW versions, and then

-during lowering, the RMW version deletes itself if it finds x still live.

-X86 compare-branch inference

-^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-In the LLVM instruction set, the compare/branch pattern works like this::

- cond = icmp eq a, b

- br cond, target

-The result of the icmp instruction is a single bit, and a conditional branch

-tests that bit. By contrast, most target architectures use this pattern::

- cmp a, b // implicitly sets various bits of FLAGS register

- br eq, target // branch on a particular FLAGS bit

-A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests

-``cond`` and conditionally branches. Subzero has a pass that identifies

-boolean-based operations like this and folds them into a single

-compare/branch-like operation. It is set up for more than just cmp/br though.

-Boolean producers include icmp (integer compare), fcmp (floating-point compare),

-and trunc (integer truncation when the destination has bool type). Boolean

-consumers include branch, select (the ternary operator from the C language), and

-sign-extend and zero-extend when the source has bool type.

-Sandboxing

-^^^^^^^^^^

-Native Client's sandbox model uses software fault isolation (SFI) to provide

-safety when running untrusted code in a browser or other environment. Subzero

-implements Native Client's `sandboxing

-<https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_

-to enable Subzero-translated executables to be run inside Chrome. Subzero also

-provides a fairly simple framework for investigating alternative sandbox models

-or other restrictions on the sandbox model.

-Sandboxing in Subzero is not actually implemented as a separate pass, but is

-integrated into lowering and assembly.

-- Indirect branches, including the ret instruction, are masked to a bundle

- boundary and bundle-locked.

-- Call instructions are aligned to the end of the bundle so that the return

- address is bundle-aligned.

-- Indirect branch targets, including function entry and targets in a switch

- statement jump table, are bundle-aligned.

-- The intrinsic for reading the thread pointer is inlined appropriately.

-- For x86-64, non-stack memory accesses are with respect to the reserved sandbox

- base register. We reduce the aggressiveness of address mode inference to

- leave room for the sandbox base register during lowering. There are no memory

- sandboxing changes for x86-32.

-Code emission

--------------

-Subzero's integrated assembler is derived from Dart's `assembler code

-<https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass

-that iterates through the low-level ICE instructions and invokes the relevant

-assembler functions. Placeholders are added for later fixup of branch target

-offsets. (Backward branches use short offsets if possible; forward branches

-generally use long offsets unless it is an intra-block branch of "known" short

-length.) The assembler emits into a staging buffer. Once emission into the

-staging buffer for a function is complete, the data is emitted to the output

-file as an ELF object file, and metadata such as relocations, symbol table, and

-string table, are accumulated for emission at the end. Global data initializers

-are emitted similarly. A key point is that at this point, the staging buffer

-can be deallocated, and only a minimum of data needs to held until the end.

-As a debugging alternative, Subzero can emit textual assembly code which can

-then be run through an external assembler. This is of course super slow, but

-quite valuable when bringing up a new target.

-As another debugging option, the staging buffer can be emitted as textual

-assembly, primarily in the form of ".byte" lines. This allows the assembler to

-be tested separately from the ELF related code.

-Memory management

------------------

-Where possible, we allocate from a ``CfgLocalAllocator`` which derives from

-LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects

-allocated from the arena are never actually freed; instead, when the CFG

-translation completes and the CFG is deleted, the entire arena memory is

-reclaimed at once. This style of allocation works well in an environment like a

-compiler where there are distinct phases with only easily-identifiable objects

-living across phases. It frees the developer from having to manage object

-deletion, and it amortizes deletion costs across just a single arena deletion at

-the end of the phase. Furthermore, it helps scalability by allocating entirely

-from thread-local memory pools, and minimizing global locking of the heap.

-Instructions are probably the most heavily allocated complex class in Subzero.

-We represent an instruction list as an intrusive doubly linked list, allocate

-all instructions from the ``CfgLocalAllocator``, and we make sure each

-instruction subclass is basically `POD

-<http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a

-trivial destructor. This way, when the CFG is finished, we don't need to

-individually deallocate every instruction. We do similar for Variables, which

-is probably the second most popular complex class.

-There are some situations where passes need to use some `STL container class

-<http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the

-``CfgLocalAllocator`` as the container allocator if this is needed.

-Multithreaded translation

--------------------------

-Subzero is designed to be able to translate functions in parallel. With the

-``-threads=N`` command-line option, there is a 3-stage producer-consumer

-pipeline:

-- A single thread parses the ``.pexe`` file and produces a sequence of work

- units. A work unit can be either a fully constructed CFG, or a set of global

- initializers. The work unit includes its sequence number denoting its parse

- order. Each work unit is added to the translation queue.

-- There are N translation threads that draw work units from the translation

- queue and lower them into assembler buffers. Each assembler buffer is added

- to the emitter queue, tagged with its sequence number. The CFG and its

- ``CfgLocalAllocator`` are disposed of at this point.

-- A single thread draws assembler buffers from the emitter queue and appends to

- the output file. It uses the sequence numbers to reintegrate the assembler

- buffers according to the original parse order, such that output order is

- always deterministic.

-This means that with ``-threads=N``, there are actually ``N+1`` spawned threads

-for a total of ``N+2`` execution threads, taking the parser and emitter threads

-into account. For the special case of ``N=0``, execution is entirely sequential

--- the same thread parses, translates, and emits, one function at a time. This

-is useful for performance measurements.

-Ideally, we would like to get near-linear scalability as the number of

-translation threads increases. We expect that ``-threads=1`` should be slightly

-faster than ``-threads=0`` as the small amount of time spent parsing and

-emitting is done largely in parallel with translation. With perfect

-scalability, we see ``-threads=N`` translating ``N`` times as fast as

-``-threads=1``, up until the point where parsing or emitting becomes the

-bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory

-performance would become a bottleneck and efficiency might peak at, say, 75%.

-Currently, parsing takes about 11% of total sequential time. If translation

-scalability ever gets so fast and awesomely scalable that parsing becomes a

-bottleneck, it should be possible to make parsing multithreaded as well.

-Internally, all shared, mutable data is held in the GlobalContext object, and

-access to each field is guarded by a mutex.

-Security

---------

-Subzero includes a number of security features in the generated code, as well as

-in the Subzero translator itself, which run on top of the existing Native Client

-sandbox as well as Chrome's OS-level sandbox.

-Sandboxed translator

-^^^^^^^^^^^^^^^^^^^^

-When running inside the browser, the Subzero translator executes as sandboxed,

-untrusted code that is initially checked by the validator, just like the

-LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no

-more or less secure than the translator it replaces, from the point of view of

-the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and

-was designed from the start with security in mind, so one expects fewer attacker

-opportunities here.

-Code diversification

-^^^^^^^^^^^^^^^^^^^^

-`Return-oriented programming

-<https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a

-now-common technique for starting with e.g. a known buffer overflow situation

-and launching it into a deeper exploit. The attacker scans the executable

-looking for ROP gadgets, which are short sequences of code that happen to load

-known values into known registers and then return. An attacker who manages to

-overwrite parts of the stack can overwrite it with carefully chosen return

-addresses such that certain ROP gadgets are effectively chained together to set

-up the register state as desired, finally returning to some code that manages to

-do something nasty based on those register values.

-If there is a popular ``.pexe`` with a large install base, the attacker could

-run Subzero on it and scan the executable for suitable ROP gadgets to use as

-part of a potential exploit. Note that if the trusted validator is working

-correctly, these ROP gadgets are limited to starting at a bundle boundary and

-cannot use the trick of finding a gadget that happens to begin inside another

-instruction. All the same, gadgets with these constraints still exist and the

-attacker has access to them. This is the attack model we focus most on --

-protecting the user against misuse of a "trusted" developer's application, as

-opposed to mischief from a malicious ``.pexe`` file.

-Subzero can mitigate these attacks to some degree through code diversification.

-Specifically, we can apply some randomness to the code generation that makes ROP

-gadgets less predictable. This randomness can have some compile-time cost, and

-it can affect the code quality; and some diversifications may be more effective

-than others. A more detailed treatment of hardening techniques may be found in

-the Matasano report "`Attacking Clientside JIT Compilers

-<https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/attacking_clientside_jit_compilers_paper.pdf>`_".

-To evaluate diversification effectiveness, we use a third-party ROP gadget

-finder and limit its results to bundle-aligned addresses. For a given

-diversification technique, we run it with a number of different random seeds,

-find ROP gadgets for each version, and determine how persistent each ROP gadget

-is across the different versions. A gadget is persistent if the same gadget is

-found at the same code address. The best diversifications are ones with low

-gadget persistence rates.

-Subzero implements 7 different diversification techniques. Below is a

-discussion of each technique, its effectiveness, and its cost. The discussions

-of cost and effectiveness are for a single diversification technique; the

-translation-time costs for multiple techniques are additive, but the effects of

-multiple techniques on code quality and effectiveness are not yet known.

-In Subzero's implementation, each randomization is "repeatable" in a sense.

-Each pass that includes a randomization option gets its own private instance of

-a random number generator (RNG). The RNG is seeded with a combination of a

-global seed, the pass ID, and the function's sequence number. The global seed

-is designed to be different across runs (perhaps based on the current time), but

-for debugging, the global seed can be set to a specific value and the results

-will be repeatable.

-Subzero-generated code is subject to diversification once per translation, and

-then Chrome caches the diversified binary for subsequent executions. An

-attacker may attempt to run the binary multiple times hoping for

-higher-probability combinations of ROP gadgets. When the attacker guesses

-wrong, a likely outcome is an application crash. Chrome throttles creation of

-crashy processes which reduces the likelihood of the attacker eventually gaining

-a foothold.

-Constant blinding

-~~~~~~~~~~~~~~~~~

-Here, we prevent attackers from controlling large immediates in the text

-(executable) section. A random cookie is generated for each function, and if

-the constant exceeds a specified threshold, the constant is obfuscated with the

-cookie and equivalent code is generated. For example, instead of this x86

-instruction::

- mov $0x11223344, <%Reg/Mem>

-the following code might be generated::

- mov $(0x11223344+Cookie), %temp

- lea -Cookie(%temp), %temp

- mov %temp, <%Reg/Mem>

-The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to

-prevent unintended effects on the flags register.

-This transformation has almost no effect on translation time, and about 1%

-impact on code quality, depending on the threshold chosen. It does little to

-reduce gadget persistence, but it does remove a lot of potential opportunities

-to construct intra-instruction ROP gadgets (which an attacker could use only if

-a validator bug were discovered, since the Native Client sandbox and associated

-validator force returns and other indirect branches to be to bundle-aligned

-addresses).

-Constant pooling

-~~~~~~~~~~~~~~~~

-This is similar to constant blinding, in that large immediates are removed from

-the text section. In this case, each unique constant above the threshold is

-stored in a read-only data section and the constant is accessed via a memory

-load. For the above example, the following code might be generated::

- mov $Label$1, %temp

- mov %temp, <%Reg/Mem>

-This has a similarly small impact on translation time and ROP gadget

-persistence, and a smaller (better) impact on code quality. This is because it

-uses fewer instructions, and in some cases good register allocation leads to no

-increase in instruction count. Note that this still gives an attacker some

-limited amount of control over some text section values, unless we randomize the

-constant pool layout.

-Static data reordering

-~~~~~~~~~~~~~~~~~~~~~~

-This transformation limits the attacker's ability to control bits in global data

-address references. It simply permutes the order in memory of global variables

-and internal constant pool entries. For the constant pool, we only permute

-within a type (i.e., emit a randomized list of ints, followed by a randomized

-list of floats, etc.) to maintain good packing in the face of alignment

-constraints.

-As might be expected, this has no impact on code quality, translation time, or

-ROP gadget persistence (though as above, it limits opportunities for

-intra-instruction ROP gadgets with a broken validator).

-Basic block reordering

-~~~~~~~~~~~~~~~~~~~~~~

-Here, we randomize the order of basic blocks within a function, with the

-constraint that we still want to maintain a topological order as much as

-possible, to avoid making the code too branchy.

-This has no impact on code quality, and about 1% impact on translation time, due

-to a separate pass to recompute layout. It ends up having a huge effect on ROP

-gadget persistence, tied for best with nop insertion, reducing ROP gadget

-persistence to less than 5%.

-Function reordering

-~~~~~~~~~~~~~~~~~~~

-Here, we permute the order that functions are emitted, primarily to shift ROP

-gadgets around to less predictable locations. It may also change call address

-offsets in case the attacker was trying to control that offset in the code.

-To control latency and memory footprint, we don't arbitrarily permute functions.

-Instead, for some relatively small value of N, we queue up N assembler buffers,

-and then emit the N functions in random order, and repeat until all functions

-are emitted.

-Function reordering has no impact on translation time or code quality.

-Measurements indicate that it reduces ROP gadget persistence to about 15%.

-Nop insertion

-~~~~~~~~~~~~~

-This diversification randomly adds a nop instruction after each regular

-instruction, with some probability. Nop instructions of different lengths may

-be selected. Nop instructions are never added inside a bundle_lock region.

-Note that when sandboxing is enabled, nop instructions are already being added

-for bundle alignment, so the diversification nop instructions may simply be

-taking the place of alignment nop instructions, though distributed differently

-through the bundle.

-In Subzero's currently implementation, nop insertion adds 3-5% to the

-translation time, but this is probably because it is implemented as a separate

-pass that adds actual nop instructions to the IR. The overhead would probably

-be a lot less if it were integrated into the assembler pass. The code quality

-is also reduced by 3-5%, making nop insertion the most expensive of the

-diversification techniques.

-Nop insertion is very effective in reducing ROP gadget persistence, at the same

-level as basic block randomization (less than 5%). But given nop insertion's

-impact on translation time and code quality, one would most likely prefer to use

-basic block randomization instead (though the combined effects of the different

-diversification techniques have not yet been studied).

-Register allocation randomization

-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-In this diversification, the register allocator tries to make different but

-mostly functionally equivalent choices, while maintaining stable code quality.

-A naive approach would be the following. Whenever the allocator has more than

-one choice for assigning a register, choose randomly among those options. And

-whenever there are no registers available and there is a tie for the

-lowest-weight variable, randomly select one of the lowest-weight variables to

-evict. Because of the one-pass nature of the linear-scan algorithm, this

-randomization strategy can have a large impact on which variables are ultimately

-assigned registers, with a corresponding large impact on code quality.

-Instead, we choose an approach that tries to keep code quality stable regardless

-of the random seed. We partition the set of physical registers into equivalence

-classes. If a register is pre-colored in the function (i.e., referenced

-explicitly by name), it forms its own equivalence class. The remaining

-registers are partitioned according to their combination of attributes such as

-integer versus floating-point, 8-bit versus 32-bit, caller-save versus

-callee-saved, etc. Each equivalence class is randomly permuted, and the

-complete permutation is applied to the final register assignments.

-Register randomization reduces ROP gadget persistence to about 10% on average,

-though there tends to be fairly high variance across functions and applications.

-This probably has to do with the set of restrictions in the x86-32 instruction

-set and ABI, such as few general-purpose registers, ``%eax`` used for return

-values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As

-intended, register randomization has no impact on code quality, and a slight

-(0.5%) impact on translation time due to an extra scan over the variables to

-identify pre-colored registers.

-Fuzzing

-^^^^^^^

-We have started fuzz-testing the ``.pexe`` files input to Subzero, using a

-combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer

-<http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to

-find and fix cases where Subzero crashes or otherwise ungracefully fails on

-unexpected inputs, and to do so automatically over a large range of unexpected

-inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility

-of an attacker exploiting these bugs.

-Most of the problems found so far are ones most appropriately handled in the

-parser. However, there have been a couple that have identified problems in the

-lowering, or otherwise inappropriately triggered assertion failures and fatal

-errors. We continue to dig into this area.

-Future security work

-^^^^^^^^^^^^^^^^^^^^

-Subzero is well-positioned to explore other future security enhancements, e.g.:

-- Tightening the Native Client sandbox. ABI changes, such as the previous work

- on `hiding the sandbox base address

- <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXVG8>`_

- in x86-64, are easy to experiment with in Subzero.

-- Making the executable code section read-only. This would prevent a PNaCl

- application from inspecting its own binary and trying to find ROP gadgets even

- after code diversification has been performed. It may still be susceptible to

- `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,

- security is still overall improved.

-- Instruction selection diversification. It may be possible to lower a given

- instruction in several largely equivalent ways, which gives more opportunities

- for code randomization.

-Chrome integration

-------------------

-Currently Subzero is available in Chrome for the x86-32 architecture, but under

-a flag. When the flag is enabled, Subzero is used when the `manifest file

-<https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_

-linking to the ``.pexe`` file specifies the ``O0`` optimization level.

-The next step is to remove the flag, i.e. invoke Subzero as the only translator

-for ``O0``-specified manifest files.

-Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which

-case Subzero could be used for all PNaCl translation.

-Command line options

---------------------

-Subzero has a number of command-line options for debugging and diagnostics.

-Among the more interesting are the following.

-- Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other

- diagnostic output, with various levels of detail after each pass. Instruction

- numbers can be printed or suppressed. Deleted instructions can be printed or

- suppressed (they are retained in the instruction list, as discussed earlier,

- because they can help explain how lower-level instructions originated).

- Liveness information can be printed when available. Details of register

- allocation can be printed as register allocator decisions are made. And more.

-- Running Subzero with any level of verbosity produces an enormous amount of

- output. When debugging a single function, verbose output can be suppressed

- except for a particular function. The ``-verbose-focus`` flag suppresses

- verbose output except for the specified function.

-- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing

- at exit. Timing markers can be placed in the Subzero source code to demarcate

- logical operations or passes of interest. Basic timing information plus

- call-stack type timing information is printed at the end.

-- Along with ``-timing``, the user can instead get a report on the overall

- translation time for each function, to help focus on timing outliers. Also,

- ``-timing-focus`` limits the ``-timing`` reporting to a single function,

- instead of aggregating pass timing across all functions.

-- The ``-szstats`` option reports various statistics on each function, such as

- stack frame size, static instruction count, etc. It may be helpful to track

- these stats over time as Subzero is improved, as an approximate measure of

- code quality.

-- The flag ``-asm-verbose``, in conjunction with emitting textual assembly

- output, annotate the assembly output with register-focused liveness

- information. In particular, each basic block is annotated with which

- registers are live-in and live-out, and each instruction is annotated with

- which registers' and stack locations' live ranges end at that instruction.

- This is really useful when studying the generated code to find opportunities

- for code quality improvements.

-Testing and debugging

----------------------

-LLVM lit tests

-^^^^^^^^^^^^^^

-For basic testing, Subzero uses LLVM's `lit

-<http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We

-have a suite of hundreds of small functions where we test for particular

-assembly code patterns across different target architectures.

-Cross tests

-^^^^^^^^^^^

-Unfortunately, the lit tests don't do a great job of precisely testing the

-correctness of the output. Much better are the cross tests, which are execution

-tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide

-variety of interesting inputs. Each cross test consists of a set of C, C++,

-and/or low-level bitcode files. The C and C++ source files are compiled down to

-bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.

-Subzero mangles global symbol names with a special prefix to avoid duplicate

-symbol errors. A driver program invokes both versions on a large set of

-interesting inputs, and reports when the Subzero and ``pnacl-llc`` results

-differ. Cross tests turn out to be an excellent way of testing the basic

-lowering patterns, but they are less useful for testing more global things like

-liveness analysis and register allocation.

-Bisection debugging

-^^^^^^^^^^^^^^^^^^^

-Sometimes with a new application, Subzero will end up producing incorrect code

-that either crashes at runtime or otherwise produces the wrong results. When

-this happens, we need to narrow it down to a single function (or small set of

-functions) that yield incorrect behavior. For this, we have a bisection

-debugging framework. Here, we initially translate the entire application once

-with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to

-selectively weaken symbols based on a whitelist or blacklist provided on the

-command line. The two object files can then be linked together without link

-errors, with the desired version of each method "winning". Then the binary is

-tested, and bisection proceeds based on whether the binary produces correct

-output.

-When the bisection completes, we are left with a minimal set of

-Subzero-translated functions that cause the failure. Usually it is a single

-function, though sometimes it might require a combination of several functions

-to cause a failure; this may be due to an incorrect call ABI, for example.

-However, Murphy's Law implies that the single failing function is enormous and

-impractical to debug. In that case, we can restart the bisection, explicitly

-blacklisting the enormous function, and try to find another candidate to debug.

-(Future work is to automate this to find all minimal sets of functions, so that

-debugging can focus on the simplest example.)

-Fuzz testing

-^^^^^^^^^^^^

-As described above, we try to find internal Subzero bugs using fuzz testing

-techniques.

-Sanitizers

-^^^^^^^^^^

-Subzero can be built with `AddressSanitizer

-<http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer

-<http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is

-done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,

-multithreading has been simple enough that TSan hasn't found any bugs, but ASan

-has found at least one memory leak which was subsequently fixed.

-`UndefinedBehaviorSanitizer

-<http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_

-(UBSan) support is in progress. `Control flow integrity sanitization

-<http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under

-consideration.

-Current status

-==============

-Target architectures

---------------------

-Subzero is currently more or less complete for the x86-32 target. It has been

-refactored and extended to handle x86-64 as well, and that is mostly complete at

-this point.

-ARM32 work is in progress. It currently lacks the testing level of x86, at

-least in part because Subzero's register allocator needs modifications to handle

-ARM's aliasing of floating point and vector registers. Specifically, a 64-bit

-register is actually a gang of two consecutive and aligned 32-bit registers, and

-a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.

-ARM64 work has not started; when it does, it will be native-only since the

-Native Client sandbox model, validator, and other tools have never been defined.

-An external contributor is adding MIPS support, in most part by following the

-ARM work.

-Translator performance

-----------------------

-Single-threaded translation speed is currently about 5× the ``pnacl-llc``

-translation speed. For a large ``.pexe`` file, the time breaks down as:

-- 11% for parsing and initial IR building

-- 4% for emitting to /dev/null

-- 27% for liveness analysis (two liveness passes plus live range construction)

-- 15% for linear-scan register allocation

-- 9% for basic lowering

-- 10% for advanced phi lowering

-- ~11% for other minor analysis

-- ~10% measurement overhead to acquire these numbers

-Some improvements could undoubtedly be made, but it will be hard to increase the

-speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With

-``-Om1`` (lack of) optimization, we do actually achieve roughly 10×

-``pnacl-llc`` translation speed, but code quality drops by a factor of 3.

-Code quality

-------------

-Measured across 16 components of spec2k, Subzero's code quality is uniformly

-better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly

-between ``pnacl-llc`` ``-O0`` and ``-O2``.

-Translator size

----------------

-When built in MINIMAL mode, the x86-64 native translator size for the x86-32

-target is about 700 KB, not including the size of functions referenced in

-dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1

-MB, and it is statically linked and also includes nop padding for bundling as

-well as indirect branch masking.

-Translator memory footprint

----------------------------

-It's hard to draw firm conclusions about memory footprint, since the footprint

-is at least proportional to the input function size, and there is no real limit

-on the size of functions in the ``.pexe`` file.

-That said, we looked at the memory footprint over time as Subzero translated

-``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our

-disposal. One of LLVM's libraries that Subzero uses can report the current

-malloc heap usage. With single-threaded translation, Subzero tends to hover

-around 15 MB of memory usage. There are a couple of monstrous functions where

-Subzero grows to around 100 MB, but then it drops back down after those

-functions finish translating. In contrast, ``pnacl-llc`` grows larger and

-larger throughout translation, reaching several hundred MB by the time it

-completes.

-It's a bit more interesting when we enable multithreaded translation. When

-there are N translation threads, Subzero implements a policy that limits the

-size of the translation queue to N entries -- if it is "full" when the parser

-tries to add a new CFG, the parser blocks until one of the translation threads

-removes a CFG. This means the number of in-memory CFGs can (and generally does)

-reach 2*N+1, and so the memory footprint rises in proportion to the number of

-threads. Adding to the pressure is the observation that the monstrous functions

-also take proportionally longer time to translate, so there's a good chance many

-of the monstrous functions will be active at the same time with multithreaded

-translation. As a result, for N=32, Subzero's memory footprint peaks at about

-260 MB, but drops back down as the large functions finish translating.

-If this peak memory size becomes a problem, it might be possible for the parser

-to resequence the functions to try to spread out the larger functions, or to

-throttle the translation queue to prevent too many in-flight large functions.

-It may also be possible to throttle based on memory pressure signaling from

-Chrome.

-Translator scalability

-----------------------

-Currently scalability is "not very good". Multiple translation threads lead to

-faster translation, but not to the degree desired. We haven't dug in to

-investigate yet.

-There are a few areas to investigate. First, there may be contention on the

-constant pool, which all threads access, and which requires locked access even

-for reading. This could be mitigated by keeping a CFG-local cache of the most

-common constants.

-Second, there may be contention on memory allocation. While almost all CFG

-objects are allocated from the CFG-local allocator, some passes use temporary

-STL containers that use the default allocator, which may require global locking.

-This could be mitigated by switching these to the CFG-local allocator.

-Third, multithreading may make the default allocator strategy more expensive.

-In a single-threaded environment, a pass will allocate its containers, run the

-pass, and deallocate the containers. This results in stack-like allocation

-behavior and makes the heap free list easier to manage, with less heap

-fragmentation. But when multithreading is added, the allocations and

-deallocations become much less stack-like, making allocation and deallocation

-operations individually more expensive. Again, this could be mitigated by

-switching these to the CFG-local allocator.

« no previous file with comments | « ALLOCATION.rst ('k') | LOWERING.rst » ('j') | no next file with comments »