DESIGN.rst - Issue 1309073003: Subzero: Add a detailed design document.

Unified Diff: DESIGN.rst

Issue 1309073003: Subzero: Add a detailed design document. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Formatting, AT&T syntax, and minor changes Created 5 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: DESIGN.rst

diff --git a/DESIGN.rst b/DESIGN.rst

new file mode 100644

index 0000000000000000000000000000000000000000..823bf840981d718450d66695cd24ad5f0ef37659

--- /dev/null

+++ b/DESIGN.rst

@@ -0,0 +1,1280 @@

+Design of the Subzero fast code generator

+=========================================

+Introduction

+------------

+The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

+compiler technology based on `LLVM <http://llvm.org/>`_. The web program

+developer uses the PNaCl toolchain to compile their application to

JF 2015/08/31 21:08:01 "web program developer" sounds weird. Just "develo

Jim Stichnoth 2015/09/02 23:35:03 Done.

+architecture-neutral PNaCl bitcode (a pexe file), using as much

JF 2015/08/31 21:08:01 Code-quote ``.pexe``, here and elsewhere.

Jim Stichnoth 2015/09/02 23:35:05 Done.

+architecture-neutral optimization as possible. The pexe file is downloaded to

+the user's browser where the PNaCl translator (a component of Chrome) compiles

+the pexe file to sandboxed native code. The translator uses

JF 2015/08/31 21:08:00 Link "sandboxed" to: https://developer.chrome.com/

Jim Stichnoth 2015/09/02 23:35:05 Done.

+architecture-specific optimizations as much as practical to generate good native

+code.

+The native code can be cached by the browser to avoid repeating translation on

+future page loads. However, first-time user experience can be hampered by long

JF 2015/08/31 21:08:01 s/can be/is/

Jim Stichnoth 2015/09/02 23:35:03 Done.

+translation times. The LLVM-based PNaCl translator is pretty slow, even when

+using ``-O0`` to minimize optimizations, so delays are especially noticeable on

+slow browser platforms such as ARM-based Chromebooks.

+Translator slowness can be mitigated or hidden in a number of ways.

+- Parallel translation. However, slow machines where this matters most,

+ e.g. ARM chromebooks, are likely to have fewer cores to parallelize across.

JF 2015/08/31 21:08:02 Also, RAM is limited.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+- Streaming translation, i.e. start translating as soon as the download starts.

+ This doesn't help much when translation speed is 10x slower than download

+ speed, or the pexe file is already cached while the translated binary was

+ flushed from the cache.

+- Arrange the web page such that translation is done in parallel with

+ downloading large assets. Arrange the web page to distract the user with cat

JF 2015/08/31 21:08:01 [citation needed] Authoritative source: https://w

Jim Stichnoth 2015/09/02 23:35:03 Done.

JF 2015/09/03 00:35:21 Most wonderful choice.

+ videos while translation is in progress. Or, improve translator performance

+ to something more reasonable.

+Goals

+=====

+Translation speed

+-----------------

+We'd like to be able to translate a pexe as fast as download speed. Any faster

+is in a sense wasted effort. Download speed varies greatly, but we'll

+arbitrarily say 1 MB/sec. We'll pick an ARM A15 platform as the example of a

JF 2015/08/31 21:07:59 s/an/the/ s/platform/CPU/

Jim Stichnoth 2015/09/02 23:35:04 Done.

+slow machine. We observe a 3x single-thread performance difference between A15

JF 2015/08/31 21:08:01 Unicode has a most wonderful multiplication sign f

Jim Stichnoth 2015/09/02 23:35:03 Done. (here and below)

+and a z620 workstation, and aggressively assume a pexe file could be compressed

JF 2015/08/31 21:08:01 s/z620/high-end x86 Xeon E5-2690/

Jim Stichnoth 2015/09/02 23:35:04 Done.

+to 50% on the web server before network transport, so we set the translation

JF 2015/08/31 21:07:59 "using gzip transport compression"

Jim Stichnoth 2015/09/02 23:35:03 Done.

+speed goal to 6 MB/sec on a z620 workstation.

+Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at

+1/10 the target rate. The ``-O2`` mode takes 3x as long as the ``-O0`` mode.

JF 2015/08/31 21:08:01 ⅒

Jim Stichnoth 2015/09/02 23:35:03 Done. I must work in a "pile of poo" reference so

+In other words, Subzero's goal is to improve over LLVM's translation speed by

+10x.

+Code quality

+------------

+Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``

+code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On

+average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

JF 2015/08/31 21:08:02 You should re-emphasize that target-independent op

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Translator size

+---------------

+The current LLVM-based translator binary (pnacl-llc) is about 10MB in size. We

JF 2015/08/31 21:08:00 Code-quotes ``pnacl-llc``

Jim Stichnoth 2015/09/02 23:35:03 Done.

+think 1MB is a more reasonable size, thus we target a 10x reduction in binary

+size.

+For development, Subzero can be built for all target architectures, and all

+debugging and diagnostic options enabled. For a smaller translator, we restrict

+to a single target architecture, and define a MINIMAL build where unnecessary

+features are compiled out.

JF 2015/08/31 21:07:59 This section is a goal... but it doesn't really sa

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Memory footprint

+----------------

+The current LLVM-based translator suffers from an issue in which some

+function-specific data has to be retained in memory until all translation

+completes, and therefore the memory footprint grows without bound. Large pexes

+can lead to the translator process holding hundreds of MB of memory by the end.

JF 2015/08/31 21:07:59 This doesn't really say *why*. You should probabl

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Subzero should maintain a stable memory footprint throughout translation. It's

+not really practical to set a specific limit, because there is not really a

+practical limit on a single function's size, but the footprint should be

+"reasonable" and not have leaks or inexorable growth. (We use ASAN builds to

+test for leaks.)

JF 2015/08/31 21:07:59 O(largest-function-size) instead of O(total-functi

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Multithreaded translation

+-------------------------

+It should be practical to translate different functions concurrently and see

+good scalability. Some locking may be needed, such as accessing output buffers

+or constant pools, but that should be fairly minimal. In contrast, LLVM was

+only designed for module-level parallelism, and as such, the PNaCl translator

+internally splits a pexe file into several modules for concurrent translation.

+All output needs to be deterministic regardless of the level of multithreading,

+i.e. functions and data should always be output in the same order.

+Target architectures

+--------------------

+Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future

+targets include ARM64 and MIPS64, though these targets lack NaCl support

+including a sandbox model or a validator.

+The first implementation is for x86-32, because it was expected to be

+particularly challenging, and thus more likely to draw out any design problems

+early:

+- There are a number of special cases, asymmetries, and warts in the x86

+ instruction set.

+- Complex addressing modes may be leveraged for better code quality.

+- 64-bit integer operations have to be lowered into longer sequences of 32-bit

+ operations.

+- Paucity of physical registers may reveal code quality issues early in the

+ design.

+Detailed design

+===============

+Intermediate representation - ICE

JF 2015/08/31 21:07:59 I'd expect to read "SSA" pretty early in the detai

Jim Stichnoth 2015/09/02 23:35:04 Done.

+---------------------------------

+Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's

+IR, which is reflected in the pexe bitcode structure.

+A function is represented by the ``Ice::Cfg`` class (CFG = control flow graph).

JF 2015/08/31 21:08:01 "A function's Control Flow Graph (CFG) is represen

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Its key contents include:

+- A list of ``CfgNode`` pointers, generally held in topological order.

+- A list of ``Variable`` pointers corresponding to local variables used in the

+ function plus compiler-generated temporaries.

+A basic block is represented by the ``Ice::CfgNode`` class. Its key contents

+include:

+- A linear list of instructions, in the same style as LLVM. The last

+ instruction of the list is always a terminator instruction: branch, switch,

+ return, unreachable.

+- A list of Phi instructions, also in the same style as LLVM. They are held as

+ a linear list for convenience, though per Phi semantics, they are executed "in

+ parallel" without dependencies on each other.

+- An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and

+ another list for outgoing edges.

+- The node's unique, 0-based index into the Cfg's node list.

+An instruction is represented by the ``Ice::Inst`` class. Its key contents

+include:

+- A list of source operands.

+- Its destination variable, if the instruction produces a result in an

+ ``Ice::Variable``.

+- A bitvector indicating which variables' live ranges this instruction ends.

+ This is computed during liveness analysis.

+Instructions kinds are divided into high-level ICE instructions and low-level

+ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode

+instruction kinds. Each target architecture implementation extends the

+instruction space with its own set of low-level instructions. Generally,

+low-level instructions correspond to individual machine instructions. The

+high-level ICE instruction space includes a few additional instruction kinds

+that are not part of LLVM but are generally useful (e.g., an Assignment

+instruction), or are useful across targets (e.g., BundleLock and BundleUnlock

+instructions for sandboxing).

+Specifically, high-level ICE instructions that derive from LLVM are the

+following:

+- Alloca: allocate data on the stack

JF 2015/08/31 21:08:00 These are more restricted than LLVM's though? This

Jim Stichnoth 2015/09/02 23:35:04 Done, by pointing to the PNaCl ABI manual for the

+- Arithmetic: binary operations of the form ``A = B op C``

+- Br: conditional or unconditional branch

+- Call: function call

+- Cast: unary type-conversion operations

+- ExtractElement: extract a scalar element from a vector-type value

+- Fcmp: floating-point comparison

+- Icmp: integer comparison

+- IntrinsicCall: call a known intrinsic

+- InsertElement: insert a scalar element into a vector-type value

+- Load: load a value from memory

+- Phi: implement the SSA phi node

+- Ret: return from the function

+- Select: essentially the C language operation of the form ``X = C ? Y : Z``

+- Store: store a value into memory

+- Switch: generalized branch to multiple possible locations

+- Unreachable: indicate that this portion of the code is unreachable

+The additional high-level ICE instructions are the following:

+- Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi

+ instructions to non-SSA assignments, before lowering to machine code.

+- BundleLock, BundleUnlock. These are markers used for sandboxing, but are

+ common across all targets and so they are elevated to the high-level

+ instruction set.

+- FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in

+ liveness analysis, elevated to the high-level because they are used by all

+ targets. They are described in more detail at the end of this section.

+- JumpTable: this represents the result of switch optimization analysis, where

+ some switch instructions may use jump tables instead of cascading

+ compare/branches.

+An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an

+operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants

+include integer constants, floating point constants, Undef (an unspecified

+constant of a particular type), and symbol constants (essentially addresses of

JF 2015/08/31 21:08:00 SIMD constants?

Jim Stichnoth 2015/09/02 23:35:04 Clarified that integer/FP constants are scalar, an

+globals). A variable represents a value allocated on the stack (though not

+including alloca-derived storage). Among other things, a variable holds its

+unique, 0-based index into the Cfg's variable list.

+Each target can extend the ``Constant`` and ``Variable`` classes for its own

+needs. In addition, the ``Operand`` class may be extended, e.g. to define an

+x86 ``MemOperand`` that encodes a base register, an index register, an index

+register shift amount, and a constant offset.

+Register allocation and liveness analysis are restricted to Variable operands.

+Because of the importance of register allocation to code quality, and the

+translation-time cost of liveness analysis, Variable operands get some special

+treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate

+across all the Variables of an instruction. An instruction holds a list of

+operands, but an operand may contain 0, 1, or more Variables. As such, the

+``Operand`` class specially holds a list of Variables contained within, for

+quick access.

+A Subzero transformation pass may work by deleting an existing instruction and

+replacing it with zero or more new instructions. Instead of actually deleting

+the existing instruction, we generally mark it as deleted and insert the new

+instructions right after the deleted instruction. When printing the IR for

+debugging, this is a big help because it makes it much more clear how the

+non-deleted instructions came about.

+Subzero has a few special instructions to help with liveness analysis

+consistency.

+- The FakeDef instruction gives a fake definition of some variable. For

+ example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``

+ but an ICE instruction can represent only one destination variable. This is

+ similar for multiply instructions, and for function calls that return a 64-bit

+ integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``

+ trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.

+- The FakeUse instruction registers a use of a variable, typically to prevent an

+ earlier assignment to that variable from being dead-code eliminated. For

+ example, lowering an operation like ``x=cc?y:z`` may be done using x86's

+ conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a

+ FakeUse of ``x`` between the two instructions, the liveness analysis pass may

+ dead-code eliminate the first instruction.

+- The FakeKill instruction is added after a call instruction, and is a quick way

+ of indicating that caller-save registers are invalidated.

+Overview of lowering

+--------------------

+In general, translation goes like this:

+- Parse the next function from the pexe file and construct a Cfg consisting of

JF 2015/08/31 21:07:59 Capitalize CFG since it's an acronym, not code.

Jim Stichnoth 2015/09/02 23:35:02 Done.

+ high-level ICE.

+- Do analysis passes and transformation passes on the high-level ICE, as

+ desired.

+- Lower each high-level ICE instruction into a sequence of zero or more

+ low-level ICE instructions. Each high-level instruction is generally lowered

+ independently, though the target lowering is allowed to look ahead in the

+ CfgNode's instruction list if desired.

+- Do more analysis and transformation passes on the low-level ICE, as desired.

+- Assemble the low-level Cfg into an ELF object file (alternatively, a textual

+ assembly file that is later assembled by some external tool).

+- Repeat for all functions, and also produce object code for data such as global

+ initializers and internal constant pools.

+Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,

+the intention is to apply all available optimizations to get the best code

+quality (though the initial code quality goal is measured against LLVM's ``O0``

+code quality). For ``Om1``, the intention is to apply as few optimizations as

+possible and produce code as quickly as possible, accepting poor code quality.

+``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,

+"sub-zero".

+Debuggability of generated code was never considered in the design, but so far,

JF 2015/08/31 21:08:00 "never" seems pretty strong! Especially considerin

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Subzero doesn't really do transformations that would obfuscate debugging. The

+main thing might be that register allocation (including stack slot coalescing

+for stack-allocated variables whose live ranges don't overlap) may render a

+variable's value unobtainable after its live range ends. This would not be an

+issue for ``Om1`` since it doesn't register-allocate program-level variables,

+nor does it coalesce stack slots.

JF 2015/08/31 21:08:01 This should discuss what's needed for debuggabilit

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but

+produces code with one third the code quality. ``Om1`` is good for testing and

+debugging -- during translation, it tends to expose errors in the basic lowering

+that might otherwise have been hidden by the register allocator or other

+optimization passes. It also helps determine whether a code correctness problem

+is a fundamental problem in the basic lowering, or an error in another

+optimization pass.

+The implementation of target lowering also controls the recipe of passes used

+for ``Om1`` and ``O2`` translation. For example, address mode inference may

+only be relevant for x86.

+Pexe parsing

JF 2015/08/31 21:07:59 It seems more intuitive to have this earlier, sinc

Jim Stichnoth 2015/09/02 23:35:05 Good point, I swapped it with the previous sub*sec

JF 2015/09/03 00:35:21 * = zero? As in, subzerosection!

+------------

+Subzero includes an integrated PNaCl pexe bitcode parser. It parses the pexe

+file function by function, ultimately constructing an ICE Cfg for each function.

+After a function is parsed, its Cfg is handed off to the translation phase. The

+bitcode parser also parses global initializer data and hands it off to be

+translated to data sections in the object file.

+Subzero has another parsing strategy for testing/debugging. LLVM libraries can

+be used to parse a module into LLVM IR (though very slowly relative to Subzero

+native parsing). Then we iterate across the LLVM IR and construct high-level

+ICE, handing off each Cfg to the translation phase.

+Lowering strategy

+-----------------

+The core of Subzero's lowering from high-level ICE to low-level ICE is to lower

+each high-level instruction down to a sequence of low-level target-specific

+instructions, in a largely context-free setting. That is, each high-level

JF 2015/08/31 21:08:00 a.k.a. "splat" or "template" JIT. It would be good

Jim Stichnoth 2015/09/02 23:35:04 "template" is already mentioned plenty in my text,

+instruction conceptually has a simple template expansion into low-level

+instructions, and lowering can in theory be done in any order. This may sound

+like a small effort, but quite a large number of templates may be needed because

+of the number of PNaCl types and instruction variants. Furthermore, there may

+be optimized templates, e.g. to take advantage of operator commutativity.

+The key idea for a lowering template is to produce valid low-level instructions

+that are guaranteed to meet address mode and other structural requirements of

+the instruction set. For example, on x86, the source operand of an integer

+store instruction must be an immediate or a physical register; a shift

+instruction's shift amount must be an immediate or in register ``%cl``; a

+function's integer return value is in ``%eax``; most x86 instructions are

+two-operand, in contrast to corresponding three-operand high-level instructions;

+etc.

+Because target lowering runs before register allocation, there is no way to know

+whether a given ``Ice::Variable`` operand lives on the stack or in a physical

+register. When the low-level instruction calls for a physical register operand,

+the target lowering can create an infinite-weight Variable. This tells the

+register allocator to assign infinite weight when making decisions, effectively

+guaranteeing some physical register. Variables can also be pre-colored to a

+specific physical register (``cl`` in the shift example above), which also gives

+infinite weight.

+To illustrate, consider a high-level arithmetic instruction on 32-bit integer

+operands::

+ A = B + C

+X86 target lowering might produce the following::

+ T.inf = B // mov instruction

+ T.inf += C // add instruction

+ A = T.inf // mov instruction

+Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a

+physical register, the three lowered instructions are all encodable regardless

+of whether ``B`` and ``C`` are physical registers, memory, or immediates, and

+whether ``A`` is a physical register or in memory.

+In this example, ``A`` must be a Variable and one may be tempted to simplify the

+lowering sequence by setting ``A`` as infinite-weight and using::

+ A = B // mov instruction

+ A += C // add instruction

+This has two problems. First, if the original instruction was actually ``A =

+B + A``, the result would be incorrect. Second, assigning ``A`` a physical

+register applies throughout ``A``'s entire live range. This is probably not

+what is intended, and may ultimately lead to a failure to allocate a register

+for an infinite-weight variable.

+This style of lowering leads to many temporaries being generated, so in ``O2``

+mode, we rely on the register allocator to clean things up. For example, in the

+example above, if ``B`` ends up getting a physical register and its live range

+ends at this instruction, the register allocator is likely to reuse that

+register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

+copy, which is removed as an emission-time peephole optimization.

+O2 lowering

+-----------

+Currently, the ``O2`` lowering recipe is the following:

+- Address mode inference

+- Read-modify-write (RMW) transformation

+- Basic liveness analysis

+- Load optimization

+- Target lowering

+- Full liveness analysis

+- Register allocation

+- Phi instruction lowering (advanced)

+- Post-phi lowering register allocation

+- Branch optimization

+These passes are described in more detail below.

+Om1 lowering

+------------

+Currently, the ``Om1`` lowering recipe is the following:

+- Phi instruction lowering (simple)

+- Target lowering

+- Register allocation (infinite-weight and pre-colored only)

JF 2015/08/31 21:08:02 There's (part of) the pipeline that I wish I'd see

Jim Stichnoth 2015/09/02 23:35:05 Done.

+Optimization passes

+-------------------

+Liveness analysis

+^^^^^^^^^^^^^^^^^

+Liveness analysis is a standard dataflow optimization, implemented as follows.

+For each node (basic block), its live-out set is computed as the union of the

+live-in sets of its successor nodes. Then the node's instructions are processed

+in reverse order, updating the live set, until the beginning of the node is

+reached, and the node's live-in set is recorded. If this iteration has changed

+the node's live-in set, the node's predecessors are marked for reprocessing.

+This continues until no more nodes need reprocessing. If nodes are processed in

+reverse topological order, the number of iterations over the Cfg is generally

+equal to the maximum loop nest depth.

+To implement this, each node records its live-in and live-out sets, initialized

+to the empty set. Each instruction records which of its Variables' live ranges

+end in that instruction, initialized to the empty set. A side effect of

+liveness analysis is dead instruction elimination. Each instruction can be

+marked as tentatively dead, and after the algorithm converges, the tentatively

+dead instructions are permanently deleted.

+Optionally, after this liveness analysis completes, we can do live range

+construction, in which we calculate the live range of each variable in terms of

+instruction numbers. A live range is represented as a union of segments, where

+the segment endpoints are instruction numbers. Instruction numbers are required

+to be unique across the Cfg, and monotonically increasing within a basic block.

+As a union of segments, live ranges can contain "gaps" and are therefore

+precise. Because of SSA properties, a variable's live range can start at most

+once in a basic block, and can end at most once in a basic block. Liveness

+analysis keeps track of which variable/instruction tuples begin live ranges and

+end live ranges, and combined with live-in and live-out sets, we can efficiently

+build up live ranges of all variables across all basic blocks.

+A lot of care is taken to try to make liveness analysis fast and efficient.

+Because of the lowering strategy, the number of variables is generally

+proportional to the number of instructions, leading to an O(N^2) complexity

+algorithm if implemented naively. To improve things based on sparsity, we note

+that most variables are "local" and referenced in at most one basic block (in

+contrast to the "global" variables with multi-block usage), and therefore cannot

+be live across basic blocks. Therefore, the live-in and live-out sets,

+typically represented as bit vectors, can be limited to the set of global

+variables, and the intra-block liveness bit vector can be compacted to hold the

+global variables plus the local variables for that block.

+Register allocation

+^^^^^^^^^^^^^^^^^^^

+Subzero implements a simple linear-scan register allocator, based on the

+allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan

+Register Allocation in the Context of SSA Form and Register Constraints

+<ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has

+several nice features:

+- Live ranges are represented as unions of segments, as described above, rather

+ than a single start/end tuple.

+- It allows pre-coloring of variables with specific physical registers.

+- It applies equally well to pre-lowered Phi instructions.

+The paper suggests an approach of aggressively coalescing variables across Phi

+instructions (i.e., trying to force Phi source and destination variables to have

+the same register assignment), but we reject that in favor of the more natural

+preference mechanism described below.

+We enhance the algorithm in the paper with the capability of automatic inference

+of register preference, and with the capability of allowing overlapping live

+ranges to safely share the same register in certain circumstances. If we are

+considering register allocation for variable ``A``, and ``A`` has a single

+defining instruction ``A=B+C``, then the preferred register for ``A``, if

+available, would be the register assigned to ``B`` or ``C``, if any, provided

+that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this

+way we infer a good register preference for ``A``.

+We allow overlapping live ranges to get the same register in certain cases.

+Suppose a high-level instruction like::

+ A = unary_op(B)

+has been target-lowered like::

+ T.inf = B

+ A = unary_op(T.inf)

+Further, assume that ``B``'s live range continues beyond this instruction

+sequence, and that ``B`` has already been assigned some register. Normally, we

+might want to infer ``B``'s register as a good candidate for ``T.inf``, but it

+turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have

+different registers. But ``T.inf`` is just a read-only copy of ``B`` that is

+guaranteed to be in a register, so in theory these overlapping live ranges could

+safely have the same register. Our implementation allows this overlap as long

+as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never

+modified within ``T.inf``'s live range.

+Subzero's register allocator can be run in 3 configurations.

+- Normal mode. All Variables are considered for register allocation. It

+ requires full liveness analysis and live range construction as a prerequisite.

+ This is used by ``O2`` lowering.

+- Minimal mode. Only infinite-weight or pre-colored Variables are considered.

+ All other Variables are stack-allocated. It does not require liveness

+ analysis; instead, it quickly scans the instructions and records first

+ definitions and last uses of all relevant Variables, using that to construct a

+ single-segment live range. Although this includes most of the Variables, the

+ live ranges are mostly simple, short, and rarely overlapping, which the

+ register allocator handles efficiently. This is used by ``Om1`` lowering.

+- Post-phi lowering mode. Advanced phi lowering is done after normal-mode

+ register allocation, and may result in new infinite-weight Variables that need

+ registers. One would like to just run something like minimal mode to assign

+ registers to the new Variables while respecting existing register allocation

+ decisions. However, it sometimes happens that there are no free registers.

+ In this case, some register needs to be forcibly spilled to the stack and

+ temporarily reassigned to the new Variable, and reloaded at the end of the new

+ Variable's live range. The register must be one that has no explicit

+ references during the Variable's live range. Since Subzero currently doesn't

+ track def/use chains (though it does record the CfgNode where a Variable is

+ defined), we just do a brute-force search across the CfgNode's instruction

+ list for the instruction numbers of interest. This situation happens very

+ rarely, so there's little point for now in improving its performance.

+The basic linear-scan algorithm may, as it proceeds, rescind an early register

+allocation decision, leaving that Variable to be stack-allocated. Some of these

+times, it turns out that the Variable could have been given a different register

+without conflict, but by this time it's too late. The literature recognizes

+this situation and describes "second-chance bin-packing", which Subzero can do.

+We can rerun the register allocator in a mode that respects existing register

+allocation decisions, and sometimes it finds new non-conflicting opportunities.

+In fact, we can repeatedly run the register allocator until convergence.

+Unfortunately, in the current implementation, these subsequent register

+allocation passes end up being extremely expensive. This is because of the

+treatment of the "unhandled pre-colored" Variable set, which is normally very

+small but ends up being quite large on subsequent passes. Its performance can

+probably be made acceptable with a better choice of data structures, but for now

+this second-chance mechanism is disabled.

JF 2015/08/31 21:08:00 Could you mention greedy RA, and why you may want

Jim Stichnoth 2015/09/02 23:35:05 Done.

+Basic phi lowering

+^^^^^^^^^^^^^^^^^^

+The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

+implements it). Consider this example::

+ L1:

+ ...

+ br L3

+ L2:

+ ...

+ br L3

+ L3:

+ A = phi [B, L1], [C, L2]

+ X = phi [Y, L1], [Z, L2]

+For each destination of a phi instruction, we can create a temporary and insert

+the temporary's assignment at the end of the predecessor block::

+ L1:

+ ...

+ A' = B

+ X' = Y

+ br L3

+ L2:

+ ...

+ A' = C

+ X' = Z

+ br L3

+ L2:

+ A = A'

+ X = X'

+This transformation is very simple and reliable. It can be done before target

+lowering and register allocation, and it easily avoids the classic lost-copy and

+related problems. ``Om1`` lowering uses this strategy.

+However, it has the disadvantage of initializing temporaries even for branches

+not taken, though that could be mitigated by splitting non-critical edges and

+putting assignments in the edge-split nodes. Another problem is that without

+extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a

+specific ordering even though phi semantics are that the assignments are

+parallel or unordered. This sometimes imposes false live range overlaps and

+leads to poorer register allocation.

+Advanced phi lowering

+^^^^^^^^^^^^^^^^^^^^^

+``O2`` lowering defers phi lowering until after register allocation to avoid the

+problem of false live range overlaps. It works as follows. We split each

+incoming edge and move the (parallel) phi assignments into the split nodes. We

+linearize each set of assignments by finding a safe, topological ordering of the

+assignments, respecting register assignments as well. For example::

+ A = B

+ X = Y

+Normally these assignments could be executed in either order, but if ``B`` and

+``X`` are assigned the same physical register, we would want to use the above

+ordering. Dependency cycles are broken by introducing a temporary. For

+example::

+ A = B

+ B = A

+Here, a temporary breaks the cycle::

+ t = A

+ A = B

+ B = t

+Finally, we use the existing target lowering to lower the assignments in this

+basic block, and once that is done for all basic blocks, we run the post-phi

+variant of register allocation on the edge-split basic blocks.

+When computing a topological order, we try to first schedule assignments whose

+source has a physical register, and last schedule assignments whose destination

+has a physical register. This helps reduce register pressure.

+X86 address mode inference

+^^^^^^^^^^^^^^^^^^^^^^^^^^

+We try to take advantage of the x86 addressing mode that includes a base

+register, an index register, an index register scale amount, and an immediate

+offset. We do this through simple pattern matching. Starting with a load or

+store instruction where the address is a variable, we initialize the base

+register to that variable, and look up the instruction where that variable is

+defined. If that is an add instruction of two variables and the index register

+hasn't been set, we replace the base and index register with those two

+variables. If instead it is an add instruction of a variable and a constant, we

+replace the base register with the variable and add the constant to the

+immediate offset.

+There are several more patterns that can be matched. This pattern matching

+continues on the load or store instruction until no more matches are found.

+Because a program typically has few load and store instructions (not to be

+confused with instructions that manipulate stack variables), this address mode

+inference pass is fast.

+X86 read-modify-write inference

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+A reasonably common bitcode pattern is a non-atomic update of a memory

+location::

+ x = load addr

+ y = add x, 1

+ store y, addr

+On x86, with good register allocation, the Subzero passes described above

+generate code with only this quality::

+ mov [%ebx], %eax

+ add $1, %eax

+ mov %eax, [%ebx]

+However, x86 allows for this kind of code::

+ add $1, [%ebx]

+which requires fewer instructions, but perhaps more importantly, requires fewer

+physical registers.

+It's also important to note that this transformation only makes sense if the

+store instruction ends ``x``'s live range.

+Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)

+opportunities via simple pattern matching. The only problem is that it is run

+before liveness analysis, which is needed to determine whether ``x``'s live

+range ends after the RMW. Since liveness analysis is one of the most expensive

+passes, it's not attractive to run it an extra time just for RMW analysis.

+Instead, we essentially generate both the RMW and the non-RMW versions, and then

+during lowering, the RMW version deletes itself if it finds x still live.

+X86 compare-branch inference

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+In the LLVM instruction set, the compare/branch pattern works like this::

+ cond = icmp eq a, b

+ br cond, target

+The result of the icmp instruction is a single bit, and a conditional branch

+tests that bit. By contrast, most target architectures use this pattern::

+ cmp a, b // implicitly sets various bits of FLAGS register

+ br eq, target // branch on a particular FLAGS bit

+A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests

+``cond`` and conditionally branches. Subzero has a pass that identifies

+boolean-based operations like this and folds them into a single

+compare/branch-like operation. It is set up for more than just cmp/br though.

+Boolean producers include icmp (integer compare), fcmp (floating-point compare),

+and trunc (integer truncation when the destination has bool type). Boolean

+consumers include branch, select (the ternary operator from the C language), and

+sign-extend and zero-extend when the source has bool type.

+Sandboxing

JF 2015/08/31 21:07:59 You should explain why sandboxing is needed (NaCl)

Jim Stichnoth 2015/09/02 23:35:04 Done.

+^^^^^^^^^^

+This is not actually a separate pass, but is integrated into lowering and

+assembly.

+- Indirect branches, including the ret instruction, are masked to a bundle

+ boundary and bundle-locked.

+- Call instructions are aligned to the end of the bundle so that the return

+ address is bundle-aligned.

+- Indirect branch targets, including function entry and targets in a switch

+ statement jump table, are bundle-aligned.

+- The intrinsic for reading the thread pointer is inlined appropriately.

+- For x86-64, non-stack memory accesses are with respect to the reserved sandbox

+ base register. We reduce the aggressiveness of address mode inference to

+ leave room for the sandbox base register during lowering. There are no memory

+ sandboxing changes for x86-32.

+Code emission

+-------------

+Subzero's integrated assembler is derived from Dart's assembler code. There is

JF 2015/08/31 21:08:01 Link?

Jim Stichnoth 2015/09/02 23:35:02 Done.

+a pass that iterates through the low-level ICE instructions and invokes the

+relevant assembler functions. Placeholders are added for later fixup of branch

+target offsets. (Backward branches use short offsets if possible; forward

+branches always use long offsets.) The assembler emits into a staging buffer.

+Once emission into the staging buffer for a function is complete, the data is

+emitted to the output file as an ELF object file, and metadata such as

+relocations, symbol table, and string table, are accumulated for emission at the

+end. Global data initializers are emitted similarly. A key point is that at

+this point, the staging buffer can be deallocated, and only a minimum of data

+needs to held until the end.

+As a debugging alternative, Subzero can emit textual assembly code which can

+then be run through an external assembler. This is of course super slow, but

+quite valuable when bringing up a new target.

+As another debugging option, the staging buffer can be emitted as textual

+assembly, primarily in the form of ".byte" lines. This allows the assembler to

+be tested separately from the ELF related code.

+Memory management

+-----------------

+Where possible, we allocate from a CfgLocalAllocator which derives from LLVM's

JF 2015/08/31 21:08:00 Code-quote ``CfgLocalAllocator`` and others.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+BumpPtrAllocator. This is an arena-style allocator where objects allocated from

+the arena are never actually freed; instead, when the Cfg translation completes

+and the Cfg is deleted, the entire arena memory is reclaimed at once.

JF 2015/08/31 21:08:00 Why does Subzero do this? I know why, but the docu

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Instructions are probably the most heavily allocated complex class in Subzero.

+We represent an instruction list as an intrusive doubly linked list, allocate

+all instructions from the CfgLoclaAllocator, and we make sure each instruction

JF 2015/08/31 21:08:01 Typo on CfgLoclaAllocator.

Jim Stichnoth 2015/09/02 23:35:05 Done.

+subclass is basically POD with a trivial destructor. This way, when the Cfg is

JF 2015/08/31 21:08:00 Link that explains POD.

Jim Stichnoth 2015/09/02 23:35:02 Done.

+finished, we don't need to individually deallocate every instruction. We do

+similar for Variables, which is probably the second most popular complex class.

+There are some situations where passes need to use some STL container class.

JF 2015/08/31 21:07:59 Link that explains STL.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Subzero has a way of using the CfgLocalAllocator as the container allocator if

+this is needed.

+Multithreaded translation

JF 2015/08/31 21:08:00 This section should mention what the scaling w.r.t

Jim Stichnoth 2015/09/02 23:35:05 Done.

+-------------------------

+Subzero is designed to be able to translate functions in parallel. With the

+``-threads=N`` command-line option, there is a 3-stage producer-consumer

+pipeline:

+- A single thread parses the pexe file and produces a sequence of work units. A

+ work unit can be either a fully constructed Cfg, or a set of global

+ initializers. The work unit includes its sequence number denoting its parse

+ order. Each work unit is added to the translation queue.

+- There are N translation threads that draw work units from the translation

+ queue and lower them into assembler buffers. Each assembler buffer is added

+ to the emitter queue, tagged with its sequence number. The Cfg and its

+ CfgLocalAllocator are disposed of at this point.

+- A single thread draws assembler buffers from the emitter queue and appends to

+ the output file. It uses the sequence numbers to reintegrate the assembler

+ buffers according to the original parse order, such that output order is

+ always deterministic.

+Note that ``-threads=N`` actually creates N+2 threads. For the special case of

JF 2015/08/31 21:08:01 I'd rephrase to "There are therefore N+2 threads s

Jim Stichnoth 2015/09/02 23:35:02 Done.

+N=0, execution is entirely sequential -- the same thread parses, translates, and

+emits, one function at a time. This is useful for performance measurements.

+Currently, parsing takes about 11% of total sequential time. If translation

+scalability ever gets so fast and awesomely scalable that parsing becomes a

+bottleneck, it should be possible to make parsing multithreaded as well.

+Internally, all shared, mutable data is held in the GlobalContext object, and

+access to each field is guarded by a mutex.

+Security

JF 2015/08/31 21:07:59 This section doesn't really clarify that all these

JF 2015/08/31 21:08:02 This section should list other things that Subzero

Jim Stichnoth 2015/09/02 23:35:04 Done.

Jim Stichnoth 2015/09/02 23:35:04 Done. (future work section below)

+--------

+Sandboxed translator

+^^^^^^^^^^^^^^^^^^^^

+When running inside the browser, the Subzero translator executes as sandboxed,

+untrusted code that is initially checked by the validator, just like the

+LLVM-based pnacl-llc translator. As such, the Subzero binary should be no more

+or less secure than the translator it replaces, from the point of view of the

+Chrome sandbox.

JF 2015/08/31 21:08:02 I'd hope it's more secure! You should mention that

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Code diversification

JF 2015/08/31 21:08:00 This section should explain that code is diversifi

JF 2015/08/31 21:08:01 Explain that code diversification can be done in a

Jim Stichnoth 2015/09/02 23:35:03 Done.

+^^^^^^^^^^^^^^^^^^^^

+Return-oriented programming (ROP) is a now-common technique for starting with

JF 2015/08/31 21:08:02 Link to papers about ROP.

Jim Stichnoth 2015/09/02 23:35:04 I don't know of canonical overview papers, so how

+e.g. a known buffer overflow situation and launching it into a deeper exploit.

+The attacker scans the executable looking for ROP gadgets, which are short

+sequences of code that happen to load known values into known registers and then

+return. If the attacker can manage to overwrite parts of the stack, he can

JF 2015/08/31 21:08:01 Ungendered: s/he/they/

Jim Stichnoth 2015/09/02 23:35:02 Mission accomplished, without resorting to agramma

+overwrite it with carefully chosen return addresses such that certain ROP

+gadgets are effectively chained together to set up the register state as

+desired, finally returning to some code that manages to do something nasty based

+on those register values.

+If there is a popular pexe with a large install base, the attacker could run

+Subzero on it and scan the executable for suitable ROP gadgets to use as part of

+a potential exploit. Note that if the trusted validator is working correctly,

+these ROP gadgets are limited to starting at a bundle boundary and cannot use

+the trick of finding a gadget that happens to begin inside another instruction.

+All the same, gadgets with these constraints still exist and the attacker has

+access to them.

JF 2015/08/31 21:08:00 There's a distinction to be made between securing

Jim Stichnoth 2015/09/02 23:35:04 Done.

+Subzero can mitigate these attacks to some degree through code diversification.

+Specifically, we can apply some randomness to the code generation that makes ROP

+gadgets less predictable. This randomness can have some compile-time cost, and

+it can affect the code quality; and some diversifications may be more effective

+than others.

JF 2015/08/31 21:08:00 Link to Matasano JIT hardening paper.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+To evaluate diversification effectiveness, we use a third-party ROP gadget

+finder and limit its results to bundle-aligned addresses. For a given

+diversification technique, we run it with a number of different random seeds,

+find ROP gadgets for each version, and determine how persistent each ROP gadget

+is across the different versions. A gadget is persistent if the same gadget is

+found at the same code address. The best diversifications are ones with low

+gadget persistence rates.

+Subzero implements 7 different diversification techniques. Below is a

+discussion of each technique, its effectiveness, and its cost.

JF 2015/08/31 21:08:00 What's the total reduction in gadget persistence w

Jim Stichnoth 2015/09/02 23:35:03 Unknown at this point (and clarified that it's unk

+Constant blinding

JF 2015/08/31 21:08:01 What's the payoff of this in reducing gadget persi

Jim Stichnoth 2015/09/02 23:35:03 See below - "It does little to reduce gadget persi

+~~~~~~~~~~~~~~~~~

+Here, we prevent attackers from controlling large immediates in the text

+(executable) section. A random cookie is generated for each function, and if

+the constant exceeds a specified threshold, the constant is obfuscated with the

+cookie and equivalent code is generated. For example, instead of this x86

+instruction::

+ mov $0x11223344, <%Reg/Mem>

+the following code might be generated::

+ mov $(0x11223344+Cookie), %temp

+ lea -Cookie(%temp), %temp

+ mov %temp, <%Reg/Mem>

+The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to

+prevent unintended effects on the flags register.

+This transformation has almost no effect on translation time, and about 1%

+impact on code quality, depending on the threshold chosen. It does little to

+reduce gadget persistence, but it does remove a lot of potential opportunities

+to construct intra-instruction ROP gadgets (which an attacker could use if a

+validator bug were discovered).

JF 2015/08/31 21:08:00 Clarify that immediates aren't super useful in NaC

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Constant pooling

JF 2015/08/31 21:08:00 What's the payoff of this in reducing gadget persi

Jim Stichnoth 2015/09/02 23:35:05 Same comment as for constant blinding.

+~~~~~~~~~~~~~~~~

+This is similar to constant blinding, in that large immediates are removed from

+the text section. In this case, each unique constant above the threshold is

+stored in a read-only data section and the constant is accessed via a memory

+load. For the above example, the following code might be generated::

+ mov $Label$1, %temp

+ mov %temp, <%Reg/Mem>

+This has a similarly small impact on translation time and ROP gadget

+persistence, and a smaller (better) impact on code quality. This is because it

+uses fewer instructions, and in some cases good register allocation leads to no

+increase in instruction count. Note that this still gives an attacker some

+limited amount of control over some text section values, unless we randomize the

+constant pool layout.

+Static data reordering

+~~~~~~~~~~~~~~~~~~~~~~

+This transformation limits the attacker's ability to control bits in global data

+address references. It simply permutes the order in memory of global variables

+and internal constant pool entries. For the constant pool, we only permute

+within a type (i.e., emit a randomized list of ints, followed by a randomized

+list of floats, etc.) to maintain good packing in the face of alignment

+constraints.

+As might be expected, this has no impact on code quality, translation time, or

+ROP gadget persistence (though as above, it limits opportunities for

+intra-instruction ROP gadgets with a broken validator).

+Basic block reordering

+~~~~~~~~~~~~~~~~~~~~~~

+Here, we randomize the order of basic blocks within a function, with the

+constraint that we still want to maintain a topological order as much as

+possible, to avoid making the code too branchy.

+This has no impact on code quality, and about 1% impact on translation time, due

+to a separate pass to recompute layout. It ends up having a huge effect on ROP

+gadget persistence, tied for best with nop insertion, reducing ROP gadget

+persistence to less than 5%.

JF 2015/08/31 21:08:01 To less that 5% on its own? Or combined?

Jim Stichnoth 2015/09/02 23:35:05 On its own, as clarified in the section intro.

+Function reordering

+~~~~~~~~~~~~~~~~~~~

+Here, we permute the order that functions are emitted, primarily to shift ROP

+gadgets around to less predictable locations. It may also change call address

+offsets in case the attacker was trying to control that offset in the code.

+To control latency and memory footprint, we don't arbitrarily permute functions.

+Instead, for some relatively small value of N, we queue up N assembler buffers,

+and then emit the N functions in random order, and repeat until all functions

+are emitted.

+Function reordering has no impact on translation time or code quality.

+Measurements indicate that it reduces ROP gadget persistence to about 15%.

+Nop insertion

+~~~~~~~~~~~~~

+This diversification randomly adds a nop instruction after each regular

+instruction, with some probability. Nop instructions of different lengths may

+be selected. Nop instructions are never added inside a bundle_lock region.

+Note that when sandboxing is enabled, nop instructions are already being added

+for bundle alignment, so the diversification nop instructions may simply be

+taking the place of alignment nop instructions, though distributed differently

+through the bundle.

+In Subzero's currently implementation, nop insertion adds 3-5% to the

+translation time, but this is probably because it is implemented as a separate

+pass that adds actual nop instructions to the IR. The overhead would probably

+be a lot less if it were integrated into the assembler pass. The code quality

+is also reduced by 3-5%, making nop insertion the most expensive of the

+diversification techniques.

+Nop insertion is very effective in reducing ROP gadget persistence, at the same

JF 2015/08/31 21:08:00 How effective?

Jim Stichnoth 2015/09/02 23:35:04 "at the same level as basic block randomization" :

+level as basic block randomization. But given nop insertion's impact on

+translation time and code quality, one would most likely prefer to use basic

+block randomization instead.

JF 2015/08/31 21:08:00 Don't they combine, though? Or do they interfere /

Jim Stichnoth 2015/09/02 23:35:03 Clarified as unknown at this point.

+Register allocation randomization

+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

+In this diversification, the register allocator tries to make different but

+mostly functionally equivalent choices, while maintaining stable code quality.

+A naive approach would be the following. Whenever the allocator has more than

+one choice for assigning a register, choose randomly among those options. And

+whenever there are no registers available and there is a tie for the

+lowest-weight variable, randomly select one of the lowest-weight variables to

+evict. Because of the one-pass nature of the linear-scan algorithm, this

+randomization strategy can have a large impact on which variables are ultimately

+assigned registers, with a corresponding large impact on code quality.

+Instead, we choose an approach that tries to keep code quality stable regardless

+of the random seed. We partition the set of physical registers into equivalence

+classes. If a register is pre-colored in the function (i.e., referenced

+explicitly by name), it forms its own equivalence class. The remaining

+registers are partitioned according to their combination of attributes such as

+integer versus floating-point, 8-bit versus 32-bit, caller-save versus

+callee-saved, etc. Each equivalence class is randomly permuted, and the

+complete permutation is applied to the final register assignments.

+Register randomization reduces ROP gadget persistence to about 10% on average,

+though there tends to be fairly high variance across functions and applications.

+This probably has to do with the set of restrictions in the x86-32 instruction

+set and ABI, such as few general-purpose registers, ``%eax`` used for return

+values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As

+intended, register randomization has no impact on code quality, and a slight

+(0.5%) impact on translation time due to an extra scan over the variables to

+identify pre-colored registers.

+Fuzzing

JF 2015/08/31 21:08:01 This is under code diversification, but isn't a di

Jim Stichnoth 2015/09/02 23:35:04 Actually it's under Security - using "^^^^" instea

+^^^^^^^

+We have started fuzzing the pexe files input to Subzero. Most of the problems

JF 2015/08/31 21:08:00 Using afl-fuzz, libFuzzer, and custom tooling. Li

Jim Stichnoth 2015/09/02 23:35:04 Done.

+found so far are ones most appropriately handled in the parser. However, there

+have been a couple that have identified problems in the lowering, or otherwise

+inappropriately triggered assertion failures and fatal errors. We continue to

+dig into this area.

JF 2015/08/31 21:08:01 This section should explain what's the point of fu

Jim Stichnoth 2015/09/02 23:35:03 Done.

+Chrome integration

+------------------

+Currently Subzero is available in Chrome for the x86-32 architecture, but under

+a flag. When the flag is enabled, Subzero is used when the pexe developer's

+manifest specifies the ``O0`` optimization level.

JF 2015/08/31 21:08:00 Link to NaCl's documentation about manifest files.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+The next step is to remove the flag, i.e. invoke Subzero as the only translator

+for ``O0``-specified manifest files.

+Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which

+case Subzero could be used for all PNaCl translation.

+Command line options

+--------------------

+Subzero has a number of command-line options for debugging and diagnostics.

+Among the more interesting are the following.

+- Using the ``-verbose`` flag, Subzero will dump the Cfg, or produce other

+ diagnostic output, with various levels of detail after each pass. Instruction

+ numbers can be printed or suppressed. Deleted instructions can be printed or

+ suppressed (they are retained in the instruction list, as discussed earlier,

+ because they can help explain how lower-level instructions originated).

+ Liveness information can be printed when available. Details of register

+ allocation can be printed as register allocator decisions are made. And more.

+- Running Subzero with any level of verbosity produces an enormous amount of

+ output. When debugging a single function, verbose output can be suppressed

+ except for a particular function. The ``-verbose-focus`` flag suppresses

+ verbose output except for the specified function.

+- Subzero has a ``-timing`` option that prints a breakdown of pass-level timing

+ at exit. Timing markers can be placed in the Subzero source code to demarcate

+ logical operations or passes of interest. Basic timing information plus

+ call-stack type timing information is printed at the end.

+- Along with ``-timing``, the user can instead get a report on the overall

+ translation time for each function, to help focus on timing outliers. Also,

+ ``-timing-focus`` limits the ``-timing`` reporting to a single function,

+ instead of aggregating pass timing across all functions.

+- The ``-szstats`` option reports various statistics on each function, such as

+ stack frame size, static instruction count, etc. It may be helpful to track

+ these stats over time as Subzero is improved, as an approximate measure of

+ code quality.

+- The flag ``-asm-verbose``, in conjunction with emitting textual assembly

+ output, annotate the assembly output with register-focused liveness

+ information. In particular, each basic block is annotated with which

+ registers are live-in and live-out, and each instruction is annotated with

+ which registers' and stack locations' live ranges end at that instruction.

+ This is really useful when studying the generated code to find opportunities

+ for code quality improvements.

+Testing and debugging

+---------------------

+LLVM lit tests

+^^^^^^^^^^^^^^

+For basic testing, Subzero uses LLVM's lit framework for running tests. We have

JF 2015/08/31 21:08:01 Link to lit docs.

Jim Stichnoth 2015/09/02 23:35:04 Done.

+a suite of hundreds of small functions where we test for particular assembly

+code patterns across different target architectures.

+Cross tests

+^^^^^^^^^^^

+Unfortunately, the lit tests don't do a great job of precisely testing the

+correctness of the output. Much better are the cross tests, which are execution

+tests that compare Subzero and llc translated bitcode across a wide variety of

+interesting inputs. Each cross test consists of a set of C, C++, and/or

+low-level bitcode files. The C and C++ source files are compiled down to

+bitcode. The bitcode files are translated by llc and also by Subzero. Subzero

+mangles global symbol names with a special prefix to avoid duplicate symbol

+errors. A driver program invokes both versions on a large set of interesting

+inputs, and reports when the Subzero and llc results differ. Cross tests turn

+out to be an excellent way of testing the basic lowering patterns, but they are

+less useful for testing more global things like liveness analysis and register

+allocation.

+Bisection debugging

+^^^^^^^^^^^^^^^^^^^

+Sometimes with a new application, Subzero will end up producing incorrect code

+that either crashes at runtime or otherwise produces the wrong results. When

+this happens, we need to narrow it down to a single function (or small set of

+functions) that yield incorrect behavior. For this, we have a bisection

+debugging framework. Here, we initially translate the entire application once

+with Subzero and once with llc. We then use objdump to selectively weaken

+symbols based on a whitelist or blacklist provided on the command line. The two

+object files can then be linked together without link errors, with the desired

+version of each method "winning". Then the binary is tested, and bisection

+proceeds based on whether the binary produces correct output.

+When the bisection completes, we are left with a minimal set of

+Subzero-translated functions that cause the failure. Usually it is a single

+function, though sometimes it might require a combination of several functions

+to cause a failure; this may be due to an incorrect call ABI, for example.

+However, Murphy's Law implies that the single failing function is enormous and

+impractical to debug. In that case, we can restart the bisection, explicitly

+blacklisting the enormous function, and try to find another candidate to debug.

+(Future work is to automate this to find all minimal sets of functions, so that

+debugging can focus on the simplest example.)

+Fuzz testing

+^^^^^^^^^^^^

+As described above, we try to find internal Subzero bugs using fuzz testing

+techniques.

+Sanitizers

+^^^^^^^^^^

+We regularly run Subzero with asan and tsan. So far, multithreading has been

JF 2015/08/31 21:08:01 "regularly" why not bots?

JF 2015/08/31 21:08:01 Why not ubsan?

Jim Stichnoth 2015/09/02 23:35:05 Good point. I stumbled around with msan for a bit

Jim Stichnoth 2015/09/02 23:35:05 That would be my failing. I reworded the "regular

+simple enough that tsan hasn't found any bugs, but asan has found at least one

+memory leak which was subsequently fixed.

+Current status

+==============

+Target architectures

+--------------------

+Subzero is currently more or less complete for the x86-32 target. It has been

+refactored and extended to handle x86-64 as well, and that is mostly complete at

+this point.

+ARM32 work is in progress. It currently lacks the testing level of x86, at

+least in part because Subzero's register allocator needs modifications to handle

+ARM's aliasing of floating point and vector registers. Specifically, a 64-bit

+register is actually a gang of two consecutive and aligned 32-bit registers, and

+a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.

+ARM64 work has not started; when it does, it will be native-only since the

+Native Client sandbox model, validator, and other tools have never been defined.

+An external contributor is adding MIPS support, in most part by following the

+ARM work.

+Translator performance

+----------------------

+Single-threaded translation speed is currently about 5x the llc translation

+speed. For a large pexe, the time breaks down as:

+- 11% for parsing and initial IR building

+- 4% for emitting to /dev/null

+- 27% for liveness analysis (two liveness passes plus live range construction)

+- 15% for linear-scan register allocation

+- 9% for basic lowering

+- 10% for advanced phi lowering

+- ~11% for other minor analysis

+- ~10% measurement overhead to acquire these numbers

+Some improvements could undoubtedly be made, but it will be hard to increase the

+speed to 10x of llc while keeping acceptable code quality. With ``-Om1`` (lack

+of) optimization, we do actually achieve roughly 10x llc translation speed, but

+code quality drops by a factor of 3.

+Code quality

+------------

+Measured across 16 components of spec2k, Subzero's code quality is uniformly

+better than llc ``-O0`` code quality, and in many cases solidly between llc

+``-O0`` and ``-O2``.

+Translator size

+---------------

+When built in MINIMAL mode, the x86-64 native translator size for the x86-32

+target is about 700 KB, not including the size of functions referenced in

+dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1

+MB, and it is statically linked and also includes nop padding for bundling as

+well as indirect branch masking.

+Translator memory footprint

+---------------------------

+It's hard to draw firm conclusions about memory footprint, since the footprint

+is at least proportional to the input function size, and there is no real limit

+on the size of functions in the pexe file.

+That said, we looked at the memory footprint over time as Subzero translated

+pnacl-llc.pexe, which is the largest pexe (7.2 MB) at our disposal. One of

+LLVM's libraries that Subzero uses can report the current malloc heap usage.

+With single-threaded translation, Subzero tends to hover around 15 MB of memory

+usage. There are a couple of monstrous functions where Subzero grows to around

+100 MB, but then it drops back down after those functions finish translating.

+In contrast, llc grows larger and larger throughout translation, reaching

+several hundred MB by the time it completes.

+It's a bit more interesting when we enable multithreaded translation. When

+there are N translation threads, Subzero implements a policy that limits the

+size of the translation queue to N entries - if it is "full" when the parser

JF 2015/08/31 21:07:59 Em dash.

Jim Stichnoth 2015/09/02 23:35:04 Made it consistent with the rest of the doc, but I

+tries to add a new Cfg, the parser blocks until one of the translation threads

+removes a Cfg. This means the number of in-memory Cfgs can (and generally does)

+reach 2*N+1, and so the memory footprint rises in proportion to the number of

+threads. Adding to the pressure is the observation that the monstrous functions

+also take proportionally longer time to translate, so there's a good chance many

+of the monstrous functions will be active at the same time with multithreaded

+translation. As a result, for N=32, Subzero's memory footprint peaks at about

+260 MB, but drops back down as the large functions finish translating.

+If this peak memory size becomes a problem, it might be possible for the parser

+to resequence the functions to try to spread out the larger functions, or to

+throttle the translation queue to prevent too many in-flight large functions.

JF 2015/08/31 21:08:01 It may be useful to get a memory pressure signal f

Jim Stichnoth 2015/09/02 23:35:05 Done.

+Translator scalability

+----------------------

+Currently scalability is "not very good". Multiple translation threads lead to

+faster translation, but not to the degree desired. We haven't dug in to

+investigate yet.

+There are a few areas to investigate. First, there may be contention on the

+constant pool, which all threads access, and which requires locked access even

+for reading. This could be mitigated by keeping a Cfg-local cache of the most

+common constants.

+Second, there may be contention on memory allocation. While almost all Cfg

+objects are allocated from the Cfg-local allocator, some passes use temporary

+STL containers that use the default allocator, which may require global locking.

+This could be mitigated by switching these to the Cfg-local allocator.

+Third, multithreading may make the default allocator strategy more expensive.

+In a single-threaded environment, a pass will allocate its containers, run the

+pass, and deallocate the containers. This results in stack-like allocation

+behavior and makes the heap free list easier to manage, with less heap

+fragmentation. But when multithreading is added, the allocations and

+deallocations become much less stack-like, making allocation and deallocation

+operations individually more expensive. Again, this could be mitigated by

+switching these to the Cfg-local allocator.

« no previous file with comments | « no previous file | README.rst » ('j') | README.rst » ('J')