DESIGN.rst - Issue 1309073003: Subzero: Add a detailed design document.

Side by Side Diff: DESIGN.rst

Issue 1309073003: Subzero: Add a detailed design document. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Created 5 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 Design of the Subzero fast code generator

	2 =========================================

	3

	4 Introduction

	5 ------------

	6

	7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

	8 compiler technology based on `LLVM <http://llvm.org/>`_. The web program

	9 developer uses the PNaCl toolchain to compile their application to

	10 architecture-neutral PNaCl bitcode (a pexe file), using as much

	11 architecture-neutral optimization as possible. The pexe file is downloaded to

	12 the user's browser where the PNaCl translator (a component of Chrome) compiles

	13 the pexe file to sandboxed native code. The translator uses

	14 architecture-specific optimizations as much as practical to generate good native

	15 code.

	16

	17 The native code can be cached by the browser to avoid repeating translation on

	18 future page loads. However, first-time user experience can be hampered by long

	19 translation times. The LLVM-based PNaCl translator is pretty slow, even when

	20 using ``-O0`` to minimize optimizations, so delays are especially noticeable on

	21 slow browser platforms such as ARM-based Chromebooks.

	22

	23 Translator slowness can be mitigated or hidden in a number of ways.

	24

	25 - Parallel translation. However, slow machines where this matters most,

	26 e.g. ARM chromebooks, are likely to have fewer cores to parallelize across.

	27

	28 - Streaming translation, i.e. start translating as soon as the download starts.

	29 This doesn't help much when translation speed is 10x slower than download

	30 speed, or the pexe file is already cached while the translated binary was

	31 flushed from the cache.

	32

	33 - Arrange the web page such that translation is done in parallel with

	34 downloading large assets. Arrange the web page to distract the user with cat

	35 videos while translation is in progress. Or, improve translator performance

	36 to something more reasonable.

	37

	38 Goals

	39 =====

	40

	41 Translation speed

	42 -----------------

	43

	44 We'd like to be able to translate a pexe as fast as download speed. Any faster

	45 is in a sense wasted effort. Download speed varies greatly, but we'll

	46 arbitrarily say 1 MB/sec. We'll pick an ARM A15 platform as the example of a

	47 slow machine. We observe a 3x single-thread performance difference between A15

	48 and a z620 workstation, and aggressively assume a pexe file could be compressed

	49 to 50% on the web server before network transport, so we set the translation

	50 speed goal to 6 MB/sec on a z620 workstation.

	51

	52 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at

	53 1/10 the target rate. The ``-O2`` mode takes 3x as long as the ``-O0`` mode.

	54

	55 In other words, Subzero's goal is to improve over LLVM's translation speed by

	56 10x.

	57

	58 Code quality

	59 ------------

	60

	61 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``

	62 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On

	63 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

	64

	65 Translator size

	66 ---------------

	67

	68 The current LLVM-based translator binary (pnacl-llc) is about 10MB in size. We

	69 think 1MB is a more reasonable size, thus we target a 10x reduction in binary

	70 size.

	71

	72 For development, Subzero can be built for all target architectures, and all

	73 debugging and diagnostic options enabled. For a smaller translator, we restrict

	74 to a single target architecture, and define a MINIMAL build where unnecessary

	75 features are compiled out.

	76

	77 Memory footprint

	78 ----------------

	79

	80 The current LLVM-based translator suffers from an issue in which some

	81 function-specific data has to be retained in memory until all translation

	82 completes, and therefore the memory footprint grows without bound. Large pexes

	83 can lead to the translator process holding hundreds of MB of memory by the end.

	84

	85 Subzero should maintain a stable memory footprint throughout translation. It's

	86 not really practical to set a specific limit, because there is not really a

	87 practical limit on a single function's size, but the footprint should be

	88 "reasonable" and not have leaks or inexorable growth. (We use ASAN builds to

	89 test for leaks.)

	90

	91 Multithreaded translation

	92 -------------------------

	93

	94 It should be practical to translate different functions concurrently and see

	95 good scalability. Some locking may be needed, such as accessing output buffers

	96 or constant pools, but that should be fairly minimal. In contrast, LLVM was

	97 only designed for module-level parallelism, and as such, the PNaCl translator

	98 internally splits a pexe file into several modules for concurrent translation.

	99 All output needs to be deterministic regardless of the level of multithreading,

	100 i.e. functions and data should always be output in the same order.

	101

	102 Target architectures

	103 --------------------

	104

	105 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future

	106 targets include ARM64 and MIPS64, though these targets lack NaCl support

	107 including a sandbox model or a validator.

	108

	109 The first implementation is for x86-32, because it was expected to be

	110 particularly challenging, and thus more likely to draw out any design problems

	111 early:

	112

	113 - There are a number of special cases, asymmetries, and warts in the x86

	114 instruction set.

	115

	116 - Complex addressing modes may be leveraged for better code quality.

	117

	118 - 64-bit integer operations have to be lowered into longer sequences of 32-bit

	119 operations.

	120

	121 - Paucity of physical registers may reveal code quality issues early in the

	122 design.

	123

	124 Detailed design

	125 ===============

	126

	127 Intermediate representation - ICE

	128 ---------------------------------

	129

	130 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's

	131 IR, which is reflected in the pexe bitcode structure.

	132

	133 A function is represented by the ``Ice::Cfg`` class (CFG = control flow graph).

	134 Its key contents include:

	135

	136 - A list of ``CfgNode`` pointers, generally held in topological order.

	137

	138 - A list of ``Variable`` pointers corresponding to local variables used in the

	139 function plus compiler-generated temporaries.

	140

	141 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents

	142 include:

	143

	144 - A linear list of instructions, in the same style as LLVM. The last

	145 instruction of the list is always a terminator instruction: branch, switch,

	146 return, unreachable.

	147

	148 - A list of Phi instructions, also in the same style as LLVM. They are held as

	149 a linear list for convenience, though per Phi semantics, they are executed "in

	150 parallel" without dependencies on each other.

	151

	152 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and

	153 another list for outgoing edges.

	154

	155 - The node's unique, 0-based index into the Cfg's node list.

	156

	157 An instruction is represented by the ``Ice::Inst`` class. Its key contents

	158 include:

	159

	160 - A list of source operands.

	161

	162 - Its destination variable, if the instruction produces a result in an

	163 ``Ice::Variable``.

	164

	165 - A bitvector indicating which variables' live ranges this instruction ends.

	166 This is computed during liveness analysis.

	167

	168 Instructions kinds are divided into high-level ICE instructions and low-level

	169 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode

	170 instruction kinds. Each target architecture implementation extends the

	171 instruction space with its own set of low-level instructions. Generally,

	172 low-level instructions correspond to individual machine instructions. The

	173 high-level ICE instruction space includes a few additional instruction kinds

	174 that are not part of LLVM but are generally useful (e.g., an Assignment

	175 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock

	176 instructions for sandboxing).

	177

	178 Specifically, high-level ICE instructions that derive from LLVM are the

	179 following:

	180

	181 - Alloca: allocate data on the stack

	182

	183 - Arithmetic: binary operations of the form A = B op C

	184

	185 - Br: conditional or unconditional branch

	186

	187 - Call: function call

	188

	189 - Cast: unary type-conversion operations

	190

	191 - ExtractElement: extract a scalar element from a vector-type value

	192

	193 - Fcmp: floating-point comparison

	194

	195 - Icmp: integer comparison

	196

	197 - IntrinsicCall: call a known intrinsic

	198

	199 - InsertElement: insert a scalar element into a vector-type value

	200

	201 - Load: load a value from memory

	202

	203 - Phi: implement the SSA phi node

	204

	205 - Ret: return from the function

	206

	207 - Select: essentially the C operation of the form X = C ? Y : Z

	208

	209 - Store: store a value into memory

	210

	211 - Switch: generalized branch to multiple possible locations

	212

	213 - Unreachable: indicate that this portion of the code is unreachable

	214

	215 The additional high-level ICE instructions are the following:

	216

	217 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi

	218 instructions to non-SSA assignments, before lowering to machine code.

	219

	220 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are

	221 common across all targets and so they are elevated to the high-level

	222 instruction set.

	223

	224 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in

	225 liveness analysis, elevated to the high-level because they are used by all

	226 targets. They are described in more detail at the end of this section.

	227

	228 - JumpTable: this represents the result of switch optimization analysis, where

	229 some switch instructions may use jump tables instead of cascading

	230 compare/branches.

	231

	232 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an

	233 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants

	234 include integer constants, floating point constants, Undef (an unspecified

	235 constant of a particular type), and symbol constants (essentially addresses of

	236 globals). A variable represents a value allocated on the stack (though not

	237 including alloca-derived storage). Among other things, a variable holds its

	238 unique, 0-based index into the Cfg's variable list.

	239

	240 Each target can extend the ``Constant`` and ``Variable`` classes for its own

	241 needs. In addition, the ``Operand`` class may be extended, e.g. to define an

	242 x86 ``MemOperand`` that encodes a base register, an index register, an index

	243 register shift amount, and a constant offset.

	244

	245 Register allocation and liveness analysis are restricted to Variable operands.

	246 Because of the importance of register allocation to code quality, and the

	247 translation-time cost of liveness analysis, Variable operands get some special

	248 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate

	249 across all the Variables of an instruction. An instruction holds a list of

	250 operands, but an operand may contain 0, 1, or more Variables. As such, the

	251 ``Operand`` class specially holds a list of Variables contained within, for

	252 quick access.

	253

	254 A Subzero transformation pass may work by deleting an existing instruction and

	255 replacing it with zero or more new instructions. Instead of actually deleting

	256 the existing instruction, we generally mark it as deleted and insert the new

	257 instructions right after the deleted instruction. When printing the IR for

	258 debugging, this is a big help because it makes it much more clear how the

	259 non-deleted instructions came about.

	260

	261 Subzero has a few special instructions to help with liveness analysis

	262 consistency.

	263

	264 - The FakeDef instruction gives a fake definition of some variable. For

	265 example, on x86-32, a divide instruction defines both ``eax`` and ``edx`` but

	266 an ICE instruction can represent only one destination variable. This is

	267 similar for multiply instructions, and for function calls that return a 64-bit

	268 integer result in the edx:eax pair. Also, using the ``xor eax, eax`` trick to

	269 set ``eax`` to 0 requires an initial FakeDef of ``eax``.

	270

	271 - The FakeUse instruction registers a use of a variable, typically to prevent an

	272 earlier assignment to that variable from being dead-code eliminated. For

	273 example, lowering an operation like ``x=cc?y:z`` may be done using x86's

	274 conditional move (cmov) instruction: ``mov x, z; cmov_cc x, y``. Without a

	275 FakeUse of ``x`` between the two instructions, the liveness analysis pass may

	276 dead-code eliminate the first instruction.

	277

	278 - The FakeKill instruction is added after a call instruction, and is a quick way

	279 of indicating that caller-save registers are invalidated.

	280

	281 Overview of lowering

	282 --------------------

	283

	284 In general, translation goes like this:

	285

	286 - Parse the next function from the pexe file and construct a Cfg consisting of

	287 high-level ICE.

	288

	289 - Do analysis passes and transformation passes on the high-level ICE, as

	290 desired.

	291

	292 - Lower each high-level ICE instruction into a sequence of zero or more

	293 low-level ICE instructions. Each high-level instruction is generally lowered

	294 independently, though the target lowering is allowed to look ahead in the

	295 CfgNode's instruction list if desired.

	296

	297 - Do more analysis and transformation passes on the low-level ICE, as desired.

	298

	299 - Assemble the low-level Cfg into an ELF object file (alternatively, a textual

	300 assembly file that is later assembled by some external tool).

	301

	302 - Repeat for all functions, and also produce object code for data such as global

	303 initializers and internal constant pools.

	304

	305 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,

	306 the intention is to apply all available optimizations to get the best code

	307 quality (though the initial code quality goal is measured against LLVM's ``O0``

	308 code quality). For ``Om1``, the intention is to apply as few optimizations as

	309 possible and produce code as quickly as possible, accepting poor code quality.

	310 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,

	311 "sub-zero".

	312

	313 Debuggability of generated code was never considered in the design, but so far,

	314 Subzero doesn't really do transformations that would obfuscate debugging. The

	315 main thing might be that register allocation (including stack slot coalescing

	316 for stack-allocated variables whose live ranges don't overlap) may render a

	317 variable's value unobtainable after its live range ends. This would not be an

	318 issue for ``Om1`` since it doesn't register-allocate program-level variables,

	319 nor does it coalesce stack slots.

	320

	321 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but

	322 produces code with one third the code quality. ``Om1`` is good for testing and

	323 debugging -- during translation, it tends to expose errors in the basic lowering

	324 that might otherwise have been hidden by the register allocator or other

	325 optimization passes. It also helps determine whether a code correctness problem

	326 is a fundamental problem in the basic lowering, or an error in another

	327 optimization pass.

	328

	329 The implementation of target lowering also controls the recipe of passes used

	330 for ``Om1`` and ``O2`` translation. For example, address mode inference may

	331 only be relevant for x86.

	332

	333 Pexe parsing

	334 ------------

	335

	336 Subzero includes an integrated PNaCl pexe bitcode parser. It parses the pexe

	337 file function by function, ultimately constructing an ICE Cfg for each function.

	338 After a function is parsed, its Cfg is handed off to the translation phase. The

	339 bitcode parser also parses global initializer data and hands it off to be

	340 translated to data sections in the object file.

	341

	342 Subzero has another parsing strategy for testing/debugging. LLVM libraries can

	343 be used to parse a module into LLVM IR (though very slowly relative to Subzero

	344 native parsing). Then we iterate across the LLVM IR and construct high-level

	345 ICE, handing off each Cfg to the translation phase.

	346

	347 Lowering strategy

	348 -----------------

	349

	350 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower

	351 each high-level instruction down to a sequence of low-level target-specific

	352 instructions, in a largely context-free setting. That is, each high-level

	353 instruction conceptually has a simple template expansion into low-level

	354 instructions, and lowering can in theory be done in any order. This may sound

	355 like a small effort, but quite a large number of templates may be needed because

	356 of the number of PNaCl types and instruction variants. Furthermore, there may

	357 be optimized templates, e.g. to take advantage of operator commutativity.

	358

	359 The key idea for a lowering template is to produce valid low-level instructions

	360 that are guaranteed to meet address mode and other structural requirements of

	361 the instruction set. For example, on x86, the source operand of an integer

	362 store instruction must be an immediate or a physical register; a shift

	363 instruction's shift amount must be an immediate or in register ``cl``; a

	364 function's integer return value is in ``eax``; most x86 instructions are

	365 two-operand, in contrast to corresponding three-operand high-level instructions;

	366 etc.

	367

	368 Because target lowering runs before register allocation, there is no way to know

	369 whether a given ``Ice::Variable`` operand lives on the stack or in a physical

	370 register. When the low-level instruction calls for a physical register operand,

	371 the target lowering can create an infinite-weight Variable. This tells the

	372 register allocator to assign infinite weight when making decisions, effectively

	373 guaranteeing some physical register. Variables can also be pre-colored to a

	374 specific physical register (``cl`` in the shift example above), which also gives

	375 infinite weight.

	376

	377 To illustrate, consider a high-level arithmetic instruction on 32-bit integer

	378 operands::

	379

	380 A = B + C

	381

	382 X86 target lowering might produce the following::

	383

	384 T.inf = B // mov instruction

	385 T.inf += C // add instruction

	386 A = T.inf // mov instruction

	387

	388 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a

	389 physical register, the three lowered instructions are all encodable regardless

	390 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and

	391 whether ``A`` is a physical register or in memory.

	392

	393 In this example, ``A`` must be a Variable and one may be tempted to simplify the

	394 lowering sequence by setting ``A`` as infinite-weight and using::

	395

	396 A = B // mov instruction

	397 A += C // add instruction

	398

	399 This has two problems. First, if the original instruction was actually ``A =

	400 B + A``, the result would be incorrect. Second, assigning ``A`` a physical

	401 register applies throughout ``A``'s entire live range. This is probably not

	402 what is intended, and may ultimately lead to a failure to allocate a register

	403 for an infinite-weight variable.

	404

	405 This style of lowering leads to many temporaries being generated, so in ``O2``

	406 mode, we rely on the register allocator to clean things up. For example, in the

	407 example above, if ``B`` ends up getting a physical register and its live range

	408 ends at this instruction, the register allocator is likely to reuse that

	409 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

	410 copy, which is removed as an emission-time peephole optimization.

	411

	412 O2 lowering

	413 -----------

	414

	415 Currently, the ``O2`` lowering recipe is the following:

	416

	417 - Address mode inference

	418

	419 - Read-modify-write (RMW) transformation

	420

	421 - Basic liveness analysis

	422

	423 - Load optimization

	424

	425 - Target lowering

	426

	427 - Full liveness analysis

	428

	429 - Register allocation

	430

	431 - Phi instruction lowering (advanced)

	432

	433 - Post-phi lowering register allocation

	434

	435 - Branch optimization

	436

	437 These passes are described in more detail below.

	438

	439 Om1 lowering

	440 ------------

	441

	442 Currently, the ``Om1`` lowering recipe is the following:

	443

	444 - Phi instruction lowering (simple)

	445

	446 - Target lowering

	447

	448 - Register allocation (infinite-weight and pre-colored only)

	449

	450 Optimization passes

	451 -------------------

	452

	453 Liveness analysis

	454 ^^^^^^^^^^^^^^^^^

	455

	456 Liveness analysis is a standard dataflow optimization, implemented as follows.

	457 For each node (basic block), its live-out set is computed as the union of the

	458 live-in sets of its successor nodes. Then the node's instructions are processed

	459 in reverse order, updating the live set, until the beginning of the node is

	460 reached, and the node's live-in set is recorded. If this iteration has changed

	461 the node's live-in set, the node's predecessors are marked for reprocessing.

	462 This continues until no more nodes need reprocessing. If nodes are processed in

	463 reverse topological order, the number of iterations over the Cfg is generally

	464 equal to the maximum loop nest depth.

	465

	466 To implement this, each node records its live-in and live-out sets, initialized

	467 to the empty set. Each instruction records which of its Variables' live ranges

	468 end in that instruction, initialized to the empty set. A side effect of

	469 liveness analysis is dead instruction elimination. Each instruction can be

	470 marked as tentatively dead, and after the algorithm converges, the tentatively

	471 dead instructions are permanently deleted.

	472

	473 Optionally, after this liveness analysis completes, we can do live range

	474 construction, in which we calculate the live range of each variable in terms of

	475 instruction numbers. A live range is represented as a union of segments, where

	476 the segment endpoints are instruction numbers. Instruction numbers are required

	477 to be unique across the Cfg, and monotonically increasing within a basic block.

	478 As a union of segments, live ranges can contain "gaps" and are therefore

	479 precise. Because of SSA properties, a variable's live range can start at most

	480 once in a basic block, and can end at most once in a basic block. Liveness

	481 analysis keeps track of which variable/instruction tuples begin live ranges and

	482 end live ranges, and combined with live-in and live-out sets, we can efficiently

	483 build up live ranges of all variables across all basic blocks.

	484

	485 A lot of care is taken to try to make liveness analysis fast and efficient.

	486 Because of the lowering strategy, the number of variables is generally

	487 proportional to the number of instructions, leading to an O(N^2) complexity

	488 algorithm if implemented naively. To improve things based on sparsity, we note

	489 that most variables are "local" and referenced in at most one basic block (in

	490 contrast to the "global" variables with multi-block usage), and therefore cannot

	491 be live across basic blocks. Therefore, the live-in and live-out sets,

	492 typically represented as bit vectors, can be limited to the set of global

	493 variables, and the intra-block liveness bit vector can be compacted to hold the

	494 global variables plus the local variables for that block.

	495

	496 Register allocation

	497 ^^^^^^^^^^^^^^^^^^^

	498

	499 Subzero implements a simple linear-scan register allocator, based on the

	500 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan

	501 Register Allocation in the Context of SSA Form and Register Constraints

	502 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has

	503 several nice features:

	504

	505 - Live ranges are represented as unions of segments, as described above, rather

	506 than a single start/end tuple.

	507

	508 - It allows pre-coloring of variables with specific physical registers.

	509

	510 - It applies equally well to pre-lowered Phi instructions.

	511

	512 The paper suggests an approach of aggressively coalescing variables across Phi

	513 instructions (i.e., trying to force Phi source and destination variables to have

	514 the same register assignment), but we reject that in favor of the more natural

	515 preference mechanism described below.

	516

	517 We enhance the algorithm in the paper with the capability of automatic inference

	518 of register preference, and with the capability of allowing overlapping live

	519 ranges to safely share the same register in certain circumstances. If we are

	520 considering register allocation for variable ``A``, and ``A`` has a single

	521 defining instruction ``A=B+C``, then the preferred register for ``A``, if

	522 available, would be the register assigned to ``B`` or ``C``, if any, provided

	523 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this

	524 way we infer a good register preference for ``A``.

	525

	526 We allow overlapping live ranges to get the same register in certain cases.

	527 Suppose a high-level instruction like::

	528

	529 A = unary_op(B)

	530

	531 has been target-lowered like::

	532

	533 T.inf = B

	534 A = unary_op(T.inf)

	535

	536 Further, assume that ``B``'s live range continues beyond this instruction

	537 sequence, and that ``B`` has already been assigned some register. Normally, we

	538 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it

	539 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have

	540 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is

	541 guaranteed to be in a register, so in theory these overlapping live ranges could

	542 safely have the same register. Our implementation allows this overlap as long

	543 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never

	544 modified within ``T.inf``'s live range.

	545

	546 Subzero's register allocator can be run in 3 configurations.

	547

	548 - Normal mode. All Variables are considered for register allocation. It

	549 requires full liveness analysis and live range construction as a prerequisite.

	550 This is used by ``O2`` lowering.

	551

	552 - Minimal mode. Only infinite-weight or pre-colored Variables are considered.

	553 All other Variables are stack-allocated. It does not require liveness

	554 analysis; instead, it quickly scans the instructions and records first

	555 definitions and last uses of all relevant Variables, using that to construct a

	556 single-segment live range. Although this includes most of the Variables, the

	557 live ranges are mostly simple, short, and rarely overlapping, which the

	558 register allocator handles efficiently. This is used by ``Om1`` lowering.

	559

	560 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode

	561 register allocation, and may result in new infinite-weight Variables that need

	562 registers. One would like to just run something like minimal mode to assign

	563 registers to the new Variables while respecting existing register allocation

	564 decisions. However, it sometimes happens that there are no free registers.

	565 In this case, some register needs to be forcibly spilled to the stack and

	566 temporarily reassigned to the new Variable, and reloaded at the end of the new

	567 Variable's live range. The register must be one that has no explicit

	568 references during the Variable's live range. Since Subzero currently doesn't

	569 track def/use chains (though it does record the CfgNode where a Variable is

	570 defined), we just do a brute-force search across the CfgNode's instruction

	571 list for the instruction numbers of interest. This situation happens very

	572 rarely, so there's little point for now in improving its performance.

	573

	574 The basic linear-scan algorithm may, as it proceeds, rescind an early register

	575 allocation decision, leaving that Variable to be stack-allocated. Some of these

	576 times, it turns out that the Variable could have been given a different register

	577 without conflict, but by this time it's too late. The literature recognizes

	578 this situation and describes "second-chance bin-packing", which Subzero can do.

	579 We can rerun the register allocator in a mode that respects existing register

	580 allocation decisions, and sometimes it finds new non-conflicting opportunities.

	581 In fact, we can repeatedly run the register allocator until convergence.

	582 Unfortunately, in the current implementation, these subsequent register

	583 allocation passes end up being extremely expensive. This is because of the

	584 treatment of the "unhandled pre-colored" Variable set, which is normally very

	585 small but ends up being quite large on subsequent passes. Its performance can

	586 probably be made acceptable with a better choice of data structures, but for now

	587 this second-chance mechanism is disabled.

	588

	589 Basic phi lowering

	590 ^^^^^^^^^^^^^^^^^^

	591

	592 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

	593 implements it). Consider this example::

	594

	595 L1:

	596 ...

	597 br L3

	598 L2:

	599 ...

	600 br L3

	601 L3:

	602 A = phi [B, L1], [C, L2]

	603 X = phi [Y, L1], [Z, L2]

	604

	605 For each destination of a phi instruction, we can create a temporary and insert

	606 the temporary's assignment at the end of the predecessor block::

	607

	608 L1:

	609 ...

	610 A' = B

	611 X' = Y

	612 br L3

	613 L2:

	614 ...

	615 A' = C

	616 X' = Z

	617 br L3

	618 L2:

	619 A = A'

	620 X = X'

	621

	622 This transformation is very simple and reliable. It can be done before target

	623 lowering and register allocation, and it easily avoids the classic lost-copy and

	624 related problems. ``Om1`` lowering uses this strategy.

	625

	626 However, it has the disadvantage of initializing temporaries even for branches

	627 not taken, though that could be mitigated by splitting non-critical edges and

	628 putting assignments in the edge-split nodes. Another problem is that without

	629 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a

	630 specific ordering even though phi semantics are that the assignments are

	631 parallel or unordered. This sometimes imposes false live range overlaps and

	632 leads to poorer register allocation.

	633

	634 Advanced phi lowering

	635 ^^^^^^^^^^^^^^^^^^^^^

	636

	637 ``O2`` lowering defers phi lowering until after register allocation to avoid the

	638 problem of false live range overlaps. It works as follows. We split each

	639 incoming edge and move the (parallel) phi assignments into the split nodes. We

	640 linearize each set of assignments by finding a safe, topological ordering of the

	641 assignments, respecting register assignments as well. For example::

	642

	643 A = B

	644 X = Y

	645

	646 Normally these assignments could be executed in either order, but if ``B`` and

	647 ``X`` are assigned the same physical register, we would want to use the above

	648 ordering. Dependency cycles are broken by introducing a temporary. For

	649 example::

	650

	651 A = B

	652 B = A

	653

	654 Here, a temporary breaks the cycle::

	655

	656 t = A

	657 A = B

	658 B = t

	659

	660 Finally, we use the existing target lowering to lower the assignments in this

	661 basic block, and once that is done for all basic blocks, we run the post-phi

	662 variant of register allocation on the edge-split basic blocks.

	663

	664 When computing a topological order, we try to first schedule assignments whose

	665 source has a physical register, and last schedule assignments whose destination

	666 has a physical register. This helps reduce register pressure.

	667

	668 X86 address mode inference

	669 ^^^^^^^^^^^^^^^^^^^^^^^^^^

	670

	671 We try to take advantage of the x86 addressing mode that includes a base

	672 register, an index register, an index register scale amount, and an immediate

	673 offset. We do this through simple pattern matching. Starting with a load or

	674 store instruction where the address is a variable, we initialize the base

	675 register to that variable, and look up the instruction where that variable is

	676 defined. If that is an add instruction of two variables and the index register

	677 hasn't been set, we replace the base and index register with those two

	678 variables. If instead it is an add instruction of a variable and a constant, we

	679 replace the base register with the variable and add the constant to the

	680 immediate offset.

	681

	682 There are several more patterns that can be matched. This pattern matching

	683 continues on the load or store instruction until no more matches are found.

	684 Because a program typically has few load and store instructions (not to be

	685 confused with instructions that manipulate stack variables), this address mode

	686 inference pass is fast.

	687

	688 X86 read-modify-write inference

	689 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	690

	691 A reasonably common bitcode pattern is a non-atomic update of a memory

	692 location::

	693

	694 x = load addr

	695 y = add x, 1

	696 store y, addr

	697

	698 On x86, with good register allocation, the Subzero passes described above

	699 generate code with only this quality::

	700

	701 mov eax, [ebx]

	702 add eax, 1

	703 mov [ebx], eax

	704

	705 However, x86 allows for this kind of code::

	706

	707 add [ebx], 1

	708

	709 which requires fewer instructions, but perhaps more importantly, requires fewer

	710 physical registers.

	711

	712 It's also important to note that this transformation only makes sense if the

	713 store instruction ends ``x``'s live range.

	714

	715 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)

	716 opportunities via simple pattern matching. The only problem is that it is run

	717 before liveness analysis, which is needed to determine whether ``x``'s live

	718 range ends after the RMW. Since liveness analysis is one of the most expensive

	719 passes, it's not attractive to run it an extra time just for RMW analysis.

	720 Instead, we essentially generate both the RMW and the non-RMW versions, and then

	721 during lowering, the RMW version deletes itself if it finds x still live.

	722

	723 X86 compare-branch inference

	724 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	725

	726 In the LLVM instruction set, the compare/branch pattern works like this::

	727

	728 cond = icmp eq a, b

	729 br cond, target

	730

	731 The result of the icmp instruction is a single bit, and a conditional branch

	732 tests that bit. By contrast, most target architectures use this pattern::

	733

	734 cmp a, b // implicitly sets various bits of FLAGS register

	735 br eq, target // branch on a particular FLAGS bit

	736

	737 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests

	738 ``cond`` and conditionally branches. Subzero has a pass that identifies

	739 boolean-based operations like this and folds them into a single

	740 compare/branch-like operation. It is set up for more than just cmp/br though.

	741 Boolean producers include icmp (integer compare), fcmp (floating-point compare),

	742 and trunc (integer truncation when the destination has bool type). Boolean

	743 consumers include branch, select (the ternary operator from the C language), and

	744 sign-extend and zero-extend when the source has bool type.

	745

	746 Sandboxing

	747 ^^^^^^^^^^

	748

	749 This is not actually a separate pass, but is integrated into lowering and

	750 assembly.

	751

	752 - Indirect branches, including the ret instruction, are masked to a bundle

	753 boundary and bundle-locked.

	754

	755 - Call instructions are aligned to the end of the bundle so that the return

	756 address is bundle-aligned.

	757

	758 - Indirect branch targets, including function entry and targets in a switch

	759 statement jump table, are bundle-aligned.

	760

	761 - The intrinsic for reading the thread pointer is inlined appropriately.

	762

	763 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox

	764 base register. We reduce the aggressiveness of address mode inference to

	765 leave room for the sandbox base register during lowering. There are no memory

	766 sandboxing changes for x86-32.

	767

	768 Code emission

	769 -------------

	770

	771 Subzero's integrated assembler is derived from Dart's assembler code. There is

	772 a pass that iterates through the low-level ICE instructions and invokes the

	773 relevant assembler functions. Placeholders are added for later fixup of branch

	774 target offsets. (Backward branches use short offsets if possible; forward

	775 branches always use long offsets.) The assembler emits into a staging buffer.

	776 Once emission into the staging buffer for a function is complete, the data is

	777 emitted to the output file as an ELF object file, and metadata such as

	778 relocations, symbol table, and string table, are accumulated for emission at the

	779 end. Global data initializers are emitted similarly. A key point is that at

	780 this point, the staging buffer can be deallocated, and only a minimum of data

	781 needs to held until the end.

	782

	783 As a debugging alternative, Subzero can emit textual assembly code which can

	784 then be run through an external assembler. This is of course super slow, but

	785 quite valuable when bringing up a new target.

	786

	787 As another debugging option, the staging buffer can be emitted as textual

	788 assembly, primarily in the form of ".byte" lines. This allows the assembler to

	789 be tested separately from the ELF related code.

	790

	791 Memory management

	792 -----------------

	793

	794 Where possible, we allocate from a CfgLocalAllocator which derives from LLVM's

	795 BumpPtrAllocator. This is an arena-style allocator where objects allocated from

	796 the arena are never actually freed; instead, when the Cfg translation completes

	797 and the Cfg is deleted, the entire arena memory is reclaimed at once.

	798

	799 Instructions are probably the most heavily allocated complex class in Subzero.

	800 We represent an instruction list as an intrusive doubly linked list, allocate

	801 all instructions from the CfgLoclaAllocator, and we make sure each instruction

	802 subclass is basically POD with a trivial destructor. This way, when the Cfg is

	803 finished, we don't need to individually deallocate every instruction. We do

	804 similar for Variables, which is probably the second most popular complex class.

	805

	806 There are some situations where passes need to use some STL container class.

	807 Subzero has a way of using the CfgLocalAllocator as the container allocator if

	808 this is needed.

	809

	810 Multithreaded translation

	811 -------------------------

	812

	813 Subzero is designed to be able to translate functions in parallel. With the

	814 ``-threads=N`` command-line option, there is a 3-stage producer-consumer

	815 pipeline:

	816

	817 - A single thread parses the pexe file and produces a sequence of work units. A

	818 work unit can be either a fully constructed Cfg, or a set of global

	819 initializers. The work unit includes its sequence number denoting its parse

	820 order. Each work unit is added to the translation queue.

	821

	822 - There are N translation threads that draw work units from the translation

	823 queue and lower them into assembler buffers. Each assembler buffer is added

	824 to the emitter queue, tagged with its sequence number. The Cfg and its

	825 CfgLocalAllocator are disposed of at this point.

	826

	827 - A single thread draws assembler buffers from the emitter queue and appends to

	828 the output file. It uses the sequence numbers to reintegrate the assembler

	829 buffers according to the original parse order, such that output order is

	830 always deterministic.

	831

	832 Note that ``-threads=N`` actually creates N+2 threads. For the special case of

	833 N=0, execution is entirely sequential -- the same thread parses, translates, and

	834 emits, one function at a time. This is useful for performance measurements.

	835

	836 Currently, parsing takes about 11% of total sequential time. If translation

	837 scalability ever gets so fast and awesomely scalable that parsing becomes a

	838 bottleneck, it should be possible to make parsing multithreaded as well.

	839

	840 Internally, all shared, mutable data is held in the GlobalContext object, and

	841 access to each field is guarded by a mutex.

	842

	843 Security

	844 --------

	845

	846 Sandboxed translator

	847 ^^^^^^^^^^^^^^^^^^^^

	848

	849 When running inside the browser, the Subzero translator executes as sandboxed,

	850 untrusted code that is initially checked by the validator, just like the

	851 LLVM-based pnacl-llc translator. As such, the Subzero binary should be no more

	852 or less secure than the translator it replaces, from the point of view of the

	853 Chrome sandbox.

	854

	855 Code diversification

	856 ^^^^^^^^^^^^^^^^^^^^

	857

	858 Return-oriented programming (ROP) is a now-common technique for starting with

	859 e.g. a known buffer overflow situation and launching it into a deeper exploit.

	860 The attacker scans the executable looking for ROP gadgets, which are short

	861 sequences of code that happen to load known values into known registers and then

	862 return. If the attacker can manage to overwrite parts of the stack, he can

	863 overwrite it with carefully chosen return addresses such that certain ROP

	864 gadgets are effectively chained together to set up the register state as

	865 desired, finally returning to some code that manages to do something nasty based

	866 on those register values.

	867

	868 If there is a popular pexe with a large install base, the attacker could run

	869 Subzero on it and scan the executable for suitable ROP gadgets to use as part of

	870 a potential exploit. Note that if the trusted validator is working correctly,

	871 these ROP gadgets are limited to starting at a bundle boundary and cannot use

	872 the trick of finding a gadget that happens to begin inside another instruction.

	873 All the same, gadgets with these constraints still exist and the attacker has

	874 access to them.

	875

	876 Subzero can mitigate these attacks to some degree through code diversification.

	877 Specifically, we can apply some randomness to the code generation that makes ROP

	878 gadgets less predictable. This randomness can have some compile-time cost, and

	879 it can affect the code quality; and some diversifications may be more effective

	880 than others.

	881

	882 To evaluate diversification effectiveness, we use a third-party ROP gadget

	883 finder and limit its results to bundle-aligned addresses. For a given

	884 diversification technique, we run it with a number of different random seeds,

	885 find ROP gadgets for each version, and determine how persistent each ROP gadget

	886 is across the different versions. A gadget is persistent if the same gadget is

	887 found at the same code address. The best diversifications are ones with low

	888 gadget persistence rates.

	889

	890 Subzero implements 7 different diversification techniques. Below is a

	891 discussion of each technique, its effectiveness, and its cost.

	892

	893 Constant blinding

	894 ~~~~~~~~~~~~~~~~~

	895

	896 Here, we prevent attackers from controlling large immediates in the text

	897 (executable) section. A random cookie is generated for each function, and if

	898 the constant exceeds a specified threshold, the constant is obfuscated with the

	899 cookie and equivalent code is generated. For example, instead of this x86

	900 instruction::

	901

	902 mov $0x11223344, <%Reg/Mem>
	John 2015/08/31 15:02:18 The example above (RMW) uses intel syntax. I'd say The example above (RMW) uses intel syntax. I'd say we should pick one style and go with it. Jim Stichnoth 2015/08/31 15:35:32 Switched everything to AT&T syntax. Show quoted text On 2015/08/31 15:02:18, John wrote: > The example above (RMW) uses intel syntax. I'd say we should pick one style and > go with it. Switched everything to AT&T syntax.
	903

	904 the following code might be generated::

	905

	906 mov $(0x11223344+Cookie), %temp

	907 lea -Cookie(%temp), %temp

	908 mov %temp, <%Reg/Mem>

	909

	910 The ``lea`` instruction is used rather than e.g. add/sub or xor, to prevent

	911 unintended effects on the flags register.

	912

	913 This transformation has almost no effect on translation time, and about 1%

	914 impact on code quality, depending on the threshold chosen. It does little to

	915 reduce gadget persistence, but it does remove a lot of potential opportunities

	916 to construct intra-instruction ROP gadgets (which an attacker could use if a

	917 validator bug were discovered).

	918

	919 Constant pooling

	920 ~~~~~~~~~~~~~~~~

	921

	922 This is similar to constant blinding, in that large immediates are removed from

	923 the text section. In this case, each unique constant above the threshold is

	924 stored in a read-only data section and the constant is accessed via a memory

	925 load. For the above example, the following code might be generated::

	926

	927 mov $Label$1, %temp

	928 mov %temp, <%Reg/Mem>

	929

	930 This has a similarly small impact on translation time and ROP gadget

	931 persistence, and a smaller (better) impact on code quality. This is because it

	932 uses fewer instructions, and in some cases good register allocation leads to no

	933 increase in instruction count. Note that this still gives an attacker some

	934 limited amount of control over some text section values, unless we randomize the

	935 constant pool layout.

	936

	937 Static data reordering

	938 ~~~~~~~~~~~~~~~~~~~~~~

	939

	940 This transformation limits the attacker's ability to control bits in global data

	941 address references. It simply permutes the order in memory of global variables

	942 and internal constant pool entries. For the constant pool, we only permute

	943 within a type (i.e., emit a randomized list of ints, followed by a randomized

	944 list of floats, etc.) to maintain good packing in the face of alignment

	945 constraints.

	946

	947 As might be expected, this has no impact on code quality, translation time, or

	948 ROP gadget persistence (though as above, it limits opportunities for

	949 intra-instruction ROP gadgets with a broken validator).

	950

	951 Basic block reordering

	952 ~~~~~~~~~~~~~~~~~~~~~~

	953

	954 Here, we randomize the order of basic blocks within a function, with the

	955 constraint that we still want to maintain a topological order as much as

	956 possible, to avoid making the code too branchy.

	957

	958 This has no impact on code quality, and about 1% impact on translation time, due

	959 to a separate pass to recompute layout. It ends up having a huge effect on ROP

	960 gadget persistence, tied for best with nop insertion, reducing ROP gadget

	961 persistence to less than 5%.

	962

	963 Function reordering

	964 ~~~~~~~~~~~~~~~~~~~

	965

	966 Here, we permute the order that functions are emitted, primarily to shift ROP

	967 gadgets around to less predictable locations. It may also change call address

	968 offsets in case the attacker was trying to control that offset in the code.

	969

	970 To control latency and memory footprint, we don't arbitrarily permute functions.

	971 Instead, for some relatively small value of N, we queue up N assembler buffers,

	972 and then emit the N functions in random order, and repeat until all functions

	973 are emitted.

	974

	975 Function reordering has no impact on translation time or code quality.

	976 Measurements indicate that it reduces ROP gadget persistence to about 15%.

	977

	978 Nop insertion

	979 ~~~~~~~~~~~~~

	980

	981 This diversification randomly adds a nop instruction after each regular

	982 instruction, with some probability. Nop instructions of different lengths may

	983 be selected. Nop instructions are never added inside a bundle_lock region.

	984 Note that when sandboxing is enabled, nop instructions are already being added

	985 for bundle alignment, so the diversification nop instructions may simply be

	986 taking the place of alignment nop instructions, though distributed differently

	987 through the bundle.

	988

	989 In Subzero's currently implementation, nop insertion adds 3-5% to the

	990 translation time, but this is probably because it is implemented as a separate

	991 pass that adds actual nop instructions to the IR. The overhead would probably

	992 be a lot less if it were integrated into the assembler pass. The code quality

	993 is also reduced by 3-5%, making nop insertion the most expensive of the

	994 diversification techniques.

	995

	996 Nop insertion is very effective in reducing ROP gadget persistence, at the same

	997 level as basic block randomization. But given nop insertion's impact on

	998 translation time and code quality, one would most likely prefer to use basic

	999 block randomization instead.

	1000

	1001 Register allocation randomization

	1002 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	1003

	1004 In this diversification, the register allocator tries to make different but

	1005 mostly functionally equivalent choices, while maintaining stable code quality.

	1006

	1007 A naive approach would be the following. Whenever the allocator has more than

	1008 one choice for assigning a register, choose randomly among those options. And

	1009 whenever there are no registers available and there is a tie for the

	1010 lowest-weight variable, randomly select one of the lowest-weight variables to

	1011 evict. Because of the one-pass nature of the linear-scan algorithm, this

	1012 randomization strategy can have a large impact on which variables are ultimately

	1013 assigned registers, with a corresponding large impact on code quality.

	1014

	1015 Instead, we choose an approach that tries to keep code quality stable regardless

	1016 of the random seed. We partition the set of physical registers into equivalence

	1017 classes. If a register is pre-colored in the function (i.e., referenced

	1018 explicitly by name), it forms its own equivalence class. The remaining

	1019 registers are partitioned according to their combination of attributes such as

	1020 integer versus floating-point, 8-bit versus 32-bit, caller-save versus

	1021 callee-saved, etc. Each equivalence class is randomly permuted, and the

	1022 complete permutation is applied to the final register assignments.

	1023

	1024 Register randomization reduces ROP gadget persistence to about 10% on average,

	1025 though there tends to be fairly high variance across functions and applications.

	1026 This probably has to do with the set of restrictions in the x86-32 instruction

	1027 set and ABI, such as few general-purpose registers, eax used for return values,

	1028 edx used for division, ecx used for shifting, etc. As intended, register

	1029 randomization has no impact on code quality, and a slight (0.5%) impact on

	1030 translation time due to an extra scan over the variables to identify pre-colored

	1031 registers.

	1032

	1033 Fuzzing

	1034 ^^^^^^^

	1035

	1036 We have started fuzzing the pexe files input to Subzero. Most of the problems

	1037 found so far are ones most appropriately handled in the parser. However, there

	1038 have been a couple that have identified problems in the lowering, or otherwise

	1039 inappropriately triggered assertion failures and fatal errors. We continue to

	1040 dig into this area.

	1041

	1042 Chrome integration

	1043 ------------------

	1044

	1045 Currently Subzero is available in Chrome for the x86-32 architecture, but under

	1046 a flag. When the flag is enabled, Subzero is used when the pexe developer's

	1047 manifest specifies the ``O0`` optimization level.

	1048

	1049 The next step is to remove the flag, i.e. invoke Subzero as the only translator

	1050 for ``O0``-specified manifest files.

	1051

	1052 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which

	1053 case Subzero could be used for all PNaCl translation.

	1054

	1055 Command line options

	1056 --------------------

	1057

	1058 Subzero has a number of command-line options for debugging and diagnostics.

	1059 Among the more interesting are the following.

	1060

	1061 - Using the ``-verbose`` flag, Subzero will dump the Cfg, or produce other

	1062 diagnostic output, with various levels of detail after each pass. Instruction

	1063 numbers can be printed or suppressed. Deleted instructions can be printed or

	1064 suppressed (they are retained in the instruction list, as discussed earlier,

	1065 because they can help explain how lower-level instructions originated).

	1066 Liveness information can be printed when available. Details of register

	1067 allocation can be printed as register allocator decisions are made. And more.

	1068

	1069 - Running Subzero with any level of verbosity produces an enormous amount of

	1070 output. When debugging a single function, verbose output can be suppressed

	1071 except for a particular function. The ``-verbose-focus`` flag suppresses

	1072 verbose output except for the specified function.

	1073

	1074 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing

	1075 at exit. Timing markers can be placed in the Subzero source code to demarcate

	1076 logical operations or passes of interest. Basic timing information plus

	1077 call-stack type timing information is printed at the end.

	1078

	1079 - Along with ``-timing``, the user can instead get a report on the overall

	1080 translation time for each function, to help focus on timing outliers. Also,

	1081 ``-timing-focus`` limits the ``-timing`` reporting to a single function,

	1082 instead of aggregating pass timing across all functions.

	1083

	1084 - The ``-szstats`` option reports various statistics on each function, such as

	1085 stack frame size, static instruction count, etc. It may be helpful to track

	1086 these stats over time as Subzero is improved, as an approximate measure of

	1087 code quality.

	1088

	1089 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly

	1090 output, annotate the assembly output with register-focused liveness

	1091 information. In particular, each basic block is annotated with which

	1092 registers are live-in and live-out, and each instruction is annotated with

	1093 which registers' and stack locations' live ranges end at that instruction.

	1094 This is really useful when studying the generated code to find opportunities

	1095 for code quality improvements.

	1096

	1097 Testing and debugging

	1098 ---------------------

	1099

	1100 LLVM lit tests

	1101 ^^^^^^^^^^^^^^

	1102

	1103 For basic testing, Subzero uses LLVM's lit framework for running tests. We have

	1104 a suite of hundreds of small functions where we test for particular assembly

	1105 code patterns across different target architectures.

	1106

	1107 Cross tests

	1108 ^^^^^^^^^^^

	1109

	1110 Unfortunately, the lit tests don't do a great job of precisely testing the

	1111 correctness of the output. Much better are the cross tests, which are execution

	1112 tests that compare Subzero and llc translated bitcode across a wide variety of

	1113 interesting inputs. Each cross test consists of a set of C, C++, and/or

	1114 low-level bitcode files. The C and C++ source files are compiled down to

	1115 bitcode. The bitcode files are translated by llc and also by Subzero. Subzero

	1116 mangles global symbol names with a special prefix to avoid duplicate symbol

	1117 errors. A driver program invokes both versions on a large set of interesting

	1118 inputs, and reports when the Subzero and llc results differ. Cross tests turn

	1119 out to be an excellent way of testing the basic lowering patterns, but they are

	1120 less useful for testing more global things like liveness analysis and register

	1121 allocation.

	1122

	1123 Bisection debugging

	1124 ^^^^^^^^^^^^^^^^^^^

	1125

	1126 Sometimes with a new application, Subzero will end up producing incorrect code

	1127 that either crashes at runtime or otherwise produces the wrong results. When

	1128 this happens, we need to narrow it down to a single function (or small set of

	1129 functions) that yield incorrect behavior. For this, we have a bisection

	1130 debugging framework. Here, we initially translate the entire application once

	1131 with Subzero and once with llc. We then use objdump to selectively weaken

	1132 symbols based on a whitelist or blacklist provided on the command line. The two

	1133 object files can then be linked together without link errors, with the desired

	1134 version of each method "winning". Then the binary is tested, and bisection

	1135 proceeds based on whether the binary produces correct output.

	1136

	1137 When the bisection completes, we are left with a minimal set of

	1138 Subzero-translated functions that cause the failure. Usually it is a single

	1139 function, though sometimes it might require a combination of several functions

	1140 to cause a failure; this may be due to an incorrect call ABI, for example.

	1141 However, Murphy's Law implies that the single failing function is enormous and

	1142 impractical to debug. In that case, we can restart the bisection, explicitly

	1143 blacklisting the enormous function, and try to find another candidate to debug.

	1144 (Future work is to automate this to find all minimal sets of functions, so that

	1145 debugging can focus on the simplest example.)

	1146

	1147 Fuzz testing

	1148 ^^^^^^^^^^^^

	1149

	1150 As described above, we try to find internal Subzero bugs using fuzz testing

	1151 techniques.

	1152

	1153 Sanitizers

	1154 ^^^^^^^^^^

	1155

	1156 We regularly run Subzero with asan and tsan. So far, multithreading has been

	1157 simple enough that tsan hasn't found any bugs, but asan has found at least one

	1158 memory leak which was subsequently fixed.

	1159

	1160 Current status

	1161 ==============

	1162

	1163 Target architectures

	1164 --------------------

	1165

	1166 Subzero is currently more or less complete for the x86-32 target. It has been

	1167 refactored and extended to handle x86-64 as well, and that is mostly complete at

	1168 this point.

	1169

	1170 ARM32 work is in progress and is mostly complete. It currently lacks the
	John 2015/08/31 15:02:18 As Jan pointed out there's substantial amount of w As Jan pointed out there's substantial amount of work for ARM32. I would just say "... in progress." For ARM64 I don't think we will ever have a sandbox model. Jim Stichnoth 2015/08/31 15:35:32 Done. Show quoted text On 2015/08/31 15:02:18, John wrote: > As Jan pointed out there's substantial amount of work for ARM32. I would just > say "... in progress." For ARM64 I don't think we will ever have a sandbox > model. Done.
	1171 testing level of x86, at least in part because Subzero's register allocator

	1172 needs modifications to handle ARM's aliasing of floating point and vector

	1173 registers. Specifically, a 64-bit register is actually a gang of two

	1174 consecutive and aligned 32-bit registers, and a 128-bit register is a gang of 4

	1175 consecutive and aligned 32-bit registers. ARM64 work has not started; the whole

	1176 ARM64 sandbox model would need to be defined.

	1177

	1178 An external contributor is adding MIPS support, in most part by following the

	1179 ARM work.

	1180

	1181 Translator performance

	1182 ----------------------

	1183

	1184 Single-threaded translation speed is currently about 5x the llc translation

	1185 speed. For a large pexe, the time breaks down as:

	1186

	1187 - 11% for parsing and initial IR building

	1188

	1189 - 4% for emitting to /dev/null

	1190

	1191 - 27% for liveness analysis (two liveness passes plus live range construction)

	1192

	1193 - 15% for linear-scan register allocation

	1194

	1195 - 9% for basic lowering

	1196

	1197 - 10% for advanced phi lowering

	1198

	1199 - ~11% for other minor analysis

	1200

	1201 - ~10% measurement overhead to acquire these numbers

	1202

	1203 Some improvements could undoubtedly be made, but it will be hard to increase the

	1204 speed to 10x of llc while keeping acceptable code quality. With ``-Om1`` (lack

	1205 of) optimization, we do actually achieve roughly 10x llc translation speed, but

	1206 code quality drops by a factor of 3.

	1207

	1208 Code quality

	1209 ------------

	1210

	1211 Measured across 16 components of spec2k, Subzero's code quality is uniformly

	1212 better than llc ``-O0`` code quality, and in many cases solidly between llc

	1213 ``-O0`` and ``-O2``.

	1214

	1215 Translator size

	1216 ---------------

	1217

	1218 When built in MINIMAL mode, the x86-64 native translator size for the x86-32

	1219 target is about 700 KB, not including the size of functions referenced in

	1220 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1

	1221 MB, and it is statically linked and also includes nop padding for bundling as

	1222 well as indirect branch masking.

	1223

	1224 Translator memory footprint

	1225 ---------------------------

	1226

	1227 It's hard to draw firm conclusions about memory footprint, since the footprint

	1228 is at least proportional to the input function size, and there is no real limit

	1229 on the size of functions in the pexe file.

	1230

	1231 That said, we looked at the memory footprint over time as Subzero translated

	1232 pnacl-llc.pexe, which is the largest pexe (7.2 MB) at our disposal. One of

	1233 LLVM's libraries that Subzero uses can report the current malloc heap usage.

	1234 With single-threaded translation, Subzero tends to hover around 15 MB of memory

	1235 usage. There are a couple of monstrous functions where Subzero grows to around

	1236 100 MB, but then it drops back down after those functions finish translating.

	1237 In contrast, llc grows larger and larger throughout translation, reaching

	1238 several hundred MB by the time it completes.

	1239

	1240 It's a bit more interesting when we enable multithreaded translation. When

	1241 there are N translation threads, Subzero implements a policy that limits the

	1242 size of the translation queue to N entries - if it is "full" when the parser

	1243 tries to add a new Cfg, the parser blocks until one of the translation threads

	1244 removes a Cfg. This means the number of in-memory Cfgs can (and generally does)

	1245 reach 2*N+1, and so the memory footprint rises in proportion to the number of

	1246 threads. Adding to the pressure is the observation that the monstrous functions

	1247 also take proportionally longer time to translate, so there's a good chance many

	1248 of the monstrous functions will be active at the same time with multithreaded

	1249 translation. As a result, for N=32, Subzero's memory footprint peaks at about

	1250 260 MB, but drops back down as the large functions finish translating.

	1251

	1252 If this peak memory size becomes a problem, it might be possible for the parser

	1253 to resequence the functions to try to spread out the larger functions, or to

	1254 throttle the translation queue to prevent too many in-flight large functions.

	1255

	1256 Translator scalability

	1257 ----------------------

	1258

	1259 Currently scalability is "not very good". Multiple translation threads lead to

	1260 faster translation, but not to the degree desired. We haven't dug in to

	1261 investigate yet.

	1262

	1263 There are a few areas to investigate. First, there may be contention on the

	1264 constant pool, which all threads access, and which requires locked access even

	1265 for reading. This could be mitigated by keeping a Cfg-local cache of the most

	1266 common constants.

	1267

	1268 Second, there may be contention on memory allocation. While almost all Cfg

	1269 objects are allocated from the Cfg-local allocator, some passes use temporary

	1270 STL containers that use the default allocator, which may require global locking.

	1271 This could be mitigated by switching these to the Cfg-local allocator.

	1272

	1273 Third, multithreading may make the default allocator strategy more expensive.

	1274 In a single-threaded environment, a pass will allocate its containers, run the

	1275 pass, and deallocate the containers. This results in stack-like allocation

	1276 behavior and makes the heap free list easier to manage, with less heap

	1277 fragmentation. But when multithreading is added, the allocations and

	1278 deallocations become much less stack-like, making allocation and deallocation

	1279 operations individually more expensive. Again, this could be mitigated by

	1280 switching these to the Cfg-local allocator.

OLD	NEW

« no previous file with comments | « no previous file | README.rst » ('j') | no next file with comments »