DESIGN.rst - Issue 1309073003: Subzero: Add a detailed design document.

Side by Side Diff: DESIGN.rst

Issue 1309073003: Subzero: Add a detailed design document. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Code review changes Created 5 years, 3 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
(Empty)
	1 Design of the Subzero fast code generator

	2 =========================================

	3

	4 Introduction

	5 ------------

	6

	7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

	8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the

	9 PNaCl toolchain to compile their application to architecture-neutral PNaCl

	10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as

	11 possible. The ``.pexe`` file is downloaded to the user's browser where the

	12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to

	13 `sandboxed

	14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_

	15 native code. The translator uses architecture-specific optimizations as much as

	16 practical to generate good native code.

	17

	18 The native code can be cached by the browser to avoid repeating translation on

	19 future page loads. However, first-time user experience is hampered by long

	20 translation times. The LLVM-based PNaCl translator is pretty slow, even when

	21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on

	22 slow browser platforms such as ARM-based Chromebooks.

	23

	24 Translator slowness can be mitigated or hidden in a number of ways.

	25

	26 - Parallel translation. However, slow machines where this matters most, e.g.

	27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across,

	28 and are likely to less memory available for multiple translation threads to

	29 use.

	30

	31 - Streaming translation, i.e. start translating as soon as the download starts.

	32 This doesn't help much when translation speed is 10x slower than download
	JF 2015/09/03 00:35:21 10× 10× Jim Stichnoth 2015/09/03 06:47:03 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > 10× Done.
	33 speed, or the ``.pexe`` file is already cached while the translated binary was

	34 flushed from the cache.

	35

	36 - Arrange the web page such that translation is done in parallel with

	37 downloading large assets.

	38

	39 - Arrange the web page to distract the user with `cat videos

	40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in

	41 progress.

	42

	43 Or, improve translator performance to something more reasonable.

	44

	45 This document describes Subzero's attempt to improve translation speed by an

	46 order of magnitude while rivaling LLVM's code quality. Subzero does this

	47 through minimal IR layering, lean data structures and passes, and a careful

	48 selection of fast optimization passes. It two optimization recipes: full
	JF 2015/09/03 00:35:21 "It two optimization recipes" It no Englishy. "It two optimization recipes" It no Englishy. Jim Stichnoth 2015/09/03 06:47:03 Doed. Show quoted text On 2015/09/03 00:35:21, JF wrote: > "It two optimization recipes" > It no Englishy. Doed.
	49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

	50 following (described in more detail below):

	51

	52 +-----------------------------------+-----------------------+

	53 \| O2 recipe \| Om1 recipe \|

	54 +===================================+=======================+

	55 \| Parse .pexe file \| Parse .pexe file \|

	56 +-----------------------------------+-----------------------+

	57 \| Address mode inference \| \|

	58 +-----------------------------------+-----------------------+

	59 \| Read-modify-write (RMW) transform \| \|

	60 +-----------------------------------+-----------------------+

	61 \| Basic liveness analysis \| \|

	62 +-----------------------------------+-----------------------+

	63 \| Load optimization \| \|

	64 +-----------------------------------+-----------------------+

	65 \| \| Phi lowering (simple) \|

	66 +-----------------------------------+-----------------------+

	67 \| Target lowering \| Target lowering \|

	68 +-----------------------------------+-----------------------+

	69 \| Full liveness analysis \| \|

	70 +-----------------------------------+-----------------------+

	71 \| Register allocation \| \|

	72 +-----------------------------------+-----------------------+

	73 \| Phi lowering (advanced) \| \|

	74 +-----------------------------------+-----------------------+

	75 \| Post-phi register allocation \| \|

	76 +-----------------------------------+-----------------------+

	77 \| Branch optimization \| \|

	78 +-----------------------------------+-----------------------+

	79 \| Code emission \| Code emission \|

	80 +-----------------------------------+-----------------------+

	81

	82 Goals

	83 =====

	84

	85 Translation speed

	86 -----------------

	87

	88 We'd like to be able to translate a ``.pexe`` file as fast as download speed.

	89 Any faster is in a sense wasted effort. Download speed varies greatly, but

	90 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a

	91 slow machine. We observe a 3× single-thread performance difference between A15

	92 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a

	93 ``.pexe`` file could be compressed to 50% on the web server using gzip transport

	94 compression, so we set the translation speed goal to 6 MB/sec on a z620

	95 workstation.
	JF 2015/09/03 00:35:21 on the Xeon workstation on the Xeon workstation Jim Stichnoth 2015/09/03 06:47:02 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > on the Xeon workstation Done.
	96

	97 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at

	98 ⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.

	99

	100 In other words, Subzero's goal is to improve over LLVM's translation speed by

	101 10×.

	102

	103 Code quality

	104 ------------

	105

	106 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``

	107 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On

	108 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

	109

	110 It's important to note that the quality of Subzero-generated code depends on

	111 target-neutral optimizations and simplifications being run beforehand in the

	112 developer environment. The ``.pexe`` file reflects these optimizations. For

	113 example, Subzero assumes that the basic blocks are ordered topologically where

	114 possible (which makes liveness analysis converge fastest), and Subzero does not

	115 do any function inlining because it should already have been done.

	116

	117 Translator size

	118 ---------------

	119

	120 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.

	121 We think 1 MB is a more reasonable size -- especially for such a component that

	122 might be distributed to a billion Chrome users. Thus we target a 10× reduction
	JF 2015/09/03 00:35:21 Not just might! It is distributed to over a bill Not just might! It is distributed to over a billion users! Jim Stichnoth 2015/09/03 06:47:03 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > Not just might! It is distributed to over a billion users! Done.
	123 in binary size.

	124

	125 For development, Subzero can be built for all target architectures, and all

	126 debugging and diagnostic options enabled. For a smaller translator, we restrict

	127 to a single target architecture, and define a MINIMAL build where unnecessary

	128 features are compiled out.

	129

	130 Memory footprint

	131 ----------------

	132

	133 The current LLVM-based translator suffers from an issue in which some

	134 function-specific data has to be retained in memory until all translation

	135 completes, and therefore the memory footprint grows without bound. Large

	136 ``.pexe`` files can lead to the translator process holding hundreds of MB of

	137 memory by the end. The translator runs in a separate process, so this memory

	138 growth doesn't directly affect other processes, but it does dirty the physical

	139 memory and contributes to a perception of bloat, especially noticeable on weaker

	140 systems.
	JF 2015/09/03 00:35:21 "bloat, and in the extreme case out-of-memory tab "bloat, and in the extreme case out-of-memory tab killing" Jim Stichnoth 2015/09/03 06:47:03 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > "bloat, and in the extreme case out-of-memory tab killing" Done.
	141

	142 Subzero should maintain a stable memory footprint throughout translation. It's

	143 not really practical to set a specific limit, because there is not really a

	144 practical limit on a single function's size, but the footprint should be

	145 "reasonable" and be proportional to the largest input function size, not the

	146 total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or

	147 inexorable memory growth. (We use ASAN builds to test for leaks.)

	148

	149 Multithreaded translation

	150 -------------------------

	151

	152 It should be practical to translate different functions concurrently and see

	153 good scalability. Some locking may be needed, such as accessing output buffers

	154 or constant pools, but that should be fairly minimal. In contrast, LLVM was

	155 only designed for module-level parallelism, and as such, the PNaCl translator

	156 internally splits a ``.pexe`` file into several modules for concurrent

	157 translation. All output needs to be deterministic regardless of the level of

	158 multithreading, i.e. functions and data should always be output in the same

	159 order.

	160

	161 Target architectures

	162 --------------------

	163

	164 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future

	165 targets include ARM64 and MIPS64, though these targets lack NaCl support

	166 including a sandbox model or a validator.

	167

	168 The first implementation is for x86-32, because it was expected to be

	169 particularly challenging, and thus more likely to draw out any design problems

	170 early:

	171

	172 - There are a number of special cases, asymmetries, and warts in the x86

	173 instruction set.

	174

	175 - Complex addressing modes may be leveraged for better code quality.

	176

	177 - 64-bit integer operations have to be lowered into longer sequences of 32-bit

	178 operations.

	179

	180 - Paucity of physical registers may reveal code quality issues early in the

	181 design.

	182

	183 Detailed design

	184 ===============

	185

	186 Intermediate representation - ICE

	187 ---------------------------------

	188

	189 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's

	190 IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a

	191 representation of global variables and initializers, and a set of functions.

	192 Each function contains a list of basic blocks, and each basic block constains a

	193 list of instructions. Instructions that operate on stack and register variables

	194 do so using static single assignment (SSA) form.

	195

	196 The ``.pexe`` file is translated one function at a time (or in parallel by

	197 multiple translation threads). The recipe for optimization passes depends on

	198 the specific target and optimization level, and is described in detail below.

	199 Global variables (types and initializers) are simply and directly translated to

	200 object code, without any meaningful attempts at optimization.

	201

	202 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.

	203 Its key contents include:

	204

	205 - A list of ``CfgNode`` pointers, generally held in topological order.

	206

	207 - A list of ``Variable`` pointers corresponding to local variables used in the

	208 function plus compiler-generated temporaries.

	209

	210 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents

	211 include:

	212

	213 - A linear list of instructions, in the same style as LLVM. The last

	214 instruction of the list is always a terminator instruction: branch, switch,

	215 return, unreachable.

	216

	217 - A list of Phi instructions, also in the same style as LLVM. They are held as

	218 a linear list for convenience, though per Phi semantics, they are executed "in

	219 parallel" without dependencies on each other.

	220

	221 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and

	222 another list for outgoing edges.

	223

	224 - The node's unique, 0-based index into the CFG's node list.

	225

	226 An instruction is represented by the ``Ice::Inst`` class. Its key contents

	227 include:

	228

	229 - A list of source operands.

	230

	231 - Its destination variable, if the instruction produces a result in an

	232 ``Ice::Variable``.

	233

	234 - A bitvector indicating which variables' live ranges this instruction ends.

	235 This is computed during liveness analysis.

	236

	237 Instructions kinds are divided into high-level ICE instructions and low-level

	238 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode

	239 instruction kinds. Each target architecture implementation extends the

	240 instruction space with its own set of low-level instructions. Generally,

	241 low-level instructions correspond to individual machine instructions. The

	242 high-level ICE instruction space includes a few additional instruction kinds

	243 that are not part of LLVM but are generally useful (e.g., an Assignment

	244 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock

	245 instructions for sandboxing).

	246

	247 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl

	248 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual

	249 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are

	250 the following:

	251

	252 - Alloca: allocate data on the stack

	253

	254 - Arithmetic: binary operations of the form ``A = B op C``

	255

	256 - Br: conditional or unconditional branch

	257

	258 - Call: function call

	259

	260 - Cast: unary type-conversion operations

	261

	262 - ExtractElement: extract a scalar element from a vector-type value

	263

	264 - Fcmp: floating-point comparison

	265

	266 - Icmp: integer comparison

	267

	268 - IntrinsicCall: call a known intrinsic

	269

	270 - InsertElement: insert a scalar element into a vector-type value

	271

	272 - Load: load a value from memory

	273

	274 - Phi: implement the SSA phi node

	275

	276 - Ret: return from the function

	277

	278 - Select: essentially the C language operation of the form ``X = C ? Y : Z``

	279

	280 - Store: store a value into memory

	281

	282 - Switch: generalized branch to multiple possible locations

	283

	284 - Unreachable: indicate that this portion of the code is unreachable

	285

	286 The additional high-level ICE instructions are the following:

	287

	288 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi

	289 instructions to non-SSA assignments, before lowering to machine code.

	290

	291 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are

	292 common across all targets and so they are elevated to the high-level

	293 instruction set.

	294

	295 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in

	296 liveness analysis, elevated to the high-level because they are used by all

	297 targets. They are described in more detail at the end of this section.

	298

	299 - JumpTable: this represents the result of switch optimization analysis, where

	300 some switch instructions may use jump tables instead of cascading

	301 compare/branches.

	302

	303 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an

	304 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants

	305 include scalar integer constants, scalar floating point constants, Undef (an

	306 unspecified constant of a particular scalar or vector type), and symbol

	307 constants (essentially addresses of globals). Note that the PNaCl ABI does not

	308 include vector-type constants besides Undef, and as such, Subzero (so far) has

	309 no reason to represent vector-type constants internally.. A variable represents
	JF 2015/09/03 00:35:21 Too many dots!! Too many dots!! Jim Stichnoth 2015/09/03 06:47:03 Done... Show quoted text On 2015/09/03 00:35:21, JF wrote: > Too many dots!! Done...
	310 a value allocated on the stack (though not including alloca-derived storage).

	311 Among other things, a variable holds its unique, 0-based index into the CFG's

	312 variable list.

	313

	314 Each target can extend the ``Constant`` and ``Variable`` classes for its own

	315 needs. In addition, the ``Operand`` class may be extended, e.g. to define an

	316 x86 ``MemOperand`` that encodes a base register, an index register, an index

	317 register shift amount, and a constant offset.

	318

	319 Register allocation and liveness analysis are restricted to Variable operands.

	320 Because of the importance of register allocation to code quality, and the

	321 translation-time cost of liveness analysis, Variable operands get some special

	322 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate

	323 across all the Variables of an instruction. An instruction holds a list of

	324 operands, but an operand may contain 0, 1, or more Variables. As such, the

	325 ``Operand`` class specially holds a list of Variables contained within, for

	326 quick access.

	327

	328 A Subzero transformation pass may work by deleting an existing instruction and

	329 replacing it with zero or more new instructions. Instead of actually deleting

	330 the existing instruction, we generally mark it as deleted and insert the new

	331 instructions right after the deleted instruction. When printing the IR for

	332 debugging, this is a big help because it makes it much more clear how the

	333 non-deleted instructions came about.

	334

	335 Subzero has a few special instructions to help with liveness analysis

	336 consistency.

	337

	338 - The FakeDef instruction gives a fake definition of some variable. For

	339 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``

	340 but an ICE instruction can represent only one destination variable. This is

	341 similar for multiply instructions, and for function calls that return a 64-bit

	342 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``

	343 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.

	344

	345 - The FakeUse instruction registers a use of a variable, typically to prevent an

	346 earlier assignment to that variable from being dead-code eliminated. For

	347 example, lowering an operation like ``x=cc?y:z`` may be done using x86's

	348 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a

	349 FakeUse of ``x`` between the two instructions, the liveness analysis pass may

	350 dead-code eliminate the first instruction.

	351

	352 - The FakeKill instruction is added after a call instruction, and is a quick way

	353 of indicating that caller-save registers are invalidated.

	354

	355 Pexe parsing

	356 ------------

	357

	358 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It

	359 parses the ``.pexe`` file function by function, ultimately constructing an ICE

	360 CFG for each function. After a function is parsed, its CFG is handed off to the

	361 translation phase. The bitcode parser also parses global initializer data and

	362 hands it off to be translated to data sections in the object file.

	363

	364 Subzero has another parsing strategy for testing/debugging. LLVM libraries can

	365 be used to parse a module into LLVM IR (though very slowly relative to Subzero

	366 native parsing). Then we iterate across the LLVM IR and construct high-level

	367 ICE, handing off each CFG to the translation phase.

	368

	369 Overview of lowering

	370 --------------------

	371

	372 In general, translation goes like this:

	373

	374 - Parse the next function from the ``.pexe`` file and construct a CFG consisting

	375 of high-level ICE.

	376

	377 - Do analysis passes and transformation passes on the high-level ICE, as

	378 desired.

	379

	380 - Lower each high-level ICE instruction into a sequence of zero or more

	381 low-level ICE instructions. Each high-level instruction is generally lowered

	382 independently, though the target lowering is allowed to look ahead in the

	383 CfgNode's instruction list if desired.

	384

	385 - Do more analysis and transformation passes on the low-level ICE, as desired.

	386

	387 - Assemble the low-level CFG into an ELF object file (alternatively, a textual

	388 assembly file that is later assembled by some external tool).

	389

	390 - Repeat for all functions, and also produce object code for data such as global

	391 initializers and internal constant pools.

	392

	393 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,

	394 the intention is to apply all available optimizations to get the best code

	395 quality (though the initial code quality goal is measured against LLVM's ``O0``

	396 code quality). For ``Om1``, the intention is to apply as few optimizations as

	397 possible and produce code as quickly as possible, accepting poor code quality.

	398 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,

	399 "sub-zero".

	400

	401 High-level debuggability of generated code is so far not a design requirement,

	402 but Subzero doesn't really do transformations that would obfuscate debugging.

	403 The main thing might be that register allocation (including stack slot

	404 coalescing for stack-allocated variables whose live ranges don't overlap) may

	405 render a variable's value unobtainable after its live range ends. This would

	406 not be an issue for ``Om1`` since it doesn't register-allocate program-level

	407 variables, nor does it coalesce stack slots. Adding debugging support would

	408 primarily involve adding DWARF support to the ELF file emitter. Subzero

	409 propagates global symbol names, local variable names, and function-internal

	410 label names that are present in the ``.pexe`` file, but the PNaCl toolchain

	411 would need to not strip these symbols to make it most useful.

	412

	413 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but

	414 produces code with one third the code quality. ``Om1`` is good for testing and

	415 debugging -- during translation, it tends to expose errors in the basic lowering

	416 that might otherwise have been hidden by the register allocator or other

	417 optimization passes. It also helps determine whether a code correctness problem

	418 is a fundamental problem in the basic lowering, or an error in another

	419 optimization pass.

	420

	421 The implementation of target lowering also controls the recipe of passes used

	422 for ``Om1`` and ``O2`` translation. For example, address mode inference may

	423 only be relevant for x86.

	424

	425 Lowering strategy

	426 -----------------

	427

	428 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower

	429 each high-level instruction down to a sequence of low-level target-specific

	430 instructions, in a largely context-free setting. That is, each high-level

	431 instruction conceptually has a simple template expansion into low-level

	432 instructions, and lowering can in theory be done in any order. This may sound

	433 like a small effort, but quite a large number of templates may be needed because

	434 of the number of PNaCl types and instruction variants. Furthermore, there may

	435 be optimized templates, e.g. to take advantage of operator commutativity. This

	436 is similar to other template-based approaches in fast code generation or

	437 interpretation, though some decisions are deferred until after some global

	438 analysis passes, mostly related to register allocation, stack slot assignment,

	439 and specific choice of instruction variant and addressing mode.

	440

	441 The key idea for a lowering template is to produce valid low-level instructions

	442 that are guaranteed to meet address mode and other structural requirements of

	443 the instruction set. For example, on x86, the source operand of an integer

	444 store instruction must be an immediate or a physical register; a shift

	445 instruction's shift amount must be an immediate or in register ``%cl``; a

	446 function's integer return value is in ``%eax``; most x86 instructions are

	447 two-operand, in contrast to corresponding three-operand high-level instructions;

	448 etc.

	449

	450 Because target lowering runs before register allocation, there is no way to know

	451 whether a given ``Ice::Variable`` operand lives on the stack or in a physical

	452 register. When the low-level instruction calls for a physical register operand,

	453 the target lowering can create an infinite-weight Variable. This tells the

	454 register allocator to assign infinite weight when making decisions, effectively

	455 guaranteeing some physical register. Variables can also be pre-colored to a

	456 specific physical register (``cl`` in the shift example above), which also gives

	457 infinite weight.

	458

	459 To illustrate, consider a high-level arithmetic instruction on 32-bit integer

	460 operands::

	461

	462 A = B + C

	463

	464 X86 target lowering might produce the following::

	465

	466 T.inf = B // mov instruction

	467 T.inf += C // add instruction

	468 A = T.inf // mov instruction

	469

	470 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a

	471 physical register, the three lowered instructions are all encodable regardless

	472 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and

	473 whether ``A`` is a physical register or in memory.

	474

	475 In this example, ``A`` must be a Variable and one may be tempted to simplify the

	476 lowering sequence by setting ``A`` as infinite-weight and using::

	477

	478 A = B // mov instruction

	479 A += C // add instruction

	480

	481 This has two problems. First, if the original instruction was actually ``A =

	482 B + A``, the result would be incorrect. Second, assigning ``A`` a physical

	483 register applies throughout ``A``'s entire live range. This is probably not

	484 what is intended, and may ultimately lead to a failure to allocate a register

	485 for an infinite-weight variable.

	486

	487 This style of lowering leads to many temporaries being generated, so in ``O2``

	488 mode, we rely on the register allocator to clean things up. For example, in the

	489 example above, if ``B`` ends up getting a physical register and its live range

	490 ends at this instruction, the register allocator is likely to reuse that

	491 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

	492 copy, which is removed as an emission-time peephole optimization.

	493

	494 O2 lowering

	495 -----------

	496

	497 Currently, the ``O2`` lowering recipe is the following:

	498

	499 - Address mode inference

	500

	501 - Read-modify-write (RMW) transformation

	502

	503 - Basic liveness analysis

	504

	505 - Load optimization

	506

	507 - Target lowering

	508

	509 - Full liveness analysis

	510

	511 - Register allocation

	512

	513 - Phi instruction lowering (advanced)

	514

	515 - Post-phi lowering register allocation

	516

	517 - Branch optimization

	518

	519 These passes are described in more detail below.

	520

	521 Om1 lowering

	522 ------------

	523

	524 Currently, the ``Om1`` lowering recipe is the following:

	525

	526 - Phi instruction lowering (simple)

	527

	528 - Target lowering

	529

	530 - Register allocation (infinite-weight and pre-colored only)

	531

	532 Optimization passes

	533 -------------------

	534

	535 Liveness analysis

	536 ^^^^^^^^^^^^^^^^^

	537

	538 Liveness analysis is a standard dataflow optimization, implemented as follows.

	539 For each node (basic block), its live-out set is computed as the union of the

	540 live-in sets of its successor nodes. Then the node's instructions are processed

	541 in reverse order, updating the live set, until the beginning of the node is

	542 reached, and the node's live-in set is recorded. If this iteration has changed

	543 the node's live-in set, the node's predecessors are marked for reprocessing.

	544 This continues until no more nodes need reprocessing. If nodes are processed in

	545 reverse topological order, the number of iterations over the CFG is generally

	546 equal to the maximum loop nest depth.

	547

	548 To implement this, each node records its live-in and live-out sets, initialized

	549 to the empty set. Each instruction records which of its Variables' live ranges

	550 end in that instruction, initialized to the empty set. A side effect of

	551 liveness analysis is dead instruction elimination. Each instruction can be

	552 marked as tentatively dead, and after the algorithm converges, the tentatively

	553 dead instructions are permanently deleted.

	554

	555 Optionally, after this liveness analysis completes, we can do live range

	556 construction, in which we calculate the live range of each variable in terms of

	557 instruction numbers. A live range is represented as a union of segments, where

	558 the segment endpoints are instruction numbers. Instruction numbers are required

	559 to be unique across the CFG, and monotonically increasing within a basic block.

	560 As a union of segments, live ranges can contain "gaps" and are therefore

	561 precise. Because of SSA properties, a variable's live range can start at most

	562 once in a basic block, and can end at most once in a basic block. Liveness

	563 analysis keeps track of which variable/instruction tuples begin live ranges and

	564 end live ranges, and combined with live-in and live-out sets, we can efficiently

	565 build up live ranges of all variables across all basic blocks.

	566

	567 A lot of care is taken to try to make liveness analysis fast and efficient.

	568 Because of the lowering strategy, the number of variables is generally

	569 proportional to the number of instructions, leading to an O(N^2) complexity

	570 algorithm if implemented naively. To improve things based on sparsity, we note

	571 that most variables are "local" and referenced in at most one basic block (in

	572 contrast to the "global" variables with multi-block usage), and therefore cannot

	573 be live across basic blocks. Therefore, the live-in and live-out sets,

	574 typically represented as bit vectors, can be limited to the set of global

	575 variables, and the intra-block liveness bit vector can be compacted to hold the

	576 global variables plus the local variables for that block.

	577

	578 Register allocation

	579 ^^^^^^^^^^^^^^^^^^^

	580

	581 Subzero implements a simple linear-scan register allocator, based on the

	582 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan

	583 Register Allocation in the Context of SSA Form and Register Constraints

	584 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has

	585 several nice features:

	586

	587 - Live ranges are represented as unions of segments, as described above, rather

	588 than a single start/end tuple.

	589

	590 - It allows pre-coloring of variables with specific physical registers.

	591

	592 - It applies equally well to pre-lowered Phi instructions.

	593

	594 The paper suggests an approach of aggressively coalescing variables across Phi

	595 instructions (i.e., trying to force Phi source and destination variables to have

	596 the same register assignment), but we reject that in favor of the more natural

	597 preference mechanism described below.

	598

	599 We enhance the algorithm in the paper with the capability of automatic inference

	600 of register preference, and with the capability of allowing overlapping live

	601 ranges to safely share the same register in certain circumstances. If we are

	602 considering register allocation for variable ``A``, and ``A`` has a single

	603 defining instruction ``A=B+C``, then the preferred register for ``A``, if

	604 available, would be the register assigned to ``B`` or ``C``, if any, provided

	605 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this

	606 way we infer a good register preference for ``A``.

	607

	608 We allow overlapping live ranges to get the same register in certain cases.

	609 Suppose a high-level instruction like::

	610

	611 A = unary_op(B)

	612

	613 has been target-lowered like::

	614

	615 T.inf = B

	616 A = unary_op(T.inf)

	617

	618 Further, assume that ``B``'s live range continues beyond this instruction

	619 sequence, and that ``B`` has already been assigned some register. Normally, we

	620 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it

	621 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have

	622 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is

	623 guaranteed to be in a register, so in theory these overlapping live ranges could

	624 safely have the same register. Our implementation allows this overlap as long

	625 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never

	626 modified within ``T.inf``'s live range.

	627

	628 Subzero's register allocator can be run in 3 configurations.

	629

	630 - Normal mode. All Variables are considered for register allocation. It

	631 requires full liveness analysis and live range construction as a prerequisite.

	632 This is used by ``O2`` lowering.

	633

	634 - Minimal mode. Only infinite-weight or pre-colored Variables are considered.

	635 All other Variables are stack-allocated. It does not require liveness

	636 analysis; instead, it quickly scans the instructions and records first

	637 definitions and last uses of all relevant Variables, using that to construct a

	638 single-segment live range. Although this includes most of the Variables, the

	639 live ranges are mostly simple, short, and rarely overlapping, which the

	640 register allocator handles efficiently. This is used by ``Om1`` lowering.

	641

	642 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode

	643 register allocation, and may result in new infinite-weight Variables that need

	644 registers. One would like to just run something like minimal mode to assign

	645 registers to the new Variables while respecting existing register allocation

	646 decisions. However, it sometimes happens that there are no free registers.

	647 In this case, some register needs to be forcibly spilled to the stack and

	648 temporarily reassigned to the new Variable, and reloaded at the end of the new

	649 Variable's live range. The register must be one that has no explicit

	650 references during the Variable's live range. Since Subzero currently doesn't

	651 track def/use chains (though it does record the CfgNode where a Variable is

	652 defined), we just do a brute-force search across the CfgNode's instruction

	653 list for the instruction numbers of interest. This situation happens very

	654 rarely, so there's little point for now in improving its performance.

	655

	656 The basic linear-scan algorithm may, as it proceeds, rescind an early register

	657 allocation decision, leaving that Variable to be stack-allocated. Some of these

	658 times, it turns out that the Variable could have been given a different register

	659 without conflict, but by this time it's too late. The literature recognizes

	660 this situation and describes "second-chance bin-packing", which Subzero can do.

	661 We can rerun the register allocator in a mode that respects existing register

	662 allocation decisions, and sometimes it finds new non-conflicting opportunities.

	663 In fact, we can repeatedly run the register allocator until convergence.

	664 Unfortunately, in the current implementation, these subsequent register

	665 allocation passes end up being extremely expensive. This is because of the

	666 treatment of the "unhandled pre-colored" Variable set, which is normally very

	667 small but ends up being quite large on subsequent passes. Its performance can

	668 probably be made acceptable with a better choice of data structures, but for now

	669 this second-chance mechanism is disabled.

	670

	671 Future work is to implement LLVM's `Greedy

	672 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_

	673 register allocator as a replacement for the basic linear-scan algorithm, given

	674 LLVM's experience with its improvement in code quality. (The blog post claims

	675 that the Greedy allocator also improved maintainability because a lot of hacks

	676 could be removed, but Subzero is probably not yet to that level of hacks, and is

	677 less likely to see that particular benefit.)

	678

	679 Basic phi lowering

	680 ^^^^^^^^^^^^^^^^^^

	681

	682 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

	683 implements it). Consider this example::

	684

	685 L1:

	686 ...

	687 br L3

	688 L2:

	689 ...

	690 br L3

	691 L3:

	692 A = phi [B, L1], [C, L2]

	693 X = phi [Y, L1], [Z, L2]

	694

	695 For each destination of a phi instruction, we can create a temporary and insert

	696 the temporary's assignment at the end of the predecessor block::

	697

	698 L1:

	699 ...

	700 A' = B

	701 X' = Y

	702 br L3

	703 L2:

	704 ...

	705 A' = C

	706 X' = Z

	707 br L3

	708 L2:

	709 A = A'

	710 X = X'

	711

	712 This transformation is very simple and reliable. It can be done before target

	713 lowering and register allocation, and it easily avoids the classic lost-copy and

	714 related problems. ``Om1`` lowering uses this strategy.

	715

	716 However, it has the disadvantage of initializing temporaries even for branches

	717 not taken, though that could be mitigated by splitting non-critical edges and

	718 putting assignments in the edge-split nodes. Another problem is that without

	719 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a

	720 specific ordering even though phi semantics are that the assignments are

	721 parallel or unordered. This sometimes imposes false live range overlaps and

	722 leads to poorer register allocation.

	723

	724 Advanced phi lowering

	725 ^^^^^^^^^^^^^^^^^^^^^

	726

	727 ``O2`` lowering defers phi lowering until after register allocation to avoid the

	728 problem of false live range overlaps. It works as follows. We split each

	729 incoming edge and move the (parallel) phi assignments into the split nodes. We

	730 linearize each set of assignments by finding a safe, topological ordering of the

	731 assignments, respecting register assignments as well. For example::

	732

	733 A = B

	734 X = Y

	735

	736 Normally these assignments could be executed in either order, but if ``B`` and

	737 ``X`` are assigned the same physical register, we would want to use the above

	738 ordering. Dependency cycles are broken by introducing a temporary. For

	739 example::

	740

	741 A = B

	742 B = A

	743

	744 Here, a temporary breaks the cycle::

	745

	746 t = A

	747 A = B

	748 B = t

	749

	750 Finally, we use the existing target lowering to lower the assignments in this

	751 basic block, and once that is done for all basic blocks, we run the post-phi

	752 variant of register allocation on the edge-split basic blocks.

	753

	754 When computing a topological order, we try to first schedule assignments whose

	755 source has a physical register, and last schedule assignments whose destination

	756 has a physical register. This helps reduce register pressure.

	757

	758 X86 address mode inference

	759 ^^^^^^^^^^^^^^^^^^^^^^^^^^

	760

	761 We try to take advantage of the x86 addressing mode that includes a base

	762 register, an index register, an index register scale amount, and an immediate

	763 offset. We do this through simple pattern matching. Starting with a load or

	764 store instruction where the address is a variable, we initialize the base

	765 register to that variable, and look up the instruction where that variable is

	766 defined. If that is an add instruction of two variables and the index register

	767 hasn't been set, we replace the base and index register with those two

	768 variables. If instead it is an add instruction of a variable and a constant, we

	769 replace the base register with the variable and add the constant to the

	770 immediate offset.

	771

	772 There are several more patterns that can be matched. This pattern matching

	773 continues on the load or store instruction until no more matches are found.

	774 Because a program typically has few load and store instructions (not to be

	775 confused with instructions that manipulate stack variables), this address mode

	776 inference pass is fast.

	777

	778 X86 read-modify-write inference

	779 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	780

	781 A reasonably common bitcode pattern is a non-atomic update of a memory

	782 location::

	783

	784 x = load addr

	785 y = add x, 1

	786 store y, addr

	787

	788 On x86, with good register allocation, the Subzero passes described above

	789 generate code with only this quality::

	790

	791 mov [%ebx], %eax

	792 add $1, %eax

	793 mov %eax, [%ebx]

	794

	795 However, x86 allows for this kind of code::

	796

	797 add $1, [%ebx]

	798

	799 which requires fewer instructions, but perhaps more importantly, requires fewer

	800 physical registers.

	801

	802 It's also important to note that this transformation only makes sense if the

	803 store instruction ends ``x``'s live range.

	804

	805 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)

	806 opportunities via simple pattern matching. The only problem is that it is run

	807 before liveness analysis, which is needed to determine whether ``x``'s live

	808 range ends after the RMW. Since liveness analysis is one of the most expensive

	809 passes, it's not attractive to run it an extra time just for RMW analysis.

	810 Instead, we essentially generate both the RMW and the non-RMW versions, and then

	811 during lowering, the RMW version deletes itself if it finds x still live.

	812

	813 X86 compare-branch inference

	814 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	815

	816 In the LLVM instruction set, the compare/branch pattern works like this::

	817

	818 cond = icmp eq a, b

	819 br cond, target

	820

	821 The result of the icmp instruction is a single bit, and a conditional branch

	822 tests that bit. By contrast, most target architectures use this pattern::

	823

	824 cmp a, b // implicitly sets various bits of FLAGS register

	825 br eq, target // branch on a particular FLAGS bit

	826

	827 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests

	828 ``cond`` and conditionally branches. Subzero has a pass that identifies

	829 boolean-based operations like this and folds them into a single

	830 compare/branch-like operation. It is set up for more than just cmp/br though.

	831 Boolean producers include icmp (integer compare), fcmp (floating-point compare),

	832 and trunc (integer truncation when the destination has bool type). Boolean

	833 consumers include branch, select (the ternary operator from the C language), and

	834 sign-extend and zero-extend when the source has bool type.

	835

	836 Sandboxing

	837 ^^^^^^^^^^

	838

	839 Native Client's sandbox model uses software fault isolation (SFI) to help

	840 provide safety when running untrusted code in a browser or other environment.
	JF 2015/09/03 00:35:21 Drop "help". Drop "help". Jim Stichnoth 2015/09/03 06:47:03 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > Drop "help". Done.
	841 Subzero implements Native Client's sandboxing to enable Subzero-translated
	JF 2015/09/03 00:35:21 Re-link to https://developer.chrome.com/native-cli Re-link to https://developer.chrome.com/native-client/reference/sandbox_internals/index Jim Stichnoth 2015/09/03 06:47:03 Done. Show quoted text On 2015/09/03 00:35:21, JF wrote: > Re-link to > https://developer.chrome.com/native-client/reference/sandbox_internals/index Done.
	842 executables to be run inside Chrome. Subzero also provides a fairly simple

	843 framework for investigating alternative sandbox models or other restrictions on

	844 the sandbox model.

	845

	846 Sandboxing in Subzero is not actually implemented as a separate pass, but is

	847 integrated into lowering and assembly.

	848

	849 - Indirect branches, including the ret instruction, are masked to a bundle

	850 boundary and bundle-locked.

	851

	852 - Call instructions are aligned to the end of the bundle so that the return

	853 address is bundle-aligned.

	854

	855 - Indirect branch targets, including function entry and targets in a switch

	856 statement jump table, are bundle-aligned.

	857

	858 - The intrinsic for reading the thread pointer is inlined appropriately.

	859

	860 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox

	861 base register. We reduce the aggressiveness of address mode inference to

	862 leave room for the sandbox base register during lowering. There are no memory

	863 sandboxing changes for x86-32.

	864

	865 Code emission

	866 -------------

	867

	868 Subzero's integrated assembler is derived from Dart's `assembler code

	869 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass

	870 that iterates through the low-level ICE instructions and invokes the relevant

	871 assembler functions. Placeholders are added for later fixup of branch target

	872 offsets. (Backward branches use short offsets if possible; forward branches

	873 always use long offsets.) The assembler emits into a staging buffer. Once

	874 emission into the staging buffer for a function is complete, the data is emitted

	875 to the output file as an ELF object file, and metadata such as relocations,

	876 symbol table, and string table, are accumulated for emission at the end. Global

	877 data initializers are emitted similarly. A key point is that at this point, the

	878 staging buffer can be deallocated, and only a minimum of data needs to held

	879 until the end.

	880

	881 As a debugging alternative, Subzero can emit textual assembly code which can

	882 then be run through an external assembler. This is of course super slow, but

	883 quite valuable when bringing up a new target.

	884

	885 As another debugging option, the staging buffer can be emitted as textual

	886 assembly, primarily in the form of ".byte" lines. This allows the assembler to

	887 be tested separately from the ELF related code.

	888

	889 Memory management

	890 -----------------

	891

	892 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from

	893 LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects

	894 allocated from the arena are never actually freed; instead, when the CFG

	895 translation completes and the CFG is deleted, the entire arena memory is

	896 reclaimed at once. This style of allocation works well in an environment like a

	897 compiler where there are distinct phases with only easily-identifiable objects

	898 living across phases. It frees the developer from having to manage object

	899 deletion, and it amortizes deletion costs across just a single arena deletion at

	900 the end of the phase. Furthermore, it helps scalability by allocating entirely

	901 from thread-local memory pools, and minimizing global locking of the heap.

	902

	903 Instructions are probably the most heavily allocated complex class in Subzero.

	904 We represent an instruction list as an intrusive doubly linked list, allocate

	905 all instructions from the ``CfgLocalAllocator``, and we make sure each

	906 instruction subclass is basically `POD

	907 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a

	908 trivial destructor. This way, when the CFG is finished, we don't need to

	909 individually deallocate every instruction. We do similar for Variables, which

	910 is probably the second most popular complex class.

	911

	912 There are some situations where passes need to use some `STL container class

	913 <http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the

	914 ``CfgLocalAllocator`` as the container allocator if this is needed.
	JF 2015/09/03 00:35:21 This STL mention reminds me: you don't explain how This STL mention reminds me: you don't explain how subzero uses LLVM! It's built as a separate tool, which only uses the stand-alone LLVM utility library ADT. I don't think this is the right place for it, but it should be somewhere in the document. Jim Stichnoth 2015/09/03 06:47:03 Good point, and I agree there's not an obvious pla Show quoted text On 2015/09/03 00:35:21, JF wrote: > This STL mention reminds me: you don't explain how subzero uses LLVM! It's built > as a separate tool, which only uses the stand-alone LLVM utility library ADT. > > I don't think this is the right place for it, but it should be somewhere in the > document. Good point, and I agree there's not an obvious place to mention it. I mentioned the ADT and Support stuff as part of the translator size discussion, and a mention of including the IR in a non-MINIMAL build.
	915

	916 Multithreaded translation

	917 -------------------------

	918

	919 Subzero is designed to be able to translate functions in parallel. With the

	920 ``-threads=N`` command-line option, there is a 3-stage producer-consumer

	921 pipeline:

	922

	923 - A single thread parses the ``.pexe`` file and produces a sequence of work

	924 units. A work unit can be either a fully constructed CFG, or a set of global

	925 initializers. The work unit includes its sequence number denoting its parse

	926 order. Each work unit is added to the translation queue.

	927

	928 - There are N translation threads that draw work units from the translation

	929 queue and lower them into assembler buffers. Each assembler buffer is added

	930 to the emitter queue, tagged with its sequence number. The CFG and its

	931 ``CfgLocalAllocator`` are disposed of at this point.

	932

	933 - A single thread draws assembler buffers from the emitter queue and appends to

	934 the output file. It uses the sequence numbers to reintegrate the assembler

	935 buffers according to the original parse order, such that output order is

	936 always deterministic.

	937

	938 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads

	939 for a total of ``N+2`` execution threads, taking the parser and emitter threads

	940 into account. For the special case of ``N=0``, execution is entirely sequential

	941 -- the same thread parses, translates, and emits, one function at a time. This

	942 is useful for performance measurements.

	943

	944 Ideally, we would like to get near-linear scalability as the number of

	945 translation threads increases. We expect that ``-threads=1`` should be slightly

	946 faster than ``-threads=0`` as the small amount of time spent parsing and

	947 emitting is done largely in parallel with translation. With perfect

	948 scalability, we see ``-threads=N`` translating ``N`` times as fast as

	949 ``-threads=1``, up until the point where parsing or emitting becomes the

	950 bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory

	951 performance would become a bottleneck and efficiency might peak at, say, 75%.

	952

	953 Currently, parsing takes about 11% of total sequential time. If translation

	954 scalability ever gets so fast and awesomely scalable that parsing becomes a

	955 bottleneck, it should be possible to make parsing multithreaded as well.

	956

	957 Internally, all shared, mutable data is held in the GlobalContext object, and

	958 access to each field is guarded by a mutex.

	959

	960 Security

	961 --------

	962

	963 Subzero includes a number of security features in the generated code, as well as

	964 in the Subzero translator itself, which run on top of the existing Native Client

	965 sandbox as well as Chrome's OS-level sandbox.

	966

	967 Sandboxed translator

	968 ^^^^^^^^^^^^^^^^^^^^

	969

	970 When running inside the browser, the Subzero translator executes as sandboxed,

	971 untrusted code that is initially checked by the validator, just like the

	972 LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no

	973 more or less secure than the translator it replaces, from the point of view of

	974 the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and

	975 was designed from the start with security in mind, so one expects fewer attacker

	976 opportunities here.

	977

	978 Code diversification

	979 ^^^^^^^^^^^^^^^^^^^^

	980

	981 `Return-oriented programming

	982 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a

	983 now-common technique for starting with e.g. a known buffer overflow situation

	984 and launching it into a deeper exploit. The attacker scans the executable

	985 looking for ROP gadgets, which are short sequences of code that happen to load

	986 known values into known registers and then return. An attacker who manages to

	987 overwrite parts of the stack can overwrite it with carefully chosen return

	988 addresses such that certain ROP gadgets are effectively chained together to set

	989 up the register state as desired, finally returning to some code that manages to

	990 do something nasty based on those register values.

	991

	992 If there is a popular ``.pexe`` with a large install base, the attacker could

	993 run Subzero on it and scan the executable for suitable ROP gadgets to use as

	994 part of a potential exploit. Note that if the trusted validator is working

	995 correctly, these ROP gadgets are limited to starting at a bundle boundary and

	996 cannot use the trick of finding a gadget that happens to begin inside another

	997 instruction. All the same, gadgets with these constraints still exist and the

	998 attacker has access to them. This is the attack model we focus most on --

	999 protecting the user against misuse of a "trusted" developer's application, as

	1000 opposed to mischief from a malicious ``.pexe`` file.

	1001

	1002 Subzero can mitigate these attacks to some degree through code diversification.

	1003 Specifically, we can apply some randomness to the code generation that makes ROP

	1004 gadgets less predictable. This randomness can have some compile-time cost, and

	1005 it can affect the code quality; and some diversifications may be more effective

	1006 than others. A more detailed treatment of hardening techniques may be found in

	1007 the Matasano report "`Attacking Clientside JIT Compilers

	1008 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/at tacking_clientside_jit_compilers_paper.pdf>`_".

	1009

	1010 To evaluate diversification effectiveness, we use a third-party ROP gadget

	1011 finder and limit its results to bundle-aligned addresses. For a given

	1012 diversification technique, we run it with a number of different random seeds,

	1013 find ROP gadgets for each version, and determine how persistent each ROP gadget

	1014 is across the different versions. A gadget is persistent if the same gadget is

	1015 found at the same code address. The best diversifications are ones with low

	1016 gadget persistence rates.

	1017

	1018 Subzero implements 7 different diversification techniques. Below is a

	1019 discussion of each technique, its effectiveness, and its cost. The discussions

	1020 of cost and effectiveness are for a single diversification technique; the

	1021 translation-time costs for multiple techniques are additive, but the effects of

	1022 multiple techniques on code quality and effectiveness are not yet known.

	1023

	1024 In Subzero's implementation, each randomization is "repeatable" in a sense.

	1025 Each pass that includes a randomization option gets its own private instance of

	1026 a random number generator (RNG). The RNG is seeded with a combination of a

	1027 global seed, the pass ID, and the function's sequence number. The global seed

	1028 is designed to be different across runs (perhaps based on the current time), but

	1029 for debugging, the global seed can be set to a specific value and the results

	1030 will be repeatable.

	1031

	1032 Subzero-generated code is subject to diversification once per translation, and

	1033 then Chrome caches the diversified binary for subsequent executions. An

	1034 attacker may attempt to run the binary multiple times hoping for

	1035 higher-probability combinations of ROP gadgets. When the attacker guesses

	1036 wrong, a likely outcome is an application crash. Chrome throttles creation of

	1037 crashy processes which reduces the likelihood of the attacker eventually gaining

	1038 a foothold.

	1039

	1040 Constant blinding

	1041 ~~~~~~~~~~~~~~~~~

	1042

	1043 Here, we prevent attackers from controlling large immediates in the text

	1044 (executable) section. A random cookie is generated for each function, and if

	1045 the constant exceeds a specified threshold, the constant is obfuscated with the

	1046 cookie and equivalent code is generated. For example, instead of this x86

	1047 instruction::

	1048

	1049 mov $0x11223344, <%Reg/Mem>

	1050

	1051 the following code might be generated::

	1052

	1053 mov $(0x11223344+Cookie), %temp

	1054 lea -Cookie(%temp), %temp

	1055 mov %temp, <%Reg/Mem>

	1056

	1057 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to

	1058 prevent unintended effects on the flags register.

	1059

	1060 This transformation has almost no effect on translation time, and about 1%

	1061 impact on code quality, depending on the threshold chosen. It does little to

	1062 reduce gadget persistence, but it does remove a lot of potential opportunities

	1063 to construct intra-instruction ROP gadgets (which an attacker could use only if

	1064 a validator bug were discovered, since the Native Client sandbox and associated

	1065 validator force returns and other indirect branches to be to bundle-aligned

	1066 addresses).

	1067

	1068 Constant pooling

	1069 ~~~~~~~~~~~~~~~~

	1070

	1071 This is similar to constant blinding, in that large immediates are removed from

	1072 the text section. In this case, each unique constant above the threshold is

	1073 stored in a read-only data section and the constant is accessed via a memory

	1074 load. For the above example, the following code might be generated::

	1075

	1076 mov $Label$1, %temp

	1077 mov %temp, <%Reg/Mem>

	1078

	1079 This has a similarly small impact on translation time and ROP gadget

	1080 persistence, and a smaller (better) impact on code quality. This is because it

	1081 uses fewer instructions, and in some cases good register allocation leads to no

	1082 increase in instruction count. Note that this still gives an attacker some

	1083 limited amount of control over some text section values, unless we randomize the

	1084 constant pool layout.

	1085

	1086 Static data reordering

	1087 ~~~~~~~~~~~~~~~~~~~~~~

	1088

	1089 This transformation limits the attacker's ability to control bits in global data

	1090 address references. It simply permutes the order in memory of global variables

	1091 and internal constant pool entries. For the constant pool, we only permute

	1092 within a type (i.e., emit a randomized list of ints, followed by a randomized

	1093 list of floats, etc.) to maintain good packing in the face of alignment

	1094 constraints.

	1095

	1096 As might be expected, this has no impact on code quality, translation time, or

	1097 ROP gadget persistence (though as above, it limits opportunities for

	1098 intra-instruction ROP gadgets with a broken validator).

	1099

	1100 Basic block reordering

	1101 ~~~~~~~~~~~~~~~~~~~~~~

	1102

	1103 Here, we randomize the order of basic blocks within a function, with the

	1104 constraint that we still want to maintain a topological order as much as

	1105 possible, to avoid making the code too branchy.

	1106

	1107 This has no impact on code quality, and about 1% impact on translation time, due

	1108 to a separate pass to recompute layout. It ends up having a huge effect on ROP

	1109 gadget persistence, tied for best with nop insertion, reducing ROP gadget

	1110 persistence to less than 5%.

	1111

	1112 Function reordering

	1113 ~~~~~~~~~~~~~~~~~~~

	1114

	1115 Here, we permute the order that functions are emitted, primarily to shift ROP

	1116 gadgets around to less predictable locations. It may also change call address

	1117 offsets in case the attacker was trying to control that offset in the code.

	1118

	1119 To control latency and memory footprint, we don't arbitrarily permute functions.

	1120 Instead, for some relatively small value of N, we queue up N assembler buffers,

	1121 and then emit the N functions in random order, and repeat until all functions

	1122 are emitted.

	1123

	1124 Function reordering has no impact on translation time or code quality.

	1125 Measurements indicate that it reduces ROP gadget persistence to about 15%.

	1126

	1127 Nop insertion

	1128 ~~~~~~~~~~~~~

	1129

	1130 This diversification randomly adds a nop instruction after each regular

	1131 instruction, with some probability. Nop instructions of different lengths may

	1132 be selected. Nop instructions are never added inside a bundle_lock region.

	1133 Note that when sandboxing is enabled, nop instructions are already being added

	1134 for bundle alignment, so the diversification nop instructions may simply be

	1135 taking the place of alignment nop instructions, though distributed differently

	1136 through the bundle.

	1137

	1138 In Subzero's currently implementation, nop insertion adds 3-5% to the

	1139 translation time, but this is probably because it is implemented as a separate

	1140 pass that adds actual nop instructions to the IR. The overhead would probably

	1141 be a lot less if it were integrated into the assembler pass. The code quality

	1142 is also reduced by 3-5%, making nop insertion the most expensive of the

	1143 diversification techniques.

	1144

	1145 Nop insertion is very effective in reducing ROP gadget persistence, at the same

	1146 level as basic block randomization (less than 5%). But given nop insertion's

	1147 impact on translation time and code quality, one would most likely prefer to use

	1148 basic block randomization instead (though the combined effects of the different

	1149 diversification techniques have not yet been studied).

	1150

	1151 Register allocation randomization

	1152 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	1153

	1154 In this diversification, the register allocator tries to make different but

	1155 mostly functionally equivalent choices, while maintaining stable code quality.

	1156

	1157 A naive approach would be the following. Whenever the allocator has more than

	1158 one choice for assigning a register, choose randomly among those options. And

	1159 whenever there are no registers available and there is a tie for the

	1160 lowest-weight variable, randomly select one of the lowest-weight variables to

	1161 evict. Because of the one-pass nature of the linear-scan algorithm, this

	1162 randomization strategy can have a large impact on which variables are ultimately

	1163 assigned registers, with a corresponding large impact on code quality.

	1164

	1165 Instead, we choose an approach that tries to keep code quality stable regardless

	1166 of the random seed. We partition the set of physical registers into equivalence

	1167 classes. If a register is pre-colored in the function (i.e., referenced

	1168 explicitly by name), it forms its own equivalence class. The remaining

	1169 registers are partitioned according to their combination of attributes such as

	1170 integer versus floating-point, 8-bit versus 32-bit, caller-save versus

	1171 callee-saved, etc. Each equivalence class is randomly permuted, and the

	1172 complete permutation is applied to the final register assignments.

	1173

	1174 Register randomization reduces ROP gadget persistence to about 10% on average,

	1175 though there tends to be fairly high variance across functions and applications.

	1176 This probably has to do with the set of restrictions in the x86-32 instruction

	1177 set and ABI, such as few general-purpose registers, ``%eax`` used for return

	1178 values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As

	1179 intended, register randomization has no impact on code quality, and a slight

	1180 (0.5%) impact on translation time due to an extra scan over the variables to

	1181 identify pre-colored registers.

	1182

	1183 Fuzzing

	1184 ^^^^^^^

	1185

	1186 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a

	1187 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer

	1188 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to

	1189 find and fix cases where Subzero crashes or otherwise ungracefully fails on

	1190 unexpected inputs, and to do so automatically over a large range of unexpected

	1191 inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility

	1192 of an attacker exploiting these bugs.

	1193

	1194 Most of the problems found so far are ones most appropriately handled in the

	1195 parser. However, there have been a couple that have identified problems in the

	1196 lowering, or otherwise inappropriately triggered assertion failures and fatal

	1197 errors. We continue to dig into this area.

	1198

	1199 Future security work

	1200 ^^^^^^^^^^^^^^^^^^^^

	1201

	1202 Subzero is well-positioned to explore other future security enhancements, e.g.:

	1203

	1204 - Tightening the Native Client sandbox. ABI changes, such as the previous work

	1205 on `hiding the sandbox base address

	1206 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXV G8>`_

	1207 in x86-64, are easy to experiment with in Subzero.

	1208

	1209 - Making the executable code section read-only. This would prevent a PNaCl

	1210 application from inspecting its own binary and trying to find ROP gadgets even

	1211 after code diversification has been performed.
	JF 2015/09/03 00:35:21 Though this is still susceptible to "blind ROP" (l Though this is still susceptible to "blind ROP" (link), it would substantially improve security. Jim Stichnoth 2015/09/03 06:47:03 Cool. Show quoted text On 2015/09/03 00:35:21, JF wrote: > Though this is still susceptible to "blind ROP" (link), it would substantially > improve security. Cool.
	1212

	1213 - Instruction selection diversification. It may be possible to lower a given

	1214 instruction in several largely equivalent ways, which gives more opportunities

	1215 for code randomization.
	JF 2015/09/03 00:35:21 Mention using dataflow sanitizer as another improv Mention using dataflow sanitizer as another improvement. Jim Stichnoth 2015/09/03 06:47:03 Before doing that, what are your thoughts on how i Show quoted text On 2015/09/03 00:35:21, JF wrote: > Mention using dataflow sanitizer as another improvement. Before doing that, what are your thoughts on how it could be used? What aspect of Subzero or its generated code would be getting the taint analysis? JF 2015/09/03 16:18:50 Oops, I meant this: http://clang.llvm.org/docs/Con Show quoted text On 2015/09/03 06:47:03, stichnot wrote: > On 2015/09/03 00:35:21, JF wrote: > > Mention using dataflow sanitizer as another improvement. > > Before doing that, what are your thoughts on how it could be used? What aspect > of Subzero or its generated code would be getting the taint analysis? Oops, I meant this: http://clang.llvm.org/docs/ControlFlowIntegrity.html I'm confusing the names of two of Peter's projects, though I now realize that their names are actually quite descriptive. Jim Stichnoth 2015/09/03 18:21:24 Ah, much better. Added essentially a TODO for tha Show quoted text On 2015/09/03 16:18:50, JF wrote: > On 2015/09/03 06:47:03, stichnot wrote: > > On 2015/09/03 00:35:21, JF wrote: > > > Mention using dataflow sanitizer as another improvement. > > > > Before doing that, what are your thoughts on how it could be used? What > aspect > > of Subzero or its generated code would be getting the taint analysis? > > Oops, I meant this: > http://clang.llvm.org/docs/ControlFlowIntegrity.html > > I'm confusing the names of two of Peter's projects, though I now realize that > their names are actually quite descriptive. Ah, much better. Added essentially a TODO for that.
	1216

	1217 Chrome integration

	1218 ------------------

	1219

	1220 Currently Subzero is available in Chrome for the x86-32 architecture, but under

	1221 a flag. When the flag is enabled, Subzero is used when the `manifest file

	1222 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_

	1223 linking to the ``.pexe`` file specifies the ``O0`` optimization level.

	1224

	1225 The next step is to remove the flag, i.e. invoke Subzero as the only translator

	1226 for ``O0``-specified manifest files.

	1227

	1228 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which

	1229 case Subzero could be used for all PNaCl translation.

	1230

	1231 Command line options

	1232 --------------------

	1233

	1234 Subzero has a number of command-line options for debugging and diagnostics.

	1235 Among the more interesting are the following.

	1236

	1237 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other

	1238 diagnostic output, with various levels of detail after each pass. Instruction

	1239 numbers can be printed or suppressed. Deleted instructions can be printed or

	1240 suppressed (they are retained in the instruction list, as discussed earlier,

	1241 because they can help explain how lower-level instructions originated).

	1242 Liveness information can be printed when available. Details of register

	1243 allocation can be printed as register allocator decisions are made. And more.

	1244

	1245 - Running Subzero with any level of verbosity produces an enormous amount of

	1246 output. When debugging a single function, verbose output can be suppressed

	1247 except for a particular function. The ``-verbose-focus`` flag suppresses

	1248 verbose output except for the specified function.

	1249

	1250 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing

	1251 at exit. Timing markers can be placed in the Subzero source code to demarcate

	1252 logical operations or passes of interest. Basic timing information plus

	1253 call-stack type timing information is printed at the end.

	1254

	1255 - Along with ``-timing``, the user can instead get a report on the overall

	1256 translation time for each function, to help focus on timing outliers. Also,

	1257 ``-timing-focus`` limits the ``-timing`` reporting to a single function,

	1258 instead of aggregating pass timing across all functions.

	1259

	1260 - The ``-szstats`` option reports various statistics on each function, such as

	1261 stack frame size, static instruction count, etc. It may be helpful to track

	1262 these stats over time as Subzero is improved, as an approximate measure of

	1263 code quality.

	1264

	1265 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly

	1266 output, annotate the assembly output with register-focused liveness

	1267 information. In particular, each basic block is annotated with which

	1268 registers are live-in and live-out, and each instruction is annotated with

	1269 which registers' and stack locations' live ranges end at that instruction.

	1270 This is really useful when studying the generated code to find opportunities

	1271 for code quality improvements.

	1272

	1273 Testing and debugging

	1274 ---------------------

	1275

	1276 LLVM lit tests

	1277 ^^^^^^^^^^^^^^

	1278

	1279 For basic testing, Subzero uses LLVM's `lit

	1280 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We

	1281 have a suite of hundreds of small functions where we test for particular

	1282 assembly code patterns across different target architectures.

	1283

	1284 Cross tests

	1285 ^^^^^^^^^^^

	1286

	1287 Unfortunately, the lit tests don't do a great job of precisely testing the

	1288 correctness of the output. Much better are the cross tests, which are execution

	1289 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide

	1290 variety of interesting inputs. Each cross test consists of a set of C, C++,

	1291 and/or low-level bitcode files. The C and C++ source files are compiled down to

	1292 bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.

	1293 Subzero mangles global symbol names with a special prefix to avoid duplicate

	1294 symbol errors. A driver program invokes both versions on a large set of

	1295 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results

	1296 differ. Cross tests turn out to be an excellent way of testing the basic

	1297 lowering patterns, but they are less useful for testing more global things like

	1298 liveness analysis and register allocation.

	1299

	1300 Bisection debugging

	1301 ^^^^^^^^^^^^^^^^^^^

	1302

	1303 Sometimes with a new application, Subzero will end up producing incorrect code

	1304 that either crashes at runtime or otherwise produces the wrong results. When

	1305 this happens, we need to narrow it down to a single function (or small set of

	1306 functions) that yield incorrect behavior. For this, we have a bisection

	1307 debugging framework. Here, we initially translate the entire application once

	1308 with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to

	1309 selectively weaken symbols based on a whitelist or blacklist provided on the

	1310 command line. The two object files can then be linked together without link

	1311 errors, with the desired version of each method "winning". Then the binary is

	1312 tested, and bisection proceeds based on whether the binary produces correct

	1313 output.

	1314

	1315 When the bisection completes, we are left with a minimal set of

	1316 Subzero-translated functions that cause the failure. Usually it is a single

	1317 function, though sometimes it might require a combination of several functions

	1318 to cause a failure; this may be due to an incorrect call ABI, for example.

	1319 However, Murphy's Law implies that the single failing function is enormous and

	1320 impractical to debug. In that case, we can restart the bisection, explicitly

	1321 blacklisting the enormous function, and try to find another candidate to debug.

	1322 (Future work is to automate this to find all minimal sets of functions, so that

	1323 debugging can focus on the simplest example.)

	1324

	1325 Fuzz testing

	1326 ^^^^^^^^^^^^

	1327

	1328 As described above, we try to find internal Subzero bugs using fuzz testing

	1329 techniques.

	1330

	1331 Sanitizers

	1332 ^^^^^^^^^^

	1333

	1334 Subzero can be built with `AddressSanitizer

	1335 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) and/or

	1336 `ThreadSanitizer <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan)

	1337 support. This is done using something as simple as ``make ASAN=1 TSAN=1``. So

	1338 far, multithreading has been simple enough that TSan hasn't found any bugs, but

	1339 ASan has found at least one memory leak which was subsequently fixed.

	1340 `UndefinedBehaviorSanitizer

	1341 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_

	1342 (UBSan) support is in progress.

	1343

	1344 Current status

	1345 ==============

	1346

	1347 Target architectures

	1348 --------------------

	1349

	1350 Subzero is currently more or less complete for the x86-32 target. It has been

	1351 refactored and extended to handle x86-64 as well, and that is mostly complete at

	1352 this point.

	1353

	1354 ARM32 work is in progress. It currently lacks the testing level of x86, at

	1355 least in part because Subzero's register allocator needs modifications to handle

	1356 ARM's aliasing of floating point and vector registers. Specifically, a 64-bit

	1357 register is actually a gang of two consecutive and aligned 32-bit registers, and

	1358 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.

	1359 ARM64 work has not started; when it does, it will be native-only since the

	1360 Native Client sandbox model, validator, and other tools have never been defined.

	1361

	1362 An external contributor is adding MIPS support, in most part by following the

	1363 ARM work.

	1364

	1365 Translator performance

	1366 ----------------------

	1367

	1368 Single-threaded translation speed is currently about 5× the ``pnacl-llc``

	1369 translation speed. For a large ``.pexe`` file, the time breaks down as:

	1370

	1371 - 11% for parsing and initial IR building

	1372

	1373 - 4% for emitting to /dev/null

	1374

	1375 - 27% for liveness analysis (two liveness passes plus live range construction)

	1376

	1377 - 15% for linear-scan register allocation

	1378

	1379 - 9% for basic lowering

	1380

	1381 - 10% for advanced phi lowering

	1382

	1383 - ~11% for other minor analysis

	1384

	1385 - ~10% measurement overhead to acquire these numbers

	1386

	1387 Some improvements could undoubtedly be made, but it will be hard to increase the

	1388 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With

	1389 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10×

	1390 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3.

	1391

	1392 Code quality

	1393 ------------

	1394

	1395 Measured across 16 components of spec2k, Subzero's code quality is uniformly

	1396 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly

	1397 between ``pnacl-llc`` ``-O0`` and ``-O2``.

	1398

	1399 Translator size

	1400 ---------------

	1401

	1402 When built in MINIMAL mode, the x86-64 native translator size for the x86-32

	1403 target is about 700 KB, not including the size of functions referenced in

	1404 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1

	1405 MB, and it is statically linked and also includes nop padding for bundling as

	1406 well as indirect branch masking.

	1407

	1408 Translator memory footprint

	1409 ---------------------------

	1410

	1411 It's hard to draw firm conclusions about memory footprint, since the footprint

	1412 is at least proportional to the input function size, and there is no real limit

	1413 on the size of functions in the ``.pexe`` file.

	1414

	1415 That said, we looked at the memory footprint over time as Subzero translated

	1416 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our

	1417 disposal. One of LLVM's libraries that Subzero uses can report the current

	1418 malloc heap usage. With single-threaded translation, Subzero tends to hover

	1419 around 15 MB of memory usage. There are a couple of monstrous functions where

	1420 Subzero grows to around 100 MB, but then it drops back down after those

	1421 functions finish translating. In contrast, ``pnacl-llc`` grows larger and

	1422 larger throughout translation, reaching several hundred MB by the time it

	1423 completes.

	1424

	1425 It's a bit more interesting when we enable multithreaded translation. When

	1426 there are N translation threads, Subzero implements a policy that limits the

	1427 size of the translation queue to N entries -- if it is "full" when the parser

	1428 tries to add a new CFG, the parser blocks until one of the translation threads

	1429 removes a CFG. This means the number of in-memory CFGs can (and generally does)

	1430 reach 2*N+1, and so the memory footprint rises in proportion to the number of

	1431 threads. Adding to the pressure is the observation that the monstrous functions

	1432 also take proportionally longer time to translate, so there's a good chance many

	1433 of the monstrous functions will be active at the same time with multithreaded

	1434 translation. As a result, for N=32, Subzero's memory footprint peaks at about

	1435 260 MB, but drops back down as the large functions finish translating.

	1436

	1437 If this peak memory size becomes a problem, it might be possible for the parser

	1438 to resequence the functions to try to spread out the larger functions, or to

	1439 throttle the translation queue to prevent too many in-flight large functions.

	1440 It may also be possible to throttle based on memory pressure signaling from

	1441 Chrome.

	1442

	1443 Translator scalability

	1444 ----------------------

	1445

	1446 Currently scalability is "not very good". Multiple translation threads lead to

	1447 faster translation, but not to the degree desired. We haven't dug in to

	1448 investigate yet.

	1449

	1450 There are a few areas to investigate. First, there may be contention on the

	1451 constant pool, which all threads access, and which requires locked access even

	1452 for reading. This could be mitigated by keeping a CFG-local cache of the most

	1453 common constants.

	1454

	1455 Second, there may be contention on memory allocation. While almost all CFG

	1456 objects are allocated from the CFG-local allocator, some passes use temporary

	1457 STL containers that use the default allocator, which may require global locking.

	1458 This could be mitigated by switching these to the CFG-local allocator.

	1459

	1460 Third, multithreading may make the default allocator strategy more expensive.

	1461 In a single-threaded environment, a pass will allocate its containers, run the

	1462 pass, and deallocate the containers. This results in stack-like allocation

	1463 behavior and makes the heap free list easier to manage, with less heap

	1464 fragmentation. But when multithreading is added, the allocations and

	1465 deallocations become much less stack-like, making allocation and deallocation

	1466 operations individually more expensive. Again, this could be mitigated by

	1467 switching these to the CFG-local allocator.

OLD	NEW

« no previous file with comments | « no previous file | README.rst » ('j') | no next file with comments »