DESIGN.rst - Issue 1562543002: move .rst to docs

Side by Side Diff: DESIGN.rst

Issue 1562543002: move .rst to docs (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: make sure clean-all is not too aggressive Created 4 years, 11 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
	(Empty)
1 Design of the Subzero fast code generator

2 =========================================

3

4 Introduction

5 ------------

6

7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the

9 PNaCl toolchain to compile their application to architecture-neutral PNaCl

10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as

11 possible. The ``.pexe`` file is downloaded to the user's browser where the

12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to

13 `sandboxed

14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_

15 native code. The translator uses architecture-specific optimizations as much as

16 practical to generate good native code.

17

18 The native code can be cached by the browser to avoid repeating translation on

19 future page loads. However, first-time user experience is hampered by long

20 translation times. The LLVM-based PNaCl translator is pretty slow, even when

21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on

22 slow browser platforms such as ARM-based Chromebooks.

23

24 Translator slowness can be mitigated or hidden in a number of ways.

25

26 - Parallel translation. However, slow machines where this matters most, e.g.

27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across,

28 and are likely to less memory available for multiple translation threads to

29 use.

30

31 - Streaming translation, i.e. start translating as soon as the download starts.

32 This doesn't help much when translation speed is 10× slower than download

33 speed, or the ``.pexe`` file is already cached while the translated binary was

34 flushed from the cache.

35

36 - Arrange the web page such that translation is done in parallel with

37 downloading large assets.

38

39 - Arrange the web page to distract the user with `cat videos

40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in

41 progress.

42

43 Or, improve translator performance to something more reasonable.

44

45 This document describes Subzero's attempt to improve translation speed by an

46 order of magnitude while rivaling LLVM's code quality. Subzero does this

47 through minimal IR layering, lean data structures and passes, and a careful

48 selection of fast optimization passes. It has two optimization recipes: full

49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

50 following (described in more detail below):

51

52 +-----------------------------------+-----------------------+

53 \| O2 recipe \| Om1 recipe \|

54 +===================================+=======================+

55 \| Parse .pexe file \| Parse .pexe file \|

56 +-----------------------------------+-----------------------+

57 \| Loop nest analysis \| \|

58 +-----------------------------------+-----------------------+

59 \| Address mode inference \| \|

60 +-----------------------------------+-----------------------+

61 \| Read-modify-write (RMW) transform \| \|

62 +-----------------------------------+-----------------------+

63 \| Basic liveness analysis \| \|

64 +-----------------------------------+-----------------------+

65 \| Load optimization \| \|

66 +-----------------------------------+-----------------------+

67 \| \| Phi lowering (simple) \|

68 +-----------------------------------+-----------------------+

69 \| Target lowering \| Target lowering \|

70 +-----------------------------------+-----------------------+

71 \| Full liveness analysis \| \|

72 +-----------------------------------+-----------------------+

73 \| Register allocation \| \|

74 +-----------------------------------+-----------------------+

75 \| Phi lowering (advanced) \| \|

76 +-----------------------------------+-----------------------+

77 \| Post-phi register allocation \| \|

78 +-----------------------------------+-----------------------+

79 \| Branch optimization \| \|

80 +-----------------------------------+-----------------------+

81 \| Code emission \| Code emission \|

82 +-----------------------------------+-----------------------+

83

84 Goals

85 =====

86

87 Translation speed

88 -----------------

89

90 We'd like to be able to translate a ``.pexe`` file as fast as download speed.

91 Any faster is in a sense wasted effort. Download speed varies greatly, but

92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a

93 slow machine. We observe a 3× single-thread performance difference between A15

94 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a

95 ``.pexe`` file could be compressed to 50% on the web server using gzip transport

96 compression, so we set the translation speed goal to 6 MB/sec on the high-end

97 Xeon workstation.

98

99 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at

100 ⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.

101

102 In other words, Subzero's goal is to improve over LLVM's translation speed by

103 10×.

104

105 Code quality

106 ------------

107

108 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``

109 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On

110 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.

111

112 It's important to note that the quality of Subzero-generated code depends on

113 target-neutral optimizations and simplifications being run beforehand in the

114 developer environment. The ``.pexe`` file reflects these optimizations. For

115 example, Subzero assumes that the basic blocks are ordered topologically where

116 possible (which makes liveness analysis converge fastest), and Subzero does not

117 do any function inlining because it should already have been done.

118

119 Translator size

120 ---------------

121

122 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.

123 We think 1 MB is a more reasonable size -- especially for such a component that

124 is distributed to a billion Chrome users. Thus we target a 10× reduction in

125 binary size.

126

127 For development, Subzero can be built for all target architectures, and all

128 debugging and diagnostic options enabled. For a smaller translator, we restrict

129 to a single target architecture, and define a ``MINIMAL`` build where

130 unnecessary features are compiled out.

131

132 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``

133 include directories, which have little impact on translator size. It also uses

134 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again

135 with little size impact. In non-``MINIMAL`` builds, the translator size is much

136 larger due to including code for parsing text-format bitcode files and forming

137 LLVM IR.

138

139 Memory footprint

140 ----------------

141

142 The current LLVM-based translator suffers from an issue in which some

143 function-specific data has to be retained in memory until all translation

144 completes, and therefore the memory footprint grows without bound. Large

145 ``.pexe`` files can lead to the translator process holding hundreds of MB of

146 memory by the end. The translator runs in a separate process, so this memory

147 growth doesn't directly affect other processes, but it does dirty the physical

148 memory and contributes to a perception of bloat and sometimes a reality of

149 out-of-memory tab killing, especially noticeable on weaker systems.

150

151 Subzero should maintain a stable memory footprint throughout translation. It's

152 not really practical to set a specific limit, because there is not really a

153 practical limit on a single function's size, but the footprint should be

154 "reasonable" and be proportional to the largest input function size, not the

155 total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or

156 inexorable memory growth. (We use ASAN builds to test for leaks.)

157

158 Multithreaded translation

159 -------------------------

160

161 It should be practical to translate different functions concurrently and see

162 good scalability. Some locking may be needed, such as accessing output buffers

163 or constant pools, but that should be fairly minimal. In contrast, LLVM was

164 only designed for module-level parallelism, and as such, the PNaCl translator

165 internally splits a ``.pexe`` file into several modules for concurrent

166 translation. All output needs to be deterministic regardless of the level of

167 multithreading, i.e. functions and data should always be output in the same

168 order.

169

170 Target architectures

171 --------------------

172

173 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future

174 targets include ARM64 and MIPS64, though these targets lack NaCl support

175 including a sandbox model or a validator.

176

177 The first implementation is for x86-32, because it was expected to be

178 particularly challenging, and thus more likely to draw out any design problems

179 early:

180

181 - There are a number of special cases, asymmetries, and warts in the x86

182 instruction set.

183

184 - Complex addressing modes may be leveraged for better code quality.

185

186 - 64-bit integer operations have to be lowered into longer sequences of 32-bit

187 operations.

188

189 - Paucity of physical registers may reveal code quality issues early in the

190 design.

191

192 Detailed design

193 ===============

194

195 Intermediate representation - ICE

196 ---------------------------------

197

198 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's

199 IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a

200 representation of global variables and initializers, and a set of functions.

201 Each function contains a list of basic blocks, and each basic block constains a

202 list of instructions. Instructions that operate on stack and register variables

203 do so using static single assignment (SSA) form.

204

205 The ``.pexe`` file is translated one function at a time (or in parallel by

206 multiple translation threads). The recipe for optimization passes depends on

207 the specific target and optimization level, and is described in detail below.

208 Global variables (types and initializers) are simply and directly translated to

209 object code, without any meaningful attempts at optimization.

210

211 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.

212 Its key contents include:

213

214 - A list of ``CfgNode`` pointers, generally held in topological order.

215

216 - A list of ``Variable`` pointers corresponding to local variables used in the

217 function plus compiler-generated temporaries.

218

219 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents

220 include:

221

222 - A linear list of instructions, in the same style as LLVM. The last

223 instruction of the list is always a terminator instruction: branch, switch,

224 return, unreachable.

225

226 - A list of Phi instructions, also in the same style as LLVM. They are held as

227 a linear list for convenience, though per Phi semantics, they are executed "in

228 parallel" without dependencies on each other.

229

230 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and

231 another list for outgoing edges.

232

233 - The node's unique, 0-based index into the CFG's node list.

234

235 An instruction is represented by the ``Ice::Inst`` class. Its key contents

236 include:

237

238 - A list of source operands.

239

240 - Its destination variable, if the instruction produces a result in an

241 ``Ice::Variable``.

242

243 - A bitvector indicating which variables' live ranges this instruction ends.

244 This is computed during liveness analysis.

245

246 Instructions kinds are divided into high-level ICE instructions and low-level

247 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode

248 instruction kinds. Each target architecture implementation extends the

249 instruction space with its own set of low-level instructions. Generally,

250 low-level instructions correspond to individual machine instructions. The

251 high-level ICE instruction space includes a few additional instruction kinds

252 that are not part of LLVM but are generally useful (e.g., an Assignment

253 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock

254 instructions for sandboxing).

255

256 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl

257 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual

258 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are

259 the following:

260

261 - Alloca: allocate data on the stack

262

263 - Arithmetic: binary operations of the form ``A = B op C``

264

265 - Br: conditional or unconditional branch

266

267 - Call: function call

268

269 - Cast: unary type-conversion operations

270

271 - ExtractElement: extract a scalar element from a vector-type value

272

273 - Fcmp: floating-point comparison

274

275 - Icmp: integer comparison

276

277 - IntrinsicCall: call a known intrinsic

278

279 - InsertElement: insert a scalar element into a vector-type value

280

281 - Load: load a value from memory

282

283 - Phi: implement the SSA phi node

284

285 - Ret: return from the function

286

287 - Select: essentially the C language operation of the form ``X = C ? Y : Z``

288

289 - Store: store a value into memory

290

291 - Switch: generalized branch to multiple possible locations

292

293 - Unreachable: indicate that this portion of the code is unreachable

294

295 The additional high-level ICE instructions are the following:

296

297 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi

298 instructions to non-SSA assignments, before lowering to machine code.

299

300 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are

301 common across all targets and so they are elevated to the high-level

302 instruction set.

303

304 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in

305 liveness analysis, elevated to the high-level because they are used by all

306 targets. They are described in more detail at the end of this section.

307

308 - JumpTable: this represents the result of switch optimization analysis, where

309 some switch instructions may use jump tables instead of cascading

310 compare/branches.

311

312 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an

313 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants

314 include scalar integer constants, scalar floating point constants, Undef (an

315 unspecified constant of a particular scalar or vector type), and symbol

316 constants (essentially addresses of globals). Note that the PNaCl ABI does not

317 include vector-type constants besides Undef, and as such, Subzero (so far) has

318 no reason to represent vector-type constants internally. A variable represents

319 a value allocated on the stack (though not including alloca-derived storage).

320 Among other things, a variable holds its unique, 0-based index into the CFG's

321 variable list.

322

323 Each target can extend the ``Constant`` and ``Variable`` classes for its own

324 needs. In addition, the ``Operand`` class may be extended, e.g. to define an

325 x86 ``MemOperand`` that encodes a base register, an index register, an index

326 register shift amount, and a constant offset.

327

328 Register allocation and liveness analysis are restricted to Variable operands.

329 Because of the importance of register allocation to code quality, and the

330 translation-time cost of liveness analysis, Variable operands get some special

331 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate

332 across all the Variables of an instruction. An instruction holds a list of

333 operands, but an operand may contain 0, 1, or more Variables. As such, the

334 ``Operand`` class specially holds a list of Variables contained within, for

335 quick access.

336

337 A Subzero transformation pass may work by deleting an existing instruction and

338 replacing it with zero or more new instructions. Instead of actually deleting

339 the existing instruction, we generally mark it as deleted and insert the new

340 instructions right after the deleted instruction. When printing the IR for

341 debugging, this is a big help because it makes it much more clear how the

342 non-deleted instructions came about.

343

344 Subzero has a few special instructions to help with liveness analysis

345 consistency.

346

347 - The FakeDef instruction gives a fake definition of some variable. For

348 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``

349 but an ICE instruction can represent only one destination variable. This is

350 similar for multiply instructions, and for function calls that return a 64-bit

351 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``

352 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.

353

354 - The FakeUse instruction registers a use of a variable, typically to prevent an

355 earlier assignment to that variable from being dead-code eliminated. For

356 example, lowering an operation like ``x=cc?y:z`` may be done using x86's

357 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a

358 FakeUse of ``x`` between the two instructions, the liveness analysis pass may

359 dead-code eliminate the first instruction.

360

361 - The FakeKill instruction is added after a call instruction, and is a quick way

362 of indicating that caller-save registers are invalidated.

363

364 Pexe parsing

365 ------------

366

367 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It

368 parses the ``.pexe`` file function by function, ultimately constructing an ICE

369 CFG for each function. After a function is parsed, its CFG is handed off to the

370 translation phase. The bitcode parser also parses global initializer data and

371 hands it off to be translated to data sections in the object file.

372

373 Subzero has another parsing strategy for testing/debugging. LLVM libraries can

374 be used to parse a module into LLVM IR (though very slowly relative to Subzero

375 native parsing). Then we iterate across the LLVM IR and construct high-level

376 ICE, handing off each CFG to the translation phase.

377

378 Overview of lowering

379 --------------------

380

381 In general, translation goes like this:

382

383 - Parse the next function from the ``.pexe`` file and construct a CFG consisting

384 of high-level ICE.

385

386 - Do analysis passes and transformation passes on the high-level ICE, as

387 desired.

388

389 - Lower each high-level ICE instruction into a sequence of zero or more

390 low-level ICE instructions. Each high-level instruction is generally lowered

391 independently, though the target lowering is allowed to look ahead in the

392 CfgNode's instruction list if desired.

393

394 - Do more analysis and transformation passes on the low-level ICE, as desired.

395

396 - Assemble the low-level CFG into an ELF object file (alternatively, a textual

397 assembly file that is later assembled by some external tool).

398

399 - Repeat for all functions, and also produce object code for data such as global

400 initializers and internal constant pools.

401

402 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,

403 the intention is to apply all available optimizations to get the best code

404 quality (though the initial code quality goal is measured against LLVM's ``O0``

405 code quality). For ``Om1``, the intention is to apply as few optimizations as

406 possible and produce code as quickly as possible, accepting poor code quality.

407 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,

408 "sub-zero".

409

410 High-level debuggability of generated code is so far not a design requirement.

411 Subzero doesn't really do transformations that would obfuscate debugging; the

412 main thing might be that register allocation (including stack slot coalescing

413 for stack-allocated variables whose live ranges don't overlap) may render a

414 variable's value unobtainable after its live range ends. This would not be an

415 issue for ``Om1`` since it doesn't register-allocate program-level variables,

416 nor does it coalesce stack slots. That said, fully supporting debuggability

417 would require a few additions:

418

419 - DWARF support would need to be added to Subzero's ELF file emitter. Subzero

420 propagates global symbol names, local variable names, and function-internal

421 label names that are present in the ``.pexe`` file. This would allow a

422 debugger to map addresses back to symbols in the ``.pexe`` file.

423

424 - To map ``.pexe`` file symbols back to meaningful source-level symbol names,

425 file names, line numbers, etc., Subzero would need to handle `LLVM bitcode

426 metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg``

427 `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_.

428

429 - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so

430 the toolchain would need to be modified to preserve it.

431

432 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but

433 produces code with one third the code quality. ``Om1`` is good for testing and

434 debugging -- during translation, it tends to expose errors in the basic lowering

435 that might otherwise have been hidden by the register allocator or other

436 optimization passes. It also helps determine whether a code correctness problem

437 is a fundamental problem in the basic lowering, or an error in another

438 optimization pass.

439

440 The implementation of target lowering also controls the recipe of passes used

441 for ``Om1`` and ``O2`` translation. For example, address mode inference may

442 only be relevant for x86.

443

444 Lowering strategy

445 -----------------

446

447 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower

448 each high-level instruction down to a sequence of low-level target-specific

449 instructions, in a largely context-free setting. That is, each high-level

450 instruction conceptually has a simple template expansion into low-level

451 instructions, and lowering can in theory be done in any order. This may sound

452 like a small effort, but quite a large number of templates may be needed because

453 of the number of PNaCl types and instruction variants. Furthermore, there may

454 be optimized templates, e.g. to take advantage of operator commutativity (for

455 example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is

456 similar to other template-based approaches in fast code generation or

457 interpretation, though some decisions are deferred until after some global

458 analysis passes, mostly related to register allocation, stack slot assignment,

459 and specific choice of instruction variant and addressing mode.

460

461 The key idea for a lowering template is to produce valid low-level instructions

462 that are guaranteed to meet address mode and other structural requirements of

463 the instruction set. For example, on x86, the source operand of an integer

464 store instruction must be an immediate or a physical register; a shift

465 instruction's shift amount must be an immediate or in register ``%cl``; a

466 function's integer return value is in ``%eax``; most x86 instructions are

467 two-operand, in contrast to corresponding three-operand high-level instructions;

468 etc.

469

470 Because target lowering runs before register allocation, there is no way to know

471 whether a given ``Ice::Variable`` operand lives on the stack or in a physical

472 register. When the low-level instruction calls for a physical register operand,

473 the target lowering can create an infinite-weight Variable. This tells the

474 register allocator to assign infinite weight when making decisions, effectively

475 guaranteeing some physical register. Variables can also be pre-colored to a

476 specific physical register (``cl`` in the shift example above), which also gives

477 infinite weight.

478

479 To illustrate, consider a high-level arithmetic instruction on 32-bit integer

480 operands::

481

482 A = B + C

483

484 X86 target lowering might produce the following::

485

486 T.inf = B // mov instruction

487 T.inf += C // add instruction

488 A = T.inf // mov instruction

489

490 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a

491 physical register, the three lowered instructions are all encodable regardless

492 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and

493 whether ``A`` is a physical register or in memory.

494

495 In this example, ``A`` must be a Variable and one may be tempted to simplify the

496 lowering sequence by setting ``A`` as infinite-weight and using::

497

498 A = B // mov instruction

499 A += C // add instruction

500

501 This has two problems. First, if the original instruction was actually ``A =

502 B + A``, the result would be incorrect. Second, assigning ``A`` a physical

503 register applies throughout ``A``'s entire live range. This is probably not

504 what is intended, and may ultimately lead to a failure to allocate a register

505 for an infinite-weight variable.

506

507 This style of lowering leads to many temporaries being generated, so in ``O2``

508 mode, we rely on the register allocator to clean things up. For example, in the

509 example above, if ``B`` ends up getting a physical register and its live range

510 ends at this instruction, the register allocator is likely to reuse that

511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

512 copy, which is removed as an emission-time peephole optimization.

513

514 O2 lowering

515 -----------

516

517 Currently, the ``O2`` lowering recipe is the following:

518

519 - Loop nest analysis

520

521 - Address mode inference

522

523 - Read-modify-write (RMW) transformation

524

525 - Basic liveness analysis

526

527 - Load optimization

528

529 - Target lowering

530

531 - Full liveness analysis

532

533 - Register allocation

534

535 - Phi instruction lowering (advanced)

536

537 - Post-phi lowering register allocation

538

539 - Branch optimization

540

541 These passes are described in more detail below.

542

543 Om1 lowering

544 ------------

545

546 Currently, the ``Om1`` lowering recipe is the following:

547

548 - Phi instruction lowering (simple)

549

550 - Target lowering

551

552 - Register allocation (infinite-weight and pre-colored only)

553

554 Optimization passes

555 -------------------

556

557 Liveness analysis

558 ^^^^^^^^^^^^^^^^^

559

560 Liveness analysis is a standard dataflow optimization, implemented as follows.

561 For each node (basic block), its live-out set is computed as the union of the

562 live-in sets of its successor nodes. Then the node's instructions are processed

563 in reverse order, updating the live set, until the beginning of the node is

564 reached, and the node's live-in set is recorded. If this iteration has changed

565 the node's live-in set, the node's predecessors are marked for reprocessing.

566 This continues until no more nodes need reprocessing. If nodes are processed in

567 reverse topological order, the number of iterations over the CFG is generally

568 equal to the maximum loop nest depth.

569

570 To implement this, each node records its live-in and live-out sets, initialized

571 to the empty set. Each instruction records which of its Variables' live ranges

572 end in that instruction, initialized to the empty set. A side effect of

573 liveness analysis is dead instruction elimination. Each instruction can be

574 marked as tentatively dead, and after the algorithm converges, the tentatively

575 dead instructions are permanently deleted.

576

577 Optionally, after this liveness analysis completes, we can do live range

578 construction, in which we calculate the live range of each variable in terms of

579 instruction numbers. A live range is represented as a union of segments, where

580 the segment endpoints are instruction numbers. Instruction numbers are required

581 to be unique across the CFG, and monotonically increasing within a basic block.

582 As a union of segments, live ranges can contain "gaps" and are therefore

583 precise. Because of SSA properties, a variable's live range can start at most

584 once in a basic block, and can end at most once in a basic block. Liveness

585 analysis keeps track of which variable/instruction tuples begin live ranges and

586 end live ranges, and combined with live-in and live-out sets, we can efficiently

587 build up live ranges of all variables across all basic blocks.

588

589 A lot of care is taken to try to make liveness analysis fast and efficient.

590 Because of the lowering strategy, the number of variables is generally

591 proportional to the number of instructions, leading to an O(N^2) complexity

592 algorithm if implemented naively. To improve things based on sparsity, we note

593 that most variables are "local" and referenced in at most one basic block (in

594 contrast to the "global" variables with multi-block usage), and therefore cannot

595 be live across basic blocks. Therefore, the live-in and live-out sets,

596 typically represented as bit vectors, can be limited to the set of global

597 variables, and the intra-block liveness bit vector can be compacted to hold the

598 global variables plus the local variables for that block.

599

600 Register allocation

601 ^^^^^^^^^^^^^^^^^^^

602

603 Subzero implements a simple linear-scan register allocator, based on the

604 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan

605 Register Allocation in the Context of SSA Form and Register Constraints

606 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has

607 several nice features:

608

609 - Live ranges are represented as unions of segments, as described above, rather

610 than a single start/end tuple.

611

612 - It allows pre-coloring of variables with specific physical registers.

613

614 - It applies equally well to pre-lowered Phi instructions.

615

616 The paper suggests an approach of aggressively coalescing variables across Phi

617 instructions (i.e., trying to force Phi source and destination variables to have

618 the same register assignment), but we reject that in favor of the more natural

619 preference mechanism described below.

620

621 We enhance the algorithm in the paper with the capability of automatic inference

622 of register preference, and with the capability of allowing overlapping live

623 ranges to safely share the same register in certain circumstances. If we are

624 considering register allocation for variable ``A``, and ``A`` has a single

625 defining instruction ``A=B+C``, then the preferred register for ``A``, if

626 available, would be the register assigned to ``B`` or ``C``, if any, provided

627 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this

628 way we infer a good register preference for ``A``.

629

630 We allow overlapping live ranges to get the same register in certain cases.

631 Suppose a high-level instruction like::

632

633 A = unary_op(B)

634

635 has been target-lowered like::

636

637 T.inf = B

638 A = unary_op(T.inf)

639

640 Further, assume that ``B``'s live range continues beyond this instruction

641 sequence, and that ``B`` has already been assigned some register. Normally, we

642 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it

643 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have

644 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is

645 guaranteed to be in a register, so in theory these overlapping live ranges could

646 safely have the same register. Our implementation allows this overlap as long

647 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never

648 modified within ``T.inf``'s live range.

649

650 Subzero's register allocator can be run in 3 configurations.

651

652 - Normal mode. All Variables are considered for register allocation. It

653 requires full liveness analysis and live range construction as a prerequisite.

654 This is used by ``O2`` lowering.

655

656 - Minimal mode. Only infinite-weight or pre-colored Variables are considered.

657 All other Variables are stack-allocated. It does not require liveness

658 analysis; instead, it quickly scans the instructions and records first

659 definitions and last uses of all relevant Variables, using that to construct a

660 single-segment live range. Although this includes most of the Variables, the

661 live ranges are mostly simple, short, and rarely overlapping, which the

662 register allocator handles efficiently. This is used by ``Om1`` lowering.

663

664 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode

665 register allocation, and may result in new infinite-weight Variables that need

666 registers. One would like to just run something like minimal mode to assign

667 registers to the new Variables while respecting existing register allocation

668 decisions. However, it sometimes happens that there are no free registers.

669 In this case, some register needs to be forcibly spilled to the stack and

670 temporarily reassigned to the new Variable, and reloaded at the end of the new

671 Variable's live range. The register must be one that has no explicit

672 references during the Variable's live range. Since Subzero currently doesn't

673 track def/use chains (though it does record the CfgNode where a Variable is

674 defined), we just do a brute-force search across the CfgNode's instruction

675 list for the instruction numbers of interest. This situation happens very

676 rarely, so there's little point for now in improving its performance.

677

678 The basic linear-scan algorithm may, as it proceeds, rescind an early register

679 allocation decision, leaving that Variable to be stack-allocated. Some of these

680 times, it turns out that the Variable could have been given a different register

681 without conflict, but by this time it's too late. The literature recognizes

682 this situation and describes "second-chance bin-packing", which Subzero can do.

683 We can rerun the register allocator in a mode that respects existing register

684 allocation decisions, and sometimes it finds new non-conflicting opportunities.

685 In fact, we can repeatedly run the register allocator until convergence.

686 Unfortunately, in the current implementation, these subsequent register

687 allocation passes end up being extremely expensive. This is because of the

688 treatment of the "unhandled pre-colored" Variable set, which is normally very

689 small but ends up being quite large on subsequent passes. Its performance can

690 probably be made acceptable with a better choice of data structures, but for now

691 this second-chance mechanism is disabled.

692

693 Future work is to implement LLVM's `Greedy

694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_

695 register allocator as a replacement for the basic linear-scan algorithm, given

696 LLVM's experience with its improvement in code quality. (The blog post claims

697 that the Greedy allocator also improved maintainability because a lot of hacks

698 could be removed, but Subzero is probably not yet to that level of hacks, and is

699 less likely to see that particular benefit.)

700

701 Basic phi lowering

702 ^^^^^^^^^^^^^^^^^^

703

704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

705 implements it). Consider this example::

706

707 L1:

708 ...

709 br L3

710 L2:

711 ...

712 br L3

713 L3:

714 A = phi [B, L1], [C, L2]

715 X = phi [Y, L1], [Z, L2]

716

717 For each destination of a phi instruction, we can create a temporary and insert

718 the temporary's assignment at the end of the predecessor block::

719

720 L1:

721 ...

722 A' = B

723 X' = Y

724 br L3

725 L2:

726 ...

727 A' = C

728 X' = Z

729 br L3

730 L2:

731 A = A'

732 X = X'

733

734 This transformation is very simple and reliable. It can be done before target

735 lowering and register allocation, and it easily avoids the classic lost-copy and

736 related problems. ``Om1`` lowering uses this strategy.

737

738 However, it has the disadvantage of initializing temporaries even for branches

739 not taken, though that could be mitigated by splitting non-critical edges and

740 putting assignments in the edge-split nodes. Another problem is that without

741 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a

742 specific ordering even though phi semantics are that the assignments are

743 parallel or unordered. This sometimes imposes false live range overlaps and

744 leads to poorer register allocation.

745

746 Advanced phi lowering

747 ^^^^^^^^^^^^^^^^^^^^^

748

749 ``O2`` lowering defers phi lowering until after register allocation to avoid the

750 problem of false live range overlaps. It works as follows. We split each

751 incoming edge and move the (parallel) phi assignments into the split nodes. We

752 linearize each set of assignments by finding a safe, topological ordering of the

753 assignments, respecting register assignments as well. For example::

754

755 A = B

756 X = Y

757

758 Normally these assignments could be executed in either order, but if ``B`` and

759 ``X`` are assigned the same physical register, we would want to use the above

760 ordering. Dependency cycles are broken by introducing a temporary. For

761 example::

762

763 A = B

764 B = A

765

766 Here, a temporary breaks the cycle::

767

768 t = A

769 A = B

770 B = t

771

772 Finally, we use the existing target lowering to lower the assignments in this

773 basic block, and once that is done for all basic blocks, we run the post-phi

774 variant of register allocation on the edge-split basic blocks.

775

776 When computing a topological order, we try to first schedule assignments whose

777 source has a physical register, and last schedule assignments whose destination

778 has a physical register. This helps reduce register pressure.

779

780 X86 address mode inference

781 ^^^^^^^^^^^^^^^^^^^^^^^^^^

782

783 We try to take advantage of the x86 addressing mode that includes a base

784 register, an index register, an index register scale amount, and an immediate

785 offset. We do this through simple pattern matching. Starting with a load or

786 store instruction where the address is a variable, we initialize the base

787 register to that variable, and look up the instruction where that variable is

788 defined. If that is an add instruction of two variables and the index register

789 hasn't been set, we replace the base and index register with those two

790 variables. If instead it is an add instruction of a variable and a constant, we

791 replace the base register with the variable and add the constant to the

792 immediate offset.

793

794 There are several more patterns that can be matched. This pattern matching

795 continues on the load or store instruction until no more matches are found.

796 Because a program typically has few load and store instructions (not to be

797 confused with instructions that manipulate stack variables), this address mode

798 inference pass is fast.

799

800 X86 read-modify-write inference

801 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

802

803 A reasonably common bitcode pattern is a non-atomic update of a memory

804 location::

805

806 x = load addr

807 y = add x, 1

808 store y, addr

809

810 On x86, with good register allocation, the Subzero passes described above

811 generate code with only this quality::

812

813 mov [%ebx], %eax

814 add $1, %eax

815 mov %eax, [%ebx]

816

817 However, x86 allows for this kind of code::

818

819 add $1, [%ebx]

820

821 which requires fewer instructions, but perhaps more importantly, requires fewer

822 physical registers.

823

824 It's also important to note that this transformation only makes sense if the

825 store instruction ends ``x``'s live range.

826

827 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)

828 opportunities via simple pattern matching. The only problem is that it is run

829 before liveness analysis, which is needed to determine whether ``x``'s live

830 range ends after the RMW. Since liveness analysis is one of the most expensive

831 passes, it's not attractive to run it an extra time just for RMW analysis.

832 Instead, we essentially generate both the RMW and the non-RMW versions, and then

833 during lowering, the RMW version deletes itself if it finds x still live.

834

835 X86 compare-branch inference

836 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

837

838 In the LLVM instruction set, the compare/branch pattern works like this::

839

840 cond = icmp eq a, b

841 br cond, target

842

843 The result of the icmp instruction is a single bit, and a conditional branch

844 tests that bit. By contrast, most target architectures use this pattern::

845

846 cmp a, b // implicitly sets various bits of FLAGS register

847 br eq, target // branch on a particular FLAGS bit

848

849 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests

850 ``cond`` and conditionally branches. Subzero has a pass that identifies

851 boolean-based operations like this and folds them into a single

852 compare/branch-like operation. It is set up for more than just cmp/br though.

853 Boolean producers include icmp (integer compare), fcmp (floating-point compare),

854 and trunc (integer truncation when the destination has bool type). Boolean

855 consumers include branch, select (the ternary operator from the C language), and

856 sign-extend and zero-extend when the source has bool type.

857

858 Sandboxing

859 ^^^^^^^^^^

860

861 Native Client's sandbox model uses software fault isolation (SFI) to provide

862 safety when running untrusted code in a browser or other environment. Subzero

863 implements Native Client's `sandboxing

864 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_

865 to enable Subzero-translated executables to be run inside Chrome. Subzero also

866 provides a fairly simple framework for investigating alternative sandbox models

867 or other restrictions on the sandbox model.

868

869 Sandboxing in Subzero is not actually implemented as a separate pass, but is

870 integrated into lowering and assembly.

871

872 - Indirect branches, including the ret instruction, are masked to a bundle

873 boundary and bundle-locked.

874

875 - Call instructions are aligned to the end of the bundle so that the return

876 address is bundle-aligned.

877

878 - Indirect branch targets, including function entry and targets in a switch

879 statement jump table, are bundle-aligned.

880

881 - The intrinsic for reading the thread pointer is inlined appropriately.

882

883 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox

884 base register. We reduce the aggressiveness of address mode inference to

885 leave room for the sandbox base register during lowering. There are no memory

886 sandboxing changes for x86-32.

887

888 Code emission

889 -------------

890

891 Subzero's integrated assembler is derived from Dart's `assembler code

892 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass

893 that iterates through the low-level ICE instructions and invokes the relevant

894 assembler functions. Placeholders are added for later fixup of branch target

895 offsets. (Backward branches use short offsets if possible; forward branches

896 generally use long offsets unless it is an intra-block branch of "known" short

897 length.) The assembler emits into a staging buffer. Once emission into the

898 staging buffer for a function is complete, the data is emitted to the output

899 file as an ELF object file, and metadata such as relocations, symbol table, and

900 string table, are accumulated for emission at the end. Global data initializers

901 are emitted similarly. A key point is that at this point, the staging buffer

902 can be deallocated, and only a minimum of data needs to held until the end.

903

904 As a debugging alternative, Subzero can emit textual assembly code which can

905 then be run through an external assembler. This is of course super slow, but

906 quite valuable when bringing up a new target.

907

908 As another debugging option, the staging buffer can be emitted as textual

909 assembly, primarily in the form of ".byte" lines. This allows the assembler to

910 be tested separately from the ELF related code.

911

912 Memory management

913 -----------------

914

915 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from

916 LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects

917 allocated from the arena are never actually freed; instead, when the CFG

918 translation completes and the CFG is deleted, the entire arena memory is

919 reclaimed at once. This style of allocation works well in an environment like a

920 compiler where there are distinct phases with only easily-identifiable objects

921 living across phases. It frees the developer from having to manage object

922 deletion, and it amortizes deletion costs across just a single arena deletion at

923 the end of the phase. Furthermore, it helps scalability by allocating entirely

924 from thread-local memory pools, and minimizing global locking of the heap.

925

926 Instructions are probably the most heavily allocated complex class in Subzero.

927 We represent an instruction list as an intrusive doubly linked list, allocate

928 all instructions from the ``CfgLocalAllocator``, and we make sure each

929 instruction subclass is basically `POD

930 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a

931 trivial destructor. This way, when the CFG is finished, we don't need to

932 individually deallocate every instruction. We do similar for Variables, which

933 is probably the second most popular complex class.

934

935 There are some situations where passes need to use some `STL container class

936 <http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the

937 ``CfgLocalAllocator`` as the container allocator if this is needed.

938

939 Multithreaded translation

940 -------------------------

941

942 Subzero is designed to be able to translate functions in parallel. With the

943 ``-threads=N`` command-line option, there is a 3-stage producer-consumer

944 pipeline:

945

946 - A single thread parses the ``.pexe`` file and produces a sequence of work

947 units. A work unit can be either a fully constructed CFG, or a set of global

948 initializers. The work unit includes its sequence number denoting its parse

949 order. Each work unit is added to the translation queue.

950

951 - There are N translation threads that draw work units from the translation

952 queue and lower them into assembler buffers. Each assembler buffer is added

953 to the emitter queue, tagged with its sequence number. The CFG and its

954 ``CfgLocalAllocator`` are disposed of at this point.

955

956 - A single thread draws assembler buffers from the emitter queue and appends to

957 the output file. It uses the sequence numbers to reintegrate the assembler

958 buffers according to the original parse order, such that output order is

959 always deterministic.

960

961 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads

962 for a total of ``N+2`` execution threads, taking the parser and emitter threads

963 into account. For the special case of ``N=0``, execution is entirely sequential

964 -- the same thread parses, translates, and emits, one function at a time. This

965 is useful for performance measurements.

966

967 Ideally, we would like to get near-linear scalability as the number of

968 translation threads increases. We expect that ``-threads=1`` should be slightly

969 faster than ``-threads=0`` as the small amount of time spent parsing and

970 emitting is done largely in parallel with translation. With perfect

971 scalability, we see ``-threads=N`` translating ``N`` times as fast as

972 ``-threads=1``, up until the point where parsing or emitting becomes the

973 bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory

974 performance would become a bottleneck and efficiency might peak at, say, 75%.

975

976 Currently, parsing takes about 11% of total sequential time. If translation

977 scalability ever gets so fast and awesomely scalable that parsing becomes a

978 bottleneck, it should be possible to make parsing multithreaded as well.

979

980 Internally, all shared, mutable data is held in the GlobalContext object, and

981 access to each field is guarded by a mutex.

982

983 Security

984 --------

985

986 Subzero includes a number of security features in the generated code, as well as

987 in the Subzero translator itself, which run on top of the existing Native Client

988 sandbox as well as Chrome's OS-level sandbox.

989

990 Sandboxed translator

991 ^^^^^^^^^^^^^^^^^^^^

992

993 When running inside the browser, the Subzero translator executes as sandboxed,

994 untrusted code that is initially checked by the validator, just like the

995 LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no

996 more or less secure than the translator it replaces, from the point of view of

997 the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and

998 was designed from the start with security in mind, so one expects fewer attacker

999 opportunities here.

1000

1001 Code diversification

1002 ^^^^^^^^^^^^^^^^^^^^

1003

1004 `Return-oriented programming

1005 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a

1006 now-common technique for starting with e.g. a known buffer overflow situation

1007 and launching it into a deeper exploit. The attacker scans the executable

1008 looking for ROP gadgets, which are short sequences of code that happen to load

1009 known values into known registers and then return. An attacker who manages to

1010 overwrite parts of the stack can overwrite it with carefully chosen return

1011 addresses such that certain ROP gadgets are effectively chained together to set

1012 up the register state as desired, finally returning to some code that manages to

1013 do something nasty based on those register values.

1014

1015 If there is a popular ``.pexe`` with a large install base, the attacker could

1016 run Subzero on it and scan the executable for suitable ROP gadgets to use as

1017 part of a potential exploit. Note that if the trusted validator is working

1018 correctly, these ROP gadgets are limited to starting at a bundle boundary and

1019 cannot use the trick of finding a gadget that happens to begin inside another

1020 instruction. All the same, gadgets with these constraints still exist and the

1021 attacker has access to them. This is the attack model we focus most on --

1022 protecting the user against misuse of a "trusted" developer's application, as

1023 opposed to mischief from a malicious ``.pexe`` file.

1024

1025 Subzero can mitigate these attacks to some degree through code diversification.

1026 Specifically, we can apply some randomness to the code generation that makes ROP

1027 gadgets less predictable. This randomness can have some compile-time cost, and

1028 it can affect the code quality; and some diversifications may be more effective

1029 than others. A more detailed treatment of hardening techniques may be found in

1030 the Matasano report "`Attacking Clientside JIT Compilers

1031 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/at tacking_clientside_jit_compilers_paper.pdf>`_".

1032

1033 To evaluate diversification effectiveness, we use a third-party ROP gadget

1034 finder and limit its results to bundle-aligned addresses. For a given

1035 diversification technique, we run it with a number of different random seeds,

1036 find ROP gadgets for each version, and determine how persistent each ROP gadget

1037 is across the different versions. A gadget is persistent if the same gadget is

1038 found at the same code address. The best diversifications are ones with low

1039 gadget persistence rates.

1040

1041 Subzero implements 7 different diversification techniques. Below is a

1042 discussion of each technique, its effectiveness, and its cost. The discussions

1043 of cost and effectiveness are for a single diversification technique; the

1044 translation-time costs for multiple techniques are additive, but the effects of

1045 multiple techniques on code quality and effectiveness are not yet known.

1046

1047 In Subzero's implementation, each randomization is "repeatable" in a sense.

1048 Each pass that includes a randomization option gets its own private instance of

1049 a random number generator (RNG). The RNG is seeded with a combination of a

1050 global seed, the pass ID, and the function's sequence number. The global seed

1051 is designed to be different across runs (perhaps based on the current time), but

1052 for debugging, the global seed can be set to a specific value and the results

1053 will be repeatable.

1054

1055 Subzero-generated code is subject to diversification once per translation, and

1056 then Chrome caches the diversified binary for subsequent executions. An

1057 attacker may attempt to run the binary multiple times hoping for

1058 higher-probability combinations of ROP gadgets. When the attacker guesses

1059 wrong, a likely outcome is an application crash. Chrome throttles creation of

1060 crashy processes which reduces the likelihood of the attacker eventually gaining

1061 a foothold.

1062

1063 Constant blinding

1064 ~~~~~~~~~~~~~~~~~

1065

1066 Here, we prevent attackers from controlling large immediates in the text

1067 (executable) section. A random cookie is generated for each function, and if

1068 the constant exceeds a specified threshold, the constant is obfuscated with the

1069 cookie and equivalent code is generated. For example, instead of this x86

1070 instruction::

1071

1072 mov $0x11223344, <%Reg/Mem>

1073

1074 the following code might be generated::

1075

1076 mov $(0x11223344+Cookie), %temp

1077 lea -Cookie(%temp), %temp

1078 mov %temp, <%Reg/Mem>

1079

1080 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to

1081 prevent unintended effects on the flags register.

1082

1083 This transformation has almost no effect on translation time, and about 1%

1084 impact on code quality, depending on the threshold chosen. It does little to

1085 reduce gadget persistence, but it does remove a lot of potential opportunities

1086 to construct intra-instruction ROP gadgets (which an attacker could use only if

1087 a validator bug were discovered, since the Native Client sandbox and associated

1088 validator force returns and other indirect branches to be to bundle-aligned

1089 addresses).

1090

1091 Constant pooling

1092 ~~~~~~~~~~~~~~~~

1093

1094 This is similar to constant blinding, in that large immediates are removed from

1095 the text section. In this case, each unique constant above the threshold is

1096 stored in a read-only data section and the constant is accessed via a memory

1097 load. For the above example, the following code might be generated::

1098

1099 mov $Label$1, %temp

1100 mov %temp, <%Reg/Mem>

1101

1102 This has a similarly small impact on translation time and ROP gadget

1103 persistence, and a smaller (better) impact on code quality. This is because it

1104 uses fewer instructions, and in some cases good register allocation leads to no

1105 increase in instruction count. Note that this still gives an attacker some

1106 limited amount of control over some text section values, unless we randomize the

1107 constant pool layout.

1108

1109 Static data reordering

1110 ~~~~~~~~~~~~~~~~~~~~~~

1111

1112 This transformation limits the attacker's ability to control bits in global data

1113 address references. It simply permutes the order in memory of global variables

1114 and internal constant pool entries. For the constant pool, we only permute

1115 within a type (i.e., emit a randomized list of ints, followed by a randomized

1116 list of floats, etc.) to maintain good packing in the face of alignment

1117 constraints.

1118

1119 As might be expected, this has no impact on code quality, translation time, or

1120 ROP gadget persistence (though as above, it limits opportunities for

1121 intra-instruction ROP gadgets with a broken validator).

1122

1123 Basic block reordering

1124 ~~~~~~~~~~~~~~~~~~~~~~

1125

1126 Here, we randomize the order of basic blocks within a function, with the

1127 constraint that we still want to maintain a topological order as much as

1128 possible, to avoid making the code too branchy.

1129

1130 This has no impact on code quality, and about 1% impact on translation time, due

1131 to a separate pass to recompute layout. It ends up having a huge effect on ROP

1132 gadget persistence, tied for best with nop insertion, reducing ROP gadget

1133 persistence to less than 5%.

1134

1135 Function reordering

1136 ~~~~~~~~~~~~~~~~~~~

1137

1138 Here, we permute the order that functions are emitted, primarily to shift ROP

1139 gadgets around to less predictable locations. It may also change call address

1140 offsets in case the attacker was trying to control that offset in the code.

1141

1142 To control latency and memory footprint, we don't arbitrarily permute functions.

1143 Instead, for some relatively small value of N, we queue up N assembler buffers,

1144 and then emit the N functions in random order, and repeat until all functions

1145 are emitted.

1146

1147 Function reordering has no impact on translation time or code quality.

1148 Measurements indicate that it reduces ROP gadget persistence to about 15%.

1149

1150 Nop insertion

1151 ~~~~~~~~~~~~~

1152

1153 This diversification randomly adds a nop instruction after each regular

1154 instruction, with some probability. Nop instructions of different lengths may

1155 be selected. Nop instructions are never added inside a bundle_lock region.

1156 Note that when sandboxing is enabled, nop instructions are already being added

1157 for bundle alignment, so the diversification nop instructions may simply be

1158 taking the place of alignment nop instructions, though distributed differently

1159 through the bundle.

1160

1161 In Subzero's currently implementation, nop insertion adds 3-5% to the

1162 translation time, but this is probably because it is implemented as a separate

1163 pass that adds actual nop instructions to the IR. The overhead would probably

1164 be a lot less if it were integrated into the assembler pass. The code quality

1165 is also reduced by 3-5%, making nop insertion the most expensive of the

1166 diversification techniques.

1167

1168 Nop insertion is very effective in reducing ROP gadget persistence, at the same

1169 level as basic block randomization (less than 5%). But given nop insertion's

1170 impact on translation time and code quality, one would most likely prefer to use

1171 basic block randomization instead (though the combined effects of the different

1172 diversification techniques have not yet been studied).

1173

1174 Register allocation randomization

1175 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1176

1177 In this diversification, the register allocator tries to make different but

1178 mostly functionally equivalent choices, while maintaining stable code quality.

1179

1180 A naive approach would be the following. Whenever the allocator has more than

1181 one choice for assigning a register, choose randomly among those options. And

1182 whenever there are no registers available and there is a tie for the

1183 lowest-weight variable, randomly select one of the lowest-weight variables to

1184 evict. Because of the one-pass nature of the linear-scan algorithm, this

1185 randomization strategy can have a large impact on which variables are ultimately

1186 assigned registers, with a corresponding large impact on code quality.

1187

1188 Instead, we choose an approach that tries to keep code quality stable regardless

1189 of the random seed. We partition the set of physical registers into equivalence

1190 classes. If a register is pre-colored in the function (i.e., referenced

1191 explicitly by name), it forms its own equivalence class. The remaining

1192 registers are partitioned according to their combination of attributes such as

1193 integer versus floating-point, 8-bit versus 32-bit, caller-save versus

1194 callee-saved, etc. Each equivalence class is randomly permuted, and the

1195 complete permutation is applied to the final register assignments.

1196

1197 Register randomization reduces ROP gadget persistence to about 10% on average,

1198 though there tends to be fairly high variance across functions and applications.

1199 This probably has to do with the set of restrictions in the x86-32 instruction

1200 set and ABI, such as few general-purpose registers, ``%eax`` used for return

1201 values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As

1202 intended, register randomization has no impact on code quality, and a slight

1203 (0.5%) impact on translation time due to an extra scan over the variables to

1204 identify pre-colored registers.

1205

1206 Fuzzing

1207 ^^^^^^^

1208

1209 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a

1210 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer

1211 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to

1212 find and fix cases where Subzero crashes or otherwise ungracefully fails on

1213 unexpected inputs, and to do so automatically over a large range of unexpected

1214 inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility

1215 of an attacker exploiting these bugs.

1216

1217 Most of the problems found so far are ones most appropriately handled in the

1218 parser. However, there have been a couple that have identified problems in the

1219 lowering, or otherwise inappropriately triggered assertion failures and fatal

1220 errors. We continue to dig into this area.

1221

1222 Future security work

1223 ^^^^^^^^^^^^^^^^^^^^

1224

1225 Subzero is well-positioned to explore other future security enhancements, e.g.:

1226

1227 - Tightening the Native Client sandbox. ABI changes, such as the previous work

1228 on `hiding the sandbox base address

1229 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXV G8>`_

1230 in x86-64, are easy to experiment with in Subzero.

1231

1232 - Making the executable code section read-only. This would prevent a PNaCl

1233 application from inspecting its own binary and trying to find ROP gadgets even

1234 after code diversification has been performed. It may still be susceptible to

1235 `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,

1236 security is still overall improved.

1237

1238 - Instruction selection diversification. It may be possible to lower a given

1239 instruction in several largely equivalent ways, which gives more opportunities

1240 for code randomization.

1241

1242 Chrome integration

1243 ------------------

1244

1245 Currently Subzero is available in Chrome for the x86-32 architecture, but under

1246 a flag. When the flag is enabled, Subzero is used when the `manifest file

1247 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_

1248 linking to the ``.pexe`` file specifies the ``O0`` optimization level.

1249

1250 The next step is to remove the flag, i.e. invoke Subzero as the only translator

1251 for ``O0``-specified manifest files.

1252

1253 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which

1254 case Subzero could be used for all PNaCl translation.

1255

1256 Command line options

1257 --------------------

1258

1259 Subzero has a number of command-line options for debugging and diagnostics.

1260 Among the more interesting are the following.

1261

1262 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other

1263 diagnostic output, with various levels of detail after each pass. Instruction

1264 numbers can be printed or suppressed. Deleted instructions can be printed or

1265 suppressed (they are retained in the instruction list, as discussed earlier,

1266 because they can help explain how lower-level instructions originated).

1267 Liveness information can be printed when available. Details of register

1268 allocation can be printed as register allocator decisions are made. And more.

1269

1270 - Running Subzero with any level of verbosity produces an enormous amount of

1271 output. When debugging a single function, verbose output can be suppressed

1272 except for a particular function. The ``-verbose-focus`` flag suppresses

1273 verbose output except for the specified function.

1274

1275 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing

1276 at exit. Timing markers can be placed in the Subzero source code to demarcate

1277 logical operations or passes of interest. Basic timing information plus

1278 call-stack type timing information is printed at the end.

1279

1280 - Along with ``-timing``, the user can instead get a report on the overall

1281 translation time for each function, to help focus on timing outliers. Also,

1282 ``-timing-focus`` limits the ``-timing`` reporting to a single function,

1283 instead of aggregating pass timing across all functions.

1284

1285 - The ``-szstats`` option reports various statistics on each function, such as

1286 stack frame size, static instruction count, etc. It may be helpful to track

1287 these stats over time as Subzero is improved, as an approximate measure of

1288 code quality.

1289

1290 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly

1291 output, annotate the assembly output with register-focused liveness

1292 information. In particular, each basic block is annotated with which

1293 registers are live-in and live-out, and each instruction is annotated with

1294 which registers' and stack locations' live ranges end at that instruction.

1295 This is really useful when studying the generated code to find opportunities

1296 for code quality improvements.

1297

1298 Testing and debugging

1299 ---------------------

1300

1301 LLVM lit tests

1302 ^^^^^^^^^^^^^^

1303

1304 For basic testing, Subzero uses LLVM's `lit

1305 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We

1306 have a suite of hundreds of small functions where we test for particular

1307 assembly code patterns across different target architectures.

1308

1309 Cross tests

1310 ^^^^^^^^^^^

1311

1312 Unfortunately, the lit tests don't do a great job of precisely testing the

1313 correctness of the output. Much better are the cross tests, which are execution

1314 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide

1315 variety of interesting inputs. Each cross test consists of a set of C, C++,

1316 and/or low-level bitcode files. The C and C++ source files are compiled down to

1317 bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.

1318 Subzero mangles global symbol names with a special prefix to avoid duplicate

1319 symbol errors. A driver program invokes both versions on a large set of

1320 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results

1321 differ. Cross tests turn out to be an excellent way of testing the basic

1322 lowering patterns, but they are less useful for testing more global things like

1323 liveness analysis and register allocation.

1324

1325 Bisection debugging

1326 ^^^^^^^^^^^^^^^^^^^

1327

1328 Sometimes with a new application, Subzero will end up producing incorrect code

1329 that either crashes at runtime or otherwise produces the wrong results. When

1330 this happens, we need to narrow it down to a single function (or small set of

1331 functions) that yield incorrect behavior. For this, we have a bisection

1332 debugging framework. Here, we initially translate the entire application once

1333 with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to

1334 selectively weaken symbols based on a whitelist or blacklist provided on the

1335 command line. The two object files can then be linked together without link

1336 errors, with the desired version of each method "winning". Then the binary is

1337 tested, and bisection proceeds based on whether the binary produces correct

1338 output.

1339

1340 When the bisection completes, we are left with a minimal set of

1341 Subzero-translated functions that cause the failure. Usually it is a single

1342 function, though sometimes it might require a combination of several functions

1343 to cause a failure; this may be due to an incorrect call ABI, for example.

1344 However, Murphy's Law implies that the single failing function is enormous and

1345 impractical to debug. In that case, we can restart the bisection, explicitly

1346 blacklisting the enormous function, and try to find another candidate to debug.

1347 (Future work is to automate this to find all minimal sets of functions, so that

1348 debugging can focus on the simplest example.)

1349

1350 Fuzz testing

1351 ^^^^^^^^^^^^

1352

1353 As described above, we try to find internal Subzero bugs using fuzz testing

1354 techniques.

1355

1356 Sanitizers

1357 ^^^^^^^^^^

1358

1359 Subzero can be built with `AddressSanitizer

1360 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer

1361 <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is

1362 done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far,

1363 multithreading has been simple enough that TSan hasn't found any bugs, but ASan

1364 has found at least one memory leak which was subsequently fixed.

1365 `UndefinedBehaviorSanitizer

1366 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_

1367 (UBSan) support is in progress. `Control flow integrity sanitization

1368 <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under

1369 consideration.

1370

1371 Current status

1372 ==============

1373

1374 Target architectures

1375 --------------------

1376

1377 Subzero is currently more or less complete for the x86-32 target. It has been

1378 refactored and extended to handle x86-64 as well, and that is mostly complete at

1379 this point.

1380

1381 ARM32 work is in progress. It currently lacks the testing level of x86, at

1382 least in part because Subzero's register allocator needs modifications to handle

1383 ARM's aliasing of floating point and vector registers. Specifically, a 64-bit

1384 register is actually a gang of two consecutive and aligned 32-bit registers, and

1385 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.

1386 ARM64 work has not started; when it does, it will be native-only since the

1387 Native Client sandbox model, validator, and other tools have never been defined.

1388

1389 An external contributor is adding MIPS support, in most part by following the

1390 ARM work.

1391

1392 Translator performance

1393 ----------------------

1394

1395 Single-threaded translation speed is currently about 5× the ``pnacl-llc``

1396 translation speed. For a large ``.pexe`` file, the time breaks down as:

1397

1398 - 11% for parsing and initial IR building

1399

1400 - 4% for emitting to /dev/null

1401

1402 - 27% for liveness analysis (two liveness passes plus live range construction)

1403

1404 - 15% for linear-scan register allocation

1405

1406 - 9% for basic lowering

1407

1408 - 10% for advanced phi lowering

1409

1410 - ~11% for other minor analysis

1411

1412 - ~10% measurement overhead to acquire these numbers

1413

1414 Some improvements could undoubtedly be made, but it will be hard to increase the

1415 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With

1416 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10×

1417 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3.

1418

1419 Code quality

1420 ------------

1421

1422 Measured across 16 components of spec2k, Subzero's code quality is uniformly

1423 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly

1424 between ``pnacl-llc`` ``-O0`` and ``-O2``.

1425

1426 Translator size

1427 ---------------

1428

1429 When built in MINIMAL mode, the x86-64 native translator size for the x86-32

1430 target is about 700 KB, not including the size of functions referenced in

1431 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1

1432 MB, and it is statically linked and also includes nop padding for bundling as

1433 well as indirect branch masking.

1434

1435 Translator memory footprint

1436 ---------------------------

1437

1438 It's hard to draw firm conclusions about memory footprint, since the footprint

1439 is at least proportional to the input function size, and there is no real limit

1440 on the size of functions in the ``.pexe`` file.

1441

1442 That said, we looked at the memory footprint over time as Subzero translated

1443 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our

1444 disposal. One of LLVM's libraries that Subzero uses can report the current

1445 malloc heap usage. With single-threaded translation, Subzero tends to hover

1446 around 15 MB of memory usage. There are a couple of monstrous functions where

1447 Subzero grows to around 100 MB, but then it drops back down after those

1448 functions finish translating. In contrast, ``pnacl-llc`` grows larger and

1449 larger throughout translation, reaching several hundred MB by the time it

1450 completes.

1451

1452 It's a bit more interesting when we enable multithreaded translation. When

1453 there are N translation threads, Subzero implements a policy that limits the

1454 size of the translation queue to N entries -- if it is "full" when the parser

1455 tries to add a new CFG, the parser blocks until one of the translation threads

1456 removes a CFG. This means the number of in-memory CFGs can (and generally does)

1457 reach 2*N+1, and so the memory footprint rises in proportion to the number of

1458 threads. Adding to the pressure is the observation that the monstrous functions

1459 also take proportionally longer time to translate, so there's a good chance many

1460 of the monstrous functions will be active at the same time with multithreaded

1461 translation. As a result, for N=32, Subzero's memory footprint peaks at about

1462 260 MB, but drops back down as the large functions finish translating.

1463

1464 If this peak memory size becomes a problem, it might be possible for the parser

1465 to resequence the functions to try to spread out the larger functions, or to

1466 throttle the translation queue to prevent too many in-flight large functions.

1467 It may also be possible to throttle based on memory pressure signaling from

1468 Chrome.

1469

1470 Translator scalability

1471 ----------------------

1472

1473 Currently scalability is "not very good". Multiple translation threads lead to

1474 faster translation, but not to the degree desired. We haven't dug in to

1475 investigate yet.

1476

1477 There are a few areas to investigate. First, there may be contention on the

1478 constant pool, which all threads access, and which requires locked access even

1479 for reading. This could be mitigated by keeping a CFG-local cache of the most

1480 common constants.

1481

1482 Second, there may be contention on memory allocation. While almost all CFG

1483 objects are allocated from the CFG-local allocator, some passes use temporary

1484 STL containers that use the default allocator, which may require global locking.

1485 This could be mitigated by switching these to the CFG-local allocator.

1486

1487 Third, multithreading may make the default allocator strategy more expensive.

1488 In a single-threaded environment, a pass will allocate its containers, run the

1489 pass, and deallocate the containers. This results in stack-like allocation

1490 behavior and makes the heap free list easier to manage, with less heap

1491 fragmentation. But when multithreading is added, the allocations and

1492 deallocations become much less stack-like, making allocation and deallocation

1493 operations individually more expensive. Again, this could be mitigated by

1494 switching these to the CFG-local allocator.

OLD	NEW

« no previous file with comments | « ALLOCATION.rst ('k') | LOWERING.rst » ('j') | Makefile.standalone » ('J')