Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(185)

Side by Side Diff: DESIGN.rst

Issue 1309073003: Subzero: Add a detailed design document. (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master
Patch Set: More code review changes Created 5 years, 3 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | README.rst » ('j') | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
(Empty)
1 Design of the Subzero fast code generator
2 =========================================
3
4 Introduction
5 ------------
6
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
11 possible. The ``.pexe`` file is downloaded to the user's browser where the
12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to
13 `sandboxed
14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
15 native code. The translator uses architecture-specific optimizations as much as
16 practical to generate good native code.
17
18 The native code can be cached by the browser to avoid repeating translation on
19 future page loads. However, first-time user experience is hampered by long
20 translation times. The LLVM-based PNaCl translator is pretty slow, even when
21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on
22 slow browser platforms such as ARM-based Chromebooks.
23
24 Translator slowness can be mitigated or hidden in a number of ways.
25
26 - Parallel translation. However, slow machines where this matters most, e.g.
27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across,
28 and are likely to less memory available for multiple translation threads to
29 use.
30
31 - Streaming translation, i.e. start translating as soon as the download starts.
32 This doesn't help much when translation speed is 10× slower than download
33 speed, or the ``.pexe`` file is already cached while the translated binary was
34 flushed from the cache.
35
36 - Arrange the web page such that translation is done in parallel with
37 downloading large assets.
38
39 - Arrange the web page to distract the user with `cat videos
40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in
41 progress.
42
43 Or, improve translator performance to something more reasonable.
44
45 This document describes Subzero's attempt to improve translation speed by an
46 order of magnitude while rivaling LLVM's code quality. Subzero does this
47 through minimal IR layering, lean data structures and passes, and a careful
48 selection of fast optimization passes. It has two optimization recipes: full
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
50 following (described in more detail below):
51
52 +-----------------------------------+-----------------------+
53 | O2 recipe | Om1 recipe |
54 +===================================+=======================+
55 | Parse .pexe file | Parse .pexe file |
56 +-----------------------------------+-----------------------+
57 | Address mode inference | |
58 +-----------------------------------+-----------------------+
59 | Read-modify-write (RMW) transform | |
60 +-----------------------------------+-----------------------+
61 | Basic liveness analysis | |
62 +-----------------------------------+-----------------------+
63 | Load optimization | |
64 +-----------------------------------+-----------------------+
65 | | Phi lowering (simple) |
66 +-----------------------------------+-----------------------+
67 | Target lowering | Target lowering |
68 +-----------------------------------+-----------------------+
69 | Full liveness analysis | |
70 +-----------------------------------+-----------------------+
71 | Register allocation | |
72 +-----------------------------------+-----------------------+
73 | Phi lowering (advanced) | |
74 +-----------------------------------+-----------------------+
75 | Post-phi register allocation | |
76 +-----------------------------------+-----------------------+
77 | Branch optimization | |
78 +-----------------------------------+-----------------------+
79 | Code emission | Code emission |
80 +-----------------------------------+-----------------------+
81
82 Goals
83 =====
84
85 Translation speed
86 -----------------
87
88 We'd like to be able to translate a ``.pexe`` file as fast as download speed.
89 Any faster is in a sense wasted effort. Download speed varies greatly, but
90 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
91 slow machine. We observe a 3× single-thread performance difference between A15
92 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a
93 ``.pexe`` file could be compressed to 50% on the web server using gzip transport
94 compression, so we set the translation speed goal to 6 MB/sec on the high-end
95 Xeon workstation.
96
97 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at
98 ⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode.
99
100 In other words, Subzero's goal is to improve over LLVM's translation speed by
101 10×.
102
103 Code quality
104 ------------
105
106 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0``
107 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On
108 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``.
109
110 It's important to note that the quality of Subzero-generated code depends on
111 target-neutral optimizations and simplifications being run beforehand in the
112 developer environment. The ``.pexe`` file reflects these optimizations. For
113 example, Subzero assumes that the basic blocks are ordered topologically where
114 possible (which makes liveness analysis converge fastest), and Subzero does not
115 do any function inlining because it should already have been done.
116
117 Translator size
118 ---------------
119
120 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size.
121 We think 1 MB is a more reasonable size -- especially for such a component that
122 is distributed to a billion Chrome users. Thus we target a 10× reduction in
123 binary size.
124
125 For development, Subzero can be built for all target architectures, and all
126 debugging and diagnostic options enabled. For a smaller translator, we restrict
127 to a single target architecture, and define a ``MINIMAL`` build where
128 unnecessary features are compiled out.
129
130 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support``
131 include directories, which have little impact on translator size. It also uses
132 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again
133 with little size impact. In non-``MINIMAL`` builds, the translator size is much
134 larger due to including code for parsing text-format bitcode files and forming
135 LLVM IR.
136
137 Memory footprint
138 ----------------
139
140 The current LLVM-based translator suffers from an issue in which some
141 function-specific data has to be retained in memory until all translation
142 completes, and therefore the memory footprint grows without bound. Large
143 ``.pexe`` files can lead to the translator process holding hundreds of MB of
144 memory by the end. The translator runs in a separate process, so this memory
145 growth doesn't *directly* affect other processes, but it does dirty the physical
146 memory and contributes to a perception of bloat and sometimes a reality of
147 out-of-memory tab killing, especially noticeable on weaker systems.
148
149 Subzero should maintain a stable memory footprint throughout translation. It's
150 not really practical to set a specific limit, because there is not really a
151 practical limit on a single function's size, but the footprint should be
152 "reasonable" and be proportional to the largest input function size, not the
153 total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or
154 inexorable memory growth. (We use ASAN builds to test for leaks.)
155
156 Multithreaded translation
157 -------------------------
158
159 It should be practical to translate different functions concurrently and see
160 good scalability. Some locking may be needed, such as accessing output buffers
161 or constant pools, but that should be fairly minimal. In contrast, LLVM was
162 only designed for module-level parallelism, and as such, the PNaCl translator
163 internally splits a ``.pexe`` file into several modules for concurrent
164 translation. All output needs to be deterministic regardless of the level of
165 multithreading, i.e. functions and data should always be output in the same
166 order.
167
168 Target architectures
169 --------------------
170
171 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future
172 targets include ARM64 and MIPS64, though these targets lack NaCl support
173 including a sandbox model or a validator.
174
175 The first implementation is for x86-32, because it was expected to be
176 particularly challenging, and thus more likely to draw out any design problems
177 early:
178
179 - There are a number of special cases, asymmetries, and warts in the x86
180 instruction set.
181
182 - Complex addressing modes may be leveraged for better code quality.
183
184 - 64-bit integer operations have to be lowered into longer sequences of 32-bit
185 operations.
186
187 - Paucity of physical registers may reveal code quality issues early in the
188 design.
189
190 Detailed design
191 ===============
192
193 Intermediate representation - ICE
194 ---------------------------------
195
196 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's
197 IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a
198 representation of global variables and initializers, and a set of functions.
199 Each function contains a list of basic blocks, and each basic block constains a
200 list of instructions. Instructions that operate on stack and register variables
201 do so using static single assignment (SSA) form.
202
203 The ``.pexe`` file is translated one function at a time (or in parallel by
204 multiple translation threads). The recipe for optimization passes depends on
205 the specific target and optimization level, and is described in detail below.
206 Global variables (types and initializers) are simply and directly translated to
207 object code, without any meaningful attempts at optimization.
208
209 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class.
210 Its key contents include:
211
212 - A list of ``CfgNode`` pointers, generally held in topological order.
213
214 - A list of ``Variable`` pointers corresponding to local variables used in the
215 function plus compiler-generated temporaries.
216
217 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents
218 include:
219
220 - A linear list of instructions, in the same style as LLVM. The last
221 instruction of the list is always a terminator instruction: branch, switch,
222 return, unreachable.
223
224 - A list of Phi instructions, also in the same style as LLVM. They are held as
225 a linear list for convenience, though per Phi semantics, they are executed "in
226 parallel" without dependencies on each other.
227
228 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and
229 another list for outgoing edges.
230
231 - The node's unique, 0-based index into the CFG's node list.
232
233 An instruction is represented by the ``Ice::Inst`` class. Its key contents
234 include:
235
236 - A list of source operands.
237
238 - Its destination variable, if the instruction produces a result in an
239 ``Ice::Variable``.
240
241 - A bitvector indicating which variables' live ranges this instruction ends.
242 This is computed during liveness analysis.
243
244 Instructions kinds are divided into high-level ICE instructions and low-level
245 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode
246 instruction kinds. Each target architecture implementation extends the
247 instruction space with its own set of low-level instructions. Generally,
248 low-level instructions correspond to individual machine instructions. The
249 high-level ICE instruction space includes a few additional instruction kinds
250 that are not part of LLVM but are generally useful (e.g., an Assignment
251 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock
252 instructions for sandboxing).
253
254 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl
255 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual
256 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are
257 the following:
258
259 - Alloca: allocate data on the stack
260
261 - Arithmetic: binary operations of the form ``A = B op C``
262
263 - Br: conditional or unconditional branch
264
265 - Call: function call
266
267 - Cast: unary type-conversion operations
268
269 - ExtractElement: extract a scalar element from a vector-type value
270
271 - Fcmp: floating-point comparison
272
273 - Icmp: integer comparison
274
275 - IntrinsicCall: call a known intrinsic
276
277 - InsertElement: insert a scalar element into a vector-type value
278
279 - Load: load a value from memory
280
281 - Phi: implement the SSA phi node
282
283 - Ret: return from the function
284
285 - Select: essentially the C language operation of the form ``X = C ? Y : Z``
286
287 - Store: store a value into memory
288
289 - Switch: generalized branch to multiple possible locations
290
291 - Unreachable: indicate that this portion of the code is unreachable
292
293 The additional high-level ICE instructions are the following:
294
295 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi
296 instructions to non-SSA assignments, before lowering to machine code.
297
298 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are
299 common across all targets and so they are elevated to the high-level
300 instruction set.
301
302 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in
303 liveness analysis, elevated to the high-level because they are used by all
304 targets. They are described in more detail at the end of this section.
305
306 - JumpTable: this represents the result of switch optimization analysis, where
307 some switch instructions may use jump tables instead of cascading
308 compare/branches.
309
310 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an
311 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants
312 include scalar integer constants, scalar floating point constants, Undef (an
313 unspecified constant of a particular scalar or vector type), and symbol
314 constants (essentially addresses of globals). Note that the PNaCl ABI does not
315 include vector-type constants besides Undef, and as such, Subzero (so far) has
316 no reason to represent vector-type constants internally. A variable represents
317 a value allocated on the stack (though not including alloca-derived storage).
318 Among other things, a variable holds its unique, 0-based index into the CFG's
319 variable list.
320
321 Each target can extend the ``Constant`` and ``Variable`` classes for its own
322 needs. In addition, the ``Operand`` class may be extended, e.g. to define an
323 x86 ``MemOperand`` that encodes a base register, an index register, an index
324 register shift amount, and a constant offset.
325
326 Register allocation and liveness analysis are restricted to Variable operands.
327 Because of the importance of register allocation to code quality, and the
328 translation-time cost of liveness analysis, Variable operands get some special
329 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate
330 across all the Variables of an instruction. An instruction holds a list of
331 operands, but an operand may contain 0, 1, or more Variables. As such, the
332 ``Operand`` class specially holds a list of Variables contained within, for
333 quick access.
334
335 A Subzero transformation pass may work by deleting an existing instruction and
336 replacing it with zero or more new instructions. Instead of actually deleting
337 the existing instruction, we generally mark it as deleted and insert the new
338 instructions right after the deleted instruction. When printing the IR for
339 debugging, this is a big help because it makes it much more clear how the
340 non-deleted instructions came about.
341
342 Subzero has a few special instructions to help with liveness analysis
343 consistency.
344
345 - The FakeDef instruction gives a fake definition of some variable. For
346 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx``
347 but an ICE instruction can represent only one destination variable. This is
348 similar for multiply instructions, and for function calls that return a 64-bit
349 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax``
350 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``.
351
352 - The FakeUse instruction registers a use of a variable, typically to prevent an
353 earlier assignment to that variable from being dead-code eliminated. For
354 example, lowering an operation like ``x=cc?y:z`` may be done using x86's
355 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a
356 FakeUse of ``x`` between the two instructions, the liveness analysis pass may
357 dead-code eliminate the first instruction.
358
359 - The FakeKill instruction is added after a call instruction, and is a quick way
360 of indicating that caller-save registers are invalidated.
361
362 Pexe parsing
363 ------------
364
365 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It
366 parses the ``.pexe`` file function by function, ultimately constructing an ICE
367 CFG for each function. After a function is parsed, its CFG is handed off to the
368 translation phase. The bitcode parser also parses global initializer data and
369 hands it off to be translated to data sections in the object file.
370
371 Subzero has another parsing strategy for testing/debugging. LLVM libraries can
372 be used to parse a module into LLVM IR (though very slowly relative to Subzero
373 native parsing). Then we iterate across the LLVM IR and construct high-level
374 ICE, handing off each CFG to the translation phase.
375
376 Overview of lowering
377 --------------------
378
379 In general, translation goes like this:
380
381 - Parse the next function from the ``.pexe`` file and construct a CFG consisting
382 of high-level ICE.
383
384 - Do analysis passes and transformation passes on the high-level ICE, as
385 desired.
386
387 - Lower each high-level ICE instruction into a sequence of zero or more
388 low-level ICE instructions. Each high-level instruction is generally lowered
389 independently, though the target lowering is allowed to look ahead in the
390 CfgNode's instruction list if desired.
391
392 - Do more analysis and transformation passes on the low-level ICE, as desired.
393
394 - Assemble the low-level CFG into an ELF object file (alternatively, a textual
395 assembly file that is later assembled by some external tool).
396
397 - Repeat for all functions, and also produce object code for data such as global
398 initializers and internal constant pools.
399
400 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``,
401 the intention is to apply all available optimizations to get the best code
402 quality (though the initial code quality goal is measured against LLVM's ``O0``
403 code quality). For ``Om1``, the intention is to apply as few optimizations as
404 possible and produce code as quickly as possible, accepting poor code quality.
405 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words,
406 "sub-zero".
407
408 High-level debuggability of generated code is so far not a design requirement,
409 but Subzero doesn't really do transformations that would obfuscate debugging.
410 The main thing might be that register allocation (including stack slot
411 coalescing for stack-allocated variables whose live ranges don't overlap) may
412 render a variable's value unobtainable after its live range ends. This would
413 not be an issue for ``Om1`` since it doesn't register-allocate program-level
414 variables, nor does it coalesce stack slots. Adding debugging support would
415 primarily involve adding DWARF support to the ELF file emitter. Subzero
416 propagates global symbol names, local variable names, and function-internal
417 label names that are present in the ``.pexe`` file, but the PNaCl toolchain
418 would need to not strip these symbols to make it most useful.
jvoung (off chromium) 2015/09/03 14:40:25 If you take the pexe with symbols and emit dwarf,
Jim Stichnoth 2015/09/03 18:21:24 Oh, excellent point. I added that to the discussi
419
420 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but
421 produces code with one third the code quality. ``Om1`` is good for testing and
422 debugging -- during translation, it tends to expose errors in the basic lowering
423 that might otherwise have been hidden by the register allocator or other
424 optimization passes. It also helps determine whether a code correctness problem
425 is a fundamental problem in the basic lowering, or an error in another
426 optimization pass.
427
428 The implementation of target lowering also controls the recipe of passes used
429 for ``Om1`` and ``O2`` translation. For example, address mode inference may
430 only be relevant for x86.
431
432 Lowering strategy
433 -----------------
434
435 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower
436 each high-level instruction down to a sequence of low-level target-specific
437 instructions, in a largely context-free setting. That is, each high-level
438 instruction conceptually has a simple template expansion into low-level
439 instructions, and lowering can in theory be done in any order. This may sound
440 like a small effort, but quite a large number of templates may be needed because
441 of the number of PNaCl types and instruction variants. Furthermore, there may
442 be optimized templates, e.g. to take advantage of operator commutativity. This
jvoung (off chromium) 2015/09/03 14:40:25 related to optimized templates: things like bool f
Jim Stichnoth 2015/09/03 18:21:24 Bool folding is one place, but I added a simpler e
443 is similar to other template-based approaches in fast code generation or
444 interpretation, though some decisions are deferred until after some global
445 analysis passes, mostly related to register allocation, stack slot assignment,
446 and specific choice of instruction variant and addressing mode.
447
448 The key idea for a lowering template is to produce valid low-level instructions
449 that are guaranteed to meet address mode and other structural requirements of
450 the instruction set. For example, on x86, the source operand of an integer
451 store instruction must be an immediate or a physical register; a shift
452 instruction's shift amount must be an immediate or in register ``%cl``; a
453 function's integer return value is in ``%eax``; most x86 instructions are
454 two-operand, in contrast to corresponding three-operand high-level instructions;
455 etc.
456
457 Because target lowering runs before register allocation, there is no way to know
458 whether a given ``Ice::Variable`` operand lives on the stack or in a physical
459 register. When the low-level instruction calls for a physical register operand,
460 the target lowering can create an infinite-weight Variable. This tells the
461 register allocator to assign infinite weight when making decisions, effectively
462 guaranteeing some physical register. Variables can also be pre-colored to a
463 specific physical register (``cl`` in the shift example above), which also gives
464 infinite weight.
465
466 To illustrate, consider a high-level arithmetic instruction on 32-bit integer
467 operands::
468
469 A = B + C
470
471 X86 target lowering might produce the following::
472
473 T.inf = B // mov instruction
474 T.inf += C // add instruction
475 A = T.inf // mov instruction
476
477 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a
478 physical register, the three lowered instructions are all encodable regardless
479 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and
480 whether ``A`` is a physical register or in memory.
481
482 In this example, ``A`` must be a Variable and one may be tempted to simplify the
483 lowering sequence by setting ``A`` as infinite-weight and using::
484
485 A = B // mov instruction
486 A += C // add instruction
487
488 This has two problems. First, if the original instruction was actually ``A =
489 B + A``, the result would be incorrect. Second, assigning ``A`` a physical
490 register applies throughout ``A``'s entire live range. This is probably not
491 what is intended, and may ultimately lead to a failure to allocate a register
492 for an infinite-weight variable.
493
494 This style of lowering leads to many temporaries being generated, so in ``O2``
495 mode, we rely on the register allocator to clean things up. For example, in the
496 example above, if ``B`` ends up getting a physical register and its live range
497 ends at this instruction, the register allocator is likely to reuse that
498 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
499 copy, which is removed as an emission-time peephole optimization.
500
501 O2 lowering
502 -----------
503
504 Currently, the ``O2`` lowering recipe is the following:
505
jvoung (off chromium) 2015/09/03 14:40:25 Loop nest analysis too, but I guess that's still i
Jim Stichnoth 2015/09/03 18:21:24 Done.
506 - Address mode inference
507
508 - Read-modify-write (RMW) transformation
509
510 - Basic liveness analysis
511
512 - Load optimization
513
514 - Target lowering
515
516 - Full liveness analysis
517
518 - Register allocation
519
520 - Phi instruction lowering (advanced)
521
522 - Post-phi lowering register allocation
523
524 - Branch optimization
525
526 These passes are described in more detail below.
527
528 Om1 lowering
529 ------------
530
531 Currently, the ``Om1`` lowering recipe is the following:
532
533 - Phi instruction lowering (simple)
534
535 - Target lowering
536
537 - Register allocation (infinite-weight and pre-colored only)
538
539 Optimization passes
540 -------------------
541
542 Liveness analysis
543 ^^^^^^^^^^^^^^^^^
544
545 Liveness analysis is a standard dataflow optimization, implemented as follows.
546 For each node (basic block), its live-out set is computed as the union of the
547 live-in sets of its successor nodes. Then the node's instructions are processed
548 in reverse order, updating the live set, until the beginning of the node is
549 reached, and the node's live-in set is recorded. If this iteration has changed
550 the node's live-in set, the node's predecessors are marked for reprocessing.
551 This continues until no more nodes need reprocessing. If nodes are processed in
552 reverse topological order, the number of iterations over the CFG is generally
553 equal to the maximum loop nest depth.
554
555 To implement this, each node records its live-in and live-out sets, initialized
556 to the empty set. Each instruction records which of its Variables' live ranges
557 end in that instruction, initialized to the empty set. A side effect of
558 liveness analysis is dead instruction elimination. Each instruction can be
559 marked as tentatively dead, and after the algorithm converges, the tentatively
560 dead instructions are permanently deleted.
561
562 Optionally, after this liveness analysis completes, we can do live range
563 construction, in which we calculate the live range of each variable in terms of
564 instruction numbers. A live range is represented as a union of segments, where
565 the segment endpoints are instruction numbers. Instruction numbers are required
566 to be unique across the CFG, and monotonically increasing within a basic block.
567 As a union of segments, live ranges can contain "gaps" and are therefore
568 precise. Because of SSA properties, a variable's live range can start at most
569 once in a basic block, and can end at most once in a basic block. Liveness
570 analysis keeps track of which variable/instruction tuples begin live ranges and
571 end live ranges, and combined with live-in and live-out sets, we can efficiently
572 build up live ranges of all variables across all basic blocks.
573
574 A lot of care is taken to try to make liveness analysis fast and efficient.
575 Because of the lowering strategy, the number of variables is generally
576 proportional to the number of instructions, leading to an O(N^2) complexity
577 algorithm if implemented naively. To improve things based on sparsity, we note
578 that most variables are "local" and referenced in at most one basic block (in
579 contrast to the "global" variables with multi-block usage), and therefore cannot
580 be live across basic blocks. Therefore, the live-in and live-out sets,
581 typically represented as bit vectors, can be limited to the set of global
582 variables, and the intra-block liveness bit vector can be compacted to hold the
583 global variables plus the local variables for that block.
584
585 Register allocation
586 ^^^^^^^^^^^^^^^^^^^
587
588 Subzero implements a simple linear-scan register allocator, based on the
589 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan
590 Register Allocation in the Context of SSA Form and Register Constraints
591 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has
592 several nice features:
593
594 - Live ranges are represented as unions of segments, as described above, rather
595 than a single start/end tuple.
596
597 - It allows pre-coloring of variables with specific physical registers.
598
599 - It applies equally well to pre-lowered Phi instructions.
600
601 The paper suggests an approach of aggressively coalescing variables across Phi
602 instructions (i.e., trying to force Phi source and destination variables to have
603 the same register assignment), but we reject that in favor of the more natural
604 preference mechanism described below.
605
606 We enhance the algorithm in the paper with the capability of automatic inference
607 of register preference, and with the capability of allowing overlapping live
608 ranges to safely share the same register in certain circumstances. If we are
609 considering register allocation for variable ``A``, and ``A`` has a single
610 defining instruction ``A=B+C``, then the preferred register for ``A``, if
611 available, would be the register assigned to ``B`` or ``C``, if any, provided
612 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this
613 way we infer a good register preference for ``A``.
614
615 We allow overlapping live ranges to get the same register in certain cases.
616 Suppose a high-level instruction like::
617
618 A = unary_op(B)
619
620 has been target-lowered like::
621
622 T.inf = B
623 A = unary_op(T.inf)
624
625 Further, assume that ``B``'s live range continues beyond this instruction
626 sequence, and that ``B`` has already been assigned some register. Normally, we
627 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it
628 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have
629 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is
630 guaranteed to be in a register, so in theory these overlapping live ranges could
631 safely have the same register. Our implementation allows this overlap as long
632 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never
633 modified within ``T.inf``'s live range.
634
635 Subzero's register allocator can be run in 3 configurations.
636
637 - Normal mode. All Variables are considered for register allocation. It
638 requires full liveness analysis and live range construction as a prerequisite.
639 This is used by ``O2`` lowering.
640
641 - Minimal mode. Only infinite-weight or pre-colored Variables are considered.
642 All other Variables are stack-allocated. It does not require liveness
643 analysis; instead, it quickly scans the instructions and records first
644 definitions and last uses of all relevant Variables, using that to construct a
645 single-segment live range. Although this includes most of the Variables, the
646 live ranges are mostly simple, short, and rarely overlapping, which the
647 register allocator handles efficiently. This is used by ``Om1`` lowering.
648
649 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode
650 register allocation, and may result in new infinite-weight Variables that need
651 registers. One would like to just run something like minimal mode to assign
652 registers to the new Variables while respecting existing register allocation
653 decisions. However, it sometimes happens that there are no free registers.
654 In this case, some register needs to be forcibly spilled to the stack and
655 temporarily reassigned to the new Variable, and reloaded at the end of the new
656 Variable's live range. The register must be one that has no explicit
657 references during the Variable's live range. Since Subzero currently doesn't
658 track def/use chains (though it does record the CfgNode where a Variable is
659 defined), we just do a brute-force search across the CfgNode's instruction
660 list for the instruction numbers of interest. This situation happens very
661 rarely, so there's little point for now in improving its performance.
662
663 The basic linear-scan algorithm may, as it proceeds, rescind an early register
664 allocation decision, leaving that Variable to be stack-allocated. Some of these
665 times, it turns out that the Variable could have been given a different register
666 without conflict, but by this time it's too late. The literature recognizes
667 this situation and describes "second-chance bin-packing", which Subzero can do.
668 We can rerun the register allocator in a mode that respects existing register
669 allocation decisions, and sometimes it finds new non-conflicting opportunities.
670 In fact, we can repeatedly run the register allocator until convergence.
671 Unfortunately, in the current implementation, these subsequent register
672 allocation passes end up being extremely expensive. This is because of the
673 treatment of the "unhandled pre-colored" Variable set, which is normally very
674 small but ends up being quite large on subsequent passes. Its performance can
675 probably be made acceptable with a better choice of data structures, but for now
676 this second-chance mechanism is disabled.
677
678 Future work is to implement LLVM's `Greedy
679 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
680 register allocator as a replacement for the basic linear-scan algorithm, given
681 LLVM's experience with its improvement in code quality. (The blog post claims
682 that the Greedy allocator also improved maintainability because a lot of hacks
683 could be removed, but Subzero is probably not yet to that level of hacks, and is
684 less likely to see that particular benefit.)
685
686 Basic phi lowering
687 ^^^^^^^^^^^^^^^^^^
688
689 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
690 implements it). Consider this example::
691
692 L1:
693 ...
694 br L3
695 L2:
696 ...
697 br L3
698 L3:
699 A = phi [B, L1], [C, L2]
700 X = phi [Y, L1], [Z, L2]
701
702 For each destination of a phi instruction, we can create a temporary and insert
703 the temporary's assignment at the end of the predecessor block::
704
705 L1:
706 ...
707 A' = B
708 X' = Y
709 br L3
710 L2:
711 ...
712 A' = C
713 X' = Z
714 br L3
715 L2:
716 A = A'
717 X = X'
718
719 This transformation is very simple and reliable. It can be done before target
720 lowering and register allocation, and it easily avoids the classic lost-copy and
721 related problems. ``Om1`` lowering uses this strategy.
722
723 However, it has the disadvantage of initializing temporaries even for branches
724 not taken, though that could be mitigated by splitting non-critical edges and
725 putting assignments in the edge-split nodes. Another problem is that without
726 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a
727 specific ordering even though phi semantics are that the assignments are
728 parallel or unordered. This sometimes imposes false live range overlaps and
729 leads to poorer register allocation.
730
731 Advanced phi lowering
732 ^^^^^^^^^^^^^^^^^^^^^
733
734 ``O2`` lowering defers phi lowering until after register allocation to avoid the
735 problem of false live range overlaps. It works as follows. We split each
736 incoming edge and move the (parallel) phi assignments into the split nodes. We
737 linearize each set of assignments by finding a safe, topological ordering of the
738 assignments, respecting register assignments as well. For example::
739
740 A = B
741 X = Y
742
743 Normally these assignments could be executed in either order, but if ``B`` and
744 ``X`` are assigned the same physical register, we would want to use the above
745 ordering. Dependency cycles are broken by introducing a temporary. For
746 example::
747
748 A = B
749 B = A
750
751 Here, a temporary breaks the cycle::
752
753 t = A
754 A = B
755 B = t
756
757 Finally, we use the existing target lowering to lower the assignments in this
758 basic block, and once that is done for all basic blocks, we run the post-phi
759 variant of register allocation on the edge-split basic blocks.
760
761 When computing a topological order, we try to first schedule assignments whose
762 source has a physical register, and last schedule assignments whose destination
763 has a physical register. This helps reduce register pressure.
764
765 X86 address mode inference
766 ^^^^^^^^^^^^^^^^^^^^^^^^^^
767
768 We try to take advantage of the x86 addressing mode that includes a base
769 register, an index register, an index register scale amount, and an immediate
770 offset. We do this through simple pattern matching. Starting with a load or
771 store instruction where the address is a variable, we initialize the base
772 register to that variable, and look up the instruction where that variable is
773 defined. If that is an add instruction of two variables and the index register
774 hasn't been set, we replace the base and index register with those two
775 variables. If instead it is an add instruction of a variable and a constant, we
776 replace the base register with the variable and add the constant to the
777 immediate offset.
778
779 There are several more patterns that can be matched. This pattern matching
780 continues on the load or store instruction until no more matches are found.
781 Because a program typically has few load and store instructions (not to be
782 confused with instructions that manipulate stack variables), this address mode
783 inference pass is fast.
784
785 X86 read-modify-write inference
786 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
787
788 A reasonably common bitcode pattern is a non-atomic update of a memory
789 location::
790
791 x = load addr
792 y = add x, 1
793 store y, addr
794
795 On x86, with good register allocation, the Subzero passes described above
796 generate code with only this quality::
797
798 mov [%ebx], %eax
799 add $1, %eax
800 mov %eax, [%ebx]
801
802 However, x86 allows for this kind of code::
803
804 add $1, [%ebx]
805
806 which requires fewer instructions, but perhaps more importantly, requires fewer
807 physical registers.
808
809 It's also important to note that this transformation only makes sense if the
810 store instruction ends ``x``'s live range.
811
812 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW)
813 opportunities via simple pattern matching. The only problem is that it is run
814 before liveness analysis, which is needed to determine whether ``x``'s live
815 range ends after the RMW. Since liveness analysis is one of the most expensive
816 passes, it's not attractive to run it an extra time just for RMW analysis.
817 Instead, we essentially generate both the RMW and the non-RMW versions, and then
818 during lowering, the RMW version deletes itself if it finds x still live.
819
820 X86 compare-branch inference
821 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
822
823 In the LLVM instruction set, the compare/branch pattern works like this::
824
825 cond = icmp eq a, b
826 br cond, target
827
828 The result of the icmp instruction is a single bit, and a conditional branch
829 tests that bit. By contrast, most target architectures use this pattern::
830
831 cmp a, b // implicitly sets various bits of FLAGS register
832 br eq, target // branch on a particular FLAGS bit
833
834 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests
835 ``cond`` and conditionally branches. Subzero has a pass that identifies
836 boolean-based operations like this and folds them into a single
837 compare/branch-like operation. It is set up for more than just cmp/br though.
838 Boolean producers include icmp (integer compare), fcmp (floating-point compare),
839 and trunc (integer truncation when the destination has bool type). Boolean
840 consumers include branch, select (the ternary operator from the C language), and
841 sign-extend and zero-extend when the source has bool type.
842
843 Sandboxing
844 ^^^^^^^^^^
845
846 Native Client's sandbox model uses software fault isolation (SFI) to provide
847 safety when running untrusted code in a browser or other environment. Subzero
848 implements Native Client's `sandboxing
849 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_
850 to enable Subzero-translated executables to be run inside Chrome. Subzero also
851 provides a fairly simple framework for investigating alternative sandbox models
852 or other restrictions on the sandbox model.
853
854 Sandboxing in Subzero is not actually implemented as a separate pass, but is
855 integrated into lowering and assembly.
856
857 - Indirect branches, including the ret instruction, are masked to a bundle
858 boundary and bundle-locked.
859
860 - Call instructions are aligned to the end of the bundle so that the return
861 address is bundle-aligned.
862
863 - Indirect branch targets, including function entry and targets in a switch
864 statement jump table, are bundle-aligned.
865
866 - The intrinsic for reading the thread pointer is inlined appropriately.
867
868 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox
869 base register. We reduce the aggressiveness of address mode inference to
870 leave room for the sandbox base register during lowering. There are no memory
871 sandboxing changes for x86-32.
872
873 Code emission
874 -------------
875
876 Subzero's integrated assembler is derived from Dart's `assembler code
877 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass
878 that iterates through the low-level ICE instructions and invokes the relevant
879 assembler functions. Placeholders are added for later fixup of branch target
880 offsets. (Backward branches use short offsets if possible; forward branches
881 always use long offsets.) The assembler emits into a staging buffer. Once
jvoung (off chromium) 2015/09/03 14:40:25 Technically, some known-near forward branches inje
Jim Stichnoth 2015/09/03 18:21:24 Done.
882 emission into the staging buffer for a function is complete, the data is emitted
883 to the output file as an ELF object file, and metadata such as relocations,
884 symbol table, and string table, are accumulated for emission at the end. Global
885 data initializers are emitted similarly. A key point is that at this point, the
886 staging buffer can be deallocated, and only a minimum of data needs to held
887 until the end.
888
889 As a debugging alternative, Subzero can emit textual assembly code which can
890 then be run through an external assembler. This is of course super slow, but
891 quite valuable when bringing up a new target.
892
893 As another debugging option, the staging buffer can be emitted as textual
894 assembly, primarily in the form of ".byte" lines. This allows the assembler to
895 be tested separately from the ELF related code.
896
897 Memory management
898 -----------------
899
900 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from
901 LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects
902 allocated from the arena are never actually freed; instead, when the CFG
903 translation completes and the CFG is deleted, the entire arena memory is
904 reclaimed at once. This style of allocation works well in an environment like a
905 compiler where there are distinct phases with only easily-identifiable objects
906 living across phases. It frees the developer from having to manage object
907 deletion, and it amortizes deletion costs across just a single arena deletion at
908 the end of the phase. Furthermore, it helps scalability by allocating entirely
909 from thread-local memory pools, and minimizing global locking of the heap.
910
911 Instructions are probably the most heavily allocated complex class in Subzero.
912 We represent an instruction list as an intrusive doubly linked list, allocate
913 all instructions from the ``CfgLocalAllocator``, and we make sure each
914 instruction subclass is basically `POD
915 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a
916 trivial destructor. This way, when the CFG is finished, we don't need to
917 individually deallocate every instruction. We do similar for Variables, which
918 is probably the second most popular complex class.
919
920 There are some situations where passes need to use some `STL container class
921 <http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the
922 ``CfgLocalAllocator`` as the container allocator if this is needed.
923
924 Multithreaded translation
925 -------------------------
926
927 Subzero is designed to be able to translate functions in parallel. With the
928 ``-threads=N`` command-line option, there is a 3-stage producer-consumer
929 pipeline:
930
931 - A single thread parses the ``.pexe`` file and produces a sequence of work
932 units. A work unit can be either a fully constructed CFG, or a set of global
933 initializers. The work unit includes its sequence number denoting its parse
934 order. Each work unit is added to the translation queue.
935
936 - There are N translation threads that draw work units from the translation
937 queue and lower them into assembler buffers. Each assembler buffer is added
938 to the emitter queue, tagged with its sequence number. The CFG and its
939 ``CfgLocalAllocator`` are disposed of at this point.
940
941 - A single thread draws assembler buffers from the emitter queue and appends to
942 the output file. It uses the sequence numbers to reintegrate the assembler
943 buffers according to the original parse order, such that output order is
944 always deterministic.
945
946 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads
947 for a total of ``N+2`` execution threads, taking the parser and emitter threads
948 into account. For the special case of ``N=0``, execution is entirely sequential
949 -- the same thread parses, translates, and emits, one function at a time. This
950 is useful for performance measurements.
951
952 Ideally, we would like to get near-linear scalability as the number of
953 translation threads increases. We expect that ``-threads=1`` should be slightly
954 faster than ``-threads=0`` as the small amount of time spent parsing and
955 emitting is done largely in parallel with translation. With perfect
956 scalability, we see ``-threads=N`` translating ``N`` times as fast as
957 ``-threads=1``, up until the point where parsing or emitting becomes the
958 bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory
959 performance would become a bottleneck and efficiency might peak at, say, 75%.
960
961 Currently, parsing takes about 11% of total sequential time. If translation
962 scalability ever gets so fast and awesomely scalable that parsing becomes a
963 bottleneck, it should be possible to make parsing multithreaded as well.
964
965 Internally, all shared, mutable data is held in the GlobalContext object, and
966 access to each field is guarded by a mutex.
967
968 Security
969 --------
970
971 Subzero includes a number of security features in the generated code, as well as
972 in the Subzero translator itself, which run on top of the existing Native Client
973 sandbox as well as Chrome's OS-level sandbox.
974
975 Sandboxed translator
976 ^^^^^^^^^^^^^^^^^^^^
977
978 When running inside the browser, the Subzero translator executes as sandboxed,
979 untrusted code that is initially checked by the validator, just like the
980 LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no
981 more or less secure than the translator it replaces, from the point of view of
982 the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and
983 was designed from the start with security in mind, so one expects fewer attacker
984 opportunities here.
985
986 Code diversification
987 ^^^^^^^^^^^^^^^^^^^^
988
989 `Return-oriented programming
990 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a
991 now-common technique for starting with e.g. a known buffer overflow situation
992 and launching it into a deeper exploit. The attacker scans the executable
993 looking for ROP gadgets, which are short sequences of code that happen to load
994 known values into known registers and then return. An attacker who manages to
995 overwrite parts of the stack can overwrite it with carefully chosen return
996 addresses such that certain ROP gadgets are effectively chained together to set
997 up the register state as desired, finally returning to some code that manages to
998 do something nasty based on those register values.
999
1000 If there is a popular ``.pexe`` with a large install base, the attacker could
1001 run Subzero on it and scan the executable for suitable ROP gadgets to use as
1002 part of a potential exploit. Note that if the trusted validator is working
1003 correctly, these ROP gadgets are limited to starting at a bundle boundary and
1004 cannot use the trick of finding a gadget that happens to begin inside another
1005 instruction. All the same, gadgets with these constraints still exist and the
1006 attacker has access to them. This is the attack model we focus most on --
1007 protecting the user against misuse of a "trusted" developer's application, as
1008 opposed to mischief from a malicious ``.pexe`` file.
1009
1010 Subzero can mitigate these attacks to some degree through code diversification.
1011 Specifically, we can apply some randomness to the code generation that makes ROP
1012 gadgets less predictable. This randomness can have some compile-time cost, and
1013 it can affect the code quality; and some diversifications may be more effective
1014 than others. A more detailed treatment of hardening techniques may be found in
1015 the Matasano report "`Attacking Clientside JIT Compilers
1016 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/at tacking_clientside_jit_compilers_paper.pdf>`_".
1017
1018 To evaluate diversification effectiveness, we use a third-party ROP gadget
1019 finder and limit its results to bundle-aligned addresses. For a given
1020 diversification technique, we run it with a number of different random seeds,
1021 find ROP gadgets for each version, and determine how persistent each ROP gadget
1022 is across the different versions. A gadget is persistent if the same gadget is
1023 found at the same code address. The best diversifications are ones with low
1024 gadget persistence rates.
1025
1026 Subzero implements 7 different diversification techniques. Below is a
1027 discussion of each technique, its effectiveness, and its cost. The discussions
1028 of cost and effectiveness are for a single diversification technique; the
1029 translation-time costs for multiple techniques are additive, but the effects of
1030 multiple techniques on code quality and effectiveness are not yet known.
1031
1032 In Subzero's implementation, each randomization is "repeatable" in a sense.
1033 Each pass that includes a randomization option gets its own private instance of
1034 a random number generator (RNG). The RNG is seeded with a combination of a
1035 global seed, the pass ID, and the function's sequence number. The global seed
1036 is designed to be different across runs (perhaps based on the current time), but
1037 for debugging, the global seed can be set to a specific value and the results
1038 will be repeatable.
1039
1040 Subzero-generated code is subject to diversification once per translation, and
1041 then Chrome caches the diversified binary for subsequent executions. An
1042 attacker may attempt to run the binary multiple times hoping for
1043 higher-probability combinations of ROP gadgets. When the attacker guesses
1044 wrong, a likely outcome is an application crash. Chrome throttles creation of
1045 crashy processes which reduces the likelihood of the attacker eventually gaining
1046 a foothold.
1047
1048 Constant blinding
1049 ~~~~~~~~~~~~~~~~~
1050
1051 Here, we prevent attackers from controlling large immediates in the text
1052 (executable) section. A random cookie is generated for each function, and if
1053 the constant exceeds a specified threshold, the constant is obfuscated with the
1054 cookie and equivalent code is generated. For example, instead of this x86
1055 instruction::
1056
1057 mov $0x11223344, <%Reg/Mem>
1058
1059 the following code might be generated::
1060
1061 mov $(0x11223344+Cookie), %temp
1062 lea -Cookie(%temp), %temp
1063 mov %temp, <%Reg/Mem>
1064
1065 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to
1066 prevent unintended effects on the flags register.
1067
1068 This transformation has almost no effect on translation time, and about 1%
1069 impact on code quality, depending on the threshold chosen. It does little to
1070 reduce gadget persistence, but it does remove a lot of potential opportunities
1071 to construct intra-instruction ROP gadgets (which an attacker could use only if
1072 a validator bug were discovered, since the Native Client sandbox and associated
1073 validator force returns and other indirect branches to be to bundle-aligned
1074 addresses).
1075
1076 Constant pooling
1077 ~~~~~~~~~~~~~~~~
1078
1079 This is similar to constant blinding, in that large immediates are removed from
1080 the text section. In this case, each unique constant above the threshold is
1081 stored in a read-only data section and the constant is accessed via a memory
1082 load. For the above example, the following code might be generated::
1083
1084 mov $Label$1, %temp
1085 mov %temp, <%Reg/Mem>
1086
1087 This has a similarly small impact on translation time and ROP gadget
1088 persistence, and a smaller (better) impact on code quality. This is because it
1089 uses fewer instructions, and in some cases good register allocation leads to no
1090 increase in instruction count. Note that this still gives an attacker some
1091 limited amount of control over some text section values, unless we randomize the
1092 constant pool layout.
1093
1094 Static data reordering
1095 ~~~~~~~~~~~~~~~~~~~~~~
1096
1097 This transformation limits the attacker's ability to control bits in global data
1098 address references. It simply permutes the order in memory of global variables
1099 and internal constant pool entries. For the constant pool, we only permute
1100 within a type (i.e., emit a randomized list of ints, followed by a randomized
1101 list of floats, etc.) to maintain good packing in the face of alignment
1102 constraints.
1103
1104 As might be expected, this has no impact on code quality, translation time, or
1105 ROP gadget persistence (though as above, it limits opportunities for
1106 intra-instruction ROP gadgets with a broken validator).
1107
1108 Basic block reordering
1109 ~~~~~~~~~~~~~~~~~~~~~~
1110
1111 Here, we randomize the order of basic blocks within a function, with the
1112 constraint that we still want to maintain a topological order as much as
1113 possible, to avoid making the code too branchy.
1114
1115 This has no impact on code quality, and about 1% impact on translation time, due
1116 to a separate pass to recompute layout. It ends up having a huge effect on ROP
1117 gadget persistence, tied for best with nop insertion, reducing ROP gadget
1118 persistence to less than 5%.
1119
1120 Function reordering
1121 ~~~~~~~~~~~~~~~~~~~
1122
1123 Here, we permute the order that functions are emitted, primarily to shift ROP
1124 gadgets around to less predictable locations. It may also change call address
1125 offsets in case the attacker was trying to control that offset in the code.
1126
1127 To control latency and memory footprint, we don't arbitrarily permute functions.
1128 Instead, for some relatively small value of N, we queue up N assembler buffers,
1129 and then emit the N functions in random order, and repeat until all functions
1130 are emitted.
1131
1132 Function reordering has no impact on translation time or code quality.
1133 Measurements indicate that it reduces ROP gadget persistence to about 15%.
1134
1135 Nop insertion
1136 ~~~~~~~~~~~~~
1137
1138 This diversification randomly adds a nop instruction after each regular
1139 instruction, with some probability. Nop instructions of different lengths may
1140 be selected. Nop instructions are never added inside a bundle_lock region.
1141 Note that when sandboxing is enabled, nop instructions are already being added
1142 for bundle alignment, so the diversification nop instructions may simply be
1143 taking the place of alignment nop instructions, though distributed differently
1144 through the bundle.
1145
1146 In Subzero's currently implementation, nop insertion adds 3-5% to the
1147 translation time, but this is probably because it is implemented as a separate
1148 pass that adds actual nop instructions to the IR. The overhead would probably
1149 be a lot less if it were integrated into the assembler pass. The code quality
1150 is also reduced by 3-5%, making nop insertion the most expensive of the
1151 diversification techniques.
1152
1153 Nop insertion is very effective in reducing ROP gadget persistence, at the same
1154 level as basic block randomization (less than 5%). But given nop insertion's
1155 impact on translation time and code quality, one would most likely prefer to use
1156 basic block randomization instead (though the combined effects of the different
1157 diversification techniques have not yet been studied).
1158
1159 Register allocation randomization
1160 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1161
1162 In this diversification, the register allocator tries to make different but
1163 mostly functionally equivalent choices, while maintaining stable code quality.
1164
1165 A naive approach would be the following. Whenever the allocator has more than
1166 one choice for assigning a register, choose randomly among those options. And
1167 whenever there are no registers available and there is a tie for the
1168 lowest-weight variable, randomly select one of the lowest-weight variables to
1169 evict. Because of the one-pass nature of the linear-scan algorithm, this
1170 randomization strategy can have a large impact on which variables are ultimately
1171 assigned registers, with a corresponding large impact on code quality.
1172
1173 Instead, we choose an approach that tries to keep code quality stable regardless
1174 of the random seed. We partition the set of physical registers into equivalence
1175 classes. If a register is pre-colored in the function (i.e., referenced
1176 explicitly by name), it forms its own equivalence class. The remaining
1177 registers are partitioned according to their combination of attributes such as
1178 integer versus floating-point, 8-bit versus 32-bit, caller-save versus
1179 callee-saved, etc. Each equivalence class is randomly permuted, and the
1180 complete permutation is applied to the final register assignments.
1181
1182 Register randomization reduces ROP gadget persistence to about 10% on average,
1183 though there tends to be fairly high variance across functions and applications.
1184 This probably has to do with the set of restrictions in the x86-32 instruction
1185 set and ABI, such as few general-purpose registers, ``%eax`` used for return
1186 values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As
1187 intended, register randomization has no impact on code quality, and a slight
1188 (0.5%) impact on translation time due to an extra scan over the variables to
1189 identify pre-colored registers.
1190
1191 Fuzzing
1192 ^^^^^^^
1193
1194 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a
1195 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer
1196 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to
1197 find and fix cases where Subzero crashes or otherwise ungracefully fails on
1198 unexpected inputs, and to do so automatically over a large range of unexpected
1199 inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility
1200 of an attacker exploiting these bugs.
1201
1202 Most of the problems found so far are ones most appropriately handled in the
1203 parser. However, there have been a couple that have identified problems in the
1204 lowering, or otherwise inappropriately triggered assertion failures and fatal
1205 errors. We continue to dig into this area.
1206
1207 Future security work
1208 ^^^^^^^^^^^^^^^^^^^^
1209
1210 Subzero is well-positioned to explore other future security enhancements, e.g.:
1211
1212 - Tightening the Native Client sandbox. ABI changes, such as the previous work
1213 on `hiding the sandbox base address
1214 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXV G8>`_
1215 in x86-64, are easy to experiment with in Subzero.
1216
1217 - Making the executable code section read-only. This would prevent a PNaCl
1218 application from inspecting its own binary and trying to find ROP gadgets even
1219 after code diversification has been performed. It may still be susceptible to
1220 `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks,
1221 security is still overall improved.
1222
1223 - Instruction selection diversification. It may be possible to lower a given
1224 instruction in several largely equivalent ways, which gives more opportunities
1225 for code randomization.
1226
1227 Chrome integration
1228 ------------------
1229
1230 Currently Subzero is available in Chrome for the x86-32 architecture, but under
1231 a flag. When the flag is enabled, Subzero is used when the `manifest file
1232 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_
1233 linking to the ``.pexe`` file specifies the ``O0`` optimization level.
1234
1235 The next step is to remove the flag, i.e. invoke Subzero as the only translator
1236 for ``O0``-specified manifest files.
1237
1238 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which
1239 case Subzero could be used for all PNaCl translation.
1240
1241 Command line options
1242 --------------------
1243
1244 Subzero has a number of command-line options for debugging and diagnostics.
1245 Among the more interesting are the following.
1246
1247 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other
1248 diagnostic output, with various levels of detail after each pass. Instruction
1249 numbers can be printed or suppressed. Deleted instructions can be printed or
1250 suppressed (they are retained in the instruction list, as discussed earlier,
1251 because they can help explain how lower-level instructions originated).
1252 Liveness information can be printed when available. Details of register
1253 allocation can be printed as register allocator decisions are made. And more.
1254
1255 - Running Subzero with any level of verbosity produces an enormous amount of
1256 output. When debugging a single function, verbose output can be suppressed
1257 except for a particular function. The ``-verbose-focus`` flag suppresses
1258 verbose output except for the specified function.
1259
1260 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing
1261 at exit. Timing markers can be placed in the Subzero source code to demarcate
1262 logical operations or passes of interest. Basic timing information plus
1263 call-stack type timing information is printed at the end.
1264
1265 - Along with ``-timing``, the user can instead get a report on the overall
1266 translation time for each function, to help focus on timing outliers. Also,
1267 ``-timing-focus`` limits the ``-timing`` reporting to a single function,
1268 instead of aggregating pass timing across all functions.
1269
1270 - The ``-szstats`` option reports various statistics on each function, such as
1271 stack frame size, static instruction count, etc. It may be helpful to track
1272 these stats over time as Subzero is improved, as an approximate measure of
1273 code quality.
1274
1275 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly
1276 output, annotate the assembly output with register-focused liveness
1277 information. In particular, each basic block is annotated with which
1278 registers are live-in and live-out, and each instruction is annotated with
1279 which registers' and stack locations' live ranges end at that instruction.
1280 This is really useful when studying the generated code to find opportunities
1281 for code quality improvements.
1282
1283 Testing and debugging
1284 ---------------------
1285
1286 LLVM lit tests
1287 ^^^^^^^^^^^^^^
1288
1289 For basic testing, Subzero uses LLVM's `lit
1290 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We
1291 have a suite of hundreds of small functions where we test for particular
1292 assembly code patterns across different target architectures.
1293
1294 Cross tests
1295 ^^^^^^^^^^^
1296
1297 Unfortunately, the lit tests don't do a great job of precisely testing the
1298 correctness of the output. Much better are the cross tests, which are execution
1299 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide
1300 variety of interesting inputs. Each cross test consists of a set of C, C++,
1301 and/or low-level bitcode files. The C and C++ source files are compiled down to
1302 bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero.
1303 Subzero mangles global symbol names with a special prefix to avoid duplicate
1304 symbol errors. A driver program invokes both versions on a large set of
1305 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results
1306 differ. Cross tests turn out to be an excellent way of testing the basic
1307 lowering patterns, but they are less useful for testing more global things like
1308 liveness analysis and register allocation.
1309
1310 Bisection debugging
1311 ^^^^^^^^^^^^^^^^^^^
1312
1313 Sometimes with a new application, Subzero will end up producing incorrect code
1314 that either crashes at runtime or otherwise produces the wrong results. When
1315 this happens, we need to narrow it down to a single function (or small set of
1316 functions) that yield incorrect behavior. For this, we have a bisection
1317 debugging framework. Here, we initially translate the entire application once
1318 with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to
1319 selectively weaken symbols based on a whitelist or blacklist provided on the
1320 command line. The two object files can then be linked together without link
1321 errors, with the desired version of each method "winning". Then the binary is
1322 tested, and bisection proceeds based on whether the binary produces correct
1323 output.
1324
1325 When the bisection completes, we are left with a minimal set of
1326 Subzero-translated functions that cause the failure. Usually it is a single
1327 function, though sometimes it might require a combination of several functions
1328 to cause a failure; this may be due to an incorrect call ABI, for example.
1329 However, Murphy's Law implies that the single failing function is enormous and
1330 impractical to debug. In that case, we can restart the bisection, explicitly
1331 blacklisting the enormous function, and try to find another candidate to debug.
1332 (Future work is to automate this to find all minimal sets of functions, so that
1333 debugging can focus on the simplest example.)
1334
1335 Fuzz testing
1336 ^^^^^^^^^^^^
1337
1338 As described above, we try to find internal Subzero bugs using fuzz testing
1339 techniques.
1340
1341 Sanitizers
1342 ^^^^^^^^^^
1343
1344 Subzero can be built with `AddressSanitizer
1345 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) and/or
1346 `ThreadSanitizer <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan)
1347 support. This is done using something as simple as ``make ASAN=1 TSAN=1``. So
jvoung (off chromium) 2015/09/03 14:40:25 I don't they can analyze both simultaneously =) "
Jim Stichnoth 2015/09/03 18:21:24 Oops! Split this into two examples.
1348 far, multithreading has been simple enough that TSan hasn't found any bugs, but
1349 ASan has found at least one memory leak which was subsequently fixed.
1350 `UndefinedBehaviorSanitizer
1351 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_
1352 (UBSan) support is in progress.
1353
1354 Current status
1355 ==============
1356
1357 Target architectures
1358 --------------------
1359
1360 Subzero is currently more or less complete for the x86-32 target. It has been
1361 refactored and extended to handle x86-64 as well, and that is mostly complete at
1362 this point.
1363
1364 ARM32 work is in progress. It currently lacks the testing level of x86, at
1365 least in part because Subzero's register allocator needs modifications to handle
1366 ARM's aliasing of floating point and vector registers. Specifically, a 64-bit
1367 register is actually a gang of two consecutive and aligned 32-bit registers, and
1368 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers.
1369 ARM64 work has not started; when it does, it will be native-only since the
1370 Native Client sandbox model, validator, and other tools have never been defined.
1371
1372 An external contributor is adding MIPS support, in most part by following the
1373 ARM work.
1374
1375 Translator performance
1376 ----------------------
1377
1378 Single-threaded translation speed is currently about 5× the ``pnacl-llc``
1379 translation speed. For a large ``.pexe`` file, the time breaks down as:
1380
1381 - 11% for parsing and initial IR building
1382
1383 - 4% for emitting to /dev/null
1384
1385 - 27% for liveness analysis (two liveness passes plus live range construction)
1386
1387 - 15% for linear-scan register allocation
1388
1389 - 9% for basic lowering
1390
1391 - 10% for advanced phi lowering
1392
1393 - ~11% for other minor analysis
1394
1395 - ~10% measurement overhead to acquire these numbers
1396
1397 Some improvements could undoubtedly be made, but it will be hard to increase the
1398 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With
1399 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10×
1400 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3.
1401
1402 Code quality
1403 ------------
1404
1405 Measured across 16 components of spec2k, Subzero's code quality is uniformly
1406 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly
1407 between ``pnacl-llc`` ``-O0`` and ``-O2``.
1408
1409 Translator size
1410 ---------------
1411
1412 When built in MINIMAL mode, the x86-64 native translator size for the x86-32
1413 target is about 700 KB, not including the size of functions referenced in
1414 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1
1415 MB, and it is statically linked and also includes nop padding for bundling as
1416 well as indirect branch masking.
1417
1418 Translator memory footprint
1419 ---------------------------
1420
1421 It's hard to draw firm conclusions about memory footprint, since the footprint
1422 is at least proportional to the input function size, and there is no real limit
1423 on the size of functions in the ``.pexe`` file.
1424
1425 That said, we looked at the memory footprint over time as Subzero translated
1426 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our
1427 disposal. One of LLVM's libraries that Subzero uses can report the current
1428 malloc heap usage. With single-threaded translation, Subzero tends to hover
1429 around 15 MB of memory usage. There are a couple of monstrous functions where
1430 Subzero grows to around 100 MB, but then it drops back down after those
1431 functions finish translating. In contrast, ``pnacl-llc`` grows larger and
1432 larger throughout translation, reaching several hundred MB by the time it
1433 completes.
1434
1435 It's a bit more interesting when we enable multithreaded translation. When
1436 there are N translation threads, Subzero implements a policy that limits the
1437 size of the translation queue to N entries -- if it is "full" when the parser
1438 tries to add a new CFG, the parser blocks until one of the translation threads
1439 removes a CFG. This means the number of in-memory CFGs can (and generally does)
1440 reach 2*N+1, and so the memory footprint rises in proportion to the number of
1441 threads. Adding to the pressure is the observation that the monstrous functions
1442 also take proportionally longer time to translate, so there's a good chance many
1443 of the monstrous functions will be active at the same time with multithreaded
1444 translation. As a result, for N=32, Subzero's memory footprint peaks at about
1445 260 MB, but drops back down as the large functions finish translating.
1446
1447 If this peak memory size becomes a problem, it might be possible for the parser
1448 to resequence the functions to try to spread out the larger functions, or to
1449 throttle the translation queue to prevent too many in-flight large functions.
1450 It may also be possible to throttle based on memory pressure signaling from
1451 Chrome.
1452
1453 Translator scalability
1454 ----------------------
1455
1456 Currently scalability is "not very good". Multiple translation threads lead to
1457 faster translation, but not to the degree desired. We haven't dug in to
1458 investigate yet.
1459
1460 There are a few areas to investigate. First, there may be contention on the
1461 constant pool, which all threads access, and which requires locked access even
1462 for reading. This could be mitigated by keeping a CFG-local cache of the most
1463 common constants.
1464
1465 Second, there may be contention on memory allocation. While almost all CFG
1466 objects are allocated from the CFG-local allocator, some passes use temporary
1467 STL containers that use the default allocator, which may require global locking.
1468 This could be mitigated by switching these to the CFG-local allocator.
1469
1470 Third, multithreading may make the default allocator strategy more expensive.
1471 In a single-threaded environment, a pass will allocate its containers, run the
1472 pass, and deallocate the containers. This results in stack-like allocation
1473 behavior and makes the heap free list easier to manage, with less heap
1474 fragmentation. But when multithreading is added, the allocations and
1475 deallocations become much less stack-like, making allocation and deallocation
1476 operations individually more expensive. Again, this could be mitigated by
1477 switching these to the CFG-local allocator.
OLDNEW
« no previous file with comments | « no previous file | README.rst » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698