Chromium Code Reviews| OLD | NEW |
|---|---|
| (Empty) | |
| 1 Design of the Subzero fast code generator | |
| 2 ========================================= | |
| 3 | |
| 4 Introduction | |
| 5 ------------ | |
| 6 | |
| 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | |
| 8 compiler technology based on `LLVM <http://llvm.org/>`_. The web program | |
| 9 developer uses the PNaCl toolchain to compile their application to | |
| 10 architecture-neutral PNaCl bitcode (a pexe file), using as much | |
| 11 architecture-neutral optimization as possible. The pexe file is downloaded to | |
| 12 the user's browser where the PNaCl translator (a component of Chrome) compiles | |
| 13 the pexe file to sandboxed native code. The translator uses | |
| 14 architecture-specific optimizations as much as practical to generate good native | |
| 15 code. | |
| 16 | |
| 17 The native code can be cached by the browser to avoid repeating translation on | |
| 18 future page loads. However, first-time user experience can be hampered by long | |
| 19 translation times. The LLVM-based PNaCl translator is pretty slow, even when | |
| 20 using ``-O0`` to minimize optimizations, so delays are especially noticeable on | |
| 21 slow browser platforms such as ARM-based Chromebooks. | |
| 22 | |
| 23 Translator slowness can be mitigated or hidden in a number of ways. | |
| 24 | |
| 25 - Parallel translation. However, slow machines where this matters most, | |
| 26 e.g. ARM chromebooks, are likely to have fewer cores to parallelize across. | |
| 27 | |
| 28 - Streaming translation, i.e. start translating as soon as the download starts. | |
| 29 This doesn't help much when translation speed is 10x slower than download | |
| 30 speed, or the pexe file is already cached while the translated binary was | |
| 31 flushed from the cache. | |
| 32 | |
| 33 - Arrange the web page such that translation is done in parallel with | |
| 34 downloading large assets. Arrange the web page to distract the user with cat | |
| 35 videos while translation is in progress. Or, improve translator performance | |
| 36 to something more reasonable. | |
| 37 | |
| 38 Goals | |
| 39 ===== | |
| 40 | |
| 41 Translation speed | |
| 42 ----------------- | |
| 43 | |
| 44 We'd like to be able to translate a pexe as fast as download speed. Any faster | |
| 45 is in a sense wasted effort. Download speed varies greatly, but we'll | |
| 46 arbitrarily say 1 MB/sec. We'll pick an ARM A15 platform as the example of a | |
| 47 slow machine. We observe a 3x single-thread performance difference between A15 | |
| 48 and a z620 workstation, and aggressively assume a pexe file could be compressed | |
| 49 to 50% on the web server before network transport, so we set the translation | |
| 50 speed goal to 6 MB/sec on a z620 workstation. | |
| 51 | |
| 52 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at | |
| 53 1/10 the target rate. The ``-O2`` mode takes 3x as long as the ``-O0`` mode. | |
| 54 | |
| 55 In other words, Subzero's goal is to improve over LLVM's translation speed by | |
| 56 10x. | |
| 57 | |
| 58 Code quality | |
| 59 ------------ | |
| 60 | |
| 61 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` | |
| 62 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On | |
| 63 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. | |
| 64 | |
| 65 Translator size | |
| 66 --------------- | |
| 67 | |
| 68 The current LLVM-based translator binary (pnacl-llc) is about 10MB in size. We | |
| 69 think 1MB is a more reasonable size, thus we target a 10x reduction in binary | |
| 70 size. | |
| 71 | |
| 72 For development, Subzero can be built for all target architectures, and all | |
| 73 debugging and diagnostic options enabled. For a smaller translator, we restrict | |
| 74 to a single target architecture, and define a MINIMAL build where unnecessary | |
| 75 features are compiled out. | |
| 76 | |
| 77 Memory footprint | |
| 78 ---------------- | |
| 79 | |
| 80 The current LLVM-based translator suffers from an issue in which some | |
| 81 function-specific data has to be retained in memory until all translation | |
| 82 completes, and therefore the memory footprint grows without bound. Large pexes | |
| 83 can lead to the translator process holding hundreds of MB of memory by the end. | |
| 84 | |
| 85 Subzero should maintain a stable memory footprint throughout translation. It's | |
| 86 not really practical to set a specific limit, because there is not really a | |
| 87 practical limit on a single function's size, but the footprint should be | |
| 88 "reasonable" and not have leaks or inexorable growth. (We use ASAN builds to | |
| 89 test for leaks.) | |
| 90 | |
| 91 Multithreaded translation | |
| 92 ------------------------- | |
| 93 | |
| 94 It should be practical to translate different functions concurrently and see | |
| 95 good scalability. Some locking may be needed, such as accessing output buffers | |
| 96 or constant pools, but that should be fairly minimal. In contrast, LLVM was | |
| 97 only designed for module-level parallelism, and as such, the PNaCl translator | |
| 98 internally splits a pexe file into several modules for concurrent translation. | |
| 99 All output needs to be deterministic regardless of the level of multithreading, | |
| 100 i.e. functions and data should always be output in the same order. | |
| 101 | |
| 102 Target architectures | |
| 103 -------------------- | |
| 104 | |
| 105 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future | |
| 106 targets include ARM64 and MIPS64, though these targets lack NaCl support | |
| 107 including a sandbox model or a validator. | |
| 108 | |
| 109 The first implementation is for x86-32, because it was expected to be | |
| 110 particularly challenging, and thus more likely to draw out any design problems | |
| 111 early: | |
| 112 | |
| 113 - There are a number of special cases, asymmetries, and warts in the x86 | |
| 114 instruction set. | |
| 115 | |
| 116 - Complex addressing modes may be leveraged for better code quality. | |
| 117 | |
| 118 - 64-bit integer operations have to be lowered into longer sequences of 32-bit | |
| 119 operations. | |
| 120 | |
| 121 - Paucity of physical registers may reveal code quality issues early in the | |
| 122 design. | |
| 123 | |
| 124 Detailed design | |
| 125 =============== | |
| 126 | |
| 127 Intermediate representation - ICE | |
| 128 --------------------------------- | |
| 129 | |
| 130 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's | |
| 131 IR, which is reflected in the pexe bitcode structure. | |
| 132 | |
| 133 A function is represented by the ``Ice::Cfg`` class (CFG = control flow graph). | |
| 134 Its key contents include: | |
| 135 | |
| 136 - A list of ``CfgNode`` pointers, generally held in topological order. | |
| 137 | |
| 138 - A list of ``Variable`` pointers corresponding to local variables used in the | |
| 139 function plus compiler-generated temporaries. | |
| 140 | |
| 141 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents | |
| 142 include: | |
| 143 | |
| 144 - A linear list of instructions, in the same style as LLVM. The last | |
| 145 instruction of the list is always a terminator instruction: branch, switch, | |
| 146 return, unreachable. | |
| 147 | |
| 148 - A list of Phi instructions, also in the same style as LLVM. They are held as | |
| 149 a linear list for convenience, though per Phi semantics, they are executed "in | |
| 150 parallel" without dependencies on each other. | |
| 151 | |
| 152 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and | |
| 153 another list for outgoing edges. | |
| 154 | |
| 155 - The node's unique, 0-based index into the Cfg's node list. | |
| 156 | |
| 157 An instruction is represented by the ``Ice::Inst`` class. Its key contents | |
| 158 include: | |
| 159 | |
| 160 - A list of source operands. | |
| 161 | |
| 162 - Its destination variable, if the instruction produces a result in an | |
| 163 ``Ice::Variable``. | |
| 164 | |
| 165 - A bitvector indicating which variables' live ranges this instruction ends. | |
| 166 This is computed during liveness analysis. | |
| 167 | |
| 168 Instructions kinds are divided into high-level ICE instructions and low-level | |
| 169 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode | |
| 170 instruction kinds. Each target architecture implementation extends the | |
| 171 instruction space with its own set of low-level instructions. Generally, | |
| 172 low-level instructions correspond to individual machine instructions. The | |
| 173 high-level ICE instruction space includes a few additional instruction kinds | |
| 174 that are not part of LLVM but are generally useful (e.g., an Assignment | |
| 175 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock | |
| 176 instructions for sandboxing). | |
| 177 | |
| 178 Specifically, high-level ICE instructions that derive from LLVM are the | |
| 179 following: | |
| 180 | |
| 181 - Alloca: allocate data on the stack | |
| 182 | |
| 183 - Arithmetic: binary operations of the form A = B op C | |
| 184 | |
| 185 - Br: conditional or unconditional branch | |
| 186 | |
| 187 - Call: function call | |
| 188 | |
| 189 - Cast: unary type-conversion operations | |
| 190 | |
| 191 - ExtractElement: extract a scalar element from a vector-type value | |
| 192 | |
| 193 - Fcmp: floating-point comparison | |
| 194 | |
| 195 - Icmp: integer comparison | |
| 196 | |
| 197 - IntrinsicCall: call a known intrinsic | |
| 198 | |
| 199 - InsertElement: insert a scalar element into a vector-type value | |
| 200 | |
| 201 - Load: load a value from memory | |
| 202 | |
| 203 - Phi: implement the SSA phi node | |
| 204 | |
| 205 - Ret: return from the function | |
| 206 | |
| 207 - Select: essentially the C operation of the form X = C ? Y : Z | |
| 208 | |
| 209 - Store: store a value into memory | |
| 210 | |
| 211 - Switch: generalized branch to multiple possible locations | |
| 212 | |
| 213 - Unreachable: indicate that this portion of the code is unreachable | |
| 214 | |
| 215 The additional high-level ICE instructions are the following: | |
| 216 | |
| 217 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi | |
| 218 instructions to non-SSA assignments, before lowering to machine code. | |
| 219 | |
| 220 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are | |
| 221 common across all targets and so they are elevated to the high-level | |
| 222 instruction set. | |
| 223 | |
| 224 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in | |
| 225 liveness analysis, elevated to the high-level because they are used by all | |
| 226 targets. They are described in more detail at the end of this section. | |
| 227 | |
| 228 - JumpTable: this represents the result of switch optimization analysis, where | |
| 229 some switch instructions may use jump tables instead of cascading | |
| 230 compare/branches. | |
| 231 | |
| 232 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an | |
| 233 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants | |
| 234 include integer constants, floating point constants, Undef (an unspecified | |
| 235 constant of a particular type), and symbol constants (essentially addresses of | |
| 236 globals). A variable represents a value allocated on the stack (though not | |
| 237 including alloca-derived storage). Among other things, a variable holds its | |
| 238 unique, 0-based index into the Cfg's variable list. | |
| 239 | |
| 240 Each target can extend the ``Constant`` and ``Variable`` classes for its own | |
| 241 needs. In addition, the ``Operand`` class may be extended, e.g. to define an | |
| 242 x86 ``MemOperand`` that encodes a base register, an index register, an index | |
| 243 register shift amount, and a constant offset. | |
| 244 | |
| 245 Register allocation and liveness analysis are restricted to Variable operands. | |
| 246 Because of the importance of register allocation to code quality, and the | |
| 247 translation-time cost of liveness analysis, Variable operands get some special | |
| 248 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate | |
| 249 across all the Variables of an instruction. An instruction holds a list of | |
| 250 operands, but an operand may contain 0, 1, or more Variables. As such, the | |
| 251 ``Operand`` class specially holds a list of Variables contained within, for | |
| 252 quick access. | |
| 253 | |
| 254 A Subzero transformation pass may work by deleting an existing instruction and | |
| 255 replacing it with zero or more new instructions. Instead of actually deleting | |
| 256 the existing instruction, we generally mark it as deleted and insert the new | |
| 257 instructions right after the deleted instruction. When printing the IR for | |
| 258 debugging, this is a big help because it makes it much more clear how the | |
| 259 non-deleted instructions came about. | |
| 260 | |
| 261 Subzero has a few special instructions to help with liveness analysis | |
| 262 consistency. | |
| 263 | |
| 264 - The FakeDef instruction gives a fake definition of some variable. For | |
| 265 example, on x86-32, a divide instruction defines both ``eax`` and ``edx`` but | |
| 266 an ICE instruction can represent only one destination variable. This is | |
| 267 similar for multiply instructions, and for function calls that return a 64-bit | |
| 268 integer result in the edx:eax pair. Also, using the ``xor eax, eax`` trick to | |
| 269 set ``eax`` to 0 requires an initial FakeDef of ``eax``. | |
| 270 | |
| 271 - The FakeUse instruction registers a use of a variable, typically to prevent an | |
| 272 earlier assignment to that variable from being dead-code eliminated. For | |
| 273 example, lowering an operation like ``x=cc?y:z`` may be done using x86's | |
| 274 conditional move (cmov) instruction: ``mov x, z; cmov_cc x, y``. Without a | |
| 275 FakeUse of ``x`` between the two instructions, the liveness analysis pass may | |
| 276 dead-code eliminate the first instruction. | |
| 277 | |
| 278 - The FakeKill instruction is added after a call instruction, and is a quick way | |
| 279 of indicating that caller-save registers are invalidated. | |
| 280 | |
| 281 Overview of lowering | |
| 282 -------------------- | |
| 283 | |
| 284 In general, translation goes like this: | |
| 285 | |
| 286 - Parse the next function from the pexe file and construct a Cfg consisting of | |
| 287 high-level ICE. | |
| 288 | |
| 289 - Do analysis passes and transformation passes on the high-level ICE, as | |
| 290 desired. | |
| 291 | |
| 292 - Lower each high-level ICE instruction into a sequence of zero or more | |
| 293 low-level ICE instructions. Each high-level instruction is generally lowered | |
| 294 independently, though the target lowering is allowed to look ahead in the | |
| 295 CfgNode's instruction list if desired. | |
| 296 | |
| 297 - Do more analysis and transformation passes on the low-level ICE, as desired. | |
| 298 | |
| 299 - Assemble the low-level Cfg into an ELF object file (alternatively, a textual | |
| 300 assembly file that is later assembled by some external tool). | |
| 301 | |
| 302 - Repeat for all functions, and also produce object code for data such as global | |
| 303 initializers and internal constant pools. | |
| 304 | |
| 305 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``, | |
| 306 the intention is to apply all available optimizations to get the best code | |
| 307 quality (though the initial code quality goal is measured against LLVM's ``O0`` | |
| 308 code quality). For ``Om1``, the intention is to apply as few optimizations as | |
| 309 possible and produce code as quickly as possible, accepting poor code quality. | |
| 310 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, | |
| 311 "sub-zero". | |
| 312 | |
| 313 Debuggability of generated code was never considered in the design, but so far, | |
| 314 Subzero doesn't really do transformations that would obfuscate debugging. The | |
| 315 main thing might be that register allocation (including stack slot coalescing | |
| 316 for stack-allocated variables whose live ranges don't overlap) may render a | |
| 317 variable's value unobtainable after its live range ends. This would not be an | |
| 318 issue for ``Om1`` since it doesn't register-allocate program-level variables, | |
| 319 nor does it coalesce stack slots. | |
| 320 | |
| 321 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but | |
| 322 produces code with one third the code quality. ``Om1`` is good for testing and | |
| 323 debugging -- during translation, it tends to expose errors in the basic lowering | |
| 324 that might otherwise have been hidden by the register allocator or other | |
| 325 optimization passes. It also helps determine whether a code correctness problem | |
| 326 is a fundamental problem in the basic lowering, or an error in another | |
| 327 optimization pass. | |
| 328 | |
| 329 The implementation of target lowering also controls the recipe of passes used | |
| 330 for ``Om1`` and ``O2`` translation. For example, address mode inference may | |
| 331 only be relevant for x86. | |
| 332 | |
| 333 Pexe parsing | |
| 334 ------------ | |
| 335 | |
| 336 Subzero includes an integrated PNaCl pexe bitcode parser. It parses the pexe | |
| 337 file function by function, ultimately constructing an ICE Cfg for each function. | |
| 338 After a function is parsed, its Cfg is handed off to the translation phase. The | |
| 339 bitcode parser also parses global initializer data and hands it off to be | |
| 340 translated to data sections in the object file. | |
| 341 | |
| 342 Subzero has another parsing strategy for testing/debugging. LLVM libraries can | |
| 343 be used to parse a module into LLVM IR (though very slowly relative to Subzero | |
| 344 native parsing). Then we iterate across the LLVM IR and construct high-level | |
| 345 ICE, handing off each Cfg to the translation phase. | |
| 346 | |
| 347 Lowering strategy | |
| 348 ----------------- | |
| 349 | |
| 350 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower | |
| 351 each high-level instruction down to a sequence of low-level target-specific | |
| 352 instructions, in a largely context-free setting. That is, each high-level | |
| 353 instruction conceptually has a simple template expansion into low-level | |
| 354 instructions, and lowering can in theory be done in any order. This may sound | |
| 355 like a small effort, but quite a large number of templates may be needed because | |
| 356 of the number of PNaCl types and instruction variants. Furthermore, there may | |
| 357 be optimized templates, e.g. to take advantage of operator commutativity. | |
| 358 | |
| 359 The key idea for a lowering template is to produce valid low-level instructions | |
| 360 that are guaranteed to meet address mode and other structural requirements of | |
| 361 the instruction set. For example, on x86, the source operand of an integer | |
| 362 store instruction must be an immediate or a physical register; a shift | |
| 363 instruction's shift amount must be an immediate or in register ``cl``; a | |
| 364 function's integer return value is in ``eax``; most x86 instructions are | |
| 365 two-operand, in contrast to corresponding three-operand high-level instructions; | |
| 366 etc. | |
| 367 | |
| 368 Because target lowering runs before register allocation, there is no way to know | |
| 369 whether a given ``Ice::Variable`` operand lives on the stack or in a physical | |
| 370 register. When the low-level instruction calls for a physical register operand, | |
| 371 the target lowering can create an infinite-weight Variable. This tells the | |
| 372 register allocator to assign infinite weight when making decisions, effectively | |
| 373 guaranteeing some physical register. Variables can also be pre-colored to a | |
| 374 specific physical register (``cl`` in the shift example above), which also gives | |
| 375 infinite weight. | |
| 376 | |
| 377 To illustrate, consider a high-level arithmetic instruction on 32-bit integer | |
| 378 operands:: | |
| 379 | |
| 380 A = B + C | |
| 381 | |
| 382 X86 target lowering might produce the following:: | |
| 383 | |
| 384 T.inf = B // mov instruction | |
| 385 T.inf += C // add instruction | |
| 386 A = T.inf // mov instruction | |
| 387 | |
| 388 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a | |
| 389 physical register, the three lowered instructions are all encodable regardless | |
| 390 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and | |
| 391 whether ``A`` is a physical register or in memory. | |
| 392 | |
| 393 In this example, ``A`` must be a Variable and one may be tempted to simplify the | |
| 394 lowering sequence by setting ``A`` as infinite-weight and using:: | |
| 395 | |
| 396 A = B // mov instruction | |
| 397 A += C // add instruction | |
| 398 | |
| 399 This has two problems. First, if the original instruction was actually ``A = | |
| 400 B + A``, the result would be incorrect. Second, assigning ``A`` a physical | |
| 401 register applies throughout ``A``'s entire live range. This is probably not | |
| 402 what is intended, and may ultimately lead to a failure to allocate a register | |
| 403 for an infinite-weight variable. | |
| 404 | |
| 405 This style of lowering leads to many temporaries being generated, so in ``O2`` | |
| 406 mode, we rely on the register allocator to clean things up. For example, in the | |
| 407 example above, if ``B`` ends up getting a physical register and its live range | |
| 408 ends at this instruction, the register allocator is likely to reuse that | |
| 409 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | |
| 410 copy, which is removed as an emission-time peephole optimization. | |
| 411 | |
| 412 O2 lowering | |
| 413 ----------- | |
| 414 | |
| 415 Currently, the ``O2`` lowering recipe is the following: | |
| 416 | |
| 417 - Address mode inference | |
| 418 | |
| 419 - Read-modify-write (RMW) transformation | |
| 420 | |
| 421 - Basic liveness analysis | |
| 422 | |
| 423 - Load optimization | |
| 424 | |
| 425 - Target lowering | |
| 426 | |
| 427 - Full liveness analysis | |
| 428 | |
| 429 - Register allocation | |
| 430 | |
| 431 - Phi instruction lowering (advanced) | |
| 432 | |
| 433 - Post-phi lowering register allocation | |
| 434 | |
| 435 - Branch optimization | |
| 436 | |
| 437 These passes are described in more detail below. | |
| 438 | |
| 439 Om1 lowering | |
| 440 ------------ | |
| 441 | |
| 442 Currently, the ``Om1`` lowering recipe is the following: | |
| 443 | |
| 444 - Phi instruction lowering (simple) | |
| 445 | |
| 446 - Target lowering | |
| 447 | |
| 448 - Register allocation (infinite-weight and pre-colored only) | |
| 449 | |
| 450 Optimization passes | |
| 451 ------------------- | |
| 452 | |
| 453 Liveness analysis | |
| 454 ^^^^^^^^^^^^^^^^^ | |
| 455 | |
| 456 Liveness analysis is a standard dataflow optimization, implemented as follows. | |
| 457 For each node (basic block), its live-out set is computed as the union of the | |
| 458 live-in sets of its successor nodes. Then the node's instructions are processed | |
| 459 in reverse order, updating the live set, until the beginning of the node is | |
| 460 reached, and the node's live-in set is recorded. If this iteration has changed | |
| 461 the node's live-in set, the node's predecessors are marked for reprocessing. | |
| 462 This continues until no more nodes need reprocessing. If nodes are processed in | |
| 463 reverse topological order, the number of iterations over the Cfg is generally | |
| 464 equal to the maximum loop nest depth. | |
| 465 | |
| 466 To implement this, each node records its live-in and live-out sets, initialized | |
| 467 to the empty set. Each instruction records which of its Variables' live ranges | |
| 468 end in that instruction, initialized to the empty set. A side effect of | |
| 469 liveness analysis is dead instruction elimination. Each instruction can be | |
| 470 marked as tentatively dead, and after the algorithm converges, the tentatively | |
| 471 dead instructions are permanently deleted. | |
| 472 | |
| 473 Optionally, after this liveness analysis completes, we can do live range | |
| 474 construction, in which we calculate the live range of each variable in terms of | |
| 475 instruction numbers. A live range is represented as a union of segments, where | |
| 476 the segment endpoints are instruction numbers. Instruction numbers are required | |
| 477 to be unique across the Cfg, and monotonically increasing within a basic block. | |
| 478 As a union of segments, live ranges can contain "gaps" and are therefore | |
| 479 precise. Because of SSA properties, a variable's live range can start at most | |
| 480 once in a basic block, and can end at most once in a basic block. Liveness | |
| 481 analysis keeps track of which variable/instruction tuples begin live ranges and | |
| 482 end live ranges, and combined with live-in and live-out sets, we can efficiently | |
| 483 build up live ranges of all variables across all basic blocks. | |
| 484 | |
| 485 A lot of care is taken to try to make liveness analysis fast and efficient. | |
| 486 Because of the lowering strategy, the number of variables is generally | |
| 487 proportional to the number of instructions, leading to an O(N^2) complexity | |
| 488 algorithm if implemented naively. To improve things based on sparsity, we note | |
| 489 that most variables are "local" and referenced in at most one basic block (in | |
| 490 contrast to the "global" variables with multi-block usage), and therefore cannot | |
| 491 be live across basic blocks. Therefore, the live-in and live-out sets, | |
| 492 typically represented as bit vectors, can be limited to the set of global | |
| 493 variables, and the intra-block liveness bit vector can be compacted to hold the | |
| 494 global variables plus the local variables for that block. | |
| 495 | |
| 496 Register allocation | |
| 497 ^^^^^^^^^^^^^^^^^^^ | |
| 498 | |
| 499 Subzero implements a simple linear-scan register allocator, based on the | |
| 500 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan | |
| 501 Register Allocation in the Context of SSA Form and Register Constraints | |
| 502 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has | |
| 503 several nice features: | |
| 504 | |
| 505 - Live ranges are represented as unions of segments, as described above, rather | |
| 506 than a single start/end tuple. | |
| 507 | |
| 508 - It allows pre-coloring of variables with specific physical registers. | |
| 509 | |
| 510 - It applies equally well to pre-lowered Phi instructions. | |
| 511 | |
| 512 The paper suggests an approach of aggressively coalescing variables across Phi | |
| 513 instructions (i.e., trying to force Phi source and destination variables to have | |
| 514 the same register assignment), but we reject that in favor of the more natural | |
| 515 preference mechanism described below. | |
| 516 | |
| 517 We enhance the algorithm in the paper with the capability of automatic inference | |
| 518 of register preference, and with the capability of allowing overlapping live | |
| 519 ranges to safely share the same register in certain circumstances. If we are | |
| 520 considering register allocation for variable ``A``, and ``A`` has a single | |
| 521 defining instruction ``A=B+C``, then the preferred register for ``A``, if | |
| 522 available, would be the register assigned to ``B`` or ``C``, if any, provided | |
| 523 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this | |
| 524 way we infer a good register preference for ``A``. | |
| 525 | |
| 526 We allow overlapping live ranges to get the same register in certain cases. | |
| 527 Suppose a high-level instruction like:: | |
| 528 | |
| 529 A = unary_op(B) | |
| 530 | |
| 531 has been target-lowered like:: | |
| 532 | |
| 533 T.inf = B | |
| 534 A = unary_op(T.inf) | |
| 535 | |
| 536 Further, assume that ``B``'s live range continues beyond this instruction | |
| 537 sequence, and that ``B`` has already been assigned some register. Normally, we | |
| 538 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it | |
| 539 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have | |
| 540 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is | |
| 541 guaranteed to be in a register, so in theory these overlapping live ranges could | |
| 542 safely have the same register. Our implementation allows this overlap as long | |
| 543 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never | |
| 544 modified within ``T.inf``'s live range. | |
| 545 | |
| 546 Subzero's register allocator can be run in 3 configurations. | |
| 547 | |
| 548 - Normal mode. All Variables are considered for register allocation. It | |
| 549 requires full liveness analysis and live range construction as a prerequisite. | |
| 550 This is used by ``O2`` lowering. | |
| 551 | |
| 552 - Minimal mode. Only infinite-weight or pre-colored Variables are considered. | |
| 553 All other Variables are stack-allocated. It does not require liveness | |
| 554 analysis; instead, it quickly scans the instructions and records first | |
| 555 definitions and last uses of all relevant Variables, using that to construct a | |
| 556 single-segment live range. Although this includes most of the Variables, the | |
| 557 live ranges are mostly simple, short, and rarely overlapping, which the | |
| 558 register allocator handles efficiently. This is used by ``Om1`` lowering. | |
| 559 | |
| 560 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode | |
| 561 register allocation, and may result in new infinite-weight Variables that need | |
| 562 registers. One would like to just run something like minimal mode to assign | |
| 563 registers to the new Variables while respecting existing register allocation | |
| 564 decisions. However, it sometimes happens that there are no free registers. | |
| 565 In this case, some register needs to be forcibly spilled to the stack and | |
| 566 temporarily reassigned to the new Variable, and reloaded at the end of the new | |
| 567 Variable's live range. The register must be one that has no explicit | |
| 568 references during the Variable's live range. Since Subzero currently doesn't | |
| 569 track def/use chains (though it does record the CfgNode where a Variable is | |
| 570 defined), we just do a brute-force search across the CfgNode's instruction | |
| 571 list for the instruction numbers of interest. This situation happens very | |
| 572 rarely, so there's little point for now in improving its performance. | |
| 573 | |
| 574 The basic linear-scan algorithm may, as it proceeds, rescind an early register | |
| 575 allocation decision, leaving that Variable to be stack-allocated. Some of these | |
| 576 times, it turns out that the Variable could have been given a different register | |
| 577 without conflict, but by this time it's too late. The literature recognizes | |
| 578 this situation and describes "second-chance bin-packing", which Subzero can do. | |
| 579 We can rerun the register allocator in a mode that respects existing register | |
| 580 allocation decisions, and sometimes it finds new non-conflicting opportunities. | |
| 581 In fact, we can repeatedly run the register allocator until convergence. | |
| 582 Unfortunately, in the current implementation, these subsequent register | |
| 583 allocation passes end up being extremely expensive. This is because of the | |
| 584 treatment of the "unhandled pre-colored" Variable set, which is normally very | |
| 585 small but ends up being quite large on subsequent passes. Its performance can | |
| 586 probably be made acceptable with a better choice of data structures, but for now | |
| 587 this second-chance mechanism is disabled. | |
| 588 | |
| 589 Basic phi lowering | |
| 590 ^^^^^^^^^^^^^^^^^^ | |
| 591 | |
| 592 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | |
| 593 implements it). Consider this example:: | |
| 594 | |
| 595 L1: | |
| 596 ... | |
| 597 br L3 | |
| 598 L2: | |
| 599 ... | |
| 600 br L3 | |
| 601 L3: | |
| 602 A = phi [B, L1], [C, L2] | |
| 603 X = phi [Y, L1], [Z, L2] | |
| 604 | |
| 605 For each destination of a phi instruction, we can create a temporary and insert | |
| 606 the temporary's assignment at the end of the predecessor block:: | |
| 607 | |
| 608 L1: | |
| 609 ... | |
| 610 A' = B | |
| 611 X' = Y | |
| 612 br L3 | |
| 613 L2: | |
| 614 ... | |
| 615 A' = C | |
| 616 X' = Z | |
| 617 br L3 | |
| 618 L2: | |
| 619 A = A' | |
| 620 X = X' | |
| 621 | |
| 622 This transformation is very simple and reliable. It can be done before target | |
| 623 lowering and register allocation, and it easily avoids the classic lost-copy and | |
| 624 related problems. ``Om1`` lowering uses this strategy. | |
| 625 | |
| 626 However, it has the disadvantage of initializing temporaries even for branches | |
| 627 not taken, though that could be mitigated by splitting non-critical edges and | |
| 628 putting assignments in the edge-split nodes. Another problem is that without | |
| 629 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a | |
| 630 specific ordering even though phi semantics are that the assignments are | |
| 631 parallel or unordered. This sometimes imposes false live range overlaps and | |
| 632 leads to poorer register allocation. | |
| 633 | |
| 634 Advanced phi lowering | |
| 635 ^^^^^^^^^^^^^^^^^^^^^ | |
| 636 | |
| 637 ``O2`` lowering defers phi lowering until after register allocation to avoid the | |
| 638 problem of false live range overlaps. It works as follows. We split each | |
| 639 incoming edge and move the (parallel) phi assignments into the split nodes. We | |
| 640 linearize each set of assignments by finding a safe, topological ordering of the | |
| 641 assignments, respecting register assignments as well. For example:: | |
| 642 | |
| 643 A = B | |
| 644 X = Y | |
| 645 | |
| 646 Normally these assignments could be executed in either order, but if ``B`` and | |
| 647 ``X`` are assigned the same physical register, we would want to use the above | |
| 648 ordering. Dependency cycles are broken by introducing a temporary. For | |
| 649 example:: | |
| 650 | |
| 651 A = B | |
| 652 B = A | |
| 653 | |
| 654 Here, a temporary breaks the cycle:: | |
| 655 | |
| 656 t = A | |
| 657 A = B | |
| 658 B = t | |
| 659 | |
| 660 Finally, we use the existing target lowering to lower the assignments in this | |
| 661 basic block, and once that is done for all basic blocks, we run the post-phi | |
| 662 variant of register allocation on the edge-split basic blocks. | |
| 663 | |
| 664 When computing a topological order, we try to first schedule assignments whose | |
| 665 source has a physical register, and last schedule assignments whose destination | |
| 666 has a physical register. This helps reduce register pressure. | |
| 667 | |
| 668 X86 address mode inference | |
| 669 ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 670 | |
| 671 We try to take advantage of the x86 addressing mode that includes a base | |
| 672 register, an index register, an index register scale amount, and an immediate | |
| 673 offset. We do this through simple pattern matching. Starting with a load or | |
| 674 store instruction where the address is a variable, we initialize the base | |
| 675 register to that variable, and look up the instruction where that variable is | |
| 676 defined. If that is an add instruction of two variables and the index register | |
| 677 hasn't been set, we replace the base and index register with those two | |
| 678 variables. If instead it is an add instruction of a variable and a constant, we | |
| 679 replace the base register with the variable and add the constant to the | |
| 680 immediate offset. | |
| 681 | |
| 682 There are several more patterns that can be matched. This pattern matching | |
| 683 continues on the load or store instruction until no more matches are found. | |
| 684 Because a program typically has few load and store instructions (not to be | |
| 685 confused with instructions that manipulate stack variables), this address mode | |
| 686 inference pass is fast. | |
| 687 | |
| 688 X86 read-modify-write inference | |
| 689 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 690 | |
| 691 A reasonably common bitcode pattern is a non-atomic update of a memory | |
| 692 location:: | |
| 693 | |
| 694 x = load addr | |
| 695 y = add x, 1 | |
| 696 store y, addr | |
| 697 | |
| 698 On x86, with good register allocation, the Subzero passes described above | |
| 699 generate code with only this quality:: | |
| 700 | |
| 701 mov eax, [ebx] | |
| 702 add eax, 1 | |
| 703 mov [ebx], eax | |
| 704 | |
| 705 However, x86 allows for this kind of code:: | |
| 706 | |
| 707 add [ebx], 1 | |
| 708 | |
| 709 which requires fewer instructions, but perhaps more importantly, requires fewer | |
| 710 physical registers. | |
| 711 | |
| 712 It's also important to note that this transformation only makes sense if the | |
| 713 store instruction ends ``x``'s live range. | |
| 714 | |
| 715 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) | |
| 716 opportunities via simple pattern matching. The only problem is that it is run | |
| 717 before liveness analysis, which is needed to determine whether ``x``'s live | |
| 718 range ends after the RMW. Since liveness analysis is one of the most expensive | |
| 719 passes, it's not attractive to run it an extra time just for RMW analysis. | |
| 720 Instead, we essentially generate both the RMW and the non-RMW versions, and then | |
| 721 during lowering, the RMW version deletes itself if it finds x still live. | |
| 722 | |
| 723 X86 compare-branch inference | |
| 724 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 725 | |
| 726 In the LLVM instruction set, the compare/branch pattern works like this:: | |
| 727 | |
| 728 cond = icmp eq a, b | |
| 729 br cond, target | |
| 730 | |
| 731 The result of the icmp instruction is a single bit, and a conditional branch | |
| 732 tests that bit. By contrast, most target architectures use this pattern:: | |
| 733 | |
| 734 cmp a, b // implicitly sets various bits of FLAGS register | |
| 735 br eq, target // branch on a particular FLAGS bit | |
| 736 | |
| 737 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests | |
| 738 ``cond`` and conditionally branches. Subzero has a pass that identifies | |
| 739 boolean-based operations like this and folds them into a single | |
| 740 compare/branch-like operation. It is set up for more than just cmp/br though. | |
| 741 Boolean producers include icmp (integer compare), fcmp (floating-point compare), | |
| 742 and trunc (integer truncation when the destination has bool type). Boolean | |
| 743 consumers include branch, select (the ternary operator from the C language), and | |
| 744 sign-extend and zero-extend when the source has bool type. | |
| 745 | |
| 746 Sandboxing | |
| 747 ^^^^^^^^^^ | |
| 748 | |
| 749 This is not actually a separate pass, but is integrated into lowering and | |
| 750 assembly. | |
| 751 | |
| 752 - Indirect branches, including the ret instruction, are masked to a bundle | |
| 753 boundary and bundle-locked. | |
| 754 | |
| 755 - Call instructions are aligned to the end of the bundle so that the return | |
| 756 address is bundle-aligned. | |
| 757 | |
| 758 - Indirect branch targets, including function entry and targets in a switch | |
| 759 statement jump table, are bundle-aligned. | |
| 760 | |
| 761 - The intrinsic for reading the thread pointer is inlined appropriately. | |
| 762 | |
| 763 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox | |
| 764 base register. We reduce the aggressiveness of address mode inference to | |
| 765 leave room for the sandbox base register during lowering. There are no memory | |
| 766 sandboxing changes for x86-32. | |
| 767 | |
| 768 Code emission | |
| 769 ------------- | |
| 770 | |
| 771 Subzero's integrated assembler is derived from Dart's assembler code. There is | |
| 772 a pass that iterates through the low-level ICE instructions and invokes the | |
| 773 relevant assembler functions. Placeholders are added for later fixup of branch | |
| 774 target offsets. (Backward branches use short offsets if possible; forward | |
| 775 branches always use long offsets.) The assembler emits into a staging buffer. | |
| 776 Once emission into the staging buffer for a function is complete, the data is | |
| 777 emitted to the output file as an ELF object file, and metadata such as | |
| 778 relocations, symbol table, and string table, are accumulated for emission at the | |
| 779 end. Global data initializers are emitted similarly. A key point is that at | |
| 780 this point, the staging buffer can be deallocated, and only a minimum of data | |
| 781 needs to held until the end. | |
| 782 | |
| 783 As a debugging alternative, Subzero can emit textual assembly code which can | |
| 784 then be run through an external assembler. This is of course super slow, but | |
| 785 quite valuable when bringing up a new target. | |
| 786 | |
| 787 As another debugging option, the staging buffer can be emitted as textual | |
| 788 assembly, primarily in the form of ".byte" lines. This allows the assembler to | |
| 789 be tested separately from the ELF related code. | |
| 790 | |
| 791 Memory management | |
| 792 ----------------- | |
| 793 | |
| 794 Where possible, we allocate from a CfgLocalAllocator which derives from LLVM's | |
| 795 BumpPtrAllocator. This is an arena-style allocator where objects allocated from | |
| 796 the arena are never actually freed; instead, when the Cfg translation completes | |
| 797 and the Cfg is deleted, the entire arena memory is reclaimed at once. | |
| 798 | |
| 799 Instructions are probably the most heavily allocated complex class in Subzero. | |
| 800 We represent an instruction list as an intrusive doubly linked list, allocate | |
| 801 all instructions from the CfgLoclaAllocator, and we make sure each instruction | |
| 802 subclass is basically POD with a trivial destructor. This way, when the Cfg is | |
| 803 finished, we don't need to individually deallocate every instruction. We do | |
| 804 similar for Variables, which is probably the second most popular complex class. | |
| 805 | |
| 806 There are some situations where passes need to use some STL container class. | |
| 807 Subzero has a way of using the CfgLocalAllocator as the container allocator if | |
| 808 this is needed. | |
| 809 | |
| 810 Multithreaded translation | |
| 811 ------------------------- | |
| 812 | |
| 813 Subzero is designed to be able to translate functions in parallel. With the | |
| 814 ``-threads=N`` command-line option, there is a 3-stage producer-consumer | |
| 815 pipeline: | |
| 816 | |
| 817 - A single thread parses the pexe file and produces a sequence of work units. A | |
| 818 work unit can be either a fully constructed Cfg, or a set of global | |
| 819 initializers. The work unit includes its sequence number denoting its parse | |
| 820 order. Each work unit is added to the translation queue. | |
| 821 | |
| 822 - There are N translation threads that draw work units from the translation | |
| 823 queue and lower them into assembler buffers. Each assembler buffer is added | |
| 824 to the emitter queue, tagged with its sequence number. The Cfg and its | |
| 825 CfgLocalAllocator are disposed of at this point. | |
| 826 | |
| 827 - A single thread draws assembler buffers from the emitter queue and appends to | |
| 828 the output file. It uses the sequence numbers to reintegrate the assembler | |
| 829 buffers according to the original parse order, such that output order is | |
| 830 always deterministic. | |
| 831 | |
| 832 Note that ``-threads=N`` actually creates N+2 threads. For the special case of | |
| 833 N=0, execution is entirely sequential -- the same thread parses, translates, and | |
| 834 emits, one function at a time. This is useful for performance measurements. | |
| 835 | |
| 836 Currently, parsing takes about 11% of total sequential time. If translation | |
| 837 scalability ever gets so fast and awesomely scalable that parsing becomes a | |
| 838 bottleneck, it should be possible to make parsing multithreaded as well. | |
| 839 | |
| 840 Internally, all shared, mutable data is held in the GlobalContext object, and | |
| 841 access to each field is guarded by a mutex. | |
| 842 | |
| 843 Security | |
| 844 -------- | |
| 845 | |
| 846 Sandboxed translator | |
| 847 ^^^^^^^^^^^^^^^^^^^^ | |
| 848 | |
| 849 When running inside the browser, the Subzero translator executes as sandboxed, | |
| 850 untrusted code that is initially checked by the validator, just like the | |
| 851 LLVM-based pnacl-llc translator. As such, the Subzero binary should be no more | |
| 852 or less secure than the translator it replaces, from the point of view of the | |
| 853 Chrome sandbox. | |
| 854 | |
| 855 Code diversification | |
| 856 ^^^^^^^^^^^^^^^^^^^^ | |
| 857 | |
| 858 Return-oriented programming (ROP) is a now-common technique for starting with | |
| 859 e.g. a known buffer overflow situation and launching it into a deeper exploit. | |
| 860 The attacker scans the executable looking for ROP gadgets, which are short | |
| 861 sequences of code that happen to load known values into known registers and then | |
| 862 return. If the attacker can manage to overwrite parts of the stack, he can | |
| 863 overwrite it with carefully chosen return addresses such that certain ROP | |
| 864 gadgets are effectively chained together to set up the register state as | |
| 865 desired, finally returning to some code that manages to do something nasty based | |
| 866 on those register values. | |
| 867 | |
| 868 If there is a popular pexe with a large install base, the attacker could run | |
| 869 Subzero on it and scan the executable for suitable ROP gadgets to use as part of | |
| 870 a potential exploit. Note that if the trusted validator is working correctly, | |
| 871 these ROP gadgets are limited to starting at a bundle boundary and cannot use | |
| 872 the trick of finding a gadget that happens to begin inside another instruction. | |
| 873 All the same, gadgets with these constraints still exist and the attacker has | |
| 874 access to them. | |
| 875 | |
| 876 Subzero can mitigate these attacks to some degree through code diversification. | |
| 877 Specifically, we can apply some randomness to the code generation that makes ROP | |
| 878 gadgets less predictable. This randomness can have some compile-time cost, and | |
| 879 it can affect the code quality; and some diversifications may be more effective | |
| 880 than others. | |
| 881 | |
| 882 To evaluate diversification effectiveness, we use a third-party ROP gadget | |
| 883 finder and limit its results to bundle-aligned addresses. For a given | |
| 884 diversification technique, we run it with a number of different random seeds, | |
| 885 find ROP gadgets for each version, and determine how persistent each ROP gadget | |
| 886 is across the different versions. A gadget is persistent if the same gadget is | |
| 887 found at the same code address. The best diversifications are ones with low | |
| 888 gadget persistence rates. | |
| 889 | |
| 890 Subzero implements 7 different diversification techniques. Below is a | |
| 891 discussion of each technique, its effectiveness, and its cost. | |
| 892 | |
| 893 Constant blinding | |
| 894 ~~~~~~~~~~~~~~~~~ | |
| 895 | |
| 896 Here, we prevent attackers from controlling large immediates in the text | |
| 897 (executable) section. A random cookie is generated for each function, and if | |
| 898 the constant exceeds a specified threshold, the constant is obfuscated with the | |
| 899 cookie and equivalent code is generated. For example, instead of this x86 | |
| 900 instruction:: | |
| 901 | |
| 902 mov $0x11223344, <%Reg/Mem> | |
|
John
2015/08/31 15:02:18
The example above (RMW) uses intel syntax. I'd say
Jim Stichnoth
2015/08/31 15:35:32
Switched everything to AT&T syntax.
| |
| 903 | |
| 904 the following code might be generated:: | |
| 905 | |
| 906 mov $(0x11223344+Cookie), %temp | |
| 907 lea -Cookie(%temp), %temp | |
| 908 mov %temp, <%Reg/Mem> | |
| 909 | |
| 910 The ``lea`` instruction is used rather than e.g. add/sub or xor, to prevent | |
| 911 unintended effects on the flags register. | |
| 912 | |
| 913 This transformation has almost no effect on translation time, and about 1% | |
| 914 impact on code quality, depending on the threshold chosen. It does little to | |
| 915 reduce gadget persistence, but it does remove a lot of potential opportunities | |
| 916 to construct intra-instruction ROP gadgets (which an attacker could use if a | |
| 917 validator bug were discovered). | |
| 918 | |
| 919 Constant pooling | |
| 920 ~~~~~~~~~~~~~~~~ | |
| 921 | |
| 922 This is similar to constant blinding, in that large immediates are removed from | |
| 923 the text section. In this case, each unique constant above the threshold is | |
| 924 stored in a read-only data section and the constant is accessed via a memory | |
| 925 load. For the above example, the following code might be generated:: | |
| 926 | |
| 927 mov $Label$1, %temp | |
| 928 mov %temp, <%Reg/Mem> | |
| 929 | |
| 930 This has a similarly small impact on translation time and ROP gadget | |
| 931 persistence, and a smaller (better) impact on code quality. This is because it | |
| 932 uses fewer instructions, and in some cases good register allocation leads to no | |
| 933 increase in instruction count. Note that this still gives an attacker some | |
| 934 limited amount of control over some text section values, unless we randomize the | |
| 935 constant pool layout. | |
| 936 | |
| 937 Static data reordering | |
| 938 ~~~~~~~~~~~~~~~~~~~~~~ | |
| 939 | |
| 940 This transformation limits the attacker's ability to control bits in global data | |
| 941 address references. It simply permutes the order in memory of global variables | |
| 942 and internal constant pool entries. For the constant pool, we only permute | |
| 943 within a type (i.e., emit a randomized list of ints, followed by a randomized | |
| 944 list of floats, etc.) to maintain good packing in the face of alignment | |
| 945 constraints. | |
| 946 | |
| 947 As might be expected, this has no impact on code quality, translation time, or | |
| 948 ROP gadget persistence (though as above, it limits opportunities for | |
| 949 intra-instruction ROP gadgets with a broken validator). | |
| 950 | |
| 951 Basic block reordering | |
| 952 ~~~~~~~~~~~~~~~~~~~~~~ | |
| 953 | |
| 954 Here, we randomize the order of basic blocks within a function, with the | |
| 955 constraint that we still want to maintain a topological order as much as | |
| 956 possible, to avoid making the code too branchy. | |
| 957 | |
| 958 This has no impact on code quality, and about 1% impact on translation time, due | |
| 959 to a separate pass to recompute layout. It ends up having a huge effect on ROP | |
| 960 gadget persistence, tied for best with nop insertion, reducing ROP gadget | |
| 961 persistence to less than 5%. | |
| 962 | |
| 963 Function reordering | |
| 964 ~~~~~~~~~~~~~~~~~~~ | |
| 965 | |
| 966 Here, we permute the order that functions are emitted, primarily to shift ROP | |
| 967 gadgets around to less predictable locations. It may also change call address | |
| 968 offsets in case the attacker was trying to control that offset in the code. | |
| 969 | |
| 970 To control latency and memory footprint, we don't arbitrarily permute functions. | |
| 971 Instead, for some relatively small value of N, we queue up N assembler buffers, | |
| 972 and then emit the N functions in random order, and repeat until all functions | |
| 973 are emitted. | |
| 974 | |
| 975 Function reordering has no impact on translation time or code quality. | |
| 976 Measurements indicate that it reduces ROP gadget persistence to about 15%. | |
| 977 | |
| 978 Nop insertion | |
| 979 ~~~~~~~~~~~~~ | |
| 980 | |
| 981 This diversification randomly adds a nop instruction after each regular | |
| 982 instruction, with some probability. Nop instructions of different lengths may | |
| 983 be selected. Nop instructions are never added inside a bundle_lock region. | |
| 984 Note that when sandboxing is enabled, nop instructions are already being added | |
| 985 for bundle alignment, so the diversification nop instructions may simply be | |
| 986 taking the place of alignment nop instructions, though distributed differently | |
| 987 through the bundle. | |
| 988 | |
| 989 In Subzero's currently implementation, nop insertion adds 3-5% to the | |
| 990 translation time, but this is probably because it is implemented as a separate | |
| 991 pass that adds actual nop instructions to the IR. The overhead would probably | |
| 992 be a lot less if it were integrated into the assembler pass. The code quality | |
| 993 is also reduced by 3-5%, making nop insertion the most expensive of the | |
| 994 diversification techniques. | |
| 995 | |
| 996 Nop insertion is very effective in reducing ROP gadget persistence, at the same | |
| 997 level as basic block randomization. But given nop insertion's impact on | |
| 998 translation time and code quality, one would most likely prefer to use basic | |
| 999 block randomization instead. | |
| 1000 | |
| 1001 Register allocation randomization | |
| 1002 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
| 1003 | |
| 1004 In this diversification, the register allocator tries to make different but | |
| 1005 mostly functionally equivalent choices, while maintaining stable code quality. | |
| 1006 | |
| 1007 A naive approach would be the following. Whenever the allocator has more than | |
| 1008 one choice for assigning a register, choose randomly among those options. And | |
| 1009 whenever there are no registers available and there is a tie for the | |
| 1010 lowest-weight variable, randomly select one of the lowest-weight variables to | |
| 1011 evict. Because of the one-pass nature of the linear-scan algorithm, this | |
| 1012 randomization strategy can have a large impact on which variables are ultimately | |
| 1013 assigned registers, with a corresponding large impact on code quality. | |
| 1014 | |
| 1015 Instead, we choose an approach that tries to keep code quality stable regardless | |
| 1016 of the random seed. We partition the set of physical registers into equivalence | |
| 1017 classes. If a register is pre-colored in the function (i.e., referenced | |
| 1018 explicitly by name), it forms its own equivalence class. The remaining | |
| 1019 registers are partitioned according to their combination of attributes such as | |
| 1020 integer versus floating-point, 8-bit versus 32-bit, caller-save versus | |
| 1021 callee-saved, etc. Each equivalence class is randomly permuted, and the | |
| 1022 complete permutation is applied to the final register assignments. | |
| 1023 | |
| 1024 Register randomization reduces ROP gadget persistence to about 10% on average, | |
| 1025 though there tends to be fairly high variance across functions and applications. | |
| 1026 This probably has to do with the set of restrictions in the x86-32 instruction | |
| 1027 set and ABI, such as few general-purpose registers, eax used for return values, | |
| 1028 edx used for division, ecx used for shifting, etc. As intended, register | |
| 1029 randomization has no impact on code quality, and a slight (0.5%) impact on | |
| 1030 translation time due to an extra scan over the variables to identify pre-colored | |
| 1031 registers. | |
| 1032 | |
| 1033 Fuzzing | |
| 1034 ^^^^^^^ | |
| 1035 | |
| 1036 We have started fuzzing the pexe files input to Subzero. Most of the problems | |
| 1037 found so far are ones most appropriately handled in the parser. However, there | |
| 1038 have been a couple that have identified problems in the lowering, or otherwise | |
| 1039 inappropriately triggered assertion failures and fatal errors. We continue to | |
| 1040 dig into this area. | |
| 1041 | |
| 1042 Chrome integration | |
| 1043 ------------------ | |
| 1044 | |
| 1045 Currently Subzero is available in Chrome for the x86-32 architecture, but under | |
| 1046 a flag. When the flag is enabled, Subzero is used when the pexe developer's | |
| 1047 manifest specifies the ``O0`` optimization level. | |
| 1048 | |
| 1049 The next step is to remove the flag, i.e. invoke Subzero as the only translator | |
| 1050 for ``O0``-specified manifest files. | |
| 1051 | |
| 1052 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which | |
| 1053 case Subzero could be used for all PNaCl translation. | |
| 1054 | |
| 1055 Command line options | |
| 1056 -------------------- | |
| 1057 | |
| 1058 Subzero has a number of command-line options for debugging and diagnostics. | |
| 1059 Among the more interesting are the following. | |
| 1060 | |
| 1061 - Using the ``-verbose`` flag, Subzero will dump the Cfg, or produce other | |
| 1062 diagnostic output, with various levels of detail after each pass. Instruction | |
| 1063 numbers can be printed or suppressed. Deleted instructions can be printed or | |
| 1064 suppressed (they are retained in the instruction list, as discussed earlier, | |
| 1065 because they can help explain how lower-level instructions originated). | |
| 1066 Liveness information can be printed when available. Details of register | |
| 1067 allocation can be printed as register allocator decisions are made. And more. | |
| 1068 | |
| 1069 - Running Subzero with any level of verbosity produces an enormous amount of | |
| 1070 output. When debugging a single function, verbose output can be suppressed | |
| 1071 except for a particular function. The ``-verbose-focus`` flag suppresses | |
| 1072 verbose output except for the specified function. | |
| 1073 | |
| 1074 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing | |
| 1075 at exit. Timing markers can be placed in the Subzero source code to demarcate | |
| 1076 logical operations or passes of interest. Basic timing information plus | |
| 1077 call-stack type timing information is printed at the end. | |
| 1078 | |
| 1079 - Along with ``-timing``, the user can instead get a report on the overall | |
| 1080 translation time for each function, to help focus on timing outliers. Also, | |
| 1081 ``-timing-focus`` limits the ``-timing`` reporting to a single function, | |
| 1082 instead of aggregating pass timing across all functions. | |
| 1083 | |
| 1084 - The ``-szstats`` option reports various statistics on each function, such as | |
| 1085 stack frame size, static instruction count, etc. It may be helpful to track | |
| 1086 these stats over time as Subzero is improved, as an approximate measure of | |
| 1087 code quality. | |
| 1088 | |
| 1089 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly | |
| 1090 output, annotate the assembly output with register-focused liveness | |
| 1091 information. In particular, each basic block is annotated with which | |
| 1092 registers are live-in and live-out, and each instruction is annotated with | |
| 1093 which registers' and stack locations' live ranges end at that instruction. | |
| 1094 This is really useful when studying the generated code to find opportunities | |
| 1095 for code quality improvements. | |
| 1096 | |
| 1097 Testing and debugging | |
| 1098 --------------------- | |
| 1099 | |
| 1100 LLVM lit tests | |
| 1101 ^^^^^^^^^^^^^^ | |
| 1102 | |
| 1103 For basic testing, Subzero uses LLVM's lit framework for running tests. We have | |
| 1104 a suite of hundreds of small functions where we test for particular assembly | |
| 1105 code patterns across different target architectures. | |
| 1106 | |
| 1107 Cross tests | |
| 1108 ^^^^^^^^^^^ | |
| 1109 | |
| 1110 Unfortunately, the lit tests don't do a great job of precisely testing the | |
| 1111 correctness of the output. Much better are the cross tests, which are execution | |
| 1112 tests that compare Subzero and llc translated bitcode across a wide variety of | |
| 1113 interesting inputs. Each cross test consists of a set of C, C++, and/or | |
| 1114 low-level bitcode files. The C and C++ source files are compiled down to | |
| 1115 bitcode. The bitcode files are translated by llc and also by Subzero. Subzero | |
| 1116 mangles global symbol names with a special prefix to avoid duplicate symbol | |
| 1117 errors. A driver program invokes both versions on a large set of interesting | |
| 1118 inputs, and reports when the Subzero and llc results differ. Cross tests turn | |
| 1119 out to be an excellent way of testing the basic lowering patterns, but they are | |
| 1120 less useful for testing more global things like liveness analysis and register | |
| 1121 allocation. | |
| 1122 | |
| 1123 Bisection debugging | |
| 1124 ^^^^^^^^^^^^^^^^^^^ | |
| 1125 | |
| 1126 Sometimes with a new application, Subzero will end up producing incorrect code | |
| 1127 that either crashes at runtime or otherwise produces the wrong results. When | |
| 1128 this happens, we need to narrow it down to a single function (or small set of | |
| 1129 functions) that yield incorrect behavior. For this, we have a bisection | |
| 1130 debugging framework. Here, we initially translate the entire application once | |
| 1131 with Subzero and once with llc. We then use objdump to selectively weaken | |
| 1132 symbols based on a whitelist or blacklist provided on the command line. The two | |
| 1133 object files can then be linked together without link errors, with the desired | |
| 1134 version of each method "winning". Then the binary is tested, and bisection | |
| 1135 proceeds based on whether the binary produces correct output. | |
| 1136 | |
| 1137 When the bisection completes, we are left with a minimal set of | |
| 1138 Subzero-translated functions that cause the failure. Usually it is a single | |
| 1139 function, though sometimes it might require a combination of several functions | |
| 1140 to cause a failure; this may be due to an incorrect call ABI, for example. | |
| 1141 However, Murphy's Law implies that the single failing function is enormous and | |
| 1142 impractical to debug. In that case, we can restart the bisection, explicitly | |
| 1143 blacklisting the enormous function, and try to find another candidate to debug. | |
| 1144 (Future work is to automate this to find all minimal sets of functions, so that | |
| 1145 debugging can focus on the simplest example.) | |
| 1146 | |
| 1147 Fuzz testing | |
| 1148 ^^^^^^^^^^^^ | |
| 1149 | |
| 1150 As described above, we try to find internal Subzero bugs using fuzz testing | |
| 1151 techniques. | |
| 1152 | |
| 1153 Sanitizers | |
| 1154 ^^^^^^^^^^ | |
| 1155 | |
| 1156 We regularly run Subzero with asan and tsan. So far, multithreading has been | |
| 1157 simple enough that tsan hasn't found any bugs, but asan has found at least one | |
| 1158 memory leak which was subsequently fixed. | |
| 1159 | |
| 1160 Current status | |
| 1161 ============== | |
| 1162 | |
| 1163 Target architectures | |
| 1164 -------------------- | |
| 1165 | |
| 1166 Subzero is currently more or less complete for the x86-32 target. It has been | |
| 1167 refactored and extended to handle x86-64 as well, and that is mostly complete at | |
| 1168 this point. | |
| 1169 | |
| 1170 ARM32 work is in progress and is mostly complete. It currently lacks the | |
|
John
2015/08/31 15:02:18
As Jan pointed out there's substantial amount of w
Jim Stichnoth
2015/08/31 15:35:32
Done.
| |
| 1171 testing level of x86, at least in part because Subzero's register allocator | |
| 1172 needs modifications to handle ARM's aliasing of floating point and vector | |
| 1173 registers. Specifically, a 64-bit register is actually a gang of two | |
| 1174 consecutive and aligned 32-bit registers, and a 128-bit register is a gang of 4 | |
| 1175 consecutive and aligned 32-bit registers. ARM64 work has not started; the whole | |
| 1176 ARM64 sandbox model would need to be defined. | |
| 1177 | |
| 1178 An external contributor is adding MIPS support, in most part by following the | |
| 1179 ARM work. | |
| 1180 | |
| 1181 Translator performance | |
| 1182 ---------------------- | |
| 1183 | |
| 1184 Single-threaded translation speed is currently about 5x the llc translation | |
| 1185 speed. For a large pexe, the time breaks down as: | |
| 1186 | |
| 1187 - 11% for parsing and initial IR building | |
| 1188 | |
| 1189 - 4% for emitting to /dev/null | |
| 1190 | |
| 1191 - 27% for liveness analysis (two liveness passes plus live range construction) | |
| 1192 | |
| 1193 - 15% for linear-scan register allocation | |
| 1194 | |
| 1195 - 9% for basic lowering | |
| 1196 | |
| 1197 - 10% for advanced phi lowering | |
| 1198 | |
| 1199 - ~11% for other minor analysis | |
| 1200 | |
| 1201 - ~10% measurement overhead to acquire these numbers | |
| 1202 | |
| 1203 Some improvements could undoubtedly be made, but it will be hard to increase the | |
| 1204 speed to 10x of llc while keeping acceptable code quality. With ``-Om1`` (lack | |
| 1205 of) optimization, we do actually achieve roughly 10x llc translation speed, but | |
| 1206 code quality drops by a factor of 3. | |
| 1207 | |
| 1208 Code quality | |
| 1209 ------------ | |
| 1210 | |
| 1211 Measured across 16 components of spec2k, Subzero's code quality is uniformly | |
| 1212 better than llc ``-O0`` code quality, and in many cases solidly between llc | |
| 1213 ``-O0`` and ``-O2``. | |
| 1214 | |
| 1215 Translator size | |
| 1216 --------------- | |
| 1217 | |
| 1218 When built in MINIMAL mode, the x86-64 native translator size for the x86-32 | |
| 1219 target is about 700 KB, not including the size of functions referenced in | |
| 1220 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1 | |
| 1221 MB, and it is statically linked and also includes nop padding for bundling as | |
| 1222 well as indirect branch masking. | |
| 1223 | |
| 1224 Translator memory footprint | |
| 1225 --------------------------- | |
| 1226 | |
| 1227 It's hard to draw firm conclusions about memory footprint, since the footprint | |
| 1228 is at least proportional to the input function size, and there is no real limit | |
| 1229 on the size of functions in the pexe file. | |
| 1230 | |
| 1231 That said, we looked at the memory footprint over time as Subzero translated | |
| 1232 pnacl-llc.pexe, which is the largest pexe (7.2 MB) at our disposal. One of | |
| 1233 LLVM's libraries that Subzero uses can report the current malloc heap usage. | |
| 1234 With single-threaded translation, Subzero tends to hover around 15 MB of memory | |
| 1235 usage. There are a couple of monstrous functions where Subzero grows to around | |
| 1236 100 MB, but then it drops back down after those functions finish translating. | |
| 1237 In contrast, llc grows larger and larger throughout translation, reaching | |
| 1238 several hundred MB by the time it completes. | |
| 1239 | |
| 1240 It's a bit more interesting when we enable multithreaded translation. When | |
| 1241 there are N translation threads, Subzero implements a policy that limits the | |
| 1242 size of the translation queue to N entries - if it is "full" when the parser | |
| 1243 tries to add a new Cfg, the parser blocks until one of the translation threads | |
| 1244 removes a Cfg. This means the number of in-memory Cfgs can (and generally does) | |
| 1245 reach 2*N+1, and so the memory footprint rises in proportion to the number of | |
| 1246 threads. Adding to the pressure is the observation that the monstrous functions | |
| 1247 also take proportionally longer time to translate, so there's a good chance many | |
| 1248 of the monstrous functions will be active at the same time with multithreaded | |
| 1249 translation. As a result, for N=32, Subzero's memory footprint peaks at about | |
| 1250 260 MB, but drops back down as the large functions finish translating. | |
| 1251 | |
| 1252 If this peak memory size becomes a problem, it might be possible for the parser | |
| 1253 to resequence the functions to try to spread out the larger functions, or to | |
| 1254 throttle the translation queue to prevent too many in-flight large functions. | |
| 1255 | |
| 1256 Translator scalability | |
| 1257 ---------------------- | |
| 1258 | |
| 1259 Currently scalability is "not very good". Multiple translation threads lead to | |
| 1260 faster translation, but not to the degree desired. We haven't dug in to | |
| 1261 investigate yet. | |
| 1262 | |
| 1263 There are a few areas to investigate. First, there may be contention on the | |
| 1264 constant pool, which all threads access, and which requires locked access even | |
| 1265 for reading. This could be mitigated by keeping a Cfg-local cache of the most | |
| 1266 common constants. | |
| 1267 | |
| 1268 Second, there may be contention on memory allocation. While almost all Cfg | |
| 1269 objects are allocated from the Cfg-local allocator, some passes use temporary | |
| 1270 STL containers that use the default allocator, which may require global locking. | |
| 1271 This could be mitigated by switching these to the Cfg-local allocator. | |
| 1272 | |
| 1273 Third, multithreading may make the default allocator strategy more expensive. | |
| 1274 In a single-threaded environment, a pass will allocate its containers, run the | |
| 1275 pass, and deallocate the containers. This results in stack-like allocation | |
| 1276 behavior and makes the heap free list easier to manage, with less heap | |
| 1277 fragmentation. But when multithreading is added, the allocations and | |
| 1278 deallocations become much less stack-like, making allocation and deallocation | |
| 1279 operations individually more expensive. Again, this could be mitigated by | |
| 1280 switching these to the Cfg-local allocator. | |
| OLD | NEW |