OLD | NEW |
| (Empty) |
1 Design of the Subzero fast code generator | |
2 ========================================= | |
3 | |
4 Introduction | |
5 ------------ | |
6 | |
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | |
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | |
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | |
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | |
11 possible. The ``.pexe`` file is downloaded to the user's browser where the | |
12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to | |
13 `sandboxed | |
14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ | |
15 native code. The translator uses architecture-specific optimizations as much as | |
16 practical to generate good native code. | |
17 | |
18 The native code can be cached by the browser to avoid repeating translation on | |
19 future page loads. However, first-time user experience is hampered by long | |
20 translation times. The LLVM-based PNaCl translator is pretty slow, even when | |
21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on | |
22 slow browser platforms such as ARM-based Chromebooks. | |
23 | |
24 Translator slowness can be mitigated or hidden in a number of ways. | |
25 | |
26 - Parallel translation. However, slow machines where this matters most, e.g. | |
27 ARM-based Chromebooks, are likely to have fewer cores to parallelize across, | |
28 and are likely to less memory available for multiple translation threads to | |
29 use. | |
30 | |
31 - Streaming translation, i.e. start translating as soon as the download starts. | |
32 This doesn't help much when translation speed is 10× slower than download | |
33 speed, or the ``.pexe`` file is already cached while the translated binary was | |
34 flushed from the cache. | |
35 | |
36 - Arrange the web page such that translation is done in parallel with | |
37 downloading large assets. | |
38 | |
39 - Arrange the web page to distract the user with `cat videos | |
40 <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in | |
41 progress. | |
42 | |
43 Or, improve translator performance to something more reasonable. | |
44 | |
45 This document describes Subzero's attempt to improve translation speed by an | |
46 order of magnitude while rivaling LLVM's code quality. Subzero does this | |
47 through minimal IR layering, lean data structures and passes, and a careful | |
48 selection of fast optimization passes. It has two optimization recipes: full | |
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | |
50 following (described in more detail below): | |
51 | |
52 +-----------------------------------+-----------------------+ | |
53 | O2 recipe | Om1 recipe | | |
54 +===================================+=======================+ | |
55 | Parse .pexe file | Parse .pexe file | | |
56 +-----------------------------------+-----------------------+ | |
57 | Loop nest analysis | | | |
58 +-----------------------------------+-----------------------+ | |
59 | Address mode inference | | | |
60 +-----------------------------------+-----------------------+ | |
61 | Read-modify-write (RMW) transform | | | |
62 +-----------------------------------+-----------------------+ | |
63 | Basic liveness analysis | | | |
64 +-----------------------------------+-----------------------+ | |
65 | Load optimization | | | |
66 +-----------------------------------+-----------------------+ | |
67 | | Phi lowering (simple) | | |
68 +-----------------------------------+-----------------------+ | |
69 | Target lowering | Target lowering | | |
70 +-----------------------------------+-----------------------+ | |
71 | Full liveness analysis | | | |
72 +-----------------------------------+-----------------------+ | |
73 | Register allocation | | | |
74 +-----------------------------------+-----------------------+ | |
75 | Phi lowering (advanced) | | | |
76 +-----------------------------------+-----------------------+ | |
77 | Post-phi register allocation | | | |
78 +-----------------------------------+-----------------------+ | |
79 | Branch optimization | | | |
80 +-----------------------------------+-----------------------+ | |
81 | Code emission | Code emission | | |
82 +-----------------------------------+-----------------------+ | |
83 | |
84 Goals | |
85 ===== | |
86 | |
87 Translation speed | |
88 ----------------- | |
89 | |
90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. | |
91 Any faster is in a sense wasted effort. Download speed varies greatly, but | |
92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a | |
93 slow machine. We observe a 3× single-thread performance difference between A15 | |
94 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a | |
95 ``.pexe`` file could be compressed to 50% on the web server using gzip transport | |
96 compression, so we set the translation speed goal to 6 MB/sec on the high-end | |
97 Xeon workstation. | |
98 | |
99 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at | |
100 ⅒ the target rate. The ``-O2`` mode takes 3× as long as the ``-O0`` mode. | |
101 | |
102 In other words, Subzero's goal is to improve over LLVM's translation speed by | |
103 10×. | |
104 | |
105 Code quality | |
106 ------------ | |
107 | |
108 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` | |
109 code quality. The stretch goal is to approach LLVM ``-O2`` code quality. On | |
110 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. | |
111 | |
112 It's important to note that the quality of Subzero-generated code depends on | |
113 target-neutral optimizations and simplifications being run beforehand in the | |
114 developer environment. The ``.pexe`` file reflects these optimizations. For | |
115 example, Subzero assumes that the basic blocks are ordered topologically where | |
116 possible (which makes liveness analysis converge fastest), and Subzero does not | |
117 do any function inlining because it should already have been done. | |
118 | |
119 Translator size | |
120 --------------- | |
121 | |
122 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size. | |
123 We think 1 MB is a more reasonable size -- especially for such a component that | |
124 is distributed to a billion Chrome users. Thus we target a 10× reduction in | |
125 binary size. | |
126 | |
127 For development, Subzero can be built for all target architectures, and all | |
128 debugging and diagnostic options enabled. For a smaller translator, we restrict | |
129 to a single target architecture, and define a ``MINIMAL`` build where | |
130 unnecessary features are compiled out. | |
131 | |
132 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support`` | |
133 include directories, which have little impact on translator size. It also uses | |
134 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again | |
135 with little size impact. In non-``MINIMAL`` builds, the translator size is much | |
136 larger due to including code for parsing text-format bitcode files and forming | |
137 LLVM IR. | |
138 | |
139 Memory footprint | |
140 ---------------- | |
141 | |
142 The current LLVM-based translator suffers from an issue in which some | |
143 function-specific data has to be retained in memory until all translation | |
144 completes, and therefore the memory footprint grows without bound. Large | |
145 ``.pexe`` files can lead to the translator process holding hundreds of MB of | |
146 memory by the end. The translator runs in a separate process, so this memory | |
147 growth doesn't *directly* affect other processes, but it does dirty the physical | |
148 memory and contributes to a perception of bloat and sometimes a reality of | |
149 out-of-memory tab killing, especially noticeable on weaker systems. | |
150 | |
151 Subzero should maintain a stable memory footprint throughout translation. It's | |
152 not really practical to set a specific limit, because there is not really a | |
153 practical limit on a single function's size, but the footprint should be | |
154 "reasonable" and be proportional to the largest input function size, not the | |
155 total ``.pexe`` file size. Simply put, Subzero should not have memory leaks or | |
156 inexorable memory growth. (We use ASAN builds to test for leaks.) | |
157 | |
158 Multithreaded translation | |
159 ------------------------- | |
160 | |
161 It should be practical to translate different functions concurrently and see | |
162 good scalability. Some locking may be needed, such as accessing output buffers | |
163 or constant pools, but that should be fairly minimal. In contrast, LLVM was | |
164 only designed for module-level parallelism, and as such, the PNaCl translator | |
165 internally splits a ``.pexe`` file into several modules for concurrent | |
166 translation. All output needs to be deterministic regardless of the level of | |
167 multithreading, i.e. functions and data should always be output in the same | |
168 order. | |
169 | |
170 Target architectures | |
171 -------------------- | |
172 | |
173 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32. Future | |
174 targets include ARM64 and MIPS64, though these targets lack NaCl support | |
175 including a sandbox model or a validator. | |
176 | |
177 The first implementation is for x86-32, because it was expected to be | |
178 particularly challenging, and thus more likely to draw out any design problems | |
179 early: | |
180 | |
181 - There are a number of special cases, asymmetries, and warts in the x86 | |
182 instruction set. | |
183 | |
184 - Complex addressing modes may be leveraged for better code quality. | |
185 | |
186 - 64-bit integer operations have to be lowered into longer sequences of 32-bit | |
187 operations. | |
188 | |
189 - Paucity of physical registers may reveal code quality issues early in the | |
190 design. | |
191 | |
192 Detailed design | |
193 =============== | |
194 | |
195 Intermediate representation - ICE | |
196 --------------------------------- | |
197 | |
198 Subzero's IR is called ICE. It is designed to be reasonably similar to LLVM's | |
199 IR, which is reflected in the ``.pexe`` file's bitcode structure. It has a | |
200 representation of global variables and initializers, and a set of functions. | |
201 Each function contains a list of basic blocks, and each basic block constains a | |
202 list of instructions. Instructions that operate on stack and register variables | |
203 do so using static single assignment (SSA) form. | |
204 | |
205 The ``.pexe`` file is translated one function at a time (or in parallel by | |
206 multiple translation threads). The recipe for optimization passes depends on | |
207 the specific target and optimization level, and is described in detail below. | |
208 Global variables (types and initializers) are simply and directly translated to | |
209 object code, without any meaningful attempts at optimization. | |
210 | |
211 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class. | |
212 Its key contents include: | |
213 | |
214 - A list of ``CfgNode`` pointers, generally held in topological order. | |
215 | |
216 - A list of ``Variable`` pointers corresponding to local variables used in the | |
217 function plus compiler-generated temporaries. | |
218 | |
219 A basic block is represented by the ``Ice::CfgNode`` class. Its key contents | |
220 include: | |
221 | |
222 - A linear list of instructions, in the same style as LLVM. The last | |
223 instruction of the list is always a terminator instruction: branch, switch, | |
224 return, unreachable. | |
225 | |
226 - A list of Phi instructions, also in the same style as LLVM. They are held as | |
227 a linear list for convenience, though per Phi semantics, they are executed "in | |
228 parallel" without dependencies on each other. | |
229 | |
230 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and | |
231 another list for outgoing edges. | |
232 | |
233 - The node's unique, 0-based index into the CFG's node list. | |
234 | |
235 An instruction is represented by the ``Ice::Inst`` class. Its key contents | |
236 include: | |
237 | |
238 - A list of source operands. | |
239 | |
240 - Its destination variable, if the instruction produces a result in an | |
241 ``Ice::Variable``. | |
242 | |
243 - A bitvector indicating which variables' live ranges this instruction ends. | |
244 This is computed during liveness analysis. | |
245 | |
246 Instructions kinds are divided into high-level ICE instructions and low-level | |
247 ICE instructions. High-level instructions consist of the PNaCl/LLVM bitcode | |
248 instruction kinds. Each target architecture implementation extends the | |
249 instruction space with its own set of low-level instructions. Generally, | |
250 low-level instructions correspond to individual machine instructions. The | |
251 high-level ICE instruction space includes a few additional instruction kinds | |
252 that are not part of LLVM but are generally useful (e.g., an Assignment | |
253 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock | |
254 instructions for sandboxing). | |
255 | |
256 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl | |
257 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual | |
258 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are | |
259 the following: | |
260 | |
261 - Alloca: allocate data on the stack | |
262 | |
263 - Arithmetic: binary operations of the form ``A = B op C`` | |
264 | |
265 - Br: conditional or unconditional branch | |
266 | |
267 - Call: function call | |
268 | |
269 - Cast: unary type-conversion operations | |
270 | |
271 - ExtractElement: extract a scalar element from a vector-type value | |
272 | |
273 - Fcmp: floating-point comparison | |
274 | |
275 - Icmp: integer comparison | |
276 | |
277 - IntrinsicCall: call a known intrinsic | |
278 | |
279 - InsertElement: insert a scalar element into a vector-type value | |
280 | |
281 - Load: load a value from memory | |
282 | |
283 - Phi: implement the SSA phi node | |
284 | |
285 - Ret: return from the function | |
286 | |
287 - Select: essentially the C language operation of the form ``X = C ? Y : Z`` | |
288 | |
289 - Store: store a value into memory | |
290 | |
291 - Switch: generalized branch to multiple possible locations | |
292 | |
293 - Unreachable: indicate that this portion of the code is unreachable | |
294 | |
295 The additional high-level ICE instructions are the following: | |
296 | |
297 - Assign: a simple ``A=B`` assignment. This is useful for e.g. lowering Phi | |
298 instructions to non-SSA assignments, before lowering to machine code. | |
299 | |
300 - BundleLock, BundleUnlock. These are markers used for sandboxing, but are | |
301 common across all targets and so they are elevated to the high-level | |
302 instruction set. | |
303 | |
304 - FakeDef, FakeUse, FakeKill. These are tools used to preserve consistency in | |
305 liveness analysis, elevated to the high-level because they are used by all | |
306 targets. They are described in more detail at the end of this section. | |
307 | |
308 - JumpTable: this represents the result of switch optimization analysis, where | |
309 some switch instructions may use jump tables instead of cascading | |
310 compare/branches. | |
311 | |
312 An operand is represented by the ``Ice::Operand`` class. In high-level ICE, an | |
313 operand is either an ``Ice::Constant`` or an ``Ice::Variable``. Constants | |
314 include scalar integer constants, scalar floating point constants, Undef (an | |
315 unspecified constant of a particular scalar or vector type), and symbol | |
316 constants (essentially addresses of globals). Note that the PNaCl ABI does not | |
317 include vector-type constants besides Undef, and as such, Subzero (so far) has | |
318 no reason to represent vector-type constants internally. A variable represents | |
319 a value allocated on the stack (though not including alloca-derived storage). | |
320 Among other things, a variable holds its unique, 0-based index into the CFG's | |
321 variable list. | |
322 | |
323 Each target can extend the ``Constant`` and ``Variable`` classes for its own | |
324 needs. In addition, the ``Operand`` class may be extended, e.g. to define an | |
325 x86 ``MemOperand`` that encodes a base register, an index register, an index | |
326 register shift amount, and a constant offset. | |
327 | |
328 Register allocation and liveness analysis are restricted to Variable operands. | |
329 Because of the importance of register allocation to code quality, and the | |
330 translation-time cost of liveness analysis, Variable operands get some special | |
331 treatment in ICE. Most notably, a frequent pattern in Subzero is to iterate | |
332 across all the Variables of an instruction. An instruction holds a list of | |
333 operands, but an operand may contain 0, 1, or more Variables. As such, the | |
334 ``Operand`` class specially holds a list of Variables contained within, for | |
335 quick access. | |
336 | |
337 A Subzero transformation pass may work by deleting an existing instruction and | |
338 replacing it with zero or more new instructions. Instead of actually deleting | |
339 the existing instruction, we generally mark it as deleted and insert the new | |
340 instructions right after the deleted instruction. When printing the IR for | |
341 debugging, this is a big help because it makes it much more clear how the | |
342 non-deleted instructions came about. | |
343 | |
344 Subzero has a few special instructions to help with liveness analysis | |
345 consistency. | |
346 | |
347 - The FakeDef instruction gives a fake definition of some variable. For | |
348 example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx`` | |
349 but an ICE instruction can represent only one destination variable. This is | |
350 similar for multiply instructions, and for function calls that return a 64-bit | |
351 integer result in the ``%edx:%eax`` pair. Also, using the ``xor %eax, %eax`` | |
352 trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``. | |
353 | |
354 - The FakeUse instruction registers a use of a variable, typically to prevent an | |
355 earlier assignment to that variable from being dead-code eliminated. For | |
356 example, lowering an operation like ``x=cc?y:z`` may be done using x86's | |
357 conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``. Without a | |
358 FakeUse of ``x`` between the two instructions, the liveness analysis pass may | |
359 dead-code eliminate the first instruction. | |
360 | |
361 - The FakeKill instruction is added after a call instruction, and is a quick way | |
362 of indicating that caller-save registers are invalidated. | |
363 | |
364 Pexe parsing | |
365 ------------ | |
366 | |
367 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files. It | |
368 parses the ``.pexe`` file function by function, ultimately constructing an ICE | |
369 CFG for each function. After a function is parsed, its CFG is handed off to the | |
370 translation phase. The bitcode parser also parses global initializer data and | |
371 hands it off to be translated to data sections in the object file. | |
372 | |
373 Subzero has another parsing strategy for testing/debugging. LLVM libraries can | |
374 be used to parse a module into LLVM IR (though very slowly relative to Subzero | |
375 native parsing). Then we iterate across the LLVM IR and construct high-level | |
376 ICE, handing off each CFG to the translation phase. | |
377 | |
378 Overview of lowering | |
379 -------------------- | |
380 | |
381 In general, translation goes like this: | |
382 | |
383 - Parse the next function from the ``.pexe`` file and construct a CFG consisting | |
384 of high-level ICE. | |
385 | |
386 - Do analysis passes and transformation passes on the high-level ICE, as | |
387 desired. | |
388 | |
389 - Lower each high-level ICE instruction into a sequence of zero or more | |
390 low-level ICE instructions. Each high-level instruction is generally lowered | |
391 independently, though the target lowering is allowed to look ahead in the | |
392 CfgNode's instruction list if desired. | |
393 | |
394 - Do more analysis and transformation passes on the low-level ICE, as desired. | |
395 | |
396 - Assemble the low-level CFG into an ELF object file (alternatively, a textual | |
397 assembly file that is later assembled by some external tool). | |
398 | |
399 - Repeat for all functions, and also produce object code for data such as global | |
400 initializers and internal constant pools. | |
401 | |
402 Currently there are two optimization levels: ``O2`` and ``Om1``. For ``O2``, | |
403 the intention is to apply all available optimizations to get the best code | |
404 quality (though the initial code quality goal is measured against LLVM's ``O0`` | |
405 code quality). For ``Om1``, the intention is to apply as few optimizations as | |
406 possible and produce code as quickly as possible, accepting poor code quality. | |
407 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, | |
408 "sub-zero". | |
409 | |
410 High-level debuggability of generated code is so far not a design requirement. | |
411 Subzero doesn't really do transformations that would obfuscate debugging; the | |
412 main thing might be that register allocation (including stack slot coalescing | |
413 for stack-allocated variables whose live ranges don't overlap) may render a | |
414 variable's value unobtainable after its live range ends. This would not be an | |
415 issue for ``Om1`` since it doesn't register-allocate program-level variables, | |
416 nor does it coalesce stack slots. That said, fully supporting debuggability | |
417 would require a few additions: | |
418 | |
419 - DWARF support would need to be added to Subzero's ELF file emitter. Subzero | |
420 propagates global symbol names, local variable names, and function-internal | |
421 label names that are present in the ``.pexe`` file. This would allow a | |
422 debugger to map addresses back to symbols in the ``.pexe`` file. | |
423 | |
424 - To map ``.pexe`` file symbols back to meaningful source-level symbol names, | |
425 file names, line numbers, etc., Subzero would need to handle `LLVM bitcode | |
426 metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg`` | |
427 `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_. | |
428 | |
429 - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so | |
430 the toolchain would need to be modified to preserve it. | |
431 | |
432 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but | |
433 produces code with one third the code quality. ``Om1`` is good for testing and | |
434 debugging -- during translation, it tends to expose errors in the basic lowering | |
435 that might otherwise have been hidden by the register allocator or other | |
436 optimization passes. It also helps determine whether a code correctness problem | |
437 is a fundamental problem in the basic lowering, or an error in another | |
438 optimization pass. | |
439 | |
440 The implementation of target lowering also controls the recipe of passes used | |
441 for ``Om1`` and ``O2`` translation. For example, address mode inference may | |
442 only be relevant for x86. | |
443 | |
444 Lowering strategy | |
445 ----------------- | |
446 | |
447 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower | |
448 each high-level instruction down to a sequence of low-level target-specific | |
449 instructions, in a largely context-free setting. That is, each high-level | |
450 instruction conceptually has a simple template expansion into low-level | |
451 instructions, and lowering can in theory be done in any order. This may sound | |
452 like a small effort, but quite a large number of templates may be needed because | |
453 of the number of PNaCl types and instruction variants. Furthermore, there may | |
454 be optimized templates, e.g. to take advantage of operator commutativity (for | |
455 example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``). This is | |
456 similar to other template-based approaches in fast code generation or | |
457 interpretation, though some decisions are deferred until after some global | |
458 analysis passes, mostly related to register allocation, stack slot assignment, | |
459 and specific choice of instruction variant and addressing mode. | |
460 | |
461 The key idea for a lowering template is to produce valid low-level instructions | |
462 that are guaranteed to meet address mode and other structural requirements of | |
463 the instruction set. For example, on x86, the source operand of an integer | |
464 store instruction must be an immediate or a physical register; a shift | |
465 instruction's shift amount must be an immediate or in register ``%cl``; a | |
466 function's integer return value is in ``%eax``; most x86 instructions are | |
467 two-operand, in contrast to corresponding three-operand high-level instructions; | |
468 etc. | |
469 | |
470 Because target lowering runs before register allocation, there is no way to know | |
471 whether a given ``Ice::Variable`` operand lives on the stack or in a physical | |
472 register. When the low-level instruction calls for a physical register operand, | |
473 the target lowering can create an infinite-weight Variable. This tells the | |
474 register allocator to assign infinite weight when making decisions, effectively | |
475 guaranteeing some physical register. Variables can also be pre-colored to a | |
476 specific physical register (``cl`` in the shift example above), which also gives | |
477 infinite weight. | |
478 | |
479 To illustrate, consider a high-level arithmetic instruction on 32-bit integer | |
480 operands:: | |
481 | |
482 A = B + C | |
483 | |
484 X86 target lowering might produce the following:: | |
485 | |
486 T.inf = B // mov instruction | |
487 T.inf += C // add instruction | |
488 A = T.inf // mov instruction | |
489 | |
490 Here, ``T.inf`` is an infinite-weight temporary. As long as ``T.inf`` has a | |
491 physical register, the three lowered instructions are all encodable regardless | |
492 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and | |
493 whether ``A`` is a physical register or in memory. | |
494 | |
495 In this example, ``A`` must be a Variable and one may be tempted to simplify the | |
496 lowering sequence by setting ``A`` as infinite-weight and using:: | |
497 | |
498 A = B // mov instruction | |
499 A += C // add instruction | |
500 | |
501 This has two problems. First, if the original instruction was actually ``A = | |
502 B + A``, the result would be incorrect. Second, assigning ``A`` a physical | |
503 register applies throughout ``A``'s entire live range. This is probably not | |
504 what is intended, and may ultimately lead to a failure to allocate a register | |
505 for an infinite-weight variable. | |
506 | |
507 This style of lowering leads to many temporaries being generated, so in ``O2`` | |
508 mode, we rely on the register allocator to clean things up. For example, in the | |
509 example above, if ``B`` ends up getting a physical register and its live range | |
510 ends at this instruction, the register allocator is likely to reuse that | |
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | |
512 copy, which is removed as an emission-time peephole optimization. | |
513 | |
514 O2 lowering | |
515 ----------- | |
516 | |
517 Currently, the ``O2`` lowering recipe is the following: | |
518 | |
519 - Loop nest analysis | |
520 | |
521 - Address mode inference | |
522 | |
523 - Read-modify-write (RMW) transformation | |
524 | |
525 - Basic liveness analysis | |
526 | |
527 - Load optimization | |
528 | |
529 - Target lowering | |
530 | |
531 - Full liveness analysis | |
532 | |
533 - Register allocation | |
534 | |
535 - Phi instruction lowering (advanced) | |
536 | |
537 - Post-phi lowering register allocation | |
538 | |
539 - Branch optimization | |
540 | |
541 These passes are described in more detail below. | |
542 | |
543 Om1 lowering | |
544 ------------ | |
545 | |
546 Currently, the ``Om1`` lowering recipe is the following: | |
547 | |
548 - Phi instruction lowering (simple) | |
549 | |
550 - Target lowering | |
551 | |
552 - Register allocation (infinite-weight and pre-colored only) | |
553 | |
554 Optimization passes | |
555 ------------------- | |
556 | |
557 Liveness analysis | |
558 ^^^^^^^^^^^^^^^^^ | |
559 | |
560 Liveness analysis is a standard dataflow optimization, implemented as follows. | |
561 For each node (basic block), its live-out set is computed as the union of the | |
562 live-in sets of its successor nodes. Then the node's instructions are processed | |
563 in reverse order, updating the live set, until the beginning of the node is | |
564 reached, and the node's live-in set is recorded. If this iteration has changed | |
565 the node's live-in set, the node's predecessors are marked for reprocessing. | |
566 This continues until no more nodes need reprocessing. If nodes are processed in | |
567 reverse topological order, the number of iterations over the CFG is generally | |
568 equal to the maximum loop nest depth. | |
569 | |
570 To implement this, each node records its live-in and live-out sets, initialized | |
571 to the empty set. Each instruction records which of its Variables' live ranges | |
572 end in that instruction, initialized to the empty set. A side effect of | |
573 liveness analysis is dead instruction elimination. Each instruction can be | |
574 marked as tentatively dead, and after the algorithm converges, the tentatively | |
575 dead instructions are permanently deleted. | |
576 | |
577 Optionally, after this liveness analysis completes, we can do live range | |
578 construction, in which we calculate the live range of each variable in terms of | |
579 instruction numbers. A live range is represented as a union of segments, where | |
580 the segment endpoints are instruction numbers. Instruction numbers are required | |
581 to be unique across the CFG, and monotonically increasing within a basic block. | |
582 As a union of segments, live ranges can contain "gaps" and are therefore | |
583 precise. Because of SSA properties, a variable's live range can start at most | |
584 once in a basic block, and can end at most once in a basic block. Liveness | |
585 analysis keeps track of which variable/instruction tuples begin live ranges and | |
586 end live ranges, and combined with live-in and live-out sets, we can efficiently | |
587 build up live ranges of all variables across all basic blocks. | |
588 | |
589 A lot of care is taken to try to make liveness analysis fast and efficient. | |
590 Because of the lowering strategy, the number of variables is generally | |
591 proportional to the number of instructions, leading to an O(N^2) complexity | |
592 algorithm if implemented naively. To improve things based on sparsity, we note | |
593 that most variables are "local" and referenced in at most one basic block (in | |
594 contrast to the "global" variables with multi-block usage), and therefore cannot | |
595 be live across basic blocks. Therefore, the live-in and live-out sets, | |
596 typically represented as bit vectors, can be limited to the set of global | |
597 variables, and the intra-block liveness bit vector can be compacted to hold the | |
598 global variables plus the local variables for that block. | |
599 | |
600 Register allocation | |
601 ^^^^^^^^^^^^^^^^^^^ | |
602 | |
603 Subzero implements a simple linear-scan register allocator, based on the | |
604 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan | |
605 Register Allocation in the Context of SSA Form and Register Constraints | |
606 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_. This allocator has | |
607 several nice features: | |
608 | |
609 - Live ranges are represented as unions of segments, as described above, rather | |
610 than a single start/end tuple. | |
611 | |
612 - It allows pre-coloring of variables with specific physical registers. | |
613 | |
614 - It applies equally well to pre-lowered Phi instructions. | |
615 | |
616 The paper suggests an approach of aggressively coalescing variables across Phi | |
617 instructions (i.e., trying to force Phi source and destination variables to have | |
618 the same register assignment), but we reject that in favor of the more natural | |
619 preference mechanism described below. | |
620 | |
621 We enhance the algorithm in the paper with the capability of automatic inference | |
622 of register preference, and with the capability of allowing overlapping live | |
623 ranges to safely share the same register in certain circumstances. If we are | |
624 considering register allocation for variable ``A``, and ``A`` has a single | |
625 defining instruction ``A=B+C``, then the preferred register for ``A``, if | |
626 available, would be the register assigned to ``B`` or ``C``, if any, provided | |
627 that ``B`` or ``C``'s live range does not overlap ``A``'s live range. In this | |
628 way we infer a good register preference for ``A``. | |
629 | |
630 We allow overlapping live ranges to get the same register in certain cases. | |
631 Suppose a high-level instruction like:: | |
632 | |
633 A = unary_op(B) | |
634 | |
635 has been target-lowered like:: | |
636 | |
637 T.inf = B | |
638 A = unary_op(T.inf) | |
639 | |
640 Further, assume that ``B``'s live range continues beyond this instruction | |
641 sequence, and that ``B`` has already been assigned some register. Normally, we | |
642 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it | |
643 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have | |
644 different registers. But ``T.inf`` is just a read-only copy of ``B`` that is | |
645 guaranteed to be in a register, so in theory these overlapping live ranges could | |
646 safely have the same register. Our implementation allows this overlap as long | |
647 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never | |
648 modified within ``T.inf``'s live range. | |
649 | |
650 Subzero's register allocator can be run in 3 configurations. | |
651 | |
652 - Normal mode. All Variables are considered for register allocation. It | |
653 requires full liveness analysis and live range construction as a prerequisite. | |
654 This is used by ``O2`` lowering. | |
655 | |
656 - Minimal mode. Only infinite-weight or pre-colored Variables are considered. | |
657 All other Variables are stack-allocated. It does not require liveness | |
658 analysis; instead, it quickly scans the instructions and records first | |
659 definitions and last uses of all relevant Variables, using that to construct a | |
660 single-segment live range. Although this includes most of the Variables, the | |
661 live ranges are mostly simple, short, and rarely overlapping, which the | |
662 register allocator handles efficiently. This is used by ``Om1`` lowering. | |
663 | |
664 - Post-phi lowering mode. Advanced phi lowering is done after normal-mode | |
665 register allocation, and may result in new infinite-weight Variables that need | |
666 registers. One would like to just run something like minimal mode to assign | |
667 registers to the new Variables while respecting existing register allocation | |
668 decisions. However, it sometimes happens that there are no free registers. | |
669 In this case, some register needs to be forcibly spilled to the stack and | |
670 temporarily reassigned to the new Variable, and reloaded at the end of the new | |
671 Variable's live range. The register must be one that has no explicit | |
672 references during the Variable's live range. Since Subzero currently doesn't | |
673 track def/use chains (though it does record the CfgNode where a Variable is | |
674 defined), we just do a brute-force search across the CfgNode's instruction | |
675 list for the instruction numbers of interest. This situation happens very | |
676 rarely, so there's little point for now in improving its performance. | |
677 | |
678 The basic linear-scan algorithm may, as it proceeds, rescind an early register | |
679 allocation decision, leaving that Variable to be stack-allocated. Some of these | |
680 times, it turns out that the Variable could have been given a different register | |
681 without conflict, but by this time it's too late. The literature recognizes | |
682 this situation and describes "second-chance bin-packing", which Subzero can do. | |
683 We can rerun the register allocator in a mode that respects existing register | |
684 allocation decisions, and sometimes it finds new non-conflicting opportunities. | |
685 In fact, we can repeatedly run the register allocator until convergence. | |
686 Unfortunately, in the current implementation, these subsequent register | |
687 allocation passes end up being extremely expensive. This is because of the | |
688 treatment of the "unhandled pre-colored" Variable set, which is normally very | |
689 small but ends up being quite large on subsequent passes. Its performance can | |
690 probably be made acceptable with a better choice of data structures, but for now | |
691 this second-chance mechanism is disabled. | |
692 | |
693 Future work is to implement LLVM's `Greedy | |
694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | |
695 register allocator as a replacement for the basic linear-scan algorithm, given | |
696 LLVM's experience with its improvement in code quality. (The blog post claims | |
697 that the Greedy allocator also improved maintainability because a lot of hacks | |
698 could be removed, but Subzero is probably not yet to that level of hacks, and is | |
699 less likely to see that particular benefit.) | |
700 | |
701 Basic phi lowering | |
702 ^^^^^^^^^^^^^^^^^^ | |
703 | |
704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | |
705 implements it). Consider this example:: | |
706 | |
707 L1: | |
708 ... | |
709 br L3 | |
710 L2: | |
711 ... | |
712 br L3 | |
713 L3: | |
714 A = phi [B, L1], [C, L2] | |
715 X = phi [Y, L1], [Z, L2] | |
716 | |
717 For each destination of a phi instruction, we can create a temporary and insert | |
718 the temporary's assignment at the end of the predecessor block:: | |
719 | |
720 L1: | |
721 ... | |
722 A' = B | |
723 X' = Y | |
724 br L3 | |
725 L2: | |
726 ... | |
727 A' = C | |
728 X' = Z | |
729 br L3 | |
730 L2: | |
731 A = A' | |
732 X = X' | |
733 | |
734 This transformation is very simple and reliable. It can be done before target | |
735 lowering and register allocation, and it easily avoids the classic lost-copy and | |
736 related problems. ``Om1`` lowering uses this strategy. | |
737 | |
738 However, it has the disadvantage of initializing temporaries even for branches | |
739 not taken, though that could be mitigated by splitting non-critical edges and | |
740 putting assignments in the edge-split nodes. Another problem is that without | |
741 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a | |
742 specific ordering even though phi semantics are that the assignments are | |
743 parallel or unordered. This sometimes imposes false live range overlaps and | |
744 leads to poorer register allocation. | |
745 | |
746 Advanced phi lowering | |
747 ^^^^^^^^^^^^^^^^^^^^^ | |
748 | |
749 ``O2`` lowering defers phi lowering until after register allocation to avoid the | |
750 problem of false live range overlaps. It works as follows. We split each | |
751 incoming edge and move the (parallel) phi assignments into the split nodes. We | |
752 linearize each set of assignments by finding a safe, topological ordering of the | |
753 assignments, respecting register assignments as well. For example:: | |
754 | |
755 A = B | |
756 X = Y | |
757 | |
758 Normally these assignments could be executed in either order, but if ``B`` and | |
759 ``X`` are assigned the same physical register, we would want to use the above | |
760 ordering. Dependency cycles are broken by introducing a temporary. For | |
761 example:: | |
762 | |
763 A = B | |
764 B = A | |
765 | |
766 Here, a temporary breaks the cycle:: | |
767 | |
768 t = A | |
769 A = B | |
770 B = t | |
771 | |
772 Finally, we use the existing target lowering to lower the assignments in this | |
773 basic block, and once that is done for all basic blocks, we run the post-phi | |
774 variant of register allocation on the edge-split basic blocks. | |
775 | |
776 When computing a topological order, we try to first schedule assignments whose | |
777 source has a physical register, and last schedule assignments whose destination | |
778 has a physical register. This helps reduce register pressure. | |
779 | |
780 X86 address mode inference | |
781 ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
782 | |
783 We try to take advantage of the x86 addressing mode that includes a base | |
784 register, an index register, an index register scale amount, and an immediate | |
785 offset. We do this through simple pattern matching. Starting with a load or | |
786 store instruction where the address is a variable, we initialize the base | |
787 register to that variable, and look up the instruction where that variable is | |
788 defined. If that is an add instruction of two variables and the index register | |
789 hasn't been set, we replace the base and index register with those two | |
790 variables. If instead it is an add instruction of a variable and a constant, we | |
791 replace the base register with the variable and add the constant to the | |
792 immediate offset. | |
793 | |
794 There are several more patterns that can be matched. This pattern matching | |
795 continues on the load or store instruction until no more matches are found. | |
796 Because a program typically has few load and store instructions (not to be | |
797 confused with instructions that manipulate stack variables), this address mode | |
798 inference pass is fast. | |
799 | |
800 X86 read-modify-write inference | |
801 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
802 | |
803 A reasonably common bitcode pattern is a non-atomic update of a memory | |
804 location:: | |
805 | |
806 x = load addr | |
807 y = add x, 1 | |
808 store y, addr | |
809 | |
810 On x86, with good register allocation, the Subzero passes described above | |
811 generate code with only this quality:: | |
812 | |
813 mov [%ebx], %eax | |
814 add $1, %eax | |
815 mov %eax, [%ebx] | |
816 | |
817 However, x86 allows for this kind of code:: | |
818 | |
819 add $1, [%ebx] | |
820 | |
821 which requires fewer instructions, but perhaps more importantly, requires fewer | |
822 physical registers. | |
823 | |
824 It's also important to note that this transformation only makes sense if the | |
825 store instruction ends ``x``'s live range. | |
826 | |
827 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) | |
828 opportunities via simple pattern matching. The only problem is that it is run | |
829 before liveness analysis, which is needed to determine whether ``x``'s live | |
830 range ends after the RMW. Since liveness analysis is one of the most expensive | |
831 passes, it's not attractive to run it an extra time just for RMW analysis. | |
832 Instead, we essentially generate both the RMW and the non-RMW versions, and then | |
833 during lowering, the RMW version deletes itself if it finds x still live. | |
834 | |
835 X86 compare-branch inference | |
836 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
837 | |
838 In the LLVM instruction set, the compare/branch pattern works like this:: | |
839 | |
840 cond = icmp eq a, b | |
841 br cond, target | |
842 | |
843 The result of the icmp instruction is a single bit, and a conditional branch | |
844 tests that bit. By contrast, most target architectures use this pattern:: | |
845 | |
846 cmp a, b // implicitly sets various bits of FLAGS register | |
847 br eq, target // branch on a particular FLAGS bit | |
848 | |
849 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests | |
850 ``cond`` and conditionally branches. Subzero has a pass that identifies | |
851 boolean-based operations like this and folds them into a single | |
852 compare/branch-like operation. It is set up for more than just cmp/br though. | |
853 Boolean producers include icmp (integer compare), fcmp (floating-point compare), | |
854 and trunc (integer truncation when the destination has bool type). Boolean | |
855 consumers include branch, select (the ternary operator from the C language), and | |
856 sign-extend and zero-extend when the source has bool type. | |
857 | |
858 Sandboxing | |
859 ^^^^^^^^^^ | |
860 | |
861 Native Client's sandbox model uses software fault isolation (SFI) to provide | |
862 safety when running untrusted code in a browser or other environment. Subzero | |
863 implements Native Client's `sandboxing | |
864 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ | |
865 to enable Subzero-translated executables to be run inside Chrome. Subzero also | |
866 provides a fairly simple framework for investigating alternative sandbox models | |
867 or other restrictions on the sandbox model. | |
868 | |
869 Sandboxing in Subzero is not actually implemented as a separate pass, but is | |
870 integrated into lowering and assembly. | |
871 | |
872 - Indirect branches, including the ret instruction, are masked to a bundle | |
873 boundary and bundle-locked. | |
874 | |
875 - Call instructions are aligned to the end of the bundle so that the return | |
876 address is bundle-aligned. | |
877 | |
878 - Indirect branch targets, including function entry and targets in a switch | |
879 statement jump table, are bundle-aligned. | |
880 | |
881 - The intrinsic for reading the thread pointer is inlined appropriately. | |
882 | |
883 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox | |
884 base register. We reduce the aggressiveness of address mode inference to | |
885 leave room for the sandbox base register during lowering. There are no memory | |
886 sandboxing changes for x86-32. | |
887 | |
888 Code emission | |
889 ------------- | |
890 | |
891 Subzero's integrated assembler is derived from Dart's `assembler code | |
892 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_. There is a pass | |
893 that iterates through the low-level ICE instructions and invokes the relevant | |
894 assembler functions. Placeholders are added for later fixup of branch target | |
895 offsets. (Backward branches use short offsets if possible; forward branches | |
896 generally use long offsets unless it is an intra-block branch of "known" short | |
897 length.) The assembler emits into a staging buffer. Once emission into the | |
898 staging buffer for a function is complete, the data is emitted to the output | |
899 file as an ELF object file, and metadata such as relocations, symbol table, and | |
900 string table, are accumulated for emission at the end. Global data initializers | |
901 are emitted similarly. A key point is that at this point, the staging buffer | |
902 can be deallocated, and only a minimum of data needs to held until the end. | |
903 | |
904 As a debugging alternative, Subzero can emit textual assembly code which can | |
905 then be run through an external assembler. This is of course super slow, but | |
906 quite valuable when bringing up a new target. | |
907 | |
908 As another debugging option, the staging buffer can be emitted as textual | |
909 assembly, primarily in the form of ".byte" lines. This allows the assembler to | |
910 be tested separately from the ELF related code. | |
911 | |
912 Memory management | |
913 ----------------- | |
914 | |
915 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from | |
916 LLVM's ``BumpPtrAllocator``. This is an arena-style allocator where objects | |
917 allocated from the arena are never actually freed; instead, when the CFG | |
918 translation completes and the CFG is deleted, the entire arena memory is | |
919 reclaimed at once. This style of allocation works well in an environment like a | |
920 compiler where there are distinct phases with only easily-identifiable objects | |
921 living across phases. It frees the developer from having to manage object | |
922 deletion, and it amortizes deletion costs across just a single arena deletion at | |
923 the end of the phase. Furthermore, it helps scalability by allocating entirely | |
924 from thread-local memory pools, and minimizing global locking of the heap. | |
925 | |
926 Instructions are probably the most heavily allocated complex class in Subzero. | |
927 We represent an instruction list as an intrusive doubly linked list, allocate | |
928 all instructions from the ``CfgLocalAllocator``, and we make sure each | |
929 instruction subclass is basically `POD | |
930 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a | |
931 trivial destructor. This way, when the CFG is finished, we don't need to | |
932 individually deallocate every instruction. We do similar for Variables, which | |
933 is probably the second most popular complex class. | |
934 | |
935 There are some situations where passes need to use some `STL container class | |
936 <http://en.cppreference.com/w/cpp/container>`_. Subzero has a way of using the | |
937 ``CfgLocalAllocator`` as the container allocator if this is needed. | |
938 | |
939 Multithreaded translation | |
940 ------------------------- | |
941 | |
942 Subzero is designed to be able to translate functions in parallel. With the | |
943 ``-threads=N`` command-line option, there is a 3-stage producer-consumer | |
944 pipeline: | |
945 | |
946 - A single thread parses the ``.pexe`` file and produces a sequence of work | |
947 units. A work unit can be either a fully constructed CFG, or a set of global | |
948 initializers. The work unit includes its sequence number denoting its parse | |
949 order. Each work unit is added to the translation queue. | |
950 | |
951 - There are N translation threads that draw work units from the translation | |
952 queue and lower them into assembler buffers. Each assembler buffer is added | |
953 to the emitter queue, tagged with its sequence number. The CFG and its | |
954 ``CfgLocalAllocator`` are disposed of at this point. | |
955 | |
956 - A single thread draws assembler buffers from the emitter queue and appends to | |
957 the output file. It uses the sequence numbers to reintegrate the assembler | |
958 buffers according to the original parse order, such that output order is | |
959 always deterministic. | |
960 | |
961 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads | |
962 for a total of ``N+2`` execution threads, taking the parser and emitter threads | |
963 into account. For the special case of ``N=0``, execution is entirely sequential | |
964 -- the same thread parses, translates, and emits, one function at a time. This | |
965 is useful for performance measurements. | |
966 | |
967 Ideally, we would like to get near-linear scalability as the number of | |
968 translation threads increases. We expect that ``-threads=1`` should be slightly | |
969 faster than ``-threads=0`` as the small amount of time spent parsing and | |
970 emitting is done largely in parallel with translation. With perfect | |
971 scalability, we see ``-threads=N`` translating ``N`` times as fast as | |
972 ``-threads=1``, up until the point where parsing or emitting becomes the | |
973 bottleneck, or ``N+2`` exceeds the number of CPU cores. In reality, memory | |
974 performance would become a bottleneck and efficiency might peak at, say, 75%. | |
975 | |
976 Currently, parsing takes about 11% of total sequential time. If translation | |
977 scalability ever gets so fast and awesomely scalable that parsing becomes a | |
978 bottleneck, it should be possible to make parsing multithreaded as well. | |
979 | |
980 Internally, all shared, mutable data is held in the GlobalContext object, and | |
981 access to each field is guarded by a mutex. | |
982 | |
983 Security | |
984 -------- | |
985 | |
986 Subzero includes a number of security features in the generated code, as well as | |
987 in the Subzero translator itself, which run on top of the existing Native Client | |
988 sandbox as well as Chrome's OS-level sandbox. | |
989 | |
990 Sandboxed translator | |
991 ^^^^^^^^^^^^^^^^^^^^ | |
992 | |
993 When running inside the browser, the Subzero translator executes as sandboxed, | |
994 untrusted code that is initially checked by the validator, just like the | |
995 LLVM-based ``pnacl-llc`` translator. As such, the Subzero binary should be no | |
996 more or less secure than the translator it replaces, from the point of view of | |
997 the Chrome sandbox. That said, Subzero is much smaller than ``pnacl-llc`` and | |
998 was designed from the start with security in mind, so one expects fewer attacker | |
999 opportunities here. | |
1000 | |
1001 Code diversification | |
1002 ^^^^^^^^^^^^^^^^^^^^ | |
1003 | |
1004 `Return-oriented programming | |
1005 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a | |
1006 now-common technique for starting with e.g. a known buffer overflow situation | |
1007 and launching it into a deeper exploit. The attacker scans the executable | |
1008 looking for ROP gadgets, which are short sequences of code that happen to load | |
1009 known values into known registers and then return. An attacker who manages to | |
1010 overwrite parts of the stack can overwrite it with carefully chosen return | |
1011 addresses such that certain ROP gadgets are effectively chained together to set | |
1012 up the register state as desired, finally returning to some code that manages to | |
1013 do something nasty based on those register values. | |
1014 | |
1015 If there is a popular ``.pexe`` with a large install base, the attacker could | |
1016 run Subzero on it and scan the executable for suitable ROP gadgets to use as | |
1017 part of a potential exploit. Note that if the trusted validator is working | |
1018 correctly, these ROP gadgets are limited to starting at a bundle boundary and | |
1019 cannot use the trick of finding a gadget that happens to begin inside another | |
1020 instruction. All the same, gadgets with these constraints still exist and the | |
1021 attacker has access to them. This is the attack model we focus most on -- | |
1022 protecting the user against misuse of a "trusted" developer's application, as | |
1023 opposed to mischief from a malicious ``.pexe`` file. | |
1024 | |
1025 Subzero can mitigate these attacks to some degree through code diversification. | |
1026 Specifically, we can apply some randomness to the code generation that makes ROP | |
1027 gadgets less predictable. This randomness can have some compile-time cost, and | |
1028 it can affect the code quality; and some diversifications may be more effective | |
1029 than others. A more detailed treatment of hardening techniques may be found in | |
1030 the Matasano report "`Attacking Clientside JIT Compilers | |
1031 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/at
tacking_clientside_jit_compilers_paper.pdf>`_". | |
1032 | |
1033 To evaluate diversification effectiveness, we use a third-party ROP gadget | |
1034 finder and limit its results to bundle-aligned addresses. For a given | |
1035 diversification technique, we run it with a number of different random seeds, | |
1036 find ROP gadgets for each version, and determine how persistent each ROP gadget | |
1037 is across the different versions. A gadget is persistent if the same gadget is | |
1038 found at the same code address. The best diversifications are ones with low | |
1039 gadget persistence rates. | |
1040 | |
1041 Subzero implements 7 different diversification techniques. Below is a | |
1042 discussion of each technique, its effectiveness, and its cost. The discussions | |
1043 of cost and effectiveness are for a single diversification technique; the | |
1044 translation-time costs for multiple techniques are additive, but the effects of | |
1045 multiple techniques on code quality and effectiveness are not yet known. | |
1046 | |
1047 In Subzero's implementation, each randomization is "repeatable" in a sense. | |
1048 Each pass that includes a randomization option gets its own private instance of | |
1049 a random number generator (RNG). The RNG is seeded with a combination of a | |
1050 global seed, the pass ID, and the function's sequence number. The global seed | |
1051 is designed to be different across runs (perhaps based on the current time), but | |
1052 for debugging, the global seed can be set to a specific value and the results | |
1053 will be repeatable. | |
1054 | |
1055 Subzero-generated code is subject to diversification once per translation, and | |
1056 then Chrome caches the diversified binary for subsequent executions. An | |
1057 attacker may attempt to run the binary multiple times hoping for | |
1058 higher-probability combinations of ROP gadgets. When the attacker guesses | |
1059 wrong, a likely outcome is an application crash. Chrome throttles creation of | |
1060 crashy processes which reduces the likelihood of the attacker eventually gaining | |
1061 a foothold. | |
1062 | |
1063 Constant blinding | |
1064 ~~~~~~~~~~~~~~~~~ | |
1065 | |
1066 Here, we prevent attackers from controlling large immediates in the text | |
1067 (executable) section. A random cookie is generated for each function, and if | |
1068 the constant exceeds a specified threshold, the constant is obfuscated with the | |
1069 cookie and equivalent code is generated. For example, instead of this x86 | |
1070 instruction:: | |
1071 | |
1072 mov $0x11223344, <%Reg/Mem> | |
1073 | |
1074 the following code might be generated:: | |
1075 | |
1076 mov $(0x11223344+Cookie), %temp | |
1077 lea -Cookie(%temp), %temp | |
1078 mov %temp, <%Reg/Mem> | |
1079 | |
1080 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to | |
1081 prevent unintended effects on the flags register. | |
1082 | |
1083 This transformation has almost no effect on translation time, and about 1% | |
1084 impact on code quality, depending on the threshold chosen. It does little to | |
1085 reduce gadget persistence, but it does remove a lot of potential opportunities | |
1086 to construct intra-instruction ROP gadgets (which an attacker could use only if | |
1087 a validator bug were discovered, since the Native Client sandbox and associated | |
1088 validator force returns and other indirect branches to be to bundle-aligned | |
1089 addresses). | |
1090 | |
1091 Constant pooling | |
1092 ~~~~~~~~~~~~~~~~ | |
1093 | |
1094 This is similar to constant blinding, in that large immediates are removed from | |
1095 the text section. In this case, each unique constant above the threshold is | |
1096 stored in a read-only data section and the constant is accessed via a memory | |
1097 load. For the above example, the following code might be generated:: | |
1098 | |
1099 mov $Label$1, %temp | |
1100 mov %temp, <%Reg/Mem> | |
1101 | |
1102 This has a similarly small impact on translation time and ROP gadget | |
1103 persistence, and a smaller (better) impact on code quality. This is because it | |
1104 uses fewer instructions, and in some cases good register allocation leads to no | |
1105 increase in instruction count. Note that this still gives an attacker some | |
1106 limited amount of control over some text section values, unless we randomize the | |
1107 constant pool layout. | |
1108 | |
1109 Static data reordering | |
1110 ~~~~~~~~~~~~~~~~~~~~~~ | |
1111 | |
1112 This transformation limits the attacker's ability to control bits in global data | |
1113 address references. It simply permutes the order in memory of global variables | |
1114 and internal constant pool entries. For the constant pool, we only permute | |
1115 within a type (i.e., emit a randomized list of ints, followed by a randomized | |
1116 list of floats, etc.) to maintain good packing in the face of alignment | |
1117 constraints. | |
1118 | |
1119 As might be expected, this has no impact on code quality, translation time, or | |
1120 ROP gadget persistence (though as above, it limits opportunities for | |
1121 intra-instruction ROP gadgets with a broken validator). | |
1122 | |
1123 Basic block reordering | |
1124 ~~~~~~~~~~~~~~~~~~~~~~ | |
1125 | |
1126 Here, we randomize the order of basic blocks within a function, with the | |
1127 constraint that we still want to maintain a topological order as much as | |
1128 possible, to avoid making the code too branchy. | |
1129 | |
1130 This has no impact on code quality, and about 1% impact on translation time, due | |
1131 to a separate pass to recompute layout. It ends up having a huge effect on ROP | |
1132 gadget persistence, tied for best with nop insertion, reducing ROP gadget | |
1133 persistence to less than 5%. | |
1134 | |
1135 Function reordering | |
1136 ~~~~~~~~~~~~~~~~~~~ | |
1137 | |
1138 Here, we permute the order that functions are emitted, primarily to shift ROP | |
1139 gadgets around to less predictable locations. It may also change call address | |
1140 offsets in case the attacker was trying to control that offset in the code. | |
1141 | |
1142 To control latency and memory footprint, we don't arbitrarily permute functions. | |
1143 Instead, for some relatively small value of N, we queue up N assembler buffers, | |
1144 and then emit the N functions in random order, and repeat until all functions | |
1145 are emitted. | |
1146 | |
1147 Function reordering has no impact on translation time or code quality. | |
1148 Measurements indicate that it reduces ROP gadget persistence to about 15%. | |
1149 | |
1150 Nop insertion | |
1151 ~~~~~~~~~~~~~ | |
1152 | |
1153 This diversification randomly adds a nop instruction after each regular | |
1154 instruction, with some probability. Nop instructions of different lengths may | |
1155 be selected. Nop instructions are never added inside a bundle_lock region. | |
1156 Note that when sandboxing is enabled, nop instructions are already being added | |
1157 for bundle alignment, so the diversification nop instructions may simply be | |
1158 taking the place of alignment nop instructions, though distributed differently | |
1159 through the bundle. | |
1160 | |
1161 In Subzero's currently implementation, nop insertion adds 3-5% to the | |
1162 translation time, but this is probably because it is implemented as a separate | |
1163 pass that adds actual nop instructions to the IR. The overhead would probably | |
1164 be a lot less if it were integrated into the assembler pass. The code quality | |
1165 is also reduced by 3-5%, making nop insertion the most expensive of the | |
1166 diversification techniques. | |
1167 | |
1168 Nop insertion is very effective in reducing ROP gadget persistence, at the same | |
1169 level as basic block randomization (less than 5%). But given nop insertion's | |
1170 impact on translation time and code quality, one would most likely prefer to use | |
1171 basic block randomization instead (though the combined effects of the different | |
1172 diversification techniques have not yet been studied). | |
1173 | |
1174 Register allocation randomization | |
1175 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1176 | |
1177 In this diversification, the register allocator tries to make different but | |
1178 mostly functionally equivalent choices, while maintaining stable code quality. | |
1179 | |
1180 A naive approach would be the following. Whenever the allocator has more than | |
1181 one choice for assigning a register, choose randomly among those options. And | |
1182 whenever there are no registers available and there is a tie for the | |
1183 lowest-weight variable, randomly select one of the lowest-weight variables to | |
1184 evict. Because of the one-pass nature of the linear-scan algorithm, this | |
1185 randomization strategy can have a large impact on which variables are ultimately | |
1186 assigned registers, with a corresponding large impact on code quality. | |
1187 | |
1188 Instead, we choose an approach that tries to keep code quality stable regardless | |
1189 of the random seed. We partition the set of physical registers into equivalence | |
1190 classes. If a register is pre-colored in the function (i.e., referenced | |
1191 explicitly by name), it forms its own equivalence class. The remaining | |
1192 registers are partitioned according to their combination of attributes such as | |
1193 integer versus floating-point, 8-bit versus 32-bit, caller-save versus | |
1194 callee-saved, etc. Each equivalence class is randomly permuted, and the | |
1195 complete permutation is applied to the final register assignments. | |
1196 | |
1197 Register randomization reduces ROP gadget persistence to about 10% on average, | |
1198 though there tends to be fairly high variance across functions and applications. | |
1199 This probably has to do with the set of restrictions in the x86-32 instruction | |
1200 set and ABI, such as few general-purpose registers, ``%eax`` used for return | |
1201 values, ``%edx`` used for division, ``%cl`` used for shifting, etc. As | |
1202 intended, register randomization has no impact on code quality, and a slight | |
1203 (0.5%) impact on translation time due to an extra scan over the variables to | |
1204 identify pre-colored registers. | |
1205 | |
1206 Fuzzing | |
1207 ^^^^^^^ | |
1208 | |
1209 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a | |
1210 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer | |
1211 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling. The purpose is to | |
1212 find and fix cases where Subzero crashes or otherwise ungracefully fails on | |
1213 unexpected inputs, and to do so automatically over a large range of unexpected | |
1214 inputs. By fixing bugs that arise from fuzz testing, we reduce the possibility | |
1215 of an attacker exploiting these bugs. | |
1216 | |
1217 Most of the problems found so far are ones most appropriately handled in the | |
1218 parser. However, there have been a couple that have identified problems in the | |
1219 lowering, or otherwise inappropriately triggered assertion failures and fatal | |
1220 errors. We continue to dig into this area. | |
1221 | |
1222 Future security work | |
1223 ^^^^^^^^^^^^^^^^^^^^ | |
1224 | |
1225 Subzero is well-positioned to explore other future security enhancements, e.g.: | |
1226 | |
1227 - Tightening the Native Client sandbox. ABI changes, such as the previous work | |
1228 on `hiding the sandbox base address | |
1229 <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXV
G8>`_ | |
1230 in x86-64, are easy to experiment with in Subzero. | |
1231 | |
1232 - Making the executable code section read-only. This would prevent a PNaCl | |
1233 application from inspecting its own binary and trying to find ROP gadgets even | |
1234 after code diversification has been performed. It may still be susceptible to | |
1235 `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks, | |
1236 security is still overall improved. | |
1237 | |
1238 - Instruction selection diversification. It may be possible to lower a given | |
1239 instruction in several largely equivalent ways, which gives more opportunities | |
1240 for code randomization. | |
1241 | |
1242 Chrome integration | |
1243 ------------------ | |
1244 | |
1245 Currently Subzero is available in Chrome for the x86-32 architecture, but under | |
1246 a flag. When the flag is enabled, Subzero is used when the `manifest file | |
1247 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_ | |
1248 linking to the ``.pexe`` file specifies the ``O0`` optimization level. | |
1249 | |
1250 The next step is to remove the flag, i.e. invoke Subzero as the only translator | |
1251 for ``O0``-specified manifest files. | |
1252 | |
1253 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which | |
1254 case Subzero could be used for all PNaCl translation. | |
1255 | |
1256 Command line options | |
1257 -------------------- | |
1258 | |
1259 Subzero has a number of command-line options for debugging and diagnostics. | |
1260 Among the more interesting are the following. | |
1261 | |
1262 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other | |
1263 diagnostic output, with various levels of detail after each pass. Instruction | |
1264 numbers can be printed or suppressed. Deleted instructions can be printed or | |
1265 suppressed (they are retained in the instruction list, as discussed earlier, | |
1266 because they can help explain how lower-level instructions originated). | |
1267 Liveness information can be printed when available. Details of register | |
1268 allocation can be printed as register allocator decisions are made. And more. | |
1269 | |
1270 - Running Subzero with any level of verbosity produces an enormous amount of | |
1271 output. When debugging a single function, verbose output can be suppressed | |
1272 except for a particular function. The ``-verbose-focus`` flag suppresses | |
1273 verbose output except for the specified function. | |
1274 | |
1275 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing | |
1276 at exit. Timing markers can be placed in the Subzero source code to demarcate | |
1277 logical operations or passes of interest. Basic timing information plus | |
1278 call-stack type timing information is printed at the end. | |
1279 | |
1280 - Along with ``-timing``, the user can instead get a report on the overall | |
1281 translation time for each function, to help focus on timing outliers. Also, | |
1282 ``-timing-focus`` limits the ``-timing`` reporting to a single function, | |
1283 instead of aggregating pass timing across all functions. | |
1284 | |
1285 - The ``-szstats`` option reports various statistics on each function, such as | |
1286 stack frame size, static instruction count, etc. It may be helpful to track | |
1287 these stats over time as Subzero is improved, as an approximate measure of | |
1288 code quality. | |
1289 | |
1290 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly | |
1291 output, annotate the assembly output with register-focused liveness | |
1292 information. In particular, each basic block is annotated with which | |
1293 registers are live-in and live-out, and each instruction is annotated with | |
1294 which registers' and stack locations' live ranges end at that instruction. | |
1295 This is really useful when studying the generated code to find opportunities | |
1296 for code quality improvements. | |
1297 | |
1298 Testing and debugging | |
1299 --------------------- | |
1300 | |
1301 LLVM lit tests | |
1302 ^^^^^^^^^^^^^^ | |
1303 | |
1304 For basic testing, Subzero uses LLVM's `lit | |
1305 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests. We | |
1306 have a suite of hundreds of small functions where we test for particular | |
1307 assembly code patterns across different target architectures. | |
1308 | |
1309 Cross tests | |
1310 ^^^^^^^^^^^ | |
1311 | |
1312 Unfortunately, the lit tests don't do a great job of precisely testing the | |
1313 correctness of the output. Much better are the cross tests, which are execution | |
1314 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide | |
1315 variety of interesting inputs. Each cross test consists of a set of C, C++, | |
1316 and/or low-level bitcode files. The C and C++ source files are compiled down to | |
1317 bitcode. The bitcode files are translated by ``pnacl-llc`` and also by Subzero. | |
1318 Subzero mangles global symbol names with a special prefix to avoid duplicate | |
1319 symbol errors. A driver program invokes both versions on a large set of | |
1320 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results | |
1321 differ. Cross tests turn out to be an excellent way of testing the basic | |
1322 lowering patterns, but they are less useful for testing more global things like | |
1323 liveness analysis and register allocation. | |
1324 | |
1325 Bisection debugging | |
1326 ^^^^^^^^^^^^^^^^^^^ | |
1327 | |
1328 Sometimes with a new application, Subzero will end up producing incorrect code | |
1329 that either crashes at runtime or otherwise produces the wrong results. When | |
1330 this happens, we need to narrow it down to a single function (or small set of | |
1331 functions) that yield incorrect behavior. For this, we have a bisection | |
1332 debugging framework. Here, we initially translate the entire application once | |
1333 with Subzero and once with ``pnacl-llc``. We then use ``objdump`` to | |
1334 selectively weaken symbols based on a whitelist or blacklist provided on the | |
1335 command line. The two object files can then be linked together without link | |
1336 errors, with the desired version of each method "winning". Then the binary is | |
1337 tested, and bisection proceeds based on whether the binary produces correct | |
1338 output. | |
1339 | |
1340 When the bisection completes, we are left with a minimal set of | |
1341 Subzero-translated functions that cause the failure. Usually it is a single | |
1342 function, though sometimes it might require a combination of several functions | |
1343 to cause a failure; this may be due to an incorrect call ABI, for example. | |
1344 However, Murphy's Law implies that the single failing function is enormous and | |
1345 impractical to debug. In that case, we can restart the bisection, explicitly | |
1346 blacklisting the enormous function, and try to find another candidate to debug. | |
1347 (Future work is to automate this to find all minimal sets of functions, so that | |
1348 debugging can focus on the simplest example.) | |
1349 | |
1350 Fuzz testing | |
1351 ^^^^^^^^^^^^ | |
1352 | |
1353 As described above, we try to find internal Subzero bugs using fuzz testing | |
1354 techniques. | |
1355 | |
1356 Sanitizers | |
1357 ^^^^^^^^^^ | |
1358 | |
1359 Subzero can be built with `AddressSanitizer | |
1360 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer | |
1361 <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support. This is | |
1362 done using something as simple as ``make ASAN=1`` or ``make TSAN=1``. So far, | |
1363 multithreading has been simple enough that TSan hasn't found any bugs, but ASan | |
1364 has found at least one memory leak which was subsequently fixed. | |
1365 `UndefinedBehaviorSanitizer | |
1366 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_ | |
1367 (UBSan) support is in progress. `Control flow integrity sanitization | |
1368 <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under | |
1369 consideration. | |
1370 | |
1371 Current status | |
1372 ============== | |
1373 | |
1374 Target architectures | |
1375 -------------------- | |
1376 | |
1377 Subzero is currently more or less complete for the x86-32 target. It has been | |
1378 refactored and extended to handle x86-64 as well, and that is mostly complete at | |
1379 this point. | |
1380 | |
1381 ARM32 work is in progress. It currently lacks the testing level of x86, at | |
1382 least in part because Subzero's register allocator needs modifications to handle | |
1383 ARM's aliasing of floating point and vector registers. Specifically, a 64-bit | |
1384 register is actually a gang of two consecutive and aligned 32-bit registers, and | |
1385 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers. | |
1386 ARM64 work has not started; when it does, it will be native-only since the | |
1387 Native Client sandbox model, validator, and other tools have never been defined. | |
1388 | |
1389 An external contributor is adding MIPS support, in most part by following the | |
1390 ARM work. | |
1391 | |
1392 Translator performance | |
1393 ---------------------- | |
1394 | |
1395 Single-threaded translation speed is currently about 5× the ``pnacl-llc`` | |
1396 translation speed. For a large ``.pexe`` file, the time breaks down as: | |
1397 | |
1398 - 11% for parsing and initial IR building | |
1399 | |
1400 - 4% for emitting to /dev/null | |
1401 | |
1402 - 27% for liveness analysis (two liveness passes plus live range construction) | |
1403 | |
1404 - 15% for linear-scan register allocation | |
1405 | |
1406 - 9% for basic lowering | |
1407 | |
1408 - 10% for advanced phi lowering | |
1409 | |
1410 - ~11% for other minor analysis | |
1411 | |
1412 - ~10% measurement overhead to acquire these numbers | |
1413 | |
1414 Some improvements could undoubtedly be made, but it will be hard to increase the | |
1415 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality. With | |
1416 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10× | |
1417 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3. | |
1418 | |
1419 Code quality | |
1420 ------------ | |
1421 | |
1422 Measured across 16 components of spec2k, Subzero's code quality is uniformly | |
1423 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly | |
1424 between ``pnacl-llc`` ``-O0`` and ``-O2``. | |
1425 | |
1426 Translator size | |
1427 --------------- | |
1428 | |
1429 When built in MINIMAL mode, the x86-64 native translator size for the x86-32 | |
1430 target is about 700 KB, not including the size of functions referenced in | |
1431 dynamically-linked libraries. The sandboxed version of Subzero is a bit over 1 | |
1432 MB, and it is statically linked and also includes nop padding for bundling as | |
1433 well as indirect branch masking. | |
1434 | |
1435 Translator memory footprint | |
1436 --------------------------- | |
1437 | |
1438 It's hard to draw firm conclusions about memory footprint, since the footprint | |
1439 is at least proportional to the input function size, and there is no real limit | |
1440 on the size of functions in the ``.pexe`` file. | |
1441 | |
1442 That said, we looked at the memory footprint over time as Subzero translated | |
1443 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our | |
1444 disposal. One of LLVM's libraries that Subzero uses can report the current | |
1445 malloc heap usage. With single-threaded translation, Subzero tends to hover | |
1446 around 15 MB of memory usage. There are a couple of monstrous functions where | |
1447 Subzero grows to around 100 MB, but then it drops back down after those | |
1448 functions finish translating. In contrast, ``pnacl-llc`` grows larger and | |
1449 larger throughout translation, reaching several hundred MB by the time it | |
1450 completes. | |
1451 | |
1452 It's a bit more interesting when we enable multithreaded translation. When | |
1453 there are N translation threads, Subzero implements a policy that limits the | |
1454 size of the translation queue to N entries -- if it is "full" when the parser | |
1455 tries to add a new CFG, the parser blocks until one of the translation threads | |
1456 removes a CFG. This means the number of in-memory CFGs can (and generally does) | |
1457 reach 2*N+1, and so the memory footprint rises in proportion to the number of | |
1458 threads. Adding to the pressure is the observation that the monstrous functions | |
1459 also take proportionally longer time to translate, so there's a good chance many | |
1460 of the monstrous functions will be active at the same time with multithreaded | |
1461 translation. As a result, for N=32, Subzero's memory footprint peaks at about | |
1462 260 MB, but drops back down as the large functions finish translating. | |
1463 | |
1464 If this peak memory size becomes a problem, it might be possible for the parser | |
1465 to resequence the functions to try to spread out the larger functions, or to | |
1466 throttle the translation queue to prevent too many in-flight large functions. | |
1467 It may also be possible to throttle based on memory pressure signaling from | |
1468 Chrome. | |
1469 | |
1470 Translator scalability | |
1471 ---------------------- | |
1472 | |
1473 Currently scalability is "not very good". Multiple translation threads lead to | |
1474 faster translation, but not to the degree desired. We haven't dug in to | |
1475 investigate yet. | |
1476 | |
1477 There are a few areas to investigate. First, there may be contention on the | |
1478 constant pool, which all threads access, and which requires locked access even | |
1479 for reading. This could be mitigated by keeping a CFG-local cache of the most | |
1480 common constants. | |
1481 | |
1482 Second, there may be contention on memory allocation. While almost all CFG | |
1483 objects are allocated from the CFG-local allocator, some passes use temporary | |
1484 STL containers that use the default allocator, which may require global locking. | |
1485 This could be mitigated by switching these to the CFG-local allocator. | |
1486 | |
1487 Third, multithreading may make the default allocator strategy more expensive. | |
1488 In a single-threaded environment, a pass will allocate its containers, run the | |
1489 pass, and deallocate the containers. This results in stack-like allocation | |
1490 behavior and makes the heap free list easier to manage, with less heap | |
1491 fragmentation. But when multithreading is added, the allocations and | |
1492 deallocations become much less stack-like, making allocation and deallocation | |
1493 operations individually more expensive. Again, this could be mitigated by | |
1494 switching these to the CFG-local allocator. | |
OLD | NEW |