| OLD | NEW | 
 | (Empty) | 
|     1 Design of the Subzero fast code generator |  | 
|     2 ========================================= |  | 
|     3  |  | 
|     4 Introduction |  | 
|     5 ------------ |  | 
|     6  |  | 
|     7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |  | 
|     8 compiler technology based on `LLVM <http://llvm.org/>`_.  The developer uses the |  | 
|     9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |  | 
|    10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |  | 
|    11 possible.  The ``.pexe`` file is downloaded to the user's browser where the |  | 
|    12 PNaCl translator (a component of Chrome) compiles the ``.pexe`` file to |  | 
|    13 `sandboxed |  | 
|    14 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ |  | 
|    15 native code.  The translator uses architecture-specific optimizations as much as |  | 
|    16 practical to generate good native code. |  | 
|    17  |  | 
|    18 The native code can be cached by the browser to avoid repeating translation on |  | 
|    19 future page loads.  However, first-time user experience is hampered by long |  | 
|    20 translation times.  The LLVM-based PNaCl translator is pretty slow, even when |  | 
|    21 using ``-O0`` to minimize optimizations, so delays are especially noticeable on |  | 
|    22 slow browser platforms such as ARM-based Chromebooks. |  | 
|    23  |  | 
|    24 Translator slowness can be mitigated or hidden in a number of ways. |  | 
|    25  |  | 
|    26 - Parallel translation.  However, slow machines where this matters most, e.g. |  | 
|    27   ARM-based Chromebooks, are likely to have fewer cores to parallelize across, |  | 
|    28   and are likely to less memory available for multiple translation threads to |  | 
|    29   use. |  | 
|    30  |  | 
|    31 - Streaming translation, i.e. start translating as soon as the download starts. |  | 
|    32   This doesn't help much when translation speed is 10× slower than download |  | 
|    33   speed, or the ``.pexe`` file is already cached while the translated binary was |  | 
|    34   flushed from the cache. |  | 
|    35  |  | 
|    36 - Arrange the web page such that translation is done in parallel with |  | 
|    37   downloading large assets. |  | 
|    38  |  | 
|    39 - Arrange the web page to distract the user with `cat videos |  | 
|    40   <https://www.youtube.com/watch?v=tLt5rBfNucc>`_ while translation is in |  | 
|    41   progress. |  | 
|    42  |  | 
|    43 Or, improve translator performance to something more reasonable. |  | 
|    44  |  | 
|    45 This document describes Subzero's attempt to improve translation speed by an |  | 
|    46 order of magnitude while rivaling LLVM's code quality.  Subzero does this |  | 
|    47 through minimal IR layering, lean data structures and passes, and a careful |  | 
|    48 selection of fast optimization passes.  It has two optimization recipes: full |  | 
|    49 optimizations (``O2``) and minimal optimizations (``Om1``).  The recipes are the |  | 
|    50 following (described in more detail below): |  | 
|    51  |  | 
|    52 +-----------------------------------+-----------------------+ |  | 
|    53 | O2 recipe                         | Om1 recipe            | |  | 
|    54 +===================================+=======================+ |  | 
|    55 | Parse .pexe file                  | Parse .pexe file      | |  | 
|    56 +-----------------------------------+-----------------------+ |  | 
|    57 | Loop nest analysis                |                       | |  | 
|    58 +-----------------------------------+-----------------------+ |  | 
|    59 | Address mode inference            |                       | |  | 
|    60 +-----------------------------------+-----------------------+ |  | 
|    61 | Read-modify-write (RMW) transform |                       | |  | 
|    62 +-----------------------------------+-----------------------+ |  | 
|    63 | Basic liveness analysis           |                       | |  | 
|    64 +-----------------------------------+-----------------------+ |  | 
|    65 | Load optimization                 |                       | |  | 
|    66 +-----------------------------------+-----------------------+ |  | 
|    67 |                                   | Phi lowering (simple) | |  | 
|    68 +-----------------------------------+-----------------------+ |  | 
|    69 | Target lowering                   | Target lowering       | |  | 
|    70 +-----------------------------------+-----------------------+ |  | 
|    71 | Full liveness analysis            |                       | |  | 
|    72 +-----------------------------------+-----------------------+ |  | 
|    73 | Register allocation               |                       | |  | 
|    74 +-----------------------------------+-----------------------+ |  | 
|    75 | Phi lowering (advanced)           |                       | |  | 
|    76 +-----------------------------------+-----------------------+ |  | 
|    77 | Post-phi register allocation      |                       | |  | 
|    78 +-----------------------------------+-----------------------+ |  | 
|    79 | Branch optimization               |                       | |  | 
|    80 +-----------------------------------+-----------------------+ |  | 
|    81 | Code emission                     | Code emission         | |  | 
|    82 +-----------------------------------+-----------------------+ |  | 
|    83  |  | 
|    84 Goals |  | 
|    85 ===== |  | 
|    86  |  | 
|    87 Translation speed |  | 
|    88 ----------------- |  | 
|    89  |  | 
|    90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. |  | 
|    91 Any faster is in a sense wasted effort.  Download speed varies greatly, but |  | 
|    92 we'll arbitrarily say 1 MB/sec.  We'll pick the ARM A15 CPU as the example of a |  | 
|    93 slow machine.  We observe a 3× single-thread performance difference between A15 |  | 
|    94 and a high-end x86 Xeon E5-2690 based workstation, and aggressively assume a |  | 
|    95 ``.pexe`` file could be compressed to 50% on the web server using gzip transport |  | 
|    96 compression, so we set the translation speed goal to 6 MB/sec on the high-end |  | 
|    97 Xeon workstation. |  | 
|    98  |  | 
|    99 Currently, at the ``-O0`` level, the LLVM-based PNaCl translation translates at |  | 
|   100 ⅒ the target rate.  The ``-O2`` mode takes 3× as long as the ``-O0`` mode. |  | 
|   101  |  | 
|   102 In other words, Subzero's goal is to improve over LLVM's translation speed by |  | 
|   103 10×. |  | 
|   104  |  | 
|   105 Code quality |  | 
|   106 ------------ |  | 
|   107  |  | 
|   108 Subzero's initial goal is to produce code that meets or exceeds LLVM's ``-O0`` |  | 
|   109 code quality.  The stretch goal is to approach LLVM ``-O2`` code quality.  On |  | 
|   110 average, LLVM ``-O2`` performs twice as well as LLVM ``-O0``. |  | 
|   111  |  | 
|   112 It's important to note that the quality of Subzero-generated code depends on |  | 
|   113 target-neutral optimizations and simplifications being run beforehand in the |  | 
|   114 developer environment.  The ``.pexe`` file reflects these optimizations.  For |  | 
|   115 example, Subzero assumes that the basic blocks are ordered topologically where |  | 
|   116 possible (which makes liveness analysis converge fastest), and Subzero does not |  | 
|   117 do any function inlining because it should already have been done. |  | 
|   118  |  | 
|   119 Translator size |  | 
|   120 --------------- |  | 
|   121  |  | 
|   122 The current LLVM-based translator binary (``pnacl-llc``) is about 10 MB in size. |  | 
|   123 We think 1 MB is a more reasonable size -- especially for such a component that |  | 
|   124 is distributed to a billion Chrome users.  Thus we target a 10× reduction in |  | 
|   125 binary size. |  | 
|   126  |  | 
|   127 For development, Subzero can be built for all target architectures, and all |  | 
|   128 debugging and diagnostic options enabled.  For a smaller translator, we restrict |  | 
|   129 to a single target architecture, and define a ``MINIMAL`` build where |  | 
|   130 unnecessary features are compiled out. |  | 
|   131  |  | 
|   132 Subzero leverages some data structures from LLVM's ``ADT`` and ``Support`` |  | 
|   133 include directories, which have little impact on translator size.  It also uses |  | 
|   134 some of LLVM's bitcode decoding code (for binary-format ``.pexe`` files), again |  | 
|   135 with little size impact.  In non-``MINIMAL`` builds, the translator size is much |  | 
|   136 larger due to including code for parsing text-format bitcode files and forming |  | 
|   137 LLVM IR. |  | 
|   138  |  | 
|   139 Memory footprint |  | 
|   140 ---------------- |  | 
|   141  |  | 
|   142 The current LLVM-based translator suffers from an issue in which some |  | 
|   143 function-specific data has to be retained in memory until all translation |  | 
|   144 completes, and therefore the memory footprint grows without bound.  Large |  | 
|   145 ``.pexe`` files can lead to the translator process holding hundreds of MB of |  | 
|   146 memory by the end.  The translator runs in a separate process, so this memory |  | 
|   147 growth doesn't *directly* affect other processes, but it does dirty the physical |  | 
|   148 memory and contributes to a perception of bloat and sometimes a reality of |  | 
|   149 out-of-memory tab killing, especially noticeable on weaker systems. |  | 
|   150  |  | 
|   151 Subzero should maintain a stable memory footprint throughout translation.  It's |  | 
|   152 not really practical to set a specific limit, because there is not really a |  | 
|   153 practical limit on a single function's size, but the footprint should be |  | 
|   154 "reasonable" and be proportional to the largest input function size, not the |  | 
|   155 total ``.pexe`` file size.  Simply put, Subzero should not have memory leaks or |  | 
|   156 inexorable memory growth.  (We use ASAN builds to test for leaks.) |  | 
|   157  |  | 
|   158 Multithreaded translation |  | 
|   159 ------------------------- |  | 
|   160  |  | 
|   161 It should be practical to translate different functions concurrently and see |  | 
|   162 good scalability.  Some locking may be needed, such as accessing output buffers |  | 
|   163 or constant pools, but that should be fairly minimal.  In contrast, LLVM was |  | 
|   164 only designed for module-level parallelism, and as such, the PNaCl translator |  | 
|   165 internally splits a ``.pexe`` file into several modules for concurrent |  | 
|   166 translation.  All output needs to be deterministic regardless of the level of |  | 
|   167 multithreading, i.e. functions and data should always be output in the same |  | 
|   168 order. |  | 
|   169  |  | 
|   170 Target architectures |  | 
|   171 -------------------- |  | 
|   172  |  | 
|   173 Initial target architectures are x86-32, x86-64, ARM32, and MIPS32.  Future |  | 
|   174 targets include ARM64 and MIPS64, though these targets lack NaCl support |  | 
|   175 including a sandbox model or a validator. |  | 
|   176  |  | 
|   177 The first implementation is for x86-32, because it was expected to be |  | 
|   178 particularly challenging, and thus more likely to draw out any design problems |  | 
|   179 early: |  | 
|   180  |  | 
|   181 - There are a number of special cases, asymmetries, and warts in the x86 |  | 
|   182   instruction set. |  | 
|   183  |  | 
|   184 - Complex addressing modes may be leveraged for better code quality. |  | 
|   185  |  | 
|   186 - 64-bit integer operations have to be lowered into longer sequences of 32-bit |  | 
|   187   operations. |  | 
|   188  |  | 
|   189 - Paucity of physical registers may reveal code quality issues early in the |  | 
|   190   design. |  | 
|   191  |  | 
|   192 Detailed design |  | 
|   193 =============== |  | 
|   194  |  | 
|   195 Intermediate representation - ICE |  | 
|   196 --------------------------------- |  | 
|   197  |  | 
|   198 Subzero's IR is called ICE.  It is designed to be reasonably similar to LLVM's |  | 
|   199 IR, which is reflected in the ``.pexe`` file's bitcode structure.  It has a |  | 
|   200 representation of global variables and initializers, and a set of functions. |  | 
|   201 Each function contains a list of basic blocks, and each basic block constains a |  | 
|   202 list of instructions.  Instructions that operate on stack and register variables |  | 
|   203 do so using static single assignment (SSA) form. |  | 
|   204  |  | 
|   205 The ``.pexe`` file is translated one function at a time (or in parallel by |  | 
|   206 multiple translation threads).  The recipe for optimization passes depends on |  | 
|   207 the specific target and optimization level, and is described in detail below. |  | 
|   208 Global variables (types and initializers) are simply and directly translated to |  | 
|   209 object code, without any meaningful attempts at optimization. |  | 
|   210  |  | 
|   211 A function's control flow graph (CFG) is represented by the ``Ice::Cfg`` class. |  | 
|   212 Its key contents include: |  | 
|   213  |  | 
|   214 - A list of ``CfgNode`` pointers, generally held in topological order. |  | 
|   215  |  | 
|   216 - A list of ``Variable`` pointers corresponding to local variables used in the |  | 
|   217   function plus compiler-generated temporaries. |  | 
|   218  |  | 
|   219 A basic block is represented by the ``Ice::CfgNode`` class.  Its key contents |  | 
|   220 include: |  | 
|   221  |  | 
|   222 - A linear list of instructions, in the same style as LLVM.  The last |  | 
|   223   instruction of the list is always a terminator instruction: branch, switch, |  | 
|   224   return, unreachable. |  | 
|   225  |  | 
|   226 - A list of Phi instructions, also in the same style as LLVM.  They are held as |  | 
|   227   a linear list for convenience, though per Phi semantics, they are executed "in |  | 
|   228   parallel" without dependencies on each other. |  | 
|   229  |  | 
|   230 - An unordered list of ``CfgNode`` pointers corresponding to incoming edges, and |  | 
|   231   another list for outgoing edges. |  | 
|   232  |  | 
|   233 - The node's unique, 0-based index into the CFG's node list. |  | 
|   234  |  | 
|   235 An instruction is represented by the ``Ice::Inst`` class.  Its key contents |  | 
|   236 include: |  | 
|   237  |  | 
|   238 - A list of source operands. |  | 
|   239  |  | 
|   240 - Its destination variable, if the instruction produces a result in an |  | 
|   241   ``Ice::Variable``. |  | 
|   242  |  | 
|   243 - A bitvector indicating which variables' live ranges this instruction ends. |  | 
|   244   This is computed during liveness analysis. |  | 
|   245  |  | 
|   246 Instructions kinds are divided into high-level ICE instructions and low-level |  | 
|   247 ICE instructions.  High-level instructions consist of the PNaCl/LLVM bitcode |  | 
|   248 instruction kinds.  Each target architecture implementation extends the |  | 
|   249 instruction space with its own set of low-level instructions.  Generally, |  | 
|   250 low-level instructions correspond to individual machine instructions.  The |  | 
|   251 high-level ICE instruction space includes a few additional instruction kinds |  | 
|   252 that are not part of LLVM but are generally useful (e.g., an Assignment |  | 
|   253 instruction), or are useful across targets (e.g., BundleLock and BundleUnlock |  | 
|   254 instructions for sandboxing). |  | 
|   255  |  | 
|   256 Specifically, high-level ICE instructions that derive from LLVM (but with PNaCl |  | 
|   257 ABI restrictions as documented in the `PNaCl Bitcode Reference Manual |  | 
|   258 <https://developer.chrome.com/native-client/reference/pnacl-bitcode-abi>`_) are |  | 
|   259 the following: |  | 
|   260  |  | 
|   261 - Alloca: allocate data on the stack |  | 
|   262  |  | 
|   263 - Arithmetic: binary operations of the form ``A = B op C`` |  | 
|   264  |  | 
|   265 - Br: conditional or unconditional branch |  | 
|   266  |  | 
|   267 - Call: function call |  | 
|   268  |  | 
|   269 - Cast: unary type-conversion operations |  | 
|   270  |  | 
|   271 - ExtractElement: extract a scalar element from a vector-type value |  | 
|   272  |  | 
|   273 - Fcmp: floating-point comparison |  | 
|   274  |  | 
|   275 - Icmp: integer comparison |  | 
|   276  |  | 
|   277 - IntrinsicCall: call a known intrinsic |  | 
|   278  |  | 
|   279 - InsertElement: insert a scalar element into a vector-type value |  | 
|   280  |  | 
|   281 - Load: load a value from memory |  | 
|   282  |  | 
|   283 - Phi: implement the SSA phi node |  | 
|   284  |  | 
|   285 - Ret: return from the function |  | 
|   286  |  | 
|   287 - Select: essentially the C language operation of the form ``X = C ? Y : Z`` |  | 
|   288  |  | 
|   289 - Store: store a value into memory |  | 
|   290  |  | 
|   291 - Switch: generalized branch to multiple possible locations |  | 
|   292  |  | 
|   293 - Unreachable: indicate that this portion of the code is unreachable |  | 
|   294  |  | 
|   295 The additional high-level ICE instructions are the following: |  | 
|   296  |  | 
|   297 - Assign: a simple ``A=B`` assignment.  This is useful for e.g. lowering Phi |  | 
|   298   instructions to non-SSA assignments, before lowering to machine code. |  | 
|   299  |  | 
|   300 - BundleLock, BundleUnlock.  These are markers used for sandboxing, but are |  | 
|   301   common across all targets and so they are elevated to the high-level |  | 
|   302   instruction set. |  | 
|   303  |  | 
|   304 - FakeDef, FakeUse, FakeKill.  These are tools used to preserve consistency in |  | 
|   305   liveness analysis, elevated to the high-level because they are used by all |  | 
|   306   targets.  They are described in more detail at the end of this section. |  | 
|   307  |  | 
|   308 - JumpTable: this represents the result of switch optimization analysis, where |  | 
|   309   some switch instructions may use jump tables instead of cascading |  | 
|   310   compare/branches. |  | 
|   311  |  | 
|   312 An operand is represented by the ``Ice::Operand`` class.  In high-level ICE, an |  | 
|   313 operand is either an ``Ice::Constant`` or an ``Ice::Variable``.  Constants |  | 
|   314 include scalar integer constants, scalar floating point constants, Undef (an |  | 
|   315 unspecified constant of a particular scalar or vector type), and symbol |  | 
|   316 constants (essentially addresses of globals).  Note that the PNaCl ABI does not |  | 
|   317 include vector-type constants besides Undef, and as such, Subzero (so far) has |  | 
|   318 no reason to represent vector-type constants internally.  A variable represents |  | 
|   319 a value allocated on the stack (though not including alloca-derived storage). |  | 
|   320 Among other things, a variable holds its unique, 0-based index into the CFG's |  | 
|   321 variable list. |  | 
|   322  |  | 
|   323 Each target can extend the ``Constant`` and ``Variable`` classes for its own |  | 
|   324 needs.  In addition, the ``Operand`` class may be extended, e.g. to define an |  | 
|   325 x86 ``MemOperand`` that encodes a base register, an index register, an index |  | 
|   326 register shift amount, and a constant offset. |  | 
|   327  |  | 
|   328 Register allocation and liveness analysis are restricted to Variable operands. |  | 
|   329 Because of the importance of register allocation to code quality, and the |  | 
|   330 translation-time cost of liveness analysis, Variable operands get some special |  | 
|   331 treatment in ICE.  Most notably, a frequent pattern in Subzero is to iterate |  | 
|   332 across all the Variables of an instruction.  An instruction holds a list of |  | 
|   333 operands, but an operand may contain 0, 1, or more Variables.  As such, the |  | 
|   334 ``Operand`` class specially holds a list of Variables contained within, for |  | 
|   335 quick access. |  | 
|   336  |  | 
|   337 A Subzero transformation pass may work by deleting an existing instruction and |  | 
|   338 replacing it with zero or more new instructions.  Instead of actually deleting |  | 
|   339 the existing instruction, we generally mark it as deleted and insert the new |  | 
|   340 instructions right after the deleted instruction.  When printing the IR for |  | 
|   341 debugging, this is a big help because it makes it much more clear how the |  | 
|   342 non-deleted instructions came about. |  | 
|   343  |  | 
|   344 Subzero has a few special instructions to help with liveness analysis |  | 
|   345 consistency. |  | 
|   346  |  | 
|   347 - The FakeDef instruction gives a fake definition of some variable.  For |  | 
|   348   example, on x86-32, a divide instruction defines both ``%eax`` and ``%edx`` |  | 
|   349   but an ICE instruction can represent only one destination variable.  This is |  | 
|   350   similar for multiply instructions, and for function calls that return a 64-bit |  | 
|   351   integer result in the ``%edx:%eax`` pair.  Also, using the ``xor %eax, %eax`` |  | 
|   352   trick to set ``%eax`` to 0 requires an initial FakeDef of ``%eax``. |  | 
|   353  |  | 
|   354 - The FakeUse instruction registers a use of a variable, typically to prevent an |  | 
|   355   earlier assignment to that variable from being dead-code eliminated.  For |  | 
|   356   example, lowering an operation like ``x=cc?y:z`` may be done using x86's |  | 
|   357   conditional move (cmov) instruction: ``mov z, x; cmov_cc y, x``.  Without a |  | 
|   358   FakeUse of ``x`` between the two instructions, the liveness analysis pass may |  | 
|   359   dead-code eliminate the first instruction. |  | 
|   360  |  | 
|   361 - The FakeKill instruction is added after a call instruction, and is a quick way |  | 
|   362   of indicating that caller-save registers are invalidated. |  | 
|   363  |  | 
|   364 Pexe parsing |  | 
|   365 ------------ |  | 
|   366  |  | 
|   367 Subzero includes an integrated PNaCl bitcode parser for ``.pexe`` files.  It |  | 
|   368 parses the ``.pexe`` file function by function, ultimately constructing an ICE |  | 
|   369 CFG for each function.  After a function is parsed, its CFG is handed off to the |  | 
|   370 translation phase.  The bitcode parser also parses global initializer data and |  | 
|   371 hands it off to be translated to data sections in the object file. |  | 
|   372  |  | 
|   373 Subzero has another parsing strategy for testing/debugging.  LLVM libraries can |  | 
|   374 be used to parse a module into LLVM IR (though very slowly relative to Subzero |  | 
|   375 native parsing).  Then we iterate across the LLVM IR and construct high-level |  | 
|   376 ICE, handing off each CFG to the translation phase. |  | 
|   377  |  | 
|   378 Overview of lowering |  | 
|   379 -------------------- |  | 
|   380  |  | 
|   381 In general, translation goes like this: |  | 
|   382  |  | 
|   383 - Parse the next function from the ``.pexe`` file and construct a CFG consisting |  | 
|   384   of high-level ICE. |  | 
|   385  |  | 
|   386 - Do analysis passes and transformation passes on the high-level ICE, as |  | 
|   387   desired. |  | 
|   388  |  | 
|   389 - Lower each high-level ICE instruction into a sequence of zero or more |  | 
|   390   low-level ICE instructions.  Each high-level instruction is generally lowered |  | 
|   391   independently, though the target lowering is allowed to look ahead in the |  | 
|   392   CfgNode's instruction list if desired. |  | 
|   393  |  | 
|   394 - Do more analysis and transformation passes on the low-level ICE, as desired. |  | 
|   395  |  | 
|   396 - Assemble the low-level CFG into an ELF object file (alternatively, a textual |  | 
|   397   assembly file that is later assembled by some external tool). |  | 
|   398  |  | 
|   399 - Repeat for all functions, and also produce object code for data such as global |  | 
|   400   initializers and internal constant pools. |  | 
|   401  |  | 
|   402 Currently there are two optimization levels: ``O2`` and ``Om1``.  For ``O2``, |  | 
|   403 the intention is to apply all available optimizations to get the best code |  | 
|   404 quality (though the initial code quality goal is measured against LLVM's ``O0`` |  | 
|   405 code quality).  For ``Om1``, the intention is to apply as few optimizations as |  | 
|   406 possible and produce code as quickly as possible, accepting poor code quality. |  | 
|   407 ``Om1`` is short for "O-minus-one", i.e. "worse than O0", or in other words, |  | 
|   408 "sub-zero". |  | 
|   409  |  | 
|   410 High-level debuggability of generated code is so far not a design requirement. |  | 
|   411 Subzero doesn't really do transformations that would obfuscate debugging; the |  | 
|   412 main thing might be that register allocation (including stack slot coalescing |  | 
|   413 for stack-allocated variables whose live ranges don't overlap) may render a |  | 
|   414 variable's value unobtainable after its live range ends.  This would not be an |  | 
|   415 issue for ``Om1`` since it doesn't register-allocate program-level variables, |  | 
|   416 nor does it coalesce stack slots.  That said, fully supporting debuggability |  | 
|   417 would require a few additions: |  | 
|   418  |  | 
|   419 - DWARF support would need to be added to Subzero's ELF file emitter.  Subzero |  | 
|   420   propagates global symbol names, local variable names, and function-internal |  | 
|   421   label names that are present in the ``.pexe`` file.  This would allow a |  | 
|   422   debugger to map addresses back to symbols in the ``.pexe`` file. |  | 
|   423  |  | 
|   424 - To map ``.pexe`` file symbols back to meaningful source-level symbol names, |  | 
|   425   file names, line numbers, etc., Subzero would need to handle `LLVM bitcode |  | 
|   426   metadata <http://llvm.org/docs/LangRef.html#metadata>`_ and ``llvm.dbg`` |  | 
|   427   `instrinsics<http://llvm.org/docs/LangRef.html#dbg-intrinsics>`_. |  | 
|   428  |  | 
|   429 - The PNaCl toolchain explicitly strips all this from the ``.pexe`` file, and so |  | 
|   430   the toolchain would need to be modified to preserve it. |  | 
|   431  |  | 
|   432 Our experience so far is that ``Om1`` translates twice as fast as ``O2``, but |  | 
|   433 produces code with one third the code quality.  ``Om1`` is good for testing and |  | 
|   434 debugging -- during translation, it tends to expose errors in the basic lowering |  | 
|   435 that might otherwise have been hidden by the register allocator or other |  | 
|   436 optimization passes.  It also helps determine whether a code correctness problem |  | 
|   437 is a fundamental problem in the basic lowering, or an error in another |  | 
|   438 optimization pass. |  | 
|   439  |  | 
|   440 The implementation of target lowering also controls the recipe of passes used |  | 
|   441 for ``Om1`` and ``O2`` translation.  For example, address mode inference may |  | 
|   442 only be relevant for x86. |  | 
|   443  |  | 
|   444 Lowering strategy |  | 
|   445 ----------------- |  | 
|   446  |  | 
|   447 The core of Subzero's lowering from high-level ICE to low-level ICE is to lower |  | 
|   448 each high-level instruction down to a sequence of low-level target-specific |  | 
|   449 instructions, in a largely context-free setting.  That is, each high-level |  | 
|   450 instruction conceptually has a simple template expansion into low-level |  | 
|   451 instructions, and lowering can in theory be done in any order.  This may sound |  | 
|   452 like a small effort, but quite a large number of templates may be needed because |  | 
|   453 of the number of PNaCl types and instruction variants.  Furthermore, there may |  | 
|   454 be optimized templates, e.g. to take advantage of operator commutativity (for |  | 
|   455 example, ``x=x+1`` might allow a bettern lowering than ``x=1+x``).  This is |  | 
|   456 similar to other template-based approaches in fast code generation or |  | 
|   457 interpretation, though some decisions are deferred until after some global |  | 
|   458 analysis passes, mostly related to register allocation, stack slot assignment, |  | 
|   459 and specific choice of instruction variant and addressing mode. |  | 
|   460  |  | 
|   461 The key idea for a lowering template is to produce valid low-level instructions |  | 
|   462 that are guaranteed to meet address mode and other structural requirements of |  | 
|   463 the instruction set.  For example, on x86, the source operand of an integer |  | 
|   464 store instruction must be an immediate or a physical register; a shift |  | 
|   465 instruction's shift amount must be an immediate or in register ``%cl``; a |  | 
|   466 function's integer return value is in ``%eax``; most x86 instructions are |  | 
|   467 two-operand, in contrast to corresponding three-operand high-level instructions; |  | 
|   468 etc. |  | 
|   469  |  | 
|   470 Because target lowering runs before register allocation, there is no way to know |  | 
|   471 whether a given ``Ice::Variable`` operand lives on the stack or in a physical |  | 
|   472 register.  When the low-level instruction calls for a physical register operand, |  | 
|   473 the target lowering can create an infinite-weight Variable.  This tells the |  | 
|   474 register allocator to assign infinite weight when making decisions, effectively |  | 
|   475 guaranteeing some physical register.  Variables can also be pre-colored to a |  | 
|   476 specific physical register (``cl`` in the shift example above), which also gives |  | 
|   477 infinite weight. |  | 
|   478  |  | 
|   479 To illustrate, consider a high-level arithmetic instruction on 32-bit integer |  | 
|   480 operands:: |  | 
|   481  |  | 
|   482     A = B + C |  | 
|   483  |  | 
|   484 X86 target lowering might produce the following:: |  | 
|   485  |  | 
|   486     T.inf = B  // mov instruction |  | 
|   487     T.inf += C // add instruction |  | 
|   488     A = T.inf  // mov instruction |  | 
|   489  |  | 
|   490 Here, ``T.inf`` is an infinite-weight temporary.  As long as ``T.inf`` has a |  | 
|   491 physical register, the three lowered instructions are all encodable regardless |  | 
|   492 of whether ``B`` and ``C`` are physical registers, memory, or immediates, and |  | 
|   493 whether ``A`` is a physical register or in memory. |  | 
|   494  |  | 
|   495 In this example, ``A`` must be a Variable and one may be tempted to simplify the |  | 
|   496 lowering sequence by setting ``A`` as infinite-weight and using:: |  | 
|   497  |  | 
|   498         A = B  // mov instruction |  | 
|   499         A += C // add instruction |  | 
|   500  |  | 
|   501 This has two problems.  First, if the original instruction was actually ``A = |  | 
|   502 B + A``, the result would be incorrect.  Second, assigning ``A`` a physical |  | 
|   503 register applies throughout ``A``'s entire live range.  This is probably not |  | 
|   504 what is intended, and may ultimately lead to a failure to allocate a register |  | 
|   505 for an infinite-weight variable. |  | 
|   506  |  | 
|   507 This style of lowering leads to many temporaries being generated, so in ``O2`` |  | 
|   508 mode, we rely on the register allocator to clean things up.  For example, in the |  | 
|   509 example above, if ``B`` ends up getting a physical register and its live range |  | 
|   510 ends at this instruction, the register allocator is likely to reuse that |  | 
|   511 register for ``T.inf``.  This leads to ``T.inf=B`` being a redundant register |  | 
|   512 copy, which is removed as an emission-time peephole optimization. |  | 
|   513  |  | 
|   514 O2 lowering |  | 
|   515 ----------- |  | 
|   516  |  | 
|   517 Currently, the ``O2`` lowering recipe is the following: |  | 
|   518  |  | 
|   519 - Loop nest analysis |  | 
|   520  |  | 
|   521 - Address mode inference |  | 
|   522  |  | 
|   523 - Read-modify-write (RMW) transformation |  | 
|   524  |  | 
|   525 - Basic liveness analysis |  | 
|   526  |  | 
|   527 - Load optimization |  | 
|   528  |  | 
|   529 - Target lowering |  | 
|   530  |  | 
|   531 - Full liveness analysis |  | 
|   532  |  | 
|   533 - Register allocation |  | 
|   534  |  | 
|   535 - Phi instruction lowering (advanced) |  | 
|   536  |  | 
|   537 - Post-phi lowering register allocation |  | 
|   538  |  | 
|   539 - Branch optimization |  | 
|   540  |  | 
|   541 These passes are described in more detail below. |  | 
|   542  |  | 
|   543 Om1 lowering |  | 
|   544 ------------ |  | 
|   545  |  | 
|   546 Currently, the ``Om1`` lowering recipe is the following: |  | 
|   547  |  | 
|   548 - Phi instruction lowering (simple) |  | 
|   549  |  | 
|   550 - Target lowering |  | 
|   551  |  | 
|   552 - Register allocation (infinite-weight and pre-colored only) |  | 
|   553  |  | 
|   554 Optimization passes |  | 
|   555 ------------------- |  | 
|   556  |  | 
|   557 Liveness analysis |  | 
|   558 ^^^^^^^^^^^^^^^^^ |  | 
|   559  |  | 
|   560 Liveness analysis is a standard dataflow optimization, implemented as follows. |  | 
|   561 For each node (basic block), its live-out set is computed as the union of the |  | 
|   562 live-in sets of its successor nodes.  Then the node's instructions are processed |  | 
|   563 in reverse order, updating the live set, until the beginning of the node is |  | 
|   564 reached, and the node's live-in set is recorded.  If this iteration has changed |  | 
|   565 the node's live-in set, the node's predecessors are marked for reprocessing. |  | 
|   566 This continues until no more nodes need reprocessing.  If nodes are processed in |  | 
|   567 reverse topological order, the number of iterations over the CFG is generally |  | 
|   568 equal to the maximum loop nest depth. |  | 
|   569  |  | 
|   570 To implement this, each node records its live-in and live-out sets, initialized |  | 
|   571 to the empty set.  Each instruction records which of its Variables' live ranges |  | 
|   572 end in that instruction, initialized to the empty set.  A side effect of |  | 
|   573 liveness analysis is dead instruction elimination.  Each instruction can be |  | 
|   574 marked as tentatively dead, and after the algorithm converges, the tentatively |  | 
|   575 dead instructions are permanently deleted. |  | 
|   576  |  | 
|   577 Optionally, after this liveness analysis completes, we can do live range |  | 
|   578 construction, in which we calculate the live range of each variable in terms of |  | 
|   579 instruction numbers.  A live range is represented as a union of segments, where |  | 
|   580 the segment endpoints are instruction numbers.  Instruction numbers are required |  | 
|   581 to be unique across the CFG, and monotonically increasing within a basic block. |  | 
|   582 As a union of segments, live ranges can contain "gaps" and are therefore |  | 
|   583 precise.  Because of SSA properties, a variable's live range can start at most |  | 
|   584 once in a basic block, and can end at most once in a basic block.  Liveness |  | 
|   585 analysis keeps track of which variable/instruction tuples begin live ranges and |  | 
|   586 end live ranges, and combined with live-in and live-out sets, we can efficiently |  | 
|   587 build up live ranges of all variables across all basic blocks. |  | 
|   588  |  | 
|   589 A lot of care is taken to try to make liveness analysis fast and efficient. |  | 
|   590 Because of the lowering strategy, the number of variables is generally |  | 
|   591 proportional to the number of instructions, leading to an O(N^2) complexity |  | 
|   592 algorithm if implemented naively.  To improve things based on sparsity, we note |  | 
|   593 that most variables are "local" and referenced in at most one basic block (in |  | 
|   594 contrast to the "global" variables with multi-block usage), and therefore cannot |  | 
|   595 be live across basic blocks.  Therefore, the live-in and live-out sets, |  | 
|   596 typically represented as bit vectors, can be limited to the set of global |  | 
|   597 variables, and the intra-block liveness bit vector can be compacted to hold the |  | 
|   598 global variables plus the local variables for that block. |  | 
|   599  |  | 
|   600 Register allocation |  | 
|   601 ^^^^^^^^^^^^^^^^^^^ |  | 
|   602  |  | 
|   603 Subzero implements a simple linear-scan register allocator, based on the |  | 
|   604 allocator described by Hanspeter Mössenböck and Michael Pfeiffer in `Linear Scan |  | 
|   605 Register Allocation in the Context of SSA Form and Register Constraints |  | 
|   606 <ftp://ftp.ssw.uni-linz.ac.at/pub/Papers/Moe02.PDF>`_.  This allocator has |  | 
|   607 several nice features: |  | 
|   608  |  | 
|   609 - Live ranges are represented as unions of segments, as described above, rather |  | 
|   610   than a single start/end tuple. |  | 
|   611  |  | 
|   612 - It allows pre-coloring of variables with specific physical registers. |  | 
|   613  |  | 
|   614 - It applies equally well to pre-lowered Phi instructions. |  | 
|   615  |  | 
|   616 The paper suggests an approach of aggressively coalescing variables across Phi |  | 
|   617 instructions (i.e., trying to force Phi source and destination variables to have |  | 
|   618 the same register assignment), but we reject that in favor of the more natural |  | 
|   619 preference mechanism described below. |  | 
|   620  |  | 
|   621 We enhance the algorithm in the paper with the capability of automatic inference |  | 
|   622 of register preference, and with the capability of allowing overlapping live |  | 
|   623 ranges to safely share the same register in certain circumstances.  If we are |  | 
|   624 considering register allocation for variable ``A``, and ``A`` has a single |  | 
|   625 defining instruction ``A=B+C``, then the preferred register for ``A``, if |  | 
|   626 available, would be the register assigned to ``B`` or ``C``, if any, provided |  | 
|   627 that ``B`` or ``C``'s live range does not overlap ``A``'s live range.  In this |  | 
|   628 way we infer a good register preference for ``A``. |  | 
|   629  |  | 
|   630 We allow overlapping live ranges to get the same register in certain cases. |  | 
|   631 Suppose a high-level instruction like:: |  | 
|   632  |  | 
|   633     A = unary_op(B) |  | 
|   634  |  | 
|   635 has been target-lowered like:: |  | 
|   636  |  | 
|   637     T.inf = B |  | 
|   638     A = unary_op(T.inf) |  | 
|   639  |  | 
|   640 Further, assume that ``B``'s live range continues beyond this instruction |  | 
|   641 sequence, and that ``B`` has already been assigned some register.  Normally, we |  | 
|   642 might want to infer ``B``'s register as a good candidate for ``T.inf``, but it |  | 
|   643 turns out that ``T.inf`` and ``B``'s live ranges overlap, requiring them to have |  | 
|   644 different registers.  But ``T.inf`` is just a read-only copy of ``B`` that is |  | 
|   645 guaranteed to be in a register, so in theory these overlapping live ranges could |  | 
|   646 safely have the same register.  Our implementation allows this overlap as long |  | 
|   647 as ``T.inf`` is never modified within ``B``'s live range, and ``B`` is never |  | 
|   648 modified within ``T.inf``'s live range. |  | 
|   649  |  | 
|   650 Subzero's register allocator can be run in 3 configurations. |  | 
|   651  |  | 
|   652 - Normal mode.  All Variables are considered for register allocation.  It |  | 
|   653   requires full liveness analysis and live range construction as a prerequisite. |  | 
|   654   This is used by ``O2`` lowering. |  | 
|   655  |  | 
|   656 - Minimal mode.  Only infinite-weight or pre-colored Variables are considered. |  | 
|   657   All other Variables are stack-allocated.  It does not require liveness |  | 
|   658   analysis; instead, it quickly scans the instructions and records first |  | 
|   659   definitions and last uses of all relevant Variables, using that to construct a |  | 
|   660   single-segment live range.  Although this includes most of the Variables, the |  | 
|   661   live ranges are mostly simple, short, and rarely overlapping, which the |  | 
|   662   register allocator handles efficiently.  This is used by ``Om1`` lowering. |  | 
|   663  |  | 
|   664 - Post-phi lowering mode.  Advanced phi lowering is done after normal-mode |  | 
|   665   register allocation, and may result in new infinite-weight Variables that need |  | 
|   666   registers.  One would like to just run something like minimal mode to assign |  | 
|   667   registers to the new Variables while respecting existing register allocation |  | 
|   668   decisions.  However, it sometimes happens that there are no free registers. |  | 
|   669   In this case, some register needs to be forcibly spilled to the stack and |  | 
|   670   temporarily reassigned to the new Variable, and reloaded at the end of the new |  | 
|   671   Variable's live range.  The register must be one that has no explicit |  | 
|   672   references during the Variable's live range.  Since Subzero currently doesn't |  | 
|   673   track def/use chains (though it does record the CfgNode where a Variable is |  | 
|   674   defined), we just do a brute-force search across the CfgNode's instruction |  | 
|   675   list for the instruction numbers of interest.  This situation happens very |  | 
|   676   rarely, so there's little point for now in improving its performance. |  | 
|   677  |  | 
|   678 The basic linear-scan algorithm may, as it proceeds, rescind an early register |  | 
|   679 allocation decision, leaving that Variable to be stack-allocated.  Some of these |  | 
|   680 times, it turns out that the Variable could have been given a different register |  | 
|   681 without conflict, but by this time it's too late.  The literature recognizes |  | 
|   682 this situation and describes "second-chance bin-packing", which Subzero can do. |  | 
|   683 We can rerun the register allocator in a mode that respects existing register |  | 
|   684 allocation decisions, and sometimes it finds new non-conflicting opportunities. |  | 
|   685 In fact, we can repeatedly run the register allocator until convergence. |  | 
|   686 Unfortunately, in the current implementation, these subsequent register |  | 
|   687 allocation passes end up being extremely expensive.  This is because of the |  | 
|   688 treatment of the "unhandled pre-colored" Variable set, which is normally very |  | 
|   689 small but ends up being quite large on subsequent passes.  Its performance can |  | 
|   690 probably be made acceptable with a better choice of data structures, but for now |  | 
|   691 this second-chance mechanism is disabled. |  | 
|   692  |  | 
|   693 Future work is to implement LLVM's `Greedy |  | 
|   694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |  | 
|   695 register allocator as a replacement for the basic linear-scan algorithm, given |  | 
|   696 LLVM's experience with its improvement in code quality.  (The blog post claims |  | 
|   697 that the Greedy allocator also improved maintainability because a lot of hacks |  | 
|   698 could be removed, but Subzero is probably not yet to that level of hacks, and is |  | 
|   699 less likely to see that particular benefit.) |  | 
|   700  |  | 
|   701 Basic phi lowering |  | 
|   702 ^^^^^^^^^^^^^^^^^^ |  | 
|   703  |  | 
|   704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |  | 
|   705 implements it).  Consider this example:: |  | 
|   706  |  | 
|   707     L1: |  | 
|   708       ... |  | 
|   709       br L3 |  | 
|   710     L2: |  | 
|   711       ... |  | 
|   712       br L3 |  | 
|   713     L3: |  | 
|   714       A = phi [B, L1], [C, L2] |  | 
|   715       X = phi [Y, L1], [Z, L2] |  | 
|   716  |  | 
|   717 For each destination of a phi instruction, we can create a temporary and insert |  | 
|   718 the temporary's assignment at the end of the predecessor block:: |  | 
|   719  |  | 
|   720     L1: |  | 
|   721       ... |  | 
|   722       A' = B |  | 
|   723       X' = Y |  | 
|   724       br L3 |  | 
|   725     L2: |  | 
|   726       ... |  | 
|   727       A' = C |  | 
|   728       X' = Z |  | 
|   729       br L3 |  | 
|   730     L2: |  | 
|   731       A = A' |  | 
|   732       X = X' |  | 
|   733  |  | 
|   734 This transformation is very simple and reliable.  It can be done before target |  | 
|   735 lowering and register allocation, and it easily avoids the classic lost-copy and |  | 
|   736 related problems.  ``Om1`` lowering uses this strategy. |  | 
|   737  |  | 
|   738 However, it has the disadvantage of initializing temporaries even for branches |  | 
|   739 not taken, though that could be mitigated by splitting non-critical edges and |  | 
|   740 putting assignments in the edge-split nodes.  Another problem is that without |  | 
|   741 extra machinery, the assignments to ``A``, ``A'``, ``X``, and ``X'`` are given a |  | 
|   742 specific ordering even though phi semantics are that the assignments are |  | 
|   743 parallel or unordered.  This sometimes imposes false live range overlaps and |  | 
|   744 leads to poorer register allocation. |  | 
|   745  |  | 
|   746 Advanced phi lowering |  | 
|   747 ^^^^^^^^^^^^^^^^^^^^^ |  | 
|   748  |  | 
|   749 ``O2`` lowering defers phi lowering until after register allocation to avoid the |  | 
|   750 problem of false live range overlaps.  It works as follows.  We split each |  | 
|   751 incoming edge and move the (parallel) phi assignments into the split nodes.  We |  | 
|   752 linearize each set of assignments by finding a safe, topological ordering of the |  | 
|   753 assignments, respecting register assignments as well.  For example:: |  | 
|   754  |  | 
|   755     A = B |  | 
|   756     X = Y |  | 
|   757  |  | 
|   758 Normally these assignments could be executed in either order, but if ``B`` and |  | 
|   759 ``X`` are assigned the same physical register, we would want to use the above |  | 
|   760 ordering.  Dependency cycles are broken by introducing a temporary.  For |  | 
|   761 example:: |  | 
|   762  |  | 
|   763     A = B |  | 
|   764     B = A |  | 
|   765  |  | 
|   766 Here, a temporary breaks the cycle:: |  | 
|   767  |  | 
|   768     t = A |  | 
|   769     A = B |  | 
|   770     B = t |  | 
|   771  |  | 
|   772 Finally, we use the existing target lowering to lower the assignments in this |  | 
|   773 basic block, and once that is done for all basic blocks, we run the post-phi |  | 
|   774 variant of register allocation on the edge-split basic blocks. |  | 
|   775  |  | 
|   776 When computing a topological order, we try to first schedule assignments whose |  | 
|   777 source has a physical register, and last schedule assignments whose destination |  | 
|   778 has a physical register.  This helps reduce register pressure. |  | 
|   779  |  | 
|   780 X86 address mode inference |  | 
|   781 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |  | 
|   782  |  | 
|   783 We try to take advantage of the x86 addressing mode that includes a base |  | 
|   784 register, an index register, an index register scale amount, and an immediate |  | 
|   785 offset.  We do this through simple pattern matching.  Starting with a load or |  | 
|   786 store instruction where the address is a variable, we initialize the base |  | 
|   787 register to that variable, and look up the instruction where that variable is |  | 
|   788 defined.  If that is an add instruction of two variables and the index register |  | 
|   789 hasn't been set, we replace the base and index register with those two |  | 
|   790 variables.  If instead it is an add instruction of a variable and a constant, we |  | 
|   791 replace the base register with the variable and add the constant to the |  | 
|   792 immediate offset. |  | 
|   793  |  | 
|   794 There are several more patterns that can be matched.  This pattern matching |  | 
|   795 continues on the load or store instruction until no more matches are found. |  | 
|   796 Because a program typically has few load and store instructions (not to be |  | 
|   797 confused with instructions that manipulate stack variables), this address mode |  | 
|   798 inference pass is fast. |  | 
|   799  |  | 
|   800 X86 read-modify-write inference |  | 
|   801 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |  | 
|   802  |  | 
|   803 A reasonably common bitcode pattern is a non-atomic update of a memory |  | 
|   804 location:: |  | 
|   805  |  | 
|   806     x = load addr |  | 
|   807     y = add x, 1 |  | 
|   808     store y, addr |  | 
|   809  |  | 
|   810 On x86, with good register allocation, the Subzero passes described above |  | 
|   811 generate code with only this quality:: |  | 
|   812  |  | 
|   813     mov [%ebx], %eax |  | 
|   814     add $1, %eax |  | 
|   815     mov %eax, [%ebx] |  | 
|   816  |  | 
|   817 However, x86 allows for this kind of code:: |  | 
|   818  |  | 
|   819     add $1, [%ebx] |  | 
|   820  |  | 
|   821 which requires fewer instructions, but perhaps more importantly, requires fewer |  | 
|   822 physical registers. |  | 
|   823  |  | 
|   824 It's also important to note that this transformation only makes sense if the |  | 
|   825 store instruction ends ``x``'s live range. |  | 
|   826  |  | 
|   827 Subzero's ``O2`` recipe includes an early pass to find read-modify-write (RMW) |  | 
|   828 opportunities via simple pattern matching.  The only problem is that it is run |  | 
|   829 before liveness analysis, which is needed to determine whether ``x``'s live |  | 
|   830 range ends after the RMW.  Since liveness analysis is one of the most expensive |  | 
|   831 passes, it's not attractive to run it an extra time just for RMW analysis. |  | 
|   832 Instead, we essentially generate both the RMW and the non-RMW versions, and then |  | 
|   833 during lowering, the RMW version deletes itself if it finds x still live. |  | 
|   834  |  | 
|   835 X86 compare-branch inference |  | 
|   836 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |  | 
|   837  |  | 
|   838 In the LLVM instruction set, the compare/branch pattern works like this:: |  | 
|   839  |  | 
|   840     cond = icmp eq a, b |  | 
|   841     br cond, target |  | 
|   842  |  | 
|   843 The result of the icmp instruction is a single bit, and a conditional branch |  | 
|   844 tests that bit.  By contrast, most target architectures use this pattern:: |  | 
|   845  |  | 
|   846     cmp a, b  // implicitly sets various bits of FLAGS register |  | 
|   847     br eq, target  // branch on a particular FLAGS bit |  | 
|   848  |  | 
|   849 A naive lowering sequence conditionally sets ``cond`` to 0 or 1, then tests |  | 
|   850 ``cond`` and conditionally branches.  Subzero has a pass that identifies |  | 
|   851 boolean-based operations like this and folds them into a single |  | 
|   852 compare/branch-like operation.  It is set up for more than just cmp/br though. |  | 
|   853 Boolean producers include icmp (integer compare), fcmp (floating-point compare), |  | 
|   854 and trunc (integer truncation when the destination has bool type).  Boolean |  | 
|   855 consumers include branch, select (the ternary operator from the C language), and |  | 
|   856 sign-extend and zero-extend when the source has bool type. |  | 
|   857  |  | 
|   858 Sandboxing |  | 
|   859 ^^^^^^^^^^ |  | 
|   860  |  | 
|   861 Native Client's sandbox model uses software fault isolation (SFI) to provide |  | 
|   862 safety when running untrusted code in a browser or other environment.  Subzero |  | 
|   863 implements Native Client's `sandboxing |  | 
|   864 <https://developer.chrome.com/native-client/reference/sandbox_internals/index>`_ |  | 
|   865 to enable Subzero-translated executables to be run inside Chrome.  Subzero also |  | 
|   866 provides a fairly simple framework for investigating alternative sandbox models |  | 
|   867 or other restrictions on the sandbox model. |  | 
|   868  |  | 
|   869 Sandboxing in Subzero is not actually implemented as a separate pass, but is |  | 
|   870 integrated into lowering and assembly. |  | 
|   871  |  | 
|   872 - Indirect branches, including the ret instruction, are masked to a bundle |  | 
|   873   boundary and bundle-locked. |  | 
|   874  |  | 
|   875 - Call instructions are aligned to the end of the bundle so that the return |  | 
|   876   address is bundle-aligned. |  | 
|   877  |  | 
|   878 - Indirect branch targets, including function entry and targets in a switch |  | 
|   879   statement jump table, are bundle-aligned. |  | 
|   880  |  | 
|   881 - The intrinsic for reading the thread pointer is inlined appropriately. |  | 
|   882  |  | 
|   883 - For x86-64, non-stack memory accesses are with respect to the reserved sandbox |  | 
|   884   base register.  We reduce the aggressiveness of address mode inference to |  | 
|   885   leave room for the sandbox base register during lowering.  There are no memory |  | 
|   886   sandboxing changes for x86-32. |  | 
|   887  |  | 
|   888 Code emission |  | 
|   889 ------------- |  | 
|   890  |  | 
|   891 Subzero's integrated assembler is derived from Dart's `assembler code |  | 
|   892 <https://github.com/dart-lang/sdk/tree/master/runtime/vm>'_.  There is a pass |  | 
|   893 that iterates through the low-level ICE instructions and invokes the relevant |  | 
|   894 assembler functions.  Placeholders are added for later fixup of branch target |  | 
|   895 offsets.  (Backward branches use short offsets if possible; forward branches |  | 
|   896 generally use long offsets unless it is an intra-block branch of "known" short |  | 
|   897 length.)  The assembler emits into a staging buffer.  Once emission into the |  | 
|   898 staging buffer for a function is complete, the data is emitted to the output |  | 
|   899 file as an ELF object file, and metadata such as relocations, symbol table, and |  | 
|   900 string table, are accumulated for emission at the end.  Global data initializers |  | 
|   901 are emitted similarly.  A key point is that at this point, the staging buffer |  | 
|   902 can be deallocated, and only a minimum of data needs to held until the end. |  | 
|   903  |  | 
|   904 As a debugging alternative, Subzero can emit textual assembly code which can |  | 
|   905 then be run through an external assembler.  This is of course super slow, but |  | 
|   906 quite valuable when bringing up a new target. |  | 
|   907  |  | 
|   908 As another debugging option, the staging buffer can be emitted as textual |  | 
|   909 assembly, primarily in the form of ".byte" lines.  This allows the assembler to |  | 
|   910 be tested separately from the ELF related code. |  | 
|   911  |  | 
|   912 Memory management |  | 
|   913 ----------------- |  | 
|   914  |  | 
|   915 Where possible, we allocate from a ``CfgLocalAllocator`` which derives from |  | 
|   916 LLVM's ``BumpPtrAllocator``.  This is an arena-style allocator where objects |  | 
|   917 allocated from the arena are never actually freed; instead, when the CFG |  | 
|   918 translation completes and the CFG is deleted, the entire arena memory is |  | 
|   919 reclaimed at once.  This style of allocation works well in an environment like a |  | 
|   920 compiler where there are distinct phases with only easily-identifiable objects |  | 
|   921 living across phases.  It frees the developer from having to manage object |  | 
|   922 deletion, and it amortizes deletion costs across just a single arena deletion at |  | 
|   923 the end of the phase.  Furthermore, it helps scalability by allocating entirely |  | 
|   924 from thread-local memory pools, and minimizing global locking of the heap. |  | 
|   925  |  | 
|   926 Instructions are probably the most heavily allocated complex class in Subzero. |  | 
|   927 We represent an instruction list as an intrusive doubly linked list, allocate |  | 
|   928 all instructions from the ``CfgLocalAllocator``, and we make sure each |  | 
|   929 instruction subclass is basically `POD |  | 
|   930 <http://en.cppreference.com/w/cpp/concept/PODType>`_ (Plain Old Data) with a |  | 
|   931 trivial destructor.  This way, when the CFG is finished, we don't need to |  | 
|   932 individually deallocate every instruction.  We do similar for Variables, which |  | 
|   933 is probably the second most popular complex class. |  | 
|   934  |  | 
|   935 There are some situations where passes need to use some `STL container class |  | 
|   936 <http://en.cppreference.com/w/cpp/container>`_.  Subzero has a way of using the |  | 
|   937 ``CfgLocalAllocator`` as the container allocator if this is needed. |  | 
|   938  |  | 
|   939 Multithreaded translation |  | 
|   940 ------------------------- |  | 
|   941  |  | 
|   942 Subzero is designed to be able to translate functions in parallel.  With the |  | 
|   943 ``-threads=N`` command-line option, there is a 3-stage producer-consumer |  | 
|   944 pipeline: |  | 
|   945  |  | 
|   946 - A single thread parses the ``.pexe`` file and produces a sequence of work |  | 
|   947   units.  A work unit can be either a fully constructed CFG, or a set of global |  | 
|   948   initializers.  The work unit includes its sequence number denoting its parse |  | 
|   949   order.  Each work unit is added to the translation queue. |  | 
|   950  |  | 
|   951 - There are N translation threads that draw work units from the translation |  | 
|   952   queue and lower them into assembler buffers.  Each assembler buffer is added |  | 
|   953   to the emitter queue, tagged with its sequence number.  The CFG and its |  | 
|   954   ``CfgLocalAllocator`` are disposed of at this point. |  | 
|   955  |  | 
|   956 - A single thread draws assembler buffers from the emitter queue and appends to |  | 
|   957   the output file.  It uses the sequence numbers to reintegrate the assembler |  | 
|   958   buffers according to the original parse order, such that output order is |  | 
|   959   always deterministic. |  | 
|   960  |  | 
|   961 This means that with ``-threads=N``, there are actually ``N+1`` spawned threads |  | 
|   962 for a total of ``N+2`` execution threads, taking the parser and emitter threads |  | 
|   963 into account.  For the special case of ``N=0``, execution is entirely sequential |  | 
|   964 -- the same thread parses, translates, and emits, one function at a time.  This |  | 
|   965 is useful for performance measurements. |  | 
|   966  |  | 
|   967 Ideally, we would like to get near-linear scalability as the number of |  | 
|   968 translation threads increases.  We expect that ``-threads=1`` should be slightly |  | 
|   969 faster than ``-threads=0`` as the small amount of time spent parsing and |  | 
|   970 emitting is done largely in parallel with translation.  With perfect |  | 
|   971 scalability, we see ``-threads=N`` translating ``N`` times as fast as |  | 
|   972 ``-threads=1``, up until the point where parsing or emitting becomes the |  | 
|   973 bottleneck, or ``N+2`` exceeds the number of CPU cores.  In reality, memory |  | 
|   974 performance would become a bottleneck and efficiency might peak at, say, 75%. |  | 
|   975  |  | 
|   976 Currently, parsing takes about 11% of total sequential time.  If translation |  | 
|   977 scalability ever gets so fast and awesomely scalable that parsing becomes a |  | 
|   978 bottleneck, it should be possible to make parsing multithreaded as well. |  | 
|   979  |  | 
|   980 Internally, all shared, mutable data is held in the GlobalContext object, and |  | 
|   981 access to each field is guarded by a mutex. |  | 
|   982  |  | 
|   983 Security |  | 
|   984 -------- |  | 
|   985  |  | 
|   986 Subzero includes a number of security features in the generated code, as well as |  | 
|   987 in the Subzero translator itself, which run on top of the existing Native Client |  | 
|   988 sandbox as well as Chrome's OS-level sandbox. |  | 
|   989  |  | 
|   990 Sandboxed translator |  | 
|   991 ^^^^^^^^^^^^^^^^^^^^ |  | 
|   992  |  | 
|   993 When running inside the browser, the Subzero translator executes as sandboxed, |  | 
|   994 untrusted code that is initially checked by the validator, just like the |  | 
|   995 LLVM-based ``pnacl-llc`` translator.  As such, the Subzero binary should be no |  | 
|   996 more or less secure than the translator it replaces, from the point of view of |  | 
|   997 the Chrome sandbox.  That said, Subzero is much smaller than ``pnacl-llc`` and |  | 
|   998 was designed from the start with security in mind, so one expects fewer attacker |  | 
|   999 opportunities here. |  | 
|  1000  |  | 
|  1001 Code diversification |  | 
|  1002 ^^^^^^^^^^^^^^^^^^^^ |  | 
|  1003  |  | 
|  1004 `Return-oriented programming |  | 
|  1005 <https://en.wikipedia.org/wiki/Return-oriented_programming>`_ (ROP) is a |  | 
|  1006 now-common technique for starting with e.g. a known buffer overflow situation |  | 
|  1007 and launching it into a deeper exploit.  The attacker scans the executable |  | 
|  1008 looking for ROP gadgets, which are short sequences of code that happen to load |  | 
|  1009 known values into known registers and then return.  An attacker who manages to |  | 
|  1010 overwrite parts of the stack can overwrite it with carefully chosen return |  | 
|  1011 addresses such that certain ROP gadgets are effectively chained together to set |  | 
|  1012 up the register state as desired, finally returning to some code that manages to |  | 
|  1013 do something nasty based on those register values. |  | 
|  1014  |  | 
|  1015 If there is a popular ``.pexe`` with a large install base, the attacker could |  | 
|  1016 run Subzero on it and scan the executable for suitable ROP gadgets to use as |  | 
|  1017 part of a potential exploit.  Note that if the trusted validator is working |  | 
|  1018 correctly, these ROP gadgets are limited to starting at a bundle boundary and |  | 
|  1019 cannot use the trick of finding a gadget that happens to begin inside another |  | 
|  1020 instruction.  All the same, gadgets with these constraints still exist and the |  | 
|  1021 attacker has access to them.  This is the attack model we focus most on -- |  | 
|  1022 protecting the user against misuse of a "trusted" developer's application, as |  | 
|  1023 opposed to mischief from a malicious ``.pexe`` file. |  | 
|  1024  |  | 
|  1025 Subzero can mitigate these attacks to some degree through code diversification. |  | 
|  1026 Specifically, we can apply some randomness to the code generation that makes ROP |  | 
|  1027 gadgets less predictable.  This randomness can have some compile-time cost, and |  | 
|  1028 it can affect the code quality; and some diversifications may be more effective |  | 
|  1029 than others.  A more detailed treatment of hardening techniques may be found in |  | 
|  1030 the Matasano report "`Attacking Clientside JIT Compilers |  | 
|  1031 <https://www.nccgroup.trust/globalassets/resources/us/presentations/documents/at
      tacking_clientside_jit_compilers_paper.pdf>`_". |  | 
|  1032  |  | 
|  1033 To evaluate diversification effectiveness, we use a third-party ROP gadget |  | 
|  1034 finder and limit its results to bundle-aligned addresses.  For a given |  | 
|  1035 diversification technique, we run it with a number of different random seeds, |  | 
|  1036 find ROP gadgets for each version, and determine how persistent each ROP gadget |  | 
|  1037 is across the different versions.  A gadget is persistent if the same gadget is |  | 
|  1038 found at the same code address.  The best diversifications are ones with low |  | 
|  1039 gadget persistence rates. |  | 
|  1040  |  | 
|  1041 Subzero implements 7 different diversification techniques.  Below is a |  | 
|  1042 discussion of each technique, its effectiveness, and its cost.  The discussions |  | 
|  1043 of cost and effectiveness are for a single diversification technique; the |  | 
|  1044 translation-time costs for multiple techniques are additive, but the effects of |  | 
|  1045 multiple techniques on code quality and effectiveness are not yet known. |  | 
|  1046  |  | 
|  1047 In Subzero's implementation, each randomization is "repeatable" in a sense. |  | 
|  1048 Each pass that includes a randomization option gets its own private instance of |  | 
|  1049 a random number generator (RNG).  The RNG is seeded with a combination of a |  | 
|  1050 global seed, the pass ID, and the function's sequence number.  The global seed |  | 
|  1051 is designed to be different across runs (perhaps based on the current time), but |  | 
|  1052 for debugging, the global seed can be set to a specific value and the results |  | 
|  1053 will be repeatable. |  | 
|  1054  |  | 
|  1055 Subzero-generated code is subject to diversification once per translation, and |  | 
|  1056 then Chrome caches the diversified binary for subsequent executions.  An |  | 
|  1057 attacker may attempt to run the binary multiple times hoping for |  | 
|  1058 higher-probability combinations of ROP gadgets.  When the attacker guesses |  | 
|  1059 wrong, a likely outcome is an application crash.  Chrome throttles creation of |  | 
|  1060 crashy processes which reduces the likelihood of the attacker eventually gaining |  | 
|  1061 a foothold. |  | 
|  1062  |  | 
|  1063 Constant blinding |  | 
|  1064 ~~~~~~~~~~~~~~~~~ |  | 
|  1065  |  | 
|  1066 Here, we prevent attackers from controlling large immediates in the text |  | 
|  1067 (executable) section.  A random cookie is generated for each function, and if |  | 
|  1068 the constant exceeds a specified threshold, the constant is obfuscated with the |  | 
|  1069 cookie and equivalent code is generated.  For example, instead of this x86 |  | 
|  1070 instruction:: |  | 
|  1071  |  | 
|  1072     mov $0x11223344, <%Reg/Mem> |  | 
|  1073  |  | 
|  1074 the following code might be generated:: |  | 
|  1075  |  | 
|  1076     mov $(0x11223344+Cookie), %temp |  | 
|  1077     lea -Cookie(%temp), %temp |  | 
|  1078     mov %temp, <%Reg/Mem> |  | 
|  1079  |  | 
|  1080 The ``lea`` instruction is used rather than e.g. ``add``/``sub`` or ``xor``, to |  | 
|  1081 prevent unintended effects on the flags register. |  | 
|  1082  |  | 
|  1083 This transformation has almost no effect on translation time, and about 1% |  | 
|  1084 impact on code quality, depending on the threshold chosen.  It does little to |  | 
|  1085 reduce gadget persistence, but it does remove a lot of potential opportunities |  | 
|  1086 to construct intra-instruction ROP gadgets (which an attacker could use only if |  | 
|  1087 a validator bug were discovered, since the Native Client sandbox and associated |  | 
|  1088 validator force returns and other indirect branches to be to bundle-aligned |  | 
|  1089 addresses). |  | 
|  1090  |  | 
|  1091 Constant pooling |  | 
|  1092 ~~~~~~~~~~~~~~~~ |  | 
|  1093  |  | 
|  1094 This is similar to constant blinding, in that large immediates are removed from |  | 
|  1095 the text section.  In this case, each unique constant above the threshold is |  | 
|  1096 stored in a read-only data section and the constant is accessed via a memory |  | 
|  1097 load.  For the above example, the following code might be generated:: |  | 
|  1098  |  | 
|  1099     mov $Label$1, %temp |  | 
|  1100     mov %temp, <%Reg/Mem> |  | 
|  1101  |  | 
|  1102 This has a similarly small impact on translation time and ROP gadget |  | 
|  1103 persistence, and a smaller (better) impact on code quality.  This is because it |  | 
|  1104 uses fewer instructions, and in some cases good register allocation leads to no |  | 
|  1105 increase in instruction count.  Note that this still gives an attacker some |  | 
|  1106 limited amount of control over some text section values, unless we randomize the |  | 
|  1107 constant pool layout. |  | 
|  1108  |  | 
|  1109 Static data reordering |  | 
|  1110 ~~~~~~~~~~~~~~~~~~~~~~ |  | 
|  1111  |  | 
|  1112 This transformation limits the attacker's ability to control bits in global data |  | 
|  1113 address references.  It simply permutes the order in memory of global variables |  | 
|  1114 and internal constant pool entries.  For the constant pool, we only permute |  | 
|  1115 within a type (i.e., emit a randomized list of ints, followed by a randomized |  | 
|  1116 list of floats, etc.) to maintain good packing in the face of alignment |  | 
|  1117 constraints. |  | 
|  1118  |  | 
|  1119 As might be expected, this has no impact on code quality, translation time, or |  | 
|  1120 ROP gadget persistence (though as above, it limits opportunities for |  | 
|  1121 intra-instruction ROP gadgets with a broken validator). |  | 
|  1122  |  | 
|  1123 Basic block reordering |  | 
|  1124 ~~~~~~~~~~~~~~~~~~~~~~ |  | 
|  1125  |  | 
|  1126 Here, we randomize the order of basic blocks within a function, with the |  | 
|  1127 constraint that we still want to maintain a topological order as much as |  | 
|  1128 possible, to avoid making the code too branchy. |  | 
|  1129  |  | 
|  1130 This has no impact on code quality, and about 1% impact on translation time, due |  | 
|  1131 to a separate pass to recompute layout.  It ends up having a huge effect on ROP |  | 
|  1132 gadget persistence, tied for best with nop insertion, reducing ROP gadget |  | 
|  1133 persistence to less than 5%. |  | 
|  1134  |  | 
|  1135 Function reordering |  | 
|  1136 ~~~~~~~~~~~~~~~~~~~ |  | 
|  1137  |  | 
|  1138 Here, we permute the order that functions are emitted, primarily to shift ROP |  | 
|  1139 gadgets around to less predictable locations.  It may also change call address |  | 
|  1140 offsets in case the attacker was trying to control that offset in the code. |  | 
|  1141  |  | 
|  1142 To control latency and memory footprint, we don't arbitrarily permute functions. |  | 
|  1143 Instead, for some relatively small value of N, we queue up N assembler buffers, |  | 
|  1144 and then emit the N functions in random order, and repeat until all functions |  | 
|  1145 are emitted. |  | 
|  1146  |  | 
|  1147 Function reordering has no impact on translation time or code quality. |  | 
|  1148 Measurements indicate that it reduces ROP gadget persistence to about 15%. |  | 
|  1149  |  | 
|  1150 Nop insertion |  | 
|  1151 ~~~~~~~~~~~~~ |  | 
|  1152  |  | 
|  1153 This diversification randomly adds a nop instruction after each regular |  | 
|  1154 instruction, with some probability.  Nop instructions of different lengths may |  | 
|  1155 be selected.  Nop instructions are never added inside a bundle_lock region. |  | 
|  1156 Note that when sandboxing is enabled, nop instructions are already being added |  | 
|  1157 for bundle alignment, so the diversification nop instructions may simply be |  | 
|  1158 taking the place of alignment nop instructions, though distributed differently |  | 
|  1159 through the bundle. |  | 
|  1160  |  | 
|  1161 In Subzero's currently implementation, nop insertion adds 3-5% to the |  | 
|  1162 translation time, but this is probably because it is implemented as a separate |  | 
|  1163 pass that adds actual nop instructions to the IR.  The overhead would probably |  | 
|  1164 be a lot less if it were integrated into the assembler pass.  The code quality |  | 
|  1165 is also reduced by 3-5%, making nop insertion the most expensive of the |  | 
|  1166 diversification techniques. |  | 
|  1167  |  | 
|  1168 Nop insertion is very effective in reducing ROP gadget persistence, at the same |  | 
|  1169 level as basic block randomization (less than 5%).  But given nop insertion's |  | 
|  1170 impact on translation time and code quality, one would most likely prefer to use |  | 
|  1171 basic block randomization instead (though the combined effects of the different |  | 
|  1172 diversification techniques have not yet been studied). |  | 
|  1173  |  | 
|  1174 Register allocation randomization |  | 
|  1175 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |  | 
|  1176  |  | 
|  1177 In this diversification, the register allocator tries to make different but |  | 
|  1178 mostly functionally equivalent choices, while maintaining stable code quality. |  | 
|  1179  |  | 
|  1180 A naive approach would be the following.  Whenever the allocator has more than |  | 
|  1181 one choice for assigning a register, choose randomly among those options.  And |  | 
|  1182 whenever there are no registers available and there is a tie for the |  | 
|  1183 lowest-weight variable, randomly select one of the lowest-weight variables to |  | 
|  1184 evict.  Because of the one-pass nature of the linear-scan algorithm, this |  | 
|  1185 randomization strategy can have a large impact on which variables are ultimately |  | 
|  1186 assigned registers, with a corresponding large impact on code quality. |  | 
|  1187  |  | 
|  1188 Instead, we choose an approach that tries to keep code quality stable regardless |  | 
|  1189 of the random seed.  We partition the set of physical registers into equivalence |  | 
|  1190 classes.  If a register is pre-colored in the function (i.e., referenced |  | 
|  1191 explicitly by name), it forms its own equivalence class.  The remaining |  | 
|  1192 registers are partitioned according to their combination of attributes such as |  | 
|  1193 integer versus floating-point, 8-bit versus 32-bit, caller-save versus |  | 
|  1194 callee-saved, etc.  Each equivalence class is randomly permuted, and the |  | 
|  1195 complete permutation is applied to the final register assignments. |  | 
|  1196  |  | 
|  1197 Register randomization reduces ROP gadget persistence to about 10% on average, |  | 
|  1198 though there tends to be fairly high variance across functions and applications. |  | 
|  1199 This probably has to do with the set of restrictions in the x86-32 instruction |  | 
|  1200 set and ABI, such as few general-purpose registers, ``%eax`` used for return |  | 
|  1201 values, ``%edx`` used for division, ``%cl`` used for shifting, etc.  As |  | 
|  1202 intended, register randomization has no impact on code quality, and a slight |  | 
|  1203 (0.5%) impact on translation time due to an extra scan over the variables to |  | 
|  1204 identify pre-colored registers. |  | 
|  1205  |  | 
|  1206 Fuzzing |  | 
|  1207 ^^^^^^^ |  | 
|  1208  |  | 
|  1209 We have started fuzz-testing the ``.pexe`` files input to Subzero, using a |  | 
|  1210 combination of `afl-fuzz <http://lcamtuf.coredump.cx/afl/>`_, LLVM's `libFuzzer |  | 
|  1211 <http://llvm.org/docs/LibFuzzer.html>`_, and custom tooling.  The purpose is to |  | 
|  1212 find and fix cases where Subzero crashes or otherwise ungracefully fails on |  | 
|  1213 unexpected inputs, and to do so automatically over a large range of unexpected |  | 
|  1214 inputs.  By fixing bugs that arise from fuzz testing, we reduce the possibility |  | 
|  1215 of an attacker exploiting these bugs. |  | 
|  1216  |  | 
|  1217 Most of the problems found so far are ones most appropriately handled in the |  | 
|  1218 parser.  However, there have been a couple that have identified problems in the |  | 
|  1219 lowering, or otherwise inappropriately triggered assertion failures and fatal |  | 
|  1220 errors.  We continue to dig into this area. |  | 
|  1221  |  | 
|  1222 Future security work |  | 
|  1223 ^^^^^^^^^^^^^^^^^^^^ |  | 
|  1224  |  | 
|  1225 Subzero is well-positioned to explore other future security enhancements, e.g.: |  | 
|  1226  |  | 
|  1227 - Tightening the Native Client sandbox.  ABI changes, such as the previous work |  | 
|  1228   on `hiding the sandbox base address |  | 
|  1229   <https://docs.google.com/document/d/1eskaI4353XdsJQFJLRnZzb_YIESQx4gNRzf31dqXV
      G8>`_ |  | 
|  1230   in x86-64, are easy to experiment with in Subzero. |  | 
|  1231  |  | 
|  1232 - Making the executable code section read-only.  This would prevent a PNaCl |  | 
|  1233   application from inspecting its own binary and trying to find ROP gadgets even |  | 
|  1234   after code diversification has been performed.  It may still be susceptible to |  | 
|  1235   `blind ROP <http://www.scs.stanford.edu/brop/bittau-brop.pdf>`_ attacks, |  | 
|  1236   security is still overall improved. |  | 
|  1237  |  | 
|  1238 - Instruction selection diversification.  It may be possible to lower a given |  | 
|  1239   instruction in several largely equivalent ways, which gives more opportunities |  | 
|  1240   for code randomization. |  | 
|  1241  |  | 
|  1242 Chrome integration |  | 
|  1243 ------------------ |  | 
|  1244  |  | 
|  1245 Currently Subzero is available in Chrome for the x86-32 architecture, but under |  | 
|  1246 a flag.  When the flag is enabled, Subzero is used when the `manifest file |  | 
|  1247 <https://developer.chrome.com/native-client/reference/nacl-manifest-format>`_ |  | 
|  1248 linking to the ``.pexe`` file specifies the ``O0`` optimization level. |  | 
|  1249  |  | 
|  1250 The next step is to remove the flag, i.e. invoke Subzero as the only translator |  | 
|  1251 for ``O0``-specified manifest files. |  | 
|  1252  |  | 
|  1253 Ultimately, Subzero might produce code rivaling LLVM ``O2`` quality, in which |  | 
|  1254 case Subzero could be used for all PNaCl translation. |  | 
|  1255  |  | 
|  1256 Command line options |  | 
|  1257 -------------------- |  | 
|  1258  |  | 
|  1259 Subzero has a number of command-line options for debugging and diagnostics. |  | 
|  1260 Among the more interesting are the following. |  | 
|  1261  |  | 
|  1262 - Using the ``-verbose`` flag, Subzero will dump the CFG, or produce other |  | 
|  1263   diagnostic output, with various levels of detail after each pass.  Instruction |  | 
|  1264   numbers can be printed or suppressed.  Deleted instructions can be printed or |  | 
|  1265   suppressed (they are retained in the instruction list, as discussed earlier, |  | 
|  1266   because they can help explain how lower-level instructions originated). |  | 
|  1267   Liveness information can be printed when available.  Details of register |  | 
|  1268   allocation can be printed as register allocator decisions are made.  And more. |  | 
|  1269  |  | 
|  1270 - Running Subzero with any level of verbosity produces an enormous amount of |  | 
|  1271   output.  When debugging a single function, verbose output can be suppressed |  | 
|  1272   except for a particular function.  The ``-verbose-focus`` flag suppresses |  | 
|  1273   verbose output except for the specified function. |  | 
|  1274  |  | 
|  1275 - Subzero has a ``-timing`` option that prints a breakdown of pass-level timing |  | 
|  1276   at exit.  Timing markers can be placed in the Subzero source code to demarcate |  | 
|  1277   logical operations or passes of interest.  Basic timing information plus |  | 
|  1278   call-stack type timing information is printed at the end. |  | 
|  1279  |  | 
|  1280 - Along with ``-timing``, the user can instead get a report on the overall |  | 
|  1281   translation time for each function, to help focus on timing outliers.  Also, |  | 
|  1282   ``-timing-focus`` limits the ``-timing`` reporting to a single function, |  | 
|  1283   instead of aggregating pass timing across all functions. |  | 
|  1284  |  | 
|  1285 - The ``-szstats`` option reports various statistics on each function, such as |  | 
|  1286   stack frame size, static instruction count, etc.  It may be helpful to track |  | 
|  1287   these stats over time as Subzero is improved, as an approximate measure of |  | 
|  1288   code quality. |  | 
|  1289  |  | 
|  1290 - The flag ``-asm-verbose``, in conjunction with emitting textual assembly |  | 
|  1291   output, annotate the assembly output with register-focused liveness |  | 
|  1292   information.  In particular, each basic block is annotated with which |  | 
|  1293   registers are live-in and live-out, and each instruction is annotated with |  | 
|  1294   which registers' and stack locations' live ranges end at that instruction. |  | 
|  1295   This is really useful when studying the generated code to find opportunities |  | 
|  1296   for code quality improvements. |  | 
|  1297  |  | 
|  1298 Testing and debugging |  | 
|  1299 --------------------- |  | 
|  1300  |  | 
|  1301 LLVM lit tests |  | 
|  1302 ^^^^^^^^^^^^^^ |  | 
|  1303  |  | 
|  1304 For basic testing, Subzero uses LLVM's `lit |  | 
|  1305 <http://llvm.org/docs/CommandGuide/lit.html>`_ framework for running tests.  We |  | 
|  1306 have a suite of hundreds of small functions where we test for particular |  | 
|  1307 assembly code patterns across different target architectures. |  | 
|  1308  |  | 
|  1309 Cross tests |  | 
|  1310 ^^^^^^^^^^^ |  | 
|  1311  |  | 
|  1312 Unfortunately, the lit tests don't do a great job of precisely testing the |  | 
|  1313 correctness of the output.  Much better are the cross tests, which are execution |  | 
|  1314 tests that compare Subzero and ``pnacl-llc`` translated bitcode across a wide |  | 
|  1315 variety of interesting inputs.  Each cross test consists of a set of C, C++, |  | 
|  1316 and/or low-level bitcode files.  The C and C++ source files are compiled down to |  | 
|  1317 bitcode.  The bitcode files are translated by ``pnacl-llc`` and also by Subzero. |  | 
|  1318 Subzero mangles global symbol names with a special prefix to avoid duplicate |  | 
|  1319 symbol errors.  A driver program invokes both versions on a large set of |  | 
|  1320 interesting inputs, and reports when the Subzero and ``pnacl-llc`` results |  | 
|  1321 differ.  Cross tests turn out to be an excellent way of testing the basic |  | 
|  1322 lowering patterns, but they are less useful for testing more global things like |  | 
|  1323 liveness analysis and register allocation. |  | 
|  1324  |  | 
|  1325 Bisection debugging |  | 
|  1326 ^^^^^^^^^^^^^^^^^^^ |  | 
|  1327  |  | 
|  1328 Sometimes with a new application, Subzero will end up producing incorrect code |  | 
|  1329 that either crashes at runtime or otherwise produces the wrong results.  When |  | 
|  1330 this happens, we need to narrow it down to a single function (or small set of |  | 
|  1331 functions) that yield incorrect behavior.  For this, we have a bisection |  | 
|  1332 debugging framework.  Here, we initially translate the entire application once |  | 
|  1333 with Subzero and once with ``pnacl-llc``.  We then use ``objdump`` to |  | 
|  1334 selectively weaken symbols based on a whitelist or blacklist provided on the |  | 
|  1335 command line.  The two object files can then be linked together without link |  | 
|  1336 errors, with the desired version of each method "winning".  Then the binary is |  | 
|  1337 tested, and bisection proceeds based on whether the binary produces correct |  | 
|  1338 output. |  | 
|  1339  |  | 
|  1340 When the bisection completes, we are left with a minimal set of |  | 
|  1341 Subzero-translated functions that cause the failure.  Usually it is a single |  | 
|  1342 function, though sometimes it might require a combination of several functions |  | 
|  1343 to cause a failure; this may be due to an incorrect call ABI, for example. |  | 
|  1344 However, Murphy's Law implies that the single failing function is enormous and |  | 
|  1345 impractical to debug.  In that case, we can restart the bisection, explicitly |  | 
|  1346 blacklisting the enormous function, and try to find another candidate to debug. |  | 
|  1347 (Future work is to automate this to find all minimal sets of functions, so that |  | 
|  1348 debugging can focus on the simplest example.) |  | 
|  1349  |  | 
|  1350 Fuzz testing |  | 
|  1351 ^^^^^^^^^^^^ |  | 
|  1352  |  | 
|  1353 As described above, we try to find internal Subzero bugs using fuzz testing |  | 
|  1354 techniques. |  | 
|  1355  |  | 
|  1356 Sanitizers |  | 
|  1357 ^^^^^^^^^^ |  | 
|  1358  |  | 
|  1359 Subzero can be built with `AddressSanitizer |  | 
|  1360 <http://clang.llvm.org/docs/AddressSanitizer.html>`_ (ASan) or `ThreadSanitizer |  | 
|  1361 <http://clang.llvm.org/docs/ThreadSanitizer.html>`_ (TSan) support.  This is |  | 
|  1362 done using something as simple as ``make ASAN=1`` or ``make TSAN=1``.  So far, |  | 
|  1363 multithreading has been simple enough that TSan hasn't found any bugs, but ASan |  | 
|  1364 has found at least one memory leak which was subsequently fixed. |  | 
|  1365 `UndefinedBehaviorSanitizer |  | 
|  1366 <http://clang.llvm.org/docs/UsersManual.html#controlling-code-generation>`_ |  | 
|  1367 (UBSan) support is in progress.  `Control flow integrity sanitization |  | 
|  1368 <http://clang.llvm.org/docs/ControlFlowIntegrity.html>`_ is also under |  | 
|  1369 consideration. |  | 
|  1370  |  | 
|  1371 Current status |  | 
|  1372 ============== |  | 
|  1373  |  | 
|  1374 Target architectures |  | 
|  1375 -------------------- |  | 
|  1376  |  | 
|  1377 Subzero is currently more or less complete for the x86-32 target.  It has been |  | 
|  1378 refactored and extended to handle x86-64 as well, and that is mostly complete at |  | 
|  1379 this point. |  | 
|  1380  |  | 
|  1381 ARM32 work is in progress.  It currently lacks the testing level of x86, at |  | 
|  1382 least in part because Subzero's register allocator needs modifications to handle |  | 
|  1383 ARM's aliasing of floating point and vector registers.  Specifically, a 64-bit |  | 
|  1384 register is actually a gang of two consecutive and aligned 32-bit registers, and |  | 
|  1385 a 128-bit register is a gang of 4 consecutive and aligned 32-bit registers. |  | 
|  1386 ARM64 work has not started; when it does, it will be native-only since the |  | 
|  1387 Native Client sandbox model, validator, and other tools have never been defined. |  | 
|  1388  |  | 
|  1389 An external contributor is adding MIPS support, in most part by following the |  | 
|  1390 ARM work. |  | 
|  1391  |  | 
|  1392 Translator performance |  | 
|  1393 ---------------------- |  | 
|  1394  |  | 
|  1395 Single-threaded translation speed is currently about 5× the ``pnacl-llc`` |  | 
|  1396 translation speed.  For a large ``.pexe`` file, the time breaks down as: |  | 
|  1397  |  | 
|  1398 - 11% for parsing and initial IR building |  | 
|  1399  |  | 
|  1400 - 4% for emitting to /dev/null |  | 
|  1401  |  | 
|  1402 - 27% for liveness analysis (two liveness passes plus live range construction) |  | 
|  1403  |  | 
|  1404 - 15% for linear-scan register allocation |  | 
|  1405  |  | 
|  1406 - 9% for basic lowering |  | 
|  1407  |  | 
|  1408 - 10% for advanced phi lowering |  | 
|  1409  |  | 
|  1410 - ~11% for other minor analysis |  | 
|  1411  |  | 
|  1412 - ~10% measurement overhead to acquire these numbers |  | 
|  1413  |  | 
|  1414 Some improvements could undoubtedly be made, but it will be hard to increase the |  | 
|  1415 speed to 10× of ``pnacl-llc`` while keeping acceptable code quality.  With |  | 
|  1416 ``-Om1`` (lack of) optimization, we do actually achieve roughly 10× |  | 
|  1417 ``pnacl-llc`` translation speed, but code quality drops by a factor of 3. |  | 
|  1418  |  | 
|  1419 Code quality |  | 
|  1420 ------------ |  | 
|  1421  |  | 
|  1422 Measured across 16 components of spec2k, Subzero's code quality is uniformly |  | 
|  1423 better than ``pnacl-llc`` ``-O0`` code quality, and in many cases solidly |  | 
|  1424 between ``pnacl-llc`` ``-O0`` and ``-O2``. |  | 
|  1425  |  | 
|  1426 Translator size |  | 
|  1427 --------------- |  | 
|  1428  |  | 
|  1429 When built in MINIMAL mode, the x86-64 native translator size for the x86-32 |  | 
|  1430 target is about 700 KB, not including the size of functions referenced in |  | 
|  1431 dynamically-linked libraries.  The sandboxed version of Subzero is a bit over 1 |  | 
|  1432 MB, and it is statically linked and also includes nop padding for bundling as |  | 
|  1433 well as indirect branch masking. |  | 
|  1434  |  | 
|  1435 Translator memory footprint |  | 
|  1436 --------------------------- |  | 
|  1437  |  | 
|  1438 It's hard to draw firm conclusions about memory footprint, since the footprint |  | 
|  1439 is at least proportional to the input function size, and there is no real limit |  | 
|  1440 on the size of functions in the ``.pexe`` file. |  | 
|  1441  |  | 
|  1442 That said, we looked at the memory footprint over time as Subzero translated |  | 
|  1443 ``pnacl-llc.pexe``, which is the largest ``.pexe`` file (7.2 MB) at our |  | 
|  1444 disposal.  One of LLVM's libraries that Subzero uses can report the current |  | 
|  1445 malloc heap usage.  With single-threaded translation, Subzero tends to hover |  | 
|  1446 around 15 MB of memory usage.  There are a couple of monstrous functions where |  | 
|  1447 Subzero grows to around 100 MB, but then it drops back down after those |  | 
|  1448 functions finish translating.  In contrast, ``pnacl-llc`` grows larger and |  | 
|  1449 larger throughout translation, reaching several hundred MB by the time it |  | 
|  1450 completes. |  | 
|  1451  |  | 
|  1452 It's a bit more interesting when we enable multithreaded translation.  When |  | 
|  1453 there are N translation threads, Subzero implements a policy that limits the |  | 
|  1454 size of the translation queue to N entries -- if it is "full" when the parser |  | 
|  1455 tries to add a new CFG, the parser blocks until one of the translation threads |  | 
|  1456 removes a CFG.  This means the number of in-memory CFGs can (and generally does) |  | 
|  1457 reach 2*N+1, and so the memory footprint rises in proportion to the number of |  | 
|  1458 threads.  Adding to the pressure is the observation that the monstrous functions |  | 
|  1459 also take proportionally longer time to translate, so there's a good chance many |  | 
|  1460 of the monstrous functions will be active at the same time with multithreaded |  | 
|  1461 translation.  As a result, for N=32, Subzero's memory footprint peaks at about |  | 
|  1462 260 MB, but drops back down as the large functions finish translating. |  | 
|  1463  |  | 
|  1464 If this peak memory size becomes a problem, it might be possible for the parser |  | 
|  1465 to resequence the functions to try to spread out the larger functions, or to |  | 
|  1466 throttle the translation queue to prevent too many in-flight large functions. |  | 
|  1467 It may also be possible to throttle based on memory pressure signaling from |  | 
|  1468 Chrome. |  | 
|  1469  |  | 
|  1470 Translator scalability |  | 
|  1471 ---------------------- |  | 
|  1472  |  | 
|  1473 Currently scalability is "not very good".  Multiple translation threads lead to |  | 
|  1474 faster translation, but not to the degree desired.  We haven't dug in to |  | 
|  1475 investigate yet. |  | 
|  1476  |  | 
|  1477 There are a few areas to investigate.  First, there may be contention on the |  | 
|  1478 constant pool, which all threads access, and which requires locked access even |  | 
|  1479 for reading.  This could be mitigated by keeping a CFG-local cache of the most |  | 
|  1480 common constants. |  | 
|  1481  |  | 
|  1482 Second, there may be contention on memory allocation.  While almost all CFG |  | 
|  1483 objects are allocated from the CFG-local allocator, some passes use temporary |  | 
|  1484 STL containers that use the default allocator, which may require global locking. |  | 
|  1485 This could be mitigated by switching these to the CFG-local allocator. |  | 
|  1486  |  | 
|  1487 Third, multithreading may make the default allocator strategy more expensive. |  | 
|  1488 In a single-threaded environment, a pass will allocate its containers, run the |  | 
|  1489 pass, and deallocate the containers.  This results in stack-like allocation |  | 
|  1490 behavior and makes the heap free list easier to manage, with less heap |  | 
|  1491 fragmentation.  But when multithreading is added, the allocations and |  | 
|  1492 deallocations become much less stack-like, making allocation and deallocation |  | 
|  1493 operations individually more expensive.  Again, this could be mitigated by |  | 
|  1494 switching these to the CFG-local allocator. |  | 
| OLD | NEW |