| OLD | NEW |
| 1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
| 2 ========================================= | 2 ========================================= |
| 3 | 3 |
| 4 Introduction | 4 Introduction |
| 5 ------------ | 5 ------------ |
| 6 | 6 |
| 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
| 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
| 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
| 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
| (...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 42 | 42 |
| 43 Or, improve translator performance to something more reasonable. | 43 Or, improve translator performance to something more reasonable. |
| 44 | 44 |
| 45 This document describes Subzero's attempt to improve translation speed by an | 45 This document describes Subzero's attempt to improve translation speed by an |
| 46 order of magnitude while rivaling LLVM's code quality. Subzero does this | 46 order of magnitude while rivaling LLVM's code quality. Subzero does this |
| 47 through minimal IR layering, lean data structures and passes, and a careful | 47 through minimal IR layering, lean data structures and passes, and a careful |
| 48 selection of fast optimization passes. It has two optimization recipes: full | 48 selection of fast optimization passes. It has two optimization recipes: full |
| 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
| 50 following (described in more detail below): | 50 following (described in more detail below): |
| 51 | 51 |
| 52 +-----------------------------------+-----------------------+ | 52 +-----------------------------------+-----------------------------+ |
| 53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
| 54 +===================================+=======================+ | 54 +===================================+=============================+ |
| 55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
| 56 +-----------------------------------+-----------------------+ | 56 +-----------------------------------+-----------------------------+ |
| 57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
| 58 +-----------------------------------+-----------------------+ | 58 +-----------------------------------+-----------------------------+ |
| 59 | Address mode inference | | | 59 | Address mode inference | | |
| 60 +-----------------------------------+-----------------------+ | 60 +-----------------------------------+-----------------------------+ |
| 61 | Read-modify-write (RMW) transform | | | 61 | Read-modify-write (RMW) transform | | |
| 62 +-----------------------------------+-----------------------+ | 62 +-----------------------------------+-----------------------------+ |
| 63 | Basic liveness analysis | | | 63 | Basic liveness analysis | | |
| 64 +-----------------------------------+-----------------------+ | 64 +-----------------------------------+-----------------------------+ |
| 65 | Load optimization | | | 65 | Load optimization | | |
| 66 +-----------------------------------+-----------------------+ | 66 +-----------------------------------+-----------------------------+ |
| 67 | | Phi lowering (simple) | | 67 | | Phi lowering (simple) | |
| 68 +-----------------------------------+-----------------------+ | 68 +-----------------------------------+-----------------------------+ |
| 69 | Target lowering | Target lowering | | 69 | Target lowering | Target lowering | |
| 70 +-----------------------------------+-----------------------+ | 70 +-----------------------------------+-----------------------------+ |
| 71 | Full liveness analysis | | | 71 | Full liveness analysis | | |
| 72 +-----------------------------------+-----------------------+ | 72 +-----------------------------------+-----------------------------+ |
| 73 | Register allocation | | | 73 | Register allocation | Minimal register allocation | |
| 74 +-----------------------------------+-----------------------+ | 74 +-----------------------------------+-----------------------------+ |
| 75 | Phi lowering (advanced) | | | 75 | Phi lowering (advanced) | | |
| 76 +-----------------------------------+-----------------------+ | 76 +-----------------------------------+-----------------------------+ |
| 77 | Post-phi register allocation | | | 77 | Post-phi register allocation | | |
| 78 +-----------------------------------+-----------------------+ | 78 +-----------------------------------+-----------------------------+ |
| 79 | Branch optimization | | | 79 | Branch optimization | | |
| 80 +-----------------------------------+-----------------------+ | 80 +-----------------------------------+-----------------------------+ |
| 81 | Code emission | Code emission | | 81 | Code emission | Code emission | |
| 82 +-----------------------------------+-----------------------+ | 82 +-----------------------------------+-----------------------------+ |
| 83 | 83 |
| 84 Goals | 84 Goals |
| 85 ===== | 85 ===== |
| 86 | 86 |
| 87 Translation speed | 87 Translation speed |
| 88 ----------------- | 88 ----------------- |
| 89 | 89 |
| 90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. | 90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. |
| 91 Any faster is in a sense wasted effort. Download speed varies greatly, but | 91 Any faster is in a sense wasted effort. Download speed varies greatly, but |
| 92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a | 92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a |
| (...skipping 1392 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
| 1485 This could be mitigated by switching these to the CFG-local allocator. | 1485 This could be mitigated by switching these to the CFG-local allocator. |
| 1486 | 1486 |
| 1487 Third, multithreading may make the default allocator strategy more expensive. | 1487 Third, multithreading may make the default allocator strategy more expensive. |
| 1488 In a single-threaded environment, a pass will allocate its containers, run the | 1488 In a single-threaded environment, a pass will allocate its containers, run the |
| 1489 pass, and deallocate the containers. This results in stack-like allocation | 1489 pass, and deallocate the containers. This results in stack-like allocation |
| 1490 behavior and makes the heap free list easier to manage, with less heap | 1490 behavior and makes the heap free list easier to manage, with less heap |
| 1491 fragmentation. But when multithreading is added, the allocations and | 1491 fragmentation. But when multithreading is added, the allocations and |
| 1492 deallocations become much less stack-like, making allocation and deallocation | 1492 deallocations become much less stack-like, making allocation and deallocation |
| 1493 operations individually more expensive. Again, this could be mitigated by | 1493 operations individually more expensive. Again, this could be mitigated by |
| 1494 switching these to the CFG-local allocator. | 1494 switching these to the CFG-local allocator. |
| OLD | NEW |