OLD | NEW |
1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
2 ========================================= | 2 ========================================= |
3 | 3 |
4 Introduction | 4 Introduction |
5 ------------ | 5 ------------ |
6 | 6 |
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
(...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
42 | 42 |
43 Or, improve translator performance to something more reasonable. | 43 Or, improve translator performance to something more reasonable. |
44 | 44 |
45 This document describes Subzero's attempt to improve translation speed by an | 45 This document describes Subzero's attempt to improve translation speed by an |
46 order of magnitude while rivaling LLVM's code quality. Subzero does this | 46 order of magnitude while rivaling LLVM's code quality. Subzero does this |
47 through minimal IR layering, lean data structures and passes, and a careful | 47 through minimal IR layering, lean data structures and passes, and a careful |
48 selection of fast optimization passes. It has two optimization recipes: full | 48 selection of fast optimization passes. It has two optimization recipes: full |
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
50 following (described in more detail below): | 50 following (described in more detail below): |
51 | 51 |
52 +-----------------------------------+-----------------------+ | 52 +-----------------------------------+-----------------------------+ |
53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
54 +===================================+=======================+ | 54 +===================================+=============================+ |
55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
56 +-----------------------------------+-----------------------+ | 56 +-----------------------------------+-----------------------------+ |
57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
58 +-----------------------------------+-----------------------+ | 58 +-----------------------------------+-----------------------------+ |
59 | Address mode inference | | | 59 | Address mode inference | | |
60 +-----------------------------------+-----------------------+ | 60 +-----------------------------------+-----------------------------+ |
61 | Read-modify-write (RMW) transform | | | 61 | Read-modify-write (RMW) transform | | |
62 +-----------------------------------+-----------------------+ | 62 +-----------------------------------+-----------------------------+ |
63 | Basic liveness analysis | | | 63 | Basic liveness analysis | | |
64 +-----------------------------------+-----------------------+ | 64 +-----------------------------------+-----------------------------+ |
65 | Load optimization | | | 65 | Load optimization | | |
66 +-----------------------------------+-----------------------+ | 66 +-----------------------------------+-----------------------------+ |
67 | | Phi lowering (simple) | | 67 | | Phi lowering (simple) | |
68 +-----------------------------------+-----------------------+ | 68 +-----------------------------------+-----------------------------+ |
69 | Target lowering | Target lowering | | 69 | Target lowering | Target lowering | |
70 +-----------------------------------+-----------------------+ | 70 +-----------------------------------+-----------------------------+ |
71 | Full liveness analysis | | | 71 | Full liveness analysis | | |
72 +-----------------------------------+-----------------------+ | 72 +-----------------------------------+-----------------------------+ |
73 | Register allocation | | | 73 | Register allocation | Minimal register allocation | |
74 +-----------------------------------+-----------------------+ | 74 +-----------------------------------+-----------------------------+ |
75 | Phi lowering (advanced) | | | 75 | Phi lowering (advanced) | | |
76 +-----------------------------------+-----------------------+ | 76 +-----------------------------------+-----------------------------+ |
77 | Post-phi register allocation | | | 77 | Post-phi register allocation | | |
78 +-----------------------------------+-----------------------+ | 78 +-----------------------------------+-----------------------------+ |
79 | Branch optimization | | | 79 | Branch optimization | | |
80 +-----------------------------------+-----------------------+ | 80 +-----------------------------------+-----------------------------+ |
81 | Code emission | Code emission | | 81 | Code emission | Code emission | |
82 +-----------------------------------+-----------------------+ | 82 +-----------------------------------+-----------------------------+ |
83 | 83 |
84 Goals | 84 Goals |
85 ===== | 85 ===== |
86 | 86 |
87 Translation speed | 87 Translation speed |
88 ----------------- | 88 ----------------- |
89 | 89 |
90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. | 90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. |
91 Any faster is in a sense wasted effort. Download speed varies greatly, but | 91 Any faster is in a sense wasted effort. Download speed varies greatly, but |
92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a | 92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a |
(...skipping 1392 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
1485 This could be mitigated by switching these to the CFG-local allocator. | 1485 This could be mitigated by switching these to the CFG-local allocator. |
1486 | 1486 |
1487 Third, multithreading may make the default allocator strategy more expensive. | 1487 Third, multithreading may make the default allocator strategy more expensive. |
1488 In a single-threaded environment, a pass will allocate its containers, run the | 1488 In a single-threaded environment, a pass will allocate its containers, run the |
1489 pass, and deallocate the containers. This results in stack-like allocation | 1489 pass, and deallocate the containers. This results in stack-like allocation |
1490 behavior and makes the heap free list easier to manage, with less heap | 1490 behavior and makes the heap free list easier to manage, with less heap |
1491 fragmentation. But when multithreading is added, the allocations and | 1491 fragmentation. But when multithreading is added, the allocations and |
1492 deallocations become much less stack-like, making allocation and deallocation | 1492 deallocations become much less stack-like, making allocation and deallocation |
1493 operations individually more expensive. Again, this could be mitigated by | 1493 operations individually more expensive. Again, this could be mitigated by |
1494 switching these to the CFG-local allocator. | 1494 switching these to the CFG-local allocator. |
OLD | NEW |