OLD | NEW |
1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
2 ========================================= | 2 ========================================= |
3 | 3 |
4 Introduction | 4 Introduction |
5 ------------ | 5 ------------ |
6 | 6 |
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
(...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
42 | 42 |
43 Or, improve translator performance to something more reasonable. | 43 Or, improve translator performance to something more reasonable. |
44 | 44 |
45 This document describes Subzero's attempt to improve translation speed by an | 45 This document describes Subzero's attempt to improve translation speed by an |
46 order of magnitude while rivaling LLVM's code quality. Subzero does this | 46 order of magnitude while rivaling LLVM's code quality. Subzero does this |
47 through minimal IR layering, lean data structures and passes, and a careful | 47 through minimal IR layering, lean data structures and passes, and a careful |
48 selection of fast optimization passes. It has two optimization recipes: full | 48 selection of fast optimization passes. It has two optimization recipes: full |
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
50 following (described in more detail below): | 50 following (described in more detail below): |
51 | 51 |
52 +-----------------------------------+-----------------------------+ | 52 +----------------------------------------+-----------------------------+ |
53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
54 +===================================+=============================+ | 54 +========================================+=============================+ |
55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
56 +-----------------------------------+-----------------------------+ | 56 +----------------------------------------+-----------------------------+ |
57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
58 +-----------------------------------+-----------------------------+ | 58 +----------------------------------------+-----------------------------+ |
59 | Address mode inference | | | 59 | Local common subexpression elimination | | |
60 +-----------------------------------+-----------------------------+ | 60 +----------------------------------------+-----------------------------+ |
61 | Read-modify-write (RMW) transform | | | 61 | Address mode inference | | |
62 +-----------------------------------+-----------------------------+ | 62 +----------------------------------------+-----------------------------+ |
63 | Basic liveness analysis | | | 63 | Read-modify-write (RMW) transform | | |
64 +-----------------------------------+-----------------------------+ | 64 +----------------------------------------+-----------------------------+ |
65 | Load optimization | | | 65 | Basic liveness analysis | | |
66 +-----------------------------------+-----------------------------+ | 66 +----------------------------------------+-----------------------------+ |
67 | | Phi lowering (simple) | | 67 | Load optimization | | |
68 +-----------------------------------+-----------------------------+ | 68 +----------------------------------------+-----------------------------+ |
69 | Target lowering | Target lowering | | 69 | | Phi lowering (simple) | |
70 +-----------------------------------+-----------------------------+ | 70 +----------------------------------------+-----------------------------+ |
71 | Full liveness analysis | | | 71 | Target lowering | Target lowering | |
72 +-----------------------------------+-----------------------------+ | 72 +----------------------------------------+-----------------------------+ |
73 | Register allocation | Minimal register allocation | | 73 | Full liveness analysis | | |
74 +-----------------------------------+-----------------------------+ | 74 +----------------------------------------+-----------------------------+ |
75 | Phi lowering (advanced) | | | 75 | Register allocation | Minimal register allocation | |
76 +-----------------------------------+-----------------------------+ | 76 +----------------------------------------+-----------------------------+ |
77 | Post-phi register allocation | | | 77 | Phi lowering (advanced) | | |
78 +-----------------------------------+-----------------------------+ | 78 +----------------------------------------+-----------------------------+ |
79 | Branch optimization | | | 79 | Post-phi register allocation | | |
80 +-----------------------------------+-----------------------------+ | 80 +----------------------------------------+-----------------------------+ |
81 | Code emission | Code emission | | 81 | Branch optimization | | |
82 +-----------------------------------+-----------------------------+ | 82 +----------------------------------------+-----------------------------+ |
| 83 | Code emission | Code emission | |
| 84 +----------------------------------------+-----------------------------+ |
83 | 85 |
84 Goals | 86 Goals |
85 ===== | 87 ===== |
86 | 88 |
87 Translation speed | 89 Translation speed |
88 ----------------- | 90 ----------------- |
89 | 91 |
90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. | 92 We'd like to be able to translate a ``.pexe`` file as fast as download speed. |
91 Any faster is in a sense wasted effort. Download speed varies greatly, but | 93 Any faster is in a sense wasted effort. Download speed varies greatly, but |
92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a | 94 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a |
(...skipping 418 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | 513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
512 copy, which is removed as an emission-time peephole optimization. | 514 copy, which is removed as an emission-time peephole optimization. |
513 | 515 |
514 O2 lowering | 516 O2 lowering |
515 ----------- | 517 ----------- |
516 | 518 |
517 Currently, the ``O2`` lowering recipe is the following: | 519 Currently, the ``O2`` lowering recipe is the following: |
518 | 520 |
519 - Loop nest analysis | 521 - Loop nest analysis |
520 | 522 |
| 523 - Local common subexpression elimination |
| 524 |
521 - Address mode inference | 525 - Address mode inference |
522 | 526 |
523 - Read-modify-write (RMW) transformation | 527 - Read-modify-write (RMW) transformation |
524 | 528 |
525 - Basic liveness analysis | 529 - Basic liveness analysis |
526 | 530 |
527 - Load optimization | 531 - Load optimization |
528 | 532 |
529 - Target lowering | 533 - Target lowering |
530 | 534 |
(...skipping 160 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
691 this second-chance mechanism is disabled. | 695 this second-chance mechanism is disabled. |
692 | 696 |
693 Future work is to implement LLVM's `Greedy | 697 Future work is to implement LLVM's `Greedy |
694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | 698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |
695 register allocator as a replacement for the basic linear-scan algorithm, given | 699 register allocator as a replacement for the basic linear-scan algorithm, given |
696 LLVM's experience with its improvement in code quality. (The blog post claims | 700 LLVM's experience with its improvement in code quality. (The blog post claims |
697 that the Greedy allocator also improved maintainability because a lot of hacks | 701 that the Greedy allocator also improved maintainability because a lot of hacks |
698 could be removed, but Subzero is probably not yet to that level of hacks, and is | 702 could be removed, but Subzero is probably not yet to that level of hacks, and is |
699 less likely to see that particular benefit.) | 703 less likely to see that particular benefit.) |
700 | 704 |
| 705 Local common subexpression elimination |
| 706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 707 |
| 708 The Local CSE implementation goes through each instruction and records a portion |
| 709 of each ``Seen`` instruction in a hashset-like container. That portion consists |
| 710 of the entire instruction except for any dest variable. That means ``A = X + Y`` |
| 711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows |
| 712 us to detect common subexpressions. |
| 713 |
| 714 Whenever a repetition is detected, the redundant variables are stored in a |
| 715 container mapping the replacee to the replacement. In the case above, it would |
| 716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. |
| 717 |
| 718 At any point if a variable that has an entry in the replacement table is |
| 719 encountered, it is replaced with the variable it is mapped to. This ensures that |
| 720 the redundant variables will not have any uses in the basic block, allowing |
| 721 dead code elimination to clean up the redundant instruction. |
| 722 |
| 723 With SSA, the information stored is never invalidated. However, non-SSA input is |
| 724 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra |
| 725 dependency information to ensure proper invalidation on variable assignment. |
| 726 This is not rigorously tested because this pass is run at an early stage where |
| 727 it is safe to assume variables have a single definition. This is not enabled by |
| 728 default because it bumps the compile time overhead from 2% to 6%. |
| 729 |
| 730 Loop-invariant code motion |
| 731 ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 732 |
| 733 This pass utilizes the loop analysis information to hoist invariant instructions |
| 734 to loop pre-headers. A loop must have a single entry node (header) and that node |
| 735 must have a single external predecessor for this optimization to work, as it is |
| 736 currently implemented. |
| 737 |
| 738 The pass works by iterating over all instructions in the loop until the set of |
| 739 invariant instructions converges. In each iteration, a non-invariant instruction |
| 740 involving only constants or a variable known to be invariant is added to the |
| 741 result set. The destination variable of that instruction is added to the set of |
| 742 variables known to be invariant (which is initialized with the constant args). |
| 743 |
| 744 Improving the loop-analysis infrastructure is likely to have significant impact |
| 745 on this optimization. Inserting an extra node to act as the pre-header when the |
| 746 header has multiple incoming edges from outside could also be a good idea. |
| 747 Expanding the initial invariant variable set to contain all variables that do |
| 748 not have definitions inside the loop does not seem to improve anything. |
| 749 |
| 750 Short circuit evaluation |
| 751 ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 752 |
| 753 Short circuit evaluation splits nodes and introduces early jumps when the result |
| 754 of a logical operation can be determined early and there are no observable side |
| 755 effects of skipping the rest of the computation. The instructions considered |
| 756 backwards from the end of the basic blocks. When a definition of a variable |
| 757 involved in a conditional jump is found, an extra jump can be inserted in that |
| 758 location, moving the rest of the instructions in the node to a newly inserted |
| 759 node. Consider this example:: |
| 760 |
| 761 __N : |
| 762 a = <something> |
| 763 Instruction 1 without side effect |
| 764 ... b = <something> ... |
| 765 Instruction N without side effect |
| 766 t1 = or a b |
| 767 br t1 __X __Y |
| 768 |
| 769 is transformed to:: |
| 770 |
| 771 __N : |
| 772 a = <something> |
| 773 br a __X __N_ext |
| 774 |
| 775 __N_ext : |
| 776 Instruction 1 without side effect |
| 777 ... b = <something> ... |
| 778 Instruction N without side effect |
| 779 br b __X __Y |
| 780 |
| 781 The logic for AND is analogous, the only difference is that the early jump is |
| 782 facilitated by a ``false`` value instead of ``true``. |
| 783 |
| 784 Global Variable Splitting |
| 785 ^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 786 |
| 787 Global variable splitting (``-split-global-vars``) is run after register |
| 788 allocation. It works on the variables that did not manage to get registers (but |
| 789 are allowed to) and decomposes their live ranges into the individual segments |
| 790 (which span a single node at most). New variables are created (but not yet used) |
| 791 with these smaller live ranges and the register allocator is run again. This is |
| 792 not inefficient as the old variables that already had registers are now |
| 793 considered pre-colored. |
| 794 |
| 795 The new variables that get registers replace their parent variables for their |
| 796 portion of its (parent's) live range. A copy from the old variable to the new |
| 797 is introduced before the first use and the reverse after the last def in the |
| 798 live range. |
| 799 |
701 Basic phi lowering | 800 Basic phi lowering |
702 ^^^^^^^^^^^^^^^^^^ | 801 ^^^^^^^^^^^^^^^^^^ |
703 | 802 |
704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | 803 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
705 implements it). Consider this example:: | 804 implements it). Consider this example:: |
706 | 805 |
707 L1: | 806 L1: |
708 ... | 807 ... |
709 br L3 | 808 br L3 |
710 L2: | 809 L2: |
(...skipping 774 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... |
1485 This could be mitigated by switching these to the CFG-local allocator. | 1584 This could be mitigated by switching these to the CFG-local allocator. |
1486 | 1585 |
1487 Third, multithreading may make the default allocator strategy more expensive. | 1586 Third, multithreading may make the default allocator strategy more expensive. |
1488 In a single-threaded environment, a pass will allocate its containers, run the | 1587 In a single-threaded environment, a pass will allocate its containers, run the |
1489 pass, and deallocate the containers. This results in stack-like allocation | 1588 pass, and deallocate the containers. This results in stack-like allocation |
1490 behavior and makes the heap free list easier to manage, with less heap | 1589 behavior and makes the heap free list easier to manage, with less heap |
1491 fragmentation. But when multithreading is added, the allocations and | 1590 fragmentation. But when multithreading is added, the allocations and |
1492 deallocations become much less stack-like, making allocation and deallocation | 1591 deallocations become much less stack-like, making allocation and deallocation |
1493 operations individually more expensive. Again, this could be mitigated by | 1592 operations individually more expensive. Again, this could be mitigated by |
1494 switching these to the CFG-local allocator. | 1593 switching these to the CFG-local allocator. |
OLD | NEW |