OLD | NEW |
---|---|
1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
2 ========================================= | 2 ========================================= |
3 | 3 |
4 Introduction | 4 Introduction |
5 ------------ | 5 ------------ |
6 | 6 |
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
(...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
42 | 42 |
43 Or, improve translator performance to something more reasonable. | 43 Or, improve translator performance to something more reasonable. |
44 | 44 |
45 This document describes Subzero's attempt to improve translation speed by an | 45 This document describes Subzero's attempt to improve translation speed by an |
46 order of magnitude while rivaling LLVM's code quality. Subzero does this | 46 order of magnitude while rivaling LLVM's code quality. Subzero does this |
47 through minimal IR layering, lean data structures and passes, and a careful | 47 through minimal IR layering, lean data structures and passes, and a careful |
48 selection of fast optimization passes. It has two optimization recipes: full | 48 selection of fast optimization passes. It has two optimization recipes: full |
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
50 following (described in more detail below): | 50 following (described in more detail below): |
51 | 51 |
52 +-----------------------------------+-----------------------------+ | 52 +----------------------------------------+-----------------------------+ |
53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
54 +===================================+=============================+ | 54 +========================================+=============================+ |
55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
56 +-----------------------------------+-----------------------------+ | 56 +----------------------------------------+-----------------------------+ |
57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
58 +-----------------------------------+-----------------------------+ | 58 +----------------------------------------+-----------------------------+ |
59 | Address mode inference | | | 59 | Local common subexpression elimination | | |
60 +-----------------------------------+-----------------------------+ | 60 +----------------------------------------+-----------------------------+ |
61 | Read-modify-write (RMW) transform | | | 61 | Address mode inference | | |
62 +-----------------------------------+-----------------------------+ | 62 +----------------------------------------+-----------------------------+ |
63 | Basic liveness analysis | | | 63 | Read-modify-write (RMW) transform | | |
64 +-----------------------------------+-----------------------------+ | 64 +----------------------------------------+-----------------------------+ |
65 | Load optimization | | | 65 | Basic liveness analysis | | |
66 +-----------------------------------+-----------------------------+ | 66 +----------------------------------------+-----------------------------+ |
67 | | Phi lowering (simple) | | 67 | Load optimization | | |
68 +-----------------------------------+-----------------------------+ | 68 +----------------------------------------+-----------------------------+ |
69 | Target lowering | Target lowering | | 69 | | Phi lowering (simple) | |
70 +-----------------------------------+-----------------------------+ | 70 +----------------------------------------+-----------------------------+ |
71 | Full liveness analysis | | | 71 | Target lowering | Target lowering | |
72 +-----------------------------------+-----------------------------+ | 72 +----------------------------------------+-----------------------------+ |
73 | Register allocation | Minimal register allocation | | 73 | Full liveness analysis | | |
74 +-----------------------------------+-----------------------------+ | 74 +----------------------------------------+-----------------------------+ |
75 | Phi lowering (advanced) | | | 75 | Register allocation | Minimal register allocation | |
76 +-----------------------------------+-----------------------------+ | 76 +----------------------------------------+-----------------------------+ |
77 | Post-phi register allocation | | | 77 | Phi lowering (advanced) | | |
78 +-----------------------------------+-----------------------------+ | 78 +----------------------------------------+-----------------------------+ |
79 | Branch optimization | | | 79 | Post-phi register allocation | | |
80 +-----------------------------------+-----------------------------+ | 80 +----------------------------------------+-----------------------------+ |
81 | Code emission | Code emission | | 81 | Branch optimization | | |
82 +-----------------------------------+-----------------------------+ | 82 +----------------------------------------+-----------------------------+ |
83 | Code emission | Code emission | | |
84 +----------------------------------------+-----------------------------+ | |
83 | 85 |
84 Goals | 86 Goals |
85 ===== | 87 ===== |
86 | 88 |
87 Translation speed | 89 Translation speed |
88 ----------------- | 90 ----------------- |
89 | 91 |
90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. | 92 We'd like to be able to translate a ``.pexe`` file as fast as download speed. |
91 Any faster is in a sense wasted effort. Download speed varies greatly, but | 93 Any faster is in a sense wasted effort. Download speed varies greatly, but |
92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a | 94 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a |
(...skipping 418 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | 513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
512 copy, which is removed as an emission-time peephole optimization. | 514 copy, which is removed as an emission-time peephole optimization. |
513 | 515 |
514 O2 lowering | 516 O2 lowering |
515 ----------- | 517 ----------- |
516 | 518 |
517 Currently, the ``O2`` lowering recipe is the following: | 519 Currently, the ``O2`` lowering recipe is the following: |
518 | 520 |
519 - Loop nest analysis | 521 - Loop nest analysis |
520 | 522 |
523 - Local common subexpression elimination | |
524 | |
521 - Address mode inference | 525 - Address mode inference |
522 | 526 |
523 - Read-modify-write (RMW) transformation | 527 - Read-modify-write (RMW) transformation |
524 | 528 |
525 - Basic liveness analysis | 529 - Basic liveness analysis |
526 | 530 |
527 - Load optimization | 531 - Load optimization |
528 | 532 |
529 - Target lowering | 533 - Target lowering |
530 | 534 |
(...skipping 160 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
691 this second-chance mechanism is disabled. | 695 this second-chance mechanism is disabled. |
692 | 696 |
693 Future work is to implement LLVM's `Greedy | 697 Future work is to implement LLVM's `Greedy |
694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | 698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |
695 register allocator as a replacement for the basic linear-scan algorithm, given | 699 register allocator as a replacement for the basic linear-scan algorithm, given |
696 LLVM's experience with its improvement in code quality. (The blog post claims | 700 LLVM's experience with its improvement in code quality. (The blog post claims |
697 that the Greedy allocator also improved maintainability because a lot of hacks | 701 that the Greedy allocator also improved maintainability because a lot of hacks |
698 could be removed, but Subzero is probably not yet to that level of hacks, and is | 702 could be removed, but Subzero is probably not yet to that level of hacks, and is |
699 less likely to see that particular benefit.) | 703 less likely to see that particular benefit.) |
700 | 704 |
705 Local common subexpression elimination | |
706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
707 | |
708 The Local CSE implementation goes through each instruction and records a portion | |
709 of each ``Seen`` instruction in a hashset-like container. That portion consists | |
710 of the entire instruction except for any dest variable. That means ``A = X + Y`` | |
711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows | |
712 us to detect common subexpressions. | |
713 | |
714 Whenever a repetition is detected, the redundant variables are stored in a | |
715 container mapping the replacee to the replacement. In the case above, it would | |
716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. | |
717 | |
718 At any point if a variable that has an entry in the replacement table is | |
719 encountered, it is replaced with the variable it is mapped to. This ensures that | |
720 the redundant variables will not have any uses in the basic block, allowing | |
721 dead code elimination to clean up the redundant instruction. | |
722 | |
723 With SSA, the information stored is never invalidated. However, non-SSA input is | |
724 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra | |
725 dependency information to ensure proper invalidation on variable assignment. | |
726 This is not rigorously tested because this pass is run at an early stage where | |
727 it is safe to assume variables have a single definition. This is not enabled by | |
728 default because it bumps the compile time overhead from 2% to 6%. | |
729 | |
730 Loop-invariant code motion | |
731 ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
732 | |
733 This pass utilizes the loop analysis information to hoist invariant instructions | |
734 to loop pre-headers. A loop must have a single entry node (header) and that node | |
735 must have a single external predecessor for this optimization to work, as it is | |
736 currently implemented. | |
737 | |
738 The pass works by iterating over all instructions in the loop until the set of | |
739 invariant instructions converges. In each iteration, a non-invariant instruction | |
740 involving only constants or a variable known to be invariant is added to the | |
741 result set. The destination variable of that instruction is added to the set of | |
742 variables known to be invariant (which is initialized with the constant args). | |
743 | |
744 Improving the loop-analysis infrastructure is likely to have significant impact | |
745 on this optimization. Inserting an extra node to act as the pre-header when the | |
746 header has multiple incoming edges from outside could also be a good idea. | |
747 Expanding the initial invariant variable set to contain all variables that do | |
748 not have definitions inside the loop does not seem to improve anything. | |
749 | |
750 Short circuit evaluation | |
751 ^^^^^^^^^^^^^^^^^^^^^^^^ | |
752 | |
753 Short circuit evaluation splits nodes and introduces early jumps when the result | |
754 of a logical operation can be determined early and there are no observable side | |
755 effects of skipping the rest of the computation. The instructions considered | |
756 backwards from the end of the basic blocks. When a definition of a variable | |
757 involved in a conditional jump is found, an extra jump can be inserted in that | |
758 location, moving the rest of the instructions in the node to a newly inserted | |
759 node. Consider this example:: | |
760 | |
761 __N : | |
762 a = <something> | |
763 Instruction 1 without side effect | |
764 ... b = <something> ... | |
765 Instruction N without side effect | |
766 t1 = or a b | |
767 br t1 __X __Y | |
768 | |
769 is transformed to:: | |
770 | |
771 __N : | |
772 a = <something> | |
773 br a __X __N_ext | |
774 | |
775 __N_ext : | |
776 Instruction 1 without side effect | |
777 ... b = <something> ... | |
778 Instruction N without side effect | |
779 br b __X __Y | |
780 | |
781 The logic for AND is analogous, the only difference is that the early jump is | |
782 facilitated by a ``false`` value instead of ``true``. | |
783 | |
784 Global Variable Splitting | |
785 ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
786 | |
787 Global variable splitting (``-split-global-vars``) is run after register | |
788 allocation. It works on the variables that did not manage to get registers (but | |
789 are allowed to) and decomposes their live ranges into the individual segments | |
790 (which span a single node at most). New variables are created (but not yet used) | |
791 with these smaller live ranges and the register allocator is run again. This is | |
792 not inefficient as the old variables that already had registers are now | |
793 considered pre-colored. | |
794 | |
795 The new variables that get registers replace their parent variables for their | |
796 portion of its (parent's) live range. A copy from the old variable to the new | |
797 is introduced before the first use and the reverse after the last def in the | |
798 live range. | |
799 | |
800 | |
Jim Stichnoth
2016/08/05 19:23:59
Just a single blank line for consistency?
| |
701 Basic phi lowering | 801 Basic phi lowering |
702 ^^^^^^^^^^^^^^^^^^ | 802 ^^^^^^^^^^^^^^^^^^ |
703 | 803 |
704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | 804 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
705 implements it). Consider this example:: | 805 implements it). Consider this example:: |
706 | 806 |
707 L1: | 807 L1: |
708 ... | 808 ... |
709 br L3 | 809 br L3 |
710 L2: | 810 L2: |
(...skipping 774 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
1485 This could be mitigated by switching these to the CFG-local allocator. | 1585 This could be mitigated by switching these to the CFG-local allocator. |
1486 | 1586 |
1487 Third, multithreading may make the default allocator strategy more expensive. | 1587 Third, multithreading may make the default allocator strategy more expensive. |
1488 In a single-threaded environment, a pass will allocate its containers, run the | 1588 In a single-threaded environment, a pass will allocate its containers, run the |
1489 pass, and deallocate the containers. This results in stack-like allocation | 1589 pass, and deallocate the containers. This results in stack-like allocation |
1490 behavior and makes the heap free list easier to manage, with less heap | 1590 behavior and makes the heap free list easier to manage, with less heap |
1491 fragmentation. But when multithreading is added, the allocations and | 1591 fragmentation. But when multithreading is added, the allocations and |
1492 deallocations become much less stack-like, making allocation and deallocation | 1592 deallocations become much less stack-like, making allocation and deallocation |
1493 operations individually more expensive. Again, this could be mitigated by | 1593 operations individually more expensive. Again, this could be mitigated by |
1494 switching these to the CFG-local allocator. | 1594 switching these to the CFG-local allocator. |
OLD | NEW |