OLD | NEW |
---|---|
1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
2 ========================================= | 2 ========================================= |
3 | 3 |
4 Introduction | 4 Introduction |
5 ------------ | 5 ------------ |
6 | 6 |
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
(...skipping 38 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
50 following (described in more detail below): | 50 following (described in more detail below): |
51 | 51 |
52 +-----------------------------------+-----------------------------+ | 52 +-----------------------------------+-----------------------------+ |
53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
54 +===================================+=============================+ | 54 +===================================+=============================+ |
55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
56 +-----------------------------------+-----------------------------+ | 56 +-----------------------------------+-----------------------------+ |
57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
58 +-----------------------------------+-----------------------------+ | 58 +-----------------------------------+-----------------------------+ |
59 | Local common sub-exp elimination | | | |
Jim Stichnoth
2016/08/05 06:20:00
I think you can spell out "subexpression" and (pai
manasijm
2016/08/05 06:57:09
Won't be that painful!
I might not be an emacs nin
manasijm
2016/08/05 17:34:44
Done.
| |
60 +-----------------------------------+-----------------------------+ | |
59 | Address mode inference | | | 61 | Address mode inference | | |
60 +-----------------------------------+-----------------------------+ | 62 +-----------------------------------+-----------------------------+ |
61 | Read-modify-write (RMW) transform | | | 63 | Read-modify-write (RMW) transform | | |
62 +-----------------------------------+-----------------------------+ | 64 +-----------------------------------+-----------------------------+ |
63 | Basic liveness analysis | | | 65 | Basic liveness analysis | | |
64 +-----------------------------------+-----------------------------+ | 66 +-----------------------------------+-----------------------------+ |
65 | Load optimization | | | 67 | Load optimization | | |
66 +-----------------------------------+-----------------------------+ | 68 +-----------------------------------+-----------------------------+ |
67 | | Phi lowering (simple) | | 69 | | Phi lowering (simple) | |
68 +-----------------------------------+-----------------------------+ | 70 +-----------------------------------+-----------------------------+ |
(...skipping 442 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | 513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
512 copy, which is removed as an emission-time peephole optimization. | 514 copy, which is removed as an emission-time peephole optimization. |
513 | 515 |
514 O2 lowering | 516 O2 lowering |
515 ----------- | 517 ----------- |
516 | 518 |
517 Currently, the ``O2`` lowering recipe is the following: | 519 Currently, the ``O2`` lowering recipe is the following: |
518 | 520 |
519 - Loop nest analysis | 521 - Loop nest analysis |
520 | 522 |
523 - Local common sub-expression elimination | |
Jim Stichnoth
2016/08/05 06:20:00
"Subexpression" is usually written without the hyp
manasijm
2016/08/05 17:34:44
Done.
| |
524 | |
521 - Address mode inference | 525 - Address mode inference |
522 | 526 |
523 - Read-modify-write (RMW) transformation | 527 - Read-modify-write (RMW) transformation |
524 | 528 |
525 - Basic liveness analysis | 529 - Basic liveness analysis |
526 | 530 |
527 - Load optimization | 531 - Load optimization |
528 | 532 |
529 - Target lowering | 533 - Target lowering |
530 | 534 |
(...skipping 160 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
691 this second-chance mechanism is disabled. | 695 this second-chance mechanism is disabled. |
692 | 696 |
693 Future work is to implement LLVM's `Greedy | 697 Future work is to implement LLVM's `Greedy |
694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | 698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |
695 register allocator as a replacement for the basic linear-scan algorithm, given | 699 register allocator as a replacement for the basic linear-scan algorithm, given |
696 LLVM's experience with its improvement in code quality. (The blog post claims | 700 LLVM's experience with its improvement in code quality. (The blog post claims |
697 that the Greedy allocator also improved maintainability because a lot of hacks | 701 that the Greedy allocator also improved maintainability because a lot of hacks |
698 could be removed, but Subzero is probably not yet to that level of hacks, and is | 702 could be removed, but Subzero is probably not yet to that level of hacks, and is |
699 less likely to see that particular benefit.) | 703 less likely to see that particular benefit.) |
700 | 704 |
705 Local common sub-expression elimination | |
706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
707 | |
708 The Local CSE implementation goes through each instruction and record ``Seen`` | |
Jim Stichnoth
2016/08/05 06:20:00
records
manasijm
2016/08/05 17:34:44
Done.
| |
709 instructions in a ``CfgUnorderedSet``. The equality comparator for this | |
Jim Stichnoth
2016/08/05 06:20:00
I don't think the level of detail in this document
manasijm
2016/08/05 17:34:44
Done.
| |
710 container ignores the destination of the instructions. That means ``A = X + Y`` | |
711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows | |
712 us to detect common sub-expressions. | |
713 | |
714 Whenever a repetition is detected, the redundant variables are stored in a | |
715 container mapping the replacee to the replacement. In the case above, it would | |
716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. | |
717 | |
718 At any point if a variable that has an entry in the replacement table is | |
719 encountered, it is replaced with the variable it is mapped to. This ensures that | |
720 the redundant variables will not have any uses in the basic block, allowing | |
721 dead code elimination to clean up the redundant instruction. | |
722 | |
723 With SSA, the information stored is never invalidated. However, non-SSA input is | |
724 supported with the -lcse=no-ssa option. This has to maintain some extra | |
Jim Stichnoth
2016/08/05 06:19:59
``-lcse=no-ssa``
manasijm
2016/08/05 17:34:43
Done.
| |
725 dependency information to ensure proper invalidation on variable assignment. | |
726 This is not rigorously tested because this pass is run at an early stage where | |
727 it is safe to assume variables have a single definition. This is not enabled by | |
728 default because it bumps the compile time overhead from 2% to 6%. | |
729 | |
730 Loop invariant code motion | |
Jim Stichnoth
2016/08/05 06:20:00
I think it's more common to hyphenate "loop-invari
manasijm
2016/08/05 17:34:44
Done.
| |
731 ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
732 | |
733 This pass utilizes the loop analysis information to hoist invariant instructions | |
734 to loop pre-headers. A loop must have a single entry node (header) and that node | |
735 must have a single external predecessor for this optimization to work, as it is | |
736 currently implemented. | |
737 | |
738 The pass works by iterating over all instructions in the loop until the set of | |
739 invariant instructions converge. In each iteration, a non-invariant instruction | |
Jim Stichnoth
2016/08/05 06:20:00
converges
manasijm
2016/08/05 17:34:44
Done.
| |
740 involving only constants or a variable known to be invariant is added to the | |
741 result set. The destination variable of that instruction is added to the set of | |
742 variables known to be invariant (which is initialized with the constant args). | |
743 | |
744 Improving the loop-analysis infrastructure is likely to have significant effect | |
Jim Stichnoth
2016/08/05 06:20:00
maybe s/effect/impact/ ?
manasijm
2016/08/05 17:34:43
Done.
| |
745 on this optimization. Inserting an extra node to act as the pre-header when the | |
746 header has multiple incoming edges from outside could also be good idea. | |
Jim Stichnoth
2016/08/05 06:20:00
be a good
manasijm
2016/08/05 17:34:44
Done.
| |
747 Expanding the initial invariant variable set to contain all variables that do | |
748 not have definitions inside the loop does not seem to improve anything. | |
749 | |
750 Short Circuit Evaluation | |
Jim Stichnoth
2016/08/05 06:20:00
For consistency, only capitalize the first word of
manasijm
2016/08/05 17:34:43
Done.
| |
751 ^^^^^^^^^^^^^^^^^^^^^^^^ | |
752 | |
753 Short Circuit Evaluation splits nodes and introduces early jumps when the result | |
Jim Stichnoth
2016/08/05 06:19:59
I would also not capitalize all these words. To s
manasijm
2016/08/05 17:34:44
Done.
| |
754 of a logical operation can be determined early and there are no side effects of | |
Jim Stichnoth
2016/08/05 06:20:00
maybe "no observable side effects" ?
manasijm
2016/08/05 17:34:44
Done.
| |
755 skipping the rest of the computation. The instructions considered backwards from | |
756 the end of the basic blocks. | |
757 When a definition of a variable involved in a conditional jump is found, an | |
Jim Stichnoth
2016/08/05 06:20:00
reflow paragraph, or make it a new paragraph
manasijm
2016/08/05 17:34:44
Done.
| |
758 extra jump can be inserted in that location, moving the rest of the instructions | |
759 in the node to a newly inserted node. Consider this example:: | |
760 | |
761 __N : | |
762 a = <something> | |
763 Instruction 1 without side effect | |
764 ... b = <something> ... | |
765 Instruction N without side effect | |
766 t1 = or a b | |
767 br t1 __X __Y | |
768 | |
769 is transformed to :: | |
770 | |
771 __N : | |
772 a = <something> | |
773 br a __X __N_ext | |
774 | |
775 __N_ext : | |
776 Instruction 1 without side effect | |
777 ... b = <something> ... | |
778 Instruction N without side effect | |
779 br b __X __Y | |
780 | |
781 The logic for AND is analogous, the only difference is that the early jump is | |
782 facilitated by a ``false`` value instead of ``true``. | |
783 | |
784 Global Variable Splitting | |
785 ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
786 | |
787 Global variable splitting (-split-global-vars) is run after register allocation. | |
Jim Stichnoth
2016/08/05 06:19:59
``-split-global-vars``
manasijm
2016/08/05 17:34:44
Done.
| |
788 It works on the variables that did not manage to get registers (but are allowed | |
789 to) and decomposes their live ranges into the individual segments (which span a | |
790 single node at most). New variables are created (but not yet used) with these | |
791 smaller live ranges and the register allocator is run again. This is not | |
792 inefficient as the old variables that already had registers are now considered | |
793 pre-colored. | |
794 | |
795 The new variables that get registers replace their parent variables for their | |
796 portion of its (parent's) live range. A copy from the old variable to the new | |
797 is introduced before the first use and the reverse after the last def in the | |
798 live range. | |
799 | |
800 | |
701 Basic phi lowering | 801 Basic phi lowering |
702 ^^^^^^^^^^^^^^^^^^ | 802 ^^^^^^^^^^^^^^^^^^ |
703 | 803 |
704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | 804 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
705 implements it). Consider this example:: | 805 implements it). Consider this example:: |
706 | 806 |
707 L1: | 807 L1: |
708 ... | 808 ... |
709 br L3 | 809 br L3 |
710 L2: | 810 L2: |
(...skipping 774 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
1485 This could be mitigated by switching these to the CFG-local allocator. | 1585 This could be mitigated by switching these to the CFG-local allocator. |
1486 | 1586 |
1487 Third, multithreading may make the default allocator strategy more expensive. | 1587 Third, multithreading may make the default allocator strategy more expensive. |
1488 In a single-threaded environment, a pass will allocate its containers, run the | 1588 In a single-threaded environment, a pass will allocate its containers, run the |
1489 pass, and deallocate the containers. This results in stack-like allocation | 1589 pass, and deallocate the containers. This results in stack-like allocation |
1490 behavior and makes the heap free list easier to manage, with less heap | 1590 behavior and makes the heap free list easier to manage, with less heap |
1491 fragmentation. But when multithreading is added, the allocations and | 1591 fragmentation. But when multithreading is added, the allocations and |
1492 deallocations become much less stack-like, making allocation and deallocation | 1592 deallocations become much less stack-like, making allocation and deallocation |
1493 operations individually more expensive. Again, this could be mitigated by | 1593 operations individually more expensive. Again, this could be mitigated by |
1494 switching these to the CFG-local allocator. | 1594 switching these to the CFG-local allocator. |
OLD | NEW |