Chromium Code Reviews| OLD | NEW |
|---|---|
| 1 Design of the Subzero fast code generator | 1 Design of the Subzero fast code generator |
| 2 ========================================= | 2 ========================================= |
| 3 | 3 |
| 4 Introduction | 4 Introduction |
| 5 ------------ | 5 ------------ |
| 6 | 6 |
| 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes | 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes |
| 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the | 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the |
| 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl | 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl |
| 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as | 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as |
| (...skipping 38 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the | 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
| 50 following (described in more detail below): | 50 following (described in more detail below): |
| 51 | 51 |
| 52 +-----------------------------------+-----------------------------+ | 52 +-----------------------------------+-----------------------------+ |
| 53 | O2 recipe | Om1 recipe | | 53 | O2 recipe | Om1 recipe | |
| 54 +===================================+=============================+ | 54 +===================================+=============================+ |
| 55 | Parse .pexe file | Parse .pexe file | | 55 | Parse .pexe file | Parse .pexe file | |
| 56 +-----------------------------------+-----------------------------+ | 56 +-----------------------------------+-----------------------------+ |
| 57 | Loop nest analysis | | | 57 | Loop nest analysis | | |
| 58 +-----------------------------------+-----------------------------+ | 58 +-----------------------------------+-----------------------------+ |
| 59 | Local common sub-exp elimination | | | |
|
Jim Stichnoth
2016/08/05 06:20:00
I think you can spell out "subexpression" and (pai
manasijm
2016/08/05 06:57:09
Won't be that painful!
I might not be an emacs nin
manasijm
2016/08/05 17:34:44
Done.
| |
| 60 +-----------------------------------+-----------------------------+ | |
| 59 | Address mode inference | | | 61 | Address mode inference | | |
| 60 +-----------------------------------+-----------------------------+ | 62 +-----------------------------------+-----------------------------+ |
| 61 | Read-modify-write (RMW) transform | | | 63 | Read-modify-write (RMW) transform | | |
| 62 +-----------------------------------+-----------------------------+ | 64 +-----------------------------------+-----------------------------+ |
| 63 | Basic liveness analysis | | | 65 | Basic liveness analysis | | |
| 64 +-----------------------------------+-----------------------------+ | 66 +-----------------------------------+-----------------------------+ |
| 65 | Load optimization | | | 67 | Load optimization | | |
| 66 +-----------------------------------+-----------------------------+ | 68 +-----------------------------------+-----------------------------+ |
| 67 | | Phi lowering (simple) | | 69 | | Phi lowering (simple) | |
| 68 +-----------------------------------+-----------------------------+ | 70 +-----------------------------------+-----------------------------+ |
| (...skipping 442 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register | 513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register |
| 512 copy, which is removed as an emission-time peephole optimization. | 514 copy, which is removed as an emission-time peephole optimization. |
| 513 | 515 |
| 514 O2 lowering | 516 O2 lowering |
| 515 ----------- | 517 ----------- |
| 516 | 518 |
| 517 Currently, the ``O2`` lowering recipe is the following: | 519 Currently, the ``O2`` lowering recipe is the following: |
| 518 | 520 |
| 519 - Loop nest analysis | 521 - Loop nest analysis |
| 520 | 522 |
| 523 - Local common sub-expression elimination | |
|
Jim Stichnoth
2016/08/05 06:20:00
"Subexpression" is usually written without the hyp
manasijm
2016/08/05 17:34:44
Done.
| |
| 524 | |
| 521 - Address mode inference | 525 - Address mode inference |
| 522 | 526 |
| 523 - Read-modify-write (RMW) transformation | 527 - Read-modify-write (RMW) transformation |
| 524 | 528 |
| 525 - Basic liveness analysis | 529 - Basic liveness analysis |
| 526 | 530 |
| 527 - Load optimization | 531 - Load optimization |
| 528 | 532 |
| 529 - Target lowering | 533 - Target lowering |
| 530 | 534 |
| (...skipping 160 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 691 this second-chance mechanism is disabled. | 695 this second-chance mechanism is disabled. |
| 692 | 696 |
| 693 Future work is to implement LLVM's `Greedy | 697 Future work is to implement LLVM's `Greedy |
| 694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ | 698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ |
| 695 register allocator as a replacement for the basic linear-scan algorithm, given | 699 register allocator as a replacement for the basic linear-scan algorithm, given |
| 696 LLVM's experience with its improvement in code quality. (The blog post claims | 700 LLVM's experience with its improvement in code quality. (The blog post claims |
| 697 that the Greedy allocator also improved maintainability because a lot of hacks | 701 that the Greedy allocator also improved maintainability because a lot of hacks |
| 698 could be removed, but Subzero is probably not yet to that level of hacks, and is | 702 could be removed, but Subzero is probably not yet to that level of hacks, and is |
| 699 less likely to see that particular benefit.) | 703 less likely to see that particular benefit.) |
| 700 | 704 |
| 705 Local common sub-expression elimination | |
| 706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 707 | |
| 708 The Local CSE implementation goes through each instruction and record ``Seen`` | |
|
Jim Stichnoth
2016/08/05 06:20:00
records
manasijm
2016/08/05 17:34:44
Done.
| |
| 709 instructions in a ``CfgUnorderedSet``. The equality comparator for this | |
|
Jim Stichnoth
2016/08/05 06:20:00
I don't think the level of detail in this document
manasijm
2016/08/05 17:34:44
Done.
| |
| 710 container ignores the destination of the instructions. That means ``A = X + Y`` | |
| 711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows | |
| 712 us to detect common sub-expressions. | |
| 713 | |
| 714 Whenever a repetition is detected, the redundant variables are stored in a | |
| 715 container mapping the replacee to the replacement. In the case above, it would | |
| 716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. | |
| 717 | |
| 718 At any point if a variable that has an entry in the replacement table is | |
| 719 encountered, it is replaced with the variable it is mapped to. This ensures that | |
| 720 the redundant variables will not have any uses in the basic block, allowing | |
| 721 dead code elimination to clean up the redundant instruction. | |
| 722 | |
| 723 With SSA, the information stored is never invalidated. However, non-SSA input is | |
| 724 supported with the -lcse=no-ssa option. This has to maintain some extra | |
|
Jim Stichnoth
2016/08/05 06:19:59
``-lcse=no-ssa``
manasijm
2016/08/05 17:34:43
Done.
| |
| 725 dependency information to ensure proper invalidation on variable assignment. | |
| 726 This is not rigorously tested because this pass is run at an early stage where | |
| 727 it is safe to assume variables have a single definition. This is not enabled by | |
| 728 default because it bumps the compile time overhead from 2% to 6%. | |
| 729 | |
| 730 Loop invariant code motion | |
|
Jim Stichnoth
2016/08/05 06:20:00
I think it's more common to hyphenate "loop-invari
manasijm
2016/08/05 17:34:44
Done.
| |
| 731 ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 732 | |
| 733 This pass utilizes the loop analysis information to hoist invariant instructions | |
| 734 to loop pre-headers. A loop must have a single entry node (header) and that node | |
| 735 must have a single external predecessor for this optimization to work, as it is | |
| 736 currently implemented. | |
| 737 | |
| 738 The pass works by iterating over all instructions in the loop until the set of | |
| 739 invariant instructions converge. In each iteration, a non-invariant instruction | |
|
Jim Stichnoth
2016/08/05 06:20:00
converges
manasijm
2016/08/05 17:34:44
Done.
| |
| 740 involving only constants or a variable known to be invariant is added to the | |
| 741 result set. The destination variable of that instruction is added to the set of | |
| 742 variables known to be invariant (which is initialized with the constant args). | |
| 743 | |
| 744 Improving the loop-analysis infrastructure is likely to have significant effect | |
|
Jim Stichnoth
2016/08/05 06:20:00
maybe s/effect/impact/ ?
manasijm
2016/08/05 17:34:43
Done.
| |
| 745 on this optimization. Inserting an extra node to act as the pre-header when the | |
| 746 header has multiple incoming edges from outside could also be good idea. | |
|
Jim Stichnoth
2016/08/05 06:20:00
be a good
manasijm
2016/08/05 17:34:44
Done.
| |
| 747 Expanding the initial invariant variable set to contain all variables that do | |
| 748 not have definitions inside the loop does not seem to improve anything. | |
| 749 | |
| 750 Short Circuit Evaluation | |
|
Jim Stichnoth
2016/08/05 06:20:00
For consistency, only capitalize the first word of
manasijm
2016/08/05 17:34:43
Done.
| |
| 751 ^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 752 | |
| 753 Short Circuit Evaluation splits nodes and introduces early jumps when the result | |
|
Jim Stichnoth
2016/08/05 06:19:59
I would also not capitalize all these words. To s
manasijm
2016/08/05 17:34:44
Done.
| |
| 754 of a logical operation can be determined early and there are no side effects of | |
|
Jim Stichnoth
2016/08/05 06:20:00
maybe "no observable side effects" ?
manasijm
2016/08/05 17:34:44
Done.
| |
| 755 skipping the rest of the computation. The instructions considered backwards from | |
| 756 the end of the basic blocks. | |
| 757 When a definition of a variable involved in a conditional jump is found, an | |
|
Jim Stichnoth
2016/08/05 06:20:00
reflow paragraph, or make it a new paragraph
manasijm
2016/08/05 17:34:44
Done.
| |
| 758 extra jump can be inserted in that location, moving the rest of the instructions | |
| 759 in the node to a newly inserted node. Consider this example:: | |
| 760 | |
| 761 __N : | |
| 762 a = <something> | |
| 763 Instruction 1 without side effect | |
| 764 ... b = <something> ... | |
| 765 Instruction N without side effect | |
| 766 t1 = or a b | |
| 767 br t1 __X __Y | |
| 768 | |
| 769 is transformed to :: | |
| 770 | |
| 771 __N : | |
| 772 a = <something> | |
| 773 br a __X __N_ext | |
| 774 | |
| 775 __N_ext : | |
| 776 Instruction 1 without side effect | |
| 777 ... b = <something> ... | |
| 778 Instruction N without side effect | |
| 779 br b __X __Y | |
| 780 | |
| 781 The logic for AND is analogous, the only difference is that the early jump is | |
| 782 facilitated by a ``false`` value instead of ``true``. | |
| 783 | |
| 784 Global Variable Splitting | |
| 785 ^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| 786 | |
| 787 Global variable splitting (-split-global-vars) is run after register allocation. | |
|
Jim Stichnoth
2016/08/05 06:19:59
``-split-global-vars``
manasijm
2016/08/05 17:34:44
Done.
| |
| 788 It works on the variables that did not manage to get registers (but are allowed | |
| 789 to) and decomposes their live ranges into the individual segments (which span a | |
| 790 single node at most). New variables are created (but not yet used) with these | |
| 791 smaller live ranges and the register allocator is run again. This is not | |
| 792 inefficient as the old variables that already had registers are now considered | |
| 793 pre-colored. | |
| 794 | |
| 795 The new variables that get registers replace their parent variables for their | |
| 796 portion of its (parent's) live range. A copy from the old variable to the new | |
| 797 is introduced before the first use and the reverse after the last def in the | |
| 798 live range. | |
| 799 | |
| 800 | |
| 701 Basic phi lowering | 801 Basic phi lowering |
| 702 ^^^^^^^^^^^^^^^^^^ | 802 ^^^^^^^^^^^^^^^^^^ |
| 703 | 803 |
| 704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` | 804 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` |
| 705 implements it). Consider this example:: | 805 implements it). Consider this example:: |
| 706 | 806 |
| 707 L1: | 807 L1: |
| 708 ... | 808 ... |
| 709 br L3 | 809 br L3 |
| 710 L2: | 810 L2: |
| (...skipping 774 matching lines...) Expand 10 before | Expand all | Expand 10 after Loading... | |
| 1485 This could be mitigated by switching these to the CFG-local allocator. | 1585 This could be mitigated by switching these to the CFG-local allocator. |
| 1486 | 1586 |
| 1487 Third, multithreading may make the default allocator strategy more expensive. | 1587 Third, multithreading may make the default allocator strategy more expensive. |
| 1488 In a single-threaded environment, a pass will allocate its containers, run the | 1588 In a single-threaded environment, a pass will allocate its containers, run the |
| 1489 pass, and deallocate the containers. This results in stack-like allocation | 1589 pass, and deallocate the containers. This results in stack-like allocation |
| 1490 behavior and makes the heap free list easier to manage, with less heap | 1590 behavior and makes the heap free list easier to manage, with less heap |
| 1491 fragmentation. But when multithreading is added, the allocations and | 1591 fragmentation. But when multithreading is added, the allocations and |
| 1492 deallocations become much less stack-like, making allocation and deallocation | 1592 deallocations become much less stack-like, making allocation and deallocation |
| 1493 operations individually more expensive. Again, this could be mitigated by | 1593 operations individually more expensive. Again, this could be mitigated by |
| 1494 switching these to the CFG-local allocator. | 1594 switching these to the CFG-local allocator. |
| OLD | NEW |