Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(58)

Side by Side Diff: docs/DESIGN.rst

Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master
Patch Set: Remove extra line Created 4 years, 4 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 Design of the Subzero fast code generator 1 Design of the Subzero fast code generator
2 ========================================= 2 =========================================
3 3
4 Introduction 4 Introduction
5 ------------ 5 ------------
6 6
7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes 7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes
8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the 8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the
9 PNaCl toolchain to compile their application to architecture-neutral PNaCl 9 PNaCl toolchain to compile their application to architecture-neutral PNaCl
10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as 10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as
(...skipping 31 matching lines...) Expand 10 before | Expand all | Expand 10 after
42 42
43 Or, improve translator performance to something more reasonable. 43 Or, improve translator performance to something more reasonable.
44 44
45 This document describes Subzero's attempt to improve translation speed by an 45 This document describes Subzero's attempt to improve translation speed by an
46 order of magnitude while rivaling LLVM's code quality. Subzero does this 46 order of magnitude while rivaling LLVM's code quality. Subzero does this
47 through minimal IR layering, lean data structures and passes, and a careful 47 through minimal IR layering, lean data structures and passes, and a careful
48 selection of fast optimization passes. It has two optimization recipes: full 48 selection of fast optimization passes. It has two optimization recipes: full
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the 49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the
50 following (described in more detail below): 50 following (described in more detail below):
51 51
52 +-----------------------------------+-----------------------------+ 52 +----------------------------------------+-----------------------------+
53 | O2 recipe | Om1 recipe | 53 | O2 recipe | Om1 recipe |
54 +===================================+=============================+ 54 +========================================+=============================+
55 | Parse .pexe file | Parse .pexe file | 55 | Parse .pexe file | Parse .pexe file |
56 +-----------------------------------+-----------------------------+ 56 +----------------------------------------+-----------------------------+
57 | Loop nest analysis | | 57 | Loop nest analysis | |
58 +-----------------------------------+-----------------------------+ 58 +----------------------------------------+-----------------------------+
59 | Address mode inference | | 59 | Local common subexpression elimination | |
60 +-----------------------------------+-----------------------------+ 60 +----------------------------------------+-----------------------------+
61 | Read-modify-write (RMW) transform | | 61 | Address mode inference | |
62 +-----------------------------------+-----------------------------+ 62 +----------------------------------------+-----------------------------+
63 | Basic liveness analysis | | 63 | Read-modify-write (RMW) transform | |
64 +-----------------------------------+-----------------------------+ 64 +----------------------------------------+-----------------------------+
65 | Load optimization | | 65 | Basic liveness analysis | |
66 +-----------------------------------+-----------------------------+ 66 +----------------------------------------+-----------------------------+
67 | | Phi lowering (simple) | 67 | Load optimization | |
68 +-----------------------------------+-----------------------------+ 68 +----------------------------------------+-----------------------------+
69 | Target lowering | Target lowering | 69 | | Phi lowering (simple) |
70 +-----------------------------------+-----------------------------+ 70 +----------------------------------------+-----------------------------+
71 | Full liveness analysis | | 71 | Target lowering | Target lowering |
72 +-----------------------------------+-----------------------------+ 72 +----------------------------------------+-----------------------------+
73 | Register allocation | Minimal register allocation | 73 | Full liveness analysis | |
74 +-----------------------------------+-----------------------------+ 74 +----------------------------------------+-----------------------------+
75 | Phi lowering (advanced) | | 75 | Register allocation | Minimal register allocation |
76 +-----------------------------------+-----------------------------+ 76 +----------------------------------------+-----------------------------+
77 | Post-phi register allocation | | 77 | Phi lowering (advanced) | |
78 +-----------------------------------+-----------------------------+ 78 +----------------------------------------+-----------------------------+
79 | Branch optimization | | 79 | Post-phi register allocation | |
80 +-----------------------------------+-----------------------------+ 80 +----------------------------------------+-----------------------------+
81 | Code emission | Code emission | 81 | Branch optimization | |
82 +-----------------------------------+-----------------------------+ 82 +----------------------------------------+-----------------------------+
83 | Code emission | Code emission |
84 +----------------------------------------+-----------------------------+
83 85
84 Goals 86 Goals
85 ===== 87 =====
86 88
87 Translation speed 89 Translation speed
88 ----------------- 90 -----------------
89 91
90 We'd like to be able to translate a ``.pexe`` file as fast as download speed. 92 We'd like to be able to translate a ``.pexe`` file as fast as download speed.
91 Any faster is in a sense wasted effort. Download speed varies greatly, but 93 Any faster is in a sense wasted effort. Download speed varies greatly, but
92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a 94 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a
(...skipping 418 matching lines...) Expand 10 before | Expand all | Expand 10 after
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register 513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register
512 copy, which is removed as an emission-time peephole optimization. 514 copy, which is removed as an emission-time peephole optimization.
513 515
514 O2 lowering 516 O2 lowering
515 ----------- 517 -----------
516 518
517 Currently, the ``O2`` lowering recipe is the following: 519 Currently, the ``O2`` lowering recipe is the following:
518 520
519 - Loop nest analysis 521 - Loop nest analysis
520 522
523 - Local common subexpression elimination
524
521 - Address mode inference 525 - Address mode inference
522 526
523 - Read-modify-write (RMW) transformation 527 - Read-modify-write (RMW) transformation
524 528
525 - Basic liveness analysis 529 - Basic liveness analysis
526 530
527 - Load optimization 531 - Load optimization
528 532
529 - Target lowering 533 - Target lowering
530 534
(...skipping 160 matching lines...) Expand 10 before | Expand all | Expand 10 after
691 this second-chance mechanism is disabled. 695 this second-chance mechanism is disabled.
692 696
693 Future work is to implement LLVM's `Greedy 697 Future work is to implement LLVM's `Greedy
694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_ 698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_
695 register allocator as a replacement for the basic linear-scan algorithm, given 699 register allocator as a replacement for the basic linear-scan algorithm, given
696 LLVM's experience with its improvement in code quality. (The blog post claims 700 LLVM's experience with its improvement in code quality. (The blog post claims
697 that the Greedy allocator also improved maintainability because a lot of hacks 701 that the Greedy allocator also improved maintainability because a lot of hacks
698 could be removed, but Subzero is probably not yet to that level of hacks, and is 702 could be removed, but Subzero is probably not yet to that level of hacks, and is
699 less likely to see that particular benefit.) 703 less likely to see that particular benefit.)
700 704
705 Local common subexpression elimination
706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
707
708 The Local CSE implementation goes through each instruction and records a portion
709 of each ``Seen`` instruction in a hashset-like container. That portion consists
710 of the entire instruction except for any dest variable. That means ``A = X + Y``
711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows
712 us to detect common subexpressions.
713
714 Whenever a repetition is detected, the redundant variables are stored in a
715 container mapping the replacee to the replacement. In the case above, it would
716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.
717
718 At any point if a variable that has an entry in the replacement table is
719 encountered, it is replaced with the variable it is mapped to. This ensures that
720 the redundant variables will not have any uses in the basic block, allowing
721 dead code elimination to clean up the redundant instruction.
722
723 With SSA, the information stored is never invalidated. However, non-SSA input is
724 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra
725 dependency information to ensure proper invalidation on variable assignment.
726 This is not rigorously tested because this pass is run at an early stage where
727 it is safe to assume variables have a single definition. This is not enabled by
728 default because it bumps the compile time overhead from 2% to 6%.
729
730 Loop-invariant code motion
731 ^^^^^^^^^^^^^^^^^^^^^^^^^^
732
733 This pass utilizes the loop analysis information to hoist invariant instructions
734 to loop pre-headers. A loop must have a single entry node (header) and that node
735 must have a single external predecessor for this optimization to work, as it is
736 currently implemented.
737
738 The pass works by iterating over all instructions in the loop until the set of
739 invariant instructions converges. In each iteration, a non-invariant instruction
740 involving only constants or a variable known to be invariant is added to the
741 result set. The destination variable of that instruction is added to the set of
742 variables known to be invariant (which is initialized with the constant args).
743
744 Improving the loop-analysis infrastructure is likely to have significant impact
745 on this optimization. Inserting an extra node to act as the pre-header when the
746 header has multiple incoming edges from outside could also be a good idea.
747 Expanding the initial invariant variable set to contain all variables that do
748 not have definitions inside the loop does not seem to improve anything.
749
750 Short circuit evaluation
751 ^^^^^^^^^^^^^^^^^^^^^^^^
752
753 Short circuit evaluation splits nodes and introduces early jumps when the result
754 of a logical operation can be determined early and there are no observable side
755 effects of skipping the rest of the computation. The instructions considered
756 backwards from the end of the basic blocks. When a definition of a variable
757 involved in a conditional jump is found, an extra jump can be inserted in that
758 location, moving the rest of the instructions in the node to a newly inserted
759 node. Consider this example::
760
761 __N :
762 a = <something>
763 Instruction 1 without side effect
764 ... b = <something> ...
765 Instruction N without side effect
766 t1 = or a b
767 br t1 __X __Y
768
769 is transformed to::
770
771 __N :
772 a = <something>
773 br a __X __N_ext
774
775 __N_ext :
776 Instruction 1 without side effect
777 ... b = <something> ...
778 Instruction N without side effect
779 br b __X __Y
780
781 The logic for AND is analogous, the only difference is that the early jump is
782 facilitated by a ``false`` value instead of ``true``.
783
784 Global Variable Splitting
785 ^^^^^^^^^^^^^^^^^^^^^^^^^
786
787 Global variable splitting (``-split-global-vars``) is run after register
788 allocation. It works on the variables that did not manage to get registers (but
789 are allowed to) and decomposes their live ranges into the individual segments
790 (which span a single node at most). New variables are created (but not yet used)
791 with these smaller live ranges and the register allocator is run again. This is
792 not inefficient as the old variables that already had registers are now
793 considered pre-colored.
794
795 The new variables that get registers replace their parent variables for their
796 portion of its (parent's) live range. A copy from the old variable to the new
797 is introduced before the first use and the reverse after the last def in the
798 live range.
799
701 Basic phi lowering 800 Basic phi lowering
702 ^^^^^^^^^^^^^^^^^^ 801 ^^^^^^^^^^^^^^^^^^
703 802
704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0`` 803 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``
705 implements it). Consider this example:: 804 implements it). Consider this example::
706 805
707 L1: 806 L1:
708 ... 807 ...
709 br L3 808 br L3
710 L2: 809 L2:
(...skipping 774 matching lines...) Expand 10 before | Expand all | Expand 10 after
1485 This could be mitigated by switching these to the CFG-local allocator. 1584 This could be mitigated by switching these to the CFG-local allocator.
1486 1585
1487 Third, multithreading may make the default allocator strategy more expensive. 1586 Third, multithreading may make the default allocator strategy more expensive.
1488 In a single-threaded environment, a pass will allocate its containers, run the 1587 In a single-threaded environment, a pass will allocate its containers, run the
1489 pass, and deallocate the containers. This results in stack-like allocation 1588 pass, and deallocate the containers. This results in stack-like allocation
1490 behavior and makes the heap free list easier to manage, with less heap 1589 behavior and makes the heap free list easier to manage, with less heap
1491 fragmentation. But when multithreading is added, the allocations and 1590 fragmentation. But when multithreading is added, the allocations and
1492 deallocations become much less stack-like, making allocation and deallocation 1591 deallocations become much less stack-like, making allocation and deallocation
1493 operations individually more expensive. Again, this could be mitigated by 1592 operations individually more expensive. Again, this could be mitigated by
1494 switching these to the CFG-local allocator. 1593 switching these to the CFG-local allocator.
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698