Index: docs/DESIGN.rst |
diff --git a/docs/DESIGN.rst b/docs/DESIGN.rst |
index 8264a7aeac74bb408850e665ef0032c5476dca40..363c19c6178a61dddf12fd95ca5bc2c996206339 100644 |
--- a/docs/DESIGN.rst |
+++ b/docs/DESIGN.rst |
@@ -49,37 +49,39 @@ selection of fast optimization passes. It has two optimization recipes: full |
optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the |
following (described in more detail below): |
-+-----------------------------------+-----------------------------+ |
-| O2 recipe | Om1 recipe | |
-+===================================+=============================+ |
-| Parse .pexe file | Parse .pexe file | |
-+-----------------------------------+-----------------------------+ |
-| Loop nest analysis | | |
-+-----------------------------------+-----------------------------+ |
-| Address mode inference | | |
-+-----------------------------------+-----------------------------+ |
-| Read-modify-write (RMW) transform | | |
-+-----------------------------------+-----------------------------+ |
-| Basic liveness analysis | | |
-+-----------------------------------+-----------------------------+ |
-| Load optimization | | |
-+-----------------------------------+-----------------------------+ |
-| | Phi lowering (simple) | |
-+-----------------------------------+-----------------------------+ |
-| Target lowering | Target lowering | |
-+-----------------------------------+-----------------------------+ |
-| Full liveness analysis | | |
-+-----------------------------------+-----------------------------+ |
-| Register allocation | Minimal register allocation | |
-+-----------------------------------+-----------------------------+ |
-| Phi lowering (advanced) | | |
-+-----------------------------------+-----------------------------+ |
-| Post-phi register allocation | | |
-+-----------------------------------+-----------------------------+ |
-| Branch optimization | | |
-+-----------------------------------+-----------------------------+ |
-| Code emission | Code emission | |
-+-----------------------------------+-----------------------------+ |
++----------------------------------------+-----------------------------+ |
+| O2 recipe | Om1 recipe | |
++========================================+=============================+ |
+| Parse .pexe file | Parse .pexe file | |
++----------------------------------------+-----------------------------+ |
+| Loop nest analysis | | |
++----------------------------------------+-----------------------------+ |
+| Local common subexpression elimination | | |
++----------------------------------------+-----------------------------+ |
+| Address mode inference | | |
++----------------------------------------+-----------------------------+ |
+| Read-modify-write (RMW) transform | | |
++----------------------------------------+-----------------------------+ |
+| Basic liveness analysis | | |
++----------------------------------------+-----------------------------+ |
+| Load optimization | | |
++----------------------------------------+-----------------------------+ |
+| | Phi lowering (simple) | |
++----------------------------------------+-----------------------------+ |
+| Target lowering | Target lowering | |
++----------------------------------------+-----------------------------+ |
+| Full liveness analysis | | |
++----------------------------------------+-----------------------------+ |
+| Register allocation | Minimal register allocation | |
++----------------------------------------+-----------------------------+ |
+| Phi lowering (advanced) | | |
++----------------------------------------+-----------------------------+ |
+| Post-phi register allocation | | |
++----------------------------------------+-----------------------------+ |
+| Branch optimization | | |
++----------------------------------------+-----------------------------+ |
+| Code emission | Code emission | |
++----------------------------------------+-----------------------------+ |
Goals |
===== |
@@ -518,6 +520,8 @@ Currently, the ``O2`` lowering recipe is the following: |
- Loop nest analysis |
+- Local common subexpression elimination |
+ |
- Address mode inference |
- Read-modify-write (RMW) transformation |
@@ -698,6 +702,101 @@ that the Greedy allocator also improved maintainability because a lot of hacks |
could be removed, but Subzero is probably not yet to that level of hacks, and is |
less likely to see that particular benefit.) |
+Local common subexpression elimination |
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
+ |
+The Local CSE implementation goes through each instruction and records a portion |
+of each ``Seen`` instruction in a hashset-like container. That portion consists |
+of the entire instruction except for any dest variable. That means ``A = X + Y`` |
+and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows |
+us to detect common subexpressions. |
+ |
+Whenever a repetition is detected, the redundant variables are stored in a |
+container mapping the replacee to the replacement. In the case above, it would |
+be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``. |
+ |
+At any point if a variable that has an entry in the replacement table is |
+encountered, it is replaced with the variable it is mapped to. This ensures that |
+the redundant variables will not have any uses in the basic block, allowing |
+dead code elimination to clean up the redundant instruction. |
+ |
+With SSA, the information stored is never invalidated. However, non-SSA input is |
+supported with the ``-lcse=no-ssa`` option. This has to maintain some extra |
+dependency information to ensure proper invalidation on variable assignment. |
+This is not rigorously tested because this pass is run at an early stage where |
+it is safe to assume variables have a single definition. This is not enabled by |
+default because it bumps the compile time overhead from 2% to 6%. |
+ |
+Loop-invariant code motion |
+^^^^^^^^^^^^^^^^^^^^^^^^^^ |
+ |
+This pass utilizes the loop analysis information to hoist invariant instructions |
+to loop pre-headers. A loop must have a single entry node (header) and that node |
+must have a single external predecessor for this optimization to work, as it is |
+currently implemented. |
+ |
+The pass works by iterating over all instructions in the loop until the set of |
+invariant instructions converges. In each iteration, a non-invariant instruction |
+involving only constants or a variable known to be invariant is added to the |
+result set. The destination variable of that instruction is added to the set of |
+variables known to be invariant (which is initialized with the constant args). |
+ |
+Improving the loop-analysis infrastructure is likely to have significant impact |
+on this optimization. Inserting an extra node to act as the pre-header when the |
+header has multiple incoming edges from outside could also be a good idea. |
+Expanding the initial invariant variable set to contain all variables that do |
+not have definitions inside the loop does not seem to improve anything. |
+ |
+Short circuit evaluation |
+^^^^^^^^^^^^^^^^^^^^^^^^ |
+ |
+Short circuit evaluation splits nodes and introduces early jumps when the result |
+of a logical operation can be determined early and there are no observable side |
+effects of skipping the rest of the computation. The instructions considered |
+backwards from the end of the basic blocks. When a definition of a variable |
+involved in a conditional jump is found, an extra jump can be inserted in that |
+location, moving the rest of the instructions in the node to a newly inserted |
+node. Consider this example:: |
+ |
+ __N : |
+ a = <something> |
+ Instruction 1 without side effect |
+ ... b = <something> ... |
+ Instruction N without side effect |
+ t1 = or a b |
+ br t1 __X __Y |
+ |
+is transformed to:: |
+ |
+ __N : |
+ a = <something> |
+ br a __X __N_ext |
+ |
+ __N_ext : |
+ Instruction 1 without side effect |
+ ... b = <something> ... |
+ Instruction N without side effect |
+ br b __X __Y |
+ |
+The logic for AND is analogous, the only difference is that the early jump is |
+facilitated by a ``false`` value instead of ``true``. |
+ |
+Global Variable Splitting |
+^^^^^^^^^^^^^^^^^^^^^^^^^ |
+ |
+Global variable splitting (``-split-global-vars``) is run after register |
+allocation. It works on the variables that did not manage to get registers (but |
+are allowed to) and decomposes their live ranges into the individual segments |
+(which span a single node at most). New variables are created (but not yet used) |
+with these smaller live ranges and the register allocator is run again. This is |
+not inefficient as the old variables that already had registers are now |
+considered pre-colored. |
+ |
+The new variables that get registers replace their parent variables for their |
+portion of its (parent's) live range. A copy from the old variable to the new |
+is introduced before the first use and the reverse after the last def in the |
+live range. |
+ |
Basic phi lowering |
^^^^^^^^^^^^^^^^^^ |