docs/DESIGN.rst - Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting

Unified Diff: docs/DESIGN.rst

Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Address comments Created 4 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Download patch

Index: docs/DESIGN.rst

diff --git a/docs/DESIGN.rst b/docs/DESIGN.rst

index 8264a7aeac74bb408850e665ef0032c5476dca40..1e6dd69a1c3e5b695c939ac67649ee231e1ad3ea 100644

--- a/docs/DESIGN.rst

+++ b/docs/DESIGN.rst

@@ -49,37 +49,39 @@ selection of fast optimization passes. It has two optimization recipes: full

optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

following (described in more detail below):

-+-----------------------------------+-----------------------------+

-| O2 recipe | Om1 recipe |

-+===================================+=============================+

-| Parse .pexe file | Parse .pexe file |

-+-----------------------------------+-----------------------------+

-| Loop nest analysis | |

-+-----------------------------------+-----------------------------+

-| Address mode inference | |

-+-----------------------------------+-----------------------------+

-| Read-modify-write (RMW) transform | |

-+-----------------------------------+-----------------------------+

-| Basic liveness analysis | |

-+-----------------------------------+-----------------------------+

-| Load optimization | |

-+-----------------------------------+-----------------------------+

-| | Phi lowering (simple) |

-+-----------------------------------+-----------------------------+

-| Target lowering | Target lowering |

-+-----------------------------------+-----------------------------+

-| Full liveness analysis | |

-+-----------------------------------+-----------------------------+

-| Register allocation | Minimal register allocation |

-+-----------------------------------+-----------------------------+

-| Phi lowering (advanced) | |

-+-----------------------------------+-----------------------------+

-| Post-phi register allocation | |

-+-----------------------------------+-----------------------------+

-| Branch optimization | |

-+-----------------------------------+-----------------------------+

-| Code emission | Code emission |

-+-----------------------------------+-----------------------------+

++----------------------------------------+-----------------------------+

+| O2 recipe | Om1 recipe |

++========================================+=============================+

+| Parse .pexe file | Parse .pexe file |

++----------------------------------------+-----------------------------+

+| Loop nest analysis | |

++----------------------------------------+-----------------------------+

+| Local common subexpression elimination | |

++----------------------------------------+-----------------------------+

+| Address mode inference | |

++----------------------------------------+-----------------------------+

+| Read-modify-write (RMW) transform | |

++----------------------------------------+-----------------------------+

+| Basic liveness analysis | |

++----------------------------------------+-----------------------------+

+| Load optimization | |

++----------------------------------------+-----------------------------+

+| | Phi lowering (simple) |

++----------------------------------------+-----------------------------+

+| Target lowering | Target lowering |

++----------------------------------------+-----------------------------+

+| Full liveness analysis | |

++----------------------------------------+-----------------------------+

+| Register allocation | Minimal register allocation |

++----------------------------------------+-----------------------------+

+| Phi lowering (advanced) | |

++----------------------------------------+-----------------------------+

+| Post-phi register allocation | |

++----------------------------------------+-----------------------------+

+| Branch optimization | |

++----------------------------------------+-----------------------------+

+| Code emission | Code emission |

++----------------------------------------+-----------------------------+

Goals

=====

@@ -518,6 +520,8 @@ Currently, the ``O2`` lowering recipe is the following:

- Loop nest analysis

+- Local common subexpression elimination

- Address mode inference

- Read-modify-write (RMW) transformation

@@ -698,6 +702,102 @@ that the Greedy allocator also improved maintainability because a lot of hacks

could be removed, but Subzero is probably not yet to that level of hacks, and is

less likely to see that particular benefit.)

+Local common subexpression elimination

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

+The Local CSE implementation goes through each instruction and records a portion

+of each ``Seen`` instruction in a hashset-like container. That portion consists

+of the entire instruction except for any dest variable. That means ``A = X + Y``

+and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows

+us to detect common subexpressions.

+Whenever a repetition is detected, the redundant variables are stored in a

+container mapping the replacee to the replacement. In the case above, it would

+be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.

+At any point if a variable that has an entry in the replacement table is

+encountered, it is replaced with the variable it is mapped to. This ensures that

+the redundant variables will not have any uses in the basic block, allowing

+dead code elimination to clean up the redundant instruction.

+With SSA, the information stored is never invalidated. However, non-SSA input is

+supported with the ``-lcse=no-ssa`` option. This has to maintain some extra

+dependency information to ensure proper invalidation on variable assignment.

+This is not rigorously tested because this pass is run at an early stage where

+it is safe to assume variables have a single definition. This is not enabled by

+default because it bumps the compile time overhead from 2% to 6%.

+Loop-invariant code motion

+^^^^^^^^^^^^^^^^^^^^^^^^^^

+This pass utilizes the loop analysis information to hoist invariant instructions

+to loop pre-headers. A loop must have a single entry node (header) and that node

+must have a single external predecessor for this optimization to work, as it is

+currently implemented.

+The pass works by iterating over all instructions in the loop until the set of

+invariant instructions converges. In each iteration, a non-invariant instruction

+involving only constants or a variable known to be invariant is added to the

+result set. The destination variable of that instruction is added to the set of

+variables known to be invariant (which is initialized with the constant args).

+Improving the loop-analysis infrastructure is likely to have significant impact

+on this optimization. Inserting an extra node to act as the pre-header when the

+header has multiple incoming edges from outside could also be a good idea.

+Expanding the initial invariant variable set to contain all variables that do

+not have definitions inside the loop does not seem to improve anything.

+Short circuit evaluation

+^^^^^^^^^^^^^^^^^^^^^^^^

+Short circuit evaluation splits nodes and introduces early jumps when the result

+of a logical operation can be determined early and there are no observable side

+effects of skipping the rest of the computation. The instructions considered

+backwards from the end of the basic blocks. When a definition of a variable

+involved in a conditional jump is found, an extra jump can be inserted in that

+location, moving the rest of the instructions in the node to a newly inserted

+node. Consider this example::

+ __N :

+ a = <something>

+ Instruction 1 without side effect

+ ... b = <something> ...

+ Instruction N without side effect

+ t1 = or a b

+ br t1 __X __Y

+is transformed to::

+ __N :

+ a = <something>

+ br a __X __N_ext

+ __N_ext :

+ Instruction 1 without side effect

+ ... b = <something> ...

+ Instruction N without side effect

+ br b __X __Y

+The logic for AND is analogous, the only difference is that the early jump is

+facilitated by a ``false`` value instead of ``true``.

+Global Variable Splitting

+^^^^^^^^^^^^^^^^^^^^^^^^^

+Global variable splitting (``-split-global-vars``) is run after register

+allocation. It works on the variables that did not manage to get registers (but

+are allowed to) and decomposes their live ranges into the individual segments

+(which span a single node at most). New variables are created (but not yet used)

+with these smaller live ranges and the register allocator is run again. This is

+not inefficient as the old variables that already had registers are now

+considered pre-colored.

+The new variables that get registers replace their parent variables for their

+portion of its (parent's) live range. A copy from the old variable to the new

+is introduced before the first use and the reverse after the last def in the

+live range.

Jim Stichnoth 2016/08/05 19:23:59 Just a single blank line for consistency?

Basic phi lowering

^^^^^^^^^^^^^^^^^^

« no previous file with comments | « no previous file | no next file » | no next file with comments »