docs/DESIGN.rst - Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting

Side by Side Diff: docs/DESIGN.rst

Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Created 4 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 Design of the Subzero fast code generator	1 Design of the Subzero fast code generator

2 =========================================	2 =========================================

3	3

4 Introduction	4 Introduction

5 ------------	5 ------------

6	6

7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes	7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the	8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the

9 PNaCl toolchain to compile their application to architecture-neutral PNaCl	9 PNaCl toolchain to compile their application to architecture-neutral PNaCl

10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as	10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as

(...skipping 38 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the	49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

50 following (described in more detail below):	50 following (described in more detail below):

51	51

52 +-----------------------------------+-----------------------------+	52 +-----------------------------------+-----------------------------+

53 \| O2 recipe \| Om1 recipe \|	53 \| O2 recipe \| Om1 recipe \|

54 +===================================+=============================+	54 +===================================+=============================+

55 \| Parse .pexe file \| Parse .pexe file \|	55 \| Parse .pexe file \| Parse .pexe file \|

56 +-----------------------------------+-----------------------------+	56 +-----------------------------------+-----------------------------+

57 \| Loop nest analysis \| \|	57 \| Loop nest analysis \| \|

58 +-----------------------------------+-----------------------------+	58 +-----------------------------------+-----------------------------+

	59 \| Local common sub-exp elimination \| \|
	Jim Stichnoth 2016/08/05 06:20:00 I think you can spell out "subexpression" and (pai I think you can spell out "subexpression" and (painfully, sorry) expand the columns, and still keep it with 80-col. manasijm 2016/08/05 06:57:09 Won't be that painful! I might not be an emacs nin Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > I think you can spell out "subexpression" and (painfully, sorry) expand the > columns, and still keep it with 80-col. Won't be that painful! I might not be an emacs ninja, but kdevelop/kate has picked up some useful features. https://kate-editor.org/2013/09/09/multi-line-text-editing-in-kate/ manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:57:09, manasijm wrote: > On 2016/08/05 06:20:00, stichnot wrote: > > I think you can spell out "subexpression" and (painfully, sorry) expand the > > columns, and still keep it with 80-col. > > Won't be that painful! > I might not be an emacs ninja, but kdevelop/kate has picked up some useful > features. > https://kate-editor.org/2013/09/09/multi-line-text-editing-in-kate/ Done.
	60 +-----------------------------------+-----------------------------+

59 \| Address mode inference \| \|	61 \| Address mode inference \| \|

60 +-----------------------------------+-----------------------------+	62 +-----------------------------------+-----------------------------+

61 \| Read-modify-write (RMW) transform \| \|	63 \| Read-modify-write (RMW) transform \| \|

62 +-----------------------------------+-----------------------------+	64 +-----------------------------------+-----------------------------+

63 \| Basic liveness analysis \| \|	65 \| Basic liveness analysis \| \|

64 +-----------------------------------+-----------------------------+	66 +-----------------------------------+-----------------------------+

65 \| Load optimization \| \|	67 \| Load optimization \| \|

66 +-----------------------------------+-----------------------------+	68 +-----------------------------------+-----------------------------+

67 \| \| Phi lowering (simple) \|	69 \| \| Phi lowering (simple) \|

68 +-----------------------------------+-----------------------------+	70 +-----------------------------------+-----------------------------+

(...skipping 442 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register	513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

512 copy, which is removed as an emission-time peephole optimization.	514 copy, which is removed as an emission-time peephole optimization.

513	515

514 O2 lowering	516 O2 lowering

515 -----------	517 -----------

516	518

517 Currently, the ``O2`` lowering recipe is the following:	519 Currently, the ``O2`` lowering recipe is the following:

518	520

519 - Loop nest analysis	521 - Loop nest analysis

520	522

	523 - Local common sub-expression elimination
	Jim Stichnoth 2016/08/05 06:20:00 "Subexpression" is usually written without the hyp "Subexpression" is usually written without the hyphen. E.g. https://en.wikipedia.org/wiki/Common_subexpression_elimination manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > "Subexpression" is usually written without the hyphen. > E.g. https://en.wikipedia.org/wiki/Common_subexpression_elimination Done.
	524

521 - Address mode inference	525 - Address mode inference

522	526

523 - Read-modify-write (RMW) transformation	527 - Read-modify-write (RMW) transformation

524	528

525 - Basic liveness analysis	529 - Basic liveness analysis

526	530

527 - Load optimization	531 - Load optimization

528	532

529 - Target lowering	533 - Target lowering

530	534

(...skipping 160 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
691 this second-chance mechanism is disabled.	695 this second-chance mechanism is disabled.

692	696

693 Future work is to implement LLVM's `Greedy	697 Future work is to implement LLVM's `Greedy

694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_	698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_

695 register allocator as a replacement for the basic linear-scan algorithm, given	699 register allocator as a replacement for the basic linear-scan algorithm, given

696 LLVM's experience with its improvement in code quality. (The blog post claims	700 LLVM's experience with its improvement in code quality. (The blog post claims

697 that the Greedy allocator also improved maintainability because a lot of hacks	701 that the Greedy allocator also improved maintainability because a lot of hacks

698 could be removed, but Subzero is probably not yet to that level of hacks, and is	702 could be removed, but Subzero is probably not yet to that level of hacks, and is

699 less likely to see that particular benefit.)	703 less likely to see that particular benefit.)

700	704

	705 Local common sub-expression elimination

	706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	707

	708 The Local CSE implementation goes through each instruction and record ``Seen``
	Jim Stichnoth 2016/08/05 06:20:00 records records manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > records Done.
	709 instructions in a ``CfgUnorderedSet``. The equality comparator for this
	Jim Stichnoth 2016/08/05 06:20:00 I don't think the level of detail in this document I don't think the level of detail in this document ever talks about specific Subzero data structure names. How about something like this? The Local CSE implementation goes through each instruction and records a portion of each ``Seen`` instruction in a hashset-like container. That portion consists of the entire instruction except for any dest variable. That means... manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > I don't think the level of detail in this document ever talks about specific > Subzero data structure names. How about something like this? > > The Local CSE implementation goes through each instruction and records a portion > of each ``Seen`` instruction in a hashset-like container. That portion consists > of the entire instruction except for any dest variable. That means... Done.
	710 container ignores the destination of the instructions. That means ``A = X + Y``

	711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows

	712 us to detect common sub-expressions.

	713

	714 Whenever a repetition is detected, the redundant variables are stored in a

	715 container mapping the replacee to the replacement. In the case above, it would

	716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.

	717

	718 At any point if a variable that has an entry in the replacement table is

	719 encountered, it is replaced with the variable it is mapped to. This ensures that

	720 the redundant variables will not have any uses in the basic block, allowing

	721 dead code elimination to clean up the redundant instruction.

	722

	723 With SSA, the information stored is never invalidated. However, non-SSA input is

	724 supported with the -lcse=no-ssa option. This has to maintain some extra
	Jim Stichnoth 2016/08/05 06:19:59 ``-lcse=no-ssa`` ``-lcse=no-ssa`` manasijm 2016/08/05 17:34:43 Done. Show quoted text On 2016/08/05 06:19:59, stichnot wrote: > ``-lcse=no-ssa`` Done.
	725 dependency information to ensure proper invalidation on variable assignment.

	726 This is not rigorously tested because this pass is run at an early stage where

	727 it is safe to assume variables have a single definition. This is not enabled by

	728 default because it bumps the compile time overhead from 2% to 6%.

	729

	730 Loop invariant code motion
	Jim Stichnoth 2016/08/05 06:20:00 I think it's more common to hyphenate "loop-invari I think it's more common to hyphenate "loop-invariant" here. manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > I think it's more common to hyphenate "loop-invariant" here. Done.
	731 ^^^^^^^^^^^^^^^^^^^^^^^^^^

	732

	733 This pass utilizes the loop analysis information to hoist invariant instructions

	734 to loop pre-headers. A loop must have a single entry node (header) and that node

	735 must have a single external predecessor for this optimization to work, as it is

	736 currently implemented.

	737

	738 The pass works by iterating over all instructions in the loop until the set of

	739 invariant instructions converge. In each iteration, a non-invariant instruction
	Jim Stichnoth 2016/08/05 06:20:00 converges converges manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > converges Done.
	740 involving only constants or a variable known to be invariant is added to the

	741 result set. The destination variable of that instruction is added to the set of

	742 variables known to be invariant (which is initialized with the constant args).

	743

	744 Improving the loop-analysis infrastructure is likely to have significant effect
	Jim Stichnoth 2016/08/05 06:20:00 maybe s/effect/impact/ ? maybe s/effect/impact/ ? manasijm 2016/08/05 17:34:43 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > maybe s/effect/impact/ ? Done.
	745 on this optimization. Inserting an extra node to act as the pre-header when the

	746 header has multiple incoming edges from outside could also be good idea.
	Jim Stichnoth 2016/08/05 06:20:00 be a good be a good manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > be a good Done.
	747 Expanding the initial invariant variable set to contain all variables that do

	748 not have definitions inside the loop does not seem to improve anything.

	749

	750 Short Circuit Evaluation
	Jim Stichnoth 2016/08/05 06:20:00 For consistency, only capitalize the first word of For consistency, only capitalize the first word of the section. manasijm 2016/08/05 17:34:43 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > For consistency, only capitalize the first word of the section. Done.
	751 ^^^^^^^^^^^^^^^^^^^^^^^^

	752

	753 Short Circuit Evaluation splits nodes and introduces early jumps when the result
	Jim Stichnoth 2016/08/05 06:19:59 I would also not capitalize all these words. To s I would also not capitalize all these words. To set it apart from the rest of the text, consider using italics? manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:19:59, stichnot wrote: > I would also not capitalize all these words. To set it apart from the rest of > the text, consider using italics? Done.
	754 of a logical operation can be determined early and there are no side effects of
	Jim Stichnoth 2016/08/05 06:20:00 maybe "no observable side effects" ? maybe "no observable side effects" ? manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > maybe "no observable side effects" ? Done.
	755 skipping the rest of the computation. The instructions considered backwards from

	756 the end of the basic blocks.

	757 When a definition of a variable involved in a conditional jump is found, an
	Jim Stichnoth 2016/08/05 06:20:00 reflow paragraph, or make it a new paragraph reflow paragraph, or make it a new paragraph manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:20:00, stichnot wrote: > reflow paragraph, or make it a new paragraph Done.
	758 extra jump can be inserted in that location, moving the rest of the instructions

	759 in the node to a newly inserted node. Consider this example::

	760

	761 __N :

	762 a = <something>

	763 Instruction 1 without side effect

	764 ... b = <something> ...

	765 Instruction N without side effect

	766 t1 = or a b

	767 br t1 __X __Y

	768

	769 is transformed to ::

	770

	771 __N :

	772 a = <something>

	773 br a __X __N_ext

	774

	775 __N_ext :

	776 Instruction 1 without side effect

	777 ... b = <something> ...

	778 Instruction N without side effect

	779 br b __X __Y

	780

	781 The logic for AND is analogous, the only difference is that the early jump is

	782 facilitated by a ``false`` value instead of ``true``.

	783

	784 Global Variable Splitting

	785 ^^^^^^^^^^^^^^^^^^^^^^^^^

	786

	787 Global variable splitting (-split-global-vars) is run after register allocation.
	Jim Stichnoth 2016/08/05 06:19:59 ``-split-global-vars`` ``-split-global-vars`` manasijm 2016/08/05 17:34:44 Done. Show quoted text On 2016/08/05 06:19:59, stichnot wrote: > ``-split-global-vars`` Done.
	788 It works on the variables that did not manage to get registers (but are allowed

	789 to) and decomposes their live ranges into the individual segments (which span a

	790 single node at most). New variables are created (but not yet used) with these

	791 smaller live ranges and the register allocator is run again. This is not

	792 inefficient as the old variables that already had registers are now considered

	793 pre-colored.

	794

	795 The new variables that get registers replace their parent variables for their

	796 portion of its (parent's) live range. A copy from the old variable to the new

	797 is introduced before the first use and the reverse after the last def in the

	798 live range.

	799

	800

701 Basic phi lowering	801 Basic phi lowering

702 ^^^^^^^^^^^^^^^^^^	802 ^^^^^^^^^^^^^^^^^^

703	803

704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``	804 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

705 implements it). Consider this example::	805 implements it). Consider this example::

706	806

707 L1:	807 L1:

708 ...	808 ...

709 br L3	809 br L3

710 L2:	810 L2:

(...skipping 774 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
1485 This could be mitigated by switching these to the CFG-local allocator.	1585 This could be mitigated by switching these to the CFG-local allocator.

1486	1586

1487 Third, multithreading may make the default allocator strategy more expensive.	1587 Third, multithreading may make the default allocator strategy more expensive.

1488 In a single-threaded environment, a pass will allocate its containers, run the	1588 In a single-threaded environment, a pass will allocate its containers, run the

1489 pass, and deallocate the containers. This results in stack-like allocation	1589 pass, and deallocate the containers. This results in stack-like allocation

1490 behavior and makes the heap free list easier to manage, with less heap	1590 behavior and makes the heap free list easier to manage, with less heap

1491 fragmentation. But when multithreading is added, the allocations and	1591 fragmentation. But when multithreading is added, the allocations and

1492 deallocations become much less stack-like, making allocation and deallocation	1592 deallocations become much less stack-like, making allocation and deallocation

1493 operations individually more expensive. Again, this could be mitigated by	1593 operations individually more expensive. Again, this could be mitigated by

1494 switching these to the CFG-local allocator.	1594 switching these to the CFG-local allocator.

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »