docs/DESIGN.rst - Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting

Side by Side Diff: docs/DESIGN.rst

Issue 2217773003: Documentation for LCSE, LICM, Short-Circuit, Global-Splitting (Closed) Base URL: https://chromium.googlesource.com/native_client/pnacl-subzero.git@master

Patch Set: Remove extra line Created 4 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
1 Design of the Subzero fast code generator	1 Design of the Subzero fast code generator

2 =========================================	2 =========================================

3	3

4 Introduction	4 Introduction

5 ------------	5 ------------

6	6

7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes	7 The `Portable Native Client (PNaCl) <http://gonacl.com>`_ project includes

8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the	8 compiler technology based on `LLVM <http://llvm.org/>`_. The developer uses the

9 PNaCl toolchain to compile their application to architecture-neutral PNaCl	9 PNaCl toolchain to compile their application to architecture-neutral PNaCl

10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as	10 bitcode (a ``.pexe`` file), using as much architecture-neutral optimization as

(...skipping 31 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
42	42

43 Or, improve translator performance to something more reasonable.	43 Or, improve translator performance to something more reasonable.

44	44

45 This document describes Subzero's attempt to improve translation speed by an	45 This document describes Subzero's attempt to improve translation speed by an

46 order of magnitude while rivaling LLVM's code quality. Subzero does this	46 order of magnitude while rivaling LLVM's code quality. Subzero does this

47 through minimal IR layering, lean data structures and passes, and a careful	47 through minimal IR layering, lean data structures and passes, and a careful

48 selection of fast optimization passes. It has two optimization recipes: full	48 selection of fast optimization passes. It has two optimization recipes: full

49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the	49 optimizations (``O2``) and minimal optimizations (``Om1``). The recipes are the

50 following (described in more detail below):	50 following (described in more detail below):

51	51

52 +-----------------------------------+-----------------------------+	52 +----------------------------------------+-----------------------------+

53 \| O2 recipe \| Om1 recipe \|	53 \| O2 recipe \| Om1 recipe \|

54 +===================================+=============================+	54 +========================================+=============================+

55 \| Parse .pexe file \| Parse .pexe file \|	55 \| Parse .pexe file \| Parse .pexe file \|

56 +-----------------------------------+-----------------------------+	56 +----------------------------------------+-----------------------------+

57 \| Loop nest analysis \| \|	57 \| Loop nest analysis \| \|

58 +-----------------------------------+-----------------------------+	58 +----------------------------------------+-----------------------------+

59 \| Address mode inference \| \|	59 \| Local common subexpression elimination \| \|

60 +-----------------------------------+-----------------------------+	60 +----------------------------------------+-----------------------------+

61 \| Read-modify-write (RMW) transform \| \|	61 \| Address mode inference \| \|

62 +-----------------------------------+-----------------------------+	62 +----------------------------------------+-----------------------------+

63 \| Basic liveness analysis \| \|	63 \| Read-modify-write (RMW) transform \| \|

64 +-----------------------------------+-----------------------------+	64 +----------------------------------------+-----------------------------+

65 \| Load optimization \| \|	65 \| Basic liveness analysis \| \|

66 +-----------------------------------+-----------------------------+	66 +----------------------------------------+-----------------------------+

67 \| \| Phi lowering (simple) \|	67 \| Load optimization \| \|

68 +-----------------------------------+-----------------------------+	68 +----------------------------------------+-----------------------------+

69 \| Target lowering \| Target lowering \|	69 \| \| Phi lowering (simple) \|

70 +-----------------------------------+-----------------------------+	70 +----------------------------------------+-----------------------------+

71 \| Full liveness analysis \| \|	71 \| Target lowering \| Target lowering \|

72 +-----------------------------------+-----------------------------+	72 +----------------------------------------+-----------------------------+

73 \| Register allocation \| Minimal register allocation \|	73 \| Full liveness analysis \| \|

74 +-----------------------------------+-----------------------------+	74 +----------------------------------------+-----------------------------+

75 \| Phi lowering (advanced) \| \|	75 \| Register allocation \| Minimal register allocation \|

76 +-----------------------------------+-----------------------------+	76 +----------------------------------------+-----------------------------+

77 \| Post-phi register allocation \| \|	77 \| Phi lowering (advanced) \| \|

78 +-----------------------------------+-----------------------------+	78 +----------------------------------------+-----------------------------+

79 \| Branch optimization \| \|	79 \| Post-phi register allocation \| \|

80 +-----------------------------------+-----------------------------+	80 +----------------------------------------+-----------------------------+

81 \| Code emission \| Code emission \|	81 \| Branch optimization \| \|

82 +-----------------------------------+-----------------------------+	82 +----------------------------------------+-----------------------------+

	83 \| Code emission \| Code emission \|

	84 +----------------------------------------+-----------------------------+

83	85

84 Goals	86 Goals

85 =====	87 =====

86	88

87 Translation speed	89 Translation speed

88 -----------------	90 -----------------

89	91

90 We'd like to be able to translate a ``.pexe`` file as fast as download speed.	92 We'd like to be able to translate a ``.pexe`` file as fast as download speed.

91 Any faster is in a sense wasted effort. Download speed varies greatly, but	93 Any faster is in a sense wasted effort. Download speed varies greatly, but

92 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a	94 we'll arbitrarily say 1 MB/sec. We'll pick the ARM A15 CPU as the example of a

(...skipping 418 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
511 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register	513 register for ``T.inf``. This leads to ``T.inf=B`` being a redundant register

512 copy, which is removed as an emission-time peephole optimization.	514 copy, which is removed as an emission-time peephole optimization.

513	515

514 O2 lowering	516 O2 lowering

515 -----------	517 -----------

516	518

517 Currently, the ``O2`` lowering recipe is the following:	519 Currently, the ``O2`` lowering recipe is the following:

518	520

519 - Loop nest analysis	521 - Loop nest analysis

520	522

	523 - Local common subexpression elimination

	524

521 - Address mode inference	525 - Address mode inference

522	526

523 - Read-modify-write (RMW) transformation	527 - Read-modify-write (RMW) transformation

524	528

525 - Basic liveness analysis	529 - Basic liveness analysis

526	530

527 - Load optimization	531 - Load optimization

528	532

529 - Target lowering	533 - Target lowering

530	534

(...skipping 160 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
691 this second-chance mechanism is disabled.	695 this second-chance mechanism is disabled.

692	696

693 Future work is to implement LLVM's `Greedy	697 Future work is to implement LLVM's `Greedy

694 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_	698 <http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html>`_

695 register allocator as a replacement for the basic linear-scan algorithm, given	699 register allocator as a replacement for the basic linear-scan algorithm, given

696 LLVM's experience with its improvement in code quality. (The blog post claims	700 LLVM's experience with its improvement in code quality. (The blog post claims

697 that the Greedy allocator also improved maintainability because a lot of hacks	701 that the Greedy allocator also improved maintainability because a lot of hacks

698 could be removed, but Subzero is probably not yet to that level of hacks, and is	702 could be removed, but Subzero is probably not yet to that level of hacks, and is

699 less likely to see that particular benefit.)	703 less likely to see that particular benefit.)

700	704

	705 Local common subexpression elimination

	706 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	707

	708 The Local CSE implementation goes through each instruction and records a portion

	709 of each ``Seen`` instruction in a hashset-like container. That portion consists

	710 of the entire instruction except for any dest variable. That means ``A = X + Y``

	711 and ``B = X + Y`` will be considered to be 'equal' for this purpose. This allows

	712 us to detect common subexpressions.

	713

	714 Whenever a repetition is detected, the redundant variables are stored in a

	715 container mapping the replacee to the replacement. In the case above, it would

	716 be ``MAP[B] = A`` assuming ``B = X + Y`` comes after ``A = X + Y``.

	717

	718 At any point if a variable that has an entry in the replacement table is

	719 encountered, it is replaced with the variable it is mapped to. This ensures that

	720 the redundant variables will not have any uses in the basic block, allowing

	721 dead code elimination to clean up the redundant instruction.

	722

	723 With SSA, the information stored is never invalidated. However, non-SSA input is

	724 supported with the ``-lcse=no-ssa`` option. This has to maintain some extra

	725 dependency information to ensure proper invalidation on variable assignment.

	726 This is not rigorously tested because this pass is run at an early stage where

	727 it is safe to assume variables have a single definition. This is not enabled by

	728 default because it bumps the compile time overhead from 2% to 6%.

	729

	730 Loop-invariant code motion

	731 ^^^^^^^^^^^^^^^^^^^^^^^^^^

	732

	733 This pass utilizes the loop analysis information to hoist invariant instructions

	734 to loop pre-headers. A loop must have a single entry node (header) and that node

	735 must have a single external predecessor for this optimization to work, as it is

	736 currently implemented.

	737

	738 The pass works by iterating over all instructions in the loop until the set of

	739 invariant instructions converges. In each iteration, a non-invariant instruction

	740 involving only constants or a variable known to be invariant is added to the

	741 result set. The destination variable of that instruction is added to the set of

	742 variables known to be invariant (which is initialized with the constant args).

	743

	744 Improving the loop-analysis infrastructure is likely to have significant impact

	745 on this optimization. Inserting an extra node to act as the pre-header when the

	746 header has multiple incoming edges from outside could also be a good idea.

	747 Expanding the initial invariant variable set to contain all variables that do

	748 not have definitions inside the loop does not seem to improve anything.

	749

	750 Short circuit evaluation

	751 ^^^^^^^^^^^^^^^^^^^^^^^^

	752

	753 Short circuit evaluation splits nodes and introduces early jumps when the result

	754 of a logical operation can be determined early and there are no observable side

	755 effects of skipping the rest of the computation. The instructions considered

	756 backwards from the end of the basic blocks. When a definition of a variable

	757 involved in a conditional jump is found, an extra jump can be inserted in that

	758 location, moving the rest of the instructions in the node to a newly inserted

	759 node. Consider this example::

	760

	761 __N :

	762 a = <something>

	763 Instruction 1 without side effect

	764 ... b = <something> ...

	765 Instruction N without side effect

	766 t1 = or a b

	767 br t1 __X __Y

	768

	769 is transformed to::

	770

	771 __N :

	772 a = <something>

	773 br a __X __N_ext

	774

	775 __N_ext :

	776 Instruction 1 without side effect

	777 ... b = <something> ...

	778 Instruction N without side effect

	779 br b __X __Y

	780

	781 The logic for AND is analogous, the only difference is that the early jump is

	782 facilitated by a ``false`` value instead of ``true``.

	783

	784 Global Variable Splitting

	785 ^^^^^^^^^^^^^^^^^^^^^^^^^

	786

	787 Global variable splitting (``-split-global-vars``) is run after register

	788 allocation. It works on the variables that did not manage to get registers (but

	789 are allowed to) and decomposes their live ranges into the individual segments

	790 (which span a single node at most). New variables are created (but not yet used)

	791 with these smaller live ranges and the register allocator is run again. This is

	792 not inefficient as the old variables that already had registers are now

	793 considered pre-colored.

	794

	795 The new variables that get registers replace their parent variables for their

	796 portion of its (parent's) live range. A copy from the old variable to the new

	797 is introduced before the first use and the reverse after the last def in the

	798 live range.

	799

701 Basic phi lowering	800 Basic phi lowering

702 ^^^^^^^^^^^^^^^^^^	801 ^^^^^^^^^^^^^^^^^^

703	802

704 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``	803 The simplest phi lowering strategy works as follows (this is how LLVM ``-O0``

705 implements it). Consider this example::	804 implements it). Consider this example::

706	805

707 L1:	806 L1:

708 ...	807 ...

709 br L3	808 br L3

710 L2:	809 L2:

(...skipping 774 matching lines...) Expand 10 before \| Expand all \| Expand 10 after Loading...
1485 This could be mitigated by switching these to the CFG-local allocator.	1584 This could be mitigated by switching these to the CFG-local allocator.

1486	1585

1487 Third, multithreading may make the default allocator strategy more expensive.	1586 Third, multithreading may make the default allocator strategy more expensive.

1488 In a single-threaded environment, a pass will allocate its containers, run the	1587 In a single-threaded environment, a pass will allocate its containers, run the

1489 pass, and deallocate the containers. This results in stack-like allocation	1588 pass, and deallocate the containers. This results in stack-like allocation

1490 behavior and makes the heap free list easier to manage, with less heap	1589 behavior and makes the heap free list easier to manage, with less heap

1491 fragmentation. But when multithreading is added, the allocations and	1590 fragmentation. But when multithreading is added, the allocations and

1492 deallocations become much less stack-like, making allocation and deallocation	1591 deallocations become much less stack-like, making allocation and deallocation

1493 operations individually more expensive. Again, this could be mitigated by	1592 operations individually more expensive. Again, this could be mitigated by

1494 switching these to the CFG-local allocator.	1593 switching these to the CFG-local allocator.

OLD	NEW

« no previous file with comments | « no previous file | no next file » | no next file with comments »