gcc/gmp/mpn/alpha/README - Issue 3050029: [gcc] GCC 4.5.0=>4.5.1

Unified Diff: gcc/gmp/mpn/alpha/README

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git

Patch Set: Created 10 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: gcc/gmp/mpn/alpha/README

diff --git a/gcc/gmp/mpn/alpha/README b/gcc/gmp/mpn/alpha/README

deleted file mode 100644

index 3578c53b856be53b09740e414ec69c423dc3c8d6..0000000000000000000000000000000000000000

--- a/gcc/gmp/mpn/alpha/README

+++ /dev/null

@@ -1,198 +0,0 @@

-Foundation, Inc.

-This file is part of the GNU MP Library.

-The GNU MP Library is free software; you can redistribute it and/or modify it

-under the terms of the GNU Lesser General Public License as published by the

-Free Software Foundation; either version 3 of the License, or (at your

-option) any later version.

-The GNU MP Library is distributed in the hope that it will be useful, but

-WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or

-FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License

-for more details.

-You should have received a copy of the GNU Lesser General Public License along

-with the GNU MP Library. If not, see http://www.gnu.org/licenses/.

-This directory contains mpn functions optimized for DEC Alpha processors.

-ALPHA ASSEMBLY RULES AND REGULATIONS

-The `.prologue N' pseudo op marks the end of instruction that needs special

-handling by unwinding. It also says whether $27 is really needed for computing

-the gp. The `.mask M' pseudo op says which registers are saved on the stack,

-and at what offset in the frame.

-Cray T3 code is very very different...

-"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"

-/ "f6" is required. We use the "r6" / "f6" forms, and have m4 defines expand

-them to "$6" or "$f6" where necessary.

-"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is

-required. The X() macro accomodates this difference.

-"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will

-accept either. We use cvttqc and have an m4 define expand to cvttq/c where

-necessary.

-"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not

-the Unicos assembler. The full "ornot" must be used.

-"unop" is not available in Unicos. We make an m4 define to the usual "ldq_u

-r31,0(r30)", and in fact use that define on all systems since it comes out the

-same.

-"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not

-available in older alpha assemblers (including gas prior to 2.12), according to

-the GCC manual, so the assembler macro forms must be used (eg. ldgp).

-RELEVANT OPTIMIZATION ISSUES

-EV4

-1. This chip has very limited store bandwidth. The on-chip L1 cache is write-

- through, and a cache line is transfered from the store buffer to the off-

- chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n,

- mpn_sub_n, mpn_lshift, and mpn_rshift.

-2. Pairing is possible between memory instructions and integer arithmetic

- instructions.

-3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these

- cycles are pipelined. Thus, multiply instructions can be issued at a rate

- of one each 21st cycle.

-EV5

-1. The memory bandwidth of this chip is good, both for loads and stores. The

- L1 cache can handle two loads or one store per cycle, but two cycles after a

- store, no ld can issue.

-2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.

- umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.

- (Note that published documentation gets these numbers slightly wrong.)

-3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12

- are memory operations. This will take at least

- ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles

- We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data

- cache cycles, which should be completely hidden in the 19 issue cycles.

- The computation is inherently serial, with these dependencies:

- ldq ldq

- \ /\

- (or) addq |

- |\ / \ |

- | addq cmpult

- \ | |

- cmpult |

- \ /

- or

- I.e., 3 operations are needed between carry-in and carry-out, making 12

- cycles the absolute minimum for the 4 limbs. We could replace the `or' with

- a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that

- might waste a cycle on EV4. The total depth remain unaffected, since cmov

- has a latency of 2 cycles.

- addq

- / \

- addq cmpult

- | \

- cmpult -> cmovne

- Montgomery has a slightly different way of computing carry that requires one

- less instruction, but has depth 4 (instead of the current 3). Since the code

- is currently instruction issue bound, Montgomery's idea should save us 1/2

- cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.

- Unfortunately, this method will not be good for the EV6.

-4. addmul_1 and friends: We previously had a scheme for splitting the single-

- limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,

- and then use FP operations for every 2nd multiply, and integer operations

- for every 2nd multiply.

- But it seems much better to split the single-limb operand in 16-bit chunks,

- since we save many integer shifts and adds that way. See powerpc64/README

- for some more details.

-EV6

-Here we have a really parallel pipeline, capable of issuing up to 4 integer

-instructions per cycle. In actual practice, it is never possible to sustain

-more than 3.5 integer insns/cycle due to rename register shortage. One integer

-multiply instruction can issue each cycle. To get optimal speed, we need to

-pretend we are vectorizing the code, i.e., minimize the depth of recurrences.

-There are two dependencies to watch out for. 1) Address arithmetic

-dependencies, and 2) carry propagation dependencies.

-We can avoid serializing due to address arithmetic by unrolling loops, so that

-addresses don't depend heavily on an index variable. Avoiding serializing

-because of carry propagation is trickier; the ultimate performance of the code

-will be determined of the number of latency cycles it takes from accepting

-carry-in to a vector point until we can generate carry-out.

-Most integer instructions can execute in either the L0, U0, L1, or U1

-pipelines. Shifts only execute in U0 and U1, and multiply only in U1.

-CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV

-split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV

-should always be placed as the last instruction of an aligned 4 instruction

-block, or perhaps simply avoided.

-Perhaps the most important issue is the latency between the L0/U0 and L1/U1

-clusters; a result obtained on either cluster has an extra cycle of latency for

-consumers in the opposite cluster. Because of the dynamic nature of the

-implementation, it is hard to predict where an instruction will execute.

-REFERENCES

-"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number

-EC-QD2KC-TE.

-"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,

-order number EC-QP99C-TE.

-"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,

-Compaq, September 2000, order number DS-0028B-TE.

-"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number

-EC-RJ66A-TE.

-All of the above are available online from

- http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html

- ftp://ftp.compaq.com/pub/products/alphaCPUdocs

-"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part

-number AA-PS31D-TE.

-"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,

-March 1996, part number AA-PY8AC-TE.

-The above are available online,

- http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM

-(Dunno what h30097 means in this URL, but if it moves try searching for "tru64

-online documentation" from the main www.hp.com page.)

-----------------

-Local variables:

-mode: text

-fill-column: 79

-End:

« no previous file with comments | « gcc/gmp/mpn/a29k/umul.s ('k') | gcc/gmp/mpn/alpha/addmul_1.asm » ('j') | no next file with comments »