Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(238)

Unified Diff: gcc/gmp/mpn/alpha/README

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git
Patch Set: Created 10 years, 5 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « gcc/gmp/mpn/a29k/umul.s ('k') | gcc/gmp/mpn/alpha/addmul_1.asm » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: gcc/gmp/mpn/alpha/README
diff --git a/gcc/gmp/mpn/alpha/README b/gcc/gmp/mpn/alpha/README
deleted file mode 100644
index 3578c53b856be53b09740e414ec69c423dc3c8d6..0000000000000000000000000000000000000000
--- a/gcc/gmp/mpn/alpha/README
+++ /dev/null
@@ -1,198 +0,0 @@
-Copyright 1996, 1997, 1999, 2000, 2001, 2002, 2003, 2004, 2005 Free Software
-Foundation, Inc.
-
-This file is part of the GNU MP Library.
-
-The GNU MP Library is free software; you can redistribute it and/or modify it
-under the terms of the GNU Lesser General Public License as published by the
-Free Software Foundation; either version 3 of the License, or (at your
-option) any later version.
-
-The GNU MP Library is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
-FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License
-for more details.
-
-You should have received a copy of the GNU Lesser General Public License along
-with the GNU MP Library. If not, see http://www.gnu.org/licenses/.
-
-
-
-
-
-This directory contains mpn functions optimized for DEC Alpha processors.
-
-ALPHA ASSEMBLY RULES AND REGULATIONS
-
-The `.prologue N' pseudo op marks the end of instruction that needs special
-handling by unwinding. It also says whether $27 is really needed for computing
-the gp. The `.mask M' pseudo op says which registers are saved on the stack,
-and at what offset in the frame.
-
-Cray T3 code is very very different...
-
-"$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
-/ "f6" is required. We use the "r6" / "f6" forms, and have m4 defines expand
-them to "$6" or "$f6" where necessary.
-
-"0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
-required. The X() macro accomodates this difference.
-
-"cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
-accept either. We use cvttqc and have an m4 define expand to cvttq/c where
-necessary.
-
-"not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
-the Unicos assembler. The full "ornot" must be used.
-
-"unop" is not available in Unicos. We make an m4 define to the usual "ldq_u
-r31,0(r30)", and in fact use that define on all systems since it comes out the
-same.
-
-"!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
-available in older alpha assemblers (including gas prior to 2.12), according to
-the GCC manual, so the assembler macro forms must be used (eg. ldgp).
-
-
-
-RELEVANT OPTIMIZATION ISSUES
-
-EV4
-
-1. This chip has very limited store bandwidth. The on-chip L1 cache is write-
- through, and a cache line is transfered from the store buffer to the off-
- chip L2 in as much 15 cycles on most systems. This delay hurts mpn_add_n,
- mpn_sub_n, mpn_lshift, and mpn_rshift.
-
-2. Pairing is possible between memory instructions and integer arithmetic
- instructions.
-
-3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
- cycles are pipelined. Thus, multiply instructions can be issued at a rate
- of one each 21st cycle.
-
-EV5
-
-1. The memory bandwidth of this chip is good, both for loads and stores. The
- L1 cache can handle two loads or one store per cycle, but two cycles after a
- store, no ld can issue.
-
-2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
- umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
- (Note that published documentation gets these numbers slightly wrong.)
-
-3. mpn_add_n. With 4-fold unrolling, we need 37 instructions, whereof 12
- are memory operations. This will take at least
- ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
- We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
- cache cycles, which should be completely hidden in the 19 issue cycles.
- The computation is inherently serial, with these dependencies:
-
- ldq ldq
- \ /\
- (or) addq |
- |\ / \ |
- | addq cmpult
- \ | |
- cmpult |
- \ /
- or
-
- I.e., 3 operations are needed between carry-in and carry-out, making 12
- cycles the absolute minimum for the 4 limbs. We could replace the `or' with
- a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
- might waste a cycle on EV4. The total depth remain unaffected, since cmov
- has a latency of 2 cycles.
-
- addq
- / \
- addq cmpult
- | \
- cmpult -> cmovne
-
- Montgomery has a slightly different way of computing carry that requires one
- less instruction, but has depth 4 (instead of the current 3). Since the code
- is currently instruction issue bound, Montgomery's idea should save us 1/2
- cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
- Unfortunately, this method will not be good for the EV6.
-
-4. addmul_1 and friends: We previously had a scheme for splitting the single-
- limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
- and then use FP operations for every 2nd multiply, and integer operations
- for every 2nd multiply.
-
- But it seems much better to split the single-limb operand in 16-bit chunks,
- since we save many integer shifts and adds that way. See powerpc64/README
- for some more details.
-
-EV6
-
-Here we have a really parallel pipeline, capable of issuing up to 4 integer
-instructions per cycle. In actual practice, it is never possible to sustain
-more than 3.5 integer insns/cycle due to rename register shortage. One integer
-multiply instruction can issue each cycle. To get optimal speed, we need to
-pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
-
-There are two dependencies to watch out for. 1) Address arithmetic
-dependencies, and 2) carry propagation dependencies.
-
-We can avoid serializing due to address arithmetic by unrolling loops, so that
-addresses don't depend heavily on an index variable. Avoiding serializing
-because of carry propagation is trickier; the ultimate performance of the code
-will be determined of the number of latency cycles it takes from accepting
-carry-in to a vector point until we can generate carry-out.
-
-Most integer instructions can execute in either the L0, U0, L1, or U1
-pipelines. Shifts only execute in U0 and U1, and multiply only in U1.
-
-CMOV instructions split into two internal instructions, CMOV1 and CMOV2. CMOV
-split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
-should always be placed as the last instruction of an aligned 4 instruction
-block, or perhaps simply avoided.
-
-Perhaps the most important issue is the latency between the L0/U0 and L1/U1
-clusters; a result obtained on either cluster has an extra cycle of latency for
-consumers in the opposite cluster. Because of the dynamic nature of the
-implementation, it is hard to predict where an instruction will execute.
-
-
-
-REFERENCES
-
-"Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
-EC-QD2KC-TE.
-
-"Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
-order number EC-QP99C-TE.
-
-"Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
-Compaq, September 2000, order number DS-0028B-TE.
-
-"Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
-EC-RJ66A-TE.
-
-All of the above are available online from
-
- http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
- ftp://ftp.compaq.com/pub/products/alphaCPUdocs
-
-"Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
-number AA-PS31D-TE.
-
-"Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
-March 1996, part number AA-PY8AC-TE.
-
-The above are available online,
-
- http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
-
-(Dunno what h30097 means in this URL, but if it moves try searching for "tru64
-online documentation" from the main www.hp.com page.)
-
-
-
-----------------
-Local variables:
-mode: text
-fill-column: 79
-End:
« no previous file with comments | « gcc/gmp/mpn/a29k/umul.s ('k') | gcc/gmp/mpn/alpha/addmul_1.asm » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698