| Index: gcc/gmp/mpn/ia64/README | 
| diff --git a/gcc/gmp/mpn/ia64/README b/gcc/gmp/mpn/ia64/README | 
| deleted file mode 100644 | 
| index 9252271ab7a6487be3733a333e733a6180139f61..0000000000000000000000000000000000000000 | 
| --- a/gcc/gmp/mpn/ia64/README | 
| +++ /dev/null | 
| @@ -1,277 +0,0 @@ | 
| -Copyright 2000, 2001, 2002, 2003, 2004, 2005 Free Software Foundation, Inc. | 
| - | 
| -This file is part of the GNU MP Library. | 
| - | 
| -The GNU MP Library is free software; you can redistribute it and/or modify | 
| -it under the terms of the GNU Lesser General Public License as published by | 
| -the Free Software Foundation; either version 3 of the License, or (at your | 
| -option) any later version. | 
| - | 
| -The GNU MP Library is distributed in the hope that it will be useful, but | 
| -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY | 
| -or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public | 
| -License for more details. | 
| - | 
| -You should have received a copy of the GNU Lesser General Public License | 
| -along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/. | 
| - | 
| - | 
| - | 
| -                      IA-64 MPN SUBROUTINES | 
| - | 
| - | 
| -This directory contains mpn functions for the IA-64 architecture. | 
| - | 
| - | 
| -CODE ORGANIZATION | 
| - | 
| -	mpn/ia64          itanium-2, and generic ia64 | 
| - | 
| -The code here has been optimized primarily for Itanium 2.  Very few Itanium 1 | 
| -chips were ever sold, and Itanium 2 is more powerful, so the latter is what | 
| -we concentrate on. | 
| - | 
| - | 
| - | 
| -CHIP NOTES | 
| - | 
| -The IA-64 ISA keeps instructions three and three in 128 bit bundles. | 
| -Programmers/compilers need to put explicit breaks `;;' when there are WAW or | 
| -RAW dependencies, with some notable exceptions.  Such "breaks" are typically | 
| -at the end of a bundle, but can be put between operations within some bundle | 
| -types too. | 
| - | 
| -The Itanium 1 and Itanium 2 implementations can under ideal conditions | 
| -execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions | 
| -to do integer operations, while the Itanium 2 allows all 6 to be integer | 
| -operations. | 
| - | 
| -Taken cloop branches seem to insert a bubble into the pipeline most of the | 
| -time on Itanium 1. | 
| - | 
| -Loads to the fp registers bypass the L1 cache and thus get extremely long | 
| -latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2. | 
| - | 
| -The software pipeline stuff using br.ctop instruction causes delays, since | 
| -many issue slots are taken up by instructions with zero predicates, and | 
| -since many extra instructions are needed to set things up.  These features | 
| -are clearly designed for code density, not speed. | 
| - | 
| -Misc pipeline limitations (Itanium 1): | 
| -* The getf.sig instruction can only execute in M0. | 
| -* At most four integer instructions/cycle. | 
| -* Nops take up resources like any plain instructions. | 
| - | 
| -Misc pipeline limitations (Itanium 2): | 
| -* The getf.sig instruction can only execute in M0. | 
| -* Nops take up resources like any plain instructions. | 
| - | 
| - | 
| -ASSEMBLY SYNTAX | 
| - | 
| -.align pads with nops in a text segment, but gas 2.14 and earlier | 
| -incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making | 
| -it come out as break instructions.  We use the ALIGN() macro in | 
| -mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro | 
| -suppresses any .align if the problem is detected by configure.  Lack of | 
| -alignment might hurt performance but will at least be correct. | 
| - | 
| -foo:: to create a global symbol is not accepted by gas.  Use separate | 
| -".global foo" and "foo:" instead. | 
| - | 
| -.global is the standard global directive.  gas accepts .globl, but hpux "as" | 
| -doesn't. | 
| - | 
| -.proc / .endp generates the appropriate .type and .size information for ELF, | 
| -so the latter directives don't need to be given explicitly. | 
| - | 
| -.pred.rel "mutex"... is standard for annotating predicate register | 
| -relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't. | 
| - | 
| -.pred directives can't be put on a line with a label, like | 
| -".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that. | 
| -gas is happy with it, and past versions of HP had seemed ok. | 
| - | 
| -// is the standard comment sequence, but we prefer "C" since it inhibits m4 | 
| -macro expansion.  See comments in ia64-defs.m4. | 
| - | 
| - | 
| -REGISTER USAGE | 
| - | 
| -Special: | 
| -   r0: constant 0 | 
| -   r1: global pointer (gp) | 
| -   r8: return value | 
| -   r12: stack pointer (sp) | 
| -   r13: thread pointer (tp) | 
| -Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127 | 
| -Caller-saves but rotating: r32- | 
| - | 
| - | 
| -================================================================ | 
| -mpn_add_n, mpn_sub_n: | 
| - | 
| -The current code runs at 1.25 c/l on Itanium 2. | 
| - | 
| -================================================================ | 
| -mpn_mul_1: | 
| - | 
| -The current code runs at 2 c/l on Itanium 2. | 
| - | 
| -Using a blocked approach, working off of 4 separate places in the operands, | 
| -one could make use of the xma accumulation, and approach 1 c/l. | 
| - | 
| -	ldf8 [up] | 
| -	xma.l | 
| -	xma.hu | 
| -	stf8  [wrp] | 
| - | 
| -================================================================ | 
| -mpn_addmul_1: | 
| - | 
| -The current code runs at 2 c/l on Itanium 2. | 
| - | 
| -It seems possible to use a blocked approach, as with mpn_mul_1.  We should | 
| -read rp[] to integer registers, allowing for just one getf.sig per cycle. | 
| - | 
| -	ld8  [rp] | 
| -	ldf8 [up] | 
| -	xma.l | 
| -	xma.hu | 
| -	getf.sig | 
| -	add+add+cmp+cmp | 
| -	st8  [wrp] | 
| - | 
| -These 10 instructions can be scheduled to approach 1.667 cycles, and with | 
| -the 4 cycle latency of xma, this means we need at least 3 blocks.  Using | 
| -ldfp8 we could approach 1.583 c/l. | 
| - | 
| -================================================================ | 
| -mpn_submul_1: | 
| - | 
| -The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires | 
| -ldfp8 with all alignment headache that implies. | 
| - | 
| -================================================================ | 
| -mpn_addmul_N | 
| - | 
| -For best speed, we need to give up using mpn_addmul_1 as the main multiply | 
| -building block, and instead take multiple v limbs per loop.  For the Itanium | 
| -1, we need to take about 8 limbs at a time for full speed.  For the Itanium | 
| -2, something like mpn_addmul_4 should be enough. | 
| - | 
| -The add+cmp+cmp+add we use on the other codes is optimal for shortening | 
| -recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When | 
| -recurrency depth is not critical, a more standard 3-cycle add+cmp+add is | 
| -better. | 
| - | 
| -/* First load the 8 values from v */ | 
| -	ldfp8		v0, v1 = [r35], 16;; | 
| -	ldfp8		v2, v3 = [r35], 16;; | 
| -	ldfp8		v4, v5 = [r35], 16;; | 
| -	ldfp8		v6, v7 = [r35], 16;; | 
| - | 
| -/* In the inner loop, get a new U limb and store a result limb. */ | 
| -	mov		lc = un | 
| -Loop:	ldf8		u0 = [r33], 8 | 
| -	ld8		r0 = [r32] | 
| -	xma.l		lp0 = v0, u0, hp0 | 
| -	xma.hu		hp0 = v0, u0, hp0 | 
| -	xma.l		lp1 = v1, u0, hp1 | 
| -	xma.hu		hp1 = v1, u0, hp1 | 
| -	xma.l		lp2 = v2, u0, hp2 | 
| -	xma.hu		hp2 = v2, u0, hp2 | 
| -	xma.l		lp3 = v3, u0, hp3 | 
| -	xma.hu		hp3 = v3, u0, hp3 | 
| -	xma.l		lp4 = v4, u0, hp4 | 
| -	xma.hu		hp4 = v4, u0, hp4 | 
| -	xma.l		lp5 = v5, u0, hp5 | 
| -	xma.hu		hp5 = v5, u0, hp5 | 
| -	xma.l		lp6 = v6, u0, hp6 | 
| -	xma.hu		hp6 = v6, u0, hp6 | 
| -	xma.l		lp7 = v7, u0, hp7 | 
| -	xma.hu		hp7 = v7, u0, hp7 | 
| -	getf.sig	l0 = lp0 | 
| -	getf.sig	l1 = lp1 | 
| -	getf.sig	l2 = lp2 | 
| -	getf.sig	l3 = lp3 | 
| -	getf.sig	l4 = lp4 | 
| -	getf.sig	l5 = lp5 | 
| -	getf.sig	l6 = lp6 | 
| -	add+cmp+add	xx, l0, r0 | 
| -	add+cmp+add	acc0, acc1, l1 | 
| -	add+cmp+add	acc1, acc2, l2 | 
| -	add+cmp+add	acc2, acc3, l3 | 
| -	add+cmp+add	acc3, acc4, l4 | 
| -	add+cmp+add	acc4, acc5, l5 | 
| -	add+cmp+add	acc5, acc6, l6 | 
| -	getf.sig	acc6 = lp7 | 
| -	st8		[r32] = xx, 8 | 
| -	br.cloop Loop | 
| - | 
| -	49 insn at max 6 insn/cycle:		8.167 cycles/limb8 | 
| -	11 memops at max 2 memops/cycle:	5.5 cycles/limb8 | 
| -	16 fpops at max 2 fpops/cycle:		8 cycles/limb8 | 
| -	21 intops at max 4 intops/cycle:	5.25 cycles/limb8 | 
| -	11+21 memops+intops at max 4/cycle	8 cycles/limb8 | 
| - | 
| -================================================================ | 
| -mpn_lshift, mpn_rshift | 
| - | 
| -The current code runs at 1 cycle/limb on Itanium 2. | 
| - | 
| -Using 63 separate loops, we could use the double-word shrp instruction. | 
| -That instruction has a plain single-cycle latency.  We need 63 loops since | 
| -this instruction only accept immediate count.  That would lead to a somewhat | 
| -silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp | 
| -each cycle plus shl/shr going down I1 for a further limb every second | 
| -cycle). | 
| - | 
| -================================================================ | 
| -mpn_copyi, mpn_copyd | 
| - | 
| -The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1 | 
| -cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use | 
| -scheduling isn't great.  It might be best to actually use modulo scheduled | 
| -loops, since that will allow us to do better load-use scheduling without too | 
| -much unrolling. | 
| - | 
| -Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium | 
| -2, according to tests/devel/try.  Cache bank conflicts? | 
| - | 
| - | 
| - | 
| -REFERENCES | 
| - | 
| -Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3, | 
| -Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1 | 
| -includes an Itanium optimization guide. | 
| - | 
| -Intel Itanium Processor-specific Application Binary Interface (ABI), Intel | 
| -document 245370-003, May 2001.  Describes C type sizes, dynamic linking, | 
| -etc. | 
| - | 
| -Intel Itanium Architecture Assembly Language Reference Guide, Intel document | 
| -248801-004, 2000-2002.  Describes assembly instruction syntax and other | 
| -directives. | 
| - | 
| -Itanium Software Conventions and Runtime Architecture Guide, Intel document | 
| -245358-003, May 2001.  Describes calling conventions, including stack | 
| -unwinding requirements. | 
| - | 
| -Intel Itanium Processor Reference Manual for Software Optimization, Intel | 
| -document 245473-003, November 2001. | 
| - | 
| -Intel Itanium-2 Processor Reference Manual for Software Development and | 
| -Optimization, Intel document 251110-003, May 2004. | 
| - | 
| -All the above documents can be found online at | 
| - | 
| -    http://developer.intel.com/design/itanium/manuals.htm | 
| - | 
| - | 
| ----------------- | 
| -Local variables: | 
| -mode: text | 
| -fill-column: 76 | 
| -End: | 
|  |