gcc/gmp/mpn/powerpc64/README - Issue 3050029: [gcc] GCC 4.5.0=>4.5.1

Unified Diff: gcc/gmp/mpn/powerpc64/README

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git

Patch Set: Created 10 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: gcc/gmp/mpn/powerpc64/README

diff --git a/gcc/gmp/mpn/powerpc64/README b/gcc/gmp/mpn/powerpc64/README

deleted file mode 100644

index 757357b4d8fd67aaaab1bb96a5bcd32f11736c37..0000000000000000000000000000000000000000

--- a/gcc/gmp/mpn/powerpc64/README

+++ /dev/null

@@ -1,155 +0,0 @@

-This file is part of the GNU MP Library.

-The GNU MP Library is free software; you can redistribute it and/or modify

-it under the terms of the GNU Lesser General Public License as published by

-the Free Software Foundation; either version 3 of the License, or (at your

-option) any later version.

-The GNU MP Library is distributed in the hope that it will be useful, but

-WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY

-or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public

-License for more details.

-You should have received a copy of the GNU Lesser General Public License

-along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.

- POWERPC-64 MPN SUBROUTINES

-This directory contains mpn functions for 64-bit PowerPC chips.

-CODE ORGANIZATION

- mpn/powerpc64 mode-neutral code

- mpn/powerpc64/mode32 code for mode32

- mpn/powerpc64/mode64 code for mode64

-The mode32 and mode64 sub-directories contain code which is for use in the

-respective chip mode, 32 or 64. The top-level directory is code that's

-unaffected by the mode.

-The "adde" instruction is the main difference between mode32 and mode64. It

-operates on either on a 32-bit or 64-bit quantity according to the chip mode.

-Other instructions have an operand size in their opcode and hence don't vary.

-POWER3/PPC630 pipeline information:

-Decoding is 4-way + branch and issue is 8-way with some out-of-order

-capability.

-Functional units:

-LS1 - ld/st unit 1

-LS2 - ld/st unit 2

-FXU1 - integer unit 1, handles any simple integer instruction

-FXU2 - integer unit 2, handles any simple integer instruction

-FXU3 - integer unit 3, handles integer multiply and divide

-FPU1 - floating-point unit 1

-FPU2 - floating-point unit 2

-Memory: Any two memory operations can issue, but memory subsystem

- can sustain just one store per cycle. No need for data

- prefetch; the hardware has very sophisticated prefetch logic.

-Simple integer: 2 operations (such as add, rl*)

-Integer multiply: 1 operation every 9th cycle worst case; exact timing depends

- on 2nd operand's most significant bit position (10 bits per

- cycle). Multiply unit is not pipelined, only one multiply

- operation in progress is allowed.

-Integer divide: ?

-Floating-point: Any plain 2 arithmetic instructions (such as fmul, fadd, and

- fmadd), latency 4 cycles.

-Floating-point divide:

- ?

-Floating-point square root:

- ?

-POWER3/PPC630 best possible times for the main loops:

-shift: 1.5 cycles limited by integer unit contention.

- With 63 special loops, one for each shift count, we could

- reduce the needed integer instructions to 2, which would

- reduce the best possible time to 1 cycle.

-add/sub: 1.5 cycles, limited by ld/st unit contention.

-mul: 18 cycles (average) unless floating-point operations are used,

- but that would only help for multiplies of perhaps 10 and more

- limbs.

-addmul/submul:Same situation as for mul.

-POWER4/PPC970 and POWER5 pipeline information:

-This is a very odd pipeline, it is basically a VLIW masquerading as a plain

-architecture. Its issue rules are not made public, and since it is so weird,

-it is very hard to figure out any useful information from experimentation.

-An example:

- A well-aligned loop with nop's take 3, 4, 6, 7, ... cycles.

- 3 cycles for 0, 1, 2, 3, 4, 5, 6, 7 nop's

- 4 cycles for 8, 9, 10, 11, 12, 13, 14, 15 nop's

- 6 cycles for 16, 17, 18, 19, 20, 21, 22, 23 nop's

- 7 cycles for 24, 25, 26, 27 nop's

- 8 cycles for 28, 29, 30, 31 nop's

- ... continues regularly

-Functional units:

-LS1 - ld/st unit 1

-LS2 - ld/st unit 2

-FXU1 - integer unit 1, handles any integer instruction

-FXU2 - integer unit 2, handles any integer instruction

-FPU1 - floating-point unit 1

-FPU2 - floating-point unit 2

-While this is one integer unit less than POWER3/PPC630, the remaining units

-are more powerful; here they handle multiply and divide.

-Memory: 2 ld/st. Stores go to the L2 cache, which can sustain just

- one store per cycle.

- L1 load latency: to gregs 3-4 cycles, to fregs 5-6 cycles.

- Operations that modify the address register might be split

- to use also a an integer issue slot.

-Simple integer: 2 operations every cycle, latency 2.

-Integer multiply: 2 operations every 6th cycle, latency 7 cycles.

-Integer divide: ?

-Floating-point: Any plain 2 arithmetic instructions (such as fmul, fadd, and

- fmadd), latency 6 cycles.

-Floating-point divide:

- ?

-Floating-point square root:

- ?

-IDEAS

-*mul_1: Handling one limb using mulld/mulhdu and two limbs using floating-

-point operations should give performance of about 20 cycles for 3 limbs, or 7

-cycles/limb.

-We should probably split the single-limb operand in 32-bit chunks, and the

-multi-limb operand in 16-bit chunks, allowing us to accumulate well in fp

-registers.

-Problem is to get 32-bit or 16-bit words to the fp registers. Only 64-bit fp

-memops copies bits without fiddling with them. We might therefore need to

-load to integer registers with zero extension, store as 64 bits into temp

-space, and then load to fp regs. Alternatively, load directly to fp space

-and add well-chosen constants to get cancelation. (Other part after given by

-subsequent subtraction.)

-Possible code mix for load-via-intregs variant:

-lwz,std,lfd

-fmadd,fmadd,fmul,fmul

-fctidz,stfd,ld,fctidz,stfd,ld

-add,adde

-lwz,std,lfd

-fmadd,fmadd,fmul,fmul

-fctidz,stfd,ld,fctidz,stfd,ld

-add,adde

-srd,sld,add,adde,add,adde

« no previous file with comments | « gcc/gmp/mpn/powerpc32/vmx/popcount.asm ('k') | gcc/gmp/mpn/powerpc64/aix.m4 » ('j') | no next file with comments »