gcc/gmp/mpn/x86/README - Issue 3050029: [gcc] GCC 4.5.0=>4.5.1

Unified Diff: gcc/gmp/mpn/x86/README

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git

Patch Set: Created 10 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: gcc/gmp/mpn/x86/README

diff --git a/gcc/gmp/mpn/x86/README b/gcc/gmp/mpn/x86/README

deleted file mode 100644

index 883db227d284e0d45dcbb708d51bb4628945882d..0000000000000000000000000000000000000000

--- a/gcc/gmp/mpn/x86/README

+++ /dev/null

@@ -1,514 +0,0 @@

-This file is part of the GNU MP Library.

-The GNU MP Library is free software; you can redistribute it and/or modify

-it under the terms of the GNU Lesser General Public License as published by

-the Free Software Foundation; either version 3 of the License, or (at your

-option) any later version.

-The GNU MP Library is distributed in the hope that it will be useful, but

-WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY

-or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public

-License for more details.

-You should have received a copy of the GNU Lesser General Public License

-along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.

- X86 MPN SUBROUTINES

-This directory contains mpn functions for various 80x86 chips.

-CODE ORGANIZATION

- x86 i386, generic

- x86/i486 i486

- x86/pentium Intel Pentium (P5, P54)

- x86/pentium/mmx Intel Pentium with MMX (P55)

- x86/p6 Intel Pentium Pro

- x86/p6/mmx Intel Pentium II, III

- x86/p6/p3mmx Intel Pentium III

- x86/k6 \ AMD K6

- x86/k6/mmx /

- x86/k6/k62mmx AMD K6-2

- x86/k7 \ AMD Athlon

- x86/k7/mmx /

- x86/pentium4 \

- x86/pentium4/mmx | Intel Pentium 4

- x86/pentium4/sse2 /

-The top-level x86 directory contains blended style code, meant to be

-reasonable on all x86s.

-STATUS

-The code is well-optimized for AMD and Intel chips, but there's nothing

-specific for Cyrix chips, nor for actual 80386 and 80486 chips.

-ASM FILES

-The x86 .asm files are BSD style assembler code, first put through m4 for

-macro processing. The generic mpn/asm-defs.m4 is used, together with

-mpn/x86/x86-defs.m4. See comments in those files.

-The code is meant for use with GNU "gas" or a system "as". There's no

-support for assemblers that demand Intel style code.

-STACK FRAME

-m4 macros are used to define the parameters passed on the stack, and these

-act like comments on what the stack frame looks like too. For example,

-mpn_mul_1() has the following.

- defframe(PARAM_MULTIPLIER, 16)

- defframe(PARAM_SIZE, 12)

- defframe(PARAM_SRC, 8)

- defframe(PARAM_DST, 4)

-PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The

-return address is at offset 0, but there's not normally any need to access

-that.

-FRAME is redefined as necessary through the code so it's the number of bytes

-pushed on the stack, and hence the offsets in the parameter macros stay

-correct. At the start of a routine FRAME should be zero.

- deflit(`FRAME',0)

- ...

- deflit(`FRAME',4)

- ...

- deflit(`FRAME',8)

- ...

-Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and

-FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,

-and can be used instead of explicit definitions if preferred.

-defframe_pushl() is a combination FRAME_pushl() and defframe().

-There's generally some slackness in redefining FRAME. If new values aren't

-going to get used then the redefinitions are omitted to keep from cluttering

-up the code. This happens for instance at the end of a routine, where there

-might be just four pops and then a ret, so FRAME isn't getting used.

-Local variables and saved registers can be similarly defined, with negative

-offsets representing stack space below the initial stack pointer. For

-example,

- defframe(SAVE_ESI, -4)

- defframe(SAVE_EDI, -8)

- defframe(VAR_COUNTER,-12)

- deflit(STACK_SPACE, 12)

-Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the

-space, and that instruction must be followed by a redefinition of FRAME

-(setting it equal to STACK_SPACE) to reflect the change in %esp.

-Definitions for pushed registers are only put in when they're going to be

-used. If registers are just saved and restored with pushes and pops then

-definitions aren't made.

-ASSEMBLER EXPRESSIONS

-Only addition and subtraction seem to be universally available, certainly

-that's all the Solaris 8 "as" seems to accept. If expressions are wanted

-then m4 eval() should be used.

-In particular note that a "/" anywhere in a line starts a comment in Solaris

-"as", and in some configurations of gas too.

- addl $32/2, %eax <-- wrong

- addl $eval(32/2), %eax <-- right

-Binutils gas/config/tc-i386.c has a choice between "/" being a comment

-anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select

-the latter, and from 2.9.5 it's the default for GNU/Linux too.

-ASSEMBLER COMMENTS

-Solaris "as" doesn't support "#" commenting, using /* */ instead. For that

-reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"

-files have no comments.

-Any comments before include(`../config.m4') must use m4 "dnl", since it's

-only after the include that "C" is available. By convention "dnl" is also

-used for comments about m4 macros.

-TEMPORARY LABELS

-Temporary numbered labels like "1:" used as "1f" or "1b" are available in

-"gas" and Solaris "as", but not in SCO "as". Normal L() labels should be

-used instead, possibly with a counter to make them unique, see jadcl0() in

-x86-defs.m4 for instance. A separate counter for each macro makes it

-possible to nest them, for instance movl_text_address() can be used within

-an ASSERT().

-"1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a

-unique number looks like a good alternative, but is that actually a

-documented feature? In any case this problem doesn't currently arise.

-ZERO DISPLACEMENTS

-In a couple of places addressing modes like 0(%ebx) with a byte-sized zero

-displacement are wanted, rather than (%ebx) with no displacement. These are

-either for computed jumps or to get desirable code alignment. Explicit

-.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into

-(%ebx). The Zdisp() macro in x86-defs.m4 is used for this.

-Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas

-1.92.3 changes it. In general changing would be the sort of "optimization"

-an assembler might perform, hence explicit ".byte"s are used where

-necessary.

-SHLD/SHRD INSTRUCTIONS

-The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"

-must be written "shldl %eax,%ebx" for some assemblers. gas takes either,

-Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is

-gas), and omits %cl elsewhere.

-For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether

-%cl should be used, and the macros shldl, shrdl, shldw and shrdw in

-mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments

-with those macros for usage.

-IMUL INSTRUCTION

-GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes

-that the following two forms produce identical object code

- imul $12, %eax

- imul $12, %eax, %eax

-but that the former isn't accepted by some assemblers, in particular the SCO

-OSR5 COFF assembler. GMP follows GCC and uses only the latter form.

-(This applies only to immediate operands, the three operand form is only

-valid with an immediate.)

-DIRECTION FLAG

-The x86 calling conventions say that the direction flag should be clear at

-function entry and exit. (See iBCS2 and SVR4 ABI books, references below.)

-Although this has been so since the year dot, it's not absolutely clear

-whether it's universally respected. Since it's better to be safe than

-sorry, GMP follows glibc and does a "cld" if it depends on the direction

-flag being clear. This happens only in a few places.

-POSITION INDEPENDENT CODE

- Coding Style

- Defining the symbol PIC in m4 processing selects SVR4 / ELF style

- position independent code. This is necessary for shared libraries

- because they can be mapped into different processes at different virtual

- addresses. Actually, relocations are allowed but text pages with

- relocations aren't shared, defeating the purpose of a shared library.

- The GOT is used to access global data, and the PLT is used for

- functions. The use of the PLT adds a fixed cost to every function call,

- and the GOT adds a cost to any function accessing global variables.

- These are small but might be noticeable when working with small

- operands.

- Scope

- It's intended, as a matter of policy, that references within libgmp are

- resolved within libgmp. Certainly there's no need for an application to

- replace any internals, and we take the view that there's no value in an

- application subverting anything documented either.

- Resolving references within libgmp in theory means calls can be made with a

- plain PC-relative call instruction, which is faster and smaller than going

- through the PLT, and data references can be similarly PC-relative, saving a

- GOT entry and fetch from there. Unfortunately the normal linker behaviour

- doesn't allow us to do this.

- By default an R_386_PC32 PC-relative reference, either for a call or for

- data, is left in libgmp.so by the linker so that it can be resolved at

- runtime to a location in the application or another shared library. This

- means a text segment relocation which we don't want.

- -Bsymbolic

- Under the "-Bsymbolic" option, the linker resolves references to symbols

- within libgmp.so. This gives us the desired effect for R_386_PC32,

- ie. it's resolved at link time. It also resolves R_386_PLT32 calls

- directly to their target without creating a PLT entry (though if this is

- done to normal compiler-generated code it still leaves a setup of %ebx

- to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).

- Unfortunately -Bsymbolic does bad things to global variables defined in

- a shared library but accessed by non-PIC code from the mainline (or a

- static library).

- The problem is that the mainline needs a fixed data address to avoid

- text segment relocations, so space is allocated in its data segment and

- the value from the variable is copied from the shared library's data

- segment when the library is loaded. Under -Bsymbolic, however,

- references in the shared library are then resolved still to the shared

- library data area. Not surprisingly it bombs badly to have mainline

- code and library code accessing different locations for what should be

- one variable.

- Note that this -Bsymbolic effect for the shared library is not just for

- R_386_PC32 offsets which might have been cooked up in assembler, but is

- done also for the contents of GOT entries. -Bsymbolic simply applies a

- general rule that symbols are resolved first from the local module.

- Visibility Attributes

- GCC __attribute__ ((visibility ("protected"))), which is available in

- recent versions, eg. 3.3, is probably what we'd like to use. It makes

- gcc generate plain PC-relative calls to indicated functions, and directs

- the linker to resolve references to the given function within the link

- module.

- Unfortunately, as of debian binutils 2.13.90.0.16 at least, the

- resulting libgmp.so comes out with text segment relocations, references

- are not resolved at link time. If the gcc description is to be believed

- this is this not how it should work. If a symbol cannot be overridden

- by another module then surely references within that module can be

- resolved immediately (ie. at link time).

- Present

- In any case, all this means that we have no optimizations we can

- usefully make to function or variable usages, neither for assembler nor

- C code. Perhaps in the future the visibility attribute will work as

- we'd like.

-GLOBAL OFFSET TABLE

-The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the

-GOT sometimes requires an extra underscore prefix. SVR4 systems and NetBSD

-don't need a prefix, OpenBSD does need one. Note that NetBSD and OpenBSD

-are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_

-is not simply the same as the prefix for ordinary globals.

-In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro

-in x86-defs.m4 add an extra underscore if required (according to a configure

-test).

-Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when

-asked to assemble the following,

- L1:

- addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx

-It seems that using the label in the same instruction it refers to is the

-problem, since a nop in between works. But the simplest workaround is to

-follow gcc and omit the +[.-L1] since it does nothing,

- addl $_GLOBAL_OFFSET_TABLE_, %ebx

-Current gas 2.10 generates incorrect object code when %eax is used in such a

-construction (with or without +[.-L1]),

- addl $_GLOBAL_OFFSET_TABLE_, %eax

-The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for

-the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any

-other register, since then it's a two byte opcode+mod/rm. GCC for example

-always uses %ebx (which is needed for calls through the PLT).

-A similar problem occurs in an leal (again with or without a +[.-L1]),

- leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx

-This time the R_386_GOTPC gets a displacement of 0 rather than the 2

-appropriate for the opcode and mod/rm, making this form unusable.

-SIMPLE LOOPS

-The overheads in setting up for an unrolled loop can mean that at small

-sizes a simple loop is faster. Making small sizes go fast is important,

-even if it adds a cycle or two to bigger sizes. To this end various

-routines choose between a simple loop and an unrolled loop according to

-operand size. The path to the simple loop, or to special case code for

-small sizes, is always as fast as possible.

-Adding a simple loop requires a conditional jump to choose between the

-simple and unrolled code. The size of a branch misprediction penalty

-affects whether a simple loop is worthwhile.

-The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover

-point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=

-UNROLL_THRESHOLD using the unrolled loop. If position independent code adds

-a couple of cycles to an unrolled loop setup, the threshold will vary with

-PIC or non-PIC. Something like the following is typical.

- deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))

-There's no automated way to determine the threshold. Setting it to a small

-value and then to a big value makes it possible to measure the simple and

-unrolled loops each over a range of sizes, from which the crossover point

-can be determined. Alternately, just adjust the threshold up or down until

-there's no more speedups.

-UNROLLED LOOP CODING

-The x86 addressing modes allow a byte displacement of -128 to +127, making

-it possible to access 256 bytes, which is 64 limbs, without adjusting

-pointer registers within the loop. Dword sized displacements can be used

-too, but they increase code size, and unrolling to 64 ought to be enough.

-When unrolling to the full 64 limbs/loop, the limb at the top of the loop

-will have a displacement of -128, so pointers have to have a corresponding

-+128 added before entering the loop. When unrolling to 32 limbs/loop

-displacements 0 to 127 can be used with 0 at the top of the loop and no

-adjustment needed to the pointers.

-Where 64 limbs/loop is supported, the +128 adjustment is done only when 64

-limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or

-16 is small, so support for 64 limbs/loop is generally only for comparison.

-COMPUTED JUMPS

-When working from least significant limb to most significant limb (most

-routines) the computed jump and pointer calculations in preparation for an

-unrolled loop are as follows.

- S = operand size in limbs

- N = number of limbs per loop (UNROLL_COUNT)

- L = log2 of unrolling (UNROLL_LOG2)

- M = mask for unrolling (UNROLL_MASK)

- C = code bytes per limb in the loop

- B = bytes per limb (4 for x86)

- computed jump (-S & M) * C + entrypoint

- subtract from pointers (-S & M) * B

- initial loop counter (S-1) >> L

- displacements 0 to B*(N-1)

-The loop counter is decremented at the end of each loop, and the looping

-stops when the decrement takes the counter to -1. The displacements are for

-the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".

-Usually the multiply by "C" can be handled without an imul, using instead an

-leal, or a shift and subtract.

-When working from most significant to least significant limb (eg. mpn_lshift

-and mpn_copyd), the calculations change as follows.

- add to pointers (-S & M) * B

- displacements 0 to -B*(N-1)

-OLD GAS 1.92.3

-This version comes with FreeBSD 2.2.8 and has a couple of gremlins that

-affect GMP code.

-Firstly, an expression involving two forward references to labels comes out

-as zero. For example,

- addl $bar-foo, %eax

- foo:

- nop

- bar:

-This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".

-When only one forward reference is involved, it works correctly, as for

-example,

- foo:

- addl $bar-foo, %eax

- nop

- bar:

-Secondly, an expression involving two labels can't be used as the

-displacement for an leal. For example,

- foo:

- nop

- bar:

- leal bar-foo(%eax,%ebx,8), %ecx

-A slightly cryptic error is given, "Unimplemented segment type 0 in

-parse_operand". When only one label is used it's ok, and the label can be a

-forward reference too, as for example,

- leal foo(%eax,%ebx,8), %ecx

- nop

- foo:

-These problems only affect PIC computed jump calculations. The workarounds

-are just to do an leal without a displacement and then an addl, and to make

-sure the code is placed so that there's at most one forward reference in the

-addl.

-REFERENCES

-"Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,

-2006, order numbers 253665 through 253669. Available on-line,

- ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf

- ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf

- ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf

- ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf

- ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf

-"System V Application Binary Interface", Unix System Laboratories Inc, 1992,

-published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor

-Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling

-conventions and ELF shared library PIC coding. Versions of both available

-on-line,

- http://www.sco.com/developer/devspecs

-"Intel386 Family Binary Compatibility Specification 2", Intel Corporation,

-published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386

-ABI supplement.)

-----------------

-Local variables:

-mode: text

-fill-column: 76

-End:

« no previous file with comments | « gcc/gmp/mpn/vax/lshift.s ('k') | gcc/gmp/mpn/x86/aors_n.asm » ('j') | no next file with comments »