openssl/crypto/des/asm/readme - Issue 2072073002: Delete bundled copy of OpenSSL and replace with README.

Side by Side Diff: openssl/crypto/des/asm/readme

Issue 2072073002: Delete bundled copy of OpenSSL and replace with README. (Closed) Base URL: https://chromium.googlesource.com/chromium/deps/openssl@master

Patch Set: Delete bundled copy of OpenSSL and replace with README. Created 4 years, 6 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch

OLD	NEW
	(Empty)
1 First up, let me say I don't like writing in assembler. It is not portable,

2 dependant on the particular CPU architecture release and is generally a pig

3 to debug and get right. Having said that, the x86 architecture is probably

4 the most important for speed due to number of boxes and since

5 it appears to be the worst architecture to to get

6 good C compilers for. So due to this, I have lowered myself to do

7 assembler for the inner DES routines in libdes :-).

8

9 The file to implement in assembler is des_enc.c. Replace the following

10 4 functions

11 des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);

12 des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);

13 des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);

14 des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);

15

16 They encrypt/decrypt the 64 bits held in 'data' using

17 the 'ks' key schedules. The only difference between the 4 functions is that

18 des_encrypt2() does not perform IP() or FP() on the data (this is an

19 optimization for when doing triple DES and des_encrypt3() and des_decrypt3()

20 perform triple des. The triple DES routines are in here because it does

21 make a big difference to have them located near the des_encrypt2 function

22 at link time..

23

24 Now as we all know, there are lots of different operating systems running on

25 x86 boxes, and unfortunately they normally try to make sure their assembler

26 formating is not the same as the other peoples.

27 The 4 main formats I know of are

28 Microsoft Windows 95/Windows NT

29 Elf Includes Linux and FreeBSD(?).

30 a.out The older Linux.

31 Solaris Same as Elf but different comments :-(.

32

33 Now I was not overly keen to write 4 different copies of the same code,

34 so I wrote a few perl routines to output the correct assembler, given

35 a target assembler type. This code is ugly and is just a hack.

36 The libraries are x86unix.pl and x86ms.pl.

37 des586.pl, des686.pl and des-som[23].pl are the programs to actually

38 generate the assembler.

39

40 So to generate elf assembler

41 perl des-som3.pl elf >dx86-elf.s

42 For Windows 95/NT

43 perl des-som2.pl win32 >win32.asm

44

45 [ update 4 Jan 1996 ]

46 I have added another way to do things.

47 perl des-som3.pl cpp >dx86-cpp.s

48 generates a file that will be included by dx86unix.cpp when it is compiled.

49 To build for elf, a.out, solaris, bsdi etc,

50 cc -E -DELF asm/dx86unix.cpp \| as -o asm/dx86-elf.o

51 cc -E -DSOL asm/dx86unix.cpp \| as -o asm/dx86-sol.o

52 cc -E -DOUT asm/dx86unix.cpp \| as -o asm/dx86-out.o

53 cc -E -DBSDI asm/dx86unix.cpp \| as -o asm/dx86bsdi.o

54 This was done to cut down the number of files in the distribution.

55

56 Now the ugly part. I acquired my copy of Intels

57 "Optimization's For Intel's 32-Bit Processors" and found a few interesting

58 things. First, the aim of the exersize is to 'extract' one byte at a time

59 from a word and do an array lookup. This involves getting the byte from

60 the 4 locations in the word and moving it to a new word and doing the lookup.

61 The most obvious way to do this is

62 xor eax, eax # clear word

63 movb al, cl # get low byte

64 xor edi DWORD PTR 0x100+des_SP[eax] # xor in word

65 movb al, ch # get next byte

66 xor edi DWORD PTR 0x300+des_SP[eax] # xor in word

67 shr ecx 16

68 which seems ok. For the pentium, this system appears to be the best.

69 One has to do instruction interleaving to keep both functional units

70 operating, but it is basically very efficient.

71

72 Now the crunch. When a full register is used after a partial write, eg.

73 mov al, cl

74 xor edi, DWORD PTR 0x100+des_SP[eax]

75 386 - 1 cycle stall

76 486 - 1 cycle stall

77 586 - 0 cycle stall

78 686 - at least 7 cycle stall (page 22 of the above mentioned document).

79

80 So the technique that produces the best results on a pentium, according to

81 the documentation, will produce hideous results on a pentium pro.

82

83 To get around this, des686.pl will generate code that is not as fast on

84 a pentium, should be very good on a pentium pro.

85 mov eax, ecx # copy word

86 shr ecx, 8 # line up next byte

87 and eax, 0fch # mask byte

88 xor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup

89 mov eax, ecx # get word

90 shr ecx 8 # line up next byte

91 and eax, 0fch # mask byte

92 xor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup

93

94 Due to the execution units in the pentium, this actually works quite well.

95 For a pentium pro it should be very good. This is the type of output

96 Visual C++ generates.

97

98 There is a third option. instead of using

99 mov al, ch

100 which is bad on the pentium pro, one may be able to use

101 movzx eax, ch

102 which may not incur the partial write penalty. On the pentium,

103 this instruction takes 4 cycles so is not worth using but on the

104 pentium pro it appears it may be worth while. I need access to one to

105 experiment :-).

106

107 eric (20 Oct 1996)

108

109 22 Nov 1996 - I have asked people to run the 2 different version on pentium

110 pros and it appears that the intel documentation is wrong. The

111 mov al,bh is still faster on a pentium pro, so just use the des586.pl

112 install des686.pl

113

114 3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these

115 functions into des_enc.c because it does make a massive performance

116 difference on some boxes to have the functions code located close to

117 the des_encrypt2() function.

118

119 9 Jan 1997 - des-som2.pl is now the correct perl script to use for

120 pentiums. It contains an inner loop from

121 Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at

122 273,000 per second. He had a previous version at 250,000 and the best

123 I was able to get was 203,000. The content has not changed, this is all

124 due to instruction sequencing (and actual instructions choice) which is able

125 to keep both functional units of the pentium going.

126 We may have lost the ugly register usage restrictions when x86 went 32 bit

127 but for the pentium it has been replaced by evil instruction ordering tricks.

128

129 13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.

130 raw DES at 281,000 per second on a pentium 100.

131

OLD	NEW

« no previous file with comments | « openssl/crypto/des/asm/desboth.pl ('k') | openssl/crypto/des/cbc3_enc.c » ('j') | no next file with comments »