OLD | NEW |
| (Empty) |
1 First up, let me say I don't like writing in assembler. It is not portable, | |
2 dependant on the particular CPU architecture release and is generally a pig | |
3 to debug and get right. Having said that, the x86 architecture is probably | |
4 the most important for speed due to number of boxes and since | |
5 it appears to be the worst architecture to to get | |
6 good C compilers for. So due to this, I have lowered myself to do | |
7 assembler for the inner DES routines in libdes :-). | |
8 | |
9 The file to implement in assembler is des_enc.c. Replace the following | |
10 4 functions | |
11 des_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt); | |
12 des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); | |
13 des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); | |
14 des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); | |
15 | |
16 They encrypt/decrypt the 64 bits held in 'data' using | |
17 the 'ks' key schedules. The only difference between the 4 functions is that | |
18 des_encrypt2() does not perform IP() or FP() on the data (this is an | |
19 optimization for when doing triple DES and des_encrypt3() and des_decrypt3() | |
20 perform triple des. The triple DES routines are in here because it does | |
21 make a big difference to have them located near the des_encrypt2 function | |
22 at link time.. | |
23 | |
24 Now as we all know, there are lots of different operating systems running on | |
25 x86 boxes, and unfortunately they normally try to make sure their assembler | |
26 formating is not the same as the other peoples. | |
27 The 4 main formats I know of are | |
28 Microsoft Windows 95/Windows NT | |
29 Elf Includes Linux and FreeBSD(?). | |
30 a.out The older Linux. | |
31 Solaris Same as Elf but different comments :-(. | |
32 | |
33 Now I was not overly keen to write 4 different copies of the same code, | |
34 so I wrote a few perl routines to output the correct assembler, given | |
35 a target assembler type. This code is ugly and is just a hack. | |
36 The libraries are x86unix.pl and x86ms.pl. | |
37 des586.pl, des686.pl and des-som[23].pl are the programs to actually | |
38 generate the assembler. | |
39 | |
40 So to generate elf assembler | |
41 perl des-som3.pl elf >dx86-elf.s | |
42 For Windows 95/NT | |
43 perl des-som2.pl win32 >win32.asm | |
44 | |
45 [ update 4 Jan 1996 ] | |
46 I have added another way to do things. | |
47 perl des-som3.pl cpp >dx86-cpp.s | |
48 generates a file that will be included by dx86unix.cpp when it is compiled. | |
49 To build for elf, a.out, solaris, bsdi etc, | |
50 cc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o | |
51 cc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o | |
52 cc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o | |
53 cc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o | |
54 This was done to cut down the number of files in the distribution. | |
55 | |
56 Now the ugly part. I acquired my copy of Intels | |
57 "Optimization's For Intel's 32-Bit Processors" and found a few interesting | |
58 things. First, the aim of the exersize is to 'extract' one byte at a time | |
59 from a word and do an array lookup. This involves getting the byte from | |
60 the 4 locations in the word and moving it to a new word and doing the lookup. | |
61 The most obvious way to do this is | |
62 xor eax, eax # clear word | |
63 movb al, cl # get low byte | |
64 xor edi DWORD PTR 0x100+des_SP[eax] # xor in word | |
65 movb al, ch # get next byte | |
66 xor edi DWORD PTR 0x300+des_SP[eax] # xor in word | |
67 shr ecx 16 | |
68 which seems ok. For the pentium, this system appears to be the best. | |
69 One has to do instruction interleaving to keep both functional units | |
70 operating, but it is basically very efficient. | |
71 | |
72 Now the crunch. When a full register is used after a partial write, eg. | |
73 mov al, cl | |
74 xor edi, DWORD PTR 0x100+des_SP[eax] | |
75 386 - 1 cycle stall | |
76 486 - 1 cycle stall | |
77 586 - 0 cycle stall | |
78 686 - at least 7 cycle stall (page 22 of the above mentioned document). | |
79 | |
80 So the technique that produces the best results on a pentium, according to | |
81 the documentation, will produce hideous results on a pentium pro. | |
82 | |
83 To get around this, des686.pl will generate code that is not as fast on | |
84 a pentium, should be very good on a pentium pro. | |
85 mov eax, ecx # copy word | |
86 shr ecx, 8 # line up next byte | |
87 and eax, 0fch # mask byte | |
88 xor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup | |
89 mov eax, ecx # get word | |
90 shr ecx 8 # line up next byte | |
91 and eax, 0fch # mask byte | |
92 xor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup | |
93 | |
94 Due to the execution units in the pentium, this actually works quite well. | |
95 For a pentium pro it should be very good. This is the type of output | |
96 Visual C++ generates. | |
97 | |
98 There is a third option. instead of using | |
99 mov al, ch | |
100 which is bad on the pentium pro, one may be able to use | |
101 movzx eax, ch | |
102 which may not incur the partial write penalty. On the pentium, | |
103 this instruction takes 4 cycles so is not worth using but on the | |
104 pentium pro it appears it may be worth while. I need access to one to | |
105 experiment :-). | |
106 | |
107 eric (20 Oct 1996) | |
108 | |
109 22 Nov 1996 - I have asked people to run the 2 different version on pentium | |
110 pros and it appears that the intel documentation is wrong. The | |
111 mov al,bh is still faster on a pentium pro, so just use the des586.pl | |
112 install des686.pl | |
113 | |
114 3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these | |
115 functions into des_enc.c because it does make a massive performance | |
116 difference on some boxes to have the functions code located close to | |
117 the des_encrypt2() function. | |
118 | |
119 9 Jan 1997 - des-som2.pl is now the correct perl script to use for | |
120 pentiums. It contains an inner loop from | |
121 Svend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at | |
122 273,000 per second. He had a previous version at 250,000 and the best | |
123 I was able to get was 203,000. The content has not changed, this is all | |
124 due to instruction sequencing (and actual instructions choice) which is able | |
125 to keep both functional units of the pentium going. | |
126 We may have lost the ugly register usage restrictions when x86 went 32 bit | |
127 but for the pentium it has been replaced by evil instruction ordering tricks. | |
128 | |
129 13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. | |
130 raw DES at 281,000 per second on a pentium 100. | |
131 | |
OLD | NEW |