OLD | NEW |
| (Empty) |
1 Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc. | |
2 | |
3 This file is part of the GNU MP Library. | |
4 | |
5 The GNU MP Library is free software; you can redistribute it and/or modify | |
6 it under the terms of the GNU Lesser General Public License as published by | |
7 the Free Software Foundation; either version 3 of the License, or (at your | |
8 option) any later version. | |
9 | |
10 The GNU MP Library is distributed in the hope that it will be useful, but | |
11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY | |
12 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public | |
13 License for more details. | |
14 | |
15 You should have received a copy of the GNU Lesser General Public License | |
16 along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. | |
17 | |
18 | |
19 | |
20 | |
21 | |
22 GMP SPEED MEASURING AND PARAMETER TUNING | |
23 | |
24 | |
25 The programs in this directory are for knowledgeable users who want to | |
26 measure GMP routines on their machine, and perhaps tweak some settings or | |
27 identify things that can be improved. | |
28 | |
29 The programs here are tools, not ready to run solutions. Nothing is built | |
30 in a normal "make all", but various Makefile targets described below exist. | |
31 | |
32 Relatively few systems and CPUs have been tested, so be sure to verify that | |
33 results are sensible before relying on them. | |
34 | |
35 | |
36 | |
37 | |
38 MISCELLANEOUS NOTES | |
39 | |
40 --enable-assert | |
41 | |
42 Don't configure with --enable-assert, since the extra code added by | |
43 assertion checking may influence measurements. | |
44 | |
45 Direct mapped caches | |
46 | |
47 Some effort has been made to accommodate CPUs with direct mapped caches, | |
48 by putting data blocks more or less contiguously on the stack. But this | |
49 will depend on TMP_ALLOC using alloca, and even then it may or may not | |
50 be enough. | |
51 | |
52 FreeBSD 4.2 i486 getrusage | |
53 | |
54 This getrusage seems to be a bit doubtful, it looks like it's | |
55 microsecond accurate, but sometimes ru_utime remains unchanged after a | |
56 time of many microseconds has elapsed. It'd be good to detect this in | |
57 the time.c initializations, but for now the suggestion is to pretend it | |
58 doesn't exist. | |
59 | |
60 ./configure ac_cv_func_getrusage=no | |
61 | |
62 NetBSD 1.4.1 m68k macintosh time base | |
63 | |
64 On this system it's been found getrusage often goes backwards, making it | |
65 unusable (time.c getrusage_backwards_p detects this). gettimeofday | |
66 sometimes doesn't update atomically when it crosses a 1 second boundary. | |
67 Not sure what to do about this. Expect possible intermittent failures. | |
68 | |
69 SCO OpenUNIX 8 /etc/hw | |
70 | |
71 /etc/hw takes about a second to return the cpu frequency, which suggests | |
72 perhaps it's measuring each time it runs. If this is annoying when | |
73 running the speed program repeatedly then set a GMP_CPU_FREQUENCY | |
74 environment variable (see TIME BASE section below). | |
75 | |
76 Low resolution timebase | |
77 | |
78 Parameter tuning can be very time consuming if the only timebase | |
79 available is a 10 millisecond clock tick, to the point of being | |
80 unusable. This is currently the case on VAX and ARM systems. | |
81 | |
82 | |
83 | |
84 | |
85 PARAMETER TUNING | |
86 | |
87 The "tuneup" program runs some tests designed to find the best settings for | |
88 various thresholds, like MUL_KARATSUBA_THRESHOLD. Its output can be put | |
89 into gmp-mparam.h. The program is built and run with | |
90 | |
91 make tune | |
92 | |
93 If the thresholds indicated are grossly different from the values in the | |
94 selected gmp-mparam.h then there may be a performance boost in applicable | |
95 size ranges by changing gmp-mparam.h accordingly. | |
96 | |
97 Be sure to do a full reconfigure and rebuild to get any newly set thresholds | |
98 to take effect. A partial rebuild is enough sometimes, but a fresh | |
99 configure and make is certain to be correct. | |
100 | |
101 If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of | |
102 the mpn subdirectories then the values from "make tune" should be similar. | |
103 But check that the configured CPU is right and there are no machine specific | |
104 effects causing a difference. | |
105 | |
106 It's hoped the compiler and options used won't have too much effect on | |
107 thresholds, since for most CPUs they ultimately come down to comparisons | |
108 between assembler subroutines. Missing out on the longlong.h macros by not | |
109 using gcc will probably have an effect. | |
110 | |
111 Some thresholds produced by the tune program are merely single values chosen | |
112 from what's a range of sizes where two algorithms are pretty much the same | |
113 speed. When this happens the program is likely to give somewhat different | |
114 values on successive runs. This is noticeable on the toom3 thresholds for | |
115 instance. | |
116 | |
117 | |
118 | |
119 | |
120 SPEED PROGRAM | |
121 | |
122 The "speed" program can be used for measuring and comparing various | |
123 routines, and producing tables of data or gnuplot graphs. Compile it with | |
124 | |
125 make speed | |
126 | |
127 (Or on DOS systems "make speed.exe".) | |
128 | |
129 Here are some examples of how to use it. Check the code for all the | |
130 options. | |
131 | |
132 Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05 | |
133 (whichever is greater). | |
134 | |
135 ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n | |
136 gnuplot foo.gnuplot | |
137 | |
138 Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and | |
139 showing under mpn_lshift the difference between it and mpn_add_n. | |
140 | |
141 ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 | |
142 | |
143 Using option -c for times in cycles is interesting but normally only | |
144 necessary when looking carefully at assembler subroutines. You might think | |
145 it would always give an integer value, but this doesn't happen in practice, | |
146 probably due to overheads in the time measurements. | |
147 | |
148 In the free-form output the "#" symbol against a measurement means the | |
149 corresponding routine is fastest at that size. This is a convenient visual | |
150 cue when comparing different routines. The graph data files <name>.data | |
151 don't get this since it would upset gnuplot or other data viewers. | |
152 | |
153 | |
154 | |
155 | |
156 TIME BASE | |
157 | |
158 The time measuring method is determined in time.c, based on what the | |
159 configured host has available. A cycle counter is preferred, possibly | |
160 supplemented by another method if the counter has a limited range. A | |
161 microsecond accurate getrusage() or gettimeofday() will work quite well too. | |
162 | |
163 The cycle counters (except possibly on alpha) and gettimeofday() will depend | |
164 on the machine being otherwise idle, or rather on other jobs not stealing | |
165 CPU time from the measuring program. Short routines (those that complete | |
166 within a timeslice) should work even on a busy machine. | |
167 | |
168 Some trouble is taken by speed_measure() in common.c to avoid ill effects | |
169 from sporadic interrupts, or other intermittent things (like cron waking up | |
170 every minute). But generally an idle machine will be necessary to be | |
171 certain of consistent results. | |
172 | |
173 The CPU frequency is needed to convert between cycles and seconds, or for | |
174 when a cycle counter is supplemented by getrusage() etc. The speed program | |
175 will convert as necessary according to the output format requested. The | |
176 tune program will work with either cycles or seconds. | |
177 | |
178 freq.c knows how to get the frequency on some systems, or can measure a | |
179 cycle counter against gettimeofday() or getrusage(), but when that fails, or | |
180 needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be | |
181 used (in Hertz). For example in "bash" on a 650 MHz machine, | |
182 | |
183 export GMP_CPU_FREQUENCY=650e6 | |
184 | |
185 A high precision time base makes it possible to get accurate measurements in | |
186 a shorter time. | |
187 | |
188 | |
189 | |
190 | |
191 EXAMPLE COMPARISONS - VARIOUS | |
192 | |
193 Here are some ideas for things that can be done with the speed program. | |
194 | |
195 There's always going to be a certain amount of overhead in the time | |
196 measurements, due to reading the time base, and in the loop that runs a | |
197 routine enough times to get a reading of the desired precision. Noop | |
198 functions taking various arguments are available to measure this. The | |
199 "overhead" printed by the speed program each time in its intro is the "noop" | |
200 routine, but note that this is just for information, it isn't deducted from | |
201 the times printed or anything. | |
202 | |
203 ./speed -s 1 noop noop_wxs noop_wxys | |
204 | |
205 To see how many cycles per limb a routine is taking, look at the time | |
206 increase when the size increments, using option -D. This avoids fixed | |
207 overheads in the measuring. Also, remember many of the assembler routines | |
208 have unrolled loops, so it might be necessary to compare times at, say, 16, | |
209 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any | |
210 finishing off. | |
211 | |
212 ./speed -s 16-64 -t 16 -C -D mpn_add_n | |
213 | |
214 The -C option on its own gives cycles per limb, but is really only useful at | |
215 big sizes where fixed overheads are small compared to the code doing the | |
216 real work. Remember of course memory caching and/or page swapping will | |
217 affect results at large sizes. | |
218 | |
219 ./speed -s 500000 -C mpn_add_n | |
220 | |
221 Once a calculation stops fitting in the CPU data cache, it's going to start | |
222 taking longer. Exactly where this happens depends on the cache priming in | |
223 the measuring routines, and on what sort of "least recently used" the | |
224 hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and | |
225 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000 | |
226 limbs. | |
227 | |
228 ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n | |
229 gnuplot foo.gnuplot | |
230 | |
231 When a routine has an unrolled loop for, say, multiples of 8 limbs and then | |
232 an ordinary loop for the remainder, it can happen that it's actually faster | |
233 to do an operation on, say, 8 limbs than it is on 7 limbs. The following | |
234 draws a graph of mpn_sub_n, to see whether times smoothly increase with | |
235 size. | |
236 | |
237 ./speed -s 1-100 -c -P foo mpn_sub_n | |
238 gnuplot foo.gnuplot | |
239 | |
240 If mpn_lshift and mpn_rshift have special case code for shifts by 1, it | |
241 ought to be faster (or at least not slower) than shifting by, say, 2 bits. | |
242 | |
243 ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 | |
244 | |
245 An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and | |
246 if the lshift isn't faster there's an obvious improvement that's possible. | |
247 | |
248 ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self | |
249 | |
250 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the | |
251 destination is one of the sources is faster than a separate destination. | |
252 Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and | |
253 mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. | |
254 | |
255 ./speed -s 1-200 -c mpn_add_n mpn_add_n.1 | |
256 | |
257 The gmp manual points out that divisions by powers of two should be done | |
258 using a right shift because it'll be significantly faster than an actual | |
259 division. The following shows by what factor mpn_rshift is faster than | |
260 mpn_divrem_1, using division by 32 as an example. | |
261 | |
262 ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 | |
263 | |
264 | |
265 | |
266 | |
267 EXAMPLE COMPARISONS - MULTIPLICATION | |
268 | |
269 mul_basecase takes a ".<r>" parameter which is the first (larger) size | |
270 parameter. For example to show speeds for 20x1 up to 20x15 in cycles, | |
271 | |
272 ./speed -s 1-15 -c mpn_mul_basecase.20 | |
273 | |
274 mul_basecase with no parameter does an NxN multiply, so for example to show | |
275 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles, | |
276 | |
277 ./speed -s 1-20 -c mpn_mul_basecase | |
278 | |
279 sqr_basecase is implemented by a "triangular" method on most CPUs, making it | |
280 up to twice as fast as mul_basecase. In practice loop overheads and the | |
281 products on the diagonal mean it falls short of this. Here's an example | |
282 running the two and showing by what factor an NxN mul_basecase is slower | |
283 than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes | |
284 below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.) | |
285 | |
286 ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase | |
287 | |
288 The technique described above with -CD for showing the time difference in | |
289 cycles per limb between two size operations can be done on an NxN | |
290 mul_basecase using -E to change the basis for the size increment to N*N. | |
291 For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16 | |
292 doing 256 limbs. The following therefore shows the per crossproduct speed | |
293 of mul_basecase and sqr_basecase at around 20x20 limbs. | |
294 | |
295 ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase | |
296 | |
297 Of course sqr_basecase isn't really doing NxN crossproducts, but it can be | |
298 interesting to compare it to mul_basecase as if it was. For sqr_basecase | |
299 the -F option can be used to base the deltas on N*(N+1)/2 operations, which | |
300 is the triangular products sqr_basecase does. For example, | |
301 | |
302 ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase | |
303 | |
304 Both -E and -F are preliminary and might change. A consistent approach to | |
305 using them when claiming certain per crossproduct or per triangularproduct | |
306 speeds hasn't really been established, but the increment between speeds in | |
307 the range karatsuba will call seems sensible, that being k to k/2. For | |
308 instance, if the karatsuba threshold was 20 for the multiply and 30 for the | |
309 square, | |
310 | |
311 ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase | |
312 ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase | |
313 | |
314 | |
315 | |
316 EXAMPLE COMPARISONS - MALLOC | |
317 | |
318 The gmp manual recommends application programs avoid excessive initializing | |
319 and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new | |
320 variable will at a minimum go through an init, a realloc for its first | |
321 store, and finally a clear. Quite how long that takes depends on the C | |
322 library. The following compares an mpz_init/realloc/clear to a 10 limb | |
323 mpz_add. Don't be surprised if the mallocing is quite slow. | |
324 | |
325 ./speed -s 10 -c mpz_init_realloc_clear mpz_add | |
326 | |
327 On some systems malloc and free are much slower when dynamic linked. The | |
328 speed-dynamic program can be used to see this. For example the following | |
329 measures malloc/free, first static then dynamic. | |
330 | |
331 ./speed -s 10 -c malloc_free | |
332 ./speed-dynamic -s 10 -c malloc_free | |
333 | |
334 Of course a real world program has big problems if it's doing so many | |
335 mallocs and frees that it gets slowed down by a dynamic linked malloc. | |
336 | |
337 | |
338 | |
339 | |
340 | |
341 EXAMPLE COMPARISONS - STRING CONVERSIONS | |
342 | |
343 mpn_get_str does a binary to string conversion. The base is specified with | |
344 a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster | |
345 than general bases. The following compares decimal and hex for instance. | |
346 | |
347 ./speed -s 1-20 -c mpn_get_str mpn_get_str.16 | |
348 | |
349 Smaller bases need more divisions to split a given size number, and so are | |
350 slower. The following compares base 3 and base 9. On small operands 9 will | |
351 be nearly twice as fast, though at bigger sizes this reduces since in the | |
352 current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit | |
353 limbs) and those divisions come to dominate. | |
354 | |
355 ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 | |
356 | |
357 mpn_set_str does a string to binary conversion. The base is specified with | |
358 a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than | |
359 general bases on large conversions. | |
360 | |
361 ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 | |
362 | |
363 mpn_set_str also has some special case code for decimal which is a bit | |
364 faster than the general case, basically by giving the compiler a chance to | |
365 optimize some multiplications by 10. | |
366 | |
367 ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 | |
368 | |
369 | |
370 | |
371 | |
372 EXAMPLE COMPARISONS - GCDs | |
373 | |
374 mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both | |
375 x and y are single limbs. This isn't tuned currently, but a value can be | |
376 established by a measurement like | |
377 | |
378 ./speed -s 10-32 mpn_gcd_1.10 | |
379 | |
380 This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div | |
381 threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd | |
382 is done by nibbling away at the 32-bit operands bit-by-bit. When the | |
383 threshold is small, say 1 bit, then an initial x%y is done to reduce it to a | |
384 10x10 bit operation. | |
385 | |
386 The threshold in mpn/generic/gcd_1.c or the various assembler | |
387 implementations can be tweaked up or down until there's no more speedups on | |
388 interesting combinations of sizes. Note that this affects only a 1x1 limb | |
389 operation and so isn't very important. (An Nx1 limb operation always does | |
390 an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) | |
391 | |
392 | |
393 | |
394 | |
395 SPEED PROGRAM EXTENSIONS | |
396 | |
397 Potentially lots of things could be made available in the program, but it's | |
398 been left at only the things that have actually been wanted and are likely | |
399 to be reasonably useful in the future. | |
400 | |
401 Extensions should be fairly easy to make though. speed-ext.c is an example, | |
402 in a style that should suit one-off tests, or new code fragments under | |
403 development. | |
404 | |
405 many.pl is a script for generating a new speed program supplemented with | |
406 alternate versions of the standard routines. It can be used for measuring | |
407 experimental code, or for comparing different implementations that exist | |
408 within a CPU family. | |
409 | |
410 | |
411 | |
412 | |
413 THRESHOLD EXAMINING | |
414 | |
415 The speed program can be used to examine the speeds of different algorithms | |
416 to check the tune program has done the right thing. For example to examine | |
417 the karatsuba multiply threshold, | |
418 | |
419 ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n | |
420 | |
421 When examining the toom3 threshold, remember it depends on the karatsuba | |
422 threshold, so the right karatsuba threshold needs to be compiled into the | |
423 library first. The tune program uses specially recompiled versions of | |
424 mpn/mul_n.c etc for this reason, but the speed program simply uses the | |
425 normal libgmp.la. | |
426 | |
427 Note further that the various routines may recurse into themselves on sizes | |
428 far enough above applicable thresholds. For example, mpn_kara_mul_n will | |
429 recurse into itself on sizes greater than twice the compiled-in | |
430 MUL_KARATSUBA_THRESHOLD. | |
431 | |
432 When doing the above comparison between mul_basecase and kara_mul_n what's | |
433 probably of interest is mul_basecase versus a kara_mul_n that does one level | |
434 of Karatsuba then calls to mul_basecase, but this only happens on sizes less | |
435 than twice the compiled MUL_KARATSUBA_THRESHOLD. A larger value for that | |
436 setting can be compiled-in to avoid the problem if necessary. The same | |
437 applies to toom3 and DC, though in a trickier fashion. | |
438 | |
439 There are some upper limits on some of the thresholds, arising from arrays | |
440 dimensioned according to a threshold (mpn_mul_n), or asm code with certain | |
441 sized displacements (some x86 versions of sqr_basecase). So putting huge | |
442 values for the thresholds, even just for testing, may fail. | |
443 | |
444 | |
445 | |
446 | |
447 FUTURE | |
448 | |
449 Make a program to check the time base is working properly, for small and | |
450 large measurements. Make it able to test each available method, including | |
451 perhaps the apparent resolution of each. | |
452 | |
453 Make a general mechanism for specifying operand overlap, and a syntax like | |
454 maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this | |
455 sort of thing with the "r" parameter currently. | |
456 | |
457 | |
458 | |
459 ---------------- | |
460 Local variables: | |
461 mode: text | |
462 fill-column: 76 | |
463 End: | |
OLD | NEW |