| OLD | NEW |
| (Empty) |
| 1 Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc. | |
| 2 | |
| 3 This file is part of the GNU MP Library. | |
| 4 | |
| 5 The GNU MP Library is free software; you can redistribute it and/or modify | |
| 6 it under the terms of the GNU Lesser General Public License as published by | |
| 7 the Free Software Foundation; either version 3 of the License, or (at your | |
| 8 option) any later version. | |
| 9 | |
| 10 The GNU MP Library is distributed in the hope that it will be useful, but | |
| 11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY | |
| 12 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public | |
| 13 License for more details. | |
| 14 | |
| 15 You should have received a copy of the GNU Lesser General Public License | |
| 16 along with the GNU MP Library. If not, see http://www.gnu.org/licenses/. | |
| 17 | |
| 18 | |
| 19 | |
| 20 | |
| 21 | |
| 22 GMP SPEED MEASURING AND PARAMETER TUNING | |
| 23 | |
| 24 | |
| 25 The programs in this directory are for knowledgeable users who want to | |
| 26 measure GMP routines on their machine, and perhaps tweak some settings or | |
| 27 identify things that can be improved. | |
| 28 | |
| 29 The programs here are tools, not ready to run solutions. Nothing is built | |
| 30 in a normal "make all", but various Makefile targets described below exist. | |
| 31 | |
| 32 Relatively few systems and CPUs have been tested, so be sure to verify that | |
| 33 results are sensible before relying on them. | |
| 34 | |
| 35 | |
| 36 | |
| 37 | |
| 38 MISCELLANEOUS NOTES | |
| 39 | |
| 40 --enable-assert | |
| 41 | |
| 42 Don't configure with --enable-assert, since the extra code added by | |
| 43 assertion checking may influence measurements. | |
| 44 | |
| 45 Direct mapped caches | |
| 46 | |
| 47 Some effort has been made to accommodate CPUs with direct mapped caches, | |
| 48 by putting data blocks more or less contiguously on the stack. But this | |
| 49 will depend on TMP_ALLOC using alloca, and even then it may or may not | |
| 50 be enough. | |
| 51 | |
| 52 FreeBSD 4.2 i486 getrusage | |
| 53 | |
| 54 This getrusage seems to be a bit doubtful, it looks like it's | |
| 55 microsecond accurate, but sometimes ru_utime remains unchanged after a | |
| 56 time of many microseconds has elapsed. It'd be good to detect this in | |
| 57 the time.c initializations, but for now the suggestion is to pretend it | |
| 58 doesn't exist. | |
| 59 | |
| 60 ./configure ac_cv_func_getrusage=no | |
| 61 | |
| 62 NetBSD 1.4.1 m68k macintosh time base | |
| 63 | |
| 64 On this system it's been found getrusage often goes backwards, making it | |
| 65 unusable (time.c getrusage_backwards_p detects this). gettimeofday | |
| 66 sometimes doesn't update atomically when it crosses a 1 second boundary. | |
| 67 Not sure what to do about this. Expect possible intermittent failures. | |
| 68 | |
| 69 SCO OpenUNIX 8 /etc/hw | |
| 70 | |
| 71 /etc/hw takes about a second to return the cpu frequency, which suggests | |
| 72 perhaps it's measuring each time it runs. If this is annoying when | |
| 73 running the speed program repeatedly then set a GMP_CPU_FREQUENCY | |
| 74 environment variable (see TIME BASE section below). | |
| 75 | |
| 76 Low resolution timebase | |
| 77 | |
| 78 Parameter tuning can be very time consuming if the only timebase | |
| 79 available is a 10 millisecond clock tick, to the point of being | |
| 80 unusable. This is currently the case on VAX and ARM systems. | |
| 81 | |
| 82 | |
| 83 | |
| 84 | |
| 85 PARAMETER TUNING | |
| 86 | |
| 87 The "tuneup" program runs some tests designed to find the best settings for | |
| 88 various thresholds, like MUL_KARATSUBA_THRESHOLD. Its output can be put | |
| 89 into gmp-mparam.h. The program is built and run with | |
| 90 | |
| 91 make tune | |
| 92 | |
| 93 If the thresholds indicated are grossly different from the values in the | |
| 94 selected gmp-mparam.h then there may be a performance boost in applicable | |
| 95 size ranges by changing gmp-mparam.h accordingly. | |
| 96 | |
| 97 Be sure to do a full reconfigure and rebuild to get any newly set thresholds | |
| 98 to take effect. A partial rebuild is enough sometimes, but a fresh | |
| 99 configure and make is certain to be correct. | |
| 100 | |
| 101 If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of | |
| 102 the mpn subdirectories then the values from "make tune" should be similar. | |
| 103 But check that the configured CPU is right and there are no machine specific | |
| 104 effects causing a difference. | |
| 105 | |
| 106 It's hoped the compiler and options used won't have too much effect on | |
| 107 thresholds, since for most CPUs they ultimately come down to comparisons | |
| 108 between assembler subroutines. Missing out on the longlong.h macros by not | |
| 109 using gcc will probably have an effect. | |
| 110 | |
| 111 Some thresholds produced by the tune program are merely single values chosen | |
| 112 from what's a range of sizes where two algorithms are pretty much the same | |
| 113 speed. When this happens the program is likely to give somewhat different | |
| 114 values on successive runs. This is noticeable on the toom3 thresholds for | |
| 115 instance. | |
| 116 | |
| 117 | |
| 118 | |
| 119 | |
| 120 SPEED PROGRAM | |
| 121 | |
| 122 The "speed" program can be used for measuring and comparing various | |
| 123 routines, and producing tables of data or gnuplot graphs. Compile it with | |
| 124 | |
| 125 make speed | |
| 126 | |
| 127 (Or on DOS systems "make speed.exe".) | |
| 128 | |
| 129 Here are some examples of how to use it. Check the code for all the | |
| 130 options. | |
| 131 | |
| 132 Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05 | |
| 133 (whichever is greater). | |
| 134 | |
| 135 ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n | |
| 136 gnuplot foo.gnuplot | |
| 137 | |
| 138 Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and | |
| 139 showing under mpn_lshift the difference between it and mpn_add_n. | |
| 140 | |
| 141 ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1 | |
| 142 | |
| 143 Using option -c for times in cycles is interesting but normally only | |
| 144 necessary when looking carefully at assembler subroutines. You might think | |
| 145 it would always give an integer value, but this doesn't happen in practice, | |
| 146 probably due to overheads in the time measurements. | |
| 147 | |
| 148 In the free-form output the "#" symbol against a measurement means the | |
| 149 corresponding routine is fastest at that size. This is a convenient visual | |
| 150 cue when comparing different routines. The graph data files <name>.data | |
| 151 don't get this since it would upset gnuplot or other data viewers. | |
| 152 | |
| 153 | |
| 154 | |
| 155 | |
| 156 TIME BASE | |
| 157 | |
| 158 The time measuring method is determined in time.c, based on what the | |
| 159 configured host has available. A cycle counter is preferred, possibly | |
| 160 supplemented by another method if the counter has a limited range. A | |
| 161 microsecond accurate getrusage() or gettimeofday() will work quite well too. | |
| 162 | |
| 163 The cycle counters (except possibly on alpha) and gettimeofday() will depend | |
| 164 on the machine being otherwise idle, or rather on other jobs not stealing | |
| 165 CPU time from the measuring program. Short routines (those that complete | |
| 166 within a timeslice) should work even on a busy machine. | |
| 167 | |
| 168 Some trouble is taken by speed_measure() in common.c to avoid ill effects | |
| 169 from sporadic interrupts, or other intermittent things (like cron waking up | |
| 170 every minute). But generally an idle machine will be necessary to be | |
| 171 certain of consistent results. | |
| 172 | |
| 173 The CPU frequency is needed to convert between cycles and seconds, or for | |
| 174 when a cycle counter is supplemented by getrusage() etc. The speed program | |
| 175 will convert as necessary according to the output format requested. The | |
| 176 tune program will work with either cycles or seconds. | |
| 177 | |
| 178 freq.c knows how to get the frequency on some systems, or can measure a | |
| 179 cycle counter against gettimeofday() or getrusage(), but when that fails, or | |
| 180 needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be | |
| 181 used (in Hertz). For example in "bash" on a 650 MHz machine, | |
| 182 | |
| 183 export GMP_CPU_FREQUENCY=650e6 | |
| 184 | |
| 185 A high precision time base makes it possible to get accurate measurements in | |
| 186 a shorter time. | |
| 187 | |
| 188 | |
| 189 | |
| 190 | |
| 191 EXAMPLE COMPARISONS - VARIOUS | |
| 192 | |
| 193 Here are some ideas for things that can be done with the speed program. | |
| 194 | |
| 195 There's always going to be a certain amount of overhead in the time | |
| 196 measurements, due to reading the time base, and in the loop that runs a | |
| 197 routine enough times to get a reading of the desired precision. Noop | |
| 198 functions taking various arguments are available to measure this. The | |
| 199 "overhead" printed by the speed program each time in its intro is the "noop" | |
| 200 routine, but note that this is just for information, it isn't deducted from | |
| 201 the times printed or anything. | |
| 202 | |
| 203 ./speed -s 1 noop noop_wxs noop_wxys | |
| 204 | |
| 205 To see how many cycles per limb a routine is taking, look at the time | |
| 206 increase when the size increments, using option -D. This avoids fixed | |
| 207 overheads in the measuring. Also, remember many of the assembler routines | |
| 208 have unrolled loops, so it might be necessary to compare times at, say, 16, | |
| 209 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any | |
| 210 finishing off. | |
| 211 | |
| 212 ./speed -s 16-64 -t 16 -C -D mpn_add_n | |
| 213 | |
| 214 The -C option on its own gives cycles per limb, but is really only useful at | |
| 215 big sizes where fixed overheads are small compared to the code doing the | |
| 216 real work. Remember of course memory caching and/or page swapping will | |
| 217 affect results at large sizes. | |
| 218 | |
| 219 ./speed -s 500000 -C mpn_add_n | |
| 220 | |
| 221 Once a calculation stops fitting in the CPU data cache, it's going to start | |
| 222 taking longer. Exactly where this happens depends on the cache priming in | |
| 223 the measuring routines, and on what sort of "least recently used" the | |
| 224 hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and | |
| 225 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000 | |
| 226 limbs. | |
| 227 | |
| 228 ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n | |
| 229 gnuplot foo.gnuplot | |
| 230 | |
| 231 When a routine has an unrolled loop for, say, multiples of 8 limbs and then | |
| 232 an ordinary loop for the remainder, it can happen that it's actually faster | |
| 233 to do an operation on, say, 8 limbs than it is on 7 limbs. The following | |
| 234 draws a graph of mpn_sub_n, to see whether times smoothly increase with | |
| 235 size. | |
| 236 | |
| 237 ./speed -s 1-100 -c -P foo mpn_sub_n | |
| 238 gnuplot foo.gnuplot | |
| 239 | |
| 240 If mpn_lshift and mpn_rshift have special case code for shifts by 1, it | |
| 241 ought to be faster (or at least not slower) than shifting by, say, 2 bits. | |
| 242 | |
| 243 ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2 | |
| 244 | |
| 245 An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and | |
| 246 if the lshift isn't faster there's an obvious improvement that's possible. | |
| 247 | |
| 248 ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self | |
| 249 | |
| 250 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the | |
| 251 destination is one of the sources is faster than a separate destination. | |
| 252 Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and | |
| 253 mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL. | |
| 254 | |
| 255 ./speed -s 1-200 -c mpn_add_n mpn_add_n.1 | |
| 256 | |
| 257 The gmp manual points out that divisions by powers of two should be done | |
| 258 using a right shift because it'll be significantly faster than an actual | |
| 259 division. The following shows by what factor mpn_rshift is faster than | |
| 260 mpn_divrem_1, using division by 32 as an example. | |
| 261 | |
| 262 ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32 | |
| 263 | |
| 264 | |
| 265 | |
| 266 | |
| 267 EXAMPLE COMPARISONS - MULTIPLICATION | |
| 268 | |
| 269 mul_basecase takes a ".<r>" parameter which is the first (larger) size | |
| 270 parameter. For example to show speeds for 20x1 up to 20x15 in cycles, | |
| 271 | |
| 272 ./speed -s 1-15 -c mpn_mul_basecase.20 | |
| 273 | |
| 274 mul_basecase with no parameter does an NxN multiply, so for example to show | |
| 275 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles, | |
| 276 | |
| 277 ./speed -s 1-20 -c mpn_mul_basecase | |
| 278 | |
| 279 sqr_basecase is implemented by a "triangular" method on most CPUs, making it | |
| 280 up to twice as fast as mul_basecase. In practice loop overheads and the | |
| 281 products on the diagonal mean it falls short of this. Here's an example | |
| 282 running the two and showing by what factor an NxN mul_basecase is slower | |
| 283 than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes | |
| 284 below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.) | |
| 285 | |
| 286 ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase | |
| 287 | |
| 288 The technique described above with -CD for showing the time difference in | |
| 289 cycles per limb between two size operations can be done on an NxN | |
| 290 mul_basecase using -E to change the basis for the size increment to N*N. | |
| 291 For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16 | |
| 292 doing 256 limbs. The following therefore shows the per crossproduct speed | |
| 293 of mul_basecase and sqr_basecase at around 20x20 limbs. | |
| 294 | |
| 295 ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase | |
| 296 | |
| 297 Of course sqr_basecase isn't really doing NxN crossproducts, but it can be | |
| 298 interesting to compare it to mul_basecase as if it was. For sqr_basecase | |
| 299 the -F option can be used to base the deltas on N*(N+1)/2 operations, which | |
| 300 is the triangular products sqr_basecase does. For example, | |
| 301 | |
| 302 ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase | |
| 303 | |
| 304 Both -E and -F are preliminary and might change. A consistent approach to | |
| 305 using them when claiming certain per crossproduct or per triangularproduct | |
| 306 speeds hasn't really been established, but the increment between speeds in | |
| 307 the range karatsuba will call seems sensible, that being k to k/2. For | |
| 308 instance, if the karatsuba threshold was 20 for the multiply and 30 for the | |
| 309 square, | |
| 310 | |
| 311 ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase | |
| 312 ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase | |
| 313 | |
| 314 | |
| 315 | |
| 316 EXAMPLE COMPARISONS - MALLOC | |
| 317 | |
| 318 The gmp manual recommends application programs avoid excessive initializing | |
| 319 and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new | |
| 320 variable will at a minimum go through an init, a realloc for its first | |
| 321 store, and finally a clear. Quite how long that takes depends on the C | |
| 322 library. The following compares an mpz_init/realloc/clear to a 10 limb | |
| 323 mpz_add. Don't be surprised if the mallocing is quite slow. | |
| 324 | |
| 325 ./speed -s 10 -c mpz_init_realloc_clear mpz_add | |
| 326 | |
| 327 On some systems malloc and free are much slower when dynamic linked. The | |
| 328 speed-dynamic program can be used to see this. For example the following | |
| 329 measures malloc/free, first static then dynamic. | |
| 330 | |
| 331 ./speed -s 10 -c malloc_free | |
| 332 ./speed-dynamic -s 10 -c malloc_free | |
| 333 | |
| 334 Of course a real world program has big problems if it's doing so many | |
| 335 mallocs and frees that it gets slowed down by a dynamic linked malloc. | |
| 336 | |
| 337 | |
| 338 | |
| 339 | |
| 340 | |
| 341 EXAMPLE COMPARISONS - STRING CONVERSIONS | |
| 342 | |
| 343 mpn_get_str does a binary to string conversion. The base is specified with | |
| 344 a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster | |
| 345 than general bases. The following compares decimal and hex for instance. | |
| 346 | |
| 347 ./speed -s 1-20 -c mpn_get_str mpn_get_str.16 | |
| 348 | |
| 349 Smaller bases need more divisions to split a given size number, and so are | |
| 350 slower. The following compares base 3 and base 9. On small operands 9 will | |
| 351 be nearly twice as fast, though at bigger sizes this reduces since in the | |
| 352 current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit | |
| 353 limbs) and those divisions come to dominate. | |
| 354 | |
| 355 ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9 | |
| 356 | |
| 357 mpn_set_str does a string to binary conversion. The base is specified with | |
| 358 a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than | |
| 359 general bases on large conversions. | |
| 360 | |
| 361 ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10 | |
| 362 | |
| 363 mpn_set_str also has some special case code for decimal which is a bit | |
| 364 faster than the general case, basically by giving the compiler a chance to | |
| 365 optimize some multiplications by 10. | |
| 366 | |
| 367 ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11 | |
| 368 | |
| 369 | |
| 370 | |
| 371 | |
| 372 EXAMPLE COMPARISONS - GCDs | |
| 373 | |
| 374 mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both | |
| 375 x and y are single limbs. This isn't tuned currently, but a value can be | |
| 376 established by a measurement like | |
| 377 | |
| 378 ./speed -s 10-32 mpn_gcd_1.10 | |
| 379 | |
| 380 This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div | |
| 381 threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd | |
| 382 is done by nibbling away at the 32-bit operands bit-by-bit. When the | |
| 383 threshold is small, say 1 bit, then an initial x%y is done to reduce it to a | |
| 384 10x10 bit operation. | |
| 385 | |
| 386 The threshold in mpn/generic/gcd_1.c or the various assembler | |
| 387 implementations can be tweaked up or down until there's no more speedups on | |
| 388 interesting combinations of sizes. Note that this affects only a 1x1 limb | |
| 389 operation and so isn't very important. (An Nx1 limb operation always does | |
| 390 an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.) | |
| 391 | |
| 392 | |
| 393 | |
| 394 | |
| 395 SPEED PROGRAM EXTENSIONS | |
| 396 | |
| 397 Potentially lots of things could be made available in the program, but it's | |
| 398 been left at only the things that have actually been wanted and are likely | |
| 399 to be reasonably useful in the future. | |
| 400 | |
| 401 Extensions should be fairly easy to make though. speed-ext.c is an example, | |
| 402 in a style that should suit one-off tests, or new code fragments under | |
| 403 development. | |
| 404 | |
| 405 many.pl is a script for generating a new speed program supplemented with | |
| 406 alternate versions of the standard routines. It can be used for measuring | |
| 407 experimental code, or for comparing different implementations that exist | |
| 408 within a CPU family. | |
| 409 | |
| 410 | |
| 411 | |
| 412 | |
| 413 THRESHOLD EXAMINING | |
| 414 | |
| 415 The speed program can be used to examine the speeds of different algorithms | |
| 416 to check the tune program has done the right thing. For example to examine | |
| 417 the karatsuba multiply threshold, | |
| 418 | |
| 419 ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n | |
| 420 | |
| 421 When examining the toom3 threshold, remember it depends on the karatsuba | |
| 422 threshold, so the right karatsuba threshold needs to be compiled into the | |
| 423 library first. The tune program uses specially recompiled versions of | |
| 424 mpn/mul_n.c etc for this reason, but the speed program simply uses the | |
| 425 normal libgmp.la. | |
| 426 | |
| 427 Note further that the various routines may recurse into themselves on sizes | |
| 428 far enough above applicable thresholds. For example, mpn_kara_mul_n will | |
| 429 recurse into itself on sizes greater than twice the compiled-in | |
| 430 MUL_KARATSUBA_THRESHOLD. | |
| 431 | |
| 432 When doing the above comparison between mul_basecase and kara_mul_n what's | |
| 433 probably of interest is mul_basecase versus a kara_mul_n that does one level | |
| 434 of Karatsuba then calls to mul_basecase, but this only happens on sizes less | |
| 435 than twice the compiled MUL_KARATSUBA_THRESHOLD. A larger value for that | |
| 436 setting can be compiled-in to avoid the problem if necessary. The same | |
| 437 applies to toom3 and DC, though in a trickier fashion. | |
| 438 | |
| 439 There are some upper limits on some of the thresholds, arising from arrays | |
| 440 dimensioned according to a threshold (mpn_mul_n), or asm code with certain | |
| 441 sized displacements (some x86 versions of sqr_basecase). So putting huge | |
| 442 values for the thresholds, even just for testing, may fail. | |
| 443 | |
| 444 | |
| 445 | |
| 446 | |
| 447 FUTURE | |
| 448 | |
| 449 Make a program to check the time base is working properly, for small and | |
| 450 large measurements. Make it able to test each available method, including | |
| 451 perhaps the apparent resolution of each. | |
| 452 | |
| 453 Make a general mechanism for specifying operand overlap, and a syntax like | |
| 454 maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this | |
| 455 sort of thing with the "r" parameter currently. | |
| 456 | |
| 457 | |
| 458 | |
| 459 ---------------- | |
| 460 Local variables: | |
| 461 mode: text | |
| 462 fill-column: 76 | |
| 463 End: | |
| OLD | NEW |