gcc/gmp/tune/README - Issue 3050029: [gcc] GCC 4.5.0=>4.5.1

Side by Side Diff: gcc/gmp/tune/README

Issue 3050029: [gcc] GCC 4.5.0=>4.5.1 (Closed) Base URL: ssh://git@gitrw.chromium.org:9222/nacl-toolchain.git

Patch Set: Created 10 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

OLD	NEW
	(Empty)
1 Copyright 2000, 2001, 2002, 2004 Free Software Foundation, Inc.

2

3 This file is part of the GNU MP Library.

4

5 The GNU MP Library is free software; you can redistribute it and/or modify

6 it under the terms of the GNU Lesser General Public License as published by

7 the Free Software Foundation; either version 3 of the License, or (at your

8 option) any later version.

9

10 The GNU MP Library is distributed in the hope that it will be useful, but

11 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY

12 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public

13 License for more details.

14

15 You should have received a copy of the GNU Lesser General Public License

16 along with the GNU MP Library. If not, see http://www.gnu.org/licenses/.

17

18

19

20

21

22 GMP SPEED MEASURING AND PARAMETER TUNING

23

24

25 The programs in this directory are for knowledgeable users who want to

26 measure GMP routines on their machine, and perhaps tweak some settings or

27 identify things that can be improved.

28

29 The programs here are tools, not ready to run solutions. Nothing is built

30 in a normal "make all", but various Makefile targets described below exist.

31

32 Relatively few systems and CPUs have been tested, so be sure to verify that

33 results are sensible before relying on them.

34

35

36

37

38 MISCELLANEOUS NOTES

39

40 --enable-assert

41

42 Don't configure with --enable-assert, since the extra code added by

43 assertion checking may influence measurements.

44

45 Direct mapped caches

46

47 Some effort has been made to accommodate CPUs with direct mapped caches,

48 by putting data blocks more or less contiguously on the stack. But this

49 will depend on TMP_ALLOC using alloca, and even then it may or may not

50 be enough.

51

52 FreeBSD 4.2 i486 getrusage

53

54 This getrusage seems to be a bit doubtful, it looks like it's

55 microsecond accurate, but sometimes ru_utime remains unchanged after a

56 time of many microseconds has elapsed. It'd be good to detect this in

57 the time.c initializations, but for now the suggestion is to pretend it

58 doesn't exist.

59

60 ./configure ac_cv_func_getrusage=no

61

62 NetBSD 1.4.1 m68k macintosh time base

63

64 On this system it's been found getrusage often goes backwards, making it

65 unusable (time.c getrusage_backwards_p detects this). gettimeofday

66 sometimes doesn't update atomically when it crosses a 1 second boundary.

67 Not sure what to do about this. Expect possible intermittent failures.

68

69 SCO OpenUNIX 8 /etc/hw

70

71 /etc/hw takes about a second to return the cpu frequency, which suggests

72 perhaps it's measuring each time it runs. If this is annoying when

73 running the speed program repeatedly then set a GMP_CPU_FREQUENCY

74 environment variable (see TIME BASE section below).

75

76 Low resolution timebase

77

78 Parameter tuning can be very time consuming if the only timebase

79 available is a 10 millisecond clock tick, to the point of being

80 unusable. This is currently the case on VAX and ARM systems.

81

82

83

84

85 PARAMETER TUNING

86

87 The "tuneup" program runs some tests designed to find the best settings for

88 various thresholds, like MUL_KARATSUBA_THRESHOLD. Its output can be put

89 into gmp-mparam.h. The program is built and run with

90

91 make tune

92

93 If the thresholds indicated are grossly different from the values in the

94 selected gmp-mparam.h then there may be a performance boost in applicable

95 size ranges by changing gmp-mparam.h accordingly.

96

97 Be sure to do a full reconfigure and rebuild to get any newly set thresholds

98 to take effect. A partial rebuild is enough sometimes, but a fresh

99 configure and make is certain to be correct.

100

101 If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of

102 the mpn subdirectories then the values from "make tune" should be similar.

103 But check that the configured CPU is right and there are no machine specific

104 effects causing a difference.

105

106 It's hoped the compiler and options used won't have too much effect on

107 thresholds, since for most CPUs they ultimately come down to comparisons

108 between assembler subroutines. Missing out on the longlong.h macros by not

109 using gcc will probably have an effect.

110

111 Some thresholds produced by the tune program are merely single values chosen

112 from what's a range of sizes where two algorithms are pretty much the same

113 speed. When this happens the program is likely to give somewhat different

114 values on successive runs. This is noticeable on the toom3 thresholds for

115 instance.

116

117

118

119

120 SPEED PROGRAM

121

122 The "speed" program can be used for measuring and comparing various

123 routines, and producing tables of data or gnuplot graphs. Compile it with

124

125 make speed

126

127 (Or on DOS systems "make speed.exe".)

128

129 Here are some examples of how to use it. Check the code for all the

130 options.

131

132 Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05

133 (whichever is greater).

134

135 ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n

136 gnuplot foo.gnuplot

137

138 Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and

139 showing under mpn_lshift the difference between it and mpn_add_n.

140

141 ./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1

142

143 Using option -c for times in cycles is interesting but normally only

144 necessary when looking carefully at assembler subroutines. You might think

145 it would always give an integer value, but this doesn't happen in practice,

146 probably due to overheads in the time measurements.

147

148 In the free-form output the "#" symbol against a measurement means the

149 corresponding routine is fastest at that size. This is a convenient visual

150 cue when comparing different routines. The graph data files <name>.data

151 don't get this since it would upset gnuplot or other data viewers.

152

153

154

155

156 TIME BASE

157

158 The time measuring method is determined in time.c, based on what the

159 configured host has available. A cycle counter is preferred, possibly

160 supplemented by another method if the counter has a limited range. A

161 microsecond accurate getrusage() or gettimeofday() will work quite well too.

162

163 The cycle counters (except possibly on alpha) and gettimeofday() will depend

164 on the machine being otherwise idle, or rather on other jobs not stealing

165 CPU time from the measuring program. Short routines (those that complete

166 within a timeslice) should work even on a busy machine.

167

168 Some trouble is taken by speed_measure() in common.c to avoid ill effects

169 from sporadic interrupts, or other intermittent things (like cron waking up

170 every minute). But generally an idle machine will be necessary to be

171 certain of consistent results.

172

173 The CPU frequency is needed to convert between cycles and seconds, or for

174 when a cycle counter is supplemented by getrusage() etc. The speed program

175 will convert as necessary according to the output format requested. The

176 tune program will work with either cycles or seconds.

177

178 freq.c knows how to get the frequency on some systems, or can measure a

179 cycle counter against gettimeofday() or getrusage(), but when that fails, or

180 needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be

181 used (in Hertz). For example in "bash" on a 650 MHz machine,

182

183 export GMP_CPU_FREQUENCY=650e6

184

185 A high precision time base makes it possible to get accurate measurements in

186 a shorter time.

187

188

189

190

191 EXAMPLE COMPARISONS - VARIOUS

192

193 Here are some ideas for things that can be done with the speed program.

194

195 There's always going to be a certain amount of overhead in the time

196 measurements, due to reading the time base, and in the loop that runs a

197 routine enough times to get a reading of the desired precision. Noop

198 functions taking various arguments are available to measure this. The

199 "overhead" printed by the speed program each time in its intro is the "noop"

200 routine, but note that this is just for information, it isn't deducted from

201 the times printed or anything.

202

203 ./speed -s 1 noop noop_wxs noop_wxys

204

205 To see how many cycles per limb a routine is taking, look at the time

206 increase when the size increments, using option -D. This avoids fixed

207 overheads in the measuring. Also, remember many of the assembler routines

208 have unrolled loops, so it might be necessary to compare times at, say, 16,

209 32, 48, 64 etc to see what the unrolled part is taking, as opposed to any

210 finishing off.

211

212 ./speed -s 16-64 -t 16 -C -D mpn_add_n

213

214 The -C option on its own gives cycles per limb, but is really only useful at

215 big sizes where fixed overheads are small compared to the code doing the

216 real work. Remember of course memory caching and/or page swapping will

217 affect results at large sizes.

218

219 ./speed -s 500000 -C mpn_add_n

220

221 Once a calculation stops fitting in the CPU data cache, it's going to start

222 taking longer. Exactly where this happens depends on the cache priming in

223 the measuring routines, and on what sort of "least recently used" the

224 hardware does. Here's an example for a CPU with a 16kbyte L1 data cache and

225 32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000

226 limbs.

227

228 ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n

229 gnuplot foo.gnuplot

230

231 When a routine has an unrolled loop for, say, multiples of 8 limbs and then

232 an ordinary loop for the remainder, it can happen that it's actually faster

233 to do an operation on, say, 8 limbs than it is on 7 limbs. The following

234 draws a graph of mpn_sub_n, to see whether times smoothly increase with

235 size.

236

237 ./speed -s 1-100 -c -P foo mpn_sub_n

238 gnuplot foo.gnuplot

239

240 If mpn_lshift and mpn_rshift have special case code for shifts by 1, it

241 ought to be faster (or at least not slower) than shifting by, say, 2 bits.

242

243 ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2

244

245 An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and

246 if the lshift isn't faster there's an obvious improvement that's possible.

247

248 ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self

249

250 On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the

251 destination is one of the sources is faster than a separate destination.

252 Here's an example to see this. ".1" selects dst==src1 for mpn_add_n (and

253 mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.

254

255 ./speed -s 1-200 -c mpn_add_n mpn_add_n.1

256

257 The gmp manual points out that divisions by powers of two should be done

258 using a right shift because it'll be significantly faster than an actual

259 division. The following shows by what factor mpn_rshift is faster than

260 mpn_divrem_1, using division by 32 as an example.

261

262 ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32

263

264

265

266

267 EXAMPLE COMPARISONS - MULTIPLICATION

268

269 mul_basecase takes a ".<r>" parameter which is the first (larger) size

270 parameter. For example to show speeds for 20x1 up to 20x15 in cycles,

271

272 ./speed -s 1-15 -c mpn_mul_basecase.20

273

274 mul_basecase with no parameter does an NxN multiply, so for example to show

275 speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,

276

277 ./speed -s 1-20 -c mpn_mul_basecase

278

279 sqr_basecase is implemented by a "triangular" method on most CPUs, making it

280 up to twice as fast as mul_basecase. In practice loop overheads and the

281 products on the diagonal mean it falls short of this. Here's an example

282 running the two and showing by what factor an NxN mul_basecase is slower

283 than an NxN sqr_basecase. (Some versions of sqr_basecase only allow sizes

284 below SQR_KARATSUBA_THRESHOLD, so if it crashes at that point don't worry.)

285

286 ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase

287

288 The technique described above with -CD for showing the time difference in

289 cycles per limb between two size operations can be done on an NxN

290 mul_basecase using -E to change the basis for the size increment to N*N.

291 For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16

292 doing 256 limbs. The following therefore shows the per crossproduct speed

293 of mul_basecase and sqr_basecase at around 20x20 limbs.

294

295 ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase

296

297 Of course sqr_basecase isn't really doing NxN crossproducts, but it can be

298 interesting to compare it to mul_basecase as if it was. For sqr_basecase

299 the -F option can be used to base the deltas on N*(N+1)/2 operations, which

300 is the triangular products sqr_basecase does. For example,

301

302 ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase

303

304 Both -E and -F are preliminary and might change. A consistent approach to

305 using them when claiming certain per crossproduct or per triangularproduct

306 speeds hasn't really been established, but the increment between speeds in

307 the range karatsuba will call seems sensible, that being k to k/2. For

308 instance, if the karatsuba threshold was 20 for the multiply and 30 for the

309 square,

310

311 ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase

312 ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase

313

314

315

316 EXAMPLE COMPARISONS - MALLOC

317

318 The gmp manual recommends application programs avoid excessive initializing

319 and clearing of mpz_t variables (and mpq_t and mpf_t too). Every new

320 variable will at a minimum go through an init, a realloc for its first

321 store, and finally a clear. Quite how long that takes depends on the C

322 library. The following compares an mpz_init/realloc/clear to a 10 limb

323 mpz_add. Don't be surprised if the mallocing is quite slow.

324

325 ./speed -s 10 -c mpz_init_realloc_clear mpz_add

326

327 On some systems malloc and free are much slower when dynamic linked. The

328 speed-dynamic program can be used to see this. For example the following

329 measures malloc/free, first static then dynamic.

330

331 ./speed -s 10 -c malloc_free

332 ./speed-dynamic -s 10 -c malloc_free

333

334 Of course a real world program has big problems if it's doing so many

335 mallocs and frees that it gets slowed down by a dynamic linked malloc.

336

337

338

339

340

341 EXAMPLE COMPARISONS - STRING CONVERSIONS

342

343 mpn_get_str does a binary to string conversion. The base is specified with

344 a ".<r>" parameter, or decimal by default. Power of 2 bases are much faster

345 than general bases. The following compares decimal and hex for instance.

346

347 ./speed -s 1-20 -c mpn_get_str mpn_get_str.16

348

349 Smaller bases need more divisions to split a given size number, and so are

350 slower. The following compares base 3 and base 9. On small operands 9 will

351 be nearly twice as fast, though at bigger sizes this reduces since in the

352 current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit

353 limbs) and those divisions come to dominate.

354

355 ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9

356

357 mpn_set_str does a string to binary conversion. The base is specified with

358 a ".<r>" parameter, or decimal by default. Power of 2 bases are faster than

359 general bases on large conversions.

360

361 ./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10

362

363 mpn_set_str also has some special case code for decimal which is a bit

364 faster than the general case, basically by giving the compiler a chance to

365 optimize some multiplications by 10.

366

367 ./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11

368

369

370

371

372 EXAMPLE COMPARISONS - GCDs

373

374 mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both

375 x and y are single limbs. This isn't tuned currently, but a value can be

376 established by a measurement like

377

378 ./speed -s 10-32 mpn_gcd_1.10

379

380 This runs src[0] from 10 to 32 bits, and y fixed at 10 bits. If the div

381 threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd

382 is done by nibbling away at the 32-bit operands bit-by-bit. When the

383 threshold is small, say 1 bit, then an initial x%y is done to reduce it to a

384 10x10 bit operation.

385

386 The threshold in mpn/generic/gcd_1.c or the various assembler

387 implementations can be tweaked up or down until there's no more speedups on

388 interesting combinations of sizes. Note that this affects only a 1x1 limb

389 operation and so isn't very important. (An Nx1 limb operation always does

390 an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)

391

392

393

394

395 SPEED PROGRAM EXTENSIONS

396

397 Potentially lots of things could be made available in the program, but it's

398 been left at only the things that have actually been wanted and are likely

399 to be reasonably useful in the future.

400

401 Extensions should be fairly easy to make though. speed-ext.c is an example,

402 in a style that should suit one-off tests, or new code fragments under

403 development.

404

405 many.pl is a script for generating a new speed program supplemented with

406 alternate versions of the standard routines. It can be used for measuring

407 experimental code, or for comparing different implementations that exist

408 within a CPU family.

409

410

411

412

413 THRESHOLD EXAMINING

414

415 The speed program can be used to examine the speeds of different algorithms

416 to check the tune program has done the right thing. For example to examine

417 the karatsuba multiply threshold,

418

419 ./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n

420

421 When examining the toom3 threshold, remember it depends on the karatsuba

422 threshold, so the right karatsuba threshold needs to be compiled into the

423 library first. The tune program uses specially recompiled versions of

424 mpn/mul_n.c etc for this reason, but the speed program simply uses the

425 normal libgmp.la.

426

427 Note further that the various routines may recurse into themselves on sizes

428 far enough above applicable thresholds. For example, mpn_kara_mul_n will

429 recurse into itself on sizes greater than twice the compiled-in

430 MUL_KARATSUBA_THRESHOLD.

431

432 When doing the above comparison between mul_basecase and kara_mul_n what's

433 probably of interest is mul_basecase versus a kara_mul_n that does one level

434 of Karatsuba then calls to mul_basecase, but this only happens on sizes less

435 than twice the compiled MUL_KARATSUBA_THRESHOLD. A larger value for that

436 setting can be compiled-in to avoid the problem if necessary. The same

437 applies to toom3 and DC, though in a trickier fashion.

438

439 There are some upper limits on some of the thresholds, arising from arrays

440 dimensioned according to a threshold (mpn_mul_n), or asm code with certain

441 sized displacements (some x86 versions of sqr_basecase). So putting huge

442 values for the thresholds, even just for testing, may fail.

443

444

445

446

447 FUTURE

448

449 Make a program to check the time base is working properly, for small and

450 large measurements. Make it able to test each available method, including

451 perhaps the apparent resolution of each.

452

453 Make a general mechanism for specifying operand overlap, and a syntax like

454 maybe "mpn_add_n.dst=src2" to select it. Some measuring routines do this

455 sort of thing with the "r" parameter currently.

456

457

458

459 ----------------

460 Local variables:

461 mode: text

462 fill-column: 76

463 End:

OLD	NEW

« no previous file with comments | « gcc/gmp/tune/Makefile.am ('k') | gcc/gmp/tune/common.c » ('j') | no next file with comments »