---> Test Program <---
Okay, here is the fastest memory copy that I could muster on an Athlon, and it might even still be fastest on a Pentium (haven't tested):
// 440mb/sec on 142mhz bus
void mem5(void *dst, void *src, int nbytes)
{
__asm {
mov ebx, nbytes
mov eax, src
mov edx, dst
shr ebx, 11 // 2048 bytes at a time
loop2k: // Copy 2k into temporary buffer
push edx
mov edx, tbuf
mov ecx, 2048 / 64
loopMemToL1:
prefetchnta 64 // Prefetch next loop, non-temporal
prefetchnta 96
movq mm0,
add edx, 64
movq mm1,
add eax, 64
movq mm2,
movq , mm0
movq mm3,
movq , mm1
movq mm4,
movq , mm2
movq mm5,
movq , mm3
movq mm6,
movq , mm4
movq mm7,
movq , mm5
movq , mm6
dec ecx
movq , mm7
jnz loopMemToL1
pop edx // Now copy from L1 to system memory
push eax
mov ecx, 2048 / 64
mov eax, tbuf
loopL1ToMem:
movq mm0,
add edx, 64
movq mm1,
add eax, 64
movq mm2,
movntq , mm0
movq mm3,
movntq , mm1
movq mm4,
movntq , mm2
movq mm5,
movntq , mm3
movq mm6,
movntq , mm4
movq mm7,
movntq , mm5
movntq , mm6
dec ecx
movntq , mm7
jnz loopL1ToMem
pop eax // Do next 2k block
dec ebx
jnz loop2k
emms
}
}
Bitrake,
Looks like a good piece of code, if you have the time, will you
time the REP MOVSD version for me from the MASM32 library so I
have some idea of the differences ?
Regards,
hutch@pbq.com.au
In comparison the 'REP MOVSD' instructions only achieves 200mb/sec :( If anyone is interested in the source code, I'll post it. This is based on an algorithm that was in an article on the Sun web site - I just interleved the MMX instructions. Using the L1 cache provides quite a boost.
This message was edited by bitRAKE, on 7/4/2001 12:27:10 PM
Yup i am interested in the source code ;)
i will try to check it against our code in HE and see if it can give us a boost also ;)
HERE is the original source, and my version for the Athlon is above.
Here are the result on a Pentium 3 938Mhz
SGI ex1: 229.631420ms = 139.353752mb/sec
SGI ex2: 110.588916ms = 289.359922mb/sec
SGI ex3: 118.112498ms = 270.928146mb/sec
SGI ex4: 74.363362ms = 430.319437mb/sec
SGI ex5: 77.423248ms = 413.312550mb/sec <- Code posted
memcpy 216.124954ms = 148.062495mb/sec
memfill 45.242622ms = 707.297651mb/sec
memfill 62.569405ms = 511.432067mb/sec
memfill 42.198380ms = 758.322951mb/sec
memfill 57.764884ms = 553.969783mb/sec
memset 157.080858ms = 203.716738mb/sec
What can I say, except amazing code thanks for posting it.
memcpy is similar to the speed of 'REP MOVSD'. On my Althon the code above runs a little faster than the SGI ex4 code. Looks like it's not so good on PIII's - overall I'd be best just to stick with the SGI ex4 code. Unless I can figure out how to get some more speed out of this Althon. :) I'm still reading about all the nitty-gritty details.
Just plain weird, I run the same benchmark again and got a completely different results.
TEST 2:
SGI ex1: 205.516902ms = 155.704955mb/sec
SGI ex2: 111.868687ms = 286.049660mb/sec
SGI ex3: 138.677402ms = 230.751367mb/sec
SGI ex4: 78.861699ms = 405.773657mb/sec
Code Posted: 75.972505ms = 421.205014mb/sec
memcpy 219.665349ms = 145.676140mb/sec
memfill 34.118023ms = 937.920689mb/sec
memfill 54.490166ms = 587.261933mb/sec
memfill 56.431474ms = 567.059441mb/sec
memfill 58.355461ms = 548.363414mb/sec
memset 137.294544ms = 233.075539mb/sec
TEST 3
SGI ex1: 214.224992ms = 149.375662mb/sec
SGI ex2: 113.460510ms = 282.036456mb/sec
SGI ex3: 142.639637ms = 224.341569mb/sec
SGI ex4: 75.187768ms = 425.601141mb/sec
Code Posted: 80.598512ms = 397.029663mb/sec
memcpy 219.209704ms = 145.978939mb/sec
memfill 36.348754ms = 880.360305mb/sec
memfill 54.707791ms = 584.925828mb/sec
memfill 61.250801ms = 522.442144mb/sec
memfill 57.334382ms = 558.129327mb/sec
memset 147.791689ms = 216.520972mb/sec
TEST 4:
SGI ex1: 209.088306ms = 153.045384mb/sec
SGI ex2: 112.688344ms = 283.969031mb/sec
SGI ex3: 128.655458ms = 248.726332mb/sec
SGI ex4: 81.062537ms = 394.756950mb/sec
Code Posted: 76.837419ms = 416.463753mb/sec
memcpy 236.251255ms = 135.449016mb/sec
memfill 50.594419ms = 632.480826mb/sec
memfill 59.103881ms = 541.419611mb/sec
memfill 58.789595ms = 544.314009mb/sec
memfill 62.341163ms = 513.304504mb/sec
memset 136.795878ms = 233.925178mb/sec
I suspect that windows 2000 latency has something to do with the differences. But if we only consider the highest values of the tests:
SGI ex4: 75.187768ms = 425.601141mb/sec
Code Posted: 75.972505ms = 421.205014mb/sec
It seems that SGI algo 4 stills wins Pentium 3
Changing topic a little, does anyone know if there is some issue with the athlons on Win2K, I have an Athlon 1009Mhz and its runnig much slower in the test than the Pentium 3 938. (Which must be impossible).
Ooops ....this dosent work on my Pentium2 Procesor :(
argh... (and i only have one P3 procesor here
IMHO assuming MMX is Ok today but assuming P3 is NOT :D
Its all very nice to say, but with my PIII 450Mghz 64MB RAM, I ended up with these results: :(:(:(
SGI ex1: 105868.977857ms = 0.302260mb/sec
SGI ex2: 109196.002280ms = 0.293051mb/sec
SGI ex3: 152312.268057ms = 0.210095mb/sec
SGI ex4: 141730.960123ms = 0.225780mb/sec
AMDSGI5: 99582.813993ms = 0.321341mb/sec
memcpy 99808.873766ms = 0.320613mb/sec
memfill 88819.556982ms = 0.360281mb/sec
memfill 72.106472ms = 443.788182mb/sec
memfill 61.896780ms = 516.989737mb/sec
memfill 68.005666ms = 470.549031mb/sec
memset 262.340133ms = 121.979049mb/sec
AMD: 51045.441593ms = 0.626892mb/sec
AMD: 138789.098041ms = 0.230566mb/sec
AMD: 108856.087095ms = 0.293966mb/sec
AMD: 94681.497343ms = 0.337975mb/sec
AMD: 123335.131330ms = 0.259456mb/sec
AMD: 119798.598703ms = 0.267115mb/sec
AMD: 115714.391793ms = 0.276543mb/sec
AMD: 108180.049112ms = 0.295803mb/sec
Can anybody sugest why I have ended up with such low speeds?
The answer is that you don't have enough memory to run the test program. It copies memory from one 32MB buffer to another 32MB buffer. The memfill only uses one 32MB buffer, and the first time that buffer wasn't in memory (the other three times it was :))
dxantos, that program that you're running is a virus. :D
this code is fast cause it use super-nes cheat code! :D
32MB at once, of course it's fast. to make it even faster,
make it 128MB buffer. those who have 256MB ram will jump
like the matrix.
This message was edited by disease_2000, on 7/5/2001 9:12:13 PM
I bought one 512MB PC133 DIMM. :) ....I've got room for two more. :) ...I'm starting a second job for a little while. :) I've updated the program at the link above so it runs 16 tests per algorithm and prints the highest - this provides a better side-by-side comparision of the routines. Your memory bus speed is going to effect this test directly (ie overclock that sucker:D)
Here is the best of 16 runs each for my Althon:
MMX move 64 blocks:
164.859780ms = 194.104348mb/sec
MMX move non-temporal stores:
112.386071ms = 284.732793mb/sec
MMX move non-temporal stores with prefetch:
94.231529ms = 339.589097mb/sec
(above) with 2K L1 use:
73.074092ms = 437.911703mb/sec
with instruction interleave:
72.252758ms = 442.889666mb/sec
REP MOVSD:
188.081725ms = 170.138805mb/sec
memcpy 188.095414ms = 170.126423mb/sec
memfill 32.682366ms = 979.121277mb/sec
memset 91.357421ms = 350.272584mb/sec
AMDCopy: 94.783834ms = 337.610314mb/sec
The 'AMDCopy' is directly from the AMD MMX Optimization manual.
This message was edited by bitRAKE, on 7/5/2001 11:17:28 PMdisease_2000, a virus mmm that would explain many things no wait, its not a virus, is Windows. :D
Seriously the machine is a 512MB Pentium 3 and thus the program ran entirely in memory, giving the results posted.
bitRAKE, is there a program to overclock the Athlon? I have one here for testing purposes is a 1GHz 512MB Athlon and its giving less than half the result of the Pentium 3. (Which should be impossible, faster processor and faster bus). Its not only the program that gives me bad results, but also a 3D code that runs at 60Hz flat on the Pentium 3 if giving me 30Hz on the Athlon, (exactly same video card on both, so its not that).
I don't think you should overclock until you determine what the problem is with the installation, and there is certianly a problem with that performance. :( Do you have the latest BIOS for you motherboard? Feel free to email me your detailed specs and I'll see what I can do.
This message was edited by bitRAKE, on 7/6/2001 12:14:09 AM