---> Test Program <--- Okay, here is the fastest memory copy that I could muster on an Athlon, and it might even still be fastest on a Pentium (haven't tested):

// 440mb/sec on 142mhz bus
void mem5(void *dst, void *src, int nbytes)
{
  __asm {
        mov ebx, nbytes 
        mov eax, src
        mov edx, dst
        shr ebx, 11 // 2048 bytes at a time 

loop2k: // Copy 2k into temporary buffer 
        push edx 
        mov edx, tbuf 
        mov ecx, 2048 / 64

loopMemToL1: 
        prefetchnta 64 // Prefetch next loop, non-temporal 
        prefetchnta 96 

	movq mm0, 
	add edx, 64
	movq mm1, 
	add eax, 64
	movq mm2, 
	movq , mm0
	movq mm3, 
	movq , mm1
	movq mm4, 
	movq , mm2
	movq mm5, 
	movq , mm3
	movq mm6, 
	movq , mm4
	movq mm7, 
	movq , mm5
	movq , mm6
	dec ecx
	movq , mm7
	jnz loopMemToL1 

        pop edx // Now copy from L1 to system memory 
        push eax 
        mov ecx, 2048 / 64
        mov eax, tbuf 

loopL1ToMem: 
	movq mm0, 
	add edx, 64
	movq mm1, 
	add eax, 64
	movq mm2, 
	movntq , mm0
	movq mm3, 
	movntq , mm1
	movq mm4, 
	movntq , mm2
	movq mm5, 
	movntq , mm3
	movq mm6, 
	movntq , mm4
	movq mm7, 
	movntq , mm5
	movntq , mm6
	dec ecx
	movntq , mm7
	jnz loopL1ToMem 

        pop eax // Do next 2k block 
        dec ebx 
        jnz loop2k
emms
	} 
}
Posted on 2001-07-04 03:43:00 by bitRAKE
Bitrake, Looks like a good piece of code, if you have the time, will you time the REP MOVSD version for me from the MASM32 library so I have some idea of the differences ? Regards, hutch@pbq.com.au
Posted on 2001-07-04 04:43:00 by hutch--
In comparison the 'REP MOVSD' instructions only achieves 200mb/sec :( If anyone is interested in the source code, I'll post it. This is based on an algorithm that was in an article on the Sun web site - I just interleved the MMX instructions. Using the L1 cache provides quite a boost. This message was edited by bitRAKE, on 7/4/2001 12:27:10 PM
Posted on 2001-07-04 12:16:00 by bitRAKE
Yup i am interested in the source code ;) i will try to check it against our code in HE and see if it can give us a boost also ;)
Posted on 2001-07-04 16:52:00 by BogdanOntanu
HERE is the original source, and my version for the Athlon is above.
Posted on 2001-07-05 01:47:00 by bitRAKE
Here are the result on a Pentium 3 938Mhz SGI ex1: 229.631420ms = 139.353752mb/sec SGI ex2: 110.588916ms = 289.359922mb/sec SGI ex3: 118.112498ms = 270.928146mb/sec SGI ex4: 74.363362ms = 430.319437mb/sec SGI ex5: 77.423248ms = 413.312550mb/sec <- Code posted memcpy 216.124954ms = 148.062495mb/sec memfill 45.242622ms = 707.297651mb/sec memfill 62.569405ms = 511.432067mb/sec memfill 42.198380ms = 758.322951mb/sec memfill 57.764884ms = 553.969783mb/sec memset 157.080858ms = 203.716738mb/sec What can I say, except amazing code thanks for posting it.
Posted on 2001-07-05 09:41:00 by dxantos
memcpy is similar to the speed of 'REP MOVSD'. On my Althon the code above runs a little faster than the SGI ex4 code. Looks like it's not so good on PIII's - overall I'd be best just to stick with the SGI ex4 code. Unless I can figure out how to get some more speed out of this Althon. :) I'm still reading about all the nitty-gritty details.
Posted on 2001-07-05 12:15:00 by bitRAKE
Just plain weird, I run the same benchmark again and got a completely different results. TEST 2: SGI ex1: 205.516902ms = 155.704955mb/sec SGI ex2: 111.868687ms = 286.049660mb/sec SGI ex3: 138.677402ms = 230.751367mb/sec SGI ex4: 78.861699ms = 405.773657mb/sec Code Posted: 75.972505ms = 421.205014mb/sec memcpy 219.665349ms = 145.676140mb/sec memfill 34.118023ms = 937.920689mb/sec memfill 54.490166ms = 587.261933mb/sec memfill 56.431474ms = 567.059441mb/sec memfill 58.355461ms = 548.363414mb/sec memset 137.294544ms = 233.075539mb/sec TEST 3 SGI ex1: 214.224992ms = 149.375662mb/sec SGI ex2: 113.460510ms = 282.036456mb/sec SGI ex3: 142.639637ms = 224.341569mb/sec SGI ex4: 75.187768ms = 425.601141mb/sec Code Posted: 80.598512ms = 397.029663mb/sec memcpy 219.209704ms = 145.978939mb/sec memfill 36.348754ms = 880.360305mb/sec memfill 54.707791ms = 584.925828mb/sec memfill 61.250801ms = 522.442144mb/sec memfill 57.334382ms = 558.129327mb/sec memset 147.791689ms = 216.520972mb/sec TEST 4: SGI ex1: 209.088306ms = 153.045384mb/sec SGI ex2: 112.688344ms = 283.969031mb/sec SGI ex3: 128.655458ms = 248.726332mb/sec SGI ex4: 81.062537ms = 394.756950mb/sec Code Posted: 76.837419ms = 416.463753mb/sec memcpy 236.251255ms = 135.449016mb/sec memfill 50.594419ms = 632.480826mb/sec memfill 59.103881ms = 541.419611mb/sec memfill 58.789595ms = 544.314009mb/sec memfill 62.341163ms = 513.304504mb/sec memset 136.795878ms = 233.925178mb/sec I suspect that windows 2000 latency has something to do with the differences. But if we only consider the highest values of the tests: SGI ex4: 75.187768ms = 425.601141mb/sec Code Posted: 75.972505ms = 421.205014mb/sec It seems that SGI algo 4 stills wins Pentium 3 Changing topic a little, does anyone know if there is some issue with the athlons on Win2K, I have an Athlon 1009Mhz and its runnig much slower in the test than the Pentium 3 938. (Which must be impossible).
Posted on 2001-07-05 16:10:00 by dxantos
Ooops ....this dosent work on my Pentium2 Procesor :( argh... (and i only have one P3 procesor here IMHO assuming MMX is Ok today but assuming P3 is NOT :D
Posted on 2001-07-05 16:52:00 by BogdanOntanu
Its all very nice to say, but with my PIII 450Mghz 64MB RAM, I ended up with these results: :(:(:( SGI ex1: 105868.977857ms = 0.302260mb/sec SGI ex2: 109196.002280ms = 0.293051mb/sec SGI ex3: 152312.268057ms = 0.210095mb/sec SGI ex4: 141730.960123ms = 0.225780mb/sec AMDSGI5: 99582.813993ms = 0.321341mb/sec memcpy 99808.873766ms = 0.320613mb/sec memfill 88819.556982ms = 0.360281mb/sec memfill 72.106472ms = 443.788182mb/sec memfill 61.896780ms = 516.989737mb/sec memfill 68.005666ms = 470.549031mb/sec memset 262.340133ms = 121.979049mb/sec AMD: 51045.441593ms = 0.626892mb/sec AMD: 138789.098041ms = 0.230566mb/sec AMD: 108856.087095ms = 0.293966mb/sec AMD: 94681.497343ms = 0.337975mb/sec AMD: 123335.131330ms = 0.259456mb/sec AMD: 119798.598703ms = 0.267115mb/sec AMD: 115714.391793ms = 0.276543mb/sec AMD: 108180.049112ms = 0.295803mb/sec Can anybody sugest why I have ended up with such low speeds?
Posted on 2001-07-05 20:58:00 by George
The answer is that you don't have enough memory to run the test program. It copies memory from one 32MB buffer to another 32MB buffer. The memfill only uses one 32MB buffer, and the first time that buffer wasn't in memory (the other three times it was :))
Posted on 2001-07-05 21:07:00 by bitRAKE
dxantos, that program that you're running is a virus. :D this code is fast cause it use super-nes cheat code! :D 32MB at once, of course it's fast. to make it even faster, make it 128MB buffer. those who have 256MB ram will jump like the matrix. This message was edited by disease_2000, on 7/5/2001 9:12:13 PM
Posted on 2001-07-05 21:09:00 by disease_2000
I bought one 512MB PC133 DIMM. :) ....I've got room for two more. :) ...I'm starting a second job for a little while. :) I've updated the program at the link above so it runs 16 tests per algorithm and prints the highest - this provides a better side-by-side comparision of the routines. Your memory bus speed is going to effect this test directly (ie overclock that sucker:D) Here is the best of 16 runs each for my Althon:
MMX move 64 blocks:
164.859780ms = 194.104348mb/sec
MMX move non-temporal stores:
112.386071ms = 284.732793mb/sec
MMX move non-temporal stores with prefetch:
94.231529ms = 339.589097mb/sec
(above) with 2K L1 use:
73.074092ms = 437.911703mb/sec
 with instruction interleave:
72.252758ms = 442.889666mb/sec
REP MOVSD:
188.081725ms = 170.138805mb/sec

memcpy 188.095414ms = 170.126423mb/sec

memfill 32.682366ms = 979.121277mb/sec

memset 91.357421ms = 350.272584mb/sec

AMDCopy: 94.783834ms = 337.610314mb/sec
The 'AMDCopy' is directly from the AMD MMX Optimization manual. This message was edited by bitRAKE, on 7/5/2001 11:17:28 PM
Posted on 2001-07-05 23:05:00 by bitRAKE
disease_2000, a virus mmm that would explain many things no wait, its not a virus, is Windows. :D Seriously the machine is a 512MB Pentium 3 and thus the program ran entirely in memory, giving the results posted. bitRAKE, is there a program to overclock the Athlon? I have one here for testing purposes is a 1GHz 512MB Athlon and its giving less than half the result of the Pentium 3. (Which should be impossible, faster processor and faster bus). Its not only the program that gives me bad results, but also a 3D code that runs at 60Hz flat on the Pentium 3 if giving me 30Hz on the Athlon, (exactly same video card on both, so its not that).
Posted on 2001-07-06 00:05:00 by dxantos
I don't think you should overclock until you determine what the problem is with the installation, and there is certianly a problem with that performance. :( Do you have the latest BIOS for you motherboard? Feel free to email me your detailed specs and I'll see what I can do. This message was edited by bitRAKE, on 7/6/2001 12:14:09 AM
Posted on 2001-07-06 00:13:00 by bitRAKE