Hello,

I'm currently trying to find the "fastest" memory copy algorithm. In some AMD papers there were several methods described how to increase performance and I implemented some of them for testing purposes.

However, the only remarkable performance boost I can get is by using opcode "movntq" instead of "movq". In the AMD papers there is an additional method described which preloads the 8kb cache manually by "mov" instructions and which should gain significantly better results, but I cannot reproduce this, it seems to be not faster.

The test program I wrote is attached with source. Perhaps you can try it on your machine and post the results. It uses function QueryPerformanceCounter for time measures, and the results are only comparable on the same machine using identical OS.

Method 1 is simple "rep movsd"
Method 5 is copy with "movntq"
Method 7 is copy with "manual" 8kB preload cache






Attachments:
Posted on 2005-12-12 06:54:49 by japheth
Hi japheth,
1: 3422977, 3449056, 3559521, 3434530, 3428219, avg.: 3458860.6, min.: 3422977, max.: 3559521
2: 3608045, 3442138, 3451504, 3679591, 3661740, avg.: 3568603.6, min.: 3442138, max.: 3679591
3: 3609053, 3672534, 3676860, 3433318, 3667954, avg.: 3611943.8, min.: 3433318, max.: 3676860
4: 3578507, 3404831, 3390465, 3401985, 3602129, avg.: 3475583.4, min.: 3390465, max.: 3602129
5: 2180421, 2101583, 2182866, 2199605, 2163821, avg.: 2165659.2, min.: 2101583, max.: 2199605
6: 2189776, 2110774, 2123631, 2187997, 2168920, avg.: 2156219.6, min.: 2110774, max.: 2189776
7: 2066715, 2156970, 2077948, 2144287, 2080567, avg.: 2105297.4, min.: 2066715, max.: 2156970


Pentium 4 2.4 GHz, 1GB RAM

1: 4499268, 4499029, 4779676, 4469791, 4883254, avg: 4626203.6, min.: 4469791, max.: 4883254
2: 5105891, 4638554, 5105064, 4954883, 4708889, avg: 4902656.2, min.: 4638554, max.: 5105891
3: 4517386, 4547970, 4500684, 4899371, 4516516, avg: 4596385.4, min.: 4516516, max.: 4899371
4: 4363102, 4307607, 4634410, 4613308, 4277856, avg: 4439256.6, min.: 4277856, max.: 4634410
5: 2615955, 2608975, 2609396, 2761660, 2592634, avg: 2637724.0, min.: 2592634, max.: 2761660
6: 2587534, 2674381, 2480952, 2464633, 2474081, avg: 2536316.2, min.: 2464633, max.: 2674381
7: 2744555, 2728899, 2548870, 2726562, 2527669, avg: 2655311.0, min.: 2527669, max.: 2728899


Athlon 2000 XP+, 512MB RAM
Posted on 2005-12-12 07:40:20 by ti_mo_n

thanks ti_mo_n,

apparently you get some small benefits from the "8kb preload" method 7 . But it's hardly worth the effort.

On my machine (P4 1,6 GHz with 768 MB, 133 DDR) method 7 is not faster than 6, and that's true for WinXP, Win9x and DOS.



Posted on 2005-12-12 09:05:41 by japheth
I've updated the above post with a test ran on Athlon. method 7 is slower on my athlon. so - yes, it's not worth the effort. I think it starts to be better when the FSB is extremely high. so it may prove useful on high-end machines.
Posted on 2005-12-12 09:17:23 by ti_mo_n
On an Intel P4 3GHz/2G RAM

Method 1: 1458731
Method 2: 1757155
Method 3: 1433806
Method 4: 1439436
Method 5: 0969311
Method 6: 0918067
Method 7: 0987131


Biterider
Posted on 2005-12-12 09:48:43 by Biterider

Ouch, there was a little bug in method 7. Not the whole 8kb cache was preloaded.

Now things look much better, method 7 gains about 25% compared to method 5 (AMD stated it would gain almost 60%, so there is possibly another error in the routine  :P )

Attachments:
Posted on 2005-12-12 10:00:00 by japheth
7: 1866971, 1860189, 1870577, 1871948, 1864699, avg: 1866876.8, min.: 1860189, max.: 1871948

P4.

~35% more than method 6 in previous test :) now it's worth the trouble :P

7: 1884217, 1868956, 1862362, 1901331, 1866873, avg.: 1876747.8, min.: 1862362, max.: 1901331

Athlon

Also ~35% improvement compared to the previos 6th test.
Posted on 2005-12-12 10:04:59 by ti_mo_n
Here's some results from my AMD64x2 4400+:

Method 1: 1461434551 1426432968 1465283578 1317669855 1398656315
Method 2: 1449077827 1390316779 1463173831 1466142093 1450970247
Method 3: 1414612433 1464708861 1399891895 1408903447 1358318454
Method 4: 1343857924 1332559047 1435798211 1402063777 1388287385
Method 5: 815422120 865896678 883681320 849708211 868122526
Method 6: 747811076 745205009 771848748 796355631 800860571
Method 7: 748868436 733629835 721430276 740537241 713453047


I've made a little batch script to make it easier to run a bunch of tests, along with a tool to boost priority to REALTIME and limit process affinity to one core (so XP doesn't decide to schedule to another core and thus incur cacheflush and whatnot).
Attachments:
Posted on 2005-12-12 12:23:44 by f0dder
Here is some results for a PIII (933MHz, 1GB ram)

Method 1: 7602577
Method 2: 12910954
Method 3: 9330890
Method 4: 9488025
Method 5: 6387086
Method 6: 6268085
Method 7: 6409181
Posted on 2005-12-12 15:41:10 by arafel
I extended the tool a bit so one can select video memory as destination. Because the "8kb preload" optimisation affects the source only I would have thought to gain some speed in this context, although not that much because the video memory is the bottleneck. But on my machine it seems that the simple "movsd" is the best choice if destination is video memory.
Attachments:
Posted on 2005-12-13 05:55:12 by japheth
Btw, I should re-run the tests on an idle system - I had two instances of folding@home running (one per core) when I did the test. Of course their priorities are set to IDLE and the testing was done at REALTIME, but still...
Posted on 2005-12-13 08:05:42 by f0dder
'video' mode:

1: 2654238, 2654269, 2655050, 2655902, 2655698, avg.: 2655031.4, min.: 2654238, max.: 2655902
2: 2657860, 2659286, 2662500, 2662086, 2664177, avg.: 2661181.8, min.: 2657860, max.: 2664177
3: 2655147, 2655327, 2656223, 2655713, 2657299, avg.: 2655941.8, min.: 2655147, max.: 2657299
4: 2669593, 2671223, 2671212, 2670448, 2672512, avg.: 2670997.6, min.: 2669593, max.: 2672512
5: 2674042, 2672593, 2672268, 2674904, 2673638, avg.: 2673489.0, min.: 2672268, max.: 2674904
6: 2660864, 2660804, 2660566, 2660763, 2660875, avg.: 2660774.4, min.: 2660566, max.: 2660875
7: 3434997, 3430532, 3430795, 3431993, 3430965, avg.: 3431856.4, min.: 3430532, max.: 3434997

P4 2.4 GHz, GF FX 5200 128MB RAM, AGP x4, AGP mem size 256 MB

1. methods 1-6 are almost equal.
2. method 1 is now the fastest, while the 7th is the slowest.
Posted on 2005-12-13 09:43:50 by ti_mo_n
Here's timings from running with the /v switch. I'm still running folding@home, and I'm still using my priority + affinity-limiting tool. I guess it should be noted that I have a PCI express GeForce 6600 card?


Method 1: 1934681799 1940174659 1938375354 1939336721 1929164882
Method 2: 1940902615 1939754938 1927483388 1939831370 1939690187
Method 3: 1928319427 1940157060 1939705309 1940675277 1940207536
Method 4: 1938661280 1940705484 1941287859 1940019346 1939994448
Method 5: 1939580209 1939852990 1943117551 1940708369 1939442550
Method 6: 1937937804 1938141919 1937595634 1938959079 1938952095
Method 7: 1941526693 1941014071 1941056164 1940582174 1942230754

Posted on 2005-12-13 10:24:32 by f0dder

how could it be that mode 7 is the slowest if destination is video. Confused!

here is my final version of this tool. I added a "rep movsb" method 0, which is almost as fast as "rep movsd". I wasn't aware that the P4 has such an optimisation.
Attachments:
Posted on 2005-12-14 08:37:51 by japheth
I've been checking into memmove in 64bit land.
The RtlMoveMemory function thats in the ntdll.dll (64bit) is fully optimized.
I've been trying my own functions and can't seem to beat it.

Of course making a memmove for CERTAIN condition would be easy but making one that optimized for all conditions (ie small sizes large sizes, unaligned, aligned) is pretty difficult.

While windows usually hacks together their dll functions I've found memmove to be very optimized compared to the rest. Of course on regular xp the system doesn't check if you have SSE extension so it might not take advantage of prefetch.

2cents
Posted on 2005-12-14 16:36:20 by r22
results on p 1.4GHz,587MHz.

regards
Attachments:
Posted on 2005-12-14 22:22:12 by dcskm4200

AMD64 3000+, FSB=333MHz, WinXP:

Method 0: time=4148538
Method 1: time=3651805
Method 2: time=3295470
Method 3: time=3538549
Method 4: time=3691407
Method 5: time=2209696
Method 6: time=1939679
Method 7: time=1749710
Method 8: time=3091703

With video memory destination:

Method 0: time=3152924
Method 1: time=3156330
Method 2: time=3153451
Method 3: time=3152092
Method 4: time=3150797
Method 5: time=3159150
Method 6: time=3144205
Method 7: time=3162130
Method 8: time=3159965

Eugen
Posted on 2005-12-15 06:03:50 by Eugen
Here are my results using f0dder's launcher (Pentium 4 630):

Method 1: time=784042230
Method 2: time=1245374468
Method 3: time=787389960
Method 4: time=782996108
Method 5: time=771784732
Method 6: time=594234555
Method 7: time=602990955
Method 8: time=3960825

Method 8 is one I wrote that uses SSE3 and some other small changes, it's attached. I hope I didn't make a mistake in the timing or something- it seems to be a heck of alot faster.

P.S. according to the intel manual you shouldn't do dec/jcc because it causes a partial flag stall, and in my tests it is indeed much slower than sub ,1/jcc
Attachments:
Posted on 2005-12-15 11:14:43 by stormix
P.S. according to the intel manual you shouldn't do dec/jcc because it causes a partial flag stall, and in my tests it is indeed much slower than sub ,1/jcc

Yes, on P4 you should avoid using any logic instructions (including inc and dec) and istead use arithmetic instructions. but of course shifting is still faster than multiplying/dividing (despite the fact that it's slower than on PIII). nevertheless it's beter to do add eax, eax than shl eax, 1, etc, etc...

P4 2.4GHz, 1GB RAM

0: 3520666, 3770755, 3482083, 3764127, 3498814, avg: 3607289.0
1: 3426808, 3436102, 3641159, 3443445, 3442222, avg: 3477947.2
2: 3698223, 3438047, 3717033, 3720024, 3460607, avg: 3606786.8
3: 3452370, 3420074, 3429982, 3700315, 3714090, avg: 3543366.2
4: 3622489, 3638258, 3427736, 3630247, 3400553, avg: 3543856.6
5: 2160772, 2170302, 2153640, 2259891, 2142523, avg: 2177425.6
6: 2231625, 2138851, 2139760, 2248791, 2133032, avg: 2178411.8
7: 1859462, 1873457, 1865491, 1852195, 1865391, avg: 1863199.2
8: 3308598, 3293898, 3321992, 3339967, 3314236, avg: 3315738.2

GF FX 5200, AGPx4

video:
0: 2658626, 2655579, 2655808, 2654846, 2654451, avg: 2655862.0
1: 2654186, 2653726, 2654246, 2654789, 2654357, avg: 2654260.8
2: 2662323, 2658739, 2657888, 2659072, 2664023, avg: 2660409.0
3: 2653795, 2654860, 2655766, 2654952, 2654077, avg: 2654690.0
4: 2671739, 2668665, 2667855, 2668568, 2669314, avg: 2669228.2
5: 2671195, 2674383, 2677655, 2675304, 2674301, avg: 2674567.6
6: 2663874, 2662107, 2662708, 2662197, 2662025, avg: 2662582.2
7: 3437203, 3435018, 3435464, 3435975, 3433573, avg: 3435446.6
8: 3434853, 3438124, 3436388, 3439009, 3434410, avg: 3436556.8

(minimum and maximum values have been underlined)
Posted on 2005-12-15 11:33:47 by ti_mo_n
> Yes, on P4 you should avoid using any logic instructions (including inc and dec) and istead use arithmetic

But on my P4 there is no difference (or if any, "sub ecx,1" is slightly slower! than "dec ecx"). I any case it is not that big an issue IMO.
Posted on 2005-12-16 02:40:59 by japheth