I apologize in advance if this is a cheap topic that has been covered millions of times, I just haven't found it on the boards...

Could anyone point me to some code that does memcpy and memset fast? I'm using MSVC inline asm and I'm not an asm programmer at all. I just want a replacement with no error checking or anything. I don't think I can guarantee alignment though, so it should be fairly general. An mmx version or something would be good if it really helps (nothing newer though like prefetching or whatever, I can't really impose high requirements). It should also be efficient for short spans.

Thanks.

steel
Posted on 2002-01-31 18:18:27 by steel
This gem show you how you can use your FPU to copy data between memory
locations? The following loop can be used for block memory copying. I don't
know who was the original developer of this kind of loop, but it has been
presented in various documents. This version comes from Agner Fogs excelent
Pentium optimization manual

;
; copying data using the fpu
;
; input:
; esi = source
; edi = destination
; ecx = number of 16-byte chunks to move
;
; output:
; none (data from esi is copied to edi)
;
; destorys:
; esi, edi, ecx
; flags, fp flags
;

topofloop:
fild qword ptr
fild qword ptr
fxch
fistp qword ptr
fistp qword ptr
add esi,16
add edi,16
dec ecx
jnz topofloop

The loop is optimal on (a fast) Pentium when both the source and
destination are aligned on 64-bit boundaries and the destination is not in
the cache. (Additionally the loop can be optimal on PPro if the destination
does not permit write-combining.)
If the destination is in the cache (or the destination memory permits write
combining on PPro) then REP MOVSD will be faster.
The loop is faster than REP MOVSD, because it does half as many writes to
external memory (with the noted exceptions). External memory is usually
very slow compared to the execution time of the loop. Consequently after a
few iterations of the loop the write buffers of the CPU become filled and
subsequent iterations of the loop will execute at the speed of external
memory. For small memory blocks you should use a simple DWORD copy loop,
because the overhead of the FPU copy loop is much higher than that of most
other memory copy loops.
You might think that you should use FLD/FSTP instead of FILD/FISTP.
Unfortunately FLD/FSTP would not work very well, because all 64-bit values
are not normal floating point values. The handling of denormal floating
point numbers is very slow.
But it's eve worse. Denormals (see notes) make the FLD/FSTP copying slow,
but it will still be functionally correct. But, if the data represents an
SNAN (see notes), it will be quietly converted to a QNAN (see notes) if IE
is masked (CW.IM = 1), or you will get an exception if IE is unmasked
(CW.IM = 0).
Therefore one should really forget about trying to use FLD/FSTP for memory
copy loops.

For related information see Agner Fog's Pentium optimization manual (you
can find it at http://announce.com/agner/assem and Intel's Pentium Pro
developer's manual volume 3 for information on write buffers, caches,
write-combining etc... (it can be found at Intel's developer WWW site).

notes:
SNANs are all the numbers where bits <62:52> = 7FFh, and bit <51> = 0 and
bits <50:0> !=0. An SNAN is converted to a QNAN by setting bit<51>.

Denormals are numbers when the exponent field has all bit set to 0 and the
mantissa is non-zero. Or in the copy process the bits 62-52 (exponent
field) of each aligned 64-bit entitiy is zero.

Gem writer: (code) Agner Fog
Posted on 2002-01-31 18:53:57 by The Svin
Without using prefetch# you are limited in the speed you can do either function. One of our moderators "Bitrake" posted a very fast memory copy routine some time ago but it required the prefetch# instructions and PIII or higher to work.

There are a couple of appropriate routines in the MASM32 library that you could use directly in C/C++ if you wrote the prototypes for them and they are reasonable performers for non PIII code. MMX by itself does not seem to improve on MOVSD for memory copy as it is a well optimised instruction on later Intel machines.

Regards,

hutch@movsd.com
Posted on 2002-01-31 21:34:27 by hutch--
First, a quesiton.
How much does all this *really* matter? Aren't we bound by
memory speed these days? If source and dest are dword aligned,
are any other methods "much" faster than a simple rep movsd?

Next... I'm no optimization guru (not compare to Svin or BitRake
anyway ;), but it seems like some efford has been put into
memset/memcpy from the microsoft libc. To take care of alignment
and stuff. Supposedly this is good if you work with large blocks.
Comments by somebody who knows?

Also, memcpy/memset are some of the so-called intrinsic functions,
which mean the compiler can inline "better copy code" instead of
calling the external functions - supposed to give better performance.
Again, not something I have really tested... my performance-requiring
code doesn't usually involve memcpy/memset.
Posted on 2002-01-31 22:00:40 by f0dder
Intel and AMD both have specially tuned memcpy for P3/P4 and Athlon/K6 respectively. I don't mean to use them blindly - they are designed for general HLL use. But you certainly can gleam the technique that best matches their respective chips internal structure from the algorithms. Intel's can be found (HERE). AMD's can be found (HERE).

From a design perspective the assembly language programmer wins big time. We can plan to have almost everything DWORD aligned and not have to test for it - instant saving on the overhead of even a small size operation.

Needless to say, this is a topic of much research!

Other Threads:
http://www.asmcommunity.net/board/index.php?topic=1229

Web Links:
http://www.geocities.com/~charlie_x/source.htm
http://www.azillionmonkeys.com/qed/blockcopy.html
http://www.bh.wakwak.com/~xelf/developer/MemoryCopy.html
Posted on 2002-01-31 23:17:55 by bitRAKE
Thanks for all the input guys, now we're going somewhere :-)

Basically my dilemma was how smartly will the compiler inline memcpy() and how much overhead does it have for small spans (line on the order of tens of bytes). I really have no idea. Or how well something like a for loop copy would translate. Really nested for loops of smallish block copies, matrix/vector manipulation stuff.

Basically I did assume that the the compiler memcpy is smarter than me so I used it, but I figured I could do a couple of tests if there was easily available code. Since I don't see msvc compatible code it would be a liiitle bit of work but I might still try a routine or two you guys suggested.

Thanks again
Posted on 2002-02-02 15:18:35 by steel
Ok, to wrap up this subject more or less...

Bitrake, thanks for the links, that last one http://www.bh.wakwak.com/~xelf/developer/MemoryCopy.html was especially useful!

I ran a bunch of benchmarks on different machines, added some of my stuff that I wanted to test, and basically came up with the conclusion that memcpy() and memset() are indeed the best overall solution, or at least good enough :-) Just thought I'd say this so hopefully other people don't waste time going down this path like I did. Although it was a great educational experience! :-)

Here's my results, some of em are weird and surprising, you can draw your own conclusions.
Note: "MMX unaligned" is the routine from
http://www.azillionmonkeys.com/qed/blockcopy.html
It works quite good in fact, especially on misaligned pointers, but doesn't help on short spans. Of course it can be adapted to any one of these routines not just mmx.

---------------------------------------------------------------------------------
//
// This section mostly weighs overhead for short spans
//

// Athlon 700
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 64

copy repetitions: 5242880

memcpy: 43.4
rep movsd: 32.2
FPU 8bytes: 15.7
MMX movntq pre 16bytes: 13.6
MMX movntq 16bytes: 14.2
MMX movntq 8bytes: 16.5
MMX 16bytes: 11.5
MMX 8bytes: 21.0
asm 8bytes: 20.2
asm 4bytes: 50.2
C++ 4bytes: 36.8
MMX unaligned : 45.0
C++ zeroing loop: 72.6
C++ memset(): 71.4
stosd memset: 125.0
Press any key to continue


// P3 866
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 64

memcpy: 37.8
rep movsd: 27.6
FPU 8bytes: 22.0
MMX movntq pre 16bytes: 14.0
MMX movntq 16bytes: 15.3
MMX movntq 8bytes: 31.2
SSE movntps pre 32bytes: 40.1
SSE movntps 16bytes: 11.3
SSE 16bytes: 11.6
MMX 16bytes: 11.6
MMX 8bytes: 23.3
asm 8bytes: 28.8
asm 4bytes: 37.8
C++ 4bytes: 38.5
MMX unaligned : 57.4
C++ zeroing loop: 76.5
C++ memset(): 74.7
stosd memset: 230.6
Press any key to continue

//
// The rest measures burst speed
//

// Athlon 700
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 16777216

copy repetitions: 10

memcpy: 105.1
rep movsd: 110.6
FPU 8bytes: 83.7
MMX movntq pre 16bytes: 77.3
MMX movntq 16bytes: 82.5
MMX movntq 8bytes: 83.5
MMX 16bytes: 83.7
MMX 8bytes: 83.5
asm 8bytes: 103.7
asm 4bytes: 131.1
C++ 4bytes: 106.0
MMX unaligned : 83.8
C++ zeroing loop: 111.8
C++ memset(): 111.9


// P3 866
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 16777216

copy repetitions: 10

memcpy: 72.4
rep movsd: 70.2
FPU 8bytes: 82.0
MMX movntq pre 16bytes: 48.5
MMX movntq 16bytes: 49.8
MMX movntq 8bytes: 49.1
SSE movntps pre 32bytes: 53.0
SSE movntps 16bytes: 50.2
SSE 16bytes: 83.9
MMX 16bytes: 84.2
MMX 8bytes: 82.5
asm 8bytes: 119.3
asm 4bytes: 123.1
C++ 4bytes: 121.6
MMX unaligned : 83.2
C++ zeroing loop: 73.7
C++ memset(): 73.8

// P2 400
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 16777216

copy repetitions: 10

memcpy: 114.9
rep movsd: 114.7
FPU 8bytes: 109.5
MMX 16bytes: 110.3
MMX 8bytes: 109.8
asm 8bytes: 161.1
asm 4bytes: 158.6
C++ 4bytes: 159.6
MMX unaligned : 109.8
C++ zeroing loop: 176.3
C++ memset(): 196.0
stosd memset: 786.1
completed.


//
// These tests check performance on unaligned pointers
// The pointer returned by new is incremented by one byte
// (Looks like new aligns the original pointer nicely,
// though I'm sure this is not always guaranteed)
//

// Athlon 700
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 64

copy repetitions: 5242880

memcpy: 51.8
rep movsd: 40.4
FPU 8bytes: 22.6
MMX 16bytes: 23.2
MMX 8bytes: 23.7
asm 8bytes: 27.6
asm 4bytes: 49.6
C++ 4bytes: 38.2
MMX unaligned : 55.3
C++ zeroing loop: 73.3
C++ memset(): 72.6
stosd memset: 126.4
completed.
Press any key to continue


// Athlon 700
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 16777216

copy repetitions: 10

memcpy: 162.0
rep movsd: 164.4
FPU 8bytes: 148.9
MMX 16bytes: 140.3
MMX 8bytes: 149.0
asm 8bytes: 154.2
asm 4bytes: 158.7
C++ 4bytes: 160.7
MMX unaligned : 88.2
C++ zeroing loop: 119.7
C++ memset(): 119.4
stosd memset: 485.4
completed.
Press any key to continue


// P3 866
memory copy code benchmark VER.2001-01-29 by (C)2001 XELF.
copy size: 16777216

copy repetitions: 10

memcpy: 146.7
rep movsd: 146.4
FPU 8bytes: 113.5
MMX movntq pre 16bytes: 133.4
MMX movntq 16bytes: 60.3
MMX movntq 8bytes: 62.8
MMX 16bytes: 110.1
MMX 8bytes: 112.9
asm 8bytes: 145.3
asm 4bytes: 146.7
C++ 4bytes: 145.6
MMX unaligned : 110.8
C++ zeroing loop: 140.5
C++ memset(): 140.6
stosd memset: 569.0
completed.
Posted on 2002-02-04 14:39:47 by steel
really nice comparings!

I found an extreme speed difference between copy loops from memory -> memory compared with memory -> video memory.

The MMX version I had, was much more faster than the CPU variant.
Posted on 2002-02-07 07:40:36 by beaster
steel,

If you are copying small blocks, one of the factors with memory copy speed is the setup time for the generically faster methods. REP MOVSD is a poor performer on blocks smaller than 64 bytes.

If you are repeatedly copying blocks that are smaller than this, it may be worth writing some simple test pieces that use the old REP MOVSB or directly indexing the memory locations with a common counter. Either will have less setup time than the faster methods for larger blocks.

Regards,

hutch@movsd.com
Posted on 2002-02-07 21:18:35 by hutch--

If you are copying small blocks, one of the factors with memory copy speed is the setup time for the generically faster methods. REP MOVSD is a poor performer on blocks smaller than 64 bytes.
If your programming in ASM, you can usually have the data aligned. So, there would be no set-up for REP MOVSD beyond any other method. Besides a string of MOVSD's. ;)

I've seen benchmarks for REP MOVSB on the P4 that rate the same as REP MOVSD - which makes me wonder if REP MOVSB is optimized the same as REP MOVSD on the P4. I don't remember what the docs say - I just forgot since I have AMD gear now.
Posted on 2002-02-07 22:13:08 by bitRAKE
bitRAKE,

The minimum size for an efficient use of REP MOVSD is documented in Agner Fog's manual and it matches the testing I have done on copying very large numbers of small blocks of memory. I have yet to do much benchmarking on the P4 as I use it as a backup box and still develop on my PIII.

At PIII and under, REP MOVSB and direct pointer manipulation on the two buffers is usually faster that REP MOVSD on very small repeated memory blocks. On AMD I don't remember as the last testing I did was on a K6/2 550 which would be well off the pace of current AMD devices.

Regards,

hutch@movsd.com
Posted on 2002-02-08 08:15:06 by hutch--
Good link with some code:
http://sourcefrog.net/mbp/memcpyspeed/
Posted on 2002-02-15 17:09:42 by bitRAKE