Hi guys,

I've got a program where I need to copy every other line of data (1440 bytes). The fastest way to do this small amount seems to be with memcpy. I've profiled it with the AMD memcpy, and it seems to be slightly slower. Anyways, I'm using inline assembly in my C++ function, and I would really like to use memcpy for the small data lines. I don't seem to be able to use the PROTO method mentioned in other threads. If anyone can give me a hand, I'd really appreciate it. I'm getting easily confused by the new syntax, I've only been trying assembly for two days. Thanks for the help.

~Steve
Posted on 2005-01-15 22:10:32 by diehard2
Steve,

With a byte count of 1440 each copy action, I think from memory that the REP MOVSD opcode pair have the legs on most late model processors. Much over that and SSE2 code takes over and under about 200 bytes, indexed pointers are faster.

Something that will effect any memory copy algo is whether the memory is aligned or not and with repeated copies of the type you mentioned, that may not be that easy to control.

If you cannot control the alignment of each start address, you will actually be faster using BYTE copy.
Posted on 2005-01-16 02:39:49 by hutch--

If you cannot control the alignment of each start address, you will actually be faster using BYTE copy.

Or even better, copy 1-3 bytes until you're on an aligned address, then copy DWORDs.
Posted on 2005-01-16 05:18:56 by f0dder

Or even better, copy 1-3 bytes until you're on an aligned address, then copy DWORDs.

Something is missing in the above statement. It assumes that both src and dest are off by the same bytes from the next DWORD boundary, in which case, I certainly agree that aligning and moving by DWORD is better.

However, it may need profiling effort when src and dest cannot be at DWORD boundary at the same time. Say, for example, "char *p; memcpy(p, p+5, n);" case. I read somewhre that it is better to align dest argument, but it turned out that is not the case on my Pentium III. Did anyone else test such cases? I'd love to hear about them.
Posted on 2005-01-16 06:05:21 by Starless
Thanks for the help guys, however (out of curiousity) is it possible to use memcpy in inline assembly? I'll probably try the repmovsd, that looks pretty good and I should be aligned. Unfortunately, my alignment depends upon other third party software, so it may be release dependent :evil: . Thanks a lot.

~Steve
Posted on 2005-01-16 07:29:00 by diehard2
Steve,

Both the C runtimes and API functions will probably be off the pace as they usually have higher overhead with SEH and similar. but if you wanted to use either, you would call them at high level as there is no real gain in calling a high level function in asm.

Theses types of functions are reasonably easy to write anyway and its probably to your advantage to write a number of them and just benchmark them to see which is faster in the context you are using.
Posted on 2005-01-16 18:11:51 by hutch--
I've never seen a memcpy that sets up SEH - the idea is that the user sets up a SEH if he wants one. As for catching exeptions with a SEH, this is done by the system, doesn't have overhead when exceptions aren't generated, and has nothing to do with the memory copy routine.

From Visual C++:

; The algorithm for forward moves is to align the destination to a dword
; boundary and so we can move dwords with an aligned destination. This
; occurs in 3 steps.
;
; - move x = ((4 - Dest & 3) & 3) bytes
; - move y = ((L-x) >> 2) dwords
; - move (L - x - y*4) bytes

This basically means "it's a pretty fast general purpose memcpy", although the overhead of the aligned copying has a bit overhead - but that's only really something to consider if you have time-critical code that involves copying a lot of very small buffers around.

With intrinsic optimization on, memcpy(blaaah, maaaah, 42*1024); results in (the weird names are because of C++ name generation):


.text:00000001 mov esi, dword ptr ds:?maaaah@@3PAXA
.text:00000007 push edi
.text:00000008 mov edi, dword ptr ds:?blaaah@@3PAXA
.text:0000000E mov ecx, 2A00h
.text:00000013 rep movsd

That's right, directly inlined with no function call overhead.

RtlMoveMemory (the windows API for memcpy() ) from WinXP is a "rep movsd" plus "rem movsb" to handle cases where the size isn't a multiple of four. It also handles situations where destination and source pointers overlap, and exits early if dst==src. Definitely no SEH or bloat there.

Basically, don't be paranoid of your system, compiler or windows - have a look for yourself, and don't trust what random people say. In most cases, there won't be a reason to roll your own. If there are, you're either writing specialized code where assembly is necessary, or you should consider finding a better C++ compiler/library.
Posted on 2005-01-16 19:01:38 by f0dder
:-D

There is an old expression in motor racing, "When the flag drops, the bullsh*t stops".


have a look for yourself, and don't trust what random people say.



write a number of them and just benchmark them to see which is faster in the context you are using


Benchmarking ends the bullsh*t.

PS : Steve,

I should have mentioned that if you are going to write some assembler copy routines, for performance reasons write the assembler in a seperate module and link it into your app as the best available compilers are not technically good enough to handle both manually written code and their own internal optimisation together.

The magic words is ALWAYS benchmark algos of this type and design the benchmark to best fit the data you are going to be moving around. 1440 bytes is a relatively small byte count and the takeoff time will tend to matter to some extent. REP MOVSD code starts to be faster than incremented pointers on DWORD/BYTE style algos over about 200 / 250 bytes but if its truly critical and you have the hardware support available, its worth looking at a specially cooked MMX or SSE (2) algo dedicated to you byte count copy.
Posted on 2005-01-17 00:22:01 by hutch--
Steve,

Here is a quicky to try out. Its a general purpose REP MOVSD style algo that assumes the data is aligned to at least 4 bytes. Set yourself up some method of timing the operations in a reliable way and try out the algos you have available. If the data is aligned by at least 4 and this one runs OK, it can be tweaked and replaced with more dedicated code that should get your speed up some.

Just save this code to a file and then build the file with ML.EXE with the /c /coff options so you have a complete OBJ module that you can link with the C code. You will need to write a C prototype using STDCALL for this algo.



; ?????????????????????????????????????????????????????????????????????????

.486 ; force 32 bit code
.model flat, stdcall ; memory model & calling convention
option casemap :none ; case sensitive

.code

; ?????????????????????????????????????????????????????????????????????????

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

align 4

mcopy proc src:DWORD,dst:DWORD,ln:DWORD

mov eax, esi ; preserve ESI & EDI
mov edx, edi

cld
mov esi, [esp+4] ; src
mov edi, [esp+8] ; dst
mov ecx, [esp+12] ; ln

shr ecx, 2
rep movsd

mov ecx, [esp+12] ; ln
and ecx, 3
rep movsb

mov edi, edx ; restore EDI & ESI
mov esi, eax

ret 12

mcopy endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ?????????????????????????????????????????????????????????????????????????

end
Posted on 2005-01-17 03:15:28 by hutch--
> is it possible to use memcpy in inline assembly?

Use a macro instead of a proc:



mcopy macro src,dst,ln

mov eax, esi ; preserve ESI & EDI
mov edx, edi

cld
mov esi, src ; src
mov edi, dst ; dst
mov ecx, ln ; ln

shr ecx, 2
rep movsd

mov ecx, ln ; ln
and ecx, 3
rep movsb

mov edi, edx ; restore EDI & ESI
mov esi, eax
endm



> Both the C runtimes and API functions will probably be off the pace as
> they usually have higher overhead with SEH

As f0dder already said, C runtime functions such as memcpy dont use SEH.
Posted on 2005-01-17 03:57:19 by japheth
hmmmm,

Shame API calls regularly DO use SEH.

Stack overhead does not matter with REP MOVSD on 1440 bytes (tested) so there is no gain inlining the code. here is a DWORD size copy that is marginally faster on my Prescott PIV.

The extra register usage is to prevent read after write stalls.



; ?????????????????????????????????????????????????????????????????????????

align 4

OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE

bcopy proc src:DWORD,dst:DWORD,ln:DWORD

mov [esp-4], ebx
mov [esp-8], esi
mov [esp-12], edi
mov [esp-16], ebp

mov esi, [esp+4] ; src
mov edi, [esp+8] ; dst
xor ebp, ebp

@@:
mov eax, [esi+ebp]
mov ebx, [esi+ebp+4]
mov ecx, [esi+ebp+8]
mov edx, [esi+ebp+12]

mov [edi+ebp], eax
mov [edi+ebp+4], ebx
mov [edi+ebp+8], ecx
mov [edi+ebp+12], edx

add ebp, 16
cmp ebp, [esp+12] ; ln
jl @B

mov ebx, [esp-4]
mov esi, [esp-8]
mov edi, [esp-12]
mov ebp, [esp-16]

ret 12

bcopy endp

OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef

; ?????????????????????????????????????????????????????????????????????????
Posted on 2005-01-17 04:53:53 by hutch--
API calls generally only set up SEH where it's needed. memcpy (RtlMoveMemory) doesn't set up a SEH (XP SP2 - but I doubt any other version does either). Look for yourself if you don't believe me.
Posted on 2005-01-17 06:49:46 by f0dder
> Shame API calls regularly DO use SEH.

I'm very ashamed having argued against M Hutchesson who obviously is *the* expert concerning C runtime topics.
Posted on 2005-01-17 07:06:34 by japheth
Let's not fight people. Digressing from the main topic..
Posted on 2005-01-17 08:19:00 by roticv
:-D

You will have to forgive us mere mortals who try and answer questions to all comers on a subject as broad as Windows API functions and the range of C runtime libraries written since 1995 for 32 bit Windows.

Now given that the member actually asked a question about writing some assembler code, I am in fact guilty of trying to help out here.

If you can be bothered to download the vc 2005 beta, you will see some Intel supplied asm code for some of the C runtimes that may be better than some of the crap that has been around in the past but then when you try and answer a members question, you have to try and address what they were looking for in the first place, some info on writing a block copy routine dedicated to a particular task. :roll:
Posted on 2005-01-17 08:23:16 by hutch--
OK, lets get away from the memcpy question. Is there a way to call a C function from inline assembly? I see that its possible for non-inline. Also, what is SEH (new guy here)? Thanks.

~Steve
Posted on 2005-01-17 08:47:13 by diehard2
Steve,

With a lot of messing around, it probably is possible but there is no gain by doing so as the compiler can easily handle a function call in what is usually a very efficient way.

If you want to use a function call with inline assembler, you would break up the blocks so the you had something like this style of code.



__asm{
; asm code here
}

// make the function call here;

__asm{
; the rest of the asm code here
}


SEH is Structured Exception Handling which is used by the OS at times to handle critical errors. If you crash an app with an error in it, it means that OS has handled the exception (error) because you have not. You can write this style of code yourself.
Posted on 2005-01-17 09:13:02 by hutch--
> the vc 2005 beta, you will see some Intel supplied asm code
> for some of the C runtimes that may be better than some of the crap that
> has been around in the past

the VC 5 CRT source is part of the CD and it's easys to see:

1. all the memcpy/strcpy stuff is written in ASM
2. it doesn't use SEH

VC5 is from 1997!
Posted on 2005-01-17 09:38:49 by japheth
hutch--, how exactly was your misinformation helpful?

Both the C runtimes and API functions will probably be off the pace as they usually have higher overhead with SEH and similar.
Posted on 2005-01-17 10:09:22 by f0dder
OK, lets get away from the memcpy question. Is there a way to call a C function from inline assembly? I see that its possible for non-inline. Also, what is SEH (new guy here)?


Of course, you can call any C function from inline assembly. But, each compiler has its own way of doing it. For example, the free version of MSC (aka, VCToolkit == VC7.1) requires undecorated function names. Like


/* memcpy(p,q,10) in inline assembly */
push 10
push q
push p
call memcpy
add esp,12


For SEH, you might want to consult MSDN or SDK documentation for introduction. (Maybe it is not in SDK documentation anymore. It was included in Win95 SDK documentation, though.) And, if you want to study more, read J Gordon's web page.

Finally, as a general note, create a separte module if you need inline assembly. Most compilers automatically turn off optimization for the module as a whole when they see inline assembly. MSC was one of them and I personally do not use inline assembly except for testing.

Now, OT.
The amount of incorrect claims in this thread about certain C compiler and its library is disturbing. MS provided part of its library source code of MSC9 (aka Visual C 2.0) and MSC14 beta (aka VC2005) for public download. Read and compare the source code and figure out what is going on. And, "Intel supplied asm code" is quite imaginative. That directory has been a part of platform specific source tree and has nothing to do with Intel-created code.
Posted on 2005-01-17 12:08:24 by Starless