Forgive me if this is stupid, but I wanted a fast memcpy for smallish string and the like and for some reason I tried this for a laugh. On cached memory it seems to run better than twice as fast as the masm32lib memcopy for larger ~100 bytes (this is probably due purely to the use of mmx regs). For smaller ~32 bytes to move it runs almost three times as fast.

Still it seems like overkill to me and I wonder though would the size of the procedure itself cause problems the first time it runs. I don't understand this area to well. Fasm code, but should translate easily to Masm.
proc memcpy,mem1,mem2,size

enter
push esi edi
mov ecx,[size]
mov esi,[mem1]
mov edi,[mem2]
xor edx,edx

cmp ecx,15
ja .a8p
jmp near [.jTbl+ecx*4]

.a8p: lea esi,[esi+ecx-16]
lea edi,[edi+ecx-16]
neg ecx
jmp .a8nx
.a8lp: movq mm0,[esi+ecx]
movq mm1,[esi+ecx+8]
add edx,16
movq [edi+ecx],mm0
movq [edi+ecx+8],mm1
.a8nx: add ecx,16
jle .a8lp

mov eax,16
mov esi,[mem1]
mov edi,[mem2]
sub eax,ecx
lea esi,[esi+edx]
lea edi,[edi+edx]

cmp eax,15
ja .a0
jmp near [.jTbl+eax*4]
.a0:
pop edi esi
return
.a1: mov al,[esi]
mov [edi],al
pop edi esi
return
.a2: mov ax,[esi]
mov [edi],ax
pop edi esi
return
.a3: mov ax,[esi]
mov dl,[esi+2]
mov [edi],ax
mov [edi+2],dl
pop edi esi
return
.a4: mov eax,[esi]
mov [edi],eax
pop edi esi
return
.a5: mov eax,[esi]
mov dl,[esi+4]
mov [edi],eax
mov [edi+4],dl
pop edi esi
return
.a6: mov eax,[esi]
mov dx,[esi+4]
mov [edi],eax
mov [edi+4],dx
pop edi esi
return
.a7: mov eax,[esi]
mov dx,[esi+4]
mov cl,[esi+6]
mov [edi],eax
mov [edi+4],dx
mov [edi+6],cl
pop edi esi
return
.a8: movq mm0,[esi]
movq [edi],mm0
pop edi esi
return
.a9: movq mm0,[esi]
mov al,[esi+8]
mov [edi+8],al
pop edi esi
return
.a10: movq mm0,[esi]
mov ax,[esi+8]
movq [edi],mm0
mov [edi+8],ax
pop edi esi
return
.a11: movq mm0,[esi]
mov ax,[esi+8]
mov dl,[esi+8+2]
movq [edi],mm0
mov [edi+8],ax
mov [edi+8+2],dl
pop edi esi
return
.a12: movq mm0,[esi]
mov eax,[esi+8]
movq [edi],mm0
mov [edi+8],eax
pop edi esi
return
.a13: movq mm0,[esi]
mov eax,[esi+8]
mov dl,[esi+8+4]
movq [edi],mm0
mov [edi+8],eax
mov [edi+8+4],dl
pop edi esi
return
.a14: movq mm0,[esi]
mov eax,[esi+8]
mov dx,[esi+8+4]
movq [edi],mm0
mov [edi+8],eax
mov [edi+8+4],dx
pop edi esi
return
.a15: movq mm0,[esi]
mov eax,[esi+8]
mov dx,[esi+8+4]
mov cl,[esi+8+6]
movq [edi],mm0
mov [edi+8],eax
mov [edi+8+4],dx
mov [edi+8+6],cl
pop edi esi
return
.jTbl dd .a0,.a1,.a2,.a3,.a4,.a5,.a6,.a7,.a8,.a9,.a10,.a11,.a12,.a13,.a14,.a15
Posted on 2003-07-31 20:02:35 by Eóin
Uhm, one question, but, mustn't "the 50 cycles wait" a.k.a. emms be called after MMX code is used or any function following that uses the FPU may get some problems?
Posted on 2003-08-01 04:00:07 by scientica
Yes I believe you're right, but I suppose that simply means don't mix this code with fpu code. For me personally thats not an issue since I tend to use SSE almost entirely now.

While I understand the problem with using mmx code you mention, it still has its place for time to time in algorithms.
Posted on 2003-08-01 06:36:16 by Eóin
emms feels superflous, but I don't know if it can be ignored if the app don't use the FPU (does some API use the FPU? If not then maybe emms isn't required)
Posted on 2003-08-01 07:07:24 by scientica
will this be faster than <http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf> ? :confused:
Posted on 2003-08-01 07:08:48 by S.T.A.S.
S.T.A.S thank you very much for that link, its very informative and actually I learned alot. That link however seems to be concentrating on large memory transfers, in my case here my interest was on small transfers.

As f0dder was always keen to point out, and I agree with him we really should have two versions of the various generic functions (strlen, memcpy, etc) one which concentrates on smaller strings, another for large blocks of memory.
Posted on 2003-08-01 20:07:34 by Eóin


As f0dder was always keen to point out, and I agree with him we really should have two versions of the various generic functions (strlen, memcpy, etc) one which concentrates on smaller strings, another for large blocks of memory.


You're right!
maybe more than 2 :grin:

i'm new at win32 asm, but some (long) time ago i coded for Spectrum (Z80) and there was the best memcopy routine like this:

mov sp, sourse
pop hl
pop de
pop bc
pop af
mov sp, destination
push af
push hl
push de
push bc

i belive PC has many undiscovered reserves, thet people miss using HLL

small chunks of data, i think, are most accelerateable, with current CPU cache is 128kb-1Mb.
it's more than all the spectrum memory (48k):grin:

for <=32 bytes transfer, i hope it's beter to align all data by 8 and use code like:
mov ecx,Posted on 2003-08-01 21:51:41 by S.T.A.S.
Start of opcode to simple line equation? :)
Good teqnique.
Reminds me of old machine size of
opcode equations tricks
For how long you are in low level coding S.T.A.S?
Now days people mostly try to equation in data (not
opcode adresses), you should be from years ago :)
Posted on 2003-08-01 23:49:00 by The Svin

Now days people mostly try to equation in data (not
opcode adresses), you should be from years ago :)


10-11 years ago i started coding Z80 asm on my Speccy
have written music editor for AY-8910/12 (Yamacha 3voyce chep) and some smal progs (archiver, demos, etc)
and 7 years i stopped :(

half a year i 'm back to learning coding on x86/win32
now i'm writting simple DX7 prog (hope it'll some time be done)
i posted it yeasterday
http://www.asmcommunity.net/board/attachment.php?postid=112656

sorry for my english, i'm not a god at it, The Svin

S.T.A.S.
Posted on 2003-08-02 00:23:58 by S.T.A.S.
Originally posted by S.T.A.S.
jmp dword ptr l0


shure this won't work :stupid: :stupid: i just remember another CPU
there must be a table like .jTbl

or

lea ECX,
push ECX
ret

but it's slow. if we could do direct jumps like jmp l0+ECX*8

:stupid: why i posted it? just for you all to smile :grin:
Posted on 2003-08-02 04:51:18 by S.T.A.S.
lea ecx,
jmp ecx
Posted on 2003-08-02 06:21:28 by The Svin
Actually it's better to express
the linear dependency this way:
x = StartOffset + 8(Count-1)=
x = StartOffset - 8 + 8*Count.

So you can implement it as:

mov ecx,Posted on 2003-08-02 07:04:45 by The Svin

jmp ecx

thank you, The Svin!,
now i see it: 0FFh, 0E1h (mov eip, ecx)
sometimes i'm confused with intel sintaxis.:mad:
Posted on 2003-08-02 21:06:47 by S.T.A.S.
I agree.
They try to bring a though that (E)IP
is not accessable to a programmer.
And CS not changable.
Yet, it brings along difficalties to
understand what jumps mean.
Of course, eip is accessable,
and jmps,calls,rets are nothing but
loading values in EIP (and CS with FAR jmps,calls,rets)
or adjusting (adding) to the value of EIP
(with relative jmps)
Posted on 2003-08-03 00:32:53 by The Svin

and jmps,calls,rets are nothing but
loading values in EIP (and CS with FAR jmps,calls,rets)
or adjusting (adding) to the value of EIP
(with relative jmps)

Perhaps this is the the Opcode thread? What the instructions really do, cmp = sub with the result not saved but the flags set.
Posted on 2003-08-03 06:31:15 by scientica
Yes, but people and docs usualy has a lot of places about what cmp or test really do, yet a few stress on what "control" instructiocs really do with EIP. And a lot of distraceing info
that EIP and CS is not accessable blah - blah - blah.
I understand their way - they want programmers to think of
control instructions in logical not arithmetic ways, yet in just
slow eduction and when it comes to system programming when
you must deal with exeptions, interupts, handling of the different kind of exeption and ring switching in different conditions - you just forced to thing of control instructions in arithmetic and EIP:CS manipulating ways.
When you look at the docs with experienced eyes - you wouldn't say that the docs are wrong, yet when you are a beginner there
is very little chance to understand at once what jmps, rets etc
really do.
Posted on 2003-08-03 07:27:44 by The Svin
Hy, what is your problem with the standard mode copy?

some simple copy routines :



move32:shr cx,1 ; ds:si = source, es:di = destination, cx = size
jnc .nb ; les di,dest lds si,source
movsb
.nb: shr cx,1
jnc .nw
movsw
.nw: rep movsd
.sk0: ret

Move128bitFPU: ; copying data using the fpu
fild qword [ds:si] ; input:
fild qword [ds:si+8] ; ds:si = source, es:di = destination
fxch ; ecx = number of 16-byte chunks to move
fistp qword [es:di] ; output: none (data from esi is copied to edi)
fistp qword [es:di+8] ; destorys: esi, edi, ecx, flags, fp flags
add si,16 ; ps.: requires Pentium +
add di,16
dec ecx
jnz Move128bitFPU
ret
Move128bitFPUraw: ; copying data using the fpu
fild qword [esi] ; input:
fild qword [esi+8] ; esi = source, edi = destination
fxch ; ecx = number of 16-byte chunks to move
fistp qword [edi] ; output: none (data from esi is copied to edi)
fistp qword [edi+8] ; destorys: esi, edi, ecx, flags, fp flags
add esi,16 ; ps.: requires Pentium +
add edi,16
dec ecx
jnz Move128bitFPUraw
ret


MATRIX
Posted on 2004-09-29 19:05:48 by >Matrix<
I think Microsoft has a VERY GOOD optimized algorithm of MemCpy Function.
It's inside some LIBs:

NTDLL.DLL or MSVCRT.DLL or MSVCR70.DLL or MSVCR71.DLL

but not in CRTDLL.DLL

so if you want you memcpy, use it.

have fun.
Posted on 2004-10-20 23:35:26 by nhnpresario
I think Microsoft has a VERY GOOD optimized algorithm of MemCpy Function.
It's inside some LIBs:

NTDLL.DLL or MSVCRT.DLL or MSVCR70.DLL or MSVCR71.DLL

but not in CRTDLL.DLL

so if you want you memcpy, use it.


Only if you are still using classic Pentium (aka P5). This also applies to >Matrix<'s code. Do people still use Pentium classic with Windows 2K or later?
Posted on 2004-10-21 05:05:22 by Starless
You must be right.
But in my opinion, i think software is for everyone.
Everyone is mean when you write software it must be run in as much target machine as you can.
So i have a conclution:
we test the target system beforce run program.
If it is a x86 old style, we use 386 instruction.
If it is a modern style with SSE, SSE2, 3DNOWEXT,... we use new technology.
Regards
Posted on 2004-10-22 05:39:57 by nhnpresario