I want to copy bytes from one memory to another. But the bytes must be sub with a variable count.
But my routine in very slow. How can I get it much faster?
Please help a bad english speaking newbie...

mov eax,0
mov ecx,dword ptr
mov ebx,dword ptr
mov edx,dword ptr
mov edi,variablecount

mov al,
sub eax,edi
jge nzero
mov al,0
mov ,al
inc ebx
inc edx
loop new
Posted on 2001-09-25 13:06:03 by Nordwind64
See Svin's tute on this topic
"Fiction point" logic
Posted on 2001-09-25 13:36:42 by rafe

A couple of things, LOOP is a very slow instruction and it is faster to use a CMP and conditional jump, even though you have more instructions.

With normal integer instructions, the string instruction MOVSD used with the repeat prefix is well optimised. The MASM32 library has a normal memory copy procedure that uses this.

mov esi, [Source]
mov edi, [Dest]
mov ecx, [ln]

shr ecx, 2
rep movsd

mov ecx, [ln]
and ecx, 3
rep movsb

You will also get reasonably fast code by manually coding the source and destination addresses into registers.

mov al, [esi]
mov [edi], al
inc esi
inc edi

This may give you extra control if you need to change a counter as well.

I think Ricky Bower posted an MMX memory copy that is very fast but it requires a PIII or later to use it.


Posted on 2001-09-25 17:09:03 by hutch--
From your algorithm, you want to do what is called a saturated subtraction (clip bytes lessthan a certain value to zero during the subtraction). This kind of operation is what MMX was designed for:

mov eax,sDiff
mov ecx,7
push eax
push eax

@@: mov BYTE PTR [esp + ecx],al
dec ecx
jne @B ;NOTE: bytes 0 & 4 are set already ;)
movq mm7,[esp]
add esp,8

mov ecx,From ;these should be qword aligned!
mov edx,To ;these should be qword aligned!

@@: sub Count,8 ;we are doing eight bytes at a time
jg @F ;few bytes on the end
movq mm0,[ecx]
add edx,8
psubusb mm0,mm7
add ecx,8
movq [edx-8],mm0
jmp @B
@@: add Count,8
je @Done
;do the remaining bytes... (add your code here...)
SByteCopy ENDP
Okay, flame me for be a lazy programmer, but you should get the gist of what is being done here. Personally, I'd just do the extra bytes and clip the array if needed - I don't know what your using this for, and I'm lazy. :) Don't forget there are many more MMX regs - so, there is much parallelism that can be added here (see MMX memcopy for details). You can virtually get the saturated subtraction for free if this is optimized! (meaning that it will be just as fast as memcpy) :grin:

Edit: HERE is a link to MMX memcpy routine, and there is a link to a good article at SGI that explains it. I came up with a different version for Athlon's, but AMD beat me (I can no longer find this code at AMD?)
Posted on 2001-09-25 18:45:46 by bitRAKE

Thank you for your answers !! I will try this out.

Thanks, Nordwind
Posted on 2001-09-26 23:22:35 by Nordwind64