Hello.. I wrote this quite optimized routine to copy memory:


cld
mov eax,ecx
sub ecx,edi
sub ecx,eax
and ecx,3
sub eax,ecx
jle short @exit
rep movsb
mov ecx,eax
and eax,3
shr ecx,2
rep movsd
@exit:
add ecx,eax
rep movsb

Now I'd like to make a "reversed" version, so I can copy also overlapped regions.. but I'm quite confused on how to modify the routine above to make it reversed (let away the std instruction in place of cld).

Does anybody have a complete suite (i.e. including also reversed versions) of optimized memory copy routines to share?
Posted on 2002-10-01 16:28:19 by Bugs' Bounty Hunter
Posted on 2002-10-01 16:43:09 by stryker
Actually, if you are going to use rep movs, it is quite simple. Just move to the end of the array and do what you did. For example,


std
lea edi,[edi+ecx-1]
lea esi,[esi+ecx-1]
mov edx,ecx
and ecx,3
rep movsb
mov ecx,edx
shr ecx,2
sub edi,3
sub esi,3
rep movsd

Of course, this is not optimal. Using rep movsb to something less than 4 bytes long is never going to be optimal. :)
Posted on 2002-10-01 16:45:32 by Starless
Instead of having two versions just combine them.

Assumes direction flag is clear, which it should be on win32.

[size=12]MemCopy proc uses esi edi, pSrc:DWORD, pDest:DWORD, dwLen:DWORD


mov esi, pSrc
mov edi, pDest
mov ecx, dwLen

mov eax, ecx
and ecx, 3
cmp edi, esi
jb @f
std
lea edi, [edi+eax-1]
lea esi, [esi+eax-1]
@@: rep movsb
jb @f
sub edi, 3
sub esi, 3
@@: mov ecx, eax
shr ecx, 2
rep movsd
cld

MemCopy endp[/size]
Posted on 2002-10-02 08:19:37 by iblis
This is always an interesting topic.
From a look at the source Bugs' Bounty Hunter posted one may notice
that it enforces destination alignment, which is a known technique to optimize
memory copy performance, expecially when the destination is not write-combining.

So a "reverse", or "backward" version should do that as well (i.e. it is a feature,
not a bug :grin: ).

Here's my quick attempt to a "conversion" of the original routine to keep
the destination alignment quality, but copy in reverse (as is indispensible
for example when src/dst regions overlap and dst>src). I could dedicate only
few minutes to it.. I leave to you the clever optimizations. What I realized
is that MOVSx, when the direction flag is set, behaves in a very non-optimal
way, i.e. it post-decrements rather than pre-decrements.. which would be much
more logical and intelligent. Bad design, or I didn't bother enough to look
for optimizations. Your time now to go and do justice. ;)

EDI = Destination Pt
ESI = Source Pt
EAX = Length in bytes (can be zero without a permanent damage to your SDRAM)



;STD
LEA EDI,[EDI+EAX-1]
LEA ESI,[ESI+EAX-1]
CMP EAX,4
JB SHORT .last
LEA ECX,[EDI+1]
AND ECX,3
SUB EAX,ECX
REP MOVSB
MOV ECX,EAX
SHR ECX,2
SUB EDI,3
SUB ESI,3
REP MOVSD
ADD EDI,3
ADD ESI,3
AND EAX,3
.last: MOV ECX,EAX
REP MOVSB


PS: Just to be complete, here is a well known "forward" version, this one was cleverly optimized by Ken Silverman:


;cld
lea ecx,[edi+edi*2]
and ecx,3
sub eax,ecx
jle short LEndBytes
rep movsb
mov ecx,eax
and eax,3
shr ecx,2
rep movsd
LEndBytes: add ecx,eax
rep movsb
Posted on 2002-10-03 16:11:39 by Maverick
In Place-Memory Reverse. No need for an extra buffer. Just the length of the string and the pointer to the string is needed.
memrev:


push ebx
push esi
mov ecx, [esp+16]
mov ebx, [esp+12]
xor edx, edx
mov esi, ecx
shr esi, 1
inc esi

__copy:

dec esi
jz __finish
dec ecx
mov al, [ebx+edx]
mov ah, [ebx+ecx]
mov [ebx+ecx], al
mov [ebx+edx], ah
inc edx
jmp __copy

__finish:

pop esi
pop ebx
retn 8
sample usage:
    invoke  lstrlen, text

push eax
push text ;[OFFSET text] for MASM....
call memrev




This has nothing to do with the main topic. This is for those who wants to reverse the text without using an extra buffer. BTW, I didn't check for a string length of 0. You can check ecx if it's 0 before mov ebx, . :)
Posted on 2002-10-03 19:59:33 by stryker
Well done, stryker :)
If I can give you a suggestion, your can take advantage of BSWAP to optimize a bit.
Posted on 2002-10-04 01:58:09 by Maverick
count   1234567890123 == 13 bytes

string ThisIsAString
1. Swap byte T(1) and g(13) -> Actual Memory Offset(count - 1)
2. Swap byte h(2) and n(12) -> Actual Memory Offset(count - 1)
n.....

It does work like BSWAP. I'll see what I can do. :)

Sorry for the delay of this response, I was very busy with projects, homeworks... at school.



BTW, no need for checking for string length of 0. I don't know what I was thinking on the previous post. :)
Posted on 2002-10-09 22:04:22 by stryker