Hey everybody,
I'm writing a string copy routine where the source string is of unknown length. I figure there are two approaches:

1) First, call StringLength on the source, then copy that number of bytes from Source to Dest
2) Or, while copying the bytes, test for the end of the string.

I'm currently running with option 2 using the following code because I figure one loop is better than two. It uses the string length-testing that was posted as Jens' in the string length thread.

Just wondering if people have some better ideas.

Thanks,



StringCopy PROC uses esi edi ebx ecx edx source,dest:DWORD
mov esi,source
mov edi,dest
xor eax,eax

@@: mov edx,[esi+eax] ;read a dword
mov ebx,edx ;use some regs to determine if dword
lea ecx,[edx - 01010101h] ;contains a zero byte
xor edx,ecx
and edx,80808080h
and edx,ecx ;so does it contain a zero byte
jnz @@CopyBytes ;if yes, jump to copy remaining bytes
mov DWORD PTR [edi+eax],ebx ;if no copy the whole dword over
add eax,4
jmp @B
@@: shr ebx,8
@@CopyBytes: mov BYTE PTR [edi+eax],bl ;copy the bytes out of ebx
inc eax
test bl,bl
jnz @B
lea eax,[eax+edi-1]
ret
StringCopy ENDP


--Chorus
Posted on 2002-08-04 09:31:30 by chorus
This is not faster but simpler, shorter and a lamer :grin:
[size=9]strcpy PROC USES esi edi lpSrc:DWORD, lpDest:DWORD

mov esi, lpSrc
xor ecx, ecx
mov edi, lpDest
@@:
mov dl, BYTE PTR [esi+ecx]
mov BYTE PTR [edi+ecx], dl
inc ecx
or dl, dl
jnz @B
ret
strcpy ENDP[/size]
You can even optimize more but I won't 'coz I'm lazy:

- just move in between and and insert a -1 on the

- use eax to replace either esi or edi

- replace with

- you can also insert nops but since I dont know how, I won't :tongue:
Posted on 2002-08-04 15:20:54 by stryker
I'm hoping to move DWORDs at a time if possible ;)

--Chorus
Posted on 2002-08-04 16:03:42 by chorus
chorus,

Your idea is right.
I created the same thing (with some additional sanity considerations) for my FreeBSD about a year ago. After some test, it turned out that the new 4 byte copy routine was faster than the byte copy when the string was longer than X (I don't remember the exact result of my test, so it is some number X, but it was not that big number like MMX memcpy would require).

I'm at work now and I don't have my code right now.
When I get home, I'll post mine. :)
Posted on 2002-08-05 14:29:38 by Starless
Thank you, I'd be interested in seeing it :)

--Chorus
Posted on 2002-08-05 14:32:12 by chorus
Stupid question! Why would one want an exact copy of a string of unknown lenght?
I looked now -- now I see you want to copy a null terminated string.
Therefore I see you can not have nulls in the string.
Posted on 2002-08-05 18:52:36 by Roy Cline
Don't see too many strings with NULLs in the middle of them Roy. :)

The point is that when I want to concatenate strings together, it's nice to just be able to call the StringCopy routine. If you ever use string copying, I'm sure you have to call a proc to get the string length first, and *then* copy the string over. I wanna kill two birds with one stone as they say. Hopefully, so it's faster

--Chorus
Posted on 2002-08-05 19:01:44 by chorus
OK, here is my code. It took me some time to convert AT&T syntax into Intel syntax. ;)

The idea is the same as the code by chorus: Use Agner's strlen() to speed up strcpy. Additional code is to make this function robust under general condition -- This was intended to replace FreeBSD libc strcpy().

<EDIT>
This code was written with P6 in mind. It would be interesting if someone test this on a different CPU and see whether this function is still fast.
</EDIT>


; char *strcpy(char *dst, char *src)
; engine: Agner's fast strlen().
strcpy PROC C
push esi
mov esi,[esp+12] ; src
mov edx,esi
push edi
mov edi,[esp+12] ; dst
sub esi,edi
push edi ; save the return value

neg edx ; the number of bytes to
and edx,3 ; the next 4 byte boundary
jz LoopEntry

mov edx,edx ; 2 byte nop

; Align src: byte-by-byte copy
; Aligning src (rather than dst) is not just for speed
; but also for avoiding SIGSEGV.
; Of course, you may still get SIGSEGV from this function,
; but that means nothing but the incompetence of the C coder,
; who does not secure a large enough buffer, thereby causing
; buffer overflow.
@@: mov al,[esi+edi]
mov [edi],al
test al,al
jz Done
inc edi
dec edx
jnz @B

jmp LoopEntry

ALIGN 16
LoopTop:
mov [edi],eax
add edi,4
LoopEntry:
mov eax,[esi+edi]
mov edx,eax
lea ecx,[eax-01010101h]
not edx
and ecx,edx
and ecx,80808080h ; end of string?
jz LoopTop ; nope. go ahead and write

; We reached the end of the string.

; The method of handling the remaining 4 or less bytes
; (including NUL) could be byte-by-byte copy.
; The choice depends on what the distribution of
; (strlen(src) % 4 + 1) looks like.
; The distribution is not known a priori, and my choice
; is to minimize the number of Jcc's in the worst case.
test ecx,8080h ; was NUL in the first 2 bytes?
jnz @F
mov [edi],ax ; nope.
add edi,2
shr eax,16
shr ecx,16
@@: mov [edi],al ; whether it is NUL or not, we have
test cl,80h ; to copy the first byte.
jnz Done
mov [edi+1],ah ; and copy NUL if necessary.

Done: pop eax ; dst saved earlier
pop edi
pop esi
ret

strcpy ENDP

Posted on 2002-08-05 23:32:57 by Starless
[size=9]strcpy PROC USES ebx lpszPattern:DWORD, lpBuffer:DWORD


mov ebx, lpszPattern
xor ecx, ecx
mov edx, lpBuffer
pxor MM7, MM7

__8bytecopy:

movq MM0, [ebx+ecx]
movq MM1, MM0
pcmpeqb MM0, MM7
packsswb MM0, MM0
movd eax, MM0
test eax, eax
jnz __exit8bytecopy
movq [edx+ecx], MM1
add ecx, 8
jmp __8bytecopy

__exit8bytecopy:

mov al, BYTE PTR [ebx+ecx]
test al, al
jz __exitnowidiot
mov BYTE PTR [edx+ecx], al
inc ecx
jmp __exit8bytecopy

__exitnowidiot:

emms
ret

strcpy ENDP[/size]
I'm really tired right now. If you see some errors tell me.
Posted on 2002-08-06 00:54:50 by stryker
Want some more optimizations??
[size=9]strcpy:


push ebx
mov ebx, [esp+8]
xor ecx, ecx
mov edx, [esp+12]
pxor MM7, MM7

__8bytecopy:

movq MM0, [ebx+ecx]
movq MM1, MM0
pcmpeqb MM0, MM7
packsswb MM0, MM0
movd eax, MM0
test eax, eax
jnz __1bytecopy
movq [edx+ecx], MM1
add ecx, 8
jmp __8bytecopy

__1bytecopy:

mov al, [ebx+ecx]
inc ecx
mov [edx+ecx-1], al
test al, al
jnz __1bytecopy

emms ;femms for AMD
pop ebx
retn 8[/size]
There's a tendency programmers might forget to insert the prologue ... none ... default ... directive in MASM ... Thus creating a frame. So I changed my code above to add more optimization and be compatible with FASM too. There's also a bug in first version of the code - It doesn't terminate the destination string with 0. :)

Usage:

push OFFSET destination
push OFFSET source
call strcpy

::edit::

I forgot to preserve ebx. Remember source string must be zero terminated.
Posted on 2002-08-06 16:25:59 by stryker
Why not shorter?


strcpy:
push ebx
xor eax, eax
pxor MM7, MM7
mov ebx, [esp+8]
mov edx, [esp+12]
strcpyLoop:
movq MM0, [ebx+eax] ; Unknown length =
movq [edx+eax], MM0 ; Unknown length + 8
pcmpeqb MM0, MM7
packsswb MM0, MM0
movd ecx, MM0
add eax, 8
jecxz strcpyLoop
pop ebx
emms ;femms for AMD
retn 8
Posted on 2002-08-06 17:56:55 by lingo12
Yeah, your right :alright: but why not
strcpy3:

push ebx
xor eax, eax
pxor MM7, MM7
mov ebx, [esp+8]
mov edx, [esp+12]
strcpyLoop:
movq MM0, [ebx+eax] ; Unknown length =
movq [edx+eax], MM0 ; Unknown length + 8
pcmpeqb MM0, MM7
pmovmskb ecx, MM0
add eax, 8
jecxz strcpyLoop
pop ebx
emms ;femms for AMD
retn 8
:grin: of course, now I'm using SSE to do this, which I don't want to use due to the majority of ppl who doesn't have a P3 or AMD Athlon. :grin:

Anyway, :alright:
Posted on 2002-08-06 18:08:51 by stryker
Starless, I like your proc. I didn't bother doing any alignment in mine, I think I'll start :) I especially like what you did to copy the remainder bytes, although I'd remove the shr ecx,16 and replace test cl,80h with test al,al.

Stryker, very nice :) I'm no good with MMX yet (just starting to learn the instructions), but I'll take a close look at them.

lingo, I actually considered this approach when I first took a look at Stryker's code. It does make a shorter proc, and since I always assume the buffer is "large enough" :) It shouldn't be a problem to copy the couple extra bytes. In my case, however, I like the return value to be the pointer to the zero byte so I can make multiple calls in a row, ex:



invoke StringCopy,addr szFirstName,addr Buffer
invoke StringCopy,addr szLastName,eax
invoke StringCopy,addr szAddress,eax
etc


I'd have to add some code to scan back for the zero byte, but that wouldn't be a problem.


Thanks to all,
--Chorus
Posted on 2002-08-06 18:10:14 by chorus
chorus,

I can't see that you will get an algo any faster than scanning it for its length then copying it to another buffer, you must allocate the other buffer first unless you know the maximum size first so there seems to be little point trying to include both operations. An algo that works something like Agner Fogs strlen function and any form of copy you like, from REP MOVSD to SIMD/MMX.

Regards,

hutch@movsd.com
Posted on 2002-08-07 05:56:23 by hutch--
chorus:

Your improvement at the remaining byte handling is valuable. I must be blind I could not see the obvious thing. ;)

stryker:

While you are at using SSE, why not use movntq , mm0? :grin:

Anyhow, a quick test result shows that stryker's first MMX version (with movntq and his own correction) is unbeatable when the string is longer than 100 bytes. I would say I would use stryker's version if I know I'm dealing with mostly long strings, and for general purpose libc stuff, I would stick to my old code with improvement by chorus. :grin:

<EDIT>
I should clarify the quick test result. When there is no x87 instruction in use, stryker's version becomes unbeatable when the string is about 60 bytes long. So, when someone's code does not have any FPU stuff, stryker's version is good for moderate size strings.
</EDIT>
Posted on 2002-08-07 15:50:55 by Starless
While you are at using SSE, why not use movntq , mm0?
yeah I forgot about movntq. :) Thanks for reminding me. :alright:
Posted on 2002-08-07 17:13:02 by stryker