I have been working on a string "cut" function that goes byte by byte through a source string and copies these bytes to a destination string until it hits the specified terminator byte or a NULL byte. I personally think that my code is horrific and can be optimized but I cannot think myself through to any better. I am slightly new to programming in the assembly language, so please don't shun me as if I were a drunken guru.

Here is the code that created using MASM syntax/procedures:


    .486
    .model flat, stdcall  ; 32 bit memory model
    option casemap :none  ; case sensitive

    .code


align 4

tString proc src:DWORD,dst:DWORD,ter:BYTE

    push edi ;preserve edi

    xor ecx,ecx ;reset counter
    mov edx, src ;source address
    mov edi, dst ;destination address
    xor eax,eax ; no stall

  @@:
    mov al, ;move byte in edx at offset ecx to low single byte register (of ax of eax)
    cmp al, ter ;check if terminator
    je @B ;if it is terminator then finish up (go to last/previous label)
    cmp al, NULL ;check if null (last chance, no string overflow)
    je @B ;if it is null then finish up (go to last/previous label)
  @@:
    mov , al ;move byte into destination buffer at correct offset
    inc ecx ;increase offset by one
    jmp @B ;repeat back to previous label
  @@:
    mov BYTE PTR , 0  ; write terminator

    pop edi ;restore edi
    inc ecx ;after terminator
    mov eax, src ;move src pointer to eax
    add eax, ecx ;add offset (after terminator) in ecx to eax

    ret ;return :P

tString endp

end

Posted on 2007-01-03 22:01:24 by NegativeNull
Before optimizing, you should add an output-buffer-size parameter and make sure you don't write more bytes than that to output.
Posted on 2007-01-04 04:30:19 by f0dder
Welcome NegativeNull  :)
It looks like you inverted @B and @F in the loop. The two je @B should be je @F.
The code is not horrific at all. I was thinking about using jecxz to remove a comparison in the loop:


tString proc src:PTR BYTE,dst:PTR BYTE,ter:BYTE

    push edi ;preserve edi

    xor ecx,ecx ;reset counter
    mov edx, src ;source address
    mov edi, dst ;destination address
    xor eax, eax ; no stall

@@:
    mov cl,           ;move byte in edx at offset ecx to low single byte register (of ax of eax)
    cmp cl, ter                 ;check if terminator
    je @F                       ;if it is terminator then finish up (go to last/previous label)
    jecxz @F
    mov , cl           ;move byte into destination buffer at correct offset
    inc eax                     ;increase offset by one
    jmp @B                     ;repeat back to previous label
@@:
    mov BYTE PTR , 0  ; write terminator

    pop edi                     ;restore edi
    inc eax                     ;after terminator
    mov eax, src               ;move src pointer to eax
    add eax, eax               ;add offset (after terminator) in ecx to eax

    ret ;return :P
tString endp


But it is just a micro optimization and it looks like it's actually slowing down the code. The next step would be to read four bytes at a time and process them in parallel (like in Intel's optimized strlen algo.)
Posted on 2007-01-04 05:27:24 by Dr. Manhattan

Before optimizing, you should add an output-buffer-size parameter and make sure you don't write more bytes than that to output.


That would defeat my purpose however since all strings it uses are 0-terminator based and the destination buffer is expected to match at least the size of the source buffer automatically. If I added this recommended code this would in fact slow it down and add to the size.


Welcome NegativeNull  :)
It looks like you inverted @B and @F in the loop. The two je @B should be je @F.
The code is not horrific at all. I was thinking about using jecxz to remove a comparison in the loop:


tString proc src:PTR BYTE,dst:PTR BYTE,ter:BYTE

    push edi ;preserve edi

    xor ecx,ecx ;reset counter
    mov edx, src ;source address
    mov edi, dst ;destination address
    xor eax, eax ; no stall

@@:
    mov cl,           ;move byte in edx at offset ecx to low single byte register (of ax of eax)
    cmp cl, ter                 ;check if terminator
    je @F                       ;if it is terminator then finish up (go to last/previous label)
    jecxz @F
    mov , cl           ;move byte into destination buffer at correct offset
    inc eax                     ;increase offset by one
    jmp @B                     ;repeat back to previous label
@@:
    mov BYTE PTR , 0  ; write terminator

    pop edi                     ;restore edi
    inc eax                     ;after terminator
    mov eax, src               ;move src pointer to eax
    add eax, eax               ;add offset (after terminator) in ecx to eax

    ret ;return :P
tString endp


But it is just a micro optimization and it looks like it's actually slowing down the code. The next step would be to read four bytes at a time and process them in parallel (like in Intel's optimized strlen algo.)

I don't believe I did invert them... @B is Previous/Before and @F is Next/Forward, are they not?
If you look at my code the first two jumps are not meant to jump to the "next" label after it, it is meant to jump to the label before it in order, which would be the finalizing/end code. The next jump in the second label returns to the label before which is just a repeat.
The only down side to using JECXZ is that for the jump and comparison the recorded clock speed is far slower than other conditional jumps.

Also in your code there is a slight error...
    inc eax	                    ;after terminator
    mov eax, src               ;move src pointer to eax
    add eax, eax               ;add offset (after terminator) in ecx to eax

If you move src into eax, it replaces the current eax value. Adding eax to eax would merely multiply eax by two.
Posted on 2007-01-04 14:11:39 by NegativeNull

That would defeat my purpose however since all strings it uses are 0-terminator based and the destination buffer is expected to match at least the size of the source buffer automatically. If I added this recommended code this would in fact slow it down and add to the size.

It would add very little to the size - slowdown might be noticable... but what's best, buffer overflows or slightly slower code? :)

But okay, if you're NEVER going to run this on user intput, NEVER going to run it on network data, and NEVER going to run it on data read from files, then it's safe enough.

Even if you use an output buffer that's larger than the input buffer, what happens if the input buffer actually isn't zero-terminated? *boom*

Just some things to keep in mind!
Posted on 2007-01-04 15:20:05 by f0dder


That would defeat my purpose however since all strings it uses are 0-terminator based and the destination buffer is expected to match at least the size of the source buffer automatically. If I added this recommended code this would in fact slow it down and add to the size.

It would add very little to the size - slowdown might be noticable... but what's best, buffer overflows or slightly slower code? :)

But okay, if you're NEVER going to run this on user intput, NEVER going to run it on network data, and NEVER going to run it on data read from files, then it's safe enough.

Even if you use an output buffer that's larger than the input buffer, what happens if the input buffer actually isn't zero-terminated? *boom*

Just some things to keep in mind!


Buffer overflows is best, well in this case at least, seeing as there will be none. This procedure will be used in a case where there will always definitely be a zero-terminator or specified terminator.
The reason is that I already have an optimized function that stores a configuration file in memory and processes each line as a seperate zero-terminated string, this function is just part of the processing. It is a very important function however due to the fact it will be used to cut sections of the string to a buffer using a delimiter and return the position after that "delimiter" in the register EAX for later use.
I understood all the problems of buffer overflows before making this function and they are the least of my worries. I need optimized code that can do this seemingly small but largely important task in minimal time in order to allow my dynamic load library to be better than it's Visual C++ sibling.
Posted on 2007-01-04 15:49:31 by NegativeNull
Good, then it's obvious to everybody that this function shouldn't be a generic-use function, and we can move on to the optimization part :)
Posted on 2007-01-04 16:34:35 by f0dder
Just a quick note that if anyone dislikes programming in MASM and is only avoiding optimization due to this you can feel free to work with FASM or NASM. I would prefer MASM or FASM, but NASM is okay too.
I might be porting my project to FASM if the benchmarks and filesize footprints are near or better than MASM.
Posted on 2007-01-04 17:58:31 by NegativeNull
heres a modified strcpy routine  8)
.686
.MMX
.MODEL FLAT,STDCALL
OPTION CASEMAP:NONE
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
.CODE
;------------------------------------
; if (no Term found) Returns (NULL)
; else Returns (Pointer after Term)
;------------------------------------
;; char *p
;; p = NNStrCpy(&Dest,&Source,',');
;; if (p) {
;; }
; extern "C" char * __stdcall NNStrCpy(char *pszDest,char *pszSource,char chTerm);
NNStrCpy proc pszDest:PTR BYTE,pszSource:PTR BYTE,chTerm:BYTE
$pszDest EQU <>
$pszSource EQU <>
$chTerm EQU <byte ptr >
movzx ecx,$chTerm
mov eax,$pszSource
mov edx,$pszDest
pxor mm2,mm2
movd mm3,ecx
punpcklbw mm3,mm3
punpcklwd mm3,mm3
punpckldq mm3,mm3
@@TestAlignment:
test eax,7
jnz @@UnAlignedOrDoByte
@@AlignedAt8:
movq mm0,
movq mm1,mm0
pcmpeqb mm0,mm3
packsswb mm0,mm0
movd ecx,mm0
movq mm0,mm1
test ecx,ecx; Found Term byte?
jnz @@UnAlignedOrDoByte
pcmpeqb mm0,mm2
packsswb mm0,mm0
movd ecx,mm0
test ecx,ecx; Found NULL byte?
jnz @@UnAlignedOrDoByte
movq ,mm1
add eax,8
add edx,8
jmp @@AlignedAt8
@@UnAlignedOrDoByte:
mov cl,
inc eax
cmp cl,$chTerm
je @@Done
mov ,cl
inc edx
test cl,cl
jnz @@TestAlignment
dec edx
xor eax,eax
@@Done:
mov byte ptr ,0
ret 3*4
NNStrCpy endp

END


if youre strings arent too big maybe an inline proc might be better
__inline char * NNStrCpy(char *pszDest,char *pszSource,char chTerm)
{
__asm
{
mov edx,pszDest
mov eax,pszSource
dec edx
_1: mov cl,
inc eax
cmp cl,chTerm
sete ch
dec ch
inc edx
and cl,ch
mov ,cl
jnz _1
}
}
Posted on 2007-01-04 21:16:55 by drizz
Thank you, drizz! That is some fast code, but sadly it uses MMX. Yes, you may be asking why that is bad since most modern processors support MMX... well with my product I am not exactly sure all of my users will have a modern processor. Of course it is expected that they will, but I am scared to take a chance.
If someone feels it necessary to correct me on my worries and give me strength to use only the MMX version, please do.
Posted on 2007-01-04 21:51:19 by NegativeNull
just to say that it uses plain MMX code not SSEMMX that would require P3 or better,
this code requires PMMX (year 1995 i think) :)
Posted on 2007-01-04 21:59:25 by drizz
Thank you drizz, I do believe you have broken my fears. Wikipedia helped a bit as well.
I am guessing your code will be fast than mine, but it is worth a good benchmark.
Posted on 2007-01-04 22:07:11 by NegativeNull
I doubt that anyone today has a CPU without MMX O_O It's really hard to buy a Celeron (the 'old one', not the one with P4's technology), you know, and even this Celeron supports MMX. I once saw an auction on Allegro where one guy was adding Celerons for free if you bought P4/AthlonXP. It's XXI century ^^ If someone has a CPU without MMX, then they have it because of their own, conscious will ^^
Posted on 2007-01-05 02:43:32 by ti_mo_n
I just can't seem to get the MMX-based code to assemble!
I have MASM 6.14.8444 and I have included the .MMX flag at the beginning of the code but I still get these errors:
speedtest.asm(56) : error A2006: undefined symbol : mm2
speedtest.asm(57) : error A2006: undefined symbol : mm3
speedtest.asm(58) : error A2006: undefined symbol : mm3
speedtest.asm(59) : error A2006: undefined symbol : mm3
speedtest.asm(60) : error A2006: undefined symbol : mm3
speedtest.asm(65) : error A2006: undefined symbol : mm0
speedtest.asm(66) : error A2006: undefined symbol : mm1
speedtest.asm(67) : error A2006: undefined symbol : mm0
speedtest.asm(68) : error A2006: undefined symbol : mm0
speedtest.asm(69) : error A2006: undefined symbol : mm0
speedtest.asm(70) : error A2006: undefined symbol : mm0
speedtest.asm(73) : error A2006: undefined symbol : mm0
speedtest.asm(74) : error A2006: undefined symbol : mm0
speedtest.asm(75) : error A2006: undefined symbol : mm0
speedtest.asm(78) : error A2006: undefined symbol : mm1

Why is this happening?
Posted on 2007-01-05 23:13:23 by NegativeNull
Use upper-cases for the MMx registers...
Posted on 2007-01-06 01:46:38 by f0dder
Thank you, f0dder.
My only problem now is that my test code is crashing. I debugged it and it crashes always on the movq instruction that takes the result and puts it back into the buffer. It says that there is some sort of memory access violation, but I cannot figure out why.
The basic structure of my code after all the includes is as follows:
.data
str1 db "abc-abcd-abcde-abcdef-abcdefg-abcdefgh-abcdefghi,abc-labcd-abcde-abcdef-abcdefg-abcdefgh-abcdefghi,abc-abcd-abcde-abcdef-abcdefg-abcdefgh-abcdefghi",9,"//Comment",0
str2 db 256 dup (0)

.code
invoke NNStrCpy,addr str1,addr str2,","
fn MessageBox,0,str$(str2),"Drizz NNStrCpy",MB_OK 

invoke ExitProcess,0

Now why would there be an access violation of some sort?
Posted on 2007-01-06 02:42:59 by NegativeNull
On my system it doesn't crash, but the MessageBox shows up empty...
Attachments:
Posted on 2007-01-06 02:59:50 by f0dder


invoke NNStrCpy,addr str1,addr str2,","

Now why would there be an access violation of some sort?

NNStrCpy proc pszDest:PTR BYTE,pszSource :) :PTR BYTE,chTerm:BYTE
so the logical output is the empty msgbox.

the source was meant to be compiled and linked as .obj, (with ml 6.15 or above :twisted:)
if youre going to include that source as/in a sourcefile you must put at end:
OPTION PROLOGUE:PROLOGUEDEF
OPTION EPILOGUE:EPILOGUEDEF
so other procedures can have prologue-epilogue code.
also note that my function omits prologue-epilogue code with:
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
see f0dder's file

Posted on 2007-01-06 10:13:52 by drizz
note 2:
performance would be better if both buffers are aligned at 8, ("align N" - command).
if you use any of the memory allocation functions for buffers then you dont have to worry about that, because all of them align at 8 at least.
assuming that the source string is not static.
Posted on 2007-01-06 10:27:22 by drizz

NNStrCpy proc pszDest:PTR BYTE,pszSource :) :PTR BYTE,chTerm:BYTE
so the logical output is the empty msgbox.


That was so incredibly bright of myself to not notice the order in which you made the arguments, ha. Well my code is 400 ticks slower than your code over 1M iterations, so my code is definitely out. Thank you so much drizz, I will be sure to give you some credit for your help.

Oh, and I knew about the prologue-epilogue "issue." That was one of the more obvious things. As for the align, I never knew that. Thank you again for your help, the code I was using was merely a test, in the actual I use the heap allocation methods.
Posted on 2007-01-06 14:22:02 by NegativeNull