I've been doing a string library and I have a fast MMX lstrlen that I thought somebody might find interesting. Comes in at about 194 clocks for a 512 byte string. For comparison, the masm32 version times out at 672 and the lstrlen api at 2093.

ALIGN 16

lszlenMMX FRAME pString

mov eax,[pString]
nop
nop ; fill in stack frame+mov to 8 bytes

pxor mm0,mm0
nop ; fill pxor to 4 bytes
pxor mm1,mm1
nop ; fill pxor to 4 bytes

: ; this is aligned to 16 bytes

movq mm0,[eax]
pcmpeqb mm0,mm1
add eax,8
pmovmskb ecx,mm0
or ecx,ecx
jz <

sub eax,[pString]

bsf ecx,ecx
sub eax,8
add eax,ecx

emms

RET
ENDF
Posted on 2004-04-14 02:32:29 by donkey
As usual to speed it even more, you have to unroll and perhaps do some prefetching. Anyway here's some codes that did the unrolling and a interesting snipplet for strlen that made used to sse by nexo.

http://www.asmcommunity.net/board/index.php?topic=4058&perpage=15&highlight=strlen&pagenumber=9
Posted on 2004-04-14 08:05:04 by roticv
MMX was looked at extensively in this thread.

http://www.asmcommunity.net/board/showthread.php?threadid=3525

...how does your algo compare?
Posted on 2004-04-14 09:01:30 by bitRAKE

MMX was looked at extensively in this thread.

http://www.asmcommunity.net/board/showthread.php?threadid=3525

...how does your algo compare?


Hi BitRake,

The last algo tested in the thread is 26 clocks slower than mine on 512 characters, I tested with this algo on a P3:

StrLen FRAME lpString

uses ebx

mov ecx,[lpString]
pxor MM0,MM0
pxor MM1,MM1

mov ebx,16
ALIGN 16
0: pcmpeqb MM1,[ecx+8]
pcmpeqb MM0,[ecx]
nop

add ecx,ebx
packsswb MM1,MM1
packsswb MM0,MM0

movd edx,MM1
movd eax,MM0
or edx,eax

je 0
bsf eax,eax
jne >1
add ecx,8
bsf eax,edx
1: sub ecx,[lpString]
shr eax,2

emms
lea eax,[ecx+eax-16]
RET
ENDF


Result was 221 clocks, string aligned and algo aligned at 16, mine was 195 clocks. Not really an official test but close enough for my purposes.

My string was :

ALIGN 16

LongString DB 512 DUP ("A")
DB 0


The first one in the thread performed horribly (LongStrLen by The Svin) but I think there is a problem with my translation (as the Svin is normally very fast) so I will have to check it further.


Timings if anyone is interested:

For 1000 itterations each

String aligned
---------------------------------------- lszLenMMX
Line 220: eax = 194062
Line 234: eax = 194111
Line 248: eax = 194047
Line 262: eax = 194107
---------------------------------------- StrLen
Line 278: eax = 221033
Line 292: eax = 221042
Line 306: eax = 221030
Line 320: eax = 261810
---------------------------------------- LongStrLen
Line 336: eax = 559995
Line 350: eax = 560019
Line 364: eax = 560013
Line 378: eax = 560020
---------------------------------------- lstrlen
Line 412: eax = 2093069
Line 426: eax = 2092999
Line 440: eax = 2093000
Line 454: eax = 2142887

String not aligned
---------------------------------------- lszLenMMX
Line 219: eax = 348083
Line 233: eax = 348062
Line 247: eax = 348068
Line 261: eax = 348076
---------------------------------------- StrLen
Line 277: eax = 376061
Line 291: eax = 376067
Line 305: eax = 376045
Line 319: eax = 376056
---------------------------------------- LongStrLen
Line 335: eax = 687225
Line 349: eax = 687205
Line 363: eax = 687272
Line 377: eax = 687296
---------------------------------------- lstrlen
Line 411: eax = 2093057
Line 425: eax = 4157187
Line 439: eax = 2093013
Line 453: eax = 2107915
Posted on 2004-04-14 10:44:11 by donkey

As usual to speed it even more, you have to unroll and perhaps do some prefetching. Anyway here's some codes that did the unrolling and a interesting snipplet for strlen that made used to sse by nexo.

http://www.asmcommunity.net/board/index.php?topic=4058&perpage=15&highlight=strlen&pagenumber=9


Thanks Roticv,

I was unaware of how to unroll it, I will have to study a bit more I guess, no harm there. I'm trying to make a nice alternative to the masm32 string library functions, both normal and mmx variants. I am learning the MMX instruction set as I go but it is a brain busting process.
Posted on 2004-04-14 10:49:21 by donkey
donkey, that algo was for Athlon. Svin's should perform better as he is coding for P1 MMX CPU. The Athlon's internal handling of MMX is very different than Intel's implementation on P3 or P4. One additional ceavat is they are for very long strings - mine quickly approaches < 0.4 cycles per byte, but you won't see this in 512 bytes.
Posted on 2004-04-14 12:16:34 by bitRAKE

donkey, that algo was for Athlon. Svin's should perform better as he is coding for P1 MMX CPU. The Athlon's internal handling of MMX is very different than Intel's implementation on P3 or P4.


Hi bitRAKE,

yeah I know, that's why I specified that it was running on a P3. I haven't tried it on any other machines yet but I assume it will significantly degrade. I made a minor change, aligning each pxor at the top, it still runs at 194 clocks, there was a minor improvement on the 1000 itterations but less than 1 clock per itteration. Also I wasn't aware that I could use the same source and destination for bsf so I stole that from your code, one less register changed on exit.

This is what I have for the Svin's routine, it looks OK but I obviously missed something...

LongStrLen FRAME lpString

uses edi, esi, ebx
.data
ALIGN 8
mask1 dq ?
mask2 dq ?
.code
xor eax,eax
movq mm1,[mask1] ;put addrs in cache
movq mm2,[mask2]
mov edi,[lpString]
movd mm0,eax
mov esi,offset mask1
: movq mm1,[edi] ;1
movq mm2,[edi+8]
pcmpeqb mm1,mm0 ;2
pcmpeqb mm2,mm0
movq [mask1],mm1 ;3
movq [mask2],mm2
mov eax,[esi] ;4
mov edx,[esi+4]
mov ecx,[esi+8] ;5
mov ebx,[esi+12]
or eax,edx ;6
jne >.calc
or ecx,ebx ;7
lea edi,[edi+16]
je <

.calc: mov al,[esi]
inc esi
test al,al
je .calc
sub esi,offset mask1+17
mov edx,[lpString]
lea eax,[edi+esi]
sub eax,edx
ret
ENDF


These are the results from a length of 8192:

---------------------------------------- lszLenMMX

Line 244: eax = 2599041
Line 258: eax = 2599057
Line 272: eax = 2976620
Line 286: eax = 2599415
---------------------------------------- StrLen
Line 302: eax = 2848044
Line 316: eax = 3239447
Line 330: eax = 2934327
Line 344: eax = 2848041
---------------------------------------- LongStrLen
Line 360: eax = 8698090
Line 374: eax = 8687823
Line 388: eax = 8617627
Line 402: eax = 8214474
---------------------------------------- lstrlen
Line 418: eax = 34025910
Line 432: eax = 33675862
Line 446: eax = 33370283
Line 460: eax = 33644663



I tried a big file (>1MB)

Based on 10 itterations, file was copied to the Heap.

Line 350: [cbFile] = 1257440

---------------------------------------- lszLenMMX
Line 423: eax = 23707720
Line 437: eax = 23741159
Line 451: eax = 23966429
Line 465: eax = 23846091
---------------------------------------- StrLen
Line 481: eax = 23527538
Line 495: eax = 23462042
Line 509: eax = 23891781
Line 523: eax = 23967803
---------------------------------------- lstrlen
Line 539: eax = 80543036
Line 553: eax = 79785431
Line 567: eax = 80412584
Line 581: eax = 78676772
Posted on 2004-04-14 12:25:02 by donkey
Hi roticv,

I tried a prefetch but in all cases it ran slower, I am not completely sure why but I think it has to do with the fact that my loop is designed to be exactly 16 bytes long, any attempt to change the size seems to carry a severe penalty (+ 50 clocks or more).
Posted on 2004-04-14 14:00:25 by donkey
Donkey, what size strings are you aiming for? If it's a replacement for lstrlen, I assume the target is generic ASCIZ strings, with a usually limited length - might be overkill (and perhaps slower?) to use heavily unrolled MMX loops.

If the goal is a high-performance string library, a lot of stuff can be done. I'm thinking in the direction of making a string a structure, rather than just the raw bytes. Fields would be current length, maximum length, and dynamic growing memory buffer (growing in chunks, reallocs are expensive). Here, the 'heavy' string length routines could be used when initializing from a large ASCIZ string.

Of course such a string library won't be as generically usable without the creation of some macros, but if done correctly you can get a lot of flexibility and speed (a dword lookup will is probably a bit hard to beat ;)).

Just some stray thoughts.
Posted on 2004-04-16 22:39:43 by f0dder
Hi f0dder,

The algorithm actually compares favorably as low as 32 characters but it is meant for long strings, over 200 characters. For example counting characters in a text edit buffer. I use Agner's for short strings:

lszLen FRAME pString

uses ebx,edx,ecx

mov eax,[pString]
lea edx,[eax+3]
:
mov ebx,[eax]
add eax,4
lea ecx,[ebx-01010101h]
not ebx
and ecx,ebx
and ecx,80808080h
jz <
test ecx,00008080h
jnz >
shr ecx,16
add eax,2
:
shl cl,1
sbb eax,edx

ret
ENDF
Posted on 2004-04-16 22:50:24 by donkey
Hey guys,
Why are you going mad over a few clock cycles. AFAIK most ppl now have Pentiums. Even if they have less then Pentiums, a few ticks won't count would it?

Thomas Antony
:o :tongue: :rolleyes: :confused: :notsure: :stupid: :)
Posted on 2004-04-17 11:27:13 by thomasantony
thomasantony, it matters if you're dealing with a lot of huge string bulk processing - and even for the small regular strings, it's nice to have some fast code rather than calling kernel32.lstrlena .
Posted on 2004-04-17 11:34:36 by f0dder