This is my first post here, so a quick introduction: Hi.

Okay,
I know very little about optimisation, and I haven't coded pure intel assembly since the days of real mode and 286s, so pardon my ignorance here...

I'm just curious how a strlen() implimentation using REPNZ SCASB would fare in a benchmark. I'm guessing not very well.


-ib
Posted on 2002-03-10 06:50:09 by iblis
hi!

I've just improved my algorithm a bit (it uses MMX now).
It's 2 times as fast as my first algorithm on long strings (at least on my P2). It will use the "old" algorithm if the processor does not support MMX ... so it will ever go the fastest way. :-)



strlen proc lpString:DWORD

;-------------------------------------------------------------------
; Count the characters of a null terminated string.
;-------------------------------------------------------------------
; This procedure was written by Jens Duttke
;----------------------------------------------------------------
; The algorithm use the MMX technology. That means you need to use
; .MMX and ML /Cp to compile the routine.
; If the processor does not support MMX, the routine will use a
; very fast algorithm to scan for the null terminator, and if MMX
; is supported the routine will run more than 100 percent faster,
; on large strings, using this technology.
; NOTE : If you use the FPU, MMX, SSE or SSE2 in your program, you
; should save the state (FXSAVE) of them before you call this
; routine and restore the state (FXRSTOR) after the routine.
;-------------------------------------------------------------------

mov eax, 1 ; request CPU feature flags
cpuid ; CPUID instruction

mov ecx, lpString

test edx, 800000h ; test bit 23 to see if MMX available
jz no_mmx ; jump if no MMX is available
pxor mm0, mm0

@@:
movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]
movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]
movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]

pcmpeqb mm1, mm0
pcmpeqb mm2, mm0
pcmpeqb mm3, mm0
pcmpeqb mm4, mm0
pcmpeqb mm5, mm0
pcmpeqb mm6, mm0

por mm1, mm2
por mm3, mm4
por mm5, mm6
por mm1, mm3
por mm1, mm5

add ecx, 48

packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B

sub ecx, 48

emms ; Empty MMX state
no_mmx:

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]
sub eax, lpString

ret
strlen endp


PS. Since the algorithm read 48 Bytes "at once", it would be worthless to test it with strings shorter than that. It's done for large strings (above 1KB, or better above 1MB).

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-10 07:19:16 by Jens Duttke
hi!

I've just done some tests, to compare the speed of different algorithms. Agner Fog's, the Byte Scan algorithms, the one from buliaNaza and mine. The tests runs on my Pentium 2 333 MHz.
I tested a string length from 1000 till 100000.
I noticed that bitRAKE tested also an algorithm from The Svin and himself. I wonder where i can get these algorithms (Are these the algorithms from the "LongStringLen" thread ? i tried some of them, but they aren't faster than the byte scanner on my PC, or are they optimized for AMD's Athlon ?)

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-10 14:05:11 by Jens Duttke
Hi Jens,
Yes, the MMX algos came from the LongStringLength thread. I've been optimizing for the Athlon and it's very likely other CPUs will perform differently. As long as your unrolling the MMX for speed, why go half way? Try this:
pxor mm0,mm0

pxor mm1,mm1
pxor mm2,mm2
pxor mm3,mm3
pxor mm4,mm4
pxor mm5,mm5
pxor mm6,mm6
pxor mm7,mm7

@@: pcmpeqb mm0, qword ptr [ecx]
pcmpeqb mm1, qword ptr [ecx + 8]
pcmpeqb mm2, qword ptr [ecx + 16]
pcmpeqb mm3, qword ptr [ecx + 24]
pcmpeqb mm4, qword ptr [ecx + 32]
pcmpeqb mm5, qword ptr [ecx + 40]
pcmpeqb mm6, qword ptr [ecx + 48]
pcmpeqb mm7, qword ptr [ecx + 56]

por mm0, mm1
por mm2, mm3
por mm4, mm5
por mm6, mm7
por mm0, mm2
por mm4, mm6
por mm0, mm4

add ecx, 64
packsswb mm0, mm0
movd eax, mm0
test eax, eax
jz @B

sub ecx, 64
emms ; Empty MMX state
This will tackle a whole cache line at once on the Athlon.
Posted on 2002-03-10 14:29:39 by bitRAKE
hi!

I don't know why, but if I use pcmpeqb with the memory directly the code is 14% slower on my computer ... with movq it's faster ...

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-10 15:16:06 by Jens Duttke
iblis,

Hi and welcome to the forum. REPNZ SCASB will vary a considerable amount from machine to machine. On later Intel hardware it is a long way off the pace and it seems to be because Intel have not bothered to update the older string instructions. Usually incrementing pointers through an index is faster with normal integer instructions.

The MMX examples take advantage of being able to handle larger blocks of data at a time and it shows with the benchmarks if the machine is compatible with MMX.

Regards,

hutch@movsd.com
Posted on 2002-03-10 15:17:43 by hutch--
Thanks! I think it's interesting how one has use little tweaks and convoluted coding methods in order to get a CPU to do what you want.

Typically I'm a C person, but will use inline asm when I'm bored. ( MS's __declspec( naked ) is a great directive ;) ) My only gripe about assembler is that it's not very portable.

Because I don't know any better, if I had to write my own asm version of strlen it'd probably look something like:

--------------------------------
strlen proc strPtr:DWORD

mov edi, strPtr
mov ecx, 0FFFFFFFFh
xor eax, eax
repnz scasb
mov eax, ecx
not eax
dec eax
ret

strlen endp
--------------------------------

It might be slow, but it's relatively small. (I prefer size to speed)
Er.. well.. I *think* that would work, for strings smaller than 4gb anyway. For Windows you would probably have to push edi then pop it before ret. I think I remember reading somewhere that Windows might become unstable if you modify edi, esi, and ebx. I don't know if that's true.

Great forum you have here! From what I've seen it's fairly civil. Most programming forums and chatrooms I've come across are filled to the brim with "31337 h4x0rz" who are constantly fighting over why Visual Basic is the best language, or how to "m4ke a k3wl ICMP bomb0r pr0ggy" or something stupid like that.

-ib
Posted on 2002-03-10 23:57:55 by iblis
iblis,

This is why we have a "crusades" forum so that anyone can put forward any crackpot theory they like with more or less impunity. It keeps the invective out of the main forum and specialised areas most of the time and keeps everyone happy as well.

Glad you like the place, we are lucky we have a secret formula for membership, only HUMAN BEINGS are allowed to join so it works well most of the time.

Regards,

hutch@movsd.com
Posted on 2002-03-11 03:20:52 by hutch--
hi!

I've tried the different algorithms at work now, on a P3 800 MHz, and it still seems that my algo's are faster compared to Agners. (The MMX algo is even the fastest.) I don't know why bitRAKE, get such a "bad result" with them ... maybe it's because the code works slower on the Athlon ?!?



Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-11 03:21:49 by Jens Duttke
Jens,

It usually has to do with processor variations. BiTRAKE mainly writes for his AMD box which has different characteristics to later Intel hardware.

I have found this myself when I try to get algorithms to work across different processors. You can get something to work fine on one but it performs badly on some others and what you have to do to get code to work OK on all of them is to keep testing variations and select the best averages.

ts one of the things that drives you NUTZ.

Regards,

hutch@movsd.com
Posted on 2002-03-11 03:40:07 by hutch--
hii

about buliaNaza's code



mov eax, [edx] ;u get a dword (buffer is aligned)
add edx, 4 ;v ready for next
mov ecx, eax ;u save it in ecx
sub eax, 1010101h ;v sub 1 from each byte in eax
and eax, 80808080h ;u test sign
jz SLloop



to


mov ecx,[edx]
add edx,4
lea eax,[ecx-1010101h]
and eax, 80808080h
jz SLloop

Posted on 2002-03-11 03:44:48 by eko
Wow looks like my algo really is the slowest. :x I thought it would at least be a tad faster than lstrlen.

Well that's good to know should I ever come across a situation where I need to repetitively find the length of a 90mb string.

Interesting stuff.
Posted on 2002-03-11 06:23:55 by iblis
Hmm... I think I use buliaNaza's Algo for my SHA code.
It's not the fastest, but I want to understand the code I use.
And I don't know any MMX stuff :(

regards,
bAZiK


P.S: buliaNaza, I'll credit you in the SHA source and my mail client where I use this code. Thanks again for sharing!
Posted on 2002-03-11 06:32:02 by bazik
Hmm... I think I use buliaNaza's Algo for my SHA code.
It's not the fastest, but I want to understand the code I use.


amm... on my pc buliaNaza's run as fast as anger , and jens codes
but with afew fixes



;use mov esi,offset string
;call strlen

strlen proc
xor edx,edx
@@:
mov ecx,[edx+esi]
add edx,4
lea eax,[ecx-1010101h]
and eax, 80808080h ;u test sign
jz @B
;v
test al, 80H ;u is zero?
jnz C_minus4 ;v
test ah, 80H ;u is zero?
jnz C_minus3 ;v
and eax, 0800000h ;u is zero?
jnz C_minus2 ;v
lea eax, [edx-1]
ret ;
C_minus2: ;
lea eax, [edx-2] ;u eax= length of string
ret ;
C_minus3: ;
lea eax, [edx-3] ;u eax= length of string
ret ;
C_minus4: ;
lea eax, [edx-4] ;u eax= length of string
ret
strlen endp


EDIT: MORE CHANGES to the algo
Posted on 2002-03-11 07:01:32 by eko



amm... on my pc buliaNaza's run as fast as anger , and jens codes
but with afew fixes



Hmm... ok... looks nice :)
I can't test it here at work, but I'll do at home.
Seems like I need to credit you both, guys! :alright:
Posted on 2002-03-11 07:09:14 by bazik
hi!

eko, compared to the other non-MMX-algorithms yours (and buliaNAZA's), seems to be the fastest on long strings, at least on my P2 (the picture).

------

I've improved my speed-test-routine and get a much more exact result now.

All in all the result of the test ony my P2 333 MHz (128 MB RAM) with an 6000 Char-String is :

1. Jens (MMX)
2. bitRAKE (MMX)
3. The Svin (MMX)
4. buliaNaza & eko
5. Jens
6. Agner Fog
7. BScan 2
8. BScan
9. lstrlen
10. iblis

the result of a 160 char-string (i think that's the average for "normal" use) :

1. Jens
2. bitRAKE (MMX) / Jens (MMX)
3. buliaNaza & eko
4. Agner Fog
5. The Svin (MMX)
6. BScan / BScan 2
7. lstrlen
8. iblis

btw. the horizontal bar is the string length (i tested from 5 to 6004 bytes), the vertical bar are the required time.

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-11 08:42:18 by Jens Duttke
hi!

I've done a new algorithm (based on my first one, without MMX), which use an really unusual system. The speed of the algorithm depends on the content of the string. For example if the complete string contains only numbers, the speed of the routine is faster than my MMX code (atleast on my P2). But if the string contains mostly different bytes the algorithm will be damn slow.
Maybe if someone just need to check a 1000 byte string, which contains only numbers or uppercase letters, this routine could be useful. ;)
For all others it's most likely damn useless.
(Maybe anyone has an idea to improve it.)



strlen proc lpString:DWORD

;-------------------------------------------------------------------
; Count the characters of a null terminated string.
;-------------------------------------------------------------------
; This procedure was written by Jens Duttke
;-------------------------------------------------------------------
; This routine uses an unusual algorithm which read 16 bytes
; "per cycle" from the memory and check it for the null-
; terminator. The speed depends on the size and on the content of
; the string. On large strings (> 1000 KB) with "mostly the same"
; content (letters, numbers ...) it's faster than my MMX algorithm,
; but slightly slower on very short strings with different content.
;-------------------------------------------------------------------

mov ecx, lpString

@@:
mov eax, dword ptr [ecx]
and eax, dword ptr [ecx + 4]
and eax, dword ptr [ecx + 8]
and eax, dword ptr [ecx + 12]
add ecx, 16

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B

sub ecx, 16

;--------
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jnz @F
;--------
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jnz @F
;--------
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jnz @F
;--------
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B
;--------
@@:

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]
sub eax, lpString

ret
strlen endp


Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-11 09:04:32 by Jens Duttke
here is another version . i must go . so i dnot have time to check it




StrLen proc lpString:DWORD
xor edx,edx
@@:
; add edx,4
add edx,4
mov ecx,[edx+esi-4]
mov ebx,[edx+esi] ; i need to to this ever time expect the last , so at least i dont have a stall i think
lea eax,[ecx-1010101h]
and eax, 80808080h
jnz @F
lea eax,[ebx-1010101h]
and eax, 80808080h
jz @B

add edx,4
@@:
;v
test al, 80h ;u is zero?
jnz C_minus4 ;v
test ah, 80h ;u is zero?
jnz C_minus3 ;v
and eax, 00800000h ;u is zero?
jnz C_minus2 ;v
lea eax, [edx-1]
ret ;
C_minus2: ;
lea eax, [edx-2] ;u eax= length of string
ret ;
C_minus3: ;
lea eax, [edx-3] ;u eax= length of string
ret ;
C_minus4: ;
lea eax, [edx-4] ;u eax= length of string
ret

StrLen endp

Posted on 2002-03-11 11:46:32 by eko
Hi, eko
I agree with you because I have the same variants but:

THIS IS A NON WORKING VARIANT FOR THE STRING WITH ASCII CHARACTERS = OR > 80h!!!

So, it is easy to cut the code to improve the speed but you lose the universality...
Posted on 2002-03-11 15:50:02 by buliaNaza
Yeah I'd be interested to see a super fast .486p strlen that tests exclusively for the null terminator.

I'll give it a shot. It should only take me about 4 years to finish. :x
Posted on 2002-03-11 15:58:26 by iblis