It works ~ 2-3 time faster than A.Fog version.
I comment it later:


LongStrLen proc uses edi esi ebx lpString
.data?
ALIGN 8
mask1 dq ?
mask2 dq ?
.code
xor eax,eax
movq mm(1),mask1 ;put addrs in cache
movq mm(2),mask2
mov edi,lpString
movd mm(0),eax
mov esi,offset mask1
@@: movq mm(1),[edi] ;1
movq mm(2),[edi+8]
pcmpeqb mm(1),mm(0) ;2
pcmpeqb mm(2),mm(0)
movq mask1,mm(1) ;3
movq mask2,mm(2)
mov eax,[esi] ;4
mov edx,[esi+4]
mov ecx,[esi+8] ;5
mov ebx,[esi+12]
or eax,edx ;6
jne @calc
or ecx,ebx ;7
lea edi,[edi][16]
je @B

@calc: mov al,[esi]
inc esi
test al,al
je @calc
sub esi,offset mask1+17
mov edx,lpString
lea eax,[edi][esi]
sub eax,edx
ret
LongStrLen endp
Posted on 2002-02-08 09:06:50 by The Svin
movd mm(0),eax ; clear lower 32 bits

should be...

pxor MM0,MM0 ; clear 64 bits


Is not this version faster? ;)
; MMX version by Ryan Mack

; Roughly 13 + 3n + BRANCH clocks on a P-II

const unsigned __int64 STRINGTBL[8] = {0, 0xff,
0xffff, 0xffffff, 0xffffffff, 0xffffffffff,
0xffffffffffff, 0xffffffffffffff}

/* ... */

pxor mm1, mm1
mov ecx, eax
mov edx, eax
and ecx, -8
and eax, 7
movq mm0, [ecx]
por mm0, [STRINGTBL+eax*8]
MAIN:
add ecx, 8
pcmpeqb mm0, mm1
packsswb mm0, mm0
movd eax, mm0
movq mm0, [ecx]
test eax, eax
jz MAIN

bsf eax, eax
shr eax, 2
lea ecx, [ecx+eax-8]
sub ecx, edx
Posted on 2002-02-08 09:58:33 by bitRAKE
Thanks for, freind!
pxor MM0,MM0
Test it.
I check 16 bytes for 8 clocks ;)
Posted on 2002-02-08 10:47:09 by The Svin
I suppose it'd have to be unrolled to beat your algo. :rolleyes:
I assume you know that?
Posted on 2002-02-08 12:02:23 by bitRAKE
The question is HOW to unroll :)
Unroll, test and show the code if it's faster ;)
BTW first time when I use the proc the final code to caculate tail
was mush faster, but in real tasks I used it to calculate lens of
strings wich was a filds in 10000s records with size > 16 kb each (I mean size of string fild not record) and calc of tail takes almost nothing compare for the rest so dispite I'm kin on speed I replaced fast but big final code with simple byte scan.
Why not bsf? We still have a lot of PMMX, bsf works hell of slow on them.
Code you posted is very interesting.
I haven't heard of Ryan Mack before. The code gets my attention to the name. Could you post some more of his code?
Posted on 2002-02-08 12:27:32 by The Svin
to bitRake:
what is in eax?
mov ecx, eax
mov edx, eax
there is now code explaining what value is in eax before this code.
Posted on 2002-02-08 15:31:54 by The Svin
EAX is the address of the string.
Here is the source of the code:
http://www.azillionmonkeys.com/qed/asmexample.html
Posted on 2002-02-08 15:34:07 by bitRAKE
The question is HOW to unroll :)
The obvious way, combine string pointer
increment and TESTs (into OR). ;)

Unroll, test and show the code if it's faster ;)
Why must I show everthing? You don't trust me?
If it is slower, I buy your next bottle of Vodka! ;)

I haven't heard of Ryan Mack before. The code gets my attention to the name. Could you post some more of his code?
Sorry, no more code, but maybe Google would be helpful.
Posted on 2002-02-08 20:19:06 by bitRAKE
I haven't seen anything to belive or not belived yet, just expression of blured feeling that it must be done.
With all do respect the code you showed is nice looking but slow working.
The main difference of algos is calc of the tail of a string.
I don't use any "special" triks in my algo but it has better strategy in disign.
So it can not be beaten by those code without changing strategy in it.
You see, mostly we use unroll to decrease amount of iteration, but main point in my algo was to use it eliminate dependences,
that's why caching in start of algo is done and also very important.
The code you posted main flow is that it is suffering from depedences.
So what you can do? Remove dependences as I did?
But it would be a different in strategy algo which will inherit main idea from main, and so cannot be compared as two original ideas.

As for tail - it will be faster on PIII and slower on PMMX.
As for test - if it had been so easy you would have written test part immediatly when posted the msg.

Complite your idea and post.
I hope you're smart enough to understand that the post is actually sign of respect to you, not anything else but invitation to
proceed to practicall intellectual talks.
Posted on 2002-02-10 13:23:32 by The Svin
Well, here is the inner loop:
	pcmpeqb  mm0, mm1

pcmpeqb mm2, mm1

packsswb mm0, mm0
packsswb mm2, mm2

movd eax, mm0
movd edx, mm2

movq mm0, [ecx]
movq mm2, [ecx+8]

add ecx, 16 ; lea ecx,[ecx + 16]
or eax, edx

jz MAIN
Where are the dependancies?
Gains two cycles over your algo every loop!
I will post follow-up when complete.
Posted on 2002-02-10 15:19:31 by bitRAKE
Now you are talking :)
I'm glad you make help of my idea to brake dependences.
Now I'm waiting how you bypass "or" problem at the end when
there are more than one zero.
Posted on 2002-02-10 17:58:06 by The Svin
;) Well, I take your advise and use the following...
	or BYTE PTR [ecx],0

je _0
or BYTE PTR [ecx+1],0
je _1
or BYTE PTR [ecx+2],0
je _2
or BYTE PTR [ecx+3],0
je _3
...etc.
...favoring shorter strings.
Posted on 2002-02-10 18:11:19 by bitRAKE
Well, then this way is useless all packing.
You can use the same logic with my way to compare and
it will be one command shorter:


;mm(0) = 0
@@: pcmpeqb mm(1), mm0
pcmpeqb mm(2), mm0

por mm(1),mm(2)
add ecx,16

movd eax,mm(1)
movq mm(2),

PSRLQ mm(1),32

movd edx,mm(1)
movq mm(1),

test eax,edx
je @B
Posted on 2002-02-10 19:19:35 by The Svin
test eax,edx 

je @B
This comparison will not work, as anytime eax is zero the branch will be taken. I am assuming you mean "OR". There is still a way to use the tricky "BSF" method.
...

or edx, eax ; must save EAX
jz MAIN
; here EAX is processed first, and only if it's zero is
; EDX processed and 8 is added to the string.
It's tricky to get the tail end processing without branches. I'd like to not use another register, but that seems impossible. We certainly have our work ahead on figuring this one out.
Posted on 2002-02-10 22:33:24 by bitRAKE
You are right about "test".
As to bsf - good luck, but to me it seems more important fast checking that here is no zero in 16 bytes.
I'm glad we proceed on topic with real code, at least I know how to change my own proc :)
Thanks for you agreed to talk by code.
Posted on 2002-02-11 04:28:47 by The Svin
I have another simple idea wich can be used in our algos
we can just comp memory :)


pxor mm(0),mm(0)
pxor mm(1),mm(1)

@@:
pcmpeqb mm(0),[ecx]
pcmpeqb mm(1),[ecx+8]

por mm(0),mm(1)
add ecx,16

movd eax,mm(0)
PSRLQ mm(0),32

movd edx,mm(0)

or edx,eax
jz @B
;bytescan from ecx-16
Posted on 2002-02-11 05:00:47 by The Svin
I tested my new version, it showed high performance ~ 1 clock for
byte lenth, but it is not much faster than my previous version
~11%.
So I doubt if there maybe increasing speed by clocks per iteration,
there is something we miss with our theoretical calcs.
As for the first code you posted(not unrolled) it's far behind all yours and mine versions - as I said - it has flow in strategy - it should have been unrolled and parralelled, in it current state it performce just slightly faster the Fog's one.

There is one of new versions:


LngStrLen2 proc lpString
mov ecx,lpString
pxor mm(0),mm(0)
pxor mm(1),mm(1)

@@: pcmpeqb mm(0),[ecx]
pcmpeqb mm(1),[ecx+8]

por mm(0),mm(1)
add ecx,16

movd eax,mm(0)
PSRLQ mm(0),31

movd edx,mm(0)

or eax,edx
je @B

mov edx,lpString
sub ecx,16
@@: mov al,[ecx]
inc ecx
test al,al
jne @B
sub ecx,edx
emms
lea eax,[ecx-1]
ret
LngStrLen2 endp
Posted on 2002-02-11 07:39:47 by The Svin
Errmm... what do you guys do the whole day?
I would never come to think that much about an algo :)
Posted on 2002-02-11 07:47:51 by bazik
Yes, very well done, Svin. I like the idea of memory compare directly. I expected a greater increase in speed. I assume you are still testing with the same strings length previously used (16K).
Posted on 2002-02-11 09:19:31 by bitRAKE
Maybe, the shift has a high latency? Many operations can be used to test for non-zero. I'm not at home, but this looks good to me. The order isn't extremely important on the Athlon, but you might have to move instructions around a bit for your CPU?
LngStrLen2 proc lpString

mov ecx,lpString
pxor mm(0),mm(0)
pxor mm(1),mm(1)

@@: pcmpeqb mm(0),[ecx]
pcmpeqb mm(1),[ecx+8]

packsswb mm(0), mm(1) ; por mm0,mm1
packsswb mm(0), mm(0)

movd eax,mm(0)
add ecx,16

test eax,eax
je @B

mov edx,lpString
sub ecx,16
@@: mov al,[ecx]
inc ecx
test al,al
jne @B
sub ecx,edx
emms
lea eax,[ecx-1]
ret
LngStrLen2 endp
I'll test this later.
Posted on 2002-02-11 10:11:05 by bitRAKE