It appears that the latency of the movd instruction is what is holding back the algorithm from performing faster on the Athlon. Storing to memory isn't any good as there is a 20 cycle stall! Oh, well - now thinking about K3D possible optimizations and cache prefeching.
Posted on 2002-02-15 15:57:02 by bitRAKE
Storing to memory isn't any good as there is a 20 cycle stall!

Have you done fool read of the memory region before main loop?
You see, if data not in the cache - storing will result in direct RAM
writing and of course there will be a stall.
Yet, forgive me if I don't understood what did you mean by "storing in memory", in other word what and why you were storing into memory.

Here is one more version, I'm not sure if it's faster, but, at least,
it is a little bit shorter:

mov eax,lpString
pxor mm0,mm0
pxor mm1,mm1
@@0: pcmpeqb mm0,[eax+0]
pcmpeqb mm1,[eax+8]
packsswb mm0,mm1
add eax,16
packsswb mm0,mm0
movd ecx,mm0
test ecx,ecx
je @@0
sub eax,16+1
@@1: inc eax
cmp [b,eax],0
jne @@1
sub eax,lpString
Posted on 2002-02-15 18:08:37 by The Svin
The Athlon has a problem doing 64bit reg/mem to 32bit reg/mem. Internally there is a 5 cycle latency, and in memory there is a 20 cycle stall. The loop could be unrolled further to get rid of this stall, but then it'd really be for only very long strings. I'll leave it alone - we have done a great job here. :)
Posted on 2002-02-15 22:42:53 by bitRAKE
I think that no such good optimization work can exclude a separation of the routine and discussions depending on the target CPU. Then (through an automatic loader or some other solution) the right routine for the host CPU gets called.

This is logical.. because a lot of efforts go into optimization, but then just having a different CPU can vanish all of them.. or open new great possibilities. In my humble opinion there should be no talk of optimization (at *these* levels) if we don't create first sub-discussions for each CPU we want to support (and maybe a generic routine for the other CPU's.. if we really have to).

In this forum we can find a lot of people, sure we can cover all the CPU cases, with tests and so on.
Posted on 2002-02-16 04:43:50 by Maverick
Maverick, what you say is true, and does improve our return on all this effort. I'm sure all of us keep several versions of algos for a wide range is situations. (I have a Crusoe - which favors small code over speed in most cases! :)).
Posted on 2002-02-16 09:50:51 by bitRAKE
This version is Specially tuned for AMD Athlon and small strings:
StrLen MACRO lpString:REQ

LOCAL _0,_1
mov ecx,lpString
pxor MM0,MM0
pxor MM1,MM1

mov ebx,16
_0: pcmpeqb MM1,[ecx+8]
pcmpeqb MM0,[ecx]

add ecx,ebx
packsswb MM1,MM1
packsswb MM0,MM0

movd edx,MM1
movd eax,MM0
or edx,eax

je _0
bsf eax,eax
jne _1
add ecx,8
bsf eax,edx
_1: sub ecx,lpString
shr eax,2

lea eax,[ecx+eax-16]
Minimum 22 cycles

- Instructions packaged/aligned to 8 bytes offer highest decode bandwidth.
- Branch targets aligned to 16 bytes boundaries
- Use when average string is >32 bytes
Posted on 2002-03-09 01:41:17 by bitRAKE