Ok, i finished my strlen routine which use MMX, now.
It include the idea of fodder to test only the first time for MMX, and modify the code, if MMX is available or not.

At the first run, it will be a bit slower, because it need to "initialize the stuff". But after that it will be much faster, than the code which check all the time for MMX.

After a last speed testing on my P2 the result is, that this routine is the fastest for strings below 8 bytes (if the string is not aligned) and above ~95 bytes.
So i think it has also it use for "normal" strings, because the "normal" string length i handle is between 100 bytes and 255 bytes.
Between 10 and ~95 bytes buliaNaza's and my non-MMX-code are at the same speed-level (sometimes buliaNaza's is faster, sometimes mine ... but only a tick).
Below 10 bytes, all non-mmx routines have nearly the same speed.

So here is the code :

strlenEx PROC lpString:DWORD
mov ecx, lpString


mov eax, 1 ; request CPU feature flags
cpuid ; CPUID instruction
test edx, 800000h ; test bit 23 to see if MMX available
jz no_mmx ; jump if no MMX is available
jmp mmx


; Simple byte loop until string is aligned to 8
mov al, byte ptr [ecx]
inc ecx
test al, al
je done
test ecx, 7
jne @B

pxor mm0, mm0

movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]
movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]
movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]

pcmpeqb mm1, mm0
pcmpeqb mm2, mm0
pcmpeqb mm3, mm0
pcmpeqb mm4, mm0
pcmpeqb mm5, mm0
pcmpeqb mm6, mm0

por mm1, mm2
por mm3, mm4
por mm5, mm6
por mm1, mm3
por mm1, mm5

add ecx, 48

packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B

sub ecx, 48

emms ; Empty MMX state


mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
jz @B
and eax, edx
jz @B

bsf edx, eax

shr edx, 3
lea eax, [ecx + edx - 4]

sub eax, lpString


lea eax, [ecx - 1]
sub eax, lpString


flOldProtect dd ?


invoke VirtualProtect, code_base, code_end - code_base, PAGE_EXECUTE_READWRITE, offset flOldProtect
test eax, eax
.if !ZERO?
mov esi, no_mmx_code
mov edi, code_base

mov ecx, code_end - no_mmx_code


rep movsb

invoke VirtualProtect, code_base, code_end - code_base, flOldProtect, offset flOldProtect

mov ecx, lpString
jmp code_base

jmp no_mmx_code


invoke VirtualProtect, code_base, code_end - code_base, PAGE_EXECUTE_READWRITE, offset flOldProtect
test eax, eax
.if !ZERO?
mov esi, mmx_code
mov edi, code_base

mov ecx, code_end - mmx_code


rep movsb

invoke VirtualProtect, code_base, code_end - code_base, flOldProtect, offset flOldProtect

mov ecx, lpString
jmp code_base

jmp mmx_code
strlenEx ENDP

I am not sure if that's a good solution for selfmodifying code, since that's my first "real" selfmodifying code.
Maybe it's the most worst solution out there ? If so, please correct me. :)


If anyone has problems while using it, because of the large amount of readed bytes behind the string-end, it would be nice if you can inform us/me.

I recommend to allocate 48 bytes more memory, than you need, so you are on the secure side, and if you call this routine really often and need the speed, I think these 48 bytes are it worth :)

I attached the result of the test with this routine.

Cu, Jens
Posted on 2002-03-26 11:28:12 by Jens Duttke
Hi Jens,

Is this my first or second variant?
Posted on 2002-03-26 20:51:30 by buliaNaza
Hi Jens,
Please, could you post a graph zoomed on lengths from 0 to ~64?
Posted on 2002-03-26 21:01:14 by Maverick

Is this my first or second variant?

Nope, it was the first one. I havn't seen your second post.
I've tried your new one now, and it's sometimes one tick faster.
So your new code is sometimes one clock faster than mine, and sometimes at the same speed.

Maverick, I attached a test from 1 to 100 bytes.

Cu, Jens
Posted on 2002-03-27 01:34:31 by Jens Duttke
Thank you, Jens.
Posted on 2002-03-27 03:58:37 by Maverick
Again me here make a fool ;)

prefetchnta [eax+64]
movq mm1,qword ptr [eax]
movq mm2,qword ptr [eax+8]
movq mm3,qword ptr [eax+16]
movq mm4,qword ptr [eax+24]
movq mm5,qword ptr [eax+32]
movq mm6,qword ptr [eax+40]
por mm1,mm2
por mm3,mm4
por mm5,mm6
por mm1,mm3
por mm1,mm5
pcmpeqb mm1,mm0
add eax,48
packsswb mm1, mm1 ;;; pmovmskb ecx,mm1
movd ecx, mm1 ;;;
jecxz @B

I am pleased to be useful.
Posted on 2002-04-08 12:44:01 by Nexo
hi guys!
I've been somewhat following this thread and wrote my own strlen algo that I hope you guys will find interesting. It's not mmx, it's just regular integer instructions (I'm running on a AMD k6/2). Anyways, my first couple of tests were promising: it was comparable to Jens original post and Agners (I've been testing it with the slen.zip package hutch posted on page 2)

The thing is ... it looks horrible. I think there's a lot of room for optimization, but I couldn't come up with anything. Take a look:

chorusalgo proc uses ecx lptext:DWORD

mov ecx,lptext
sub ecx,4

add ecx,4
mov eax,
test al,ah
jz maybefoundit
shr eax,16
test al,ah
jnz looper
test ah,ah
jz foundit
test al,al
jnz looper
;fix up ecx to point to actual byte.
;I just test al,al/jz finished/shr eax,8/inc ecx
mov eax,ecx

chorusalgo endp

The idea is to filter out as many possibilities as I can. At the top I move the dword into eax and then test al,ah. If the test is not zero then we *know* neither al or ah is zero. Do the same for the next 16 bits of eax. Now, if the test *is* zero, then maybe one of the two values is zero so we jump to maybefoundit and test each one.

It's a different approach, but there are too many jumps. I'm sure it plays hell on the branch prediction. If anyone has some other ideas I'd be glad to hear 'em. Also, there's lots of forward dependencies cause I use eax throughout the entire thing.

One variation I tried was to bury the mov eax, by changing it into mov edx, and mov eax,edx. The hope was that I could read the *next* four bytes into edx where there was room to execute the instruction and then the mov eax,edx would be cheap at the top. It didn't work out though, but I think the idea has potential.

Thanks all

Posted on 2002-04-09 11:02:12 by chorus

hi guys!
I've been somewhat following this thread and wrote my own strlen algo that I hope you guys will find interesting. It's not mmx, it's just regular integer instructions (I'm running on a AMD k6/2).
AFAIK the K6-2 has MMX support.. so why not use it?
Posted on 2002-04-09 11:47:31 by Maverick
I guess it does...

Actually, I can't say I've used MMX. At all. I didn't think the K6-2 had it so I never bothered looking into it. I guess I will though, but usually when I program I don't assume much more than a PPlain. Just in case someone tries to run my program on a processor that doesn't support certain instructions (for most of my programs I submit my friends to doing testing for me :))

Thanks for the tip though.. I'll check out some more MMX stuff when I can.

Posted on 2002-04-09 11:56:02 by chorus
Code for search of null:

pxor mm7,mm7
@@: movq mm0,[eax+0]
pminub mm0,[eax+8]
pminub mm0,[eax+16]
pminub mm0,[eax+24]
pminub mm0,[eax+32]
pminub mm0,[eax+40]
pminub mm0,[eax+48]
pminub mm0,[eax+56]
pcmpeqb mm0,mm7
add eax,64
pmovmskb ecx,mm0
jecxz @B

It is shorter and faster.
Posted on 2002-04-09 12:47:29 by Nexo

Code for search of null:
It is shorter and faster.
... and doesn't work on older machines like the Pentium, Pentium MMX, Pentium Pro, Pentium 2 (does it work on the P3 ? ), K6/K6-2 (does the Athlon support these instructions ?) and so on.

I think it's a nice idea, but the use is currently not that high, since some (or the most ?) people still use these "old" processors, which does not support these instructions.

Cu, Jens
Posted on 2002-04-10 02:49:35 by Jens Duttke
Jens Duttke, you have forgotten many 386, 486 processors. But it not a motive of a mention of absence for them MMX instructions. This code works since a Pentium III and anyone Athlon processors. These processors exist and have the extended set of instructions which non-use is silly. For what them have added? These are main processes for which demand to create programs customers. These processors have the majority of local friends. If it is not interesting to you, can skip these messages. Or it is better to be limited to a minimum level 386? In general it is easier to me to not represent codes for these processors of times it to nobody interestingly. I for that coders used all possibilities of processors. If has found application of some instructions I am is simple glad to share it with others.
Posted on 2002-04-10 10:37:47 by Nexo

It's nice that you share your code with us, and it's nice that you've done such a optimized code (even if i can't test it, because i am one of the guys who still use a P2 :) ).

And it's also nice that Intel (and Amd) develope new instructions to make it possible to improve the speed of the programs even more.

But you limit the number of people who can use your strlen-routine extremly, because not many people buy everytime the newest computers. I think computers with cpu's with MMX are standard currently.

Cu, Jens
Posted on 2002-04-10 11:45:22 by Jens Duttke
Your last procedure works on processors with MMX and without MMX. To expand it there are no problems ;)
Posted on 2002-04-10 11:52:00 by Nexo

Nexo, your idea with

jecxz @B

is well done, so I've changed my code a bit, to do that on the same way (that saved me one TEST) and it's on larger strings (above 500 bytes), one tick faster.

My old MMX/non-MMX version had also a bug, if VirtualProtect returns 0, it will most likely crash, so that's fixed now too.

I attached it as zip, because the code is a bit large now, to post it here.

Maybe someone has some time (and the processor to test it :) ) and combine Nexo's SSE code with my non-MMX/MMX code, so it will always use the best way.

Cu, Jens
Posted on 2002-04-10 13:16:42 by Jens Duttke
Hi !
When Hutch started this thread he presented Jens algo and in


bsf edx,eax
sub edx,4
shr edx,3

I can't see what's the use of "sub edx,4" .Must be wrong copied.
But I saw even in a Jens post this part of code so please let me now if I got wrong Jens' idea .:confused:
Posted on 2002-04-14 13:04:51 by M.B.
Well there that was a most interesting thread to read. Hundreds of ideas posted and while not knowing completely what was going on i have a better idea of MMX Technology and what not ..

I would love to see a graph with Nexo's code benchmarked besides everybody elses..

and is there a simple way to detect between SSE ?

Is it as easy to determine as it was for the MMX Technology.

I would attempt to put together some of the samples and test them cause i have a 3 different types of processors and with a little convincing i can even round up a pentium 2 i think ;) but i am going to sleep right now dreaming Of ASM Instructions and 2 millisecond optimizations :P

Also I guess These are all for Regular Systems not Unicode ?
Because I would like to have a pure Unicode version of the program i am using. And String Parsing Routines would be very interesting to see :)

Also are there any links where i can read up on the MMX Technology.. the MMX Instructions .. which ones work on both AMD and Pentium and etc...

thank you very much
Posted on 2002-04-17 00:09:05 by Volcano_88101
I have an
Intel Pentium 3 - 700 MHZ
Intel Pentium 2 - 233 MHZ
AMD Athlon XP 1800 - 1.53 GHZ

If somebody has the routine I would gladly benchmark them on my computers.. Would be very interesting to see how they all performed on the differnt platforms
Posted on 2002-04-24 11:01:37 by Volcano_88101
Hmm Okay I have a question.. A While back I saw the code of a program that was for Unicode and Non/Unicode Systems..

Basically what they did is when the program first starts up it detects whether you are a Unicode System or not

And if it isnt then it defines all the functions to point to non/Unicode Functions... else we use Unicode Functions ..

Couldnt we do something like that for the MMX Detection Part?

So It doesnt have to do that every time we call it ? or am I trying to eliminate something that uses up maybe 2 or 3 cycles ?

If so you feel free to kick me
Posted on 2002-04-24 11:08:05 by Volcano_88101
yeah, of course mmx detecting can be done in a one-time-only way :).
You can either do indirect function calling through a pointer, or patch
the code. For something the really matters, I'd probably do the
latter, as it's a tiny bit faster in the end.
Posted on 2002-04-24 11:27:40 by f0dder