I know Paul Hsieh's site very well but:

"The most notable content on earlier versions of this page has gone out of date and has now
been superceded by the original author, Agner Fog. I am no longer motivated to keep it up to
date, so the old content has been removed. I would recommend that you visit Agner's own page
on How to optimize for the Pentium? microprocessors for the most up to date information on
Pentium Optimization." from

It is old (and slow) stuff:
"A fast implementation of strlen()

Recently, someone wrote to me with the comment that strlen() is a very commonly called function,
and as such was interested in possible performance improvements for it. At first, without thinking
too hard about it, I didn't see how there was any opportunity to fundamentally improve the algorithm.
I was right, but as far as low level algorithmic scrutiny is concerned, there is plenty of opportunity.
Basically, the algorithm is byte scan based, and as such the typical thing that the C version will do
wrong is miss the opportunity to reduce load redundancy.
; compiler

mov edx,ebx
cmp byte ptr [ebx],0
je l2
l1: mov ah,[ebx+1] ; U
inc ebx ; V
test ah,ah ; U
jne l1 ; V +1brt
l2: sub ebx,edx
; by Paul Hsieh

lea ecx,[ebx-1]
l1: inc ecx
test ecx,3
jz l2
cmp byte ptr [ecx],0
jne l1
jmp l6
l2: mov eax,[ecx] ; U
add ecx,4 ; V
test al,al ; U
jz l5 ; V
test ah,ah ; U
jz l4 ; V
test eax,0ff0000h ; U
jz l3 ; V
test eax,0ff000000h ; U
jnz l2 ; V +1brt
inc ecx
l3: inc ecx
l4: inc ecx
l5: sub ecx,4
l6: sub ecx,ebx

Here, I've sacrificed size for performance, by essentially unrolling the loop 4 times. If the input strings are fairly long (which is when performance will matter) on a Pentium, the asm code will execute at a rate of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte. If the strings are not long enough, branch mispredictions may make this solution worse than the straight forward one.

While discussing sprite data copying (see next example) I realized that there is a significant improvement for 32-bit x86's that have slow branching (P-IIs and Athlon.)
; by Paul Hsieh

lea ecx,[ebx-1]
l1: inc ecx
test ecx,3
jnz l3
l2: mov edx,[ecx] ; U
mov eax,07F7F7F7Fh ; V
and eax,edx ; U
add ecx,4 ; V
add eax,07F7F7F7Fh ; U
or eax,edx ; U
and eax,080808080h ; U
cmp eax,080808080h ; U
je l2 ; V +1brt
sub ecx,4
l3: cmp byte ptr [ecx],0
jne l1
sub ecx,ebx

I think this code will perform better in general for all 32 bit x86s due to less branching.
16 bit x86's obviously can use a similar idea, but it should be pretty clear that it would
be at least twice as slow. (I'm really starting to like this bit mask trick! "
from as you suggested

Old (and slow), so search the board !!!

Dear arkadash Vortex,

"a)To get a short code:use C run-time DLLa "
Are you sure? Please, post the code to see it...

"b)To get a speedy code:use optimized algos."
Please, post the code too...

Posted on 2003-04-13 14:08:01 by lingo12
It might just be me ,but programming in assembly and then use a C library isn't high on my list. Even tho' it might be easier.

The first version of the library is going to be really simple, for people to learn assembly.
Posted on 2003-04-13 18:43:16 by jInuQ
the best and fastest strlen is to not use strlen at all...
Posted on 2003-04-14 02:16:42 by f0dder
Dear arkadash Lingo12,

You want to see some examples of using C run-time DLLs?

About optimised codes,you can read this post:

Don't forget amigo,think twice. :)
Posted on 2003-04-14 02:44:26 by Vortex