krish,
I know Paul Hsieh's site very well but:
"The most notable content on earlier versions of this page has gone out of date and has now
been superceded by the original author, Agner Fog. I am no longer motivated to keep it up to
date, so the old content has been removed. I would recommend that you visit Agner's own page
on How to optimize for the Pentium? microprocessors for the most up to date information on
Pentium Optimization." from http://www.azillionmonkeys.com/qed/p5opt.html
It is old (and slow) stuff:
"A fast implementation of strlen()
Recently, someone wrote to me with the comment that strlen() is a very commonly called function,
and as such was interested in possible performance improvements for it. At first, without thinking
too hard about it, I didn't see how there was any opportunity to fundamentally improve the algorithm.
I was right, but as far as low level algorithmic scrutiny is concerned, there is plenty of opportunity.
Basically, the algorithm is byte scan based, and as such the typical thing that the C version will do
wrong is miss the opportunity to reduce load redundancy.
; compiler
Here, I've sacrificed size for performance, by essentially unrolling the loop 4 times. If the input strings are fairly long (which is when performance will matter) on a Pentium, the asm code will execute at a rate of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte. If the strings are not long enough, branch mispredictions may make this solution worse than the straight forward one.
Update!
While discussing sprite data copying (see next example) I realized that there is a significant improvement for 32-bit x86's that have slow branching (P-IIs and Athlon.)
; by Paul Hsieh
I think this code will perform better in general for all 32 bit x86s due to less branching.
16 bit x86's obviously can use a similar idea, but it should be pretty clear that it would
be at least twice as slow. (I'm really starting to like this bit mask trick! "
from http://www.azillionmonkeys.com/qed/asmexample.html as you suggested
Old (and slow), so search the board !!!
Dear arkadash Vortex,
"a)To get a short code:use C run-time DLLa "
Are you sure? Please, post the code to see it...
"b)To get a speedy code:use optimized algos."
Please, post the code too...
Regards,
Lingo
I know Paul Hsieh's site very well but:
"The most notable content on earlier versions of this page has gone out of date and has now
been superceded by the original author, Agner Fog. I am no longer motivated to keep it up to
date, so the old content has been removed. I would recommend that you visit Agner's own page
on How to optimize for the Pentium? microprocessors for the most up to date information on
Pentium Optimization." from http://www.azillionmonkeys.com/qed/p5opt.html
It is old (and slow) stuff:
"A fast implementation of strlen()
Recently, someone wrote to me with the comment that strlen() is a very commonly called function,
and as such was interested in possible performance improvements for it. At first, without thinking
too hard about it, I didn't see how there was any opportunity to fundamentally improve the algorithm.
I was right, but as far as low level algorithmic scrutiny is concerned, there is plenty of opportunity.
Basically, the algorithm is byte scan based, and as such the typical thing that the C version will do
wrong is miss the opportunity to reduce load redundancy.
; compiler
mov edx,ebx
cmp byte ptr [ebx],0
je l2
l1: mov ah,[ebx+1] ; U
inc ebx ; V
test ah,ah ; U
jne l1 ; V +1brt
l2: sub ebx,edx
; by Paul Hsieh
lea ecx,[ebx-1]
l1: inc ecx
test ecx,3
jz l2
cmp byte ptr [ecx],0
jne l1
jmp l6
l2: mov eax,[ecx] ; U
add ecx,4 ; V
test al,al ; U
jz l5 ; V
test ah,ah ; U
jz l4 ; V
test eax,0ff0000h ; U
jz l3 ; V
test eax,0ff000000h ; U
jnz l2 ; V +1brt
inc ecx
l3: inc ecx
l4: inc ecx
l5: sub ecx,4
l6: sub ecx,ebx
Here, I've sacrificed size for performance, by essentially unrolling the loop 4 times. If the input strings are fairly long (which is when performance will matter) on a Pentium, the asm code will execute at a rate of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte. If the strings are not long enough, branch mispredictions may make this solution worse than the straight forward one.
Update!
While discussing sprite data copying (see next example) I realized that there is a significant improvement for 32-bit x86's that have slow branching (P-IIs and Athlon.)
; by Paul Hsieh
lea ecx,[ebx-1]
l1: inc ecx
test ecx,3
jnz l3
l2: mov edx,[ecx] ; U
mov eax,07F7F7F7Fh ; V
and eax,edx ; U
add ecx,4 ; V
add eax,07F7F7F7Fh ; U
or eax,edx ; U
and eax,080808080h ; U
cmp eax,080808080h ; U
je l2 ; V +1brt
sub ecx,4
l3: cmp byte ptr [ecx],0
jne l1
sub ecx,ebx
I think this code will perform better in general for all 32 bit x86s due to less branching.
16 bit x86's obviously can use a similar idea, but it should be pretty clear that it would
be at least twice as slow. (I'm really starting to like this bit mask trick! "
from http://www.azillionmonkeys.com/qed/asmexample.html as you suggested
Old (and slow), so search the board !!!
Dear arkadash Vortex,
"a)To get a short code:use C run-time DLLa "
Are you sure? Please, post the code to see it...
"b)To get a speedy code:use optimized algos."
Please, post the code too...
Regards,
Lingo
Vortex,
It might just be me ,but programming in assembly and then use a C library isn't high on my list. Even tho' it might be easier.
Lingo,
The first version of the library is going to be really simple, for people to learn assembly.
It might just be me ,but programming in assembly and then use a C library isn't high on my list. Even tho' it might be easier.
Lingo,
The first version of the library is going to be really simple, for people to learn assembly.
the best and fastest strlen is to not use strlen at all...
Dear arkadash Lingo12,
You want to see some examples of using C run-time DLLs?
http://www.asmcommunity.net/board/index.php?topic=10168
http://www.asmcommunity.net/board/index.php?topic=9510
http://www.asmcommunity.net/board/index.php?topic=9520
About optimised codes,you can read this post:
http://www.asmcommunity.net/board/showthread.php?s=&postid=96315.msg96315
Don't forget amigo,think twice. :)
You want to see some examples of using C run-time DLLs?
http://www.asmcommunity.net/board/index.php?topic=10168
http://www.asmcommunity.net/board/index.php?topic=9510
http://www.asmcommunity.net/board/index.php?topic=9520
About optimised codes,you can read this post:
http://www.asmcommunity.net/board/showthread.php?s=&postid=96315.msg96315
Don't forget amigo,think twice. :)