bAZiK, I can think about algos when I do the dishes and laundry as well as when I eat! ;) I do the same things you do all day, and I carry around a little pad of paper to scribble on if I get a bright idea.
Posted on 2002-02-11 10:32:29 by bitRAKE
You proved your way of "or" more efficient.
Though it doesn't increase speed in this particular proc but it's
more economic and with further unrolling could also increase the speed.
I've made some last final brush touches to reduce one dependency and make the proc shorter without needless frame.
I think we've made something much better combining best of the
first procs with best of our ideas:


LngStrLen2 proc arg1 ;this fool arg needed by masm to create @4
;at the end of name in lib
;options at the same time prevent compiler from creating frame
;wich is needless since we can make all nessessery for
;appropriate return without it and without additional opcode

lpString equ [esp+4]
mov ecx,lpString
pxor mm(0),mm(0)
pxor mm(1),mm(1)

@@: pcmpeqb mm(0),[ecx]
pcmpeqb mm(1),[ecx+8]

packsswb mm(0), mm(1) ; por mm0,mm1
add ecx,16

packsswb mm(0), mm(0)

movd eax,mm(0)

test eax,eax

je @B
mov edx,lpString
sub ecx,16
@@: mov al,[ecx]
inc ecx
test al,al
jne @B
sub ecx,edx
lea eax,[ecx-1]
retn 4
LngStrLen2 endp
Posted on 2002-02-11 12:53:59 by The Svin
	sub ecx,16

@@: mov al,[ecx]
inc ecx
test al,al
jne @B
On average the cost of this code is 18 cycles - assuming even distribution of zero. This leaves quite a large area of discovery for a better method - given that all these bytes have already been tested. Under some instances it would be better to favor smaller string lengths, but I'm just curious to find other methods. I will post some follow-up code -- just wanted you to know where my thoughts are.
Posted on 2002-02-11 17:11:28 by bitRAKE
Rake, I respect your moving towords perfection.
Now something to think about:
1. It is called LongStrLen, the name itself tell us that it is design for Long Strings.
Long strings are something lenthy and its nature are subject when optimization
strategy is more important then nice look.
It is issue when overworking time estimation is very important.
When Steve took challenge to make new implementation of BM algo, I wished to
bless his work with my strongest magic spell :)
'Cause his job (and following sort routines) was very important and I could say the
most important job on procs among all he had done before upto that moment in x86.
And he did it. He was a hero :).
But when he did it some people couldn't understand what was the proc for
and reported result of testing that on small strings BM performed slower then
previous InString proc.
But BM wasn't disign for short strings, and when string is short the job is short in other
words - less important. But BM perform extremly better on lenthy strings.
On the job it was dising for.

2. Let's use simple 1st grade arithmetics.
For example we use it in arrays of string => 16 kb.
And you found a way to calculate "tail" much faster, it cost you almost nothing - main loop
was slowed down just by 1 clock.
Now how much faster is your version. Let say current scan make it for ~ 24 clocks and yours
twice faster ! (wich I think just fantstic but I agree to assume it :)
If length of 1 string is ~ 16 kb + tail then proc in avarege makes 1024 iteration and you lost
1024(losses in main loop) - 12 (bonus for fast end) = 1012 clocks.
If it used in query of 50000 records (very usuall task) you lost 50 600 000 clocks only for calculating
string length.

- We shall think of overall time
- this proc is not for strings < 100 bytes. In all strings > 100 bytes it will be faster that all I've seen before.
Posted on 2002-02-11 18:48:12 by The Svin
Yes, could also be stated: as the length of the string approaches infinity the contribution of the tail code to the total approaches zero.
Posted on 2002-02-11 19:09:31 by bitRAKE

Have i missed something here ? I have regularly written procedures with no parameters and passed any data I wanted either in registers or recently, in memory mapped files. Is there a problem with a library module not having any parameters ?

What I have generally found is that if there are no stack locals and no parameters passed on the stack, no stack frame is created at all and you basically have bare bones CALL/RET overhead.

Posted on 2002-02-11 19:48:18 by hutch--
It's a case when we actually want to make proc
1. With parameters passing to it
2. Use the proc with invoke, that comfortable way masm supplys (with checking parameters etc.)
3. But we don't want MASM to create the usual frame for it.
1. and 2. - it's how m32lib procs are usually done and meant to be used.
So the only difference is 3.
Fisrt about why would we want it.
Sometime procs are very simple and don't use the stack at all, so instead of
push ebp
mov ebp,esp

and reffering to parameters as ....

ret sizeofparameters (I write usuall frame masm automatically produces with proc Arg1,Arg2 ... code)

we can use agruments but reffer to them as etc.
and finish with
ret sizeofparameters

The aim is clear - remove 3 needless opcodes:
push ebp mov | ebp,esp | leave

For us to code easier we can declare at the beginning of a proc
lpString equ
Value equ etc.

So we actually aspect agruments, we just want to handle them ourselves without usual frame.
It's easy to prevent MASM from doing by using OPTION PROLOGUE:NONE and OPTION EPILOGUE:NONE
(We can switch it back right after and of proc if we want to)

But problem is how to make MASM check params in invoke when we didn't stated them in proc statment
We actually want masm to check for a programmer if he right appropriate number of params, and
at the same time we didn't state it in "proc" statement.
Masm invoke checking parameters in libs by looking at a name of a proc.
@number at the end of a name.
And when libs are compiled (for example by make in your m32lib) end of name are created according size
of params in proc statement.
For example for
MyProc proc Arg1,Arg2
name will be _MyProc@8
OtherProc proc Arg1
name will be _OtherProc@4
and this @number is used by invoke to check params.
Now we want to reffer to params without ebp+? using esp+? and without making frame
- If we write without params:
MyProc proc
lpString equ
lpValue equ

mov eax,lpValue
lea ecx,lpString
ret 8
MyProc endp
Our proc will make correct return and parameters handling but it cannot be used with
invoke MyProc, some,somes
'cause name for our proc in lib will be _MyProc@0 (actually _MyProc@4 - this is a bug :)
If we write MyProc as

MyProc proc lpString,lpValue

mov eax,lpValue
lea ecx,lpString

it will work incorrect 'cause masm will produce
mov eax,
lea ecx,
but without making frame.

so the way out is:

MyProc proc Arg1,Arg2 ; this force Masm to make name as _MyProc@8
lpString equ ;but actually inside the proc we use this names
lpValue equ

mov eax,lpValue
lea ecx,lpString

If the proc is written this way it can be called usuall way
invoke MyProc,x,y
and all params and return will be rightly handled and masm correctly check params in invoke,
and at the same time we have an advantage of leaving out those three frame opcodes.

I'm not sure if I spelt everything in correct English :)
Posted on 2002-02-11 20:50:27 by The Svin
You do well with your English and ASM Svin.

lpString equ [esp+4]

mov ecx,lpString
pxor MM0,MM0
pxor MM1,MM1

@@: pcmpeqb MM0,[ecx]
pcmpeqb MM1,[ecx+8]

packsswb MM0,MM0
packsswb MM1,MM1

movd eax,MM0
movd edx,MM1

add ecx,16 ; this is faster than LEA on Athlon
or edx,eax

je @B

bsf eax,eax
jne @F

add ecx,8
bsf eax,edx
@@: sub ecx,lpString
shr eax,2
lea ecx,[ecx+eax-16]
ret 4
My timings show that this takes 40 cycles for a 1 byte string, and approaches <0.4 cycles per byte very quickly. This is 6 cycles for the loop. How? I would think 5 cycles. This is fast, but it is better to have a sense of knowing what is happening within the machine for future work!
Posted on 2002-02-11 21:05:28 by bitRAKE
This demostrates that the main loop does take 5 cycles
and I'm not nuts. ;) There is a 6 cycle penalty for being
the second qword.

16319, 6575 ticks

16320, 6576 ticks ...(same timing)
16327, 6576 ticks
16328, 6582 ticks ...
16335, 6582 ticks

16336, 6581 ticks ...
16343, 6581 ticks
16344, 6587 ticks ...
16351, 6587 ticks

16352, 6586 ticks ...
16359, 6586 ticks
16360, 6592 ticks ...
16375, 6592 ticks

16376, 6598 ticks ...
16383, 6598 ticks

...And your proc above:

16319, 6224 ticks

16320, 6170 ticks
16321, 6174 ticks
16322, 6177 ticks
16323, 6182 ticks
16324, 6183 ticks
16325, 6186 ticks
16326, 6187 ticks
16327, 6192 ticks
16328, 6195 ticks
16329, 6212 ticks
16330, 6215 ticks
16331, 6215 ticks
16332, 6218 ticks
16333, 6224 ticks
16334, 6227 ticks
16335, 6230 ticks

16336, 6179 ticks
16337, 6181 ticks
16338, 6183 ticks
16339, 6185 ticks
16340, 6190 ticks
16341, 6192 ticks
16342, 6193 ticks
16343, 6198 ticks
16344, 6201 ticks
16345, 6218 ticks
16346, 6221 ticks
16347, 6224 ticks
16348, 6227 ticks
16349, 6230 ticks
16350, 6233 ticks
16351, 6236 ticks

16352, 6185 ticks
16353, 6187 ticks
16354, 6189 ticks
16355, 6194 ticks
16356, 6196 ticks
16357, 6198 ticks
16358, 6199 ticks
16359, 6204 ticks
16360, 6207 ticks
16361, 6224 ticks
16362, 6227 ticks
16363, 6230 ticks
16364, 6233 ticks
16365, 6236 ticks
16366, 6239 ticks
16367, 6242 ticks

16368, 6191 ticks
16369, 6193 ticks
16370, 6195 ticks
16371, 6200 ticks
16372, 6202 ticks
16373, 6204 ticks
16374, 6205 ticks
16375, 6210 ticks
16376, 6213 ticks
16377, 6230 ticks
16378, 6233 ticks
16379, 6236 ticks
16380, 6239 ticks
16381, 6242 ticks
16382, 6245 ticks
16383, 6248 ticks

Very nice!

I still don't understand the difference in timings. My algo produced very consistent timing, but yours took longer to stablize. There is really something missing from the analysis.
Posted on 2002-02-11 23:51:54 by bitRAKE

thanks for the description, using the stack pointer makes sense to me if the procedure allows it and there is no other stack uage in the proc. I can see the use of this technique for very short algorithms which are called sequentially many times and there are speed advantages in doing so if it clocks up OK.

Posted on 2002-02-12 00:29:39 by hutch--
Here is an image showing the part I don't understand. My algo is doing well,
then it starts taking longer than it should?

This is on an Althon 1.333 Ghz, best time of 1024 reps each string length.
Horitontal = bytes string length (not counting zero byte on end)
Vertical = cycles (best time of 1024 reps)
Everything should be in data/code cache.
Posted on 2002-02-12 00:57:45 by bitRAKE
The cpu is definitely optimizing your algo's loop after some
itterations. It becomes 1 cycle faster than it was initially.
I wonder what the criteria for optimization are?
Posted on 2002-02-12 08:45:45 by bitRAKE
I post it just to let you know that I've read your last posts with data and questions with big interest. It's become deeper and more practical.
As to your questions I have just blurred thoughts about paging,and compering uncached data.
I think algo with one read from memory\other comaperesion with
data would explain more.
I only want you know that I'm working only with P family stone and not a specialist with with different chips.
Are Ticks clocks or millisecs?
Is strings aligned 32? or at least 16?
Posted on 2002-02-12 16:13:14 by The Svin

I think algo with one read from memory\other comaperesion with data would explain more.
I am sorry, but I do not understand this sentence. 'comaperesion'? Do you mean comparison? CMP ;) I will certainly test more to find want is going on. The data should be aligned on a 4K boundary. Ticks are clocks (rdtsc).
Posted on 2002-02-12 16:52:28 by bitRAKE
Hi bitRAKEy, just a question.. how did you produce that graph? Your own program or some standard profiling/benchmarking program?

Posted on 2002-02-13 02:05:07 by Maverick
Posted on 2002-02-13 08:57:45 by bitRAKE


Hmm.. assuming you're not joking.. :tongue:

You mean you collected this data via RDTSC or similar, saved it and then processed it through Excel to display this graph?

Mavereeeek! :eek:

Posted on 2002-02-14 16:08:47 by Maverick
I'm not joking. I've spend a lot of time with Excel.
It's simply cut-n-paste and the graph wizard, big deal. :tongue:
RDTSC is the best I have right now. Might be a good idea to
use the performance counters, but I haven't played with them.
This is under windows, so I don't test pieces of code that take a long time, and I throw out everything but the best time. It is very accurate - as I've compared it to code under my own OS, and the times are the same.
Posted on 2002-02-14 16:14:12 by bitRAKE
You have your ???
Posted on 2002-02-14 16:23:01 by stryker
It's nothing, really ;) I'm still working off a floppy. It's my own version of UUU, but I've deviated quite a bit from their work.
Posted on 2002-02-14 16:28:03 by bitRAKE