Do this mean that the original Jens Duttke StrLen ( the First one that Hutch like ) is the fastest or darn close on a plain 386 - P200..... if i was only using strings between the size of 128 - 1000 bytes. Thats all i will EVER need...

Also where do you get a string speed tester from...
Posted on 2002-03-11 21:59:37 by cmax
Somebody write the fastest for 0-127 characters without the use
of MMX or any >plain586(or 486 ;)) opcodes that doesn't have to
read past the end of the actual string... :)
Posted on 2002-03-11 22:05:07 by f0dder
Ok We need to make a chart for future notice :-)

Fastest Code for Strings 0 - 80 bytes
Fastest Code for Strings 80 - 160 bytes
Fastest Code for Strings 160 - 1000 bytes
Fastest Code for Strings 1000 - 9000 bytes
Fastest Code for Strings 9000 - 100k+ bytes

I guess this may be a stupid questions but can you do somethng like this?

use one string scan technique up to 80 bytes then another one up to 160 bytes then switch to another one if it's over 1000 bytes and so on?

sorta like

cmp ecx, 160
ja SwitchStringScanTechnique

Sliver
Posted on 2002-03-12 00:38:16 by Sliver
hi!

For sure, that would be possible ... but i don't want to flood the server with thousands of chars. :grin:

I got an better idea ... I simply attach my test program, so everybody can use the same test on his computer.
The program will generate a file called table.txt in the program directory.
This file can be easily imported in MS Excel, and used to make charts from it, or compare the data directly.
It will take some seconds/minutes (depends on your computers speed). You should not run any other time-intensive processes in background and you should not move the mouse while it runs.
After the program is done it will show a messagebox "Done !" ... while running it does not show anything.
The program will test a stringlength from 5 to 6004 bytes. I don't think that the result will be another with more than 6000 bytes.

Your computer need to support a high-resolution timer.

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-12 02:02:14 by Jens Duttke

Somebody write the fastest for 0-127 characters without the use
of MMX or any >plain586(or 486 ;)) opcodes that doesn't have to
read past the end of the actual string... :)



Yeah, wouldn't it be possible to generate some kind of memory fault if you try to read 1-3 more bytes than you've alloc'ed?
Posted on 2002-03-12 03:18:29 by iblis
hi!

A friend tested the algorithms on his Athlon 1.4 GHz, 256 MB DDR-RAM.

And the result is this :

160 Byte String :

1. Jens (MMX)
2. bitRAKE (MMX)
3. Agner Fog
4. buliaNaza & eko / The Svin (MMX) / Jens
5. BScan 2
6. BScan
7. lstrlen
8. iblis

6000 Byte String :

1. Jens (MMX)
2. The Svin (MMX)
3. bitRAKE (MMX)
4. Agner Fog
5. Jens
6. buliaNaza & eko
7. BScan 2
8. BScan
9. lstrlen
10. iblis

Seems like fast code on the P2/P3 is also fast on the Athlon ... there aren't much differences in the result.

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-12 05:25:06 by Jens Duttke
Okay I gave it a shot. So far my tests show this one to be pretty fast. On my AMD Athlon 750 I got these arbitrary values:
myNewStrlen()		...5167740


JensNonMMX() ...4985968
ByteScan() ..12023752
REPNZ_SCASB() ..12029381
WinAPIstrlen() ..15450739


To get those values I used QueryPerformanceCounter calls before and after calling each function 1 million times on a 1000 byte string. Something like this:
[size=9]***begin pseudo-code***


char* string = "1000 bytes long.....->"

for( each algorithm )
{
QueryPerformanceCounter();
for( 0 - 1,000,000 )
{
strlen[ algo ]( string );
}
QueryPerformanceCounter();

printf( "Name of Algo:%", (perfcounter2 - perfcounter1) );

}

***end pseudo-code***[/size]



So I didn't calculate ticks/sec or anything. Still my strlen seemed fast.


Here's my new strlen:
[color=blue][b]strlen proc lpString:DWORD

mov ecx, lpString
_slLoopBegin:
mov eax, dword ptr [ecx]
add ecx, 4

test eax, 000000FFh
jz _slJmp3
test eax, 0000FF00h
jz _slJmp2
test eax, 00FF0000h
jz _slJmp1
test eax, 0FF000000h
jnz _slLoopBegin

inc ecx
_slJmp1:
inc ecx
_slJmp2:
inc ecx
_slJmp3:
lea eax, [ecx - 4]
sub eax, lpString
ret
strlen endp[/b][/color]


I don't know very much about optimisation, but I figured test/jmp would probably be a uv pair, since it's such a common thing. Any ideas of how to optimise it?

-ib
Posted on 2002-03-12 07:02:18 by iblis
hi!

Nice routine ... faster than your first one. :grin:
But are you sure it needs only 34257 ? That would be unbelievable fast, compared to the other routines.
I've tested it on my P3 800 MHz, and the speed is between the Byte Scanner and Agners algorithm.

btw. while testing you should also set the process priority to realtime, so that windows does not affect the testing.

Cu, Jens Duttke
----
http://www.emucheater.com
http://cyberpad.psxemu.com
Posted on 2002-03-12 07:51:26 by Jens Duttke
Yeah, Jens. Sorry I got some numbers messed up.


I've edited the post to reflect the new values.
Posted on 2002-03-12 07:59:52 by iblis
iblis,

Compliments, this is a lot better algorithm and its speed seems to be a lot better.

Jens, Iblis is testing on an AMD machine that will give different timings to an Intel machine, especially the later ones.

Regards,

hutch@movsd.com
Posted on 2002-03-12 08:56:24 by hutch--
Now that we have several great algorithms, I think it is imporant to move the discussion on to finding the best general solution. I do not mean to imply that it is a single one of these algorithms, but I would like to reach an understanding of what properties would be favorable for a general solution. Special situations could integrate a more custom string length function into their algorithm, so please try to ignore them for now. In general: What is the average string length that one would call this function with? What part of an algorithm would need a string length function? What are some situations where one would need to know the length of a string?
Posted on 2002-03-12 10:07:32 by bitRAKE
A question about license.

Since the strlen is a function that is used frequently a fast one could really help me to get some cycles back.

However im on a comercial project, thus I wish to know if the authors can give permission to use the functions. I cannot give or promise monetary gain, but I can promise credit in the the credits screens.
Posted on 2002-03-12 10:22:24 by dxantos
dxantos, any code I post here may be used comercially, or otherwise. Credit in the project isn't so important to me right now, but if you could send me a letter on company letterhead that states you've used my code that would be awesome. :grin:
Posted on 2002-03-12 11:17:59 by bitRAKE
Jens Duttke:
I had a look on your unrolled MMX algo.
There are two ways to improve it.
1. you can get rid of 6 pmovq instruction
- Before loop pxor all mmx regs
- In loop compare it to 8 memory locations
advantages:
- 8x8=64 bytes check per iteration
- no need for pmovq inside the loop
collecting the checking do the same way por+packsswb
2. minor change:
you can use mm7 and for parity check mm0 to memory
then por mm0 and mm7 with the rest before packsswb.

I'm sure you can code and test the ideas without my help
Good luck.
Posted on 2002-03-12 11:40:17 by The Svin
Generic strlen... well, most strings I juggle around "generically"
will not even be 256 bytes long. And most strings are pretty short,
probably 32 bytes or less. If you look at any windows app, the situation
will probably be very similar - a lot of rather small strings.

So, for a generic strlen, I don't think it makes sense to optimize for
long string lengths, especially not if makes it slower for short strings
(and of course it makes sense to use another strlen when you know
you will be dealing with large strings). I think read up to 3-4 bytes
too much *can* be acceptable, considering alignment and typical program
layout. For a generic strlen, I don't think MMX should be used - there's
still a few people out there with pplain, and for short strings, would
MMX even be beneficial?

Just my thoughts :)
Posted on 2002-03-12 11:44:35 by f0dder
hi!

dxantos : I'd spend many hours in my code, I would be happy if someone would use it in his project, that show me that my work wasn't useless. :grin:


Now that we have several great algorithms, I think it is imporant to move the discussion on to finding the best general solution. I do not mean to imply that it is a single one of these algorithms, but I would like to reach an understanding of what properties would be favorable for a general solution. Special situations could integrate a more custom string length function into their algorithm, so please try to ignore them for now. In general: What is the average string length that one would call this function with? What part of an algorithm would need a string length function? What are some situations where one would need to know the length of a string?


That's true ... and i found a way to improve the speed of the most algorithms (especially the MMX ones.)
Since some people said, it will extremly improve the speed to align the string, i've done some tests with the different algorithms, and alignments from 4 to 12 ...

The result is ... some algorithms like Agners or the lstrlen algorithm doesn't seems to be affected at all.

Some other algorithms seems be slow downed, if the string address is not dividable through 4, like iblis or my non-MMX algorithm.

And this is "cool" : the MMX algorithms, are much faster if the string is aligned to 8.
bitRAKEs works 37% faster, my MMX code 40%, and The Svins 45%.

Now i wonder, can i except, if the string is aligned to 8, that the code works on all systems faster. Or does that depends on the RAM-type/-speed, processor-type ?

bitRAKE :
I think strlen is required everywhere, where people need to handle strings/user input. Since user input isn't normally longer than 256 bytes, directory names are also not longer than 256 bytes and i can't think of anything else, which could be longer, i would assume, the standard for string lengths is 40 for short strings, 160 for medium strings and 240 for long strings, so a algorithm should be fast in this range. I also think the test of 60 or 100 MB strings is a bit "useless", since i really don't know a usage for strings like this (i think a book-writer will need years to write a text-string which is longer than 10 MB ;) )

Cu, Jens Duttke
Posted on 2002-03-12 11:52:43 by Jens Duttke

Generic strlen... well, most strings I juggle around "generically" will not even be 256 bytes long. And most strings are pretty short, probably 32 bytes or less. If you look at any windows app, the situation will probably be very similar - a lot of rather small strings.
From an algorithmic perspective, are these strings in the cache, or is string length the first operation performed on the string? (generally)
Posted on 2002-03-12 11:55:22 by bitRAKE
hi!

bitRAKE :

I don't know how much my processor can cache (i never thought about this.) but the code do this :

start:
writing the string to the memory
test routine 1
test routine 2
test routine 3
...
increase the length and jump to start

I am not sure if the string is cached ... but it seems like (or what ya think ?)
On the other side, if the processor really cache the string, this will also happen in a "real" program, because if i need the strlen routine, i've most like processed the string with another routine first.
But ... do you know a way to force the processor to "uncache" the string ? I am interested if this will affect the result.

Originally posted by The Svin
Jens Duttke:
I had a look on your unrolled MMX algo.
There are two ways to improve it.


nope, there aren't. :grin:


1. you can get rid of 6 pmovq instruction
- Before loop pxor all mmx regs
- In loop compare it to 8 memory locations
advantages:
- 8x8=64 bytes check per iteration
- no need for pmovq inside the loop
collecting the checking do the same way por+packsswb


I tried to use the pcmpeqb directly with the memory, and this is about 14% slower than my current way. (Don't ask why ... i don't understand that too, but it is ... (atleast on my P2))

I also thought about it to read 64 bytes at once ... but there are 3 things :

1. If the user use very short strings, this will slow down the execution much more, that it speedup the routine on fast strings ... because it need to read 16 bytes more ...
2. I already tried it, and it isn't really faster ... it's nearly the same speed
3. I am not sure, if it is really that good, to read such a high number of bytes past the string-end. Some people don't seems to like the idea to read past the end of the string. (Partial I think these guys are maybe right ... but currently i never had any problems with this.)

You can trust me, I tried nearly every possible way ("wasted" some hours on it) to get the code as fast as possible. And as far as i see there isn't any way to do faster (for me on my P2) (only the stuff which i've written in my last post will improve the speed extremly ... so i will impliment that as soon as i clicked the "Submit" button :) )

Cu, Jens Duttke
Posted on 2002-03-12 12:08:55 by Jens Duttke
Most likely wont be in the cache. Alignment will probably be no better
than 4 bytes - but most strings ought to be 4byte aligned (C compilers
will do it automatically, asm programmers ought to align strings to
4byte boundary).

So, I guess the goal is: fastest strlen for 4byte aligned zero-terminated
strings, string length from 1-256 chars (average around 20-40), reading
maximally 4 bytes past end of string, strings not in cache, compatible
with pplain (though speed optimizations should favour ppro+/athlon).

:)
Posted on 2002-03-12 12:11:44 by f0dder

So, I guess the goal is: fastest strlen for 4byte aligned zero-terminated strings, string length from 1-256 chars (average around 20-40), reading maximally 4 bytes past end of string, strings not in cache, compatible with pplain (though speed optimizations should favour ppro+/athlon).
On the Athlon this critria results in the unrolled bscan8 I posted above, but I don't know how this performs on pplain, P2, P3, P4? Unrolling less/further will move the bumps in the graph around. The P4 has larger penalties for branch miss, so I'm assuming the bumps will be bigger on P4. Alignment will improve preformance based on cacheline size (ie 64 bytes on athlon).
Posted on 2002-03-12 12:22:03 by bitRAKE