Jens has just sent this following algorithm to me among a few other useful ones and I liked the look of it but he has only tested it on early pentiums so far.

I wondered if anyone had the time if they could compare its speed to the version by Agner Fog that is in the MASM32 library and see which is faster on later Intel and AMD processors.

It is presented here as a library module but it would be easy enough to just use it as a procedure for testing purposes.

Regards,

hutch@movsd.com



; #########################################################################

;----------------------------------------------------------------
; This procedure was written by Jens Duttke
;----------------------------------------------------------------

.486
.model flat, stdcall ; 32 bit memory model
option casemap :none ; case sensitive

.code

; #########################################################################

align 8

StrLen2 proc lpString:DWORD

mov ecx, lpString

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]
sub eax, lpString

ret
StrLen2 endp

; #########################################################################

end
Posted on 2002-03-07 15:25:01 by hutch--
Nice algo!!!

I have a pretty lame strlen :) . If anybody wants it can have it. It's pretty straightforward.



strlen PROC srcStr:DWORD
xor ecx, ecx
mov esi, srcStr
@@:
mov al, BYTE PTR [esi+ecx]
inc ecx
or al, al
jnz @b
dec ecx
ret
strlen ENDP


Return Value: ECX
Usage: invoke strlen, OFFSET yourstring

I know it's lame but hey!!! ... smile :)


I just read Agner fog's code on the masm32 library. I think I have the "classic" byte scanner. Ooops this is slow :)
Posted on 2002-03-08 00:11:03 by stryker
Nah,

Your scanner looks fine, here is a variation that I think Alex designed, I rewrote it from memory so it may be layed out a bit differently.


; ########################################################################

bscan proc lptext:DWORD

mov eax, lptext

@@:
inc eax
cmp BYTE PTR [eax], 0
jnz @B

sub eax, lptext

ret

bscan endp

; ########################################################################

Now what I really need is someone who has the time to test the new version that Jens wrote against Agner Fog's version on later AMD and Intel hardware.

Regards,

hutch@movsd.com
Posted on 2002-03-08 00:53:58 by hutch--
What kind of test do you want? How about a color graph of
lengths and alignment? Lengths of range [0 - 64K], and
colored lines for alignment [0,1,2,3]. :) The dependencies
look really bad - may as well use MMX...
Posted on 2002-03-08 01:01:26 by bitRAKE
Ricky,

I just set up a benchmark for 2 scanners and the two DWORD versions, Agners and Jens and the results are as below on my PIII 600. I have not tested them on my other boxes yet.


bscan 370 ms 3 instruction loop
bscan2 340 ms unrolled by two
Agner 305 ms original DWORD version
Jens 305 ms new DWORD version

Test piece was a single 52 meg text file with zero written to the last byte of the file.

Regards,

hutch@movsd.com
Posted on 2002-03-08 02:04:56 by hutch--

Nice algo!!!

I have a pretty lame strlen :) . If anybody wants it can have it. It's pretty straightforward.



strlen PROC srcStr:DWORD
xor ecx, ecx
mov esi, srcStr
@@:
mov al, BYTE PTR [esi+ecx]
inc ecx
or al, al
jnz @b
dec ecx
ret
strlen ENDP


Return Value: ECX


You know the standard is to return values in EAX... by returning the value in ECX, you break this standard, that makes it unusable from others langages like C (the program is unable to retrieve the return value).

I submit this version :



strlen PROC srcStr:DWORD
xor eax, eax
mov esi, srcStr
@@:
mov dl, BYTE PTR [esi+eax]
inc eax
or dl, dl
jnz @b
dec eax
ret
strlen ENDP


I think we can try to make it smaller (that is the goal of this implementation) by making the dec eax dissapear.

I will try to post the smallest implementation I can do when I will go home, but Alex's implemention has many advantages (only destroying one register while the one described on this post destroys three of them.).

To me, Jens' implementation is a very good speed/size compromise.
Posted on 2002-03-08 02:45:29 by JCP
Here is the unrolled byte scanner. It is a bit faster than the three instruction version but still short of the two DWORD versions of Agner and Jens.


; ########################################################################

bscan2 proc lptext:DWORD

mov eax, lptext

@@:
; --------------------------
add eax, 2
cmp BYTE PTR [eax-2], 0
jz lbl1
cmp BYTE PTR [eax-1], 0
jnz @B
; --------------------------
jmp @F
lbl1:
sub eax, 1
@@:

sub eax, lptext
dec eax

ret

bscan2 endp

; #########################################################################

Regards,

hutch@movsd.com
Posted on 2002-03-08 03:00:13 by hutch--
Nice new one !

I have a question :

lbl1:
sub eax, 1
@@:


Is there a particular reason to not use dec eax instead ?

If your benchmarking tool works as the BM one, please can you measure the performance of the MS lstrlen API ? Not just to compare our dicks lenghts against MS one, but to know if there is a REAL performance increase and advantage to use "home-made" function or not. :)

Thanks.
Posted on 2002-03-08 03:34:54 by JCP
Readiosys,

Nope, it should have been a DEC. I had been playing with variations and did not change it back.

Yep, the version on my win95b is a lot slower than the slowest asm version, times are as follows.


bscan 383
bscan2 350
Agner 315
Jens 315
lstrlen 545

Its probably some junk slopped out in C.

Regards,

hutch@movsd.com
Posted on 2002-03-08 05:45:53 by hutch--
hutch,
wich code do you use for timing? Just a GetTickCount or something special?
Posted on 2002-03-08 06:05:01 by bazik
BaZIK,

Yep, I am a real time man with benchmarking and once the algo runs for over 250 ms or so, it becomes a very low error tolerance.

I tested these on a single 52 meg file as a single pass to avoid cache effects looping through the same code.

Regards,

hutch@movsd.com
Posted on 2002-03-08 06:16:39 by hutch--
kernel32!lstrlenA ... first of all, it sets up (and removes) a SEH frame.
Next, it does NULL pointer checking. That's a bit of extra overhead
compared to the more optimized routines. The real bummer, however,
is the fact that lstrlenA used "repne scasb"... and used in a way that
looks more like a poor asm programmer than an optimizing compiler :).
Also remember that lstrlenA cannot afford to use optimizations that
would, for instance, read more bytes than the actual string length,
it has to be very generic.

But why they use "repne scasb" instead of a manual loop... *shrug*.
Posted on 2002-03-08 06:40:47 by f0dder
maybe is old windows code.

the version of strlen that comes with VC++ 6.0 does not use the repne scasb.

So I that makes me think that is old code in windows that no one bothers to rewrite.
Posted on 2002-03-08 08:42:20 by dxantos
Hutch--, the algos are memory bound. With these long timing test, all your doing is testing the efficiency of your P3's ability to load data from main memory into the cache. Usually, the strlen is what loads the data into cache to perform some operation on it. Why else would you want to know the length? It is still important to test both cached and non-cached data. Even running short tests on uncached data should mirror the results of the long tests, and you'll have more percise data, IMO. Is the 52 meg file loaded into memory all at once, or are you letting windows page the file in?
Posted on 2002-03-08 09:51:59 by bitRAKE
hi have an idea



can i replace

	sub	edx, 4

shr edx, 3

lea eax, [ecx + edx - 4]


with


lea eax,[ecx*8 +edx-36] ;2^2(1+2^3)= 36 , 2^3=8
shr eax,3

??
bye
eko
Posted on 2002-03-08 10:19:24 by eko
Nice one eko! :grin:
Posted on 2002-03-08 10:35:09 by bitRAKE
Let's see...
y + (x-4)/8 - 4 = y + (x-4)/8 - 32/8 = y + (x-36)/8 = (8y + x - 36)/8
...so you can.
Nice.
Posted on 2002-03-08 12:51:15 by micmic
But is it faster to do this? Or smaller? Or Both?

:stupid:
Posted on 2002-03-08 14:27:46 by Rennsemmel

But is it faster to do this? Or smaller? Or Both?
:stupid:
Both, but only matters for many small strings, IMO.
I'll steal it for my MMX routine. :)
Posted on 2002-03-08 14:31:40 by bitRAKE
Ricky,

I agree with you that memory access is the wall that any algo runs into but some still go faster than others so its not the only factor.

The file is loaded into OLE string memory which defeats paging and the tests on subsequent algorithms are consistent enough to show that.

The advantage of testing on a single read is one of avoiding cache effects in repeatedly looping through a test piece. The file size is to get the time long enough in millisecond resolution to reduce the error percentage to under a percent or so. Usually 250 to 500 milliseconds will do that.

Eko,

Thanks for the suggestion, I coded it as you wrote it and it works fine so compliments on a good optimisation. It is after the main loop so unfortunately it does not effect the timing but it does make the code a bit smaller and more efficient.

f0dder,

I think you are right but even with SEH and NULL pointer test, a 52 meg file is big enough to cover these effects. I think dxantos got it right that they have not modernised the code and left it with an old string instruction that is about DOS level code.

Regards,

hutch@movsd.com
Posted on 2002-03-08 14:55:48 by hutch--