hi!

This is my "new routine" with the code to align the string (i am sure it's more optimizable) ... but still ... it works :



strlen proc lpString:DWORD
mov eax, 1 ; request CPU feature flags
cpuid ; CPUID instruction

;- Pre-Scan to align the string-start ----
mov ecx, lpString
mov eax, ecx
cmp byte ptr [eax], 0
je done
and ecx, 0FFFFFFF8h
add ecx, 8
sub ecx, eax
cmp ecx, 8
je aligned
@@:
inc eax
cmp byte ptr [eax], 0
je done
dec ecx
jnz @B
aligned:
mov ecx, eax
;-----------------------------------------

test edx, 800000h ; test bit 23 to see if MMX available
jz no_mmx ; jump if no MMX is available
pxor mm0, mm0

@@:
movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]
movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]
movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]

pcmpeqb mm1, mm0
pcmpeqb mm2, mm0
pcmpeqb mm3, mm0
pcmpeqb mm4, mm0
pcmpeqb mm5, mm0
pcmpeqb mm6, mm0

por mm1, mm2
por mm3, mm4
por mm5, mm6
por mm1, mm3
por mm1, mm5

add ecx, 48

packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B

sub ecx, 48

emms ; Empty MMX state
no_mmx:

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]

done:

sub eax, lpString

ret
strlen endp


I've attached an chart which show the different between the "unaligned" and "aligned" "code".

Cu, Jens Duttke
Posted on 2002-03-12 13:13:15 by Jens Duttke
Hmm, your routine tests for mmx on each call... imo it would be better
to do this check only once (at program startup): make the "strlen"
symbol point to a nop-filled area large enough to hold the largest
strlen algo. At startup time, VirtualProtect this area to enable write,
copy the optimal (based on processor) strlen algo there, and finish
off by VirtualProtect()ing the area back to previous protection flags.
Posted on 2002-03-12 13:19:10 by f0dder
Jens, all those sharp spikes in your graphs are do to windows multi-tasking. I choose the lowest of several runs to minimize this effect. You can see the execution pattern of the code - which is nice, IMHO. Timing short strings is just as effective when you clear the cache - there is no need to wait two minutes to time an algorithm, IMHO.
Posted on 2002-03-12 13:36:29 by bitRAKE
hi!

fodder : I tried it, and your idea seem to improve the speed noticable, thx.

bitRAKE : hmm, that's also a nice idea, i will do that, thx too. :)
btw. i still don't know how to clear the cache ... do you know any way to do that ? (I think I saw some instructions in SSE to do that, but the P2 does not support SSE).

Cu, Jens Duttke
Posted on 2002-03-12 13:52:00 by Jens Duttke

btw. i still don't know how to clear the cache ... do you know any way to do that?
I choose a section of data (the size of the cache) that doesn't have anything to do with what I'm timing, and I load it into cache - this will work on all processors. :)
Posted on 2002-03-12 13:59:20 by bitRAKE
Does that data have to be the same size as your cache? Wouldn't
any data set >= the size of your cache work?
Posted on 2002-03-12 14:07:01 by f0dder
I wrote a little test program for non-MMX strlens. I don't know what Agnor's routine is, so it's not in there. If I left anybody out, post your source, or link to it.

This series of tests was done on an AMD Athlon 750. I'll post my program if anybody wants it.


[size=9]Non MMX strlen tests

Note: Each test is actually performed 16 times per algo and averaged.

--------------------------------------------------------------
[b]Test 1 - Endurance[/b]
String is 16,777,216 bytes long.
--------------------------------------------------------------
1. lstrlen() - 368229.35
2. JensDuttke() - 276212.29
3. iblis() - 300473.56
4.BScanUnroll() - 309765.33
5. buliaNaza() - 289484.76
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 13271.47
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 2 - Practicality[/b]
String is 65536 bytes long.
--------------------------------------------------------------
1. lstrlen() - 657.06
2. JensDuttke() - 371.23
3. iblis() - 413.82
4.BScanUnroll() - 786.18
5. buliaNaza() - 412.78
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 40.55
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 3 - Readiness[/b]
String is 256 bytes long.
--------------------------------------------------------------
1. lstrlen() - 6.40
2. JensDuttke() - 5.31
3. iblis() - 5.31
4.BScanUnroll() - 5.33
5. buliaNaza() - 5.32
--------------------------------------------------------------
Best fn: [b]Tie - see results[/b]
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 4 - Overhead[/b]
String is 0 bytes long.
--------------------------------------------------------------
1. lstrlen() - 3.22
2. JensDuttke() - 4.25
3. iblis() - 3.23
4.BScanUnroll() - 3.21
5. buliaNaza() - 4.25
--------------------------------------------------------------
Best fn:
[b]BScanUnroll()[/b] by a difference of 0.00
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 5 - Search Stress[/b]
String is 65536 bytes long, functions called 100 times.
--------------------------------------------------------------
1. lstrlen() - 67959.49
2. JensDuttke() - 35184.05
3. iblis() - 41784.55
4.BScanUnroll() - 50482.14
5. buliaNaza() - 41887.96
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 6600.50
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 6 - Hash Strain[/b]
String is 256 bytes long, functions called 1000 times.
--------------------------------------------------------------
1. lstrlen() - 3113.59
2. JensDuttke() - 1516.76
3. iblis() - 1710.88
4.BScanUnroll() - 2183.48
5. buliaNaza() - 1873.11
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 193.11
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 7 - Iterative Persistence[/b]
String is 16 bytes long, functions called 1,000,000 times.
--------------------------------------------------------------
1. lstrlen() - 439857.09
2. JensDuttke() - 197054.92
3. iblis() - 156195.21
4.BScanUnroll() - 177383.43
5. buliaNaza() - 187797.31
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 21187.22
--------------------------------------------------------------


--------------------------------------------------------------
[b]Test 8 - Super Test[/b]
String is 1,048,576 bytes long, functions called 100 times.
--------------------------------------------------------------
1. lstrlen() - 2177994.67
2. JensDuttke() - 1677631.94
3. iblis() - 1812873.57
4.BScanUnroll() - 1928963.20
5. buliaNaza() - 1809566.89
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 131935.95
--------------------------------------------------------------



[color=red]Running the same tests for non-aligned data....

--------------------------------------------------------------
[b]Test 1 - Endurance[/b]
String is 16,777,216 bytes long.
--------------------------------------------------------------
1. lstrlen() - 349326.91
2. JensDuttke() - 289607.49
3. iblis() - 269065.58
4.BScanUnroll() - 308322.13
5. buliaNaza() - 288883.21
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 19818.62
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 2 - Practicality[/b]
String is 65536 bytes long.
--------------------------------------------------------------
1. lstrlen() - 655.97
2. JensDuttke() - 352.01
3. iblis() - 352.05
4.BScanUnroll() - 499.19
5. buliaNaza() - 410.64
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 0.03
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 3 - Readiness[/b]
String is 256 bytes long.
--------------------------------------------------------------
1. lstrlen() - 6.41
2. JensDuttke() - 4.30
3. iblis() - 5.31
4.BScanUnroll() - 5.33
5. buliaNaza() - 5.32
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 0.01
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 4 - Overhead[/b]
String is 0 bytes long.
--------------------------------------------------------------
1. lstrlen() - 3.23
2. JensDuttke() - 3.19
3. iblis() - 3.21
4.BScanUnroll() - 3.20
5. buliaNaza() - 3.21
--------------------------------------------------------------
Best fn:
[b] JensDuttke()[/b] by a difference of 0.01
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 5 - Search Stress[/b]
String is 65536 bytes long, functions called 100 times.
--------------------------------------------------------------
1. lstrlen() - 67293.83
2. JensDuttke() - 33682.14
3. iblis() - 33301.32
4.BScanUnroll() - 51075.24
5. buliaNaza() - 41414.42
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 381.81
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 6 - Hash Strain[/b]
String is 256 bytes long, functions called 1000 times.
--------------------------------------------------------------
1. lstrlen() - 3267.24
2. JensDuttke() - 1509.32
3. iblis() - 1498.67
4.BScanUnroll() - 2047.98
5. buliaNaza() - 1732.25
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 10.64
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 7 - Iterative Persistence[/b]
String is 16 bytes long, functions called 1,000,000 times.
--------------------------------------------------------------
1. lstrlen() - 439543.46
2. JensDuttke() - 192216.50
3. iblis() - 151165.83
4.BScanUnroll() - 176716.78
5. buliaNaza() - 183479.47
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 25551.95
--------------------------------------------------------------
--------------------------------------------------------------
[b]Test 8 - Super Test[/b]
String is 1,048,576 bytes long, functions called 100 times.
--------------------------------------------------------------
1. lstrlen() - 2181548.78
2. JensDuttke() - 1808117.34
3. iblis() - 1682792.51
4.BScanUnroll() - 1928479.98
5. buliaNaza() - 1800880.01
--------------------------------------------------------------
Best fn:
[b] iblis()[/b] by a difference of 118087.49
--------------------------------------------------------------
[/color][/size]
Posted on 2002-03-12 14:26:15 by iblis
Here's the program. It's a console app. The .bat simply just runs it, but redirects stdout to a file.
Posted on 2002-03-12 14:30:41 by iblis
hi!

now i know : never say a code isn't more optimizable. :)
I just found a way to optimize my first code (posted by hutch) even more ... it's just one more jz, to split the last AND from the main loop. Compared to Agner Fogs and my "original" routine, this double the speed.



strlen proc lpString:DWORD
mov ecx, lpString

@@:
mov eax, dword ptr [ecx]
add ecx, 4

lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
jz @B
and eax, edx
jz @B

bsf edx, eax

sub edx, 4
shr edx, 3

lea eax, [ecx + edx - 4]
sub eax, lpString

ret
strlen endp


It would be nice if someone, for whom my first routine worked slower or in the same speed as Agners, could test it again.

Cu, Jens Duttke
Posted on 2002-03-12 14:33:04 by Jens Duttke

Does that data have to be the same size as your cache? Wouldn't any data set >= the size of your cache work?
Of course! You only have to touch a byte to load the whole cacheline, too. I would hope that people would do some research themselves to know the why and how. :)
Posted on 2002-03-12 14:35:13 by bitRAKE
Good :). I was afraid there were some dark and secret reason to
use a dataset the exact size of your cache =). As for cache line size,
do any recent processors have size less than 32bytes? Otherwise
I'll just touch in 32byte increments and wont bother CPUIDing for
cache line size :).
Posted on 2002-03-12 14:40:26 by f0dder
One added note: I had to modify the algos so that they matched the same calling convention as WinAPI lstrlen()... meaning any algos that passed args through registers had to be changed to stack args. This way I could keep a table of procs and just loop through them.

Also because of __stdcall every ret became a ret 4. So stack cleanup may very well have slanted the results. (No, I didn't add a stack enter/leave frame. lpString was referenced in each proc as )
Posted on 2002-03-12 14:44:13 by iblis

I was afraid there were some dark and secret reason to
use a dataset the exact size of your cache =).
There are no secrets - just unread paragraphs of the manual. :grin:
Posted on 2002-03-12 14:50:34 by bitRAKE
Here's the program again with Jens optimisation.


Edit: Added time-critical restraining per Jens' suggestion. Also added Agner Fog's algorithm.


Again, this program is to test the algos under different conditions. That way you can see which works best for large strings, small strings, long iterations, 1 iteration... etc... This should help to do what BitRake suggested and figure out which works best for which situation.

Output is formatted for vBcodes.
Posted on 2002-03-12 14:55:50 by iblis
I wonder what will rip a meg or so of sequential zero seperated strings to pieces in a hurry,


MyString db "short string",0
db "longer string length",0
db "more unaligned zero terminated strings",0
; many more uneven length unaligned strings
db 0,0 ; terminator

Assuming 4 8 16 32 byte alignment is convenient for testing optimum conditions but a bad assumption on messy data that occurs in a form like this.

Regards,

hutch@movsd.com
Posted on 2002-03-12 20:40:45 by hutch--
Assuming 4byte alignment is fine :). You should always get (at least)
that with dynamically allocated memory, and as for your own stuff...
just align the data properly :).
Posted on 2002-03-12 20:45:10 by f0dder
You can trust me, I tried nearly every possible way


No I can not. I don't trust anybody (including myself), but tests.
I've heard those "trust" words from bitRAKE, had I let him go with it he would have not written his intereting code ever.

You obviously didn't check the second idea, it's clear from your post. Shall I do it for you?
Posted on 2002-03-12 21:14:17 by The Svin
f0dder,

==============
Assuming 4byte alignment is fine . You should always get (at least) that with dynamically allocated memory, and as for your own stuff... just align the data properly .
==============

This is fine with simple data where you can align the beginning but the example is one of the ways of storing lists of strings with just a zero as a seperator and a double zero terminator and the contents are of uneven lengths and misaligned.

Now making a DWORD array of pointers is no big deal to do if you can scan the location of each zero and put it in the array but the scan will be unaligned in many of the results.

This is why I always included 2 different algos in the MASM32 library, one to handle aligned data in single long reads and a byte scanner that could handle the messy stuff without endless stalls and a massive performance penalty.

Regards,

hutch@movsd.com
Posted on 2002-03-12 21:33:41 by hutch--
Jens Duttke:
Check it out (your last version with my idea #2)
BTW: for new stounes it may be worthy to align boundry of loop:


strlen proc lpString:DWORD
mov eax, 1
; request CPU feature flags
cpuid
; CPUID instruction
;- Pre-Scan to align the string-start ----
mov ecx, lpString
mov eax, ecx cmp
byte ptr [eax], 0
je done
and ecx, 0FFFFFFF8h
add ecx, 8
sub ecx, eax
cmp ecx, 8
je aligned
@@:
inc eax
cmp byte ptr [eax], 0
je done
dec ecx
jnz @B
aligned: mov ecx, eax
;-----------------------------------------
test edx, 800000h ; test bit 23 to see if MMX available
jz no_mmx ; jump if no MMX is available
pxor mm0, mm0
@@:
movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]
movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]
movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]
movq mm7, qword ptr [ecx + 48] ;!
pcmpeqb mm1, mm0
pcmpeqb mm2, mm0
pcmpeqb mm3, mm0
pcmpeqb mm4, mm0
pcmpeqb mm5, mm0
pcmpeqb mm7, mm0 ;!
pcmpeqb mm6, mm0
pcmpeqb mm0,[ecx+56] ;!
por mm1, mm2
por mm3, mm4
por mm5, mm6
por mm7,mm0 ;!

por mm1,mm3
por mm5,mm7 ;!

por mm1, mm5

add ecx, 64
packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B
sub ecx, 64
emms ; Empty MMX state
no_mmx: @@:
mov eax, dword ptr [ecx]
add ecx, 4
lea edx, [eax - 01010101h]
xor eax, edx
and eax, 80808080h
and eax, edx
jz @B
bsf edx, eax
sub edx, 4
shr edx, 3
lea eax, [ecx + edx - 4]
done: sub eax, lpString
ret
strlen endp
Posted on 2002-03-12 21:48:38 by The Svin

No I can not. I don't trust anybody (including myself), but tests.
I've heard those "trust" words from bitRAKE, had I let him go with it he would have not written his intereting code ever.
I also didn't think an unrolled byte scanner would be so good on uncached data - I wrote it, I tested it, wow! I'm using it. Or, rather a general version that scans for the byte in AL.
Posted on 2002-03-12 22:03:36 by bitRAKE