Here is the benchmark prog I used with these different algos. IT MUST have a test file named TEST.TXT in the same directory or it will GP fault on start.

The test file must be a text file, not binary as it scans for the terminating zero. I make the test file by starting with any big text file and copying it onto itself until its large enough to run for 250 to 500 ms.

The test prog appends a zero at the last byte so no other work needs to be done with the test file.

I have tested it on my p4 and it delivers much the same time relationships as my pIII.

I would be interested to see what its does on late model AMD athlons as well as older pentium hardware.

Regards,

hutch@movsd.com
Posted on 2002-03-08 15:52:52 by hutch--
Sorry folks I can't stay long (or maybe it's a happy thing:-)) but this caught my eye...

sub eax, 1


Now from the Intel P4 optimization manual...

ASSEMBLY/COMPILER CODING RULE 41. (M Impact, H Generality) inc and dec instuctions should be replaced by an add or sub instruction because add and sub instructions overwrite all flags.


Going back to the relevant section in the book, this partial flag register update apparently creates a dependency on previous writes to the flag register & can slow things down.

Yes it adds another byte to the code & they claim that you won't see a benefit for PII or before.

I haven't gotten off of my lazy ass to actually test this tho... so who knows maybe they're right.

I'll drop back when I can get back to REAL programming... none of this Perl/SQL junk I'm force-feeding the post-docs.
Posted on 2002-03-08 17:00:54 by rafe
Here's the result on my P3 800mhz. Test.txt is 9 131 110 bytes.


BScan - 60, 60, 60, 50, 70
Unrolled BScan - 50, 60, 50, 50
Agner - 40, 50, 40, 50, 50
Jens - 40, 50, 50, 40, 40
M$ - 70, 90, 71, 80, 71


Agner and Jen's code are performing Head2Head. Can't say which is the fastest. Sometimes Agner performs well with reults on the 40's while Jen's are in the 50's sometimes its the other way around.

I was wondering if we could even go faster with qwords(using MMX)? :)
Posted on 2002-03-08 17:04:37 by stryker

I was wondering if we could even go faster with qwords(using MMX)? :)
That is what ( this ) thread was about.
Posted on 2002-03-08 17:16:21 by bitRAKE
Ooops! :) he! he! he! How about DT in size? :)
Posted on 2002-03-08 17:18:23 by stryker
I use this1:


;Usage: mov esi, offset Buffer
; call StringLen
;On exit: eax->lenght of string
;........................................
StringLen: ;
mov edx, esi ;u edx=esi=lp to string
xor esi, -1 ;v esi->for final computation
SLloop: ;
mov eax, [edx] ;u get a dword (buffer is aligned)
add edx, 4 ;v ready for next
mov ecx, eax ;u save it in ecx
sub eax, 1010101h ;v sub 1 from each byte in eax
and eax, 80808080h ;u test sign
jz SLloop ;v if not loop again
not ecx ;
and eax, ecx ;u check for byte >= 80h
jz SLloop ;v
test al, al ;u is zero?
jnz C_minus4 ;v
test ah, ah ;u is zero?
jnz C_minus3 ;v
and eax, 0FF0000h ;u is zero?
jnz C_minus2 ;v
lea eax, [edx+esi] ;u eax=length of string
ret ;
C_minus2: ;
lea eax, [edx+esi-1] ;u eax= length of string
ret ;
C_minus3: ;
lea eax, [edx+esi-2] ;u eax= length of string
ret ;
C_minus4: ;
lea eax, [edx+esi-3] ;u eax= length of string
ret
Posted on 2002-03-08 18:14:19 by buliaNaza
buliaNaza,

I just pluggd in your code as is into the test bed and it produces identical times to both AGners and Jens versions.

Compliments on a fast algo.

Regards,

hutch@movsd.com
Posted on 2002-03-08 18:30:22 by hutch--
May be if you use a Russian symbols.. but in this English world..
They have 1 instruction more in the fast part of the main loop...
Posted on 2002-03-08 18:44:53 by buliaNaza
Athlon TB 1.333Ghz 512MB DDR WinXP
Average of 10 times on 89.2MB (93,584,001 bytes) text file:
bscan    320

bscan2 290
Agner 320
Jens 290
lstrlen 310
buliaNaza 300
[b][COLOR=red]Svin/bitRAKE 140 (MMX)[/COLOR][/b]
Posted on 2002-03-08 20:50:46 by bitRAKE
bitRAKE,

Ha,ha.. I agree with MMX but have questions:

- what about the length of your test string?

- are you tested a "normal" or "general purpose" string,
for instance "full path name" string?

- please, publish results for the "normal string" as follow:

db "C:\Program Files\Microsoft Visual Studio.NET\VC#\C# Language Specification.doc",0

- what is your advice to other people, which algo to use in their programs and why?

- should we use your algos with MMX or SIMD technology and why?

- please, publish your LAST and FASTEST MMX and SIMD versions here to learn them..

- ....
Posted on 2002-03-08 22:32:31 by buliaNaza
buliaNaza, the string is 93,584,001 bytes long - this test is the one Hutch-- designed and tests only the inner loop and memory speed. As stated above, ( this ) thread best outlines tests that have already been done on MMX routines for long strings. The algo in that thread has a 40 cycle overhead for a zero length string. On uncached data the overhead is overshadowed by a cache line read from memory. This algo quickly outpaces all others presented, but I have not done an analysis of the 'crossover' point between algos - when it is best to use the MMX algo verses byte scanner. I have no plans to write my thesis on string length. :) My thinking was that your algo would be faster, but the Athlon appears to have optimized Jens code to beat yours - this is a tricky chip to program on (same for Transmeta's Crusoe!). :)

My advise is to avoid string length. :)

If you have MMX, use MMX. :)
Posted on 2002-03-08 22:56:41 by bitRAKE
Thanks bitRAKE for your replay and nice advice...

For your info I know ( this ) thread in details
because I like The_Svin's ideas...

What about if I have a short string with some zeroes at the end?

"My advice is to avoid string length." ->no comments!!!

"If you have MMX, use MMX." -> but if I have MMX(with ADDITIONAL instructions) and SSE/2?
Posted on 2002-03-09 00:27:53 by buliaNaza
Ricky,

Thanks for posting the results, some interesting effects there. AMD usually handle the old string instructions better than Intel hardware so the API function is a lot faster on the AMD than an Intel box.

The MMX function is a very good performer and is OK on anything that will handle MMX but some early AMD and Intel Pentiums don't have that capacity unfortunately.

Jens version is simply shorter in loop length than the other 2 DWORD versions so it probably will be slightly faster on AMD hardware. I tried unrolling it and there was no speed difference at all so it is pretty efficient as it is.

Regards,

hutch@movsd.com
Posted on 2002-03-09 00:39:23 by hutch--
buliaNaza, I did a test with your string:
    ALIGN 32

buliaNazaString:
db "C:\Program Files\Microsoft Visual Studio.NET\"
db "VC#\C# Language Specification.doc",0
The test consisted of 100 tests for string length of the above string.
bscan    23000 cycles

bscan2 16700 cycles
Agner 9017 cycles
Jens 9718 cycles
lstrlen ?
buliaNaza 9609 cycles
The Svin 7660 (MMX)
bitRAKE 5010 (MMX)
We see here a great contrast from Hutch--'s test. All data is in the cache, so we are not testing memory speed! MMX still fairly good. :) These results were very repeatable - very little fluxuation ~1% overall.

By avoiding string length: I mean that the data is going to be parsed (test for length at time of parse - almost no cost), or data could be in a more structured form (no string length needed).

Sorry, but I have not tested SSE2. When I get a P4, I will. The text I have read would lead me to believe MMX bad on P4, and SSE/2 good on P4 - obvious marketing manuveur by Intel. :)
Posted on 2002-03-09 00:43:36 by bitRAKE
Ricky,

These are interesting results, I would be interested to see what they perform like in real time rather than cycle count.

Regards,

hutch@movsd.com
Posted on 2002-03-09 01:03:34 by hutch--
Thanks bitRAKE for your efforts and Good Night!!!
Posted on 2002-03-09 01:24:16 by buliaNaza
This graph shows the timing of all present algos for very short strings in the cache:
Posted on 2002-03-09 03:23:02 by bitRAKE
The horizontal axis is the length of the string and the vertical axis is the cycle count on an AMD Athlon. My next graph will be for non-cached strings...

The Svin's algo is very impressive on uncached strings > 64 bytes.
Posted on 2002-03-09 16:59:37 by bitRAKE
Here are my (slow :grin: ) results:


BYTE scanner 1 : 12829 ms
Unrolled BYTE scanner: 8352 ms
DWORD version Agner : 6209 ms
DWORD version Jens : 6199 ms
Windows API version : 8472 ms

Length of "test.txt" : 1.397.368.000 bytes
Posted on 2002-03-09 17:55:36 by bazik
This appears to be the fastest for small (<40 bytes) strings:
StrLen8 MACRO lpString:REQ ; byte scanner 8

LOCAL _0,_1,_2,_3,_4,_5,_6,_7,_8

mov edx, lpString
xor eax,eax
; ALIGN 16 ; align external from macro is better
_0:
cmp BYTE PTR [edx],0
jz _1
cmp BYTE PTR [edx + 1],al
jz _2
cmp BYTE PTR [edx + 2],al
jz _3
cmp BYTE PTR [edx + 3],al
jz _4
cmp BYTE PTR [edx + 4],al
jz _5
cmp BYTE PTR [edx + 5],al
jz _6
cmp BYTE PTR [edx + 6],al
jz _7
cmp BYTE PTR [edx + 7],al
jz _8
add edx,8
jmp _0
ALIGN 16
_8: inc eax
_7: inc edx
_6: inc eax
_5: inc edx
_4: inc eax
_3: inc edx
_2: inc eax
_1: sub edx, lpString
add eax, edx
ENDM
un-cached. (see graphs above)
Posted on 2002-03-09 21:35:16 by bitRAKE