No I can not. I don't trust anybody (including myself), but tests.
I've heard those "trust" words from bitRAKE, had I let him go with it he would have not written his intereting code ever.

You obviously didn't check the second idea, it's clear from your post. Shall I do it for you?

Your way slowdown the code (atleast on the P3 here). Look at the chart. I think the problem is the memory reading. If the "complete memory" is read at the same time, without to process the data it seems to be much faster, than to process the stuff directly from the memory.

Cu, Jens
Posted on 2002-03-13 00:17:40 by Jens Duttke
Doesn't the P3 support prefetch?
	mov edx, lpString

pxor mm0,mm0
pxor mm1,mm1
pxor mm2,mm2
pxor mm3,mm3

sLoop: pcmpeqb mm0, qword ptr [edx]
pcmpeqb mm1, qword ptr [edx + 8]

pcmpeqb mm2, qword ptr [edx + 16]
pcmpeqb mm3, qword ptr [edx + 24]

por mm0, mm1
por mm2, mm3

pcmpeqb mm1, qword ptr [edx + 32]
pcmpeqb mm3, qword ptr [edx + 40]

por mm0, mm2
por mm1, mm3

pcmpeqb mm2, qword ptr [edx + 48]
pcmpeqb mm3, qword ptr [edx + 56]

por mm0, mm1
por mm2, mm3

por mm0, mm2
prefetchnta [edx + 64]

packsswb mm0, mm0

movd ecx, mm0

add edx, 64
test ecx, ecx
jz sLoop
This really plowed through the data on the Athlon!! Hardly no slope to the graph line at all. I put buliaNaza's code on the end of this and it beat anything on >24 character strings. My machine is coming in at 20 cycles for 64 bytes (with a couple spikes for the tail code), but I'm hitting the memory wall. Quite an impressive combination. :)

Whoops, forgot the P3 has a 32 byte cacheline, iirc?
So, you'd need to modify the code some. :tongue:
Posted on 2002-03-13 00:41:01 by bitRAKE

I tried your the code on the P3 here, and the result is, that it's slidely slower than your MMX code on short strings, and a bit faster on long strings. I've also tried to read only 32 bytes like you said, but then, your SSE code is completly slower than your MMX code.

There is also the problem, that older system, (like my P2) does not support the prefetch instruction. So your code limited to a very small number of systems.

Cu, Jens
Posted on 2002-03-13 01:36:00 by Jens Duttke
Jens Duttke:
Could I have the testing code, please?
Posted on 2002-03-13 02:56:59 by The Svin

Jens Duttke:
Could I have the testing code, please?

You can call me Jens. :grin:

And sure, you can download it here :


(It's complete with assembler and libs, that's why it has a size of 930kb ... so, you just need to extract it and start the make.bat)

The "string generate code" could be much more optimized. But since it works in a good speed, I havn't done that currently.

Cu, Jens
Posted on 2002-03-13 03:22:36 by Jens Duttke
I have no idea with what to open ace files.
Posted on 2002-03-13 04:02:47 by The Svin

I have no idea with what to open ace files.
You need the WinAce archiver.

Posted on 2002-03-13 04:11:23 by Maverick

WinAce and WinRar can open it.

Cu, Jens
Posted on 2002-03-13 04:11:41 by Jens Duttke
Thanks, Meveric, I've downloaded it.

Is it kinda famious archiver?
I've never heard of it before.
Posted on 2002-03-13 04:31:07 by The Svin

WinAce is the best archiver for some file-types, it beats even WinRar sometimes.

Cu, Jens
Posted on 2002-03-13 04:45:56 by Jens Duttke
Alex, Jens: In my experience it's the best archiver too.. I recall of a BMP image where .ACE was half of .RAR!!

Posted on 2002-03-13 04:51:08 by Maverick
ACE should always beat RAR, or at least have equal compression, since it's just an extension to RAR :) That's why WinACE has no problem reading RAR's and WinRAR can read just very few ACE's.
Posted on 2002-03-13 06:06:56 by Qweerdy
like buliaNaza said . i had a mistake
here is a working version( i think) of bulianaza's code

;mov esi, offset Buffer
; call StringLen

StrLen proc lpString:DWORD
xor edx,edx
mov ecx,[edx+esi]
add edx,4
lea eax,[ecx-1010101h]
and eax, 80808080h
jz @B
not ecx ;
and eax, ecx ;u check for byte >= 80h
jz @B
test al, 80h ;u is zero?

jnz C_minus4 ;v
test ah, 80h ;u is zero?
jnz C_minus3 ;v
shl eax,9
sbb edx,0
lea eax, [edx-1]


C_minus3: ;
lea eax, [edx-3] ;u eax= length of string
ret ;
C_minus4: ;
lea eax, [edx-4] ;u eax= length of string

StrLen endp

and Jens Duttkei i just curious why dont you change

sub edx, 4
shr edx, 3
lea eax, [ecx + edx - 4]

lea eax,[ecx*8 +edx-36] ;2^2(1+2^3)= 36 , 2^3=8
shr eax,3
Posted on 2002-03-13 08:14:41 by eko

Originally posted by eko
and Jens Duttkei i just curious why dont you change

sub edx, 4
shr edx, 3
lea eax, [ecx + edx - 4]

lea eax,[ecx*8 +edx-36] ;2^2(1+2^3)= 36 , 2^3=8
shr eax,3

it sounds theoretically like a nice idea ... but let's assume the address of the string (ecx) is 30000000h
now you calculate 30000000h (ecx) * 8 ... the result is a 33 bit number 1.8000.0000, since the x86 can only handle 32 bit, it cut the highest bit ... the result is, eax will be wrong, so it will return the wrong number of bytes.

And why doing it that complex, simply remove the 'sub edx, 4', it still works. (I should comment my code a bit more, i don't remember why it was there :grin: )

Cu, Jens
Posted on 2002-03-13 09:17:25 by Jens Duttke
dxantos, any code I post here may be used comercially, or otherwise. Credit in the project isn't so important to me right now, but if you could send me a letter on company letterhead that states you've used my code that would be awesome.

Sure, why not. Just send me a message with your postal address and we will send you the letter. (Even if this code ends up not being used I do use your FPC macro :) ).
Posted on 2002-03-13 12:37:24 by dxantos
you wrote very good test,(very stable).

Here is a little good news.
I didn't change your algo, but rearanged commands to remove
2 dependences (sure with a little time we'll be ready to remove
all of them). The result is a little but obvious (at list according your
test) improvment.
Compared procs (exept for very beginning) showing
1. Or the same number of ticks
2. Or rearanged one show 1-2 ticks faster:

Here is code (only core part, start and end the same as in original)

movq mm1, qword ptr [ecx]
movq mm2, qword ptr [ecx + 8]

pcmpeqb mm1, mm0
pcmpeqb mm2, mm0

movq mm3, qword ptr [ecx + 16]
movq mm4, qword ptr [ecx + 24]

pcmpeqb mm3, mm0
pcmpeqb mm4, mm0

por mm1, mm2
por mm3, mm4

movq mm5, qword ptr [ecx + 32]
movq mm6, qword ptr [ecx + 40]

por mm1, mm3

pcmpeqb mm5, mm0
pcmpeqb mm6, mm0

por mm5, mm6
por mm1, mm5

add ecx, 48

packsswb mm1, mm1
movd eax, mm1
test eax, eax
jz @B

sub ecx, 48

Have I said that I'm very glad you are with us here?
I am.

Now teach me step by step, please, how to make ecxel graphs
from those files, the way you and bitRake do.
Posted on 2002-03-13 22:20:39 by The Svin
Svin, I output the numbers directly to vkim's debug window and cut-n-paste them onto the spreadsheet. Then push the chart wizard button on the toolbar, select line graph, select done. :) I thought of writing a DLL for direct use in VBA scripts to automate the whole process, but it's too easy now and I wouldn't like to loose the flexiblity. Maybe, for a profiling tool? just brainstorming on the keyboard...
Posted on 2002-03-13 22:30:26 by bitRAKE
Here is my second and hope faster variant (without MMX)
because I hate:

movd eax, mm1
test eax, eax
jz @B

and I'll work with SSE2....

;Usage: mov   esi, offset Buffer

; call buliaNaza2Var
;On exit: eax = the number of characters in string,
; excluding the terminal NULL
buliaNaza2Var: ; strlen 2nd variant
xor edx,edx ; edx=0
C2_loop: ;
mov eax, [esi+edx] ; get a dword (buffer is aligned)
lea ecx, [eax-1010101h];sub 1 from each byte in eax
add edx, 4 ; ready for next dword
and ecx, 80808080h ; test sign
jz C2_loop ; if not loop again
test eax, 000000FFh ; is al zero?
jz C2_minus4 ;
test eax, 0000FF00h ; is ah zero?
jz C2_minus3 ;
test eax, 00FF0000h ; is zero?
jz C2_minus2 ;
and eax, 0FF000000h ; is zero?
jnz C2_loop ; if not zeroes loop again
lea eax, [edx-1] ; eax= length of string
ret ;
C2_minus2: ;
lea eax, [edx-2] ; eax= length of string
ret ;
C2_minus3: ;
lea eax, [edx-3] ; eax= length of string
ret ;
C2_minus4: ;
lea eax, [edx-4] ; eax= length of string
ret ;
Posted on 2002-03-14 00:37:56 by buliaNaza

The Svin : It's really sometimes 1 tick faster. :)

To Excel :

- Start Excel
- Click on File -> Open
- Select "All Files (*.*)" and select the table.txt file
- Click in the new window on the "Next >" button without to change anything.
- In the next window select the "Semikolon" checkbox and click on the done button.
- Remove the text "Length" from A1 (you can also remove it from the sourcecode, so you don't need to do that all the time).
- Click on the Diagram-Wizard button (or Menu -> Insert -> Diagram...)
- Select in the wizard, the type "Line", and there the first diagram-type.
- Click on "Done" and the diagram should be there

I am sure it's also possible to make a macro for that ... but i am just to lazy to do that :grin:


buliaNaza : You did a damn good job, your algo is extremly fast.


I got an idea, already tried it, and it seems to work a bit faster (a tick) :

instead of :

mov ecx, lpString

mov eax, dword ptr [ecx]
add ecx, 4

i tried this :

mov ecx, lpString
shr ecx, 2

mov eax, dword ptr [ecx * 4]
inc ecx

the only problem is, the memory need to be aligned to 4, but it should not be a problem to add code for this ... i wonder if that also increase the speed a bit on an Athlon.

Cu, Jens
Posted on 2002-03-14 02:07:00 by Jens Duttke
The Svin : It's really sometimes 1 tick faster.

Sometimes 2 ticks.
And those "sometimes" most of the time :)

Thanx for help with Excel.
Posted on 2002-03-16 05:58:04 by The Svin