You are wrong,
to produce the same value data just need to be aligned by mb.
Look at last post you'll see results from file 1 mb.
And values are the same.
Posted on 2002-03-28 13:37:27 by The Svin
What processor do you use

I'm writing now from my father in low.
He has PMMX 200. But it would be faster on any original Pentium.
I write for Pentiums and don't care for anything else :)

It should be faster if no memory involved (we'll change it to some better way after all bugs are dead), cause it's absolutly identical to your 3rd proc exept for division and 2 muls are always faster than 1 div (second mul I 'll remove later to make it faster)
Posted on 2002-03-28 13:44:28 by The Svin


eax = Nico : code:C9AD7C71, 321 ms for 10 loops on 2MB of data
eax = BitRAKE : code:C0B67C80, 701 ms for 10 loops on 2MB of data
eax = Thomas2 : code:C9AD7C71, 300 ms for 10 loops on 2MB of data
eax = Thomas3 : code:C9AD7C71, 200 ms for 10 loops on 2MB of data
eax = Svin2 : code:C9AD7C71, 190 ms for 10 loops on 2MB of data


:)
Posted on 2002-03-28 14:13:26 by The Svin
Very nice Svin! Could you post the code?
I'm wondering if we could do more iterations before mod-ing the results by quiting the loop just before an overflow of ecx or edx occurs.. Now I've used the worst case value as number of iterations before the mod.

Thomas
Posted on 2002-03-28 15:16:47 by Thomas
something like this:


_b2:
next:
mov al, [esi]
_proceed:
add ecx, eax
add edx, ecx
jo _of_edx
inc esi
dec edi
jnz next
jmp _done

_of_edx:
sub edx, ecx
sub ecx, eax
<<<mod edx and ecx >>>
jmp _proceed


This requires an extra jcc and when unrolled 4 extra jumps so it's probably slower, but maybe we can get something out of it..

Thomas
Posted on 2002-03-28 15:27:45 by Thomas
Okay I fixed my algo:


------------------------------------------------------------------
eax = Nico : code:3980E24D, 171 ms for 10 loops on 2MB of data
eax = BitRAKE : code:3980E24D, 170 ms for 10 loops on 2MB of data
eax = Thomas2 : code:3980E24D, 160 ms for 10 loops on 2MB of data
eax = Thomas3 : code:3980E24D, 80 ms for 10 loops on 2MB of data
eax = Svin2 : code:DC181263, 81 ms for 10 loops on 2MB of data

bitRAKE proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD
mov eax,adler
mov ecx,buf
mov edx,eax
shr eax,16
and edx,0FFFFh

mov esi,BASE
sub eax,edx
jmp _x
_0:
movzx ebx, BYTE PTR [ecx]
inc ecx
add eax,edx
add edx,ebx
cmp esi,eax
sbb edi,edi
cmp esi,edx
sbb ebx,ebx
and edi,esi
and ebx,esi
sub eax,edi ; values are restricted to:
sub edx,ebx ; [0, BASE)

_x: dec len
jns _0

add eax,edx
cmp esi,eax
sbb ebx,ebx
and ebx,esi
sub eax,ebx

shl eax,16
add eax,edx
ret
bitRAKE ENDP
Silly error with reversing the args to CMP - that is what I get for coding without my tools. :)
Posted on 2002-03-28 20:56:33 by bitRAKE
Thomas, you'll have less dependancies on your inner loop if you stagger the calculations like I do above - it will work on your unrolled version, too.
---------------------------------------------------------------------

eax = Thomas2 : code:3980E24D, 1602 ms for 100 loops on 2MB of data
eax = Thomas3 : code:3980E24D, 791 ms for 100 loops on 2MB of data
eax = Nico : code:3980E24D, 1452 ms for 100 loops on 2MB of data
eax = BitRAKE : code:3980E24D, 1733 ms for 100 loops on 2MB of data
eax = BitRAKE2 : code:3980E24D, 771 ms for 100 loops on 2MB of data

sub edx, ecx ;**bitRAKE2
next:
movzx eax, BYTE PTR [esi+0]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+1]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+2]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+3]
add esi, 4
add edx, ecx
add ecx, eax
dec edi
jnz next
add edx, ecx ;**bitRAKE2
Replace the code in Thomas3 with the code above. Not a big improvement on Athlons, but maybe more on other processors?
Posted on 2002-03-28 22:42:49 by bitRAKE
This solution is in another class.
Triple speed with prefetch. ;)
 Thomas2   : code:3980E24D, 1632 ms for [b]100[/b] loops on 2MB of data

Thomas3 : code:3980E24D, 721 ms for [b]100[/b] loops on 2MB of data
Nico : code:3980E24D, 1302 ms for [b]100[/b] loops on 2MB of data
bitRAKE : code:3980E24D, 1552 ms for [b]100[/b] loops on 2MB of data
bitRAKE2 : code:3980E24D, 721 ms for [b]100[/b] loops on 2MB of data
bitRAKE3! : code:3980E24D, [b]240[/b] ms for [b]100[/b] loops on 2MB of data


bitRAKE3 proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD
mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
shr edx, 16 ; edx = s2
mov ebx, BASE

_l1:

CACHE_LINE EQU 64

mov edi, 86*64
sub len, edi
ja _b2
add edi, len
jz _done
and len, 0

ALIGN 8
_b2:
sub edx, ecx
next:
; three cache lines ahead ;)
prefetchnta [esi + CACHE_LINE*3]

i = CACHE_LINE
WHILE i NE 0
movzx eax, BYTE PTR [esi+CACHE_LINE-i]
IF i EQ 1
add esi,CACHE_LINE
ENDIF
add edx, ecx
add ecx, eax
i = i - 1
ENDM

sub edi,CACHE_LINE
jnz next

mov eax, ecx
add ecx, edx
xor edx, edx
div ebx
mov eax, ecx
mov ecx, edx
xor edx, edx
div ebx
jmp _l1
_done:
mov eax, edx
shl eax, 16
add eax, ecx
ret
bitRAKE3 ENDP
Shouldn't a 1Ghz+ CPU beat a P200MMX by several times? len must be a multiple of CACHE_LINE or else this doesn't work. The fetching of three lines forward is tuned for my 1.3Ghz TB and DDR memory - this will be different for other configurations. :(
Posted on 2002-03-28 23:38:52 by bitRAKE
Very nice Svin! Could you post the code?

Sure,
It's practically your code, for me was more important just prove that my "mul\div" method is reliable.
Only xor eax,eax was needed after devision (that simple :)
It doesn't increase speed though, cause the code is rarely taken
and with mul insead of div it is longer (in size) it's offten out of code chache so effect in opposite may be negative.
With multiple tests it was clear that Thomas3 and Svin2 were compiting in speed and you can not be sure wich one was faster
(code cache affectation).
I had a little look at the rest of Thomas3 and found that code may be reorganaized to remove some dependences, this gave effect for sure.
Anyway I would call it Thomas3andSvin :) 'Cause my part was
just a little and auxilary.


Svin2 proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD

mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
shr edx, 16 ; edx = s2
mov ebx, 80078071h
shr len, 2
_l1:

cmp len, 0
jz _done

mov edi, 963
cmp len, edi
ja _b2
mov edi, len
_b2:
sub len, edi

next:
mov al, [esi+0]
add ecx, eax
mov al, [esi+1]
add edx, ecx

add ecx, eax
mov al, [esi+2]
add edx, ecx

add ecx, eax
mov al, [esi+3]
add edx, ecx

add ecx, eax
add esi,4
add edx,ecx
dec edi

jnz next

mov edi,edx ;devident
mov eax,edx
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
mov edx,65521
shr eax,15
mul edx
sub edi,eax

mov eax,ecx
mov edx,ebx
mul ebx
mov eax,edx
mov edx,65521
shr eax,15
mul edx
sub ecx,eax
mov edx,edi
xor eax,eax
jmp _l1
_done:

mov eax, edx
shl eax, 16
add eax, ecx

ret
Svin2 endp
Posted on 2002-03-29 01:11:06 by The Svin
Thomas, I was right about alignemt by 1 mb,
try to test any data for example with size x*mb + 41
Results of checkcode would be different for different progs.
Example following testing data with size 1024*1024*8+41:



eax = Nico : code:EAC94A43, 3335 ms for 10 loops on 2MB of data
eax = BitRAKE : code:4F554A52, 5608 ms for 10 loops on 2MB of data
eax = Thomas2 : code:CAAF4ACA, 2493 ms for 10 loops on 2MB of data
eax = Thomas3 : code:A0864A16, 1522 ms for 10 loops on 2MB of data
eax = Svin2 : code:A0864A16, 1312 ms for 10 loops on 2MB of data


Last to procs have the same result just because they are absolutly identical algos with different realisation (operators and order of use)
Posted on 2002-03-29 01:55:45 by The Svin
Thomas, my test shows 5552 bytes can be processed
in the worst case without overflowing 32-bits.
	or ecx,-1

mov eax,BASE-1
mov edx,BASE-1
@@: inc ecx
add eax,255
add edx,eax
jnc @B
; ECX is max bytes before overflow in worse case: 5552
Or, am I missing something?
Posted on 2002-03-29 02:21:39 by bitRAKE
bitRAKE: That's the same number as nico had calculated, somehow my calculations were wrong...

All: Thanks for all your versions, I'll test each of them.

Thomas
Posted on 2002-03-29 02:31:15 by Thomas
I've tested all versions, with a 100x loop, and made one new version based on Thomas3AndSvin, with bitRAKEs suggestion about the instruction order.


eax = 004031C0
------------------------------------------------------------------
eax = Nico : [9639F0C4], 1332 ms [100x2MB], 150.15 MB/s
eax = BitRAKE : [9639F0C4], 1722 ms [100x2MB], 116.14 MB/s
eax = BitRAKE2 : [9639F0C4], 1692 ms [100x2MB], 118.20 MB/s
eax = Thomas2 : [9639F0C4], 1453 ms [100x2MB], 137.64 MB/s
eax = Thomas3 : [9639F0C4], 801 ms [100x2MB], 249.68 MB/s
eax = Svin2 : [8B1CBB09], 741 ms [100x2MB], 269.90 MB/s
eax = Nico2 : [9639F0C4], 1091 ms [100x2MB], 183.31 MB/s
eax = Thomas3AndSvin : [9639F0C4], 781 ms [100x2MB], 256.08 MB/s
eax = Thomas3AndSvinAndBitRAKE : [9639F0C4], 772 ms [100x2MB], 259.06 MB/s
eax = BitRAKE3 : [9639F0C4], 230 ms [100x2MB], 869.56 MB/s


It's obvious that bitRAKE beats the rest with his athlon version, but the other (latest) ones are reasonably fast as well.. Could anyone test this on a pentium?

Thomas
Posted on 2002-03-29 11:27:17 by Thomas
i tried to test the program on a pentium, but i'm getting the following error during linking:

test.obj : error LNK2001: unresolved external symbol _testData

help?
Posted on 2002-03-29 12:04:17 by jademtech
jademtech, download the attachment in this post above:
http://www.asmcommunity.net/board/showthread.php?s=&postid=31317.msg31317

...copy a 2MB+ file to the same directory, rename it to file.dat and execute file2obj.bat to create the object file. Then make.bat

Edit: Replace the test.asm file with the new one above before building.
Posted on 2002-03-29 12:20:22 by bitRAKE

jademtech, download the attachment in this post above:
http://www.asmcommunity.net/board/showthread.php?s=&postid=31317.msg31317

...copy a 2MB+ file to the same directory, rename it to file.dat and execute file2obj.bat to create the object file. Then make.bat

Edit: Replace the test.asm file with the new one above before building.


thanks... now it assembles... but i get an application error:

The instruction at "0x00401122" referenced memory at "0x005fd000". The memory could not be "read".

i get this on both my (Win2K) PIII and the (WinNT) pentium i am testing it on.
Posted on 2002-03-29 12:44:08 by jademtech
bitRAKE3 requires a P4/Athlon, iirc.
Comment that one out, and build again.

Edit: No, that isn't correct - the prefetch instructions should be seen as NOP's. Wish, I wasn't at work - I'd build a copy for you...
Posted on 2002-03-29 12:46:15 by bitRAKE

bitRAKE3 requires a P4/Athlon, iirc.
Comment that one out, and build again.

Edit: No, that isn't correct - the prefetch instructions should be seen as NOP's. Wish, I wasn't at work - I'd build a copy for you...

hehheh... win32Asming on company time ;) anyway, i tried commenting bitRAKE3 out, but i it turns out, i have problems whenever the TESTPROC macro is called.
Posted on 2002-03-29 12:53:18 by jademtech

hehheh... win32Asming on company time ;) anyway, i tried commenting bitRAKE3 out, but i it turns out, i have problems whenever the TESTPROC macro is called.
Well, it's Friday and I wouldn't have the job if I couldn't do both at once. You have to comment out the TESTPROC line at the end with the bitRAKE3 in it.
Posted on 2002-03-29 13:12:07 by bitRAKE

Well, it's Friday and I wouldn't have the job if I couldn't do both at once. You have to comment out the TESTPROC line at the end with the bitRAKE3 in it.


i had already tried that and if i leave even *one* TESTPROC in (nico,bitRAKE,Thomas2,etc), i still get that app error. thanks for all this time you're spending to help me :)
Posted on 2002-03-29 13:31:05 by jademtech