You are wrong,
to produce the same value data just need to be aligned by mb.
Look at last post you'll see results from file 1 mb.
And values are the same.
to produce the same value data just need to be aligned by mb.
Look at last post you'll see results from file 1 mb.
And values are the same.
What processor do you use
I'm writing now from my father in low.
He has PMMX 200. But it would be faster on any original Pentium.
I write for Pentiums and don't care for anything else :)
It should be faster if no memory involved (we'll change it to some better way after all bugs are dead), cause it's absolutly identical to your 3rd proc exept for division and 2 muls are always faster than 1 div (second mul I 'll remove later to make it faster)
eax = Nico : code:C9AD7C71, 321 ms for 10 loops on 2MB of data
eax = BitRAKE : code:C0B67C80, 701 ms for 10 loops on 2MB of data
eax = Thomas2 : code:C9AD7C71, 300 ms for 10 loops on 2MB of data
eax = Thomas3 : code:C9AD7C71, 200 ms for 10 loops on 2MB of data
eax = Svin2 : code:C9AD7C71, 190 ms for 10 loops on 2MB of data
:)
Very nice Svin! Could you post the code?
I'm wondering if we could do more iterations before mod-ing the results by quiting the loop just before an overflow of ecx or edx occurs.. Now I've used the worst case value as number of iterations before the mod.
Thomas
I'm wondering if we could do more iterations before mod-ing the results by quiting the loop just before an overflow of ecx or edx occurs.. Now I've used the worst case value as number of iterations before the mod.
Thomas
something like this:
This requires an extra jcc and when unrolled 4 extra jumps so it's probably slower, but maybe we can get something out of it..
Thomas
_b2:
next:
mov al, [esi]
_proceed:
add ecx, eax
add edx, ecx
jo _of_edx
inc esi
dec edi
jnz next
jmp _done
_of_edx:
sub edx, ecx
sub ecx, eax
<<<mod edx and ecx >>>
jmp _proceed
This requires an extra jcc and when unrolled 4 extra jumps so it's probably slower, but maybe we can get something out of it..
Thomas
Okay I fixed my algo:
Posted on 2002-03-28 20:56:33 by bitRAKE
------------------------------------------------------------------
eax = Nico : code:3980E24D, 171 ms for 10 loops on 2MB of data
eax = BitRAKE : code:3980E24D, 170 ms for 10 loops on 2MB of data
eax = Thomas2 : code:3980E24D, 160 ms for 10 loops on 2MB of data
eax = Thomas3 : code:3980E24D, 80 ms for 10 loops on 2MB of data
eax = Svin2 : code:DC181263, 81 ms for 10 loops on 2MB of data
bitRAKE proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD
mov eax,adler
mov ecx,buf
mov edx,eax
shr eax,16
and edx,0FFFFh
mov esi,BASE
sub eax,edx
jmp _x
_0:
movzx ebx, BYTE PTR [ecx]
inc ecx
add eax,edx
add edx,ebx
cmp esi,eax
sbb edi,edi
cmp esi,edx
sbb ebx,ebx
and edi,esi
and ebx,esi
sub eax,edi ; values are restricted to:
sub edx,ebx ; [0, BASE)
_x: dec len
jns _0
add eax,edx
cmp esi,eax
sbb ebx,ebx
and ebx,esi
sub eax,ebx
shl eax,16
add eax,edx
ret
bitRAKE ENDP
Silly error with reversing the args to CMP - that is what I get for coding without my tools. :)
Posted on 2002-03-28 20:56:33 by bitRAKE
Thomas, you'll have less dependancies on your inner loop if you stagger the calculations like I do above - it will work on your unrolled version, too.
---------------------------------------------------------------------
eax = Thomas2 : code:3980E24D, 1602 ms for 100 loops on 2MB of data
eax = Thomas3 : code:3980E24D, 791 ms for 100 loops on 2MB of data
eax = Nico : code:3980E24D, 1452 ms for 100 loops on 2MB of data
eax = BitRAKE : code:3980E24D, 1733 ms for 100 loops on 2MB of data
eax = BitRAKE2 : code:3980E24D, 771 ms for 100 loops on 2MB of data
sub edx, ecx ;**bitRAKE2
next:
movzx eax, BYTE PTR [esi+0]
add edx, ecx
add ecx, eax
movzx eax, BYTE PTR [esi+1]
add edx, ecx
add ecx, eax
movzx eax, BYTE PTR [esi+2]
add edx, ecx
add ecx, eax
movzx eax, BYTE PTR [esi+3]
add esi, 4
add edx, ecx
add ecx, eax
dec edi
jnz next
add edx, ecx ;**bitRAKE2
Replace the code in Thomas3 with the code above. Not a big improvement on Athlons, but maybe more on other processors?This solution is in another class.
Triple speed with prefetch. ;)
Triple speed with prefetch. ;)
Thomas2 : code:3980E24D, 1632 ms for [b]100[/b] loops on 2MB of data
Thomas3 : code:3980E24D, 721 ms for [b]100[/b] loops on 2MB of data
Nico : code:3980E24D, 1302 ms for [b]100[/b] loops on 2MB of data
bitRAKE : code:3980E24D, 1552 ms for [b]100[/b] loops on 2MB of data
bitRAKE2 : code:3980E24D, 721 ms for [b]100[/b] loops on 2MB of data
bitRAKE3! : code:3980E24D, [b]240[/b] ms for [b]100[/b] loops on 2MB of data
bitRAKE3 proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD
mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
shr edx, 16 ; edx = s2
mov ebx, BASE
_l1:
CACHE_LINE EQU 64
mov edi, 86*64
sub len, edi
ja _b2
add edi, len
jz _done
and len, 0
ALIGN 8
_b2:
sub edx, ecx
next:
; three cache lines ahead ;)
prefetchnta [esi + CACHE_LINE*3]
i = CACHE_LINE
WHILE i NE 0
movzx eax, BYTE PTR [esi+CACHE_LINE-i]
IF i EQ 1
add esi,CACHE_LINE
ENDIF
add edx, ecx
add ecx, eax
i = i - 1
ENDM
sub edi,CACHE_LINE
jnz next
mov eax, ecx
add ecx, edx
xor edx, edx
div ebx
mov eax, ecx
mov ecx, edx
xor edx, edx
div ebx
jmp _l1
_done:
mov eax, edx
shl eax, 16
add eax, ecx
ret
bitRAKE3 ENDP
Shouldn't a 1Ghz+ CPU beat a P200MMX by several times? len must be a multiple of CACHE_LINE or else this doesn't work. The fetching of three lines forward is tuned for my 1.3Ghz TB and DDR memory - this will be different for other configurations. :(Very nice Svin! Could you post the code?
Sure,
It's practically your code, for me was more important just prove that my "mul\div" method is reliable.
Only xor eax,eax was needed after devision (that simple :)
It doesn't increase speed though, cause the code is rarely taken
and with mul insead of div it is longer (in size) it's offten out of code chache so effect in opposite may be negative.
With multiple tests it was clear that Thomas3 and Svin2 were compiting in speed and you can not be sure wich one was faster
(code cache affectation).
I had a little look at the rest of Thomas3 and found that code may be reorganaized to remove some dependences, this gave effect for sure.
Anyway I would call it Thomas3andSvin :) 'Cause my part was
just a little and auxilary.
Svin2 proc uses edi esi ebx adler:DWORD, buf:DWORD, len:DWORD
mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
shr edx, 16 ; edx = s2
mov ebx, 80078071h
shr len, 2
_l1:
cmp len, 0
jz _done
mov edi, 963
cmp len, edi
ja _b2
mov edi, len
_b2:
sub len, edi
next:
mov al, [esi+0]
add ecx, eax
mov al, [esi+1]
add edx, ecx
add ecx, eax
mov al, [esi+2]
add edx, ecx
add ecx, eax
mov al, [esi+3]
add edx, ecx
add ecx, eax
add esi,4
add edx,ecx
dec edi
jnz next
mov edi,edx ;devident
mov eax,edx
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
mov edx,65521
shr eax,15
mul edx
sub edi,eax
mov eax,ecx
mov edx,ebx
mul ebx
mov eax,edx
mov edx,65521
shr eax,15
mul edx
sub ecx,eax
mov edx,edi
xor eax,eax
jmp _l1
_done:
mov eax, edx
shl eax, 16
add eax, ecx
ret
Svin2 endp
Thomas, I was right about alignemt by 1 mb,
try to test any data for example with size x*mb + 41
Results of checkcode would be different for different progs.
Example following testing data with size 1024*1024*8+41:
Last to procs have the same result just because they are absolutly identical algos with different realisation (operators and order of use)
try to test any data for example with size x*mb + 41
Results of checkcode would be different for different progs.
Example following testing data with size 1024*1024*8+41:
eax = Nico : code:EAC94A43, 3335 ms for 10 loops on 2MB of data
eax = BitRAKE : code:4F554A52, 5608 ms for 10 loops on 2MB of data
eax = Thomas2 : code:CAAF4ACA, 2493 ms for 10 loops on 2MB of data
eax = Thomas3 : code:A0864A16, 1522 ms for 10 loops on 2MB of data
eax = Svin2 : code:A0864A16, 1312 ms for 10 loops on 2MB of data
Last to procs have the same result just because they are absolutly identical algos with different realisation (operators and order of use)
Thomas, my test shows 5552 bytes can be processed
in the worst case without overflowing 32-bits.
in the worst case without overflowing 32-bits.
or ecx,-1
mov eax,BASE-1
mov edx,BASE-1
@@: inc ecx
add eax,255
add edx,eax
jnc @B
; ECX is max bytes before overflow in worse case: 5552
Or, am I missing something?bitRAKE: That's the same number as nico had calculated, somehow my calculations were wrong...
All: Thanks for all your versions, I'll test each of them.
Thomas
All: Thanks for all your versions, I'll test each of them.
Thomas
I've tested all versions, with a 100x loop, and made one new version based on Thomas3AndSvin, with bitRAKEs suggestion about the instruction order.
It's obvious that bitRAKE beats the rest with his athlon version, but the other (latest) ones are reasonably fast as well.. Could anyone test this on a pentium?
Thomas
eax = 004031C0
------------------------------------------------------------------
eax = Nico : [9639F0C4], 1332 ms [100x2MB], 150.15 MB/s
eax = BitRAKE : [9639F0C4], 1722 ms [100x2MB], 116.14 MB/s
eax = BitRAKE2 : [9639F0C4], 1692 ms [100x2MB], 118.20 MB/s
eax = Thomas2 : [9639F0C4], 1453 ms [100x2MB], 137.64 MB/s
eax = Thomas3 : [9639F0C4], 801 ms [100x2MB], 249.68 MB/s
eax = Svin2 : [8B1CBB09], 741 ms [100x2MB], 269.90 MB/s
eax = Nico2 : [9639F0C4], 1091 ms [100x2MB], 183.31 MB/s
eax = Thomas3AndSvin : [9639F0C4], 781 ms [100x2MB], 256.08 MB/s
eax = Thomas3AndSvinAndBitRAKE : [9639F0C4], 772 ms [100x2MB], 259.06 MB/s
eax = BitRAKE3 : [9639F0C4], 230 ms [100x2MB], 869.56 MB/s
It's obvious that bitRAKE beats the rest with his athlon version, but the other (latest) ones are reasonably fast as well.. Could anyone test this on a pentium?
Thomas
i tried to test the program on a pentium, but i'm getting the following error during linking:
test.obj : error LNK2001: unresolved external symbol _testData
help?
test.obj : error LNK2001: unresolved external symbol _testData
help?
jademtech, download the attachment in this post above:
http://www.asmcommunity.net/board/showthread.php?s=&postid=31317.msg31317
...copy a 2MB+ file to the same directory, rename it to file.dat and execute file2obj.bat to create the object file. Then make.bat
Edit: Replace the test.asm file with the new one above before building.
http://www.asmcommunity.net/board/showthread.php?s=&postid=31317.msg31317
...copy a 2MB+ file to the same directory, rename it to file.dat and execute file2obj.bat to create the object file. Then make.bat
Edit: Replace the test.asm file with the new one above before building.
jademtech, download the attachment in this post above:
http://www.asmcommunity.net/board/showthread.php?s=&postid=31317.msg31317
...copy a 2MB+ file to the same directory, rename it to file.dat and execute file2obj.bat to create the object file. Then make.bat
Edit: Replace the test.asm file with the new one above before building.
thanks... now it assembles... but i get an application error:
The instruction at "0x00401122" referenced memory at "0x005fd000". The memory could not be "read".
i get this on both my (Win2K) PIII and the (WinNT) pentium i am testing it on.
bitRAKE3 requires a P4/Athlon, iirc.
Comment that one out, and build again.
Edit: No, that isn't correct - the prefetch instructions should be seen as NOP's. Wish, I wasn't at work - I'd build a copy for you...
Comment that one out, and build again.
Edit: No, that isn't correct - the prefetch instructions should be seen as NOP's. Wish, I wasn't at work - I'd build a copy for you...
bitRAKE3 requires a P4/Athlon, iirc.
Comment that one out, and build again.
Edit: No, that isn't correct - the prefetch instructions should be seen as NOP's. Wish, I wasn't at work - I'd build a copy for you...
hehheh... win32Asming on company time ;) anyway, i tried commenting bitRAKE3 out, but i it turns out, i have problems whenever the TESTPROC macro is called.
hehheh... win32Asming on company time ;) anyway, i tried commenting bitRAKE3 out, but i it turns out, i have problems whenever the TESTPROC macro is called.
Well, it's Friday and I wouldn't have the job if I couldn't do both at once. You have to comment out the TESTPROC line at the end with the bitRAKE3 in it.
i had already tried that and if i leave even *one* TESTPROC in (nico,bitRAKE,Thomas2,etc), i still get that app error. thanks for all this time you're spending to help me :)