That's very strange, no the proc should work on any data, aligned or unaligned, any size... As I have to run it on pieces of decoded PNG data and as these pieces can start at any point in the output stream and can have any size, I can't guarantee alignment..

Thomas
Posted on 2002-03-31 08:00:13 by Thomas
Yes I meant SIZE. Cause size is a parameter to tested procs it is very easy to check my words (I seemed wrote about it)
just go to testproc and change last argument so that it won't be aligned by mb
for example to 1024*1024*8+41
Posted on 2002-03-31 08:32:15 by The Svin
Size doesn't have to be aligned by MB (would be a pretty high requirement :), but at least by DWORD for some of the procs, as they are unrolled.
As I mentioned in my first post, you can split the input in parts and calculate the adler of the full data by feeding the previous adler to the proc that calculates the next part. So if you have one proc for 1 to 3 bytes (can be done without a loop), and one that needs a size which is a multiple of 4, you can first call the dword version, and then the 1-3 byte version for the remaining bytes.

Thomas
Posted on 2002-03-31 08:43:06 by Thomas

Try my mod code and see if it helps your speed, instead of div (very slow)!
DIV isn't so slow on Athlons, and Svin already came up with a good replacement for other processors. A table would be slower because the data isn't in the cache.

What about an MMX version? :)
Posted on 2002-03-31 11:05:12 by bitRAKE
An MMX version would certainly speed things up..


new_s1 = s1 + b1 + b2 + b3 + b4 + b5 + b6 + b7 + b8
new_s2 = s2 + 8*s1 + 8*b1 + 7*b2 + 6*b3 + 5*b4 + 4*b5 + 3*b6 + 2*b7 + b8

(Hope I got it right this way)
Looks possible with MMX..

Thomas
Posted on 2002-03-31 11:15:30 by Thomas
That is correct but slower (I think), do not
forget about the mod though.
You do not need to use multiply just the adds, and
maximize the inner loop, but I have not tested
the code with the worst case: a file of just
'db FF's

Also can we assume the size is a multiple of 2, or
4, or ..?

I thought about the MMX, but I have never used it,
and I do not know how to do a 64/128 bit integer
divide.

You can hard code the mod, but if the Athon is
faster then....
Posted on 2002-03-31 17:28:57 by bdjames
Imagine calculating the value of each byte from the start for s2, each iteration the byte value is added to s2 again until all the bytes are processed. So, if we were to index the bytes starting at the end byte = index 1, then s2 equals the sum of the (byte values * index value), plus (s1 * length of buffer).

s1 = (b1 + b2 + b3 + ... + bn + s1) % BASE
s2 = (b1*1 + b2*2 + b3*3 + ... + bn*n + n*s1 + s2) % BASE
(just thinking out loud)
Posted on 2002-03-31 23:31:04 by bitRAKE
I think it might get to complex with MMX. I first thought about shifting around the bits a little to get the multiplications but since you can't shift only 1 element in an MMX register that won't be easy.

Thomas
Posted on 2002-04-01 03:16:55 by Thomas
MMX version for Pentium III+ and Athlon processors.
8 bytes aligment for source data is required.



align 8
shift dw 0,-1,-2,-3

proc Nexo adler:DWORD, buf:DWORD, len:DWORD
mov edx,
mov ecx,
test ecx,ecx
je _done
push edi esi ebx ebp
mov edi,edx
and edi,0FFFFh
shr edx,16
pxor mm7,mm7
pcmpeqd mm2,mm2
mov esi,
mov ebx,0FFF1h
psllw mm2,2
_l1:
mov ebp,1580h
cmp ecx,ebp
cmovb ebp,ecx
sub ecx,ebp
movd mm4,edi
imul edi,ebp
movd mm6,ebp
add edi,edx
pshufw mm6,mm6,0
movd mm5,edi
paddw mm6,
_l2:
prefetchnta
movq mm0,
add esi,8
sub ebp,8
movq mm3,mm0
movq mm1,mm0
psadbw mm3,mm7
punpcklbw mm0,mm7
punpckhbw mm1,mm7
paddd mm4,mm3
pmaddwd mm0,mm6
paddw mm6,mm2
paddd mm5,mm0
pmaddwd mm1,mm6
paddw mm6,mm2
paddd mm5,mm1
jne _l2
movd eax,mm4
xor edx,edx
div ebx
mov edi,edx
pshufw mm6,mm5,1110b
paddd mm5,mm6
movd eax,mm5
xor edx,edx
div ebx
test ecx,ecx
jne _l1
shl edx,16
add edx,edi
pop ebp ebx esi edi
_done:
mov eax,edx
ret
endp



For AthlonXP 1700+ with DDR RAM:



BitRAKE2 : [519163C7], 422 ms [100x2MB], 473.93 MB/s
Thomas3 : [519163C7], 406 ms [100x2MB], 492.61 MB/s
Thomas3AndSvin : [519163C7], 422 ms [100x2MB], 473.93 MB/s
Thomas3AndSvinAndBitRAKE : [519163C7], 406 ms [100x2MB], 492.61 MB/s
BitRAKE3 : [519163C7], 219 ms [100x2MB], 913.24 MB/s
Nexo : [519163C7], 172 ms [100x2MB], 1162.79 MB/s

Posted on 2002-04-04 11:35:43 by Nexo
Nexo, bravo!!
Posted on 2002-04-04 11:46:42 by bitRAKE
:alright: Very well done Nexo!

Thomas
Posted on 2002-04-04 12:56:45 by Thomas
bitRake, I saw in your code:


next:
movzx eax, BYTE PTR [esi+0]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+1]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+2]
add edx, ecx
add ecx, eax

movzx eax, BYTE PTR [esi+3]


I wander, why not use movzx just once (first time) and then
mov al, ? It will save 3 bytes and part of eax after 7bit will have zero bits anyway?



next:
movzx eax, BYTE PTR [esi+0]
add edx, ecx
add ecx, eax

mov al, [esi+1]
add edx, ecx
add ecx, eax

mov al,[esi+2]
add edx, ecx
add ecx, eax

mov al, [esi+3]

Posted on 2002-04-07 08:00:39 by The Svin
Ups :)
Posted on 2002-04-07 09:08:18 by Nexo
BitRAKE adheres to rules of optimization under a Pentium ;)
Assembly/Compiler Coding Rule 46. (M impact, MH generality) Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using movzx.
Posted on 2002-04-07 09:08:47 by Nexo
AFAIK from Intel docs partial regestry dependencies
somewhat different.
when you change one part. registry and use different part
of the same registry, processor still wait assuming that whole
register is being changed.
For example
mov al,dl
add cl,ah ;partial regestry stall though ah is not changed.
May be I miss something?

In the code above I don't see any dependency with eax not patial regestry,
not whole (bitRake did actually great work to removing dependencies so
I wander why still movxz?)
Posted on 2002-04-07 10:31:28 by The Svin
In documents all is told unambiguously.
It certainly all to refer to PPro and is higher.
VTune analyse result:
0  0FB606 movzx eax, BYTE PTR [esi]				

3 03D1 add edx, ecx
5 03C8 add ecx, eax
7 8A4601 mov al, BYTE PTR [esi+01h]
10 03D1 add edx, ecx
12 03C8 add ecx, eax PPro_Partial_Stall_eax:12-16
14 8A4602 mov al, BYTE PTR [esi+02h]
17 03D1 add edx, ecx
19 03C8 add ecx, eax PPro_Partial_Stall_eax:12-16
21 8A4603 mov al, BYTE PTR [esi+03h]
24 03D1 add edx, ecx
26 03C8 add ecx, eax PPro_Partial_Stall_eax:12-16
Posted on 2002-04-07 10:46:38 by Nexo
Addition for AMD processors (may be K6):
Avoid superset dependencies?Using the larger form of a register immediate after an instruction uses the smaller form creates a superset dependency and prevents parallel execution. For example, avoid the following type of code:
OR AH,07h
ADD EAX,1555555h
One method for avoiding superset dependencies is to schedule the instruction with the superset dependency (for example, the ADD instruction) 4?6 instructions later than would normally be preferable. Another method, useful in some cases, is to use the MOVZX instruction to efficiently convert a byte-size value to a doubleword-size value, which can then be combined with other values in 32-bit operations.
Posted on 2002-04-07 11:12:08 by Nexo
Thanx for info.
Posted on 2002-04-07 13:25:47 by The Svin
Nexo,

This is a great implementation of alder32. I would love to use this code.
Can you tell me what license you use.

Thanks.
Nitin.
Posted on 2007-01-02 14:22:03 by nitin