Here's a new challenge for all the optimisation gurus here: the adler32 algorithm.
This is a CRC like algorithm which is for example used as checksum for a decoded ZLIB stream. The algorithm is much more simple than CRC though. The C version is:


#define BASE 65521 /* largest prime smaller than 65536 */

unsigned long update_adler32(unsigned long adler,unsigned char *buf, int len)
{
unsigned long s1 = adler & 0xffff;
unsigned long s2 = (adler >> 16) & 0xffff;
int n;
for (n = 0; n < len; n++)
{
s1 = (s1 + buf[n]) % BASE;
s2 = (s2 + s1) % BASE;
}
return (s2 << 16) + s1;
}



This is an update function for an adler code, which means you can feed the previous adler code and proceed with it, along with a pointer to the buffer with the data and a len parameter containing the length of the data. Initially, the adler code is 1. Example: if you want to compute the adler value of a 100 byte buffer, call:
adler = update_adler32([b]1[/b], pointer_buffer,100);

you can also do the buffer in pieces:


adler = update_adler32([b]1[/b], pointer_buffer,20);
adler = update_adler32([b]adler[/b], pointer_buffer + 20,30); // next 30 bytes
adler = update_adler32([b]adler[/b], pointer_buffer + 50,50); // last 50 bytes

This code should produce the same value as the first version. Knowing this may be useful for optimizations, for example you can first process as much DWORD units as possible, and do the last bytes (len%4) seperately.

Here's the first, not much optimized, straight-forward translation in asm:


BASE equ 65521
mov ecx, adler
mov edi, len
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
add edi, esi
shr edx, 16 ; edx = s2
@next:
mov al, [esi]
add ecx, eax ;s1=s1+buf[n]
cmp ecx, BASE
jb @F
sub ecx, BASE
@@:
add edx, ecx
inc esi
@@:
cmp edx, BASE
jb @F
sub edx, BASE
jmp @B
@@:
cmp esi, edi
jb @next
mov eax, edx
shl eax, 16
add eax, ecx
ret


One optimisation a C compiler will probably not think of, is replacing the modulus (and thus two slow divisions) by a simple compare and one or two substractions:


s1 = (s1 + buf[n]) % BASE;
s2 = (s2 + s1) % BASE;

Initially, s1 is the lower word of adler, so it can only be 65535 at most. buf can be 255 at most, which gives a total of 65535 +255 = 65790. No need to divide by BASE, as it either does or doesn't fit in (0 or 1).
s2 goes analogous, but because s2 and s1 are added, BASE may fit in twice...
After these two lines, s1 and s2 will obviously be < BASE so the rules described above will apply to every iteration.

This code can probably be much more optimized, any bright ideas are welcome. I'm also wondering if it would be possible to unroll the loop, although it's not easy because the s1 and s2 values depend on their previous values..

Thomas
Posted on 2002-03-27 10:49:20 by Thomas
Could you not move the modulo outside the loop?



{
unsigned long s1 = adler & 0xffff;
unsigned long s2 = (adler >> 16) & 0xffff;
int n;
for (n = 0; n < len; n++)
{
s1 = (s1 + buf[n]);
s2 = (s2 + s1);
}
return ((s2 % BASE) << 16) + (s1 % BASE);
}


Mirno
Posted on 2002-03-27 11:14:04 by Mirno
I thought about that but I don't think it will work as eventually s1 or s2 will become bigger than 2^32-1 and overflow... For small data sets it will probably work.. I'll test it on a big one.

Thomas
Posted on 2002-03-27 11:45:55 by Thomas
You can unroll the loop though, here's the math that proves the formulas used:


---------------
unroll:
s1 = (s1 + buf[n+0])
s2 = (s2 + s1)

s1 = (s1 + buf[n+1])
s2 = (s2 + s1)

s1 = (s1 + buf[n+2])
s2 = (s2 + s1)

s1 = (s1 + buf[n+3])
s2 = (s2 + s1)
-----------------
change names:
A = (s1 + buf[n+0])
B = (s2 + A)

C = (A + buf[n+1])
D = (B + C)

E = (C + buf[n+2])
F = (D + E)

G = (E + buf[n+3])
H = (F + G)

new s1 = G
new s2 = H
------------------
substitute:


E =
F =

G = ((((s1 + buf[n+0]) + buf[n+1]) + buf[n+2]) + buf[n+3])
H = ((((s2 + (s1 + buf[n+0])) + ((s1 + buf[n+0]) + buf[n+1])) + (((s1 + buf[n+0]) + buf[n+1]) + buf[n+2])) + G)

new s1 = G
new s2 = H
------------------
simplify:
(b0=buf[n+0], b1=buf[n+1]....)

G = s1 + buf[n+0]+ buf[n+1] + buf[n+2] + buf[n+3]
= s1 + {b0+b1+b2+b3}
H = s2 + s1 + buf[n+0] + s1 + buf[n+0] + buf[n+1] + s1 + buf[n+0] + buf[n+1] + buf[n+2] + G
= s2 + 3*s1 + 3*b0 + 2*b1 + b2 + G

new_s1 = s1 + {b0+b1+b2+b3}
new_s2 = s2 + 3*s1 + 3*b0 + 2*b1 + b2 + new_s1



This code does work on a test set of 2,5 MB.. size has to be a multiple of 4 but as I said you can use the original proc for the remaining bytes because of the nature of adler32.
 + buf + buf + buf;

unsigned long new_s2 = s2 + 3*s1 + 3*buf + 2*buf + buf + new_s1;

s1 = new_s1;
s2 = new_s2;

s1 %= BASE;
s2 %= BASE;

}
return (s2 << 16) + s1;
}


Thomas
Posted on 2002-03-27 12:11:39 by Thomas
Here's one that is unrolled, not that much faster though..


mov ecx, adler
mov edi, len
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
add edi, esi
shr edx, 16 ; edx = s2
xor ebx, ebx

next:
;s1 = (s1 + buf[n]) % BASE;
;s2 = (s2 + s1) % BASE ;

mov al, [esi+0]
mov bl, [esi+1]


add ecx, eax
add edx, ecx
add ecx, ebx
add edx, ecx

mov al, [esi+2]
mov bl, [esi+3]

add ecx, eax
add edx, ecx
add ecx, ebx
add edx, ecx

@@:
cmp ecx, BASE
jb @F
sub ecx, BASE
jmp @B

@@:
add esi, 4
@@:
cmp edx, BASE
jb @F
sub edx, BASE
jmp @B
@@:
cmp esi, edi
jb next
mov eax, edx
shl eax, 16
add eax, ecx
ret


Thomas
Posted on 2002-03-27 12:58:04 by Thomas
I haven't tested this, but the theme should be appearant:
update_adler32 PROC uses ebx esi edi, adler:DWORD, buf:DWORD, len:DWORD

mov eax,adler
mov ecx,buf
mov edx,eax
shr eax,16
and edx,0FFFFh

; xor ebx,ebx
sub eax,edx
jmp _x
_0:
; mov bl,[ecx]
movzx ebx, BYTE PTR [ecx]
inc ecx
add eax,edx
add edx,ebx
cmp eax,BASE + 1
sbb edi,edi
cmp edx,BASE + 1
sbb esi,esi
and edi,BASE
and esi,BASE
sub eax,edi ; values are restricted to:
sub edx,esi ; [0, BASE)

_x: dec len
jns _0

add eax,edx
cmp eax,BASE + 1
sbb esi,esi
and esi,BASE
sub eax,esi

shl eax,16
mov ax,dx
ret
update_adler32 ENDP
BASE could be in a register as well, if EDI is replaced with EBX. There is also a way to eliminate the CMP instructions at the expensive of having to do the modulus at the end - I'll post again later.
Posted on 2002-03-27 21:33:40 by bitRAKE
The mathematics makes it look simple. If you could
assume that you would not get an overflow, then
it is just a matter of adding to the end and then
dividing (like Mirno said). However, S2 grow exponentially.

I tried to find a relation with BASE and the
register base (4294967295) that would allow a
simple conversion at the end.

Maybe using the mmx registers, assuming that you
could get x number of iterations before you needed
to divide could help.

S2 = (ADL* + n*ADL + n*A[0] + (n-1)*A[1] ... A)%BASE

S1 = (ADL + A[0] + A[1] ... A)%BASE
Posted on 2002-03-28 02:10:25 by bdjames
I've tested all versions I have so far, but the differences aren't that big.. Recently someone (Nico) contacted me about my post on huffman codes as he has been working on a PNG decoder as well. He had written it's own adler32 procedure, which was the fastest so far.
He also pointed out that I made an error in my first post about adler32. S2 will always be smaller than 2xBASE so the extra jump isn't necessary.


Nico's version:

GetAdler proc lpData:DWORD,DataSize:DWORD
push ebx
push esi
xor eax,eax
inc eax
xor ebx,ebx
mov esi,lpData
Loop1:
cmp DataSize,0
jz EndLoop1
cmp DataSize,5552
ja MaxReached
mov edx,DataSize
jmp Cont
MaxReached:
mov edx,5552
Cont:
sub DataSize,edx
Loop2:
test edx,edx
jz EndLoop2
dec edx
xor ecx,ecx
mov cl,[esi]
add eax,ecx
inc esi
add ebx,eax
jmp Loop2
EndLoop2:
mov ecx,65521
xor edx,edx
div ecx
push edx
mov eax,ebx
xor edx,edx
div ecx
mov ebx,edx
pop eax
jmp Loop1
EndLoop1:
shl ebx,16
or eax,ebx
ExitProc:
pop esi
pop ebx
ret
GetAdler endp


Thomas
Posted on 2002-03-28 03:51:36 by Thomas
Here's the timing of the versions so far:


C-Version: 7C068FF0, time = 1943 ms for 10 loops
Nico: 7C068FF0, time = 150 ms for 10 loops
Thomas: 7C068FF0, time = 210 ms for 10 loops
Thomas2: 7C068FF0, time = 201 ms for 10 loops
bitRAKE: 20CF8FFF, time = 210 ms for 10 loops
Thomas3: 7C068FF0, time = 120 ms for 10 loops

Using getTickCount around a loop that calls the adler function 10 times, using a data buffer of 2.83MB (2972436 bytes)..

BitRAKE: Your version did not produce the right adler code (see hex values above).

Thomas3 is this version:


mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
shr edx, 16 ; edx = s2
mov ebx, BASE
shr len, 2
_l1:
cmp len, 0
jz _done

mov edi, 963
cmp len, edi
ja _b2
mov edi, len
_b2:
sub len, edi
next:
mov al, [esi+0]
add ecx, eax
add edx, ecx

mov al, [esi+1]
add ecx, eax
add edx, ecx

mov al, [esi+2]
add ecx, eax
add edx, ecx

mov al, [esi+3]
add ecx, eax
add esi, 4
add edx, ecx

dec edi
jnz next

mov eax, edx
xor edx, edx
div ebx
mov edi, edx
mov eax, ecx
xor edx, edx
div ebx
mov ecx, edx
mov edx, edi

jmp _l1
_done:

mov eax, edx
shl eax, 16
add eax, ecx
ret


The inner loop does 963 dwords without MOD, then MODs the results and does the next 963 dwords. Even in the worst case, the inner loop will never overflow ecx and edx.

Thomas
Posted on 2002-03-28 03:57:00 by Thomas
Thomas, I think you can use mul instead of div, which speeds up
things greatly.
Magic multiplier to get qu ob reg\base = 80078071h
put it in ebx instead of base.
To divide value in eax on base do:
mov edx,ebx
mul ebx
shr edx,15
;edx = quantiant of eax/base
to get reminder now
mul result in by base and sub the result from value that
was in eax
.data
base dd 65521
.code
mov edi,edx ;devident
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
xor edx,edx
shr eax,15
mul base
sub edi,eax
;and so on with second div

It may be further optimized - I am just giving main idea
Posted on 2002-03-28 05:41:56 by The Svin
The Svin: I tried your code, but it doesn't produce the right results... I've walked through the code and that time it divided correctly, but the final adler value isn't correct...



mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
shr edx, 16 ; edx = s2
mov ebx, 80078071h
shr len, 2
_l1:
cmp len, 0
jz _done

mov edi, 963
cmp len, edi
ja _b2
mov edi, len
_b2:
sub len, edi
next:
mov al, [esi+0]
add ecx, eax
add edx, ecx

mov al, [esi+1]
add ecx, eax
add edx, ecx

mov al, [esi+2]
add ecx, eax
add edx, ecx

mov al, [esi+3]
add ecx, eax
add esi, 4
add edx, ecx

dec edi
jnz next

mov edi,edx ;devident
mov eax, edx
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
xor edx,edx
shr eax,15
push 65521
mul dword ptr [esp]
sub edi,eax
mov edx, edi
add esp, 4

push edx

mov edi,ecx ;devident
mov eax, ecx
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
xor edx,edx
shr eax,15
push 65521
mul dword ptr [esp]
sub edi,eax
mov ecx, edi
add esp, 4

pop edx



jmp _l1
_done:

mov eax, edx
shl eax, 16
add eax, ecx


Thomas
Posted on 2002-03-28 06:06:12 by Thomas


mov ecx, adler
mov esi, buf
mov edx, ecx
and ecx, 0ffffh ; ecx = s1
xor eax, eax
shr edx, 16 ; edx = s2
mov ebx, 80078071h
shr len, 2
_l1:
cmp len, 0
jz _done

mov edi, 963
cmp len, edi
ja _b2
mov edi, len
_b2:
sub len, edi
next:
mov al, [esi+0]
add ecx, eax
add edx, ecx

mov al, [esi+1]
add ecx, eax
add edx, ecx

mov al, [esi+2]
add ecx, eax
add edx, ecx

mov al, [esi+3]
add ecx, eax
add esi, 4
add edx, ecx

dec edi
jnz next
.data
ALIGN 4
base dd 65521
.code
mov edi,edx ;devident
mov edx,ebx ;= 80078071h
mul ebx
mov eax,edx
xor edx,edx
shr eax,15
mul base
sub edi,eax
mov eax,ecx
mov edx,ebx
mul ebx
mov eax,edx
xor edx,edx
shr eax,15
mul base
sub ecx,eax
mov edx,edi

jmp _l1
_done:

mov eax, edx
shl eax, 16
add eax, ecx
ret
Posted on 2002-03-28 06:22:59 by The Svin
It still gives the wrong value :(
Posted on 2002-03-28 07:14:20 by Thomas
Please, send me testing code with any proc inside it wich produce right values, I'll try write code that catches the moment of difference.
Posted on 2002-03-28 07:36:09 by The Svin
I've converted the test code from C to asm so the C version isn't in the test anymore.
To test it choose a big file (at least 2MB) from your harddisk, copy it as file.dat in the test dir, run file2obj which will use f0dder's bin2o to create blah.obj. The test will use the first 2MB of the file to test.
edit: updated it to let it use vkim's debug window

Thomas
Posted on 2002-03-28 08:17:18 by Thomas
I wrote simple test (run only inside debugger!!!)


.data
base dd 65521
.code
start:
xor ebx,ebx
mov ecx,80078071h
@@: inc ebx
xor edx,edx
mov eax,ebx
div base
mov edi,edx
mov edx,ecx
mov esi,ebx
mov eax,ebx
mul edx
mov eax,edx
xor edx,edx
shr eax,15
mul base
sub esi,eax
cmp edi,esi
je @B
mov eax,eax ;!set brakepoint here and press F9
call ExitProcess

Test showed that reminders got by both div and my method
are absolutly identical. I wasn't been able to get to mov eax,eax
(Alt x was the only way out ;)
Error somewhere out of this part - I let you know later when
catch it though your test prog.
Posted on 2002-03-28 10:34:30 by The Svin
I used file 1075239 bytes and got this results:


eax = Nico : code:7A298223, 451 ms for 10 loops on 2MB of data
eax = BitRAKE : code:7BAC8232, 721 ms for 10 loops on 2MB of data
eax = Thomas2 : code:FC4C8223, 310 ms for 10 loops on 2MB of data
eax = Thomas3 : code:F3CE8108, 190 ms for 10 loops on 2MB of data
eax = Svin2 : code:02D56DB8, 181 ms for 10 loops on 2MB of data


None of the same code :)
Posted on 2002-03-28 12:57:54 by The Svin
I found 1 place now final value (from ecx part) the same:
(the thing was that in your code part of eax (ah +) has
quontient after last devision and I changed using eax to multiply quontient):


eax = Nico : code:C9AD7C71, 431 ms for 10 loops on 2MB of data
eax = BitRAKE : code:C0B67C80, 711 ms for 10 loops on 2MB of data
eax = Thomas2 : code:C9AD7C71, 310 ms for 10 loops on 2MB of data
eax = Thomas3 : code:C9AD7C71, 190 ms for 10 loops on 2MB of data
eax = Svin2 : code:BE827C71, 181 ms for 10 loops on 2MB of data


Part of work is done.
Posted on 2002-03-28 13:33:43 by The Svin
That's not strange if you use only 1 MB of data. You should use a file of at least 2MB..
Only bitRAKE's and yours should not produce the right values:


eax = Nico : code:9639F0C4, 190 ms for 10 loops on 2MB of data
eax = BitRAKE : code:0B5AF0D3, 151 ms for 10 loops on 2MB of data
eax = Thomas2 : code:9639F0C4, 130 ms for 10 loops on 2MB of data
eax = Thomas3 : code:9639F0C4, 70 ms for 10 loops on 2MB of data
eax = Svin2 : code:8B1CBB09, 80 ms for 10 loops on 2MB of data



Thomas
Posted on 2002-03-28 13:34:10 by Thomas
Lower word is correct, one word to go :) It seems to be a little faster than my latest version. What processor do you use?

Thomas
Posted on 2002-03-28 13:36:14 by Thomas