Here's a new challenge for all the optimisation gurus here: the adler32 algorithm.

This is a CRC like algorithm which is for example used as checksum for a decoded ZLIB stream. The algorithm is much more simple than CRC though. The C version is:

This is an update function for an adler code, which means you can feed the previous adler code and proceed with it, along with a pointer to the buffer with the data and a len parameter containing the length of the data. Initially, the adler code is 1. Example: if you want to compute the adler value of a 100 byte buffer, call:

you can also do the buffer in pieces:

This code should produce the same value as the first version. Knowing this may be useful for optimizations, for example you can first process as much DWORD units as possible, and do the last bytes (len%4) seperately.

Here's the first, not much optimized, straight-forward translation in asm:

One optimisation a C compiler will probably not think of, is replacing the modulus (and thus two slow divisions) by a simple compare and one or two substractions:

Initially, s1 is the lower word of adler, so it can only be 65535 at most. buf can be 255 at most, which gives a total of 65535 +255 = 65790. No need to divide by BASE, as it either does or doesn't fit in (0 or 1).

s2 goes analogous, but because s2 and s1 are added, BASE may fit in twice...

After these two lines, s1 and s2 will obviously be < BASE so the rules described above will apply to every iteration.

This code can probably be much more optimized, any bright ideas are welcome. I'm also wondering if it would be possible to unroll the loop, although it's not easy because the s1 and s2 values depend on their previous values..

Thomas

This is a CRC like algorithm which is for example used as checksum for a decoded ZLIB stream. The algorithm is much more simple than CRC though. The C version is:

```
```

#define BASE 65521 /* largest prime smaller than 65536 */

unsigned long update_adler32(unsigned long adler,unsigned char *buf, int len)

{

unsigned long s1 = adler & 0xffff;

unsigned long s2 = (adler >> 16) & 0xffff;

int n;

for (n = 0; n < len; n++)

{

s1 = (s1 + buf[n]) % BASE;

s2 = (s2 + s1) % BASE;

}

return (s2 << 16) + s1;

}

This is an update function for an adler code, which means you can feed the previous adler code and proceed with it, along with a pointer to the buffer with the data and a len parameter containing the length of the data. Initially, the adler code is 1. Example: if you want to compute the adler value of a 100 byte buffer, call:

`adler = update_adler32([b]1[/b], pointer_buffer,100);`

you can also do the buffer in pieces:

```
```

adler = update_adler32([b]1[/b], pointer_buffer,20);

adler = update_adler32([b]adler[/b], pointer_buffer + 20,30); // next 30 bytes

adler = update_adler32([b]adler[/b], pointer_buffer + 50,50); // last 50 bytes

This code should produce the same value as the first version. Knowing this may be useful for optimizations, for example you can first process as much DWORD units as possible, and do the last bytes (len%4) seperately.

Here's the first, not much optimized, straight-forward translation in asm:

```
```

BASE equ 65521

mov ecx, adler

mov edi, len

mov esi, buf

mov edx, ecx

and ecx, 0ffffh ; ecx = s1

xor eax, eax

add edi, esi

shr edx, 16 ; edx = s2

@next:

mov al, [esi]

add ecx, eax ;s1=s1+buf[n]

cmp ecx, BASE

jb @F

sub ecx, BASE

@@:

add edx, ecx

inc esi

@@:

cmp edx, BASE

jb @F

sub edx, BASE

jmp @B

@@:

cmp esi, edi

jb @next

mov eax, edx

shl eax, 16

add eax, ecx

ret

One optimisation a C compiler will probably not think of, is replacing the modulus (and thus two slow divisions) by a simple compare and one or two substractions:

```
```

s1 = (s1 + buf[n]) % BASE;

s2 = (s2 + s1) % BASE;

Initially, s1 is the lower word of adler, so it can only be 65535 at most. buf can be 255 at most, which gives a total of 65535 +255 = 65790. No need to divide by BASE, as it either does or doesn't fit in (0 or 1).

s2 goes analogous, but because s2 and s1 are added, BASE may fit in twice...

After these two lines, s1 and s2 will obviously be < BASE so the rules described above will apply to every iteration.

This code can probably be much more optimized, any bright ideas are welcome. I'm also wondering if it would be possible to unroll the loop, although it's not easy because the s1 and s2 values depend on their previous values..

Thomas

Could you not move the modulo outside the loop?

Mirno

```
```

{

unsigned long s1 = adler & 0xffff;

unsigned long s2 = (adler >> 16) & 0xffff;

int n;

for (n = 0; n < len; n++)

{

s1 = (s1 + buf[n]);

s2 = (s2 + s1);

}

return ((s2 % BASE) << 16) + (s1 % BASE);

}

Mirno

I thought about that but I don't think it will work as eventually s1 or s2 will become bigger than 2^32-1 and overflow... For small data sets it will probably work.. I'll test it on a big one.

Thomas

Thomas

You can unroll the loop though, here's the math that proves the formulas used:

This code does work on a test set of 2,5 MB.. size has to be a multiple of 4 but as I said you can use the original proc for the remaining bytes because of the nature of adler32.

Thomas

```
```

---------------

unroll:

s1 = (s1 + buf[n+0])

s2 = (s2 + s1)

s1 = (s1 + buf[n+1])

s2 = (s2 + s1)

s1 = (s1 + buf[n+2])

s2 = (s2 + s1)

s1 = (s1 + buf[n+3])

s2 = (s2 + s1)

-----------------

change names:

A = (s1 + buf[n+0])

B = (s2 + A)

C = (A + buf[n+1])

D = (B + C)

E = (C + buf[n+2])

F = (D + E)

G = (E + buf[n+3])

H = (F + G)

new s1 = G

new s2 = H

------------------

substitute:

E =

F =

G = ((((s1 + buf[n+0]) + buf[n+1]) + buf[n+2]) + buf[n+3])

H = ((((s2 + (s1 + buf[n+0])) + ((s1 + buf[n+0]) + buf[n+1])) + (((s1 + buf[n+0]) + buf[n+1]) + buf[n+2])) + G)

new s1 = G

new s2 = H

------------------

simplify:

(b0=buf[n+0], b1=buf[n+1]....)

G = s1 + buf[n+0]+ buf[n+1] + buf[n+2] + buf[n+3]

= s1 + {b0+b1+b2+b3}

H = s2 + s1 + buf[n+0] + s1 + buf[n+0] + buf[n+1] + s1 + buf[n+0] + buf[n+1] + buf[n+2] + G

= s2 + 3*s1 + 3*b0 + 2*b1 + b2 + G

new_s1 = s1 + {b0+b1+b2+b3}

new_s2 = s2 + 3*s1 + 3*b0 + 2*b1 + b2 + new_s1

This code does work on a test set of 2,5 MB.. size has to be a multiple of 4 but as I said you can use the original proc for the remaining bytes because of the nature of adler32.

```
+ buf + buf + buf;
```

unsigned long new_s2 = s2 + 3*s1 + 3*buf + 2*buf + buf + new_s1;

s1 = new_s1;

s2 = new_s2;

s1 %= BASE;

s2 %= BASE;

}

return (s2 << 16) + s1;

}

Thomas

Here's one that is unrolled, not that much faster though..

Thomas

```
```

mov ecx, adler

mov edi, len

mov esi, buf

mov edx, ecx

and ecx, 0ffffh ; ecx = s1

xor eax, eax

add edi, esi

shr edx, 16 ; edx = s2

xor ebx, ebx

next:

;s1 = (s1 + buf[n]) % BASE;

;s2 = (s2 + s1) % BASE ;

mov al, [esi+0]

mov bl, [esi+1]

add ecx, eax

add edx, ecx

add ecx, ebx

add edx, ecx

mov al, [esi+2]

mov bl, [esi+3]

add ecx, eax

add edx, ecx

add ecx, ebx

add edx, ecx

@@:

cmp ecx, BASE

jb @F

sub ecx, BASE

jmp @B

@@:

add esi, 4

@@:

cmp edx, BASE

jb @F

sub edx, BASE

jmp @B

@@:

cmp esi, edi

jb next

mov eax, edx

shl eax, 16

add eax, ecx

ret

Thomas

I haven't tested this, but the theme should be appearant:

```
update_adler32 PROC uses ebx esi edi, adler:DWORD, buf:DWORD, len:DWORD
```

mov eax,adler

mov ecx,buf

mov edx,eax

shr eax,16

and edx,0FFFFh

; xor ebx,ebx

sub eax,edx

jmp _x

_0:

; mov bl,[ecx]

movzx ebx, BYTE PTR [ecx]

inc ecx

add eax,edx

add edx,ebx

cmp eax,BASE + 1

sbb edi,edi

cmp edx,BASE + 1

sbb esi,esi

and edi,BASE

and esi,BASE

sub eax,edi ; values are restricted to:

sub edx,esi ; [0, BASE)

_x: dec len

jns _0

add eax,edx

cmp eax,BASE + 1

sbb esi,esi

and esi,BASE

sub eax,esi

shl eax,16

mov ax,dx

ret

update_adler32 ENDP

BASE could be in a register as well, if EDI is replaced with EBX. There is also a way to eliminate the CMP instructions at the expensive of having to do the modulus at the end - I'll post again later.The mathematics makes it look simple. If you could

assume that you would not get an overflow, then

it is just a matter of adding to the end and then

dividing (like Mirno said). However, S2 grow exponentially.

I tried to find a relation with BASE and the

register base (4294967295) that would allow a

simple conversion at the end.

Maybe using the mmx registers, assuming that you

could get x number of iterations before you needed

to divide could help.

S2 = (ADL* + n*ADL + n*A[0] + (n-1)*A[1] ... A)%BASE

S1 = (ADL + A[0] + A[1] ... A)%BASE

assume that you would not get an overflow, then

it is just a matter of adding to the end and then

dividing (like Mirno said). However, S2 grow exponentially.

I tried to find a relation with BASE and the

register base (4294967295) that would allow a

simple conversion at the end.

Maybe using the mmx registers, assuming that you

could get x number of iterations before you needed

to divide could help.

S2 = (ADL* + n*ADL + n*A[0] + (n-1)*A[1] ... A)%BASE

S1 = (ADL + A[0] + A[1] ... A)%BASE

I've tested all versions I have so far, but the differences aren't that big.. Recently someone (Nico) contacted me about my post on huffman codes as he has been working on a PNG decoder as well. He had written it's own adler32 procedure, which was the fastest so far.

He also pointed out that I made an error in my first post about adler32. S2 will always be smaller than 2xBASE so the extra jump isn't necessary.

Thomas

He also pointed out that I made an error in my first post about adler32. S2 will always be smaller than 2xBASE so the extra jump isn't necessary.

```
```

Nico's version:

GetAdler proc lpData:DWORD,DataSize:DWORD

push ebx

push esi

xor eax,eax

inc eax

xor ebx,ebx

mov esi,lpData

Loop1:

cmp DataSize,0

jz EndLoop1

cmp DataSize,5552

ja MaxReached

mov edx,DataSize

jmp Cont

MaxReached:

mov edx,5552

Cont:

sub DataSize,edx

Loop2:

test edx,edx

jz EndLoop2

dec edx

xor ecx,ecx

mov cl,[esi]

add eax,ecx

inc esi

add ebx,eax

jmp Loop2

EndLoop2:

mov ecx,65521

xor edx,edx

div ecx

push edx

mov eax,ebx

xor edx,edx

div ecx

mov ebx,edx

pop eax

jmp Loop1

EndLoop1:

shl ebx,16

or eax,ebx

ExitProc:

pop esi

pop ebx

ret

GetAdler endp

Thomas

Here's the timing of the versions so far:

Using getTickCount around a loop that calls the adler function 10 times, using a data buffer of 2.83MB (2972436 bytes)..

Thomas3 is this version:

The inner loop does 963 dwords without MOD, then MODs the results and does the next 963 dwords. Even in the worst case, the inner loop will never overflow ecx and edx.

Thomas

```
```

C-Version: 7C068FF0, time = 1943 ms for 10 loops

Nico: 7C068FF0, time = 150 ms for 10 loops

Thomas: 7C068FF0, time = 210 ms for 10 loops

Thomas2: 7C068FF0, time = 201 ms for 10 loops

bitRAKE: 20CF8FFF, time = 210 ms for 10 loops

Thomas3: 7C068FF0, time = 120 ms for 10 loops

Using getTickCount around a loop that calls the adler function 10 times, using a data buffer of 2.83MB (2972436 bytes)..

**BitRAKE:**Your version did not produce the right adler code (see hex values above).Thomas3 is this version:

```
```

mov ecx, adler

mov esi, buf

mov edx, ecx

and ecx, 0ffffh ; ecx = s1

xor eax, eax

shr edx, 16 ; edx = s2

mov ebx, BASE

shr len, 2

_l1:

cmp len, 0

jz _done

mov edi, 963

cmp len, edi

ja _b2

mov edi, len

_b2:

sub len, edi

next:

mov al, [esi+0]

add ecx, eax

add edx, ecx

mov al, [esi+1]

add ecx, eax

add edx, ecx

mov al, [esi+2]

add ecx, eax

add edx, ecx

mov al, [esi+3]

add ecx, eax

add esi, 4

add edx, ecx

dec edi

jnz next

mov eax, edx

xor edx, edx

div ebx

mov edi, edx

mov eax, ecx

xor edx, edx

div ebx

mov ecx, edx

mov edx, edi

jmp _l1

_done:

mov eax, edx

shl eax, 16

add eax, ecx

ret

The inner loop does 963 dwords without MOD, then MODs the results and does the next 963 dwords. Even in the worst case, the inner loop will never overflow ecx and edx.

Thomas

Thomas, I think you can use mul instead of div, which speeds up

things greatly.

Magic multiplier to get qu ob reg\base = 80078071h

put it in ebx instead of base.

To divide value in eax on base do:

mov edx,ebx

mul ebx

shr edx,15

;edx = quantiant of eax/base

to get reminder now

mul result in by base and sub the result from value that

was in eax

.data

base dd 65521

.code

mov edi,edx ;devident

mov edx,ebx ;= 80078071h

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

mul base

sub edi,eax

;and so on with second div

It may be further optimized - I am just giving main idea

things greatly.

Magic multiplier to get qu ob reg\base = 80078071h

put it in ebx instead of base.

To divide value in eax on base do:

mov edx,ebx

mul ebx

shr edx,15

;edx = quantiant of eax/base

to get reminder now

mul result in by base and sub the result from value that

was in eax

.data

base dd 65521

.code

mov edi,edx ;devident

mov edx,ebx ;= 80078071h

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

mul base

sub edi,eax

;and so on with second div

It may be further optimized - I am just giving main idea

**The Svin**: I tried your code, but it doesn't produce the right results... I've walked through the code and that time it divided correctly, but the final adler value isn't correct...

```
```

mov ecx, adler

mov esi, buf

mov edx, ecx

and ecx, 0ffffh ; ecx = s1

xor eax, eax

shr edx, 16 ; edx = s2

mov ebx, 80078071h

shr len, 2

_l1:

cmp len, 0

jz _done

mov edi, 963

cmp len, edi

ja _b2

mov edi, len

_b2:

sub len, edi

next:

mov al, [esi+0]

add ecx, eax

add edx, ecx

mov al, [esi+1]

add ecx, eax

add edx, ecx

mov al, [esi+2]

add ecx, eax

add edx, ecx

mov al, [esi+3]

add ecx, eax

add esi, 4

add edx, ecx

dec edi

jnz next

mov edi,edx ;devident

mov eax, edx

mov edx,ebx ;= 80078071h

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

push 65521

mul dword ptr [esp]

sub edi,eax

mov edx, edi

add esp, 4

push edx

mov edi,ecx ;devident

mov eax, ecx

mov edx,ebx ;= 80078071h

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

push 65521

mul dword ptr [esp]

sub edi,eax

mov ecx, edi

add esp, 4

pop edx

jmp _l1

_done:

mov eax, edx

shl eax, 16

add eax, ecx

Thomas

```
```

mov ecx, adler

mov esi, buf

mov edx, ecx

and ecx, 0ffffh ; ecx = s1

xor eax, eax

shr edx, 16 ; edx = s2

mov ebx, 80078071h

shr len, 2

_l1:

cmp len, 0

jz _done

mov edi, 963

cmp len, edi

ja _b2

mov edi, len

_b2:

sub len, edi

next:

mov al, [esi+0]

add ecx, eax

add edx, ecx

mov al, [esi+1]

add ecx, eax

add edx, ecx

mov al, [esi+2]

add ecx, eax

add edx, ecx

mov al, [esi+3]

add ecx, eax

add esi, 4

add edx, ecx

dec edi

jnz next

.data

ALIGN 4

base dd 65521

.code

mov edi,edx ;devident

mov edx,ebx ;= 80078071h

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

mul base

sub edi,eax

mov eax,ecx

mov edx,ebx

mul ebx

mov eax,edx

xor edx,edx

shr eax,15

mul base

sub ecx,eax

mov edx,edi

jmp _l1

_done:

mov eax, edx

shl eax, 16

add eax, ecx

ret

It still gives the wrong value :(

Please, send me testing code with any proc inside it wich produce right values, I'll try write code that catches the moment of difference.

I've converted the test code from C to asm so the C version isn't in the test anymore.

To test it choose a big file (at least 2MB) from your harddisk, copy it as file.dat in the test dir, run file2obj which will use f0dder's bin2o to create blah.obj. The test will use the first 2MB of the file to test.

Thomas

To test it choose a big file (at least 2MB) from your harddisk, copy it as file.dat in the test dir, run file2obj which will use f0dder's bin2o to create blah.obj. The test will use the first 2MB of the file to test.

**edit: updated it to let it use vkim's debug window**Thomas

I wrote simple test (run only inside debugger!!!)

Test showed that reminders got by both div and my method

are absolutly identical. I wasn't been able to get to mov eax,eax

(Alt x was the only way out ;)

Error somewhere out of this part - I let you know later when

catch it though your test prog.

```
```

.data

base dd 65521

.code

start:

xor ebx,ebx

mov ecx,80078071h

@@: inc ebx

xor edx,edx

mov eax,ebx

div base

mov edi,edx

mov edx,ecx

mov esi,ebx

mov eax,ebx

mul edx

mov eax,edx

xor edx,edx

shr eax,15

mul base

sub esi,eax

cmp edi,esi

je @B

mov eax,eax ;!set brakepoint here and press F9

call ExitProcess

Test showed that reminders got by both div and my method

are absolutly identical. I wasn't been able to get to mov eax,eax

(Alt x was the only way out ;)

Error somewhere out of this part - I let you know later when

catch it though your test prog.

I used file 1075239 bytes and got this results:

None of the same code :)

```
```

eax = Nico : code:7A298223, 451 ms for 10 loops on 2MB of data

eax = BitRAKE : code:7BAC8232, 721 ms for 10 loops on 2MB of data

eax = Thomas2 : code:FC4C8223, 310 ms for 10 loops on 2MB of data

eax = Thomas3 : code:F3CE8108, 190 ms for 10 loops on 2MB of data

eax = Svin2 : code:02D56DB8, 181 ms for 10 loops on 2MB of data

None of the same code :)

I found 1 place now final value (from ecx part) the same:

(the thing was that in your code part of eax (ah +) has

quontient after last devision and I changed using eax to multiply quontient):

Part of work is done.

(the thing was that in your code part of eax (ah +) has

quontient after last devision and I changed using eax to multiply quontient):

```
```

eax = Nico : code:C9AD7C71, 431 ms for 10 loops on 2MB of data

eax = BitRAKE : code:C0B67C80, 711 ms for 10 loops on 2MB of data

eax = Thomas2 : code:C9AD7C71, 310 ms for 10 loops on 2MB of data

eax = Thomas3 : code:C9AD7C71, 190 ms for 10 loops on 2MB of data

eax = Svin2 : code:BE827C71, 181 ms for 10 loops on 2MB of data

Part of work is done.

That's not strange if you use only 1 MB of data. You should use a file of at least 2MB..

Only bitRAKE's and yours should not produce the right values:

Thomas

Only bitRAKE's and yours should not produce the right values:

```
```

eax = Nico : code:9639F0C4, 190 ms for 10 loops on 2MB of data

eax = BitRAKE : code:0B5AF0D3, 151 ms for 10 loops on 2MB of data

eax = Thomas2 : code:9639F0C4, 130 ms for 10 loops on 2MB of data

eax = Thomas3 : code:9639F0C4, 70 ms for 10 loops on 2MB of data

eax = Svin2 : code:8B1CBB09, 80 ms for 10 loops on 2MB of data

Thomas

Lower word is correct, one word to go :) It seems to be a little faster than my latest version. What processor do you use?

Thomas

Thomas