Hi everyone,

I need to optimize a function f(a,b) in C that does the following: f(a,b) = a*b mod (2^16 + 1). I?m no assmbler expert, but, I?ve tried all the optimizations at my hand in C and, still, it?s slow. Does anybody have an example of how could this be calculated using assembler?

Greetings,

Aliosha

I need to optimize a function f(a,b) in C that does the following: f(a,b) = a*b mod (2^16 + 1). I?m no assmbler expert, but, I?ve tried all the optimizations at my hand in C and, still, it?s slow. Does anybody have an example of how could this be calculated using assembler?

Greetings,

Aliosha

Since the mod is not a power-of-two (very deliberately so, it seems), my approach would be to rewrite the mod to a division, a sub and a mul:

x mod y = x - ((int)(x/y)*x)

I would also rewrite the division there to a reciprocal mul:

x/y = x * 1/y;

Since y is a constant, so is 1/y.

This means that it can be precalced, and therefore the slow mod/div operation can be avoided at runtime.

You can use either floats or fixedpoint arithmetic to implement it.

x mod y = x - ((int)(x/y)*x)

I would also rewrite the division there to a reciprocal mul:

x/y = x * 1/y;

Since y is a constant, so is 1/y.

This means that it can be precalced, and therefore the slow mod/div operation can be avoided at runtime.

You can use either floats or fixedpoint arithmetic to implement it.

Actually you can just do this:

```
mov ecx,eax
```

shr ecx,16

sub ax,cx

movzx eax,ax

adc eax,0

hummm, and you're using 32bit ints and not bignums?

Explanations would help, really.

So basically you subtract the highword from the lowword of eax.

which would mean:

y = x - (x / 2^16)

Scale it up to make some sense of it:

y*(2^16) = x*(2^16)-x

y*(2^16) = x*(2^16 - 1)

and then you do a modulo 2^16 on y.

I can't seem to make sense of it, and the code (I assume it's for the mod only, after then mul?) doesn't give the proper results anyway.

So basically you subtract the highword from the lowword of eax.

which would mean:

y = x - (x / 2^16)

Scale it up to make some sense of it:

y*(2^16) = x*(2^16)-x

y*(2^16) = x*(2^16 - 1)

and then you do a modulo 2^16 on y.

I can't seem to make sense of it, and the code (I assume it's for the mod only, after then mul?) doesn't give the proper results anyway.

Although you have not specified it, let's assume that you are looking for an integer result. It can also be assumed that both variables are unsigned 32-bit integers and that the result of dividing their product by (2^16+1) would also be contained in a 32-bit integer.

The following would return the mod in EAX (which is standard with most APIs):

Explanation: The multiplication of a 32-bit value in EAX by another 32-bit value produces a 64-bit value, the High Order 32 bits being in EDX. Dividing this by a 32-bit value puts the result in EAX and the remainder (mod) into EDX.

Raymond

The following would return the mod in EAX (which is standard with most APIs):

```
mov eax,a
```

mul b

mov ecx,2^16+1

div ecx

mov eax,edx ;transfer the modulo into EAX

Explanation: The multiplication of a 32-bit value in EAX by another 32-bit value produces a 64-bit value, the High Order 32 bits being in EDX. Dividing this by a 32-bit value puts the result in EAX and the remainder (mod) into EDX.

Raymond

Raymond: I assume the point of optimizing is to remove the slow div-operation. Else you might aswell use C.

Hi everyone,

First of all, thanks for your responses. Although I read all of them I think maybe I missed some information on my original post. Currently I have the function implemented like this:

unsigned short __fastcall f(unsigned short a, unsigned short b)

{

unsigned int p;

if ( a )

{

if ( b )

{

p = (int) a * b;

b = ( low16(p) );

a = ( p >> 16 );

return b - a + (b < a);

}

else

{

return 1 - a;

}

}

else

{

return 1 - b;

}

}

it?s the fastest implementation I?ve done, but, still, I?m looking forward to make it better :grin: . Any suggestions? Also, does anyone have a sample on how to build the asm code into the C code?

Greetings,

First of all, thanks for your responses. Although I read all of them I think maybe I missed some information on my original post. Currently I have the function implemented like this:

unsigned short __fastcall f(unsigned short a, unsigned short b)

{

unsigned int p;

if ( a )

{

if ( b )

{

p = (int) a * b;

b = ( low16(p) );

a = ( p >> 16 );

return b - a + (b < a);

}

else

{

return 1 - a;

}

}

else

{

return 1 - b;

}

}

it?s the fastest implementation I?ve done, but, still, I?m looking forward to make it better :grin: . Any suggestions? Also, does anyone have a sample on how to build the asm code into the C code?

Greetings,

Sephiroth3's code is

Here I add capability of solving signed numbers, too. 64-bit is really unnecessary for the code this topic is about.

Tested it thoroughly, and here's how it works:

**perfect**for unsigned numbers :alright:Here I add capability of solving signed numbers, too. 64-bit is really unnecessary for the code this topic is about.

Tested it thoroughly, and here's how it works:

```
```

; my code:

[color=green];---------------\[/color]

mov ecx,eax

shr ecx,16

test ch,80h

jz _labl1

inc cx

stc

jz @F

_labl1:

sub ax,cx

@@:

movzx eax,ax

adc eax,0

[color=green];---------------/[/color]

[b]when we have a positive EAX:[/b]

mov ecx,eax

shr ecx,16

mov ax2,ax ; store for later reference

mov cx2,cx ; store..

sub ax,cx

movzx eax,ax ; max number is 65536

adc eax,0 ; if (ax2 < cx2) , in other words, if there was loopback, add 1

let's see several examples

[b]eax=10[/b]

mov ecx,eax ; = 10

shr ecx,16 ; ecx = 0

sub ax,cx ; ax = 10

movzx eax,ax; eax = 10

adc eax,0 ; 10 > 0 (ax2>cx2), then do not increase eax

; result : eax = 10

[b]eax=65536[/b]

mov ecx,eax ; = 65536

shr ecx,16 ; ecx = 1

sub ax,cx ; ax = 0 - 1 = 65535

movzx eax,ax; eax = 65535

adc eax,0 ; 0 < 1 (ax2<cx2), then increase eax

; result : eax = 65536

[b]eax=65537[/b]

mov ecx,eax ; = 65537

shr ecx,16 ; ecx = 1

sub ax,cx ; ax = 1 - 1 = 0

movzx eax,ax; eax = 0

adc eax,0 ; 0 == 0 (ax2==cx2), then do not increase eax

; result : eax = 0

[b]eax=65538[/b]

mov ecx,eax ; = 65538

shr ecx,16 ; ecx = 1

sub ax,cx ; ax = 2 - 1 = 1

movzx eax,ax; eax = 1

adc eax,0 ; 2 > 1 (ax2>cx2), then do not increase eax

; result : eax = 1

[b]eax=65537*65000+25[/b]

mov ecx,eax ; = 65537*65000+25 = 4259905025 = 65000*65536 + 65025

shr ecx,16 ; ecx = 65000

sub ax,cx ; ax = 65025 - 65000 = 25

movzx eax,ax; eax = 25

adc eax,0 ; 65025 > 65000 (ax2>cx2), then do not increase eax

; result : eax = 25

[color=blue][b] but with negative values of EAX uses the other code:[/b][/color]

mov ecx,eax

shr ecx,16

inc cx

stc ; set carry by default

jz @F ; if cx is zero, then the next instruction is useless,

; and it'll clear the carry flag. We need it set.

sub ax,cx

@@:

movzx eax,ax

adc eax,0

[b]eax=-1[/b]

mov ecx,eax ; =-1

shr ecx,16

inc cx ; = 0

stc ; set carry

jz @F ; do not clear it, since cx ==0

sub ax,cx

@@:

movzx eax,ax ; eax = 65535

adc eax,0 ; carry is set, so we'll increase eax

; result : eax = 65536

[b]eax=-5000000[/b] must result 46349

mov ecx,eax ; = -5000000 = FFB3B4C0h

shr ecx,16 ; ecx = FFB3h = 65459

inc cx ; cx = 65460

stc

jz @F

sub ax,cx ; ax = 46272-65460 = 46348

@@:

movzx eax,ax ; eax = 46348

adc eax,0 ; carry is set, so we'll increase eax

; result : eax = 46349

[b]eax=-65537[/b] must result 0

mov ecx,eax ; = -65537

shr ecx,16 ; ecx = 65534

inc cx ; cx = 65535

stc

jz @F

sub ax,cx ; ax = 65535 - 65535

@@:

movzx eax,ax ; eax = 0

adc eax,0 ; carry is not set

; result : eax = 0

Hi,

With all the usefull comments here, the code looks like the following.But, I?m getting weird results. If I use the C version with a=1010, b=1010 I get 37045 and with the assembler version I get 995 :confused:. Am I missing something?

Greetings,

unsigned short f(unsigned short a, unsigned short b)

{

if( a )

{

if( b )

{

unsigned int p = (int) a * b;

__asm {

mov ecx,p

shr ecx,16

sub ax,cx

movzx eax,ax

adc eax,0

}

}

else

return 1 - a;

}

else

return 1 - b;

}

With all the usefull comments here, the code looks like the following.But, I?m getting weird results. If I use the C version with a=1010, b=1010 I get 37045 and with the assembler version I get 995 :confused:. Am I missing something?

Greetings,

unsigned short f(unsigned short a, unsigned short b)

{

if( a )

{

if( b )

{

unsigned int p = (int) a * b;

__asm {

mov ecx,p

shr ecx,16

sub ax,cx

movzx eax,ax

adc eax,0

}

}

else

return 1 - a;

}

else

return 1 - b;

}

you need to have the data in eax, too:

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

sub ax,cx

movzx eax,ax

adc eax,0

}

or for the signed mod:

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

test ch,80h

jz _labl1

inc cx

stc

jz @F

_labl1:

sub ax,cx

@@:

movzx eax,ax

adc eax,0

}

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

sub ax,cx

movzx eax,ax

adc eax,0

}

or for the signed mod:

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

test ch,80h

jz _labl1

inc cx

stc

jz @F

_labl1:

sub ax,cx

@@:

movzx eax,ax

adc eax,0

}

Hi Ultrano,

Thanks for your quote. That?s working perfect!!!!

Greetings,

Thanks for your quote. That?s working perfect!!!!

Greetings,

Hi Everyone,

Strangely enough, the new C function with the inline assembler built in works slower than the original compiled C function :(

I?ve read that compilers usually don?t optimize inline assembly and I?m using the Intel Compiler v.8.0.

Any suggestions?

Greetings,

Aliosha

Strangely enough, the new C function with the inline assembler built in works slower than the original compiled C function :(

I?ve read that compilers usually don?t optimize inline assembly and I?m using the Intel Compiler v.8.0.

Any suggestions?

Greetings,

Aliosha

Make a separate function:

long mod16_1(long p){

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

test ch,80h

jz _labl1

inc cx

stc

jz @F

_labl1:

sub ax,cx

@@:

movzx eax,ax

adc eax,0

mov p,eax

}

return p;

}

use it - this way the other (calling) function will work properly and optimized.

long mod16_1(long p){

__asm {

mov eax,p

mov ecx,eax

shr ecx,16

test ch,80h

jz _labl1

inc cx

stc

jz @F

_labl1:

sub ax,cx

@@:

movzx eax,ax

adc eax,0

mov p,eax

}

return p;

}

use it - this way the other (calling) function will work properly and optimized.