Hi everyone,

I need to optimize a function f(a,b) in C that does the following: f(a,b) = a*b mod (2^16 + 1). I?m no assmbler expert, but, I?ve tried all the optimizations at my hand in C and, still, it?s slow. Does anybody have an example of how could this be calculated using assembler?

Greetings,

Aliosha
Posted on 2004-02-17 14:27:46 by alioshab
Since the mod is not a power-of-two (very deliberately so, it seems), my approach would be to rewrite the mod to a division, a sub and a mul:

x mod y = x - ((int)(x/y)*x)

I would also rewrite the division there to a reciprocal mul:

x/y = x * 1/y;

Since y is a constant, so is 1/y.
This means that it can be precalced, and therefore the slow mod/div operation can be avoided at runtime.
You can use either floats or fixedpoint arithmetic to implement it.
Posted on 2004-02-17 15:01:19 by Henk-Jan
Actually you can just do this:
mov ecx,eax

shr ecx,16
sub ax,cx
movzx eax,ax
adc eax,0
Posted on 2004-02-17 15:16:18 by Sephiroth3
hummm, and you're using 32bit ints and not bignums?
Posted on 2004-02-17 15:29:35 by f0dder
Explanations would help, really.
So basically you subtract the highword from the lowword of eax.
which would mean:

y = x - (x / 2^16)

Scale it up to make some sense of it:
y*(2^16) = x*(2^16)-x
y*(2^16) = x*(2^16 - 1)

and then you do a modulo 2^16 on y.
I can't seem to make sense of it, and the code (I assume it's for the mod only, after then mul?) doesn't give the proper results anyway.
Posted on 2004-02-17 15:39:41 by Henk-Jan
Although you have not specified it, let's assume that you are looking for an integer result. It can also be assumed that both variables are unsigned 32-bit integers and that the result of dividing their product by (2^16+1) would also be contained in a 32-bit integer.

The following would return the mod in EAX (which is standard with most APIs):
mov  eax,a

mul b
mov ecx,2^16+1
div ecx
mov eax,edx ;transfer the modulo into EAX

Explanation: The multiplication of a 32-bit value in EAX by another 32-bit value produces a 64-bit value, the High Order 32 bits being in EDX. Dividing this by a 32-bit value puts the result in EAX and the remainder (mod) into EDX.

Raymond
Posted on 2004-02-17 21:13:43 by Raymond
Raymond: I assume the point of optimizing is to remove the slow div-operation. Else you might aswell use C.
Posted on 2004-02-18 06:03:13 by Henk-Jan
Hi everyone,

First of all, thanks for your responses. Although I read all of them I think maybe I missed some information on my original post. Currently I have the function implemented like this:

unsigned short __fastcall f(unsigned short a, unsigned short b)
{
unsigned int p;

if ( a )
{
if ( b )
{
p = (int) a * b;
b = ( low16(p) );
a = ( p >> 16 );

return b - a + (b < a);
}
else
{
return 1 - a;
}
}
else
{
return 1 - b;
}
}

it?s the fastest implementation I?ve done, but, still, I?m looking forward to make it better :grin: . Any suggestions? Also, does anyone have a sample on how to build the asm code into the C code?

Greetings,
Posted on 2004-02-18 08:02:42 by alioshab
Sephiroth3's code is perfect for unsigned numbers :alright:
Here I add capability of solving signed numbers, too. 64-bit is really unnecessary for the code this topic is about.
Tested it thoroughly, and here's how it works:


; my code:
[color=green];---------------\[/color]
mov ecx,eax
shr ecx,16
test ch,80h
jz _labl1
inc cx
stc
jz @F
_labl1:
sub ax,cx
@@:
movzx eax,ax
adc eax,0
[color=green];---------------/[/color]




[b]when we have a positive EAX:[/b]

mov ecx,eax
shr ecx,16
mov ax2,ax ; store for later reference
mov cx2,cx ; store..
sub ax,cx
movzx eax,ax ; max number is 65536
adc eax,0 ; if (ax2 < cx2) , in other words, if there was loopback, add 1

let's see several examples

[b]eax=10[/b]
mov ecx,eax ; = 10
shr ecx,16 ; ecx = 0
sub ax,cx ; ax = 10
movzx eax,ax; eax = 10
adc eax,0 ; 10 > 0 (ax2>cx2), then do not increase eax
; result : eax = 10

[b]eax=65536[/b]
mov ecx,eax ; = 65536
shr ecx,16 ; ecx = 1
sub ax,cx ; ax = 0 - 1 = 65535
movzx eax,ax; eax = 65535
adc eax,0 ; 0 < 1 (ax2<cx2), then increase eax
; result : eax = 65536

[b]eax=65537[/b]
mov ecx,eax ; = 65537
shr ecx,16 ; ecx = 1
sub ax,cx ; ax = 1 - 1 = 0
movzx eax,ax; eax = 0
adc eax,0 ; 0 == 0 (ax2==cx2), then do not increase eax
; result : eax = 0


[b]eax=65538[/b]
mov ecx,eax ; = 65538
shr ecx,16 ; ecx = 1
sub ax,cx ; ax = 2 - 1 = 1
movzx eax,ax; eax = 1
adc eax,0 ; 2 > 1 (ax2>cx2), then do not increase eax
; result : eax = 1


[b]eax=65537*65000+25[/b]
mov ecx,eax ; = 65537*65000+25 = 4259905025 = 65000*65536 + 65025
shr ecx,16 ; ecx = 65000
sub ax,cx ; ax = 65025 - 65000 = 25
movzx eax,ax; eax = 25
adc eax,0 ; 65025 > 65000 (ax2>cx2), then do not increase eax
; result : eax = 25

[color=blue][b] but with negative values of EAX uses the other code:[/b][/color]



mov ecx,eax
shr ecx,16
inc cx
stc ; set carry by default
jz @F ; if cx is zero, then the next instruction is useless,
; and it'll clear the carry flag. We need it set.
sub ax,cx
@@:
movzx eax,ax
adc eax,0


[b]eax=-1[/b]
mov ecx,eax ; =-1
shr ecx,16
inc cx ; = 0
stc ; set carry
jz @F ; do not clear it, since cx ==0
sub ax,cx
@@:
movzx eax,ax ; eax = 65535
adc eax,0 ; carry is set, so we'll increase eax
; result : eax = 65536


[b]eax=-5000000[/b] must result 46349
mov ecx,eax ; = -5000000 = FFB3B4C0h
shr ecx,16 ; ecx = FFB3h = 65459
inc cx ; cx = 65460
stc
jz @F
sub ax,cx ; ax = 46272-65460 = 46348
@@:
movzx eax,ax ; eax = 46348
adc eax,0 ; carry is set, so we'll increase eax
; result : eax = 46349

[b]eax=-65537[/b] must result 0
mov ecx,eax ; = -65537
shr ecx,16 ; ecx = 65534
inc cx ; cx = 65535
stc
jz @F
sub ax,cx ; ax = 65535 - 65535
@@:
movzx eax,ax ; eax = 0
adc eax,0 ; carry is not set
; result : eax = 0


Posted on 2004-02-18 09:22:39 by Ultrano
Hi,

With all the usefull comments here, the code looks like the following.But, I?m getting weird results. If I use the C version with a=1010, b=1010 I get 37045 and with the assembler version I get 995 :confused:. Am I missing something?

Greetings,

unsigned short f(unsigned short a, unsigned short b)
{
if( a )
{
if( b )
{
unsigned int p = (int) a * b;

__asm {
mov ecx,p
shr ecx,16
sub ax,cx
movzx eax,ax
adc eax,0
}

}
else
return 1 - a;
}
else
return 1 - b;
}
Posted on 2004-02-18 15:08:58 by alioshab
you need to have the data in eax, too:
__asm {
mov eax,p
mov ecx,eax
shr ecx,16
sub ax,cx
movzx eax,ax
adc eax,0
}


or for the signed mod:
__asm {
mov eax,p
mov ecx,eax
shr ecx,16
test ch,80h
jz _labl1
inc cx
stc
jz @F
_labl1:
sub ax,cx
@@:
movzx eax,ax
adc eax,0
}
Posted on 2004-02-18 16:29:20 by Ultrano
Hi Ultrano,

Thanks for your quote. That?s working perfect!!!!

Greetings,
Posted on 2004-02-19 07:18:32 by alioshab
Hi Everyone,

Strangely enough, the new C function with the inline assembler built in works slower than the original compiled C function :(

I?ve read that compilers usually don?t optimize inline assembly and I?m using the Intel Compiler v.8.0.

Any suggestions?

Greetings,

Aliosha
Posted on 2004-02-20 15:43:23 by alioshab
Make a separate function:
long mod16_1(long p){
__asm {
mov eax,p
mov ecx,eax
shr ecx,16
test ch,80h
jz _labl1
inc cx
stc
jz @F
_labl1:
sub ax,cx
@@:
movzx eax,ax
adc eax,0
mov p,eax
}
return p;
}

use it - this way the other (calling) function will work properly and optimized.
Posted on 2004-02-20 15:58:33 by Ultrano