Hola,

Spent a few minutes making an approximation for 1/n in assembly for a c++ routine I was working on...

#define FP_ONE_BITS 0x3F800000
``````
int _i = 2 * FP_ONE_BITS - *(int *)&(p);
r = *(float *)&_i;
r = r * (2.0f - (p) * r);
``````

into

``````
__asm
{
mov     eax,0x7F000000
sub     eax,dword ptr [t]
mov     dword ptr [it],eax
fld     dword ptr [t]
fmul    dword ptr [it]
fsubr   [two]
fmul    dword ptr [it]
fstp    dword ptr [it]
}
``````

where 't' = a floating point number (ex: 10.0f) and 'it' = the approximation of 1/t

The problem isn't the conversion, but more to the point how does it work... :)

for example using 10.0
the floating point would be...

0 100000101 01000000000000000000000

and 0x7F000000 = 01111111000000000000000000000000

but I'm not sure I'm doing my math right, because following this it doesn't work on paper...

Maybe someone could help me out...

Any optimizations would be really appreciated as well...

Sliver
Posted on 2003-03-09 16:05:33 by Sliver
Sliver, cool algo - I never seen that one before. You have to remember the IEEE 754 standard and the format of REAL4 numbers. The leading one bit is assumed, and the floating point is a positive biased number. So, the algo is playing with the exponent - namely, the exponent of the number is subtracted from twice the bias. If the number is greater than one, then the number is already biased plus another constant amount.

bias = 7F

number = 7F + c

2*bias - number = 7F - c

You see how the scale of the number is inverted?

The rest is just to improve the approximation:
2 - ( X * 1/X ) should be close to one...
Posted on 2003-03-09 22:39:50 by bitRAKE
would this be any faster than an fdiv?
Posted on 2003-03-09 22:59:36 by iblis
I believe it is... (Inotherwords, I really hope so...) ;)

it ran faster than
``````
fld1
fld dword ptr [t]
fdiv
fstp dword ptr [it]
``````

on some initial tests, but I could be wrong...

Haven't done much profiling except the usualy 0x0F 0x31 for clocking cycles
So if anyone feels the desire I'd like to see the results...

Again anything added would be cool :)

BitRAKE:
I do understand the bias, but I still don't understand what bit manipulation it's doing... any more clarrification possible?

so is it multiplying the number by it's -exponent equivalent ?

for example using 10 we have an exponent of 4
(after the bias) so IEEE would be stated as:
2^4 * .625

so when you subtract the floating point exponent from the bias do you get 2^-4?

The problem is I can't wrap my mind around something I can't easily parse and view (it would be easier is I could just do a bunch of masking and view the exponent, but I don't have much to work with when dealing with floating points...

Obviously help is always appreciated :) :) :)

Sliver
Posted on 2003-03-09 23:37:44 by Sliver
``````; first approximation:
mov     eax,0x7F000000		; bias(1) * 2
sub     eax,dword ptr [t]	; bias(1) * 2 - bias(t)
mov     dword ptr [it],eax	;= ~1/t
; second approximation:
fld     dword ptr [t]		;
fmul    dword ptr [it]		; t * (~1/t) = ~1-
fsubr   [two]			; 2 - (~1) = ~1+
fmul    dword ptr [it]		; ~1+ * it
fstp    dword ptr [it]		;= 1/t``````
'~' means approximate

I'm not sure what part you don't understand? It sound like you understand from what you state. Just look at the numbers in a debugger - paying attention to only the exponent bits. If you really don't care about accuracy then you could use the first approximation.
Posted on 2003-03-10 00:14:13 by bitRAKE

I'm not sure what part you don't understand? It sound like you understand from what you state.

*Exercises in stupidity*

:( Ok I just spent the last few hours in learning the fine art of stupidity

So I do understand this after all (here is a good example why you should always (a) use a debugger (b) make sure the numbers you use on paper are the same numbers your testing in the program :( I could almost cry...

Seems I was testing using the number

t = 1234.24532
so I "thought" the answer was 0.0008102117 (with my approx being 0.0008048193)

instead I was using (on paper) t = 10.0
which is 0.10000

So obviously I'm looking at the computer saying "I'm not getting what I should be after taking the bias" :( but since I *knew* this was wrong I kept re-doing the problem...

BitRAKE:
Thanks for helping me fine my way out of that mess...

Sliver
Posted on 2003-03-10 01:45:50 by Sliver
Obviously this isn't the end all be all of testing... :)

Given: t = 10.0f
``````
for (int i = 0; i < 100000; i++)
{
__asm
{
mov     eax,0x7F000000
sub     eax,dword ptr [t]
mov     dword ptr [it],eax
fld     dword ptr [t]
fmul    dword ptr [it]
fsubr   [two]
fmul    dword ptr [it]
fstp    dword ptr [it]
}
}
``````

inv expected 0.1000000000 approx 0.0991210938
Gross time (est): P2-450 / 0.00512982 sec

``````
for (int i = 0; i < 100000; i++)
{
__asm
{
fld1
fld dword ptr [t]
fdiv
fstp dword ptr [it]
}
}
``````

inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00852415 sec

``````
for (int i = 0; i < 100000; i++)
{
it = 1/t;
}
``````

inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00940774 sec
Posted on 2003-03-10 01:55:09 by Sliver

Obviously this isn't the end all be all of testing... :)
Yeah, I'll just keep it in the toolbox like an old philips driver. It's too small a piece of code to test by itself.
Posted on 2003-03-10 08:27:37 by bitRAKE
Hey help me out here, what is a biased number? Excuse me for sounding stupid but I'm only in Grade 11 :(
Posted on 2003-04-06 18:14:26 by x86asm
The IEEE format uses a bias for the exponent - 7Fh for REAL4 format. This bias allows floating point numbers to be compared with integer compares! :) An exponent of 7Fh is zero - note the one bit is implied - so the number is one (1).
Posted on 2003-04-06 19:37:44 by bitRAKE
Sliver:

It is not necessary to load the divisor for performing the dividion with the FPU. It would be interesting to see the relative comparison with your time figures if you replace the two instructions

fld dword ptr
fdiv

by the single instruction

fdiv dword ptr

Raymond
Posted on 2003-04-06 19:51:26 by Raymond
The P4 takes no less than 23 / 38 / 43 cycles for the FDIV! Athlon does a little better with 16 / 20 / 24 - three latency numbers refer to precision control settings of single precision, double precision, and extended precision, respectively.
Posted on 2003-04-06 19:58:16 by bitRAKE
Things look better on Athlon XPs: pfrcp and rcpss take 3 clocks.
Posted on 2003-04-08 11:12:09 by Jan Wassenberg
``````
__asm
{
fld1
fdiv dword ptr [t]
fstp dword ptr [it]
}

``````

inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00849256 sec, P2-500 / 0.01, P2-650 / 0.01

Still beats it cosistantly by approx. '.003 seconds'

This of course is by no means conclusive, but it's not meant to be :p

Here is the code I use to check it (in c/c++)

``````
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <assert.h>

#define COUNT_CYCLES 1

/*****************************************************************************/
#if COUNT_CYCLES
/*****************************************************************************/

#pragma warning(disable:4035)

__int64         GetCycleCount64()
{
__asm   _emit   0x0F
__asm   _emit   0x31
};

unsigned        GetCycleCount32()        // enough for about 40 seconds
{
__asm   push    EDX
__asm   _emit   0x0F
__asm   _emit   0x31
__asm   pop     EDX
};

#pragma warning(default:4035)

/*****************************************************************************/

static
__int64         cycleBegin;

static
__int64         cycleTotal;

static
__int64         cycleStart;

static
unsigned        cyclePause;

unsigned        cycleExtra;

static
void            cycleCounterInit()
{
cycleBegin = GetCycleCount64();
cycleTotal = 0;
cycleStart = 0;
cyclePause = 0;
}

static
__int64         cycleCounterDone(__int64 *realPtr)
{
assert(cyclePause == 0);

*realPtr = GetCycleCount64() - cycleBegin;

return cycleTotal;
}

void            cycleCounterBeg()
{
assert(cyclePause == 0);

cycleStart = GetCycleCount64();
}

void            cycleCounterEnd()
{
assert(cycleStart != 0);
assert(cyclePause == 0);

cycleTotal += GetCycleCount64() - cycleStart;

cycleStart  = 0;
}

void            cycleCounterPause()
{
assert(cycleStart != 0);

if  (!cyclePause)
cycleTotal += GetCycleCount64() - cycleStart;

cyclePause++;
}

void            cycleCounterResume()
{
assert(cycleStart != 0);
assert(cyclePause != 0);

if  (--cyclePause)
return;

cycleStart = GetCycleCount64();
}

/*****************************************************************************/
#endif//COUNT_CYCLES

#define FP_ONE_BITS 0x3F800000
// r = 1/p
#define FP_INV(r,p)                                                          \
{                                                                            \
int _i = 2 * FP_ONE_BITS - *(int *)&(p);                                 \
r = *(float *)&_i;                                                       \
r = r * (2.0f - (p) * r);                                                \
}

float   two = 2.0f;

#define FP_INV2(r,p)                     \
{                                        \
__asm { mov     eax,0x7F000000    }; \
__asm { sub     eax,dword ptr [p] }; \
__asm { mov     dword ptr [r],eax }; \
__asm { fld     dword ptr [p]     }; \
__asm { fmul    dword ptr [r]     }; \
__asm { fsubr   [two]             }; \
__asm { fmul    dword ptr [r]     }; \
__asm { fstp    dword ptr [r]     }; \
}

void main()
{
#if COUNT_CYCLES

/* Reset the cycle counter */

cycleCounterInit();
assert(cycleTotal == 0);
cycleCounterBeg();
#endif

float t, it;
t = 10.0f;

for (int i = 0; i < 100000; i++)
{

__asm
{
mov     eax,0x7F000000
sub     eax,dword ptr [t]
mov     dword ptr [it],eax
fld     dword ptr [t]
fmul    dword ptr [it]
fsubr   [two]
fmul    dword ptr [it]
fstp    dword ptr [it]
}
/*
__asm
{
fld1
fdiv dword ptr [t]
fstp dword ptr [it]
}
*/
}

printf("inv  expected %20.10f  approx %20.10f\n", 1/t, it);

#if COUNT_CYCLES

__int64         cycleTotal;
__int64         cycleSpent;

cycleSpent = cycleCounterDone(&cycleTotal);

if  (cycleTotal)
printf("Gross time (est): P2-450 / %2.8f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleTotal/450000000, (float)cycleTotal/500000000, (float)cycleTotal/650000000);
if  (cycleSpent)
printf("Net   time (est): P2-450 / %2.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleSpent/450000000, (float)cycleSpent/500000000, (float)cycleSpent/650000000);
if  (cycleExtra)
printf("Extra time (est): P2-450 / %4.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleExtra/450000000, (float)cycleExtra/500000000, (float)cycleExtra/650000000);

printf("\n");

#endif

}

``````
Posted on 2003-04-11 00:03:32 by Sliver
A paper wrote about the approximation method for inverse square root.
http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf
Posted on 2003-05-05 01:38:28 by bitRAKE
It was an interesting look at trying to better that approximation...

The funny part was he never really figured out what made that constant so important...
Brought me back to a math class a couple of years ago when I tried to understand why Sqrt(2 pi n) so was damn accurate (at least in ratio's) in Sterlings Approximation using PI = 3.141592653579 (since at the time I'd never seen "e" and "PI" in the same equation before)

Sliver
Posted on 2003-05-10 12:50:14 by Sliver
About mathematical approximations, there is a small optimization project:

http://www.optimalcode.com/fastcode/index.htm

Dennis Christensen wrote his own DKC optimizer, which computes the best coefficients for a given expression (perhaps using PSLQ or LLL ?).

JC
Posted on 2003-05-10 13:25:20 by MCoder