Hola,

Spent a few minutes making an approximation for 1/n in assembly for a c++ routine I was working on...

#define FP_ONE_BITS 0x3F800000


int _i = 2 * FP_ONE_BITS - *(int *)&(p);
r = *(float *)&_i;
r = r * (2.0f - (p) * r);


into



__asm
{
mov eax,0x7F000000
sub eax,dword ptr [t]
mov dword ptr [it],eax
fld dword ptr [t]
fmul dword ptr [it]
fsubr [two]
fmul dword ptr [it]
fstp dword ptr [it]
}


where 't' = a floating point number (ex: 10.0f) and 'it' = the approximation of 1/t

The problem isn't the conversion, but more to the point how does it work... :)

for example using 10.0
the floating point would be...

0 100000101 01000000000000000000000

and 0x7F000000 = 01111111000000000000000000000000

but I'm not sure I'm doing my math right, because following this it doesn't work on paper...

Maybe someone could help me out...

Any optimizations would be really appreciated as well...

Sliver
Posted on 2003-03-09 16:05:33 by Sliver
Sliver, cool algo - I never seen that one before. You have to remember the IEEE 754 standard and the format of REAL4 numbers. The leading one bit is assumed, and the floating point is a positive biased number. So, the algo is playing with the exponent - namely, the exponent of the number is subtracted from twice the bias. If the number is greater than one, then the number is already biased plus another constant amount.

bias = 7F

number = 7F + c

2*bias - number = 7F - c

You see how the scale of the number is inverted?

The rest is just to improve the approximation:
2 - ( X * 1/X ) should be close to one...
Posted on 2003-03-09 22:39:50 by bitRAKE
would this be any faster than an fdiv?
Posted on 2003-03-09 22:59:36 by iblis
I believe it is... (Inotherwords, I really hope so...) ;)

it ran faster than


fld1
fld dword ptr [t]
fdiv
fstp dword ptr [it]


on some initial tests, but I could be wrong...

Haven't done much profiling except the usualy 0x0F 0x31 for clocking cycles
So if anyone feels the desire I'd like to see the results...

Again anything added would be cool :)

BitRAKE:
I do understand the bias, but I still don't understand what bit manipulation it's doing... any more clarrification possible?

so is it multiplying the number by it's -exponent equivalent ?

for example using 10 we have an exponent of 4
(after the bias) so IEEE would be stated as:
2^4 * .625

so when you subtract the floating point exponent from the bias do you get 2^-4?

The problem is I can't wrap my mind around something I can't easily parse and view (it would be easier is I could just do a bunch of masking and view the exponent, but I don't have much to work with when dealing with floating points...

Obviously help is always appreciated :) :) :)

Sliver
Posted on 2003-03-09 23:37:44 by Sliver
; first approximation:

mov eax,0x7F000000 ; bias(1) * 2
sub eax,dword ptr [t] ; bias(1) * 2 - bias(t)
mov dword ptr [it],eax ;= ~1/t
; second approximation:
fld dword ptr [t] ;
fmul dword ptr [it] ; t * (~1/t) = ~1-
fsubr [two] ; 2 - (~1) = ~1+
fmul dword ptr [it] ; ~1+ * it
fstp dword ptr [it] ;= 1/t
'~' means approximate

I'm not sure what part you don't understand? It sound like you understand from what you state. Just look at the numbers in a debugger - paying attention to only the exponent bits. If you really don't care about accuracy then you could use the first approximation.
Posted on 2003-03-10 00:14:13 by bitRAKE



I'm not sure what part you don't understand? It sound like you understand from what you state.


*Exercises in stupidity*

:( Ok I just spent the last few hours in learning the fine art of stupidity

So I do understand this after all (here is a good example why you should always (a) use a debugger (b) make sure the numbers you use on paper are the same numbers your testing in the program :( I could almost cry...

Seems I was testing using the number

t = 1234.24532
so I "thought" the answer was 0.0008102117 (with my approx being 0.0008048193)

instead I was using (on paper) t = 10.0
which is 0.10000

So obviously I'm looking at the computer saying "I'm not getting what I should be after taking the bias" :( but since I *knew* this was wrong I kept re-doing the problem...

BitRAKE:
Thanks for helping me fine my way out of that mess...

Sliver
Posted on 2003-03-10 01:45:50 by Sliver
Obviously this isn't the end all be all of testing... :)

Given: t = 10.0f


for (int i = 0; i < 100000; i++)
{
__asm
{
mov eax,0x7F000000
sub eax,dword ptr [t]
mov dword ptr [it],eax
fld dword ptr [t]
fmul dword ptr [it]
fsubr [two]
fmul dword ptr [it]
fstp dword ptr [it]
}
}

inv expected 0.1000000000 approx 0.0991210938
Gross time (est): P2-450 / 0.00512982 sec




for (int i = 0; i < 100000; i++)
{
__asm
{
fld1
fld dword ptr [t]
fdiv
fstp dword ptr [it]
}
}

inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00852415 sec




for (int i = 0; i < 100000; i++)
{
it = 1/t;
}

inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00940774 sec
Posted on 2003-03-10 01:55:09 by Sliver

Obviously this isn't the end all be all of testing... :)
Yeah, I'll just keep it in the toolbox like an old philips driver. It's too small a piece of code to test by itself.
Posted on 2003-03-10 08:27:37 by bitRAKE
Hey help me out here, what is a biased number? Excuse me for sounding stupid but I'm only in Grade 11 :(
Posted on 2003-04-06 18:14:26 by x86asm
The IEEE format uses a bias for the exponent - 7Fh for REAL4 format. This bias allows floating point numbers to be compared with integer compares! :) An exponent of 7Fh is zero - note the one bit is implied - so the number is one (1).
Posted on 2003-04-06 19:37:44 by bitRAKE
Sliver:

It is not necessary to load the divisor for performing the dividion with the FPU. It would be interesting to see the relative comparison with your time figures if you replace the two instructions

fld dword ptr
fdiv

by the single instruction

fdiv dword ptr

Raymond
Posted on 2003-04-06 19:51:26 by Raymond
The P4 takes no less than 23 / 38 / 43 cycles for the FDIV! Athlon does a little better with 16 / 20 / 24 - three latency numbers refer to precision control settings of single precision, double precision, and extended precision, respectively.
Posted on 2003-04-06 19:58:16 by bitRAKE
Things look better on Athlon XPs: pfrcp and rcpss take 3 clocks.
Posted on 2003-04-08 11:12:09 by Jan Wassenberg


__asm
{
fld1
fdiv dword ptr [t]
fstp dword ptr [it]
}




inv expected 0.1000000000 approx 0.1000000015
Gross time (est): P2-450 / 0.00849256 sec, P2-500 / 0.01, P2-650 / 0.01

Still beats it cosistantly by approx. '.003 seconds'

This of course is by no means conclusive, but it's not meant to be :p

Here is the code I use to check it (in c/c++)



#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <assert.h>

#define COUNT_CYCLES 1

/*****************************************************************************/
#if COUNT_CYCLES
/*****************************************************************************/

#pragma warning(disable:4035)

#define CCNT_OVERHEAD64 13

__int64 GetCycleCount64()
{
__asm _emit 0x0F
__asm _emit 0x31
};

#define CCNT_OVERHEAD32 13

unsigned GetCycleCount32() // enough for about 40 seconds
{
__asm push EDX
__asm _emit 0x0F
__asm _emit 0x31
__asm pop EDX
};

#pragma warning(default:4035)

/*****************************************************************************/

static
__int64 cycleBegin;

static
__int64 cycleTotal;

static
__int64 cycleStart;

static
unsigned cyclePause;

unsigned cycleExtra;

static
void cycleCounterInit()
{
cycleBegin = GetCycleCount64();
cycleTotal = 0;
cycleStart = 0;
cyclePause = 0;
}

static
__int64 cycleCounterDone(__int64 *realPtr)
{
assert(cyclePause == 0);

*realPtr = GetCycleCount64() - cycleBegin;

return cycleTotal;
}

void cycleCounterBeg()
{
assert(cyclePause == 0);

cycleStart = GetCycleCount64();
}

void cycleCounterEnd()
{
assert(cycleStart != 0);
assert(cyclePause == 0);

cycleTotal += GetCycleCount64() - cycleStart;

cycleStart = 0;
}

void cycleCounterPause()
{
assert(cycleStart != 0);

if (!cyclePause)
cycleTotal += GetCycleCount64() - cycleStart;

cyclePause++;
}

void cycleCounterResume()
{
assert(cycleStart != 0);
assert(cyclePause != 0);

if (--cyclePause)
return;

cycleStart = GetCycleCount64();
}

/*****************************************************************************/
#endif//COUNT_CYCLES

#define FP_ONE_BITS 0x3F800000
// r = 1/p
#define FP_INV(r,p) \
{ \
int _i = 2 * FP_ONE_BITS - *(int *)&(p); \
r = *(float *)&_i; \
r = r * (2.0f - (p) * r); \
}

float two = 2.0f;

#define FP_INV2(r,p) \
{ \
__asm { mov eax,0x7F000000 }; \
__asm { sub eax,dword ptr [p] }; \
__asm { mov dword ptr [r],eax }; \
__asm { fld dword ptr [p] }; \
__asm { fmul dword ptr [r] }; \
__asm { fsubr [two] }; \
__asm { fmul dword ptr [r] }; \
__asm { fstp dword ptr [r] }; \
}

void main()
{
#if COUNT_CYCLES

/* Reset the cycle counter */

cycleCounterInit();
assert(cycleTotal == 0);
cycleCounterBeg();
#endif


float t, it;
t = 10.0f;

for (int i = 0; i < 100000; i++)
{


__asm
{
mov eax,0x7F000000
sub eax,dword ptr [t]
mov dword ptr [it],eax
fld dword ptr [t]
fmul dword ptr [it]
fsubr [two]
fmul dword ptr [it]
fstp dword ptr [it]
}
/*
__asm
{
fld1
fdiv dword ptr [t]
fstp dword ptr [it]
}
*/
}

printf("inv expected %20.10f approx %20.10f\n", 1/t, it);

#if COUNT_CYCLES

__int64 cycleTotal;
__int64 cycleSpent;

cycleSpent = cycleCounterDone(&cycleTotal);

if (cycleTotal)
printf("Gross time (est): P2-450 / %2.8f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleTotal/450000000, (float)cycleTotal/500000000, (float)cycleTotal/650000000);
if (cycleSpent)
printf("Net time (est): P2-450 / %2.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleSpent/450000000, (float)cycleSpent/500000000, (float)cycleSpent/650000000);
if (cycleExtra)
printf("Extra time (est): P2-450 / %4.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleExtra/450000000, (float)cycleExtra/500000000, (float)cycleExtra/650000000);

printf("\n");

#endif

}

Posted on 2003-04-11 00:03:32 by Sliver
A paper wrote about the approximation method for inverse square root.
http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf
Posted on 2003-05-05 01:38:28 by bitRAKE
It was an interesting look at trying to better that approximation...

The funny part was he never really figured out what made that constant so important...
Brought me back to a math class a couple of years ago when I tried to understand why Sqrt(2 pi n) so was damn accurate (at least in ratio's) in Sterlings Approximation using PI = 3.141592653579 (since at the time I'd never seen "e" and "PI" in the same equation before)

Sliver
Posted on 2003-05-10 12:50:14 by Sliver
About mathematical approximations, there is a small optimization project:

http://www.optimalcode.com/fastcode/index.htm

Dennis Christensen wrote his own DKC optimizer, which computes the best coefficients for a given expression (perhaps using PSLQ or LLL ?).

JC
Posted on 2003-05-10 13:25:20 by MCoder