Hola,

Spent a few minutes making an approximation for 1/n in assembly for a c++ routine I was working on...

#define FP_ONE_BITS 0x3F800000

into

where 't' = a floating point number (ex: 10.0f) and 'it' = the approximation of 1/t

The problem isn't the conversion, but more to the point how does it work... :)

for example using 10.0

the floating point would be...

0 100000101 01000000000000000000000

and 0x7F000000 = 01111111000000000000000000000000

but I'm not sure I'm doing my math right, because following this it doesn't work on paper...

Maybe someone could help me out...

Any optimizations would be really appreciated as well...

Sliver

Spent a few minutes making an approximation for 1/n in assembly for a c++ routine I was working on...

#define FP_ONE_BITS 0x3F800000

```
```

int _i = 2 * FP_ONE_BITS - *(int *)&(p);

r = *(float *)&_i;

r = r * (2.0f - (p) * r);

into

```
```

__asm

{

mov eax,0x7F000000

sub eax,dword ptr [t]

mov dword ptr [it],eax

fld dword ptr [t]

fmul dword ptr [it]

fsubr [two]

fmul dword ptr [it]

fstp dword ptr [it]

}

where 't' = a floating point number (ex: 10.0f) and 'it' = the approximation of 1/t

The problem isn't the conversion, but more to the point how does it work... :)

for example using 10.0

the floating point would be...

0 100000101 01000000000000000000000

and 0x7F000000 = 01111111000000000000000000000000

but I'm not sure I'm doing my math right, because following this it doesn't work on paper...

Maybe someone could help me out...

Any optimizations would be really appreciated as well...

Sliver

**Sliver**, cool algo - I never seen that one before. You have to remember the IEEE 754 standard and the format of REAL4 numbers. The leading one bit is assumed, and the floating point is a positive biased number. So, the algo is playing with the exponent - namely, the exponent of the number is subtracted from twice the bias. If the number is greater than one, then the number is already biased plus another constant amount.

bias = 7F

number = 7F + c

2*bias - number = 7F - c

You see how the scale of the number is inverted?

The rest is just to improve the approximation:

2 - ( X * 1/X ) should be close to one...

would this be any faster than an fdiv?

I believe it is... (Inotherwords, I really hope so...) ;)

it ran faster than

on some initial tests, but I could be wrong...

Haven't done much profiling except the usualy 0x0F 0x31 for clocking cycles

So if anyone feels the desire I'd like to see the results...

Again anything added would be cool :)

BitRAKE:

I do understand the bias, but I still don't understand what bit manipulation it's doing... any more clarrification possible?

so is it multiplying the number by it's -exponent equivalent ?

for example using 10 we have an exponent of 4

(after the bias) so IEEE would be stated as:

2^4 * .625

so when you subtract the floating point exponent from the bias do you get 2^-4?

The problem is I can't wrap my mind around something I can't easily parse and view (it would be easier is I could just do a bunch of masking and view the exponent, but I don't have much to work with when dealing with floating points...

Obviously help is always appreciated :) :) :)

Sliver

it ran faster than

```
```

fld1

fld dword ptr [t]

fdiv

fstp dword ptr [it]

on some initial tests, but I could be wrong...

Haven't done much profiling except the usualy 0x0F 0x31 for clocking cycles

So if anyone feels the desire I'd like to see the results...

Again anything added would be cool :)

BitRAKE:

I do understand the bias, but I still don't understand what bit manipulation it's doing... any more clarrification possible?

so is it multiplying the number by it's -exponent equivalent ?

for example using 10 we have an exponent of 4

(after the bias) so IEEE would be stated as:

2^4 * .625

so when you subtract the floating point exponent from the bias do you get 2^-4?

The problem is I can't wrap my mind around something I can't easily parse and view (it would be easier is I could just do a bunch of masking and view the exponent, but I don't have much to work with when dealing with floating points...

Obviously help is always appreciated :) :) :)

Sliver

```
; first approximation:
```

mov eax,0x7F000000 ; bias(1) * 2

sub eax,dword ptr [t] ; bias(1) * 2 - bias(t)

mov dword ptr [it],eax ;= ~1/t

; second approximation:

fld dword ptr [t] ;

fmul dword ptr [it] ; t * (~1/t) = ~1-

fsubr [two] ; 2 - (~1) = ~1+

fmul dword ptr [it] ; ~1+ * it

fstp dword ptr [it] ;= 1/t

'~' means approximate
I'm not sure what part you don't understand? It sound like you understand from what you state. Just look at the numbers in a debugger - paying attention to only the exponent bits. If you really don't care about accuracy then you could use the first approximation.

I'm not sure what part you don't understand? It sound like you understand from what you state.

*Exercises in stupidity*

:( Ok I just spent the last few hours in learning the fine art of stupidity

So I do understand this after all (here is a good example why you should always (a) use a debugger (b) make sure the numbers you use on paper are the same numbers your testing in the program :( I could almost cry...

Seems I was testing using the number

t = 1234.24532

so I "thought" the answer was 0.0008102117 (with my approx being 0.0008048193)

instead I was using (on paper) t = 10.0

which is 0.10000

So obviously I'm looking at the computer saying "I'm not getting what I should be after taking the bias" :( but since I *knew* this was wrong I kept re-doing the problem...

BitRAKE:

Thanks for helping me fine my way out of that mess...

Sliver

Obviously this isn't the end all be all of testing... :)

Given: t = 10.0f

inv expected 0.1000000000 approx 0.0991210938

Gross time (est): P2-450 / 0.00512982 sec

inv expected 0.1000000000 approx 0.1000000015

Gross time (est): P2-450 / 0.00852415 sec

inv expected 0.1000000000 approx 0.1000000015

Gross time (est): P2-450 / 0.00940774 sec

Given: t = 10.0f

```
```

for (int i = 0; i < 100000; i++)

{

__asm

{

mov eax,0x7F000000

sub eax,dword ptr [t]

mov dword ptr [it],eax

fld dword ptr [t]

fmul dword ptr [it]

fsubr [two]

fmul dword ptr [it]

fstp dword ptr [it]

}

}

inv expected 0.1000000000 approx 0.0991210938

Gross time (est): P2-450 / 0.00512982 sec

```
```

for (int i = 0; i < 100000; i++)

{

__asm

{

fld1

fld dword ptr [t]

fdiv

fstp dword ptr [it]

}

}

inv expected 0.1000000000 approx 0.1000000015

Gross time (est): P2-450 / 0.00852415 sec

```
```

for (int i = 0; i < 100000; i++)

{

it = 1/t;

}

inv expected 0.1000000000 approx 0.1000000015

Gross time (est): P2-450 / 0.00940774 sec

Obviously this isn't the end all be all of testing... :)

Hey help me out here, what is a biased number? Excuse me for sounding stupid but I'm only in Grade 11 :(

The IEEE format uses a bias for the exponent - 7Fh for REAL4 format. This bias allows floating point numbers to be compared with integer compares! :) An exponent of 7Fh is zero - note the one bit is implied - so the number is one (1).

Sliver:

It is not necessary to load the divisor for performing the dividion with the FPU. It would be interesting to see the relative comparison with your time figures if you replace the two instructions

fld dword ptr

fdiv

by the single instruction

fdiv dword ptr

Raymond

It is not necessary to load the divisor for performing the dividion with the FPU. It would be interesting to see the relative comparison with your time figures if you replace the two instructions

fld dword ptr

fdiv

by the single instruction

fdiv dword ptr

Raymond

The P4 takes no less than 23 / 38 / 43 cycles for the FDIV! Athlon does a little better with 16 / 20 / 24 - three latency numbers refer to precision control settings of single precision, double precision, and extended precision, respectively.

Things look better on Athlon XPs: pfrcp and rcpss take 3 clocks.

```
```

__asm

{

fld1

fdiv dword ptr [t]

fstp dword ptr [it]

}

inv expected 0.1000000000 approx 0.1000000015

Gross time (est): P2-450 / 0.00849256 sec, P2-500 / 0.01, P2-650 / 0.01

Still beats it cosistantly by approx. '.003 seconds'

This of course is by no means conclusive, but it's not meant to be :p

Here is the code I use to check it (in c/c++)

```
```

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <assert.h>

#define COUNT_CYCLES 1

/*****************************************************************************/

#if COUNT_CYCLES

/*****************************************************************************/

#pragma warning(disable:4035)

#define CCNT_OVERHEAD64 13

__int64 GetCycleCount64()

{

__asm _emit 0x0F

__asm _emit 0x31

};

#define CCNT_OVERHEAD32 13

unsigned GetCycleCount32() // enough for about 40 seconds

{

__asm push EDX

__asm _emit 0x0F

__asm _emit 0x31

__asm pop EDX

};

#pragma warning(default:4035)

/*****************************************************************************/

static

__int64 cycleBegin;

static

__int64 cycleTotal;

static

__int64 cycleStart;

static

unsigned cyclePause;

unsigned cycleExtra;

static

void cycleCounterInit()

{

cycleBegin = GetCycleCount64();

cycleTotal = 0;

cycleStart = 0;

cyclePause = 0;

}

static

__int64 cycleCounterDone(__int64 *realPtr)

{

assert(cyclePause == 0);

*realPtr = GetCycleCount64() - cycleBegin;

return cycleTotal;

}

void cycleCounterBeg()

{

assert(cyclePause == 0);

cycleStart = GetCycleCount64();

}

void cycleCounterEnd()

{

assert(cycleStart != 0);

assert(cyclePause == 0);

cycleTotal += GetCycleCount64() - cycleStart;

cycleStart = 0;

}

void cycleCounterPause()

{

assert(cycleStart != 0);

if (!cyclePause)

cycleTotal += GetCycleCount64() - cycleStart;

cyclePause++;

}

void cycleCounterResume()

{

assert(cycleStart != 0);

assert(cyclePause != 0);

if (--cyclePause)

return;

cycleStart = GetCycleCount64();

}

/*****************************************************************************/

#endif//COUNT_CYCLES

#define FP_ONE_BITS 0x3F800000

// r = 1/p

#define FP_INV(r,p) \

{ \

int _i = 2 * FP_ONE_BITS - *(int *)&(p); \

r = *(float *)&_i; \

r = r * (2.0f - (p) * r); \

}

float two = 2.0f;

#define FP_INV2(r,p) \

{ \

__asm { mov eax,0x7F000000 }; \

__asm { sub eax,dword ptr [p] }; \

__asm { mov dword ptr [r],eax }; \

__asm { fld dword ptr [p] }; \

__asm { fmul dword ptr [r] }; \

__asm { fsubr [two] }; \

__asm { fmul dword ptr [r] }; \

__asm { fstp dword ptr [r] }; \

}

void main()

{

#if COUNT_CYCLES

/* Reset the cycle counter */

cycleCounterInit();

assert(cycleTotal == 0);

cycleCounterBeg();

#endif

float t, it;

t = 10.0f;

for (int i = 0; i < 100000; i++)

{

__asm

{

mov eax,0x7F000000

sub eax,dword ptr [t]

mov dword ptr [it],eax

fld dword ptr [t]

fmul dword ptr [it]

fsubr [two]

fmul dword ptr [it]

fstp dword ptr [it]

}

/*

__asm

{

fld1

fdiv dword ptr [t]

fstp dword ptr [it]

}

*/

}

printf("inv expected %20.10f approx %20.10f\n", 1/t, it);

#if COUNT_CYCLES

__int64 cycleTotal;

__int64 cycleSpent;

cycleSpent = cycleCounterDone(&cycleTotal);

if (cycleTotal)

printf("Gross time (est): P2-450 / %2.8f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleTotal/450000000, (float)cycleTotal/500000000, (float)cycleTotal/650000000);

if (cycleSpent)

printf("Net time (est): P2-450 / %2.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleSpent/450000000, (float)cycleSpent/500000000, (float)cycleSpent/650000000);

if (cycleExtra)

printf("Extra time (est): P2-450 / %4.2f sec, P2-500 / %4.2f, P2-650 / %4.2f\n", (float)cycleExtra/450000000, (float)cycleExtra/500000000, (float)cycleExtra/650000000);

printf("\n");

#endif

}

A paper wrote about the approximation method for inverse square root.

http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf

http://www.math.purdue.edu/~clomont/Math/Papers/2003/InvSqrt.pdf

It was an interesting look at trying to better that approximation...

The funny part was he never really figured out what made that constant so important...

Brought me back to a math class a couple of years ago when I tried to understand why Sqrt(2 pi n) so was damn accurate (at least in ratio's) in Sterlings Approximation using PI = 3.141592653579 (since at the time I'd never seen "e" and "PI" in the same equation before)

Sliver

The funny part was he never really figured out what made that constant so important...

Brought me back to a math class a couple of years ago when I tried to understand why Sqrt(2 pi n) so was damn accurate (at least in ratio's) in Sterlings Approximation using PI = 3.141592653579 (since at the time I'd never seen "e" and "PI" in the same equation before)

Sliver

About mathematical approximations, there is a small optimization project:

http://www.optimalcode.com/fastcode/index.htm

Dennis Christensen wrote his own DKC optimizer, which computes the best coefficients for a given expression (perhaps using PSLQ or LLL ?).

JC

http://www.optimalcode.com/fastcode/index.htm

Dennis Christensen wrote his own DKC optimizer, which computes the best coefficients for a given expression (perhaps using PSLQ or LLL ?).

JC