Hello,

I've written a routine to approximate the cube root of a real number and I'm looking for some improvements. The method is not iterative, it only needs to be calculated once. It uses a rational quartic polynomial to approximate the cube root as described in the source code. The paper which I got this from insists it is accurate to 24 bits. The core of it requires only 5 muls and 1 div, but I'm looking to clean it up a bit. Any ideas are greatly appreciated

Here is the function:

--Chorus

I've written a routine to approximate the cube root of a real number and I'm looking for some improvements. The method is not iterative, it only needs to be calculated once. It uses a rational quartic polynomial to approximate the cube root as described in the source code. The paper which I got this from insists it is accurate to 24 bits. The core of it requires only 5 muls and 1 div, but I'm looking to clean it up a bit. Any ideas are greatly appreciated

Here is the function:

```
```

CubeRoot proc uses eax ecx edx x:real4

LOCAL exp:DWORD

fld x ;we have to work with absolute value

fabs

fxtract ;get exponent

fxch

fistp exp

mov eax,exp

cdq

mov ecx,3

idiv ecx ;eax=n, edx=exponent for y

cmp edx,0

jl @F

sub edx,3 ;adjust eax and edx so -3 <= edx <0

inc eax

@@:

mov exp,edx

fild exp

fxch

fscale ;this produces y, a value in [1/8,1)

fstp st(1)

fld st(0) ;calculate P(y). 3 muls 5 adds

fmul real8 ptr CC[8*0]

fld real8 ptr CC[8*1]

fadd st(0),st(1)

fmul st(0),st(0)

fxch

fadd real8 ptr CC[8*2]

fadd st(0),st(1)

fxch

fadd real8 ptr CC[8*3]

fmulp st(1),st(0)

fadd real8 ptr CC[8*4]

fld st(1) ;calculate Q(y). 2 muls 5 adds

fld real8 ptr CC[8*5]

fadd st(0),st(1)

fmul st(0),st(0)

fxch

fadd real8 ptr CC[8*6]

fadd st(0),st(1)

fxch

fadd real8 ptr CC[8*7]

fmulp st(1),st(0)

fadd real8 ptr CC[8*8]

fdivp st(1),st(0) ;CubeRoot(y) ~= P(y)/Q(y)

mov exp,eax ;load up n

fild exp

fxch

fscale ;CubeRoot(x) = 2^n * CubeRoot(y)

fstp st(1)

fstp st(1)

test x,80000000h ;restore sign

jz @F

fchs

@@:

ret

CubeRoot endp

--Chorus

Comment #1

Unless you start with a finite precise number in your source REAL4, you cannot get a cube root of that number with more than 8 bits of precision regardless of the procedure being used (the REAL4 itself only has 24 bits of precision).

If you do start with a finite precise number in your source REAL4, the most precision you could get is 21 bits, i.e. a third of the 64 bits of precision used by the FPU data registers.

And if you are not already aware of it, what you would normally consider as a precise decimal number such as 1.35 would NOT be a precise binary number when converted to a REAL4.

(Similarly, a square root would only yield 12 bits of precision on the same REAL4 number.)

Comment #2

Your code to extract a cubic root would be a lot shorter if you use the logarithmic instructions and your result would be the most precise (still only to 8 bits) instead of being an approximation.

Raymond

Unless you start with a finite precise number in your source REAL4, you cannot get a cube root of that number with more than 8 bits of precision regardless of the procedure being used (the REAL4 itself only has 24 bits of precision).

If you do start with a finite precise number in your source REAL4, the most precision you could get is 21 bits, i.e. a third of the 64 bits of precision used by the FPU data registers.

And if you are not already aware of it, what you would normally consider as a precise decimal number such as 1.35 would NOT be a precise binary number when converted to a REAL4.

(Similarly, a square root would only yield 12 bits of precision on the same REAL4 number.)

Comment #2

Your code to extract a cubic root would be a lot shorter if you use the logarithmic instructions and your result would be the most precise (still only to 8 bits) instead of being an approximation.

Raymond

*Originally posted by Raymond*

Your code to extract a cubic root would be a lot shorter if you use the logarithmic instructions and your result would be the most precise (still only to 8 bits) instead of being an approximation.

For double precision and so-called extended precision numbers, I agree with Raymond.

But, for single precision numbers, log approach is slow. Approximation can be as accurate as log approach, and much faster. One can verify this by using Sun's FDLIBM. After all, the reason why we use single precision numbers is for speed (and smaller memory requirement).

Thanks for your comments :) A couple of remarks:

1) I am aware that certain numbers cannot be expressed exactly in either real4, real8 or for that matter any amount of precision. Unfortunately, there's really nothing I can do about it...

2) The input, of course, is dependent on the individual application. For example, my application is actually only concerned with integers. The code is easily modified because the only two lines that depend on input precision are the function prototype and the first fld (or fild)

3) From what I understand the alternative approach (to calculate with logs) uses f2xm1 and fyl2x, and both these instructions use a lot of clock cycles. In your opinion, Raymond, do you think these instructions would be faster than my approach (with muls and divs)? It was my impression that the general x^y code would be much slower than, say, using Newton's method (or mine). Personally, I need speed and not size for my application

4) Output precision is real8. Again, the source document claims that the maximum error of the approximating polynomial is <2^(-24) in the interval [1/8,1). I have not been able to verify this on a large scale, but the few values I have observed seem to hold this true. (By the way, the original document can be found at http://www.worldserver.com/turk/computergraphics/CubeRoot.pdf).

5) For extending the precision of this approximation, one can either use Newton's method, or arbitrary precision math. Both of these are described in Numerical Recipes (an online version of which is at http://www.nr.com)

6) To starless, since all the operations are done on the FPU stack, there should be no speed difference between a real4 or real8 implementation except for the very first instruction. All stack operations are done as real8s by default (I believe)

What really concerns me about my proc, actually, is the fxtract and two fscales. If any one can think of any cleve ways of getting rid of these, i'd be interested to know... I think the fscales in particular wipe out most of the performance gains from eliminating the muls...

Again, Thanks for your comments

--Chorus

1) I am aware that certain numbers cannot be expressed exactly in either real4, real8 or for that matter any amount of precision. Unfortunately, there's really nothing I can do about it...

2) The input, of course, is dependent on the individual application. For example, my application is actually only concerned with integers. The code is easily modified because the only two lines that depend on input precision are the function prototype and the first fld (or fild)

3) From what I understand the alternative approach (to calculate with logs) uses f2xm1 and fyl2x, and both these instructions use a lot of clock cycles. In your opinion, Raymond, do you think these instructions would be faster than my approach (with muls and divs)? It was my impression that the general x^y code would be much slower than, say, using Newton's method (or mine). Personally, I need speed and not size for my application

4) Output precision is real8. Again, the source document claims that the maximum error of the approximating polynomial is <2^(-24) in the interval [1/8,1). I have not been able to verify this on a large scale, but the few values I have observed seem to hold this true. (By the way, the original document can be found at http://www.worldserver.com/turk/computergraphics/CubeRoot.pdf).

5) For extending the precision of this approximation, one can either use Newton's method, or arbitrary precision math. Both of these are described in Numerical Recipes (an online version of which is at http://www.nr.com)

6) To starless, since all the operations are done on the FPU stack, there should be no speed difference between a real4 or real8 implementation except for the very first instruction. All stack operations are done as real8s by default (I believe)

What really concerns me about my proc, actually, is the fxtract and two fscales. If any one can think of any cleve ways of getting rid of these, i'd be interested to know... I think the fscales in particular wipe out most of the performance gains from eliminating the muls...

Again, Thanks for your comments

--Chorus

2) and 6) Not only would the parameters passed to the routine need changing, but also the FPU would have to be put into an extended percision mode.

4) Would not this error compound as these approximations are continually used to approximate the cube roots of larger numbers?

Thanks for the link to the original text and keep up the good work.

4) Would not this error compound as these approximations are continually used to approximate the cube roots of larger numbers?

Thanks for the link to the original text and keep up the good work.

1.- The FPU functions with 80-bit registers such all FPU operations are performed with extended precision, i.e. REAL10.

2.- That precision is equivalent to approximately 19 decimal significant digits. The list of constants supplied in your .asm file initializes REAL8 variables which only have the precision equivalent to approximately 16 decimal significant digits. All those digits exceeding the 16th or 17th digit are thus totally useless for improving the precision of any computation.

3.- The following proc relying on logarithms to compute the cubic root takes approximately 1.22 microsecs (based on 1000000 loops) as compared to approximately 1.16 microsecs with your proc on my computer (P4 1.5GHz). The difference is quite minimal.

When results are compared to those obtained with the Windows' calculator, the log proc is precise to the 16th significant digit. Your proc is precise to the 8th decimal digit.

Raymond

2.- That precision is equivalent to approximately 19 decimal significant digits. The list of constants supplied in your .asm file initializes REAL8 variables which only have the precision equivalent to approximately 16 decimal significant digits. All those digits exceeding the 16th or 17th digit are thus totally useless for improving the precision of any computation.

3.- The following proc relying on logarithms to compute the cubic root takes approximately 1.22 microsecs (based on 1000000 loops) as compared to approximately 1.16 microsecs with your proc on my computer (P4 1.5GHz). The difference is quite minimal.

When results are compared to those obtained with the Windows' calculator, the log proc is precise to the 16th significant digit. Your proc is precise to the 8th decimal digit.

```
CubeLog proc uses eax ecx edx x:DWORD
```

finit

fld1

fld st

fadd st,st

fadd ;st=3

fild x ;st=x, st1=3

fabs

fxtract ;st=man, st1=exp, st2=3

fincstp ;st7=man, st=exp, st1=3

fst st(2) ;st7=man, st=exp, st1=3, st2=exp

fprem ;st7=man, st=expmod3, st1=3, st2=exp

fdecstp ;st=man, st1=expmod3, st2=3, st3=exp

fscale ;st=man*2^expmod3, st1=expmod3, st2=3, st3=exp

fstp st(1) ;st=man*2^expmod3, st1=3, st2=exp

fld1 ;st=1, st1=man*2^expmod3, st2=3, st3=exp

fdiv st,st(2) ;st=1/3, st1=man*2^expmod3, st2=3, st3=exp

fxch ;st=man*2^expmod3, st1=1/3, st2=3, st3=exp

fyl2x ;st=log2(man*2^expmod3)*1/3, st1=3, st2=exp

f2xm1

fld1

fadd ;st=2^[log2(man*2^expmod3)*1/3], st1=3, st2=exp

fincstp ;st7=2^[log2(man*2^expmod3)*1/3], st=3, st1=exp

fdiv st(1),st ;st7=2^[log2(man*2^expmod3)*1/3], st=3, st1=exp/3

fdecstp ;st=2^[log2(man*2^expmod3)*1/3], st1=3, st2=exp/3

fstp st(1) ;st=2^[log2(man*2^expmod3)*1/3], st1=exp/3

fscale ;st=2^[log2(man*2^expmod3)*1/3]*2^exp/3, st1=exp/3

ffree st(1) ;st=cube root (x)

test x,80000000h

jz @F

fchs ;restore sign

@@:

ret

CubeLog endp

Raymond

Bitrake,

Thank you very much for your comments :)

Regarding the FPU precision mode, I think by default the FPU is in double precision mode. Or at least, that's what I remember FINIT does, and I think Windows will call this before the application process is started (can anyone verify this?). So I believe when the first fld (or fild) is called, regardless of the original source, the value will be converted to real8 on the stack. I suppose you could change the precision to extended to buy yourself some extra bits in the calculation, but that will likely be taken care of at the application level.

As for error accumulating I'm not sure what you mean. The procedure I wrote is not iterative so taken by itself the error should not get worse as the number gets larger (at least not worse *percentage* wise, it will however double in magnitude as the original number doubles, etc.). For Newton's method, the error should actually *decrease* as more iterations are called. The source code that I attached in my OP has a more detailed explaination

Thanks

--Chorus

Thank you very much for your comments :)

Regarding the FPU precision mode, I think by default the FPU is in double precision mode. Or at least, that's what I remember FINIT does, and I think Windows will call this before the application process is started (can anyone verify this?). So I believe when the first fld (or fild) is called, regardless of the original source, the value will be converted to real8 on the stack. I suppose you could change the precision to extended to buy yourself some extra bits in the calculation, but that will likely be taken care of at the application level.

As for error accumulating I'm not sure what you mean. The procedure I wrote is not iterative so taken by itself the error should not get worse as the number gets larger (at least not worse *percentage* wise, it will however double in magnitude as the original number doubles, etc.). For Newton's method, the error should actually *decrease* as more iterations are called. The source code that I attached in my OP has a more detailed explaination

Thanks

--Chorus

1. Chorus reply 6:

Actually, you can save memory with SP coefficients. What I told you is from my experience. Try transcribing Sun's FDLIBM, and compare the speed with log approach. You'll see what I meant only when you actually do it.

2. Raymond's second comment 1:

That is not true for all FPU instructurions. An example to falsify your claim is fdiv. Try changing PC bits and fdiv. You will see that in some cases, x86 FPU fails to pass IEEE test. (But, lower precision makes fdiv faster.)

3. Finally...

PC bits are set to 3 when CPU starts up. But, several OSes change it to 2 claiming that extended precision is unreliable. Windows is one of them. One advantage of log approch is that several transcendental functions are always done in extended precision, regardless of your PC setting. log and exponential functions are among them. On the other hand, basic operations like add/mul/sub/div are sensitive to PC setting.

And, cube root is relatively smooth function (compared to other frequently used functions) and you don't lose much even if your coefficients are in SP. -- And I told you Sun's FDLIBM is as accurate as log approach. The results are 1 ulp apart at the worst. 1 ulp of SP number. How large error is that? ;)

Actually, you can save memory with SP coefficients. What I told you is from my experience. Try transcribing Sun's FDLIBM, and compare the speed with log approach. You'll see what I meant only when you actually do it.

2. Raymond's second comment 1:

That is not true for all FPU instructurions. An example to falsify your claim is fdiv. Try changing PC bits and fdiv. You will see that in some cases, x86 FPU fails to pass IEEE test. (But, lower precision makes fdiv faster.)

3. Finally...

PC bits are set to 3 when CPU starts up. But, several OSes change it to 2 claiming that extended precision is unreliable. Windows is one of them. One advantage of log approch is that several transcendental functions are always done in extended precision, regardless of your PC setting. log and exponential functions are among them. On the other hand, basic operations like add/mul/sub/div are sensitive to PC setting.

And, cube root is relatively smooth function (compared to other frequently used functions) and you don't lose much even if your coefficients are in SP. -- And I told you Sun's FDLIBM is as accurate as log approach. The results are 1 ulp apart at the worst. 1 ulp of SP number. How large error is that? ;)

About the extraction of the 3rd root, it reminds me an interesting approach on the site optimalcode.com (which is strangely dead now).

It was a 'DKC' solver (from the name of the danish creator), which was a function approximator, it was able to compute approximations on trigonometric functions with very few terms and good accuracy, but should work on exponentiations. I don't know much about the algorithms behind.

Perhaps you should try the Chebyshev (or Chebychev, or with tch...) method to get a better approximation.

Some years ago, I also saw a binary digit nth-root extractor. It was in Byte. Perhaps could you find it somewhere ?

JC

It was a 'DKC' solver (from the name of the danish creator), which was a function approximator, it was able to compute approximations on trigonometric functions with very few terms and good accuracy, but should work on exponentiations. I don't know much about the algorithms behind.

Perhaps you should try the Chebyshev (or Chebychev, or with tch...) method to get a better approximation.

Some years ago, I also saw a binary digit nth-root extractor. It was in Byte. Perhaps could you find it somewhere ?

JC

Starless, I'll give your approach a try and see if I can reduce the size of the code (I assume that's what you mean by saving memory... ? Otherwise I'm not sure what there is to gain...)

MCoder, I was just looking into a Rational Chebyshev Approximation. I'm thinking I can get a pretty good approximation for x^(1/3) in the interval [-1/8,1/8] and use the same idea of scaling down x, calculating the root and then scaling back up. I need to get to a computer with Mathematica or something to calculate the coefficients... I don't think my calculator will solve the approximation to the necessary precision...

As an aside, I tried replacing one of the fscale's in my code with multiplying by the corresponding integer value (2^n). This is much faster (increasing the algo by 20% or so), but only works if 0 <= n <32. ie., doesn't work for negative values of n (corresponding to 0<x<1) or very large values of n.

--Chorus

MCoder, I was just looking into a Rational Chebyshev Approximation. I'm thinking I can get a pretty good approximation for x^(1/3) in the interval [-1/8,1/8] and use the same idea of scaling down x, calculating the root and then scaling back up. I need to get to a computer with Mathematica or something to calculate the coefficients... I don't think my calculator will solve the approximation to the necessary precision...

As an aside, I tried replacing one of the fscale's in my code with multiplying by the corresponding integer value (2^n). This is much faster (increasing the algo by 20% or so), but only works if 0 <= n <32. ie., doesn't work for negative values of n (corresponding to 0<x<1) or very large values of n.

--Chorus

As an aside, I tried replacing one of the fscale's in my code with multiplying by the corresponding integer value (2^n). This is much faster (increasing the algo by 20% or so), but only works if 0 <= n <32. ie., doesn't work for negative values of n (corresponding to 0<x<1) or very large values of n.

```
fstp TBYTE PTR [esp]
```

; adjust the exponent

add [esp+8], eax

; load the scaled value

fld TBYTE PTR [esp]

fstp TBYTE PTR

; adjust the exponent

add , eax

; load the scaled value

fld TBYTE PTR

; adjust the exponent

add , eax

; load the scaled value

fld TBYTE PTR

For those who may not be aware, the above code would trash the stack if used as is. Most probably bitRAKE would have preceded that with

**sub esp,10**to reserve space on the stack and later restore the stack with a

**add esp,10**

chorus

I was curious about the efficiency of the FpuXexpY function of the Fpulib and surprised at its speed compared to the other algo I posted. The main reason for the speed improvement was that the reciprocal of 3 had to be precalculated and in memory before calling the function.

The following proc comes at 0.70 microseconds as compared to the other figures reported previously. Significantly faster and still providing the most precision. Hard to believe but the simple computation of the reciprocal of 3 within the proc itself would add some 0.4 microseconds!!!

```
```

CubeLogA proc uses eax ecx edx x:DWORD

.data

recip3 dt 0.333333333333333333

.code

finit

fld recip3

fild x

fabs

fyl2x ;->log2(Src1)*exponent

fld st(0) ;copy the logarithm

frndint ;keep only the characteristic

fsub st(1),st ;keeps only the mantissa

fxch ;get the mantissa on top

f2xm1 ;->2^(mantissa)-1

fld1

fadd ;add 1 back

fscale ;scale it with the characteristic

fstp st(1) ;copy result over and "pop" it

test x,80000000h

jz @F

fchs ;restore sign

@@:

ret

CubeLogA endp

Raymond

One more detail which may be difficult to explain. As stated in the previous post, computing the reciprocal of 3 within the proc was adding some 0.40 microsecond.

This was relatively constant whether the value of 3 was built on the FPU or loaded from memory as a floating point (REAL4, REAL8 or REAL10) or an integer (WORD or DWORD) before performing the division of 1.

This was also relatively constant when dividing 1 directly by a value of 3 from memory whether declared as a floating point (REAL4, REAL8) or a DWORD integer.

HOWEVER, dividing 1 directly by a value of 3 from memory declared as a

Raymond

This was relatively constant whether the value of 3 was built on the FPU or loaded from memory as a floating point (REAL4, REAL8 or REAL10) or an integer (WORD or DWORD) before performing the division of 1.

This was also relatively constant when dividing 1 directly by a value of 3 from memory whether declared as a floating point (REAL4, REAL8) or a DWORD integer.

HOWEVER, dividing 1 directly by a value of 3 from memory declared as a

**WORD**only added 0.05 microseconds to the proc.:confused:Raymond

Hello Raymond,

I love the simplicity of your method. I can't argue about having smaller *and* faster code :)

However, there must be a big difference between your processor and mine... when I compare the latest code of yours vs my original, mine seems to runs about 30% faster... I'm running AMD Athlon 1100 MHz. I'm just using a simple rdtsc to count clocks after 1000000 iterations, like so:

After running it at least a dozen times your proc comes up around 200 million clocks and mine around 140 million...

Perhaps the P4 executes the f2xm1 and fyl2x much faster...

(BTW with your method you have to test for zero at the beginning b/c of the log base 2)

Regards

--Chorus

PS I looked into approximating the cube root in the range [-1/8,1/8] to try to simplify my method a little (I was hoping, for instance to eliminate the need to check for the sign). This doesn't seem to work very well because the derivative disappears at 0. Rational polynomial approximation doesn't seem to like that very much, though I'm going to try other ideas. In the mean time, I'm going to try Bitrakes' suggestion for replacing fscale, although I think it's a little more complicated than his post because the top bit of the 10BYTE is for sign... it'll have to be preserved somehow.. I think a simple rol value,1 ought to do the trick...

I love the simplicity of your method. I can't argue about having smaller *and* faster code :)

However, there must be a big difference between your processor and mine... when I compare the latest code of yours vs my original, mine seems to runs about 30% faster... I'm running AMD Athlon 1100 MHz. I'm just using a simple rdtsc to count clocks after 1000000 iterations, like so:

```
```

start:

mov ecx,1000000

rdtsc

mov FirstTimeLo,eax

mov FirstTimeHi,edx

@@:

push 8

call CubeRoot

; call CubeLogA

fstp st(0)

dec ecx

jnz @B

rdtsc

sub eax,FirstTimeLo

sbb edx,FirstTimeHi

;I'm dividing down by 1000 to make sure there's nothing

;in edx that I'm missing, although I expect it to be zero

mov ecx,1000

idiv ecx

invoke dwtoa,eax,addr szOutput

invoke MessageBox,NULL,addr szOutput,NULL,MB_OK

invoke ExitProcess,NULL

After running it at least a dozen times your proc comes up around 200 million clocks and mine around 140 million...

Perhaps the P4 executes the f2xm1 and fyl2x much faster...

(BTW with your method you have to test for zero at the beginning b/c of the log base 2)

Regards

--Chorus

PS I looked into approximating the cube root in the range [-1/8,1/8] to try to simplify my method a little (I was hoping, for instance to eliminate the need to check for the sign). This doesn't seem to work very well because the derivative disappears at 0. Rational polynomial approximation doesn't seem to like that very much, though I'm going to try other ideas. In the mean time, I'm going to try Bitrakes' suggestion for replacing fscale, although I think it's a little more complicated than his post because the top bit of the 10BYTE is for sign... it'll have to be preserved somehow.. I think a simple rol value,1 ought to do the trick...

It's surprising at times how different computers will perform with a given set of instructions. For example, I have a little program to extract square roots with up to 10,000 decimals in the answer. That program was running almost twice as fast on my previous P3-550 than on my P4-1500!!!

Your AMD seems to be a lot more efficient than my P4. Using the rstdc (instead of the millisec timer of Windows), I'm getting 1200 cycles for my procedure and 1700 for yours. Maybe its a P4-150 that I got.:(

When I looked at bitRAKE's suggestion for scaling, I was also annoyed at first about the sign bit. Giving more thought to it, that sign bit will not get affected unless your scaling factor would be so big (positive or negative) as to exceed the range of the REAL10 format, which is quite improbable unless you feed it garbage.

Raymond

Your AMD seems to be a lot more efficient than my P4. Using the rstdc (instead of the millisec timer of Windows), I'm getting 1200 cycles for my procedure and 1700 for yours. Maybe its a P4-150 that I got.:(

When I looked at bitRAKE's suggestion for scaling, I was also annoyed at first about the sign bit. Giving more thought to it, that sign bit will not get affected unless your scaling factor would be so big (positive or negative) as to exceed the range of the REAL10 format, which is quite improbable unless you feed it garbage.

Raymond

Ok, after taking a number of different ideas I've got a much revamped version:

1) I managed to eliminate the integer divide by using magic number division (thanks to The Svin for these constants, I stole them from one of his posts :) )

2) The two fscales and one fxtract are gone, I'm doing the scaling manually using the real10 format. (For anyone unfamiliar with real10 format, check out Intel's Architecture Software Developer's Manual, Volume 1)

3) All conditional jumps are removed

4) At the bottom, I use a real4 scale value to simultaneously restore the sign of the original value and adjust the exponent

5) The constants have been recalculated and reordered. I'm not sure if the values will load any quicker if they are in the order that they get used, but I figured it couldn't hurt

This code now runs about twice as fast as yours on my machine. Strangely, I thought I could get better performance by rearranging the instructions (trying to get better pairing) but I couldn't make much of a difference. So I've left it in a more "readable" manner.

I still can't touch your procedure for precision, although I do have an idea to improve it, I don't really feel like going through it right now. The idea is pretty simple: The proc maps the input value to the range [1/8,1) which naturally breaks down to three separate regions [1/8,1/4), [1/4,1/2), and [1/2,1). My idea is simple: to have 3 sets of constants, one corresponding to a distinct approximation function in each of the three ranges. This should bring the error down to around <1e-12 or so.

Again, if anyone has any improvements I'd be glad to hear them

Thanks

--Chorus

1) I managed to eliminate the integer divide by using magic number division (thanks to The Svin for these constants, I stole them from one of his posts :) )

2) The two fscales and one fxtract are gone, I'm doing the scaling manually using the real10 format. (For anyone unfamiliar with real10 format, check out Intel's Architecture Software Developer's Manual, Volume 1)

3) All conditional jumps are removed

4) At the bottom, I use a real4 scale value to simultaneously restore the sign of the original value and adjust the exponent

5) The constants have been recalculated and reordered. I'm not sure if the values will load any quicker if they are in the order that they get used, but I figured it couldn't hurt

This code now runs about twice as fast as yours on my machine. Strangely, I thought I could get better performance by rearranging the instructions (trying to get better pairing) but I couldn't make much of a difference. So I've left it in a more "readable" manner.

I still can't touch your procedure for precision, although I do have an idea to improve it, I don't really feel like going through it right now. The idea is pretty simple: The proc maps the input value to the range [1/8,1) which naturally breaks down to three separate regions [1/8,1/4), [1/4,1/2), and [1/2,1). My idea is simple: to have 3 sets of constants, one corresponding to a distinct approximation function in each of the three ranges. This should bring the error down to around <1e-12 or so.

Again, if anyone has any improvements I'd be glad to hear them

Thanks

--Chorus

```
```

.data

CC real8 6.3637179600061402163539320738081

real8 -12.065381127146011865968174543024

real8 1.15441272092408400896359030695808

real8 1.32216597568937846186167818505031

real8 82.614153385222731367127241292588

real8 65.00227162585858862822658857792

real8 -90.424005465780377990655280610327

real8 2.31555198682175862934814370456

real8 5985.3903660662530599695396031034

.code

CubeRoot proc uses eax ecx edx x:dword

LOCAL mant[3]:DWORD

fild x ;get the input value into a real10 format

fstp real10 ptr mant ; so we can adjust the exponent

mov ecx,7FFFh

mov eax,2863311531 ;Use magic number "division" of exponent by three to get

and ecx,dword ptr [mant+8] ; remainder and quotient. Some constants are used here

mul ecx ; because of the bias on the real10 stored value.

shr edx,1

lea eax,[edx*2][edx-3FFFh+3]

sub ecx,eax

mov dword ptr [mant+8],ecx ;Store back the adjusted exponent

fld real8 ptr CC[0*8] ;Start loading our constants to calculate P(y)

fld real8 ptr CC[1*8] ; the first polynomial

fld real8 ptr CC[2*8]

fld real10 ptr mant

fmul real8 ptr CC[3*8]

fadd st(2),st(0)

faddp st(1),st(0)

fmul st(0),st(0)

fadd st(2),st(0)

faddp st(1),st(0)

fmulp st(1),st(0)

fadd real8 ptr CC[4*8] ;P(y)

fld real8 ptr CC[5*8] ;The second polynomial Q(y)

fld real8 ptr CC[6*8]

fld real8 ptr CC[7*8]

fld real10 ptr mant

fadd st(2),st(0)

faddp st(1),st(0)

fmul st(0),st(0)

fadd st(2),st(0)

faddp st(1),st(0)

fmulp st(1),st(0)

fadd real8 ptr CC[8*8] ;Q(y)

fdivp st(1),st(0) ;P(y)/Q(y)

mov eax,x ;Build a real4 value with the following properties:

add edx,7Fh-1554h

and eax,80000000h ; 1) Has same sign as original value

shl edx,23 ; 2) Has exponent of cube root

add edx,eax

mov dword ptr [mant],edx

fmul real4 ptr [mant] ;Multiply by our value to simultaneously correct

ret ; sign and exponent

CubeRoot endp

Oops... doesn't work with x = 0. Just throw in a cmp x,0/je @@ReturnFromTheProc after the fild :)