Hello,
I've written a routine to approximate the cube root of a real number and I'm looking for some improvements. The method is not iterative, it only needs to be calculated once. It uses a rational quartic polynomial to approximate the cube root as described in the source code. The paper which I got this from insists it is accurate to 24 bits. The core of it requires only 5 muls and 1 div, but I'm looking to clean it up a bit. Any ideas are greatly appreciated

Here is the function:

``````
CubeRoot proc uses eax ecx edx x:real4
LOCAL exp:DWORD

fld x                       ;we have to work with absolute value
fabs
fxtract                     ;get exponent
fxch
fistp exp

mov eax,exp
cdq
mov ecx,3
idiv ecx                    ;eax=n, edx=exponent for y
cmp edx,0
jl @F
sub edx,3                   ;adjust eax and edx so -3 <= edx <0
inc eax
@@:
mov exp,edx
fild exp
fxch
fscale                      ;this produces y, a value in [1/8,1)
fstp st(1)

fld st(0)                   ;calculate P(y). 3 muls 5 adds
fmul real8 ptr CC[8*0]
fld real8 ptr CC[8*1]
fmul st(0),st(0)
fxch
fxch
fmulp st(1),st(0)

fld st(1)                   ;calculate Q(y). 2 muls 5 adds
fld real8 ptr CC[8*5]
fmul st(0),st(0)
fxch
fxch
fmulp st(1),st(0)

fdivp st(1),st(0)          ;CubeRoot(y) ~= P(y)/Q(y)

fild exp
fxch
fscale                     ;CubeRoot(x) = 2^n * CubeRoot(y)
fstp st(1)
fstp st(1)
test x,80000000h           ;restore sign
jz @F
fchs
@@:
ret
CubeRoot endp
``````

--Chorus
Posted on 2003-08-19 20:27:27 by chorus
Comment #1
Unless you start with a finite precise number in your source REAL4, you cannot get a cube root of that number with more than 8 bits of precision regardless of the procedure being used (the REAL4 itself only has 24 bits of precision).

If you do start with a finite precise number in your source REAL4, the most precision you could get is 21 bits, i.e. a third of the 64 bits of precision used by the FPU data registers.

And if you are not already aware of it, what you would normally consider as a precise decimal number such as 1.35 would NOT be a precise binary number when converted to a REAL4.

(Similarly, a square root would only yield 12 bits of precision on the same REAL4 number.)

Comment #2
Your code to extract a cubic root would be a lot shorter if you use the logarithmic instructions and your result would be the most precise (still only to 8 bits) instead of being an approximation.

Raymond
Posted on 2003-08-19 23:23:27 by Raymond
Originally posted by Raymond
Your code to extract a cubic root would be a lot shorter if you use the logarithmic instructions and your result would be the most precise (still only to 8 bits) instead of being an approximation.

For double precision and so-called extended precision numbers, I agree with Raymond.

But, for single precision numbers, log approach is slow. Approximation can be as accurate as log approach, and much faster. One can verify this by using Sun's FDLIBM. After all, the reason why we use single precision numbers is for speed (and smaller memory requirement).
Posted on 2003-08-20 00:21:33 by Starless

1) I am aware that certain numbers cannot be expressed exactly in either real4, real8 or for that matter any amount of precision. Unfortunately, there's really nothing I can do about it...

2) The input, of course, is dependent on the individual application. For example, my application is actually only concerned with integers. The code is easily modified because the only two lines that depend on input precision are the function prototype and the first fld (or fild)

3) From what I understand the alternative approach (to calculate with logs) uses f2xm1 and fyl2x, and both these instructions use a lot of clock cycles. In your opinion, Raymond, do you think these instructions would be faster than my approach (with muls and divs)? It was my impression that the general x^y code would be much slower than, say, using Newton's method (or mine). Personally, I need speed and not size for my application

4) Output precision is real8. Again, the source document claims that the maximum error of the approximating polynomial is <2^(-24) in the interval [1/8,1). I have not been able to verify this on a large scale, but the few values I have observed seem to hold this true. (By the way, the original document can be found at http://www.worldserver.com/turk/computergraphics/CubeRoot.pdf).

5) For extending the precision of this approximation, one can either use Newton's method, or arbitrary precision math. Both of these are described in Numerical Recipes (an online version of which is at http://www.nr.com)

6) To starless, since all the operations are done on the FPU stack, there should be no speed difference between a real4 or real8 implementation except for the very first instruction. All stack operations are done as real8s by default (I believe)

What really concerns me about my proc, actually, is the fxtract and two fscales. If any one can think of any cleve ways of getting rid of these, i'd be interested to know... I think the fscales in particular wipe out most of the performance gains from eliminating the muls...

--Chorus
Posted on 2003-08-20 09:51:07 by chorus
2) and 6) Not only would the parameters passed to the routine need changing, but also the FPU would have to be put into an extended percision mode.

4) Would not this error compound as these approximations are continually used to approximate the cube roots of larger numbers?

Thanks for the link to the original text and keep up the good work.
Posted on 2003-08-20 11:12:22 by bitRAKE
1.- The FPU functions with 80-bit registers such all FPU operations are performed with extended precision, i.e. REAL10.

2.- That precision is equivalent to approximately 19 decimal significant digits. The list of constants supplied in your .asm file initializes REAL8 variables which only have the precision equivalent to approximately 16 decimal significant digits. All those digits exceeding the 16th or 17th digit are thus totally useless for improving the precision of any computation.

3.- The following proc relying on logarithms to compute the cubic root takes approximately 1.22 microsecs (based on 1000000 loops) as compared to approximately 1.16 microsecs with your proc on my computer (P4 1.5GHz). The difference is quite minimal.
When results are compared to those obtained with the Windows' calculator, the log proc is precise to the 16th significant digit. Your proc is precise to the 8th decimal digit.
``````CubeLog proc uses eax ecx edx x:DWORD

finit
fld1
fld   st
fild  x                 ;st=x, st1=3
fabs
fxtract                 ;st=man, st1=exp, st2=3
fincstp                 ;st7=man, st=exp, st1=3
fst   st(2)             ;st7=man, st=exp, st1=3, st2=exp
fprem                   ;st7=man, st=expmod3, st1=3, st2=exp
fdecstp                 ;st=man, st1=expmod3, st2=3, st3=exp
fscale                  ;st=man*2^expmod3, st1=expmod3, st2=3, st3=exp
fstp  st(1)             ;st=man*2^expmod3, st1=3, st2=exp
fld1                    ;st=1, st1=man*2^expmod3, st2=3, st3=exp
fdiv  st,st(2)          ;st=1/3, st1=man*2^expmod3, st2=3, st3=exp
fxch                    ;st=man*2^expmod3, st1=1/3, st2=3, st3=exp
fyl2x                   ;st=log2(man*2^expmod3)*1/3, st1=3, st2=exp
f2xm1
fld1
fincstp                 ;st7=2^[log2(man*2^expmod3)*1/3], st=3, st1=exp
fdiv  st(1),st          ;st7=2^[log2(man*2^expmod3)*1/3], st=3, st1=exp/3
fdecstp                 ;st=2^[log2(man*2^expmod3)*1/3], st1=3, st2=exp/3
fstp  st(1)             ;st=2^[log2(man*2^expmod3)*1/3], st1=exp/3
fscale                  ;st=2^[log2(man*2^expmod3)*1/3]*2^exp/3, st1=exp/3
ffree st(1)             ;st=cube root (x)
test x,80000000h
jz @F
fchs                    ;restore sign
@@:

ret

CubeLog endp
``````

Raymond
Posted on 2003-08-20 12:59:18 by Raymond
Bitrake,

Regarding the FPU precision mode, I think by default the FPU is in double precision mode. Or at least, that's what I remember FINIT does, and I think Windows will call this before the application process is started (can anyone verify this?). So I believe when the first fld (or fild) is called, regardless of the original source, the value will be converted to real8 on the stack. I suppose you could change the precision to extended to buy yourself some extra bits in the calculation, but that will likely be taken care of at the application level.

As for error accumulating I'm not sure what you mean. The procedure I wrote is not iterative so taken by itself the error should not get worse as the number gets larger (at least not worse *percentage* wise, it will however double in magnitude as the original number doubles, etc.). For Newton's method, the error should actually *decrease* as more iterations are called. The source code that I attached in my OP has a more detailed explaination

Thanks

--Chorus
Posted on 2003-08-20 13:07:40 by chorus
Actually, you can save memory with SP coefficients. What I told you is from my experience. Try transcribing Sun's FDLIBM, and compare the speed with log approach. You'll see what I meant only when you actually do it.

2. Raymond's second comment 1:
That is not true for all FPU instructurions. An example to falsify your claim is fdiv. Try changing PC bits and fdiv. You will see that in some cases, x86 FPU fails to pass IEEE test. (But, lower precision makes fdiv faster.)

3. Finally...
PC bits are set to 3 when CPU starts up. But, several OSes change it to 2 claiming that extended precision is unreliable. Windows is one of them. One advantage of log approch is that several transcendental functions are always done in extended precision, regardless of your PC setting. log and exponential functions are among them. On the other hand, basic operations like add/mul/sub/div are sensitive to PC setting.

And, cube root is relatively smooth function (compared to other frequently used functions) and you don't lose much even if your coefficients are in SP. -- And I told you Sun's FDLIBM is as accurate as log approach. The results are 1 ulp apart at the worst. 1 ulp of SP number. How large error is that? ;)
Posted on 2003-08-20 16:28:56 by Starless
About the extraction of the 3rd root, it reminds me an interesting approach on the site optimalcode.com (which is strangely dead now).
It was a 'DKC' solver (from the name of the danish creator), which was a function approximator, it was able to compute approximations on trigonometric functions with very few terms and good accuracy, but should work on exponentiations. I don't know much about the algorithms behind.
Perhaps you should try the Chebyshev (or Chebychev, or with tch...) method to get a better approximation.

Some years ago, I also saw a binary digit nth-root extractor. It was in Byte. Perhaps could you find it somewhere ?

JC
Posted on 2003-08-20 16:56:14 by MCoder
Starless, I'll give your approach a try and see if I can reduce the size of the code (I assume that's what you mean by saving memory... ? Otherwise I'm not sure what there is to gain...)

MCoder, I was just looking into a Rational Chebyshev Approximation. I'm thinking I can get a pretty good approximation for x^(1/3) in the interval [-1/8,1/8] and use the same idea of scaling down x, calculating the root and then scaling back up. I need to get to a computer with Mathematica or something to calculate the coefficients... I don't think my calculator will solve the approximation to the necessary precision...

As an aside, I tried replacing one of the fscale's in my code with multiplying by the corresponding integer value (2^n). This is much faster (increasing the algo by 20% or so), but only works if 0 <= n <32. ie., doesn't work for negative values of n (corresponding to 0<x<1) or very large values of n.

--Chorus
Posted on 2003-08-20 17:34:13 by chorus

As an aside, I tried replacing one of the fscale's in my code with multiplying by the corresponding integer value (2^n). This is much faster (increasing the algo by 20% or so), but only works if 0 <= n <32. ie., doesn't work for negative values of n (corresponding to 0<x<1) or very large values of n.
Could use Agner Fogs method of avoiding fscale.

``````fstp TBYTE PTR [esp]

fld TBYTE PTR [esp]``````
Posted on 2003-08-20 17:47:39 by bitRAKE
fstp TBYTE PTR

fld TBYTE PTR

For those who may not be aware, the above code would trash the stack if used as is. Most probably bitRAKE would have preceded that with
sub esp,10 to reserve space on the stack and later restore the stack with a

chorus

I was curious about the efficiency of the FpuXexpY function of the Fpulib and surprised at its speed compared to the other algo I posted. The main reason for the speed improvement was that the reciprocal of 3 had to be precalculated and in memory before calling the function.

The following proc comes at 0.70 microseconds as compared to the other figures reported previously. Significantly faster and still providing the most precision. Hard to believe but the simple computation of the reciprocal of 3 within the proc itself would add some 0.4 microseconds!!!
``````
CubeLogA proc uses eax ecx edx x:DWORD

.data
recip3      dt    0.333333333333333333

.code
finit
fld   recip3
fild  x
fabs
fyl2x                   ;->log2(Src1)*exponent
fld   st(0)             ;copy the logarithm
frndint                 ;keep only the characteristic
fsub  st(1),st          ;keeps only the mantissa
fxch                    ;get the mantissa on top
f2xm1                   ;->2^(mantissa)-1
fld1
fscale                  ;scale it with the characteristic
fstp  st(1)             ;copy result over and "pop" it
test x,80000000h
jz @F
fchs                    ;restore sign
@@:
ret

CubeLogA endp
``````

Raymond
Posted on 2003-08-20 19:11:06 by Raymond
One more detail which may be difficult to explain. As stated in the previous post, computing the reciprocal of 3 within the proc was adding some 0.40 microsecond.

This was relatively constant whether the value of 3 was built on the FPU or loaded from memory as a floating point (REAL4, REAL8 or REAL10) or an integer (WORD or DWORD) before performing the division of 1.

This was also relatively constant when dividing 1 directly by a value of 3 from memory whether declared as a floating point (REAL4, REAL8) or a DWORD integer.

HOWEVER, dividing 1 directly by a value of 3 from memory declared as a WORD only added 0.05 microseconds to the proc.:confused:

Raymond
Posted on 2003-08-20 23:13:09 by Raymond
Hello Raymond,
I love the simplicity of your method. I can't argue about having smaller *and* faster code :)

However, there must be a big difference between your processor and mine... when I compare the latest code of yours vs my original, mine seems to runs about 30% faster... I'm running AMD Athlon 1100 MHz. I'm just using a simple rdtsc to count clocks after 1000000 iterations, like so:

``````
start:

mov ecx,1000000
rdtsc
mov FirstTimeLo,eax
mov FirstTimeHi,edx

@@:
push 8
call CubeRoot
;    call CubeLogA
fstp st(0)
dec ecx
jnz @B

rdtsc
sub eax,FirstTimeLo
sbb edx,FirstTimeHi

;I'm dividing down by 1000 to make sure there's nothing
;in edx that I'm missing, although I expect it to be zero

mov ecx,1000
idiv ecx

invoke ExitProcess,NULL
``````

After running it at least a dozen times your proc comes up around 200 million clocks and mine around 140 million...

Perhaps the P4 executes the f2xm1 and fyl2x much faster...

(BTW with your method you have to test for zero at the beginning b/c of the log base 2)

Regards
--Chorus

PS I looked into approximating the cube root in the range [-1/8,1/8] to try to simplify my method a little (I was hoping, for instance to eliminate the need to check for the sign). This doesn't seem to work very well because the derivative disappears at 0. Rational polynomial approximation doesn't seem to like that very much, though I'm going to try other ideas. In the mean time, I'm going to try Bitrakes' suggestion for replacing fscale, although I think it's a little more complicated than his post because the top bit of the 10BYTE is for sign... it'll have to be preserved somehow.. I think a simple rol value,1 ought to do the trick...
Posted on 2003-08-21 15:24:23 by chorus
It's surprising at times how different computers will perform with a given set of instructions. For example, I have a little program to extract square roots with up to 10,000 decimals in the answer. That program was running almost twice as fast on my previous P3-550 than on my P4-1500!!!

Your AMD seems to be a lot more efficient than my P4. Using the rstdc (instead of the millisec timer of Windows), I'm getting 1200 cycles for my procedure and 1700 for yours. Maybe its a P4-150 that I got.:(

When I looked at bitRAKE's suggestion for scaling, I was also annoyed at first about the sign bit. Giving more thought to it, that sign bit will not get affected unless your scaling factor would be so big (positive or negative) as to exceed the range of the REAL10 format, which is quite improbable unless you feed it garbage.

Raymond
Posted on 2003-08-21 22:35:02 by Raymond
Ok, after taking a number of different ideas I've got a much revamped version:

1) I managed to eliminate the integer divide by using magic number division (thanks to The Svin for these constants, I stole them from one of his posts :) )
2) The two fscales and one fxtract are gone, I'm doing the scaling manually using the real10 format. (For anyone unfamiliar with real10 format, check out Intel's Architecture Software Developer's Manual, Volume 1)
3) All conditional jumps are removed
4) At the bottom, I use a real4 scale value to simultaneously restore the sign of the original value and adjust the exponent
5) The constants have been recalculated and reordered. I'm not sure if the values will load any quicker if they are in the order that they get used, but I figured it couldn't hurt

This code now runs about twice as fast as yours on my machine. Strangely, I thought I could get better performance by rearranging the instructions (trying to get better pairing) but I couldn't make much of a difference. So I've left it in a more "readable" manner.

I still can't touch your procedure for precision, although I do have an idea to improve it, I don't really feel like going through it right now. The idea is pretty simple: The proc maps the input value to the range [1/8,1) which naturally breaks down to three separate regions [1/8,1/4), [1/4,1/2), and [1/2,1). My idea is simple: to have 3 sets of constants, one corresponding to a distinct approximation function in each of the three ranges. This should bring the error down to around <1e-12 or so.

Again, if anyone has any improvements I'd be glad to hear them

Thanks
--Chorus

``````
.data
CC real8    6.3637179600061402163539320738081
real8  -12.065381127146011865968174543024
real8   1.15441272092408400896359030695808
real8     1.32216597568937846186167818505031
real8   82.614153385222731367127241292588

real8   65.00227162585858862822658857792
real8  -90.424005465780377990655280610327
real8    2.31555198682175862934814370456
real8  5985.3903660662530599695396031034

.code
CubeRoot proc uses eax ecx edx x:dword
LOCAL mant:DWORD
fild x                        ;get the input value into a real10 format
fstp real10 ptr mant          ; so we can adjust the exponent
mov ecx,7FFFh
mov eax,2863311531            ;Use magic number "division" of exponent by three to get
and ecx,dword ptr [mant+8]    ; remainder and quotient. Some constants are used here
mul ecx                       ; because of the bias on the real10 stored value.
shr edx,1
lea eax,[edx*2][edx-3FFFh+3]
sub ecx,eax
mov dword ptr [mant+8],ecx    ;Store back the adjusted exponent
fld real8 ptr CC[1*8]         ; the first polynomial
fld real8 ptr CC[2*8]
fld real10 ptr mant
fmul real8 ptr CC[3*8]
fmul st(0),st(0)
fmulp st(1),st(0)
fld real8 ptr CC[5*8]         ;The second polynomial Q(y)
fld real8 ptr CC[6*8]
fld real8 ptr CC[7*8]
fld real10 ptr mant
fmul st(0),st(0)
fmulp st(1),st(0)
fdivp st(1),st(0)             ;P(y)/Q(y)
mov eax,x                     ;Build a real4 value with the following properties: