; Exponential value of top of stack ( e^X )

sub esp,16
fmul ; z = x*log2(e)
fist DWORD PTR [esp+12] ; round(z)
fstp TBYTE PTR [esp]
fisub DWORD PTR [esp+12] ; z - round(z)
mov eax, [esp+12]
add [esp+8],eax
fadd ; 2^(z-round(z))
fld TBYTE PTR [esp] ; 2^(round(z))
fmul ; 2^z = e^x
add esp,16
This isn't my idea - I got it from the Agner Fog's pentium optimization manual, but I did optimize it (a little). ;) Nice thing is the rounding mode doesn't matter, but X has to be in range +/-16000 IIRC?
Posted on 2002-01-16 00:46:06 by bitRAKE
the fldcw instruction. This instruction is a synchronizing instruction and will cause a significant slowdown in the performance of your application on all IA-based processors.
This quote above from the Intel Optimization Manual is why algorithms that don't rely on the rounding mode are so important. Some algorithms do require a specific rounding mode and it's better to set it at the begining of your program and leave it constant throughout.
Posted on 2002-01-16 23:07:43 by bitRAKE
here is acode to calculate the e with out the fpu
( this is not my code , i had this code as a question in computer contest )



EDIT: and print in binary
Posted on 2002-01-17 13:51:04 by eko
Thats a beautiful algo bitRAKE, I have no access to MASM at the moment and I'd love to know how much faster it is than my pure fpu one. Could you time it. :)

fld st
fsub st(1),st
fstp st

I gather from your code that you simulate the fscale instruction through integer maths, is it much faster? I'd never have thought of trying something like that.
Posted on 2002-01-18 04:23:04 by Eóin
E?in, on my Athlon this code is just short of twice as fast - 58 cycles verses 108 for straight FPU code. :) I've tried to document it more, but those trying to understand how this algo works should really read the FPU section of the Intel manual, Volume One.
; -16000 < st(0) < 16000

sub esp,16 ; we need some space on the stack for TBYTE + DWORD
mov eax,3FFFh ; exponent for TBYTE = 0
fmul ; z = x*log2(e)
and DWORD PTR [esp],0 ; 1.0
mov DWORD PTR [esp+4],80000000h ; set fraction of TBYTE to
; store the integer portion of z
fist DWORD PTR [esp+12] ; round(z)
; fraction of (z) on FPU stack
fisub DWORD PTR [esp+12] ; z - round(z)
; scale TBYTE number by the integer value of (z)
add eax,[esp+12]
mov [esp+8],eax
f2xm1 ; (z) fraction in range [-1,1]
fadd ; 2^(z-round(z))
fld TBYTE PTR [esp] ; 2^(round(z))
fmul ; 2^z = e^x
add esp,16
Posted on 2002-03-24 00:14:41 by bitRAKE
Slight improvements on FPU method:

fmul ; A
fld st ; A A
frndint ;*B A
fld1 ; 1 B A
fscale ;*C B A
fxch st(2) ; A B C
fsubp st(1),st ; D C
f2xm1 ; E C
fmul st,st(1) ; F C
fadd ; G

A: X * log(e)2
C: 2^INT(A)
D: A - B
E: 2^D - 1
F: E * C
G: F + C = C*(E+1) = e^X
Posted on 2002-03-24 02:29:28 by bitRAKE

All the hoopla about factoring large ints got me to pull out my dusty notes on Very-HLL methods & see if I could get anything going in asm.

I really wanted to use this pretty algo but, oops!, rounding probs with this method grrrr.

bitRAKE, That method using ints only we were posting a while back? Slower but when you really have to have exact numbers it's better... 99.9999% of the time this IS the better method... leave it to me to stumble on that 0.0001% :grin:
Posted on 2002-05-30 16:26:36 by rafe

It may be a late reply, but this thread came to my attention only today. (Thanks, rafe, for bringing this thread up. :grin: )

I have some questions about your modification of Agner's code. Would you care to answer them? ;)

1. The whole point of separating fist and fisub in Agner's code is to hide the dependency created by long latency of fist. From what I read in Agner's note, your rearrangement of stack operation would require more clocks than Agner's version. What am I missing here? :confused: Would you explain your optimization?

2. While rearranging the stack operation, you used and DWORD PTR ,0. Is it faster than mov DWORD PTR ,0? I don't know which one is faster, but I guess that mov may be better because and here is a 'read-modify-write' op.

3. I don't understand why you said Agner's and your exp() is not dependent on RC setting. As I understand it, fist is also affected by RC setting. No? Usually, rounding mode is set to 'nearest' for this kind of operation for precision and that is the default RC value (at least, on OSes I use). But, for any reason, RC is set to a different value (either by an asm programmer or a HLL compiler) it is normally recommended to set RC back to 'nearest' for precision. (Yes, everybody in this forum knows that. :) ) If your comment about RC is right, then the recommended procedure is not necessary.

Anyhow, I'm glad to see FPU related discussion here. :alright: Personally, I was drawn to asm world to utilize x87 beyond what my C compiler was capable of. (And partly because I don't have a decent C compiler under Windows, but that is another story. :) ).
Posted on 2002-05-31 00:39:55 by Starless
Starless, good questions :)

1. I tested on an Athlon, and this was the fastest configuration.

2. Smaller, slower in some instances - not this one.

3. Yes, there is rounding, but it cancels out. Doesn't it? I guess not, considering what rafe said. Looks like this needs more testing - I'll take a look.
Posted on 2002-05-31 10:18:16 by bitRAKE
10^x MACRO
fldl2t ;log(2)10,x
fmulp st,st(1) ;x*log(2)10
f2xm1 ;2^(x*log(2)10)-1
fld1 ;1,2^(x*log(2)10)-1
faddp st,st(1) ;10^x

fyl2x ;x*log2Y
f2xm1 ;Y^x-1
fld1 ;1,Y^x-1
faddp st,st(1) ;Y^x
Posted on 2002-06-01 23:02:41 by purefiring
From good book
Posted on 2002-06-02 11:52:59 by Nexo

I guess my first two questions are really matter of CPU design difference between Intel and AMD. When I tested it under P3, Agner's original was (marginally) better. Or, maybe the difference in the memory type you and I have. I have PC100 SDRAM.

About the third part, I don't think it is a mathematical issue. Yes, you are right mathematically. However, when we think it comuptationaly, I have that question.

Traditionally, exp(3) has been implemented with small piece of code which makes sure that PC is 11b and RC is 0. PC part is obvious. RC part is not transparent. That depends on how CPU vendor implements the calculation of fractional exponent. If RC is 0, the fractional part will be in (-0.5,0.5) and if RC is set to some truncation, then the fraction will be (0,1) or (-1,0). I do not know the actual design of Intel CPU, so I don't know which will be approximated better. But, I guess that the result is better approximated with RC==0, only because Intel sets RC to 0 by default, and I believe Intel has a good reason to do that. (At least, they wouldn't hurt themselves, would they?) Again, AMD may have their own way, which may invalidate my belief. :)
Posted on 2002-06-04 14:41:15 by Starless