Hey guys,

I wrote a bunch of sine routines for a trig package I will be needing. This is the first pass through the routines. The approximation equations are described along with the average/max errors, and average profile cycle time. The calcs are performed using doubles/real8's. There are 4 routines of increasing accuracy and decreasing speed available.

Thought I'd post for y'all to check out. Any feedback/optimizations are welcome.

- Chris
Posted on 2002-09-14 17:42:18 by bit
SSE for sin5
movss	xmm0, x

movss xmm1, xmm0
mulss xmm1, xmm0
movss xmm2, xmm1
divss xmm1, const_20
movss xmm3, const_1
movss xmm4, xmm3
subss xmm3, xmm1
mulss xmm3, xmm2
divss xmm3, const_6
subss xmm4, xmm3
mulss xmm0, xmm4
movss x, xmm0
Float data type size only, tested on MS-VC w/ inline asm, no doubles. Not timed how many clock cycles <--- too lazy to do that. :)

Nice compilation... I'm planning to convert every function to SSE.
Posted on 2002-09-15 00:26:01 by stryker
Nice work, bit.

But, out of curiosity,
what is the advantage of your functions compared to using fsin? Did your functions pass the IEEE754 test?

Personally, I prefer using fsin because the result does not depend on PC setting. Here is what I use:

; double sin(double x)
sin proc c
fld qword ptr [esp+4]
fnstsw ax
jp @F ; assuming most vals
ret ; are within range.
; |x| > 2^63, out of range.
; need to reduce x.
@@: fldpi
fadd st(0),st(0)
@@: fprem1
fnstsw ax
jp @B
fstp st(1)
; reduction done. do it again.
sin endp

If you believe Agner, you can eliminate the last loop with only one fprem1. But, I prefer to play safe. :)

Like I said before, this function does not depend on your PC setting, so by only chaning fld qword ptr , you get a single precision version and long double precision version (as required by C99) without ever doing fldcw.
But, I do not know the speed of my function. Can someone compare and post the performance result?

BTW, if you have to approximate at any cost, there is a fairly accurate implementation from Sun. Look in the directory ftp://ftp.netlib.org/fdlibm/. At least, the results are mostly within 1ulp range from fsin results for all representable double precision numbers. (Yeah, I did run the test for that crazy exhaustive test. :)) Its problem is the unbearably slow (well, from x86 point of view) computation.
Posted on 2002-09-15 03:25:12 by Starless
Originally posted by stryker
I'm planning to convert every function to SSE.

Most of useful mathematical functions are already implemented using SSE by Intel, and they distribute them free of charge. Just download MKL from Intel site. That will save you a lot of time.
Posted on 2002-09-15 03:37:09 by Starless
Thanks guys. The functions don't have any advantage over fsin at this point - I had been playing around with it a little bit as well. However, I have been thinking about doing a Fixed Point version that would use the same algorithms as those floating point versions. Hopefully those would be faster, but with the speed of the FPU nowadays, who knows.

And, I actually have the code from fdlibm. Didn't run it, but it seemed rather slow just from looking at it. Or, at the very least, quite complex.

Starless, I clocked the function at an average of 246 cycles. The C run time version of sin() runs at 254 cycles. However, these tests are presuming the values are in range. I didn't throw anything needing range reduction at it.

- Chris
Posted on 2002-09-15 08:31:52 by bit

I could not believe that your functions would take that long. Or, did you mean my version took 246 clocks and another C library of yours took 254 clocks? If your function took 246 clocks, did you mean your Sine9 timing?
Duh... I'm stupid. You obviously meant mine took 246 clocks. But, then, I wonder why your functions take so long as you commented in your source code.

Anyhow, one more look at your source code, I have a suggestion for speed up.

The idea is simple. 'Multiplication by a reciprocal rather than division'. For example, when you have to divide by 20, you want to multiply by 0.05. But, you might also want to do error analysis again with this modification, because none of your constants are exactly representable.
I mean, the reciprocal of your constants are not exactly representable.

To enhance the accuracy, you might want to make all constants in EP. That will cost you load time, but that is much cheaper than fdiv. And, you can hide the load latency while you are doing another fmul, so I guess the cost would be virtually invisible.
Posted on 2002-09-15 16:06:45 by Starless
Those timings listed in the code were the delta of calls to rdtsc before and after the routine. The normal C run-time took 254 clocks, and your version took 246. The respective timings of each of my functions are listed in the comments at the start of each function. I thought those were clocks cycles, but I could be wrong. At any rate, the relative meaning is still valid. The fastest one of my functions only runs at 221. So, for a difference of only 25 clocks (or 25 whatever it is) I get the real answer w/o the error.

I will implement your suggestions and see what happens. I should smack myself for not thinking of the multiply by reciprical. I hadn't given much thought to any optimizations yet. It took me long enough to figure out the math. ;)

Anyway, it seems like the fsin instruction might actually be worth using considering the improved accuracy and only small clock cycle penalty. But, we will see. I need this to be as fast as possible since I am going to be using it for a sound synthesis engine. I have no idea how much accuracy I need - thus the multiple routines.

- Chris
Posted on 2002-09-15 16:18:08 by bit
Okay, the base stats that I am working against:

sin: 254 - c run-time version using vc++ 6.0
Sine: 246 - fSin version that you posted

And ... the newest stats for the approximation functions after the multiply by reciprical optimization ...

Sine5: 168 - 5th degree Taylor Series
Sine7: 182 - 7th degree Taylor Series
Sine9: 199 - 9th degree Taylor Series
Sine11: 212 - 11th degree Taylor Series

The errors went up in some and down in others, but only by .00000001 at the max. For the most part they were unchanged.

I think there might still be some speed to squeeze out of someplace, but I'll have to beat my head against the wall for a few days to figure it out. :)

The only remaining issue is the range of acceptable input. The Taylor series has issues the further away the input is from 0 because there is not enough accuracy. Which means I need to scale the input to somewhere between -PI/2 and PI/2 (or, at the very least -PI and PI). Then, use the symetry of sine to calculate the correct value.

So, in order for me to have the approximation routines be more effective than the fsin version, I need to be able to perform that calculation and scaling in under 34 cycles (246 - 212). As long as the 5th and 7th degree approximations were faster I would probably be happy, but I would really like for the most accurate one to be faster as well - after all, what is the point of writing them if they are slower and/or not as accurate.

I will post back again when I have some more info/ideas/results.

- Chris
Posted on 2002-09-15 21:55:48 by bit
but MKL is commercially sold by Intel. Anyway, even if it is/was free, it's fun converting... :)

Next time, I'll post more converted functions.

Thanks for remembering me of MKL. I'll probably buy the library in the future when I need it. :alright:
Posted on 2002-09-15 22:11:18 by stryker

I still don't understand why your Sine5 takes that long time. After you replaced fdiv by fmul, there are only 5 multiplications. So, even if the current code cannot hide the latency due to the dependency flow, it should not take more than 50 clocks, unless the cache is trashed every time you measure it so that the memory access time outweighs the calculation time.

If you are not allowed to use cache at all, then try using fld1 instead of getting CONST_1 from the memory, and similarly, try reducing memory access at the entry, for example,

fld x
fld x
fmul x


fld x
fld st(0)
fmul st(0),st(0)

and it is shorter, too.
Posted on 2002-09-15 23:46:04 by Starless

but MKL is commercially sold by Intel.

Eh? Since when? Or, am I totally mistaken? I thought MKL distribution was free-of-charge for non-commercial use.

Hmm, if Intel sells MKL, I should stop promoting it and stop using it right now. Oh, well, that is a big loss. :( Darn, there is no good math library left for free use.
Posted on 2002-09-15 23:52:13 by Starless
I think it was the early versions that was free...I can't be sure but check out this link http://www.intel.com/software/products/mkl/mkl52/index.htm there's an evaluation for 30 days and Buy Now :(

Yeah, I remember it was free but didn't bother to download because I thought I don't have any use for it. What a bad decision... Or I could have been dreaming this all along. :grin:
Posted on 2002-09-15 23:56:05 by stryker
Maybe I am misunderstanding the rdtsc instruction. I thought it returned clock counts, but now I dunno. I know what you're saying ... it feels like it should be faster, so I don't know where all of the overhead is coming from. The only thing I can think of is that the routine I am using to time it (which is in C w/ inline asm) is taking a gigantic amount of time since it has the inline asm code and uses 64-bit variables. Maybe I'll take a look at the decompilation and see what code the compiler is writing for that 2 line routine. However, that would make all versions faster (not just mine), since I time all the routines with the function. Thus the deltas would still apply and I am back to square 1 which is that the fsin instruction is almost as fast as my approximations and fully accurate.

As a side note, I checked out Intel's optimization manual and they claim the trancendental functions can be implemented *no faster* w/o losing accuracy.

Thanks for the idea, I have actually been messing with the entry loading a smidge. So far I have shaved a few cycles off of each routine by loading the second copy of X during the multiply and then exchanging registers. I'll try changing it to load the second x from the registers instead, and I'll keep playing around with preloading some of the constants during the multiplies and see what happens. I might be able to save a few more cycles.

- Chris
Posted on 2002-09-16 00:05:59 by bit
Turns out the overhead of the Timing stuff was 121 cycles! Which gives me the following new values for everything including the optimizations that have already been done.

sin: 133- c run-time version using vc++ 6.0
Sine: 123 - fSin version that you posted
Sine5: 45 - 5th degree Taylor Series
Sine7: 59 - 7th degree Taylor Series
Sine9: 72 - 9th degree Taylor Series
Sine11: 86 - 11th degree Taylor Series

So, is it going to be possible to calculate the smaller range X and then adjust the sine accordingly? I have 37 cycles (123 - 86) in which I could do the calculation and have it be worthwhile. Wish me luck. I am going to attack it once I try and eek a few more cycles out of the others.

- Chris
Posted on 2002-09-16 09:10:20 by bit
Well, it doesn't appear that this will be a worthwhile undertaking. According to my timings, the range reduction function w/o performing any adjusting to get a correct value of sine takes about 40 cycles. Which is 2 cycles over the limit. I was able to slice a cycle off of the 11th degree approximation, but I am still cutting it way too close without even adjusting the value for the sine function.

Here is the range reducing function that also selects the proper degree to call (no code for that part yet).

; SineApprox Procedure
SineApprox proc x: REAL8, degree: DWORD

; This function calls the appropriate appoximation functions based
; on the degree specified after performing range reduction.
; All approximation functions works best when X is close to 0. So,
; we reduce x to between 0 and PI/2 then utilize the following
; properties of the sine function in each 90 degree quadrant.
; quadrant 1 = sin(x)
; quadrant 2 = sin(PI/2 - (x - PI/2))
; quadrant 3 = -sin(PI - x)
; quadrant 4 = -sin((2*PI)-x)
; Average Profile Cycles:

; Range reduce x to (0 - 2 PI)
; Calc and store the number of 2 * PI's
; in x, not including the very last
; incomplete one this is sitting in.
fld x
fistp num2PI

; Subtract number of 2PI's from x
fld CONST_2_PI
fimul num2PI
fld x
fsubrp st(1), st(0)

; The range reduced X is now on
; the top of the stack

; What degree approximation do they want?
.if degree == 5
; They want a 5th degree approx

.elseif degree == 7
.elseif degree == 9
.elseif degree == 11
; Invalid degree ... do nothing

; We completed

SineApprox ENDP
; END SineApprox

So, I am undecided about what the plan should be. Obviously the fsin routine is going to be faster when accuracy is desired. But, that is what the c run-time uses and a whole bunch of posts have been made about the lack of speed for sound wave oscillation purposes. So, is it actually fast enough nowadays and nobody ever timed it (e.g. basing it off the DOS days)? I have timed a couple other approximation routines and they are all extremely close to the fsin timing.

Anybody have any thoughts/ideas?

- Chris
Posted on 2002-09-16 13:04:25 by bit
For sound use a Goertzel resonator (oscillator). ;)

A Google search will do it.
Posted on 2002-09-16 16:45:14 by Maverick
What about reducing x using fprem1? Like

fadd st(0),st(0)
fld x
fstp st(1)
; now st(0) is in [-2*pi, 2*pi]

According to Agner, this is not faster than your code, but, if your code takes 40 clocks (which I cannot believe - your timing report seems too pessimistic), I think it is worth trying.

In another thread (almost at the end of it), there is a link to a library. I have not tested it myself, but it seems worth trying out.
Posted on 2002-09-16 17:04:30 by Starless
Thanks Maverick. I am familiar with Goertzel and I do have an implementation in C that I have been playing with. My only issue with that algorithm is the cost of frequency changing. Won't know exactly how that will affect what I am doing until I get further along though.

Starless, I don't think my timings are off. But, you never know. Perhaps I will switch to another timing method and see what the comparison yields. Doesn't fprem take like 70 cycles or so to execute?

At any rate, I have decided that the caller of the sine routines can either control the range of inputs, or just use the fsin version for full accuracy. This way I can optimize if needed for each situation instead of just blindly range reducing all the time when it can't be interleaved with anything.

And ... I think I may have found a slightly faster algorithm for the sine calcs. It looks like I can derive some coefficients to eliminate some of the multiples and subtracts. Only problem is I found it with coefficients for cosine and their implementation of sine just shifts x back by PI/2 and then calls cosine - which is OK, just slightly slower than coding the algorithm directly. So, I need to figure out how to derive the needed coefficients and then I can hopefully save about 3-4 multiples and 3-4 subtracts per approximation - which is quite a few cycles.

The one thing that strikes me as weird is the entire community is always complaining about the speed of the c run-times version of sin(). But, according to my calcs it is only a few cycles slower then the fsin assembly version, and once you handle range reduction is a little bit faster than a decent approximation. Yet, people are still going out of their way to write custom implementations and use LUTs. Makes me wonder if my timings are somehow way off. But, then they would be off for everything including the run time lib's ... so, I dunno.

- Chris
Posted on 2002-09-16 21:19:05 by bit
I just checked out the timing list and found that fprem1 takes 20-70 clocks on P5 and 33 micro-ops to port 1 on P6 (so 33 clocks or close to that number). If your timing is not off much, then, using fprem1 may be a viable alternative (which can be done with only one memory access) on P6. (At least, I thought so.)

Then again, these are numbers on a paper. According to the same timing list, fsin takes 65-100 clocks on P5 and generates 17-97 micro-ops on P6. But, as you already found out, fsin version takes longer than these numbers even if we consider the loading and testing. So, who knows how long fprem1 will take in real life? :)

I guess that the C runtime library is essentially the same as my version with some additional codes (maybe checking the existence of FPU?).

I don't know how sound synthesis works, but if you can vectorize your input, then a fully vectorized SSE version may be the way to go. (waiting for stryker's SSE version... :grin: )
Posted on 2002-09-17 01:16:06 by Starless
Hi bit,
If you continuosly change frequency and/or phase, then I suggest you to use fsin (or a faster version like BMath that was mentioned some time ago perhaps in another thread, but which anyway you can find at http://www.bmath.net/old/bmath/howbmath.htm). Also, on my K7 last time I had to face this problem, using fsin was not a problem because it was nicely executed "for free", due to other stuff done in parallel (including memory accesses). So it depends on the specific algorithm (and CPU, being the K7 excellent at FPU), and the eventual optimizations one can do on it.. but I suggest you to try all the solutions before excluding any.. because you may get very suprised.
If you use fsin I just suggest you to keep your current_phase variable always within 0..Pi2 ranges.. both for precision and for fsin overhead.
Posted on 2002-09-17 02:46:10 by Maverick