I am considering a programming project where I'll need to generate accurate sine waves in real-time, and I haven't heard much when it comes to how to generate sine waves. I know that for accuracy, you have to set the FPU rounding method right, but that's about all. So what's better; leaving it up to the FPU or using optimized non-FPU assembly?

Thanks in advance,
- keantoken

EDIT: Actually, before I ask this question, I thought it may be of importance to state what methods I planned on trying out if it turned out the FPU was a bit slow. Naturally, since I believe in a good balance between quality, efficiency, and economy, my approach was to take a method I KNEW was 100% accurate and go from there. This method being the Taylor Series! I've calculated that the longest this calculation needs to be to calculate perfectly precise 32-bit sine values within 1 full cycle (a half-cycle from both sides of the origin) is as follows:


However, this no doubt NOT the only way to do this, though it is quite flexible if you didn't mind slight degradation of sines in return for efficiency. I believe it would be much faster to calculate this using the FPU rather than with pure non-FPU x86 assembly. But then again, I don't know for sure, which is why I'm asking.

- keantoken
Posted on 2007-10-28 14:59:29 by keantoken
Well, quite some time ago, Ultrano here has written some code to calculate sine of a given angle and that method proved to be faster that using FPU while being 'quite accurate'. So it depends on HOW ACCURATE it must be. (If your code is to be used in surgery then you better stick to the FPU ^^' ).

Following the aforementioned method, you could create a lookup table as large as you need and then calculate linear interpolation between values in the table. Modern system that use DDR2/3 RAMs should be able to operate on large lookup tables with little penalty.
Posted on 2007-10-28 16:31:30 by ti_mo_n
scali did some taylor-series stuff for sines... http://srcvault.scali.eu.org/cgi-bin/Syntax/Syntax.cgi?sinsin.c - works pretty well in the [-PI;PI] range even with a relatively small amount of iterations.
Posted on 2007-10-28 17:47:57 by f0dder
A vast improvement over my simple method of sine-calculation is to use spline-interpolation, 4-point (17 cycles iirc) and 6-point (70 cycles iirc) are almost perfect. Audio-DSP code needs to do interpolations that minimize the introduction of harmonics.... and the only non-harmonic wave is the sine. Thus, interpolations for audio are perfectly suited for calculation of sines, with a look-up-table. 
But in my code I always need to get the sin() or cos() of angles WAY far from the [-PI;PI] region. Thus a LUT is the best way to go, for me.
Btw, the precision-error of the simple method with linear interpolation and a 1024-entry LUT was 0,06% iirc.

Using 3DNow or SSE in multiple passes for precision is a contradict, imho.
Posted on 2007-10-29 01:17:45 by Ultrano

Using 3DNow or SSE in multiple passes for precision is a contradict, imho.

Why? Yes, the main goal when doing SSE is speed, but you don't always want to sacrifice precision too much... and if you can get very good precision by one or two extra iterations, well...
Posted on 2007-10-29 02:53:09 by f0dder
f0dder, nice to see you give that abnoxious character some credit for once :)
I miss Scali, he made this place more fun.
I think we should start an obituary :)
Posted on 2007-10-29 10:55:13 by Homer
For completeness: Here's the mentioned topic. The code I proposed there is the implementation of Taylor stuff on FPU. (It's a shame that doing things this way is actually faster than using 1 instruction). Following is the LUT code with ~0.05% error.
Posted on 2007-10-29 10:57:44 by ti_mo_n
Okay, thanks, guys.

I read and studied a bit, but in the end my own creativity took over. When I calculated the longest equation needed to get 32-bit perfect accuracy, I was using GraphCalc, from SourceForge. I entered in SIN(x) and it's Taylor Series equivalent, then entered one minus the other, and looked at the error curve to see if it resembled anything I thought I had seen before. I was thinking that if I could find an equation that reproduced the error curve, I could perhaps combine an inferior Taylor Series with that equation and not have to waste processor time calculating just the Taylor series. I didn't find any equations that worked, though. However, there is another option: Create a lookup table from the error curve and use that in conjunction with an inferior Taylor Series to see if it would be reasonably efficient.


y1=Taylor Series
y2=True Sine
y3=Error curve

Using this method, the program could be written to compromise between higher memory use (large corrective lookup table) and higher processor usage (larger series to calculate). Of course, this compromise couldn't happen during realtime (at least that I know of), so it would have to be during init time.

Is it worthy of looking into?
Posted on 2007-10-29 17:07:03 by keantoken
I dont see why you would want to use "corrective" lookup tables.. at that point why not simply use the same sized sine table and then use a better interpolation than linear? (ie, cubic)

(also, the equation that fits the error curve you have there is precisely the remaining portion of the taylor series)

Posted on 2007-10-29 17:46:44 by Rockoon
keantoken, I think 3! is not enought, I should go at least 7!
Here take a look those graphs
Posted on 2007-10-29 18:07:52 by Dite
I see your point, Rockoon...

A vast improvement over my simple method of sine-calculation is to use spline-interpolation, 4-point (17 cycles iirc) and 6-point (70 cycles iirc) are almost perfect. Audio-DSP code needs to do interpolations that minimize the introduction of harmonics.... and the only non-harmonic wave is the sine. Thus, interpolations for audio are perfectly suited for calculation of sines, with a look-up-table.

Audio is exactly what this is for. The highest sample rate will be 44100Hz, highest bit depth 16, possibly higher in the future, highest polyphony... Around the 31-62 area, though this is a rough estimate at the moment. How suited for this application is the aforementioned algorithm?

Also, each voice has up to 31 harmonics... Can this be implemented without stacking a bunch of algos on top of each other?


Actually, 7! is not nearly enough for what I want. However, combined with the corrective lookup table 3! is all I need. Or did I misinterpret your post?

Also, about that webpage, just because the graph shows no visible difference does not mean that there is no audible difference. I say this from experience. Not to mention that 7! is not nearly accurate considering the 32-bit depth of modern audio equipment. Of course, there are those who will argue there is no audible difference between 32-bit and 16-bit sound, and IMHO they are right to an extent. I doubt that page was directed towards audio programming, though, so I don't think it's terribly important.

Posted on 2007-10-29 18:55:26 by keantoken
If its for audio then I suggest using 16.16 or 32.32 fixed point instead of floating point ..

Posted on 2007-10-29 20:28:50 by Rockoon
Yes, it will use fixed point. That I know.

- keantoken
Posted on 2007-10-29 21:22:53 by keantoken
Let's calculate the max acceptable error of a sine-approximation function to put in a 16-bit sample:  1/32768 = 0,003%

Let's make C++ code to check which method produces how much error. (attached it). Run it, and for a LUT-size of 4096 floats we get:

no-interpolation: 0.153332% error
linear-interp:    0.000032% error // 100 times better than required
6-point spline:  0.000004% error

thus, linear-interpolation is clearly a winner, for its small-enough error and speed.

I also tried hermite interpolation (4-point), but it produced errors, equal to no-interpolation XD
Btw, I use these interpolations for audio, too. And I tend to use 32.32 fixed-point for "phase" (angle in this case), all produced audio-signal is kept in float32, and all intermediate results are kept in float64 or float80. But in my case, perfect, professional audio-quality is a-must-have. I see no reason for you to use 16.16 audio, comeon - even a P3/P4 can handle floats ok. And any AMD cpu since Athlon is a beast in FPU calculations.
Posted on 2007-10-30 01:51:41 by Ultrano
Well it looks like with the 6-point spline you could get get away with a table of only 8 entries instead of 4096 entries.. that would greatly reduce the risk of cache misses (8 floats = one cache line)

Posted on 2007-10-30 04:09:39 by Rockoon
I'm using 16.16 because that's the only format supported, ditto with bit depth. Not to mention floats are unpreferable  for its application. However, I was considering 32-bit because I might want to use this code in future projects. It seems that linear interpolation is the best way to go at this point, obviously... So now the concern is efficiency. From what I'm told, it seems fast enough, however... It would preferably be able to run well with the aforementioned conditions on my computer: 500MHz PIII 324Mb RAM. Is it even possible?

For optimizational purposes, all voices will need to have at most 31 harmonics. If this can be integrated into the algo without a significant speed hit, then that's one less thing to worry about. However... If there isn't a way, then 32^2=961, 961*=. Of course, for all I know, I could be worrying about nothing. I just don't have that much experience.

At any rate, I'm thankful for all the help,
- keantoken
Posted on 2007-10-30 18:16:09 by keantoken
IMHO, a Pentium 3 would be more happy to execute SSE code.
Posted on 2007-10-30 18:35:30 by ti_mo_n
My processor does have SSE... Do all PIIIs have SSE? Sorry for my indefinite lack of knowledge on this subject... :P

Also, I read somewhere in the aforementioned threads that SSE compromised accuracy... Though I don't understand the comment fully.

- keanotoken
Posted on 2007-10-30 19:39:04 by keantoken
Which Windows version are you using now? AFAIK the only significant addition PentiumIII had was SSE. I remember that many people complained about this because them said that "PentiumII with SSE" would be more appropriate since "Pentium with MMX" wasn't called PentiumII.
Posted on 2007-10-30 20:00:35 by LocoDelAssembly
I'm running W2k Pro. Actually it's a lot faster than many windows systems simply because I waste so much time trying to figure out how to make it faster ;)

Also, I thought it was somewhat strange when I heard of PIIIs having lower clocks than PIIs... That's insanely silly :P

- keantoken.
Posted on 2007-10-30 21:17:34 by keantoken