Thanks guys. I'll probably keep a version of each around and experiment on a bunch of different machines once I know exactly what I'll need for the synthesis engine.

And, according to the docs, fsin on the Pentium executes in the U pipe. Which means, nothing can be done in parallel. The k7 has always been much faster for floating point math - fsin just further confirms it.

Now, I have to go off in search of coefficients for the faster sine algorithms.

- Chris
Posted on 2002-09-17 09:35:37 by bit
Here's Sine7. Sorry for the delay, I wasn't able to find time to look at the code. I haven't taken into consideration the optimizations such as using mul instead of div, maybe after I finished all, I'll profile and do some optimization. Probably add some prefetch instructions... This is just a straight translation of the original FPU code.
movss	xmm0, x

movss xmm1, xmm0
mulss xmm1, xmm0
movss xmm2, xmm1
divss xmm2, const_42
movss xmm3, const_1
movss xmm4, xmm3
movss xmm5, xmm3
subss xmm3, xmm2
mulss xmm3, xmm1
divss xmm3, const_20
subss xmm4, xmm3
mulss xmm4, xmm1
divss xmm4, const_6
subss xmm5, xmm4
mulss xmm0, xmm5
movss x, xmm0
BTW, I'm just doing this for fun. I don't claim these are faster but it should.
Posted on 2002-09-17 16:07:20 by stryker
sine9
movss	xmm0, x

movss xmm1, xmm0
mulss xmm1, xmm0
movss xmm2, xmm1
movss xmm3, xmm1
divss xmm2, const_72
movss xmm4, const_1
movss xmm5, xmm4
movss xmm6, xmm4
movss xmm7, xmm4
subss xmm4, xmm2
mulss xmm4, xmm1
divss xmm4, const_42
subss xmm5, xmm4
mulss xmm5, xmm1
divss xmm5, const_20
subss xmm6, xmm5
mulss xmm6, xmm1
divss xmm6, const_6
subss xmm7, xmm6
mulss xmm0, xmm7
movss x, xmm0
Posted on 2002-09-17 23:05:18 by stryker
Sine11
movss	xmm0, x

movss xmm1, xmm0
mulss xmm1, xmm0
movss xmm2, xmm1
movss xmm3, xmm1
divss xmm2, const_110
movss xmm4, const_1
movss xmm5, xmm4
movss xmm6, xmm4
movss xmm7, xmm4
subss xmm4, xmm2
mulss xmm4, xmm1
divss xmm4, const_72
subss xmm5, xmm4
mulss xmm5, xmm1
movss xmm4, xmm6
divss xmm5, const_42
subss xmm6, xmm5
mulss xmm6, xmm1
divss xmm6, const_20
subss xmm7, xmm6
mulss xmm7, xmm1
divss xmm7, const_6
subss xmm4, xmm7
mulss xmm0, xmm4
movss x, xmm0
Posted on 2002-09-17 23:32:20 by stryker
Yet another SSE Sine9

It takes about 40-42 clocks on my PIII. And to use SIMD feature, it implements the Taylor expansion literally. One obvious drawback is that it supports only single precision.


.686
.xmm
.model flat
option casemap:none

.const
COEFF dd 0be2aaaabh ; -1/3!
dd 03c088889h ; 1/5!
dd 0b9500d01h ; -1/7!
dd 03638ef1dh ; 1/9!

; float Sine9(float x)
.code
Sine9 PROC C
movss XMM0,[esp+4]
shufps XMM0,XMM0,0
movaps XMM1,XMM0
mulps XMM1,XMM0
movaps XMM2,XMM1
mulss XMM1,XMM0
mulss XMM2,XMM2
movlhps XMM2,XMM1
shufps XMM2,XMM2,25h
movss XMM2,XMM0
mulps XMM2,XMM2
mulps XMM2,XMM0
mulps XMM2,COEFF
movhlps XMM1,XMM2
addps XMM1,XMM2
addss XMM0,XMM1
shufps XMM1,XMM1,1
addss XMM0,XMM1
movss [esp-4],XMM0
fld dword ptr [esp-4]
ret
Sine9 ENDP
END
Posted on 2002-09-18 23:28:38 by Starless
Did you heard about Chebyshev polynoms ?
They improve the Taylor series consequently, if I remember correctly.

JC
Posted on 2002-09-21 15:15:27 by MCoder
Not to put a dampner on ye, but this is what your competing against, Intels MKL:

vsSin
34.9 Standard Processor
33.9 PIII
19.6 PIV
6.2 Itanium

vdSin
49.9 Standard
49.9 PIII
43.5 PIV
7.3 ItaniumTM

Of course that is for bulk processing of multiple Sin values :)
Posted on 2002-09-21 18:22:02 by Eóin
MCoder,

Thanks for the info. I'll check it out

Eoin,

shoot!!! those MKL ones are pretty fast. I still haven't started optimizing mine. Anyway, thanks for letting us now. It's a challenge. ;)
Posted on 2002-09-21 21:57:17 by stryker
MCoder, yeah those are the coefficients that I have been searching for. So far all I can find is information and coefficients for the cosine.

E?in, how accurate are those routines of Intel's and what exactly do you mean by bulk (a pointer to 2 or more values to take the sine on)? Also, I take it the "vs" is single precision and "vd" is double precision? I have some numbers from some Athlons for about 4 cycles in my 5th degree aproximation routine. But, for some reason the timing on the p4's is showing hundreds of cycles (not just for mine either; this happens for the std c routine as well). For whatever reason the AMD's totally smoke the Pentiums in the FP department. Very tragic to see. :)

It looks like the MMX version Startless posted if converted to using the Chebyshev coefficients would run as fast as Intels lib, if not faster. And, the one I posted once converted to the Chebyshev coefficients would probably run close to as fast as the vdSin routine, perhaps faster who knows. Of course, these are both 9th degree aproximations and I dunno how accurate Intel's are.

Oh yeah E?in, do the routines of Intel's do range reduction?

- Chris
Posted on 2002-09-21 22:07:58 by bit
More about the Intel algo:
http://www.intel.com/technology/itj/q41999/articles/art_5.htm

http://www.research.scea.com/gdc2002/fastermath.pdf

; 11 MUL

; 5 ADD/SUB
; 7 LOAD
fld x
fld st
fmul st,st(1)
fld one_over_110
fmul st,st(1)
fld1
fsubr
fmul st,st(1)
fmul one_over_72
fld1
fsubr
fmul st,st(1)
fmul one_over_42
fld1
fsubr
fmul st,st(1)
fmul one_over_20
fld1
fsubr
fmulp st(1),st
fmul one_over_6
fmul st,st(1)
fsub
...and some interesting C++ code (don't know it the comments are true).
This isn't my code! :)
doubledouble sin(const doubledouble& x) {

static const doubledouble tab[9]={ // tab[b] := sin(b*Pi/16)...
0.0,
"0.1950903220161282678482848684770222409277",
"0.3826834323650897717284599840303988667613",
"0.5555702330196022247428308139485328743749",
"0.7071067811865475244008443621048490392850",
"0.8314696123025452370787883776179057567386",
"0.9238795325112867561281831893967882868225",
"0.9807852804032304491261822361342390369739",
1.0
};
static const doubledouble sinsTab[7] = { // Chebyshev coefficients
"0.9999999999999999999999999999993767021096",
"-0.1666666666666666666666666602899977158461",
"8333333333333333333322459353395394180616.0e-42",
"-1984126984126984056685882073709830240680.0e-43",
"2755731922396443936999523827282063607870.0e-45",
"-2505210805220830174499424295197047025509.0e-47",
"1605649194713006247696761143618673476113.0e-49"
};
if (fabs(x.h())<1.0e-7) return x*(1.0-sqr(x)/3);
int a,b; doubledouble sins, coss, k1, k3, t2, s, s2, sinb, cosb;
// reduce x: -Pi < x <= Pi
k1=x/doubledouble::TwoPi; k3=k1-rint(k1);
// determine integers a and b to minimize |s|, where s=x-a*Pi/2-b*Pi/16
t2=4*k3;
a=int(rint(t2));
b=int(rint((8*(t2-a)))); // must have |a|<=2 and |b|<=7 now
s=doubledouble::Pi*(k3+k3-(8*a+b)/16.0); // s is now reduced argument. -Pi/32 < s < Pi/32
s2=sqr(s);
// Chebyshev series on -Pi/32..Pi/32, max abs error 2^-98=3.16e-30, whereas
// power series has error 6e-28 at Pi/32 with terms up to x^13 included.
sins=s*(sinsTab[0]+(sinsTab[1]+(sinsTab[2]+(sinsTab[3]+(sinsTab[4]+
(sinsTab[5]+sinsTab[6]*s2)*s2)*s2)*s2)*s2)*s2);
coss=sqrt(1.0-sqr(sins)); // ok as -Pi/32 < s < Pi/32
// here sinb=sin(b*Pi/16) etc.
sinb= (b>=0) ? tab[b] : -tab[-b]; cosb=tab[8-abs(b)];
if (0==a) {
return sins*cosb+coss*sinb;
} else if (1==a) {
return -sins*sinb+coss*cosb;
} else if (-1==a) {
return sins*sinb-coss*cosb;
} else { // |a|=2
return -sins*cosb-coss*sinb;
}
// i.e. return sins*(cosa*cosb-sina*sinb)+coss*(sina*cosb+cosa*sinb);
}
Posted on 2002-09-22 02:49:05 by bitRAKE
A quick search of Chebychev and sine with Google lead to:
http://hjem.get2net.dk/bnielsen/sinus.html
or with Tchebychev:
http://home.att.net/~numericana/answer/functions.htm

Taylor:
sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...

3-leds Chebychev:
sin(x) = x - 0.16605 x^3 + 0.00761 x^5

5-leds Chebychev:
sin(x) = x - 0.1666666664 x^3 + 0.0083333315 x^5
- 0.0001984090 x^7 + 0.0000027526 x^9
- 0.0000000239 x^11

Try Chebychev, Chebyshev, Tchebychev, Tchebycheff, etc...

JC
Posted on 2002-09-22 04:26:07 by MCoder
Bulk means they take in an array of dds/qds depending on the function vs/vd and output an array of answers. They only seem to achieve those clock ratings when dealing with arrays of 100+ elements.

I know we've heard again and again that FPU on P4 sucks, but that just means intels given up on it. Unfortunatly with SSE we'll never get better than dq precision but I find it hard to believe that in such circumstance where qwords aren't precise enough speed will be an issue :) .

Maybe its just me, but speed in bulk processing is the only place I really think you need it :eek: .
Posted on 2002-09-22 07:24:23 by Eóin
; Chebyshev coefficients

;cSINc1 REAL8 0.9999999999999999999999999999993767021096
cSINc2 REAL8 -0.1666666666666666666666666602899977158461
cSINc3 REAL8 0.8333333333333333333322459353395394180616e-2
cSINc4 REAL8 -0.1984126984126984056685882073709830240680e-3
cSINc5 REAL8 0.2755731922396443936999523827282063607870e-5
cSINc6 REAL8 -0.2505210805220830174499424295197047025509e-7
cSINc7 REAL8 0.1605649194713006247696761143618673476113e-9


; 8 MUL
; 6 ADD/SUB
; 3 LOAD
fld x
fld st
fmul st, st(1)
fld cSINc7
fmul st, st(1)
fadd cSINc6
fmul st, st(1)
fadd cSINc5
fmul st, st(1)
fadd cSINc4
fmul st, st(1)
fadd cSINc3
fmul st, st(1)
fadd cSINc2
fmul ; pop x^2
fmul st, st(1)
fadd ; pop x
Sum of error from -pi/2 to pi/2, 1024 sample points:
4.2130688617853282550e-08 Chebyshev
4.1491462730860308530e-06 Taylor

1024^2 Sample Points:
0.0000428065976113201 Chebyshev
0.0042198884936352558 Taylor

Look like almost two digits more of accuracy.
Now lets take a look at the speed on an Athlon...

Average of 100 loops:
52.15 cycles Chebyshev
60.22 cycles Taylor
92.48 cycles FPU (AMD says 96-192)

These are actually slow - the dependancies are making the processor very inefficient - your seeing the full latency of each instruction in these times. The Athlon FPU can do one load/store, one mul, and one add operation per cycle.
Posted on 2002-09-22 11:40:17 by bitRAKE
To demostrate the huge dependancies of the algorithm I have shuffled together a dual Chebyshev sine function. This algorithm executes in the same time as the single sine algorithm above (54 cycles = 27 cycles per sine)! Compute the sine for the top two floats on the stack:
	fld	st(1)		;	x	y	x

fmul st, st ; x*x y x
fld st(1) ; y x*x y x
fmul st, st ; y*y x*x y x

fld st(1)
fld cSINc7
fmul st(1), st ; c7 c7*x*x y*y x*x y x

fmul st, st(2) ;y
fxch st(1)
fadd cSINc6

fmul st, st(3) ;x
fxch st(1)
fadd cSINc6

fmul st, st(2) ;y
fxch st(1)
fadd cSINc5

fmul st, st(3) ;x
fxch st(1)
fadd cSINc5

fmul st, st(2) ;y
fxch st(1)
fadd cSINc4

fmul st, st(3) ;x
fxch st(1)
fadd cSINc4

fmul st, st(2) ;y
fxch st(1)
fadd cSINc3

fmul st, st(3) ;x
fxch st(1)
fadd cSINc3

fmul st, st(2) ;y
fxch st(1)
fadd cSINc2

fmulp st(3), st ;x
fadd cSINc2


fmulp st(1), st ;y



fmul st, st(2) ;y



faddp st(2), st ;y
fmul st, st(2) ;x


faddp st(2), st ;x
The line spacing gives a little insight on the latency of the instructions. Note: this is not optimal - the algorithm needs to be re-worked backwards from the end to save a cycle or two.
Posted on 2002-09-22 16:07:35 by bitRAKE
MCoder, I was able to find some stuff when searching with the alternate spellings. It would be nice if the transliteration was the same on all sites. Unfortunately, it isn't. Ah well.

bitRake, it is interesting to see exactly how much dependency there is. The Chebyshev is always going to be a bit faster than the Taylor for the same amount of accuracy since it distributes the error and can thus get by with a lower term (typically one less). So, with that in mind, converting all my functions to use Chebyshev should be a worthwhile thing.

I am still working out in my mind how I am going to organize all of these different versions and which will be the fastest on each of the platforms. Who knows, I may get lucky and be able to get by with Goertzel's algo for the oscillator(s).

- Chris
Posted on 2002-09-22 16:28:36 by bit
bit (Chris), in the tests above I actually used an extra term on Chebyshev. The difference in speed is mainly due to fewer FMUL instructions. Four sines at once might keep the Athlon busy. ;)
Posted on 2002-09-22 16:36:39 by bitRAKE
Cool. That means it should be a lot faster than the Taylor, but with the same accuracy, when that extra term is removed. I can't wait till I get to the actual writing of the synthesis engine! Right now it is hard to envision how everything is going to fit together (mainly because sound synthesis is a new arena for me).

- Chris
Posted on 2002-09-22 16:44:00 by bit