Thanks guys. I'll probably keep a version of each around and experiment on a bunch of different machines once I know exactly what I'll need for the synthesis engine.

And, according to the docs, fsin on the Pentium executes in the U pipe. Which means, nothing can be done in parallel. The k7 has always been much faster for floating point math - fsin just further confirms it.

Now, I have to go off in search of coefficients for the faster sine algorithms.

- Chris

And, according to the docs, fsin on the Pentium executes in the U pipe. Which means, nothing can be done in parallel. The k7 has always been much faster for floating point math - fsin just further confirms it.

Now, I have to go off in search of coefficients for the faster sine algorithms.

- Chris

Here's Sine7. Sorry for the delay, I wasn't able to find time to look at the code. I haven't taken into consideration the optimizations such as using mul instead of div, maybe after I finished all, I'll profile and do some optimization. Probably add some prefetch instructions... This is just a straight translation of the original FPU code.

```
movss xmm0, x
```

movss xmm1, xmm0

mulss xmm1, xmm0

movss xmm2, xmm1

divss xmm2, const_42

movss xmm3, const_1

movss xmm4, xmm3

movss xmm5, xmm3

subss xmm3, xmm2

mulss xmm3, xmm1

divss xmm3, const_20

subss xmm4, xmm3

mulss xmm4, xmm1

divss xmm4, const_6

subss xmm5, xmm4

mulss xmm0, xmm5

movss x, xmm0

BTW, I'm just doing this for fun. I don't claim these are faster but it should.sine9

```
movss xmm0, x
```

movss xmm1, xmm0

mulss xmm1, xmm0

movss xmm2, xmm1

movss xmm3, xmm1

divss xmm2, const_72

movss xmm4, const_1

movss xmm5, xmm4

movss xmm6, xmm4

movss xmm7, xmm4

subss xmm4, xmm2

mulss xmm4, xmm1

divss xmm4, const_42

subss xmm5, xmm4

mulss xmm5, xmm1

divss xmm5, const_20

subss xmm6, xmm5

mulss xmm6, xmm1

divss xmm6, const_6

subss xmm7, xmm6

mulss xmm0, xmm7

movss x, xmm0

Sine11

```
movss xmm0, x
```

movss xmm1, xmm0

mulss xmm1, xmm0

movss xmm2, xmm1

movss xmm3, xmm1

divss xmm2, const_110

movss xmm4, const_1

movss xmm5, xmm4

movss xmm6, xmm4

movss xmm7, xmm4

subss xmm4, xmm2

mulss xmm4, xmm1

divss xmm4, const_72

subss xmm5, xmm4

mulss xmm5, xmm1

movss xmm4, xmm6

divss xmm5, const_42

subss xmm6, xmm5

mulss xmm6, xmm1

divss xmm6, const_20

subss xmm7, xmm6

mulss xmm7, xmm1

divss xmm7, const_6

subss xmm4, xmm7

mulss xmm0, xmm4

movss x, xmm0

Yet another SSE Sine9

It takes about 40-42 clocks on my PIII. And to use SIMD feature, it implements the Taylor expansion literally. One obvious drawback is that it supports only single precision.

It takes about 40-42 clocks on my PIII. And to use SIMD feature, it implements the Taylor expansion literally. One obvious drawback is that it supports only single precision.

```
```

.686

.xmm

.model flat

option casemap:none

.const

COEFF dd 0be2aaaabh ; -1/3!

dd 03c088889h ; 1/5!

dd 0b9500d01h ; -1/7!

dd 03638ef1dh ; 1/9!

; float Sine9(float x)

.code

Sine9 PROC C

movss XMM0,[esp+4]

shufps XMM0,XMM0,0

movaps XMM1,XMM0

mulps XMM1,XMM0

movaps XMM2,XMM1

mulss XMM1,XMM0

mulss XMM2,XMM2

movlhps XMM2,XMM1

shufps XMM2,XMM2,25h

movss XMM2,XMM0

mulps XMM2,XMM2

mulps XMM2,XMM0

mulps XMM2,COEFF

movhlps XMM1,XMM2

addps XMM1,XMM2

addss XMM0,XMM1

shufps XMM1,XMM1,1

addss XMM0,XMM1

movss [esp-4],XMM0

fld dword ptr [esp-4]

ret

Sine9 ENDP

END

Did you heard about Chebyshev polynoms ?

They improve the Taylor series consequently, if I remember correctly.

JC

They improve the Taylor series consequently, if I remember correctly.

JC

Not to put a dampner on ye, but this is what your competing against, Intels MKL:

vsSin

34.9 Standard Processor

33.9 PIII

19.6 PIV

6.2 Itanium

vdSin

49.9 Standard

49.9 PIII

43.5 PIV

7.3 ItaniumTM

Of course that is for bulk processing of multiple Sin values :)

vsSin

34.9 Standard Processor

33.9 PIII

19.6 PIV

6.2 Itanium

vdSin

49.9 Standard

49.9 PIII

43.5 PIV

7.3 ItaniumTM

Of course that is for bulk processing of multiple Sin values :)

MCoder,

Thanks for the info. I'll check it out

Eoin,

shoot!!! those MKL ones are pretty fast. I still haven't started optimizing mine. Anyway, thanks for letting us now. It's a challenge. ;)

Thanks for the info. I'll check it out

Eoin,

shoot!!! those MKL ones are pretty fast. I still haven't started optimizing mine. Anyway, thanks for letting us now. It's a challenge. ;)

MCoder, yeah those are the coefficients that I have been searching for. So far all I can find is information and coefficients for the cosine.

E?in, how accurate are those routines of Intel's and what exactly do you mean by bulk (a pointer to 2 or more values to take the sine on)? Also, I take it the "vs" is single precision and "vd" is double precision? I have some numbers from some Athlons for about 4 cycles in my 5th degree aproximation routine. But, for some reason the timing on the p4's is showing hundreds of cycles (not just for mine either; this happens for the std c routine as well). For whatever reason the AMD's totally smoke the Pentiums in the FP department. Very tragic to see. :)

It looks like the MMX version Startless posted if converted to using the Chebyshev coefficients would run as fast as Intels lib, if not faster. And, the one I posted once converted to the Chebyshev coefficients would probably run close to as fast as the vdSin routine, perhaps faster who knows. Of course, these are both 9th degree aproximations and I dunno how accurate Intel's are.

Oh yeah E?in, do the routines of Intel's do range reduction?

- Chris

E?in, how accurate are those routines of Intel's and what exactly do you mean by bulk (a pointer to 2 or more values to take the sine on)? Also, I take it the "vs" is single precision and "vd" is double precision? I have some numbers from some Athlons for about 4 cycles in my 5th degree aproximation routine. But, for some reason the timing on the p4's is showing hundreds of cycles (not just for mine either; this happens for the std c routine as well). For whatever reason the AMD's totally smoke the Pentiums in the FP department. Very tragic to see. :)

It looks like the MMX version Startless posted if converted to using the Chebyshev coefficients would run as fast as Intels lib, if not faster. And, the one I posted once converted to the Chebyshev coefficients would probably run close to as fast as the vdSin routine, perhaps faster who knows. Of course, these are both 9th degree aproximations and I dunno how accurate Intel's are.

Oh yeah E?in, do the routines of Intel's do range reduction?

- Chris

More about the Intel algo:

http://www.intel.com/technology/itj/q41999/articles/art_5.htm

http://www.research.scea.com/gdc2002/fastermath.pdf

This isn't my code! :)

http://www.intel.com/technology/itj/q41999/articles/art_5.htm

http://www.research.scea.com/gdc2002/fastermath.pdf

```
; 11 MUL
```

; 5 ADD/SUB

; 7 LOAD

fld x

fld st

fmul st,st(1)

fld one_over_110

fmul st,st(1)

fld1

fsubr

fmul st,st(1)

fmul one_over_72

fld1

fsubr

fmul st,st(1)

fmul one_over_42

fld1

fsubr

fmul st,st(1)

fmul one_over_20

fld1

fsubr

fmulp st(1),st

fmul one_over_6

fmul st,st(1)

fsub

...and some interesting C++ code (don't know it the comments are true).
This isn't my code! :)

```
doubledouble sin(const doubledouble& x) {
```

static const doubledouble tab[9]={ // tab[b] := sin(b*Pi/16)...

0.0,

"0.1950903220161282678482848684770222409277",

"0.3826834323650897717284599840303988667613",

"0.5555702330196022247428308139485328743749",

"0.7071067811865475244008443621048490392850",

"0.8314696123025452370787883776179057567386",

"0.9238795325112867561281831893967882868225",

"0.9807852804032304491261822361342390369739",

1.0

};

static const doubledouble sinsTab[7] = { // Chebyshev coefficients

"0.9999999999999999999999999999993767021096",

"-0.1666666666666666666666666602899977158461",

"8333333333333333333322459353395394180616.0e-42",

"-1984126984126984056685882073709830240680.0e-43",

"2755731922396443936999523827282063607870.0e-45",

"-2505210805220830174499424295197047025509.0e-47",

"1605649194713006247696761143618673476113.0e-49"

};

if (fabs(x.h())<1.0e-7) return x*(1.0-sqr(x)/3);

int a,b; doubledouble sins, coss, k1, k3, t2, s, s2, sinb, cosb;

// reduce x: -Pi < x <= Pi

k1=x/doubledouble::TwoPi; k3=k1-rint(k1);

// determine integers a and b to minimize |s|, where s=x-a*Pi/2-b*Pi/16

t2=4*k3;

a=int(rint(t2));

b=int(rint((8*(t2-a)))); // must have |a|<=2 and |b|<=7 now

s=doubledouble::Pi*(k3+k3-(8*a+b)/16.0); // s is now reduced argument. -Pi/32 < s < Pi/32

s2=sqr(s);

// Chebyshev series on -Pi/32..Pi/32, max abs error 2^-98=3.16e-30, whereas

// power series has error 6e-28 at Pi/32 with terms up to x^13 included.

sins=s*(sinsTab[0]+(sinsTab[1]+(sinsTab[2]+(sinsTab[3]+(sinsTab[4]+

(sinsTab[5]+sinsTab[6]*s2)*s2)*s2)*s2)*s2)*s2);

coss=sqrt(1.0-sqr(sins)); // ok as -Pi/32 < s < Pi/32

// here sinb=sin(b*Pi/16) etc.

sinb= (b>=0) ? tab[b] : -tab[-b]; cosb=tab[8-abs(b)];

if (0==a) {

return sins*cosb+coss*sinb;

} else if (1==a) {

return -sins*sinb+coss*cosb;

} else if (-1==a) {

return sins*sinb-coss*cosb;

} else { // |a|=2

return -sins*cosb-coss*sinb;

}

// i.e. return sins*(cosa*cosb-sina*sinb)+coss*(sina*cosb+cosa*sinb);

}

A quick search of Chebychev and sine with Google lead to:

http://hjem.get2net.dk/bnielsen/sinus.html

or with Tchebychev:

http://home.att.net/~numericana/answer/functions.htm

Taylor:

sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...

3-leds Chebychev:

sin(x) = x - 0.16605 x^3 + 0.00761 x^5

5-leds Chebychev:

sin(x) = x - 0.1666666664 x^3 + 0.0083333315 x^5

- 0.0001984090 x^7 + 0.0000027526 x^9

- 0.0000000239 x^11

Try Chebychev, Chebyshev, Tchebychev, Tchebycheff, etc...

JC

http://hjem.get2net.dk/bnielsen/sinus.html

or with Tchebychev:

http://home.att.net/~numericana/answer/functions.htm

Taylor:

sin(x) = x - x^3/3! + x^5/5! - x^7/7! + ...

3-leds Chebychev:

sin(x) = x - 0.16605 x^3 + 0.00761 x^5

5-leds Chebychev:

sin(x) = x - 0.1666666664 x^3 + 0.0083333315 x^5

- 0.0001984090 x^7 + 0.0000027526 x^9

- 0.0000000239 x^11

Try Chebychev, Chebyshev, Tchebychev, Tchebycheff, etc...

JC

Bulk means they take in an array of dds/qds depending on the function vs/vd and output an array of answers. They only seem to achieve those clock ratings when dealing with arrays of 100+ elements.

I know we've heard again and again that FPU on P4 sucks, but that just means intels given up on it. Unfortunatly with SSE we'll never get better than dq precision but I find it hard to believe that in such circumstance where qwords aren't precise enough speed will be an issue :) .

Maybe its just me, but speed in bulk processing is the only place I really think you need it :eek: .

I know we've heard again and again that FPU on P4 sucks, but that just means intels given up on it. Unfortunatly with SSE we'll never get better than dq precision but I find it hard to believe that in such circumstance where qwords aren't precise enough speed will be an issue :) .

Maybe its just me, but speed in bulk processing is the only place I really think you need it :eek: .

```
; Chebyshev coefficients
```

;cSINc1 REAL8 0.9999999999999999999999999999993767021096

cSINc2 REAL8 -0.1666666666666666666666666602899977158461

cSINc3 REAL8 0.8333333333333333333322459353395394180616e-2

cSINc4 REAL8 -0.1984126984126984056685882073709830240680e-3

cSINc5 REAL8 0.2755731922396443936999523827282063607870e-5

cSINc6 REAL8 -0.2505210805220830174499424295197047025509e-7

cSINc7 REAL8 0.1605649194713006247696761143618673476113e-9

; 8 MUL

; 6 ADD/SUB

; 3 LOAD

fld x

fld st

fmul st, st(1)

fld cSINc7

fmul st, st(1)

fadd cSINc6

fmul st, st(1)

fadd cSINc5

fmul st, st(1)

fadd cSINc4

fmul st, st(1)

fadd cSINc3

fmul st, st(1)

fadd cSINc2

fmul ; pop x^2

fmul st, st(1)

fadd ; pop x

Sum of error from -pi/2 to pi/2, 1024 sample points:
4.2130688617853282550e-

**08**Chebyshev

4.1491462730860308530e-06 Taylor

1024^2 Sample Points:

0.0000428065976113201 Chebyshev

0.0042198884936352558 Taylor

Look like almost two digits more of accuracy.

Now lets take a look at the speed on an Athlon...

Average of 100 loops:

52.15 cycles Chebyshev

60.22 cycles Taylor

92.48 cycles FPU (AMD says 96-192)

These are actually slow - the dependancies are making the processor very inefficient - your seeing the full latency of each instruction in these times. The Athlon FPU can do one load/store, one mul, and one add operation per cycle.

To demostrate the huge dependancies of the algorithm I have shuffled together a dual Chebyshev sine function. This algorithm executes in the same time as the single sine algorithm above (54 cycles =

**27 cycles per sine**)! Compute the sine for the top two floats on the stack:```
fld st(1) ; x y x
```

fmul st, st ; x*x y x

fld st(1) ; y x*x y x

fmul st, st ; y*y x*x y x

fld st(1)

fld cSINc7

fmul st(1), st ; c7 c7*x*x y*y x*x y x

fmul st, st(2) ;y

fxch st(1)

fadd cSINc6

fmul st, st(3) ;x

fxch st(1)

fadd cSINc6

fmul st, st(2) ;y

fxch st(1)

fadd cSINc5

fmul st, st(3) ;x

fxch st(1)

fadd cSINc5

fmul st, st(2) ;y

fxch st(1)

fadd cSINc4

fmul st, st(3) ;x

fxch st(1)

fadd cSINc4

fmul st, st(2) ;y

fxch st(1)

fadd cSINc3

fmul st, st(3) ;x

fxch st(1)

fadd cSINc3

fmul st, st(2) ;y

fxch st(1)

fadd cSINc2

fmulp st(3), st ;x

fadd cSINc2

fmulp st(1), st ;y

fmul st, st(2) ;y

faddp st(2), st ;y

fmul st, st(2) ;x

faddp st(2), st ;x

The line spacing gives a little insight on the latency of the instructions. Note: this is not optimal - the algorithm needs to be re-worked backwards from the end to save a cycle or two.MCoder, I was able to find some stuff when searching with the alternate spellings. It would be nice if the transliteration was the same on all sites. Unfortunately, it isn't. Ah well.

bitRake, it is interesting to see exactly how much dependency there is. The Chebyshev is always going to be a bit faster than the Taylor for the same amount of accuracy since it distributes the error and can thus get by with a lower term (typically one less). So, with that in mind, converting all my functions to use Chebyshev should be a worthwhile thing.

I am still working out in my mind how I am going to organize all of these different versions and which will be the fastest on each of the platforms. Who knows, I may get lucky and be able to get by with Goertzel's algo for the oscillator(s).

- Chris

bitRake, it is interesting to see exactly how much dependency there is. The Chebyshev is always going to be a bit faster than the Taylor for the same amount of accuracy since it distributes the error and can thus get by with a lower term (typically one less). So, with that in mind, converting all my functions to use Chebyshev should be a worthwhile thing.

I am still working out in my mind how I am going to organize all of these different versions and which will be the fastest on each of the platforms. Who knows, I may get lucky and be able to get by with Goertzel's algo for the oscillator(s).

- Chris

**bit**(Chris), in the tests above I actually used an extra term on Chebyshev. The difference in speed is mainly due to fewer FMUL instructions. Four sines at once might keep the Athlon busy. ;)

Cool. That means it should be a lot faster than the Taylor, but with the same accuracy, when that extra term is removed. I can't wait till I get to the actual writing of the synthesis engine! Right now it is hard to envision how everything is going to fit together (mainly because sound synthesis is a new arena for me).

- Chris

- Chris