I worked on this over night using the Infinite Addition algorithm for cosines
It took a while, because I didn't know that a ! meant integer PRODUCT from 1*2*3..*Number

The benchmarking was done on a P4 3.2ghz with varying radian values.

Making the algorithm parallel took a little fancy DataTable organizing and decent sized setup before the iterations and take down after. This will of course be made much easier with the new SSE3 instructions (I don't have access to a machine with SSE3 yet).


;IN: ptr to qword double precision FP value in radians
;OUT: value of in ptr replaced with Cosine
align 16  ;Lou Ricci, SSE Cosine, for the Community
CosSSE8_2:  ;;parallelism
    mov eax, ;note: unrolling makes SSE slower on P4
    MOVSD xmm0,qword  ;\
    MOVHPD xmm0,qword ;SSE3 MOV-DUP
    MOVDQA xmm1,xmm0 ; HI  LO
    MULPD xmm0,xmm1 ; x^2 | x^2
    MULPD xmm1,xmm1 ; x^2 | x^2
    MULSD xmm0,xmm1 ; x^2 | x^4
    MULPD xmm1,xmm1 ; x^4 | x^4
    MOVDQA xmm3, dqword ; 1.0 | 0.0
    MOVDQA xmm2,xmm0 ;copy for scaling x^'s with xmm1
    mov ecx,-16*6 ;12 iterations (6*2) parallel
  .LP:
    MULPD xmm0, dqword ;x^? / (?)!
    ADDPD xmm3,xmm0    ;1 - + - + -...
    MULPD xmm2,xmm1    ;x^? | x^? TO x^(?+4) | x^(?+4)
    MOVDQA xmm0,xmm2
    add ecx,16
    js .LP
    ;add high qword and low qword of xmm3 for result
    ;err no SSE3 horizontal add yet so, cheap work around
    ;SSE3 HADDPD
    MOVHPD qword,xmm3
    ADDSD xmm3,qword
    MOVSD qword,xmm3
    ret 4
align 16
CosOneP dq 1.0, 0.0
align 16 ;Table of 1/4! | 1/2!, 1/8! | 1/6! ...
CosTableP dq 0.041666666666666666666666666666667, -0.5
        dq 2.4801587301587301587301587301587e-5, -0.0013888888888888888888888888888889
        dq 2.0876756987868098979210090321201e-9, -2.7557319223985890652557319223986e-7
        dq 4.7794773323873852974382074911175e-14, -1.1470745597729724713851697978682e-11
        dq 4.1103176233121648584779906184361e-19, -1.5619206968586226462216364350057e-16
        dq 1.6117375710961183490487133048012e-24, -8.8967913924505732867488974425025e-22
        dq 3.2798892370698379101520417273121e-30, -2.479596263224797460074943545848e-27


The above code is faster than the FPU FCOS opcode.
You can scale precision (and speed of execution) by
lessening the number of loop iterations (ie change 16*6 to 16*5 in two lines of code).

Last note, aligning the qword ptr that you pass to the function (and that receives the result), to 16bytes may improve speed slightly if the function is called A LOT.
Posted on 2005-07-23 15:46:05 by r22
It took a while, because I didn't know that a ! meant integer PRODUCT from 1*2*3..*Number

It's called a 'factorial' ;)
Posted on 2005-07-23 23:50:36 by ti_mo_n
Nice :) 
The aliasing when compiling the CosTableP data can cause a bit of more incorrection of the result, what about computing them on the first call of the func ?
Again, I note that actual benchmark results (and comparisons) are important to state.
Posted on 2005-07-24 21:54:18 by Ultrano
FPU function

CosFPU:
mov eax,
FLD qword
fcos
FSTP qword
retn 4


Only the 2nd to last 2 lines were changed in the function using Shuffle instead of a High Mov
Iterations changed from 6 to 5

align 16? ;Lou Ricci, SSE Cosine, for the Community
CosSSE8_2:? ;;parallelism
? ? ?mov eax, ;note: unrolling makes SSE slower on P4
? ? ?MOVSD xmm0,qword? ;\
? ? ?MOVHPD xmm0,qword ;SSE3 MOV-DUP
? ? ?MOVDQA xmm1,xmm0 ; HI? LO
? ? ?MULPD xmm0,xmm1 ; x^2 | x^2
? ? ?MULPD xmm1,xmm1 ; x^2 | x^2
? ? ?MULSD xmm0,xmm1 ; x^2 | x^4
? ? ?MULPD xmm1,xmm1 ; x^4 | x^4
? ? ?MOVDQA xmm3, dqword ; 1.0 | 0.0
? ? ?MOVDQA xmm2,xmm0 ;copy for scaling x^'s with xmm1
? ? ?mov ecx,-16*5 ;10 iterations (5*2)
? .LP:
? ? ?MULPD xmm0, dqword ;x^? / (?)!
? ? ?ADDPD xmm3,xmm0? ? ;1 - + - + -...
? ? ?MULPD xmm2,xmm1? ? ;x^? | x^? TO x^(?+4) | x^(?+4)
? ? ?MOVDQA xmm0,xmm2
? ? ?add ecx,16
? ? ?js .LP
? ? ?;add high qword and low qword of xmm3 for result
? ? ?;err no SSE3 horizontal add yet so, cheap work around
? ? ?;SSE3 HADDPD
? ? ?pshufd xmm1,xmm3,0eeh ;didn't think of it
? ? ?addsd xmm3,xmm1? ? ? ?;suggested by: revolution
? ? ?MOVSD qword,xmm3
? ? ?ret 4
align 16
CosOneP dq 1.0, 0.0
align 16 ;Table of 1/4! | 1/2!, 1/8! | 1/6! ...
CosTableP dq 0.041666666666666666666666666666667, -0.5
? ? ? ? ?dq 2.4801587301587301587301587301587e-5, -0.0013888888888888888888888888888889
? ? ? ? ?dq 2.0876756987868098979210090321201e-9, -2.7557319223985890652557319223986e-7
? ? ? ? ?dq 4.7794773323873852974382074911175e-14, -1.1470745597729724713851697978682e-11
? ? ? ? ?dq 4.1103176233121648584779906184361e-19, -1.5619206968586226462216364350057e-16
? ? ? ? ?dq 1.6117375710961183490487133048012e-24, -8.8967913924505732867488974425025e-22
? ? ? ? ?dq 3.2798892370698379101520417273121e-30, -2.479596263224797460074943545848e-27


Test Values
---------------------------------------------------------------
Pi/2: 1.5707963267948966192313216916398
Pi/6: 0.52359877559829887307710723054658

Precision Of Results
Cos(Pi/2): Cos(Pi/6)?
--------------------------------------------------------------
WinXP Calc.exe
0: 0.86602540378443864676372317075294
CosFPU (using printf api with a format string of %1.30f)
0.000000000000000061230317691119: 0.866025403784438600000000000000
CosSSE8_2 (using printf api with a format string of %1.30f)
0.000000000000000222044604925031: 0.866025403784438600000000000000

Benchmark speed results in (ms)
------------------------------------------------------------
CosFPU (called 00 FF FF FF H times)
1313
CosSSE8_2 (called 00 FF FF FF H times)
1015? ? ? ? ?~20% speed up

Benchmark was done on a P4 3.2ghz 1gbDDR WinXPsp2 box


20% speed increase, percision loss after the 15-16 decimal place

I'd like to see the GCC amd64 function for Cosine using SSE (if anyone has it at hand I'd like to see it)
Posted on 2005-07-24 23:01:18 by r22