I wrote my own version of the FFT algorithm, based on the "

The surprise came when I tested my FFT for 4 million values: it is

I optimized the calculation, I use only 3 local variable, etc. Does the separation of the vectors counts so much? I think the many cache misses due to the non-consecutive memory accesses slows down my algorithm. I did not thought the memory accesses can harm so much the algorithm.

How can the memory acceses be optimized in this case, when the Real and Imaginary parts are separated? Or what are your suggestions?

Thanks for your help and patience in advance.

Here is my implementation:

*Numerical Receipes in C*" book, and on**NaN**'s assembly implementation. In my application I separated the Real and Imaginary parts of the complex numbers, because I thought it would be much clear, much organized. Also I could make two simultaneous real FFT-s this way.The surprise came when I tested my FFT for 4 million values: it is

*. It is approximately 2.7 times slower on my P4 2GHz than the C implementation in the book when the Real and Imaginary parts are NOT separated.***extremely slow**I optimized the calculation, I use only 3 local variable, etc. Does the separation of the vectors counts so much? I think the many cache misses due to the non-consecutive memory accesses slows down my algorithm. I did not thought the memory accesses can harm so much the algorithm.

How can the memory acceses be optimized in this case, when the Real and Imaginary parts are separated? Or what are your suggestions?

Thanks for your help and patience in advance.

Here is my implementation:

```
```

fpuFFT proc Real:DoubleVector, Imag:DoubleVector, N:DWORD

LOCAL mmax:DWORD

LOCAL wpr, wpi:REAL8

pusha ; Save all the registers

; Bit Reversal section

mov esi, [Real] ; Array with the Real parts

mov edi, [Imag] ; Array with the Imaginary parts

xor eax, eax ; i = 0

xor ebx, ebx ; j = 0

mov ecx, [N]

sub ecx, 1 ; N - 1

@1: cmp eax, ebx ; i < j ?

jae @F ; Jump if False

fld REAL8 ptr [esi+eax*8] ; Swap(Real[i], Real[j])

fld REAL8 ptr [esi+ebx*8]

fstp REAL8 ptr [esi+eax*8]

fstp REAL8 ptr [esi+ebx*8]

fld REAL8 ptr [edi+eax*8] ; Swap(Imag[i], Imag[j])

fld REAL8 ptr [edi+ebx*8]

fstp REAL8 ptr [edi+eax*8]

fstp REAL8 ptr [edi+ebx*8]

@@: mov edx, [N]

shr edx, 1 ; m = N / 2

@2: jz @F ; Jump if False

cmp ebx, edx ; j >= m ?

jb @F ; Jump if False

sub ebx, edx ; j -= m

shr edx,1 ; m >>= 1

jmp @2 ; Next m

@@: add ebx, edx ; j += m

add eax, 1 ; i++

sub ecx, 1 ; Decrement counter

jnz @1 ; Next i

; Danielson-Lanczos section

fldz ; wr

xor ecx, ecx

fldz ; wi

inc ecx ; mmax = 1

@3: cmp [N], ecx ; N > mmax ?

jbe @6 ; Jump if False

fcompp ; Free wr and wi

fld1 ; wr = 1.0

fldz ; wi = 0.0

mov [mmax], ecx

fldpi

fidiv DWORD ptr [mmax] ; theta = pi / mmax

shl ecx, 1 ; istep

fmul [Half] ; 0.5 * theta

fsincos ; sin(0.5 * theta), cos(0.5 * theta)

fmul st, st(1)

fadd st, st ; sin(theta)

fstp [wpi]

fmul st, st ; wtemp * wtemp

fadd st, st ; 2 * wtemp * wtemp

fchs ; -2 * wtemp * wtemp

xor edx, edx ; m = 0

fstp [wpr]

@4: cmp edx, [mmax] ; m < mmax ?

jae @3

mov eax, edx ; i = m

@5: cmp eax, [N] ; i < N ?

jae @F ; Jump if False

mov ebx, eax

add ebx, [mmax] ; j = i + mmax

fld REAL8 ptr [esi+ebx*8] ; Real[j]

fld REAL8 ptr [edi+ebx*8] ; Imag[j]

fld st(2) ; wi

fmul st, st(1) ; wi * Imag[j]

fld st(4) ; wr

fmul st, st(3) ; wr * Real[j]

fsubrp st(1), st ; tempr

fld st(3) ; wi

fmul st, st(3) ; wi * Real[j]

fld st(5) ; wr

fmul st, st(3) ; wr * Imag[j]

faddp st(1), st ; tempi

fld REAL8 ptr [edi+eax*8] ; Imag[i]

fld st

fadd st, st(2) ; Imag[i] + tempi

fstp REAL8 ptr [edi+eax*8] ; New Imag[i]

fsubrp st(1), st ; Imag[i] - tempi

fstp REAL8 ptr [edi+ebx*8] ; new Imag[j]

fld REAL8 ptr [esi+eax*8] ; Real[i]

fld st

fadd st, st(2) ; Real[i] + tempr

fstp REAL8 ptr [esi+eax*8] ; New Real[i]

fsubrp st(1), st ; Real[i] - tempr

fstp REAL8 ptr [esi+ebx*8] ; New Real[j]

add eax, ecx ; i += istep

fcompp ; Free Real[j] and Imag[j]

jmp @5

@@: fld st ; wi

fmul [wpi] ; wi * wpi

fld st(2) ; wr

fmul [wpr] ; wr * wpr

fsubrp st(1), st

fld st(2) ; wr

fmul [wpi] ; wr * wpi

fld st(2) ; wi

fmul [wpr] ; wi * wpr

faddp st(1), st

faddp st(2), st ; New wi

inc edx ; m++

faddp st(2), st ; New wr

jmp @4

@6: fcompp ; Free wr and wi

popa ; Load all registers

ret

fpuFFT endp

Couple things come to mind right away:

Does the C version use REAL8's?

Can you do the calculations without actually physically swapping values?

Maybe wpr, wpi can be kept on the FPU stack?

Does the C version use REAL8's?

Can you do the calculations without actually physically swapping values?

Maybe wpr, wpi can be kept on the FPU stack?

Thanks bitRAKE for your interest.

Yes the C version use double precision. And the bit reversal section is necessary. But not the swapping part is the critical, the FFT itself. And keeping the wpr and wpi on the stack is not critical either. I made a disassembly of the C code, and the C code uses 10-12 local variables. So my version is much more optimized from this point of view.

The inner loop of FFT is critical, because the indices are modifying with big steps. In consequence the elements will be almost always out of the cache, and this slows down the memory acces.

Yes the C version use double precision. And the bit reversal section is necessary. But not the swapping part is the critical, the FFT itself. And keeping the wpr and wpi on the stack is not critical either. I made a disassembly of the C code, and the C code uses 10-12 local variables. So my version is much more optimized from this point of view.

The inner loop of FFT is critical, because the indices are modifying with big steps. In consequence the elements will be almost always out of the cache, and this slows down the memory acces.

Forgive my ignorance of FFT's. Are the stored values read again within the loop? If not, much speed can be gained.

Speaking of the P4, non-temporal stores would bypass the cache, but you would have to use SSE2 and it might be slower if the values are needed again without sufficient delay.

Speaking of the P4, non-temporal stores would bypass the cache, but you would have to use SSE2 and it might be slower if the values are needed again without sufficient delay.

Hi bszente,

You could use a table of sins & cosines instead of computing them during the function (fsincos is costy!).

You could use a table of sins & cosines instead of computing them during the function (fsincos is costy!).

Oops.. I thought that the fsincos is used in a heavy loop, then I realized that the twiddles are computed reccursively (iterative approximation?).. Anyway, storing twiddles in a table could improve performance, but of course this isn't the reason that the C version is faster :confused:

A little thing you can do is to give FPU computation some time before FSTP (FXCH may help organize computation).

A rough test for your function on my Celeron 766Hz showed that it takes about 50,000 clocks for N=256. Can you post your test results too?

A little thing you can do is to give FPU computation some time before FSTP (FXCH may help organize computation).

A rough test for your function on my Celeron 766Hz showed that it takes about 50,000 clocks for N=256. Can you post your test results too?

Oops.. I thought that the fsincos is used in a heavy loop, then I realized that the twiddles are computed reccursively (iterative approximation?).. Anyway, storing twiddles in a table could improve performance, but of course this isn't the reason that the C version is faster :confused:

A little thing you can do is to give FPU computation some time before FSTP (FXCH may help organize computation).

A rough test for your function on my Celeron 766Hz showed that it takes about 50,000 clocks for N=256. Can you post your test results too?

You might want to take Kiro's advice, the P4 isnt good with FPU instructions.

Hi, all of you.

I am glad, for your interest in this topic. I couldn't answer quickly, because I was out of the town.

My results for N=256 are around 65,000 clocks, it's pretty weird. This proves that unoptimized code on P4 gives worse results.

If I put the

Another solution would be to write it using the SSE instructions as bitRAKE said, especially because it has also some cache management version of the instructions.

I am glad, for your interest in this topic. I couldn't answer quickly, because I was out of the town.

My results for N=256 are around 65,000 clocks, it's pretty weird. This proves that unoptimized code on P4 gives worse results.

If I put the

**before the***FXCH ST**FST*instruction in the inner loop, the number of cycles decrease to ~60,000. Definitely I should rearrange and rethink the implementation.Another solution would be to write it using the SSE instructions as bitRAKE said, especially because it has also some cache management version of the instructions.

**bszente**, it may not be fast, but I think I might finally understand what is happening in an FFT algorithm thanks to your easy to read code. :)

This site has some text about in-place vs. out-of-place FFT - the author claims his FFT method is a lot faster than in-place FFT'ing. Dunno if it's of any help, but here goes:

http://www.gamedev.net/reference/articles/article1349.asp

http://www.gamedev.net/reference/articles/article1349.asp

I have been involved in FFT for a while. Mostly as related to doing large number multiplication. For people into FFT the version in Numerical C is well known as one of the slowest implementations. It is easy to learn from though. I got into it because I like to calculate PI to large numbers of digits. http://pages.istar.ca/~lyster/chart.html There is good discussion on an associated Yahoo forum about PI calculation which has a LOT of information on writing fast FFT. They joke about the slower Numerical Recipes version.

http://groups.yahoo.com/group/pi-hacks/

http://groups.yahoo.com/group/pi-hacks/

I am extraordinaly glad to see you interested in this topic.

Thanks

Thanks again.

**bitRAKE**, I'm appy to heart that my implementation is not so useless after all :)Thanks

**f0dder**and**Mark Larson**for your helpful links, the pages contain exactly what I needed. I'm working on an image processing project, that's why I need very-very fast 2 dimensional FFT. And I also consider that the x86 based PCs are strong enough to make such calculations, if they are properly programmed.Thanks again.

I am extraordinaly glad to see you interested in this topic.

**bitRAKE**, I'm appy to heart that my implementation is not so useless after all :)

Thanks

**f0dder**and

**Mark Larson**for your helpful links, the pages contain exactly what I needed. I'm working on an image processing project, that's why I need very-very fast 2 dimensional FFT. And I also consider that the x86 based PCs are strong enough to make such calculations, if they are properly programmed.

Thanks again.

There is one other place I forgot to mention. They have been around for years, and have freely downloadable source. It is called FFTW. They tend to be near the front of the pack as far as speed goes.

www.fftw.org

I recommend to have a look into George Woltmans Prime95 code (http://www.mersenne.org/gimps/source23.zip). In the files called x***.mac are a lot of macro's for SSE2 - it's so far the fastest FFT for P3/K7/K8 and P4. FFTW reaches about 50% of that speed.

The reason are not faster butterfly computations, but the used memory layout and caching strategies. And the programmer has to take care of TLBs. Just imagine: You have data from different memory locations in the cache (as it often happens with FFTs) - but you can't access them at the known cache latencies (e.g. 3 for K8 L1 cache) because there are not enough L1 TLB entries. It's difficult to circumvent this. An easier way is to use huge TLBs (available in newer linux kernels and Win 2003 Server). According to a scientific paper, FFTW got a speedup of 48% using huge TLBs.

The reason are not faster butterfly computations, but the used memory layout and caching strategies. And the programmer has to take care of TLBs. Just imagine: You have data from different memory locations in the cache (as it often happens with FFTs) - but you can't access them at the known cache latencies (e.g. 3 for K8 L1 cache) because there are not enough L1 TLB entries. It's difficult to circumvent this. An easier way is to use huge TLBs (available in newer linux kernels and Win 2003 Server). According to a scientific paper, FFTW got a speedup of 48% using huge TLBs.