I wrote my own version of the FFT algorithm, based on the "Numerical Receipes in C" book, and on NaN's assembly implementation. In my application I separated the Real and Imaginary parts of the complex numbers, because I thought it would be much clear, much organized. Also I could make two simultaneous real FFT-s this way.

The surprise came when I tested my FFT for 4 million values: it is extremely slow. It is approximately 2.7 times slower on my P4 2GHz than the C implementation in the book when the Real and Imaginary parts are NOT separated.
I optimized the calculation, I use only 3 local variable, etc. Does the separation of the vectors counts so much? I think the many cache misses due to the non-consecutive memory accesses slows down my algorithm. I did not thought the memory accesses can harm so much the algorithm.

How can the memory acceses be optimized in this case, when the Real and Imaginary parts are separated? Or what are your suggestions?
Thanks for your help and patience in advance.

Here is my implementation:


fpuFFT proc Real:DoubleVector, Imag:DoubleVector, N:DWORD
LOCAL mmax:DWORD
LOCAL wpr, wpi:REAL8

pusha ; Save all the registers

; Bit Reversal section
mov esi, [Real] ; Array with the Real parts
mov edi, [Imag] ; Array with the Imaginary parts
xor eax, eax ; i = 0
xor ebx, ebx ; j = 0
mov ecx, [N]
sub ecx, 1 ; N - 1
@1: cmp eax, ebx ; i < j ?
jae @F ; Jump if False
fld REAL8 ptr [esi+eax*8] ; Swap(Real[i], Real[j])
fld REAL8 ptr [esi+ebx*8]
fstp REAL8 ptr [esi+eax*8]
fstp REAL8 ptr [esi+ebx*8]
fld REAL8 ptr [edi+eax*8] ; Swap(Imag[i], Imag[j])
fld REAL8 ptr [edi+ebx*8]
fstp REAL8 ptr [edi+eax*8]
fstp REAL8 ptr [edi+ebx*8]
@@: mov edx, [N]
shr edx, 1 ; m = N / 2
@2: jz @F ; Jump if False
cmp ebx, edx ; j >= m ?
jb @F ; Jump if False
sub ebx, edx ; j -= m
shr edx,1 ; m >>= 1
jmp @2 ; Next m
@@: add ebx, edx ; j += m
add eax, 1 ; i++
sub ecx, 1 ; Decrement counter
jnz @1 ; Next i

; Danielson-Lanczos section
fldz ; wr
xor ecx, ecx
fldz ; wi
inc ecx ; mmax = 1
@3: cmp [N], ecx ; N > mmax ?
jbe @6 ; Jump if False
fcompp ; Free wr and wi
fld1 ; wr = 1.0
fldz ; wi = 0.0
mov [mmax], ecx
fldpi
fidiv DWORD ptr [mmax] ; theta = pi / mmax
shl ecx, 1 ; istep
fmul [Half] ; 0.5 * theta
fsincos ; sin(0.5 * theta), cos(0.5 * theta)
fmul st, st(1)
fadd st, st ; sin(theta)
fstp [wpi]
fmul st, st ; wtemp * wtemp
fadd st, st ; 2 * wtemp * wtemp
fchs ; -2 * wtemp * wtemp
xor edx, edx ; m = 0
fstp [wpr]
@4: cmp edx, [mmax] ; m < mmax ?
jae @3
mov eax, edx ; i = m
@5: cmp eax, [N] ; i < N ?
jae @F ; Jump if False
mov ebx, eax
add ebx, [mmax] ; j = i + mmax
fld REAL8 ptr [esi+ebx*8] ; Real[j]
fld REAL8 ptr [edi+ebx*8] ; Imag[j]
fld st(2) ; wi
fmul st, st(1) ; wi * Imag[j]
fld st(4) ; wr
fmul st, st(3) ; wr * Real[j]
fsubrp st(1), st ; tempr
fld st(3) ; wi
fmul st, st(3) ; wi * Real[j]
fld st(5) ; wr
fmul st, st(3) ; wr * Imag[j]
faddp st(1), st ; tempi
fld REAL8 ptr [edi+eax*8] ; Imag[i]
fld st
fadd st, st(2) ; Imag[i] + tempi
fstp REAL8 ptr [edi+eax*8] ; New Imag[i]
fsubrp st(1), st ; Imag[i] - tempi
fstp REAL8 ptr [edi+ebx*8] ; new Imag[j]
fld REAL8 ptr [esi+eax*8] ; Real[i]
fld st
fadd st, st(2) ; Real[i] + tempr
fstp REAL8 ptr [esi+eax*8] ; New Real[i]
fsubrp st(1), st ; Real[i] - tempr
fstp REAL8 ptr [esi+ebx*8] ; New Real[j]
add eax, ecx ; i += istep
fcompp ; Free Real[j] and Imag[j]
jmp @5
@@: fld st ; wi
fmul [wpi] ; wi * wpi
fld st(2) ; wr
fmul [wpr] ; wr * wpr
fsubrp st(1), st
fld st(2) ; wr
fmul [wpi] ; wr * wpi
fld st(2) ; wi
fmul [wpr] ; wi * wpr
faddp st(1), st
faddp st(2), st ; New wi
inc edx ; m++
faddp st(2), st ; New wr
jmp @4
@6: fcompp ; Free wr and wi

popa ; Load all registers
ret

fpuFFT endp
Posted on 2004-05-12 09:41:01 by bszente
Couple things come to mind right away:

Does the C version use REAL8's?

Can you do the calculations without actually physically swapping values?

Maybe wpr, wpi can be kept on the FPU stack?
Posted on 2004-05-12 10:18:48 by bitRAKE
Thanks bitRAKE for your interest.

Yes the C version use double precision. And the bit reversal section is necessary. But not the swapping part is the critical, the FFT itself. And keeping the wpr and wpi on the stack is not critical either. I made a disassembly of the C code, and the C code uses 10-12 local variables. So my version is much more optimized from this point of view.

The inner loop of FFT is critical, because the indices are modifying with big steps. In consequence the elements will be almost always out of the cache, and this slows down the memory acces.
Posted on 2004-05-12 15:35:03 by bszente
Forgive my ignorance of FFT's. Are the stored values read again within the loop? If not, much speed can be gained.

Speaking of the P4, non-temporal stores would bypass the cache, but you would have to use SSE2 and it might be slower if the values are needed again without sufficient delay.
Posted on 2004-05-12 17:21:25 by bitRAKE
Hi bszente,

You could use a table of sins & cosines instead of computing them during the function (fsincos is costy!).
Posted on 2004-05-13 10:55:55 by John Kiro
Oops.. I thought that the fsincos is used in a heavy loop, then I realized that the twiddles are computed reccursively (iterative approximation?).. Anyway, storing twiddles in a table could improve performance, but of course this isn't the reason that the C version is faster :confused:

A little thing you can do is to give FPU computation some time before FSTP (FXCH may help organize computation).

A rough test for your function on my Celeron 766Hz showed that it takes about 50,000 clocks for N=256. Can you post your test results too?
Posted on 2004-05-14 14:09:40 by John Kiro

Oops.. I thought that the fsincos is used in a heavy loop, then I realized that the twiddles are computed reccursively (iterative approximation?).. Anyway, storing twiddles in a table could improve performance, but of course this isn't the reason that the C version is faster :confused:

A little thing you can do is to give FPU computation some time before FSTP (FXCH may help organize computation).

A rough test for your function on my Celeron 766Hz showed that it takes about 50,000 clocks for N=256. Can you post your test results too?

You might want to take Kiro's advice, the P4 isnt good with FPU instructions.
Posted on 2004-05-14 15:16:50 by x86asm
Hi, all of you.

I am glad, for your interest in this topic. I couldn't answer quickly, because I was out of the town.

My results for N=256 are around 65,000 clocks, it's pretty weird. This proves that unoptimized code on P4 gives worse results.
If I put the FXCH ST before the FST instruction in the inner loop, the number of cycles decrease to ~60,000. Definitely I should rearrange and rethink the implementation.

Another solution would be to write it using the SSE instructions as bitRAKE said, especially because it has also some cache management version of the instructions.
Posted on 2004-05-17 05:39:33 by bszente
bszente, it may not be fast, but I think I might finally understand what is happening in an FFT algorithm thanks to your easy to read code. :)
Posted on 2004-05-17 09:16:34 by bitRAKE
This site has some text about in-place vs. out-of-place FFT - the author claims his FFT method is a lot faster than in-place FFT'ing. Dunno if it's of any help, but here goes:
http://www.gamedev.net/reference/articles/article1349.asp
Posted on 2004-05-17 12:05:02 by f0dder
I have been involved in FFT for a while. Mostly as related to doing large number multiplication. For people into FFT the version in Numerical C is well known as one of the slowest implementations. It is easy to learn from though. I got into it because I like to calculate PI to large numbers of digits. http://pages.istar.ca/~lyster/chart.html There is good discussion on an associated Yahoo forum about PI calculation which has a LOT of information on writing fast FFT. They joke about the slower Numerical Recipes version.

http://groups.yahoo.com/group/pi-hacks/
Posted on 2004-05-17 12:48:14 by mark_larson
I am extraordinaly glad to see you interested in this topic.

bitRAKE, I'm appy to heart that my implementation is not so useless after all :)

Thanks f0dder and Mark Larson for your helpful links, the pages contain exactly what I needed. I'm working on an image processing project, that's why I need very-very fast 2 dimensional FFT. And I also consider that the x86 based PCs are strong enough to make such calculations, if they are properly programmed.

Thanks again.
Posted on 2004-05-18 07:13:03 by bszente

I am extraordinaly glad to see you interested in this topic.

bitRAKE, I'm appy to heart that my implementation is not so useless after all :)

Thanks f0dder and Mark Larson for your helpful links, the pages contain exactly what I needed. I'm working on an image processing project, that's why I need very-very fast 2 dimensional FFT. And I also consider that the x86 based PCs are strong enough to make such calculations, if they are properly programmed.

Thanks again.


There is one other place I forgot to mention. They have been around for years, and have freely downloadable source. It is called FFTW. They tend to be near the front of the pack as far as speed goes.


www.fftw.org
Posted on 2004-05-18 09:21:12 by mark_larson
I recommend to have a look into George Woltmans Prime95 code (http://www.mersenne.org/gimps/source23.zip). In the files called x***.mac are a lot of macro's for SSE2 - it's so far the fastest FFT for P3/K7/K8 and P4. FFTW reaches about 50% of that speed.

The reason are not faster butterfly computations, but the used memory layout and caching strategies. And the programmer has to take care of TLBs. Just imagine: You have data from different memory locations in the cache (as it often happens with FFTs) - but you can't access them at the known cache latencies (e.g. 3 for K8 L1 cache) because there are not enough L1 TLB entries. It's difficult to circumvent this. An easier way is to use huge TLBs (available in newer linux kernels and Win 2003 Server). According to a scientific paper, FFTW got a speedup of 48% using huge TLBs.
Posted on 2004-05-26 02:39:29 by Dresdenboy