Here is a part of my FPU code for my audio mixer, works fine, mixing 32 channels takes like 1% CPU of my AMD AThlon 900Mhz, but I feel other CPU's may crumble under the pressure of the FPU code, I tried to interlace the Integer and FPU code so that I get optimum CPU/FPU usage, can anyone verify that I do better? Thanks

dec DWORD PTR [ecx+18]             ;Decrement step 

jnz resamstatic ;Output into MixStream last sample
mov eax,[ecx+10]
mov [ecx+18],eax
mov esi,[ecx+38] ;Fetch PTR to Sample Data
fild WORD PTR [ecx+2] ;Load Left volume
movsx eax,WORD PTR [esi] ;Fetch left sample
mov [ecx+54],eax ;Store samples for simple resampling algothrim
mov templsam,eax
fld sammulfactor
movsx ebx,WORD PTR [esi+4] ;Fetch Right sample
fmulp st(1),st(0) ;Compute left sample multiply factor, ans is in ST(0)
mov temprsam,ebx
fld sammulfactor
mov [ecx+50],ebx
fild WORD PTR [ecx+4] ;Load Right volume
fmulp st(1),st(0) ;ST(0)==Right, ST(1)==Left
fild templsam
fild temprsam
;ST(0)=Right, ST(1)=Left, ST(2)=Right multiply, ST(3)=Left multiply
fmul st(0),st(2) ;Amplify right sample accordingly
fxch
fmul st(0),st(3) ;Amplify left sample accordingly
fistp templsam
fistp temprsam ;Mix into output stream
mov eax,templsam
fcompp
mov ebx,temprsam
add [edi],eax
add [edi+4],ebx

add DWORD PTR [ecx+38],4 ;Skip current set of samples and proceed to next
mov edx,[ecx+38]
.IF EDX >= [ecx+30] ;Have we passed or met loop restart position?
movzx eax,WORD PTR [ecx]
test eax,2
.IF ZERO? ;No restart
push ecx
mov esi,ecx
mov ecx,SIZEOF MixCHN
shr ecx,2
xor eax,eax
.WHILE ECX!=0
mov [esi],eax
dec ecx
.endw
pop ecx
jmp nextchannel
.ELSE ;Restart
sub edx,[ecx+34]
mov [ecx+38],edx
.endif
.endif
jmp nextchannel

Posted on 2002-12-12 20:24:40 by x86asm
Anyway when I did signed multiplication the usage always stayed at 0% and rarely went to 1%, but now with the FPU code, It's like 0~3% now, anyway the FPU code provides more precise and wider audio volume ranges, so its more accurate, but is the jump in CPU usage justifiable in your opinion?

Answer the question if you have the time and the effort :) to do so, if you can give me suggestions to speed up the FPU code (I already looked at Agner's HLP file), I already used one trick as stated in the Intel/AMd optimize manuals is to use FCOMPP to free up two FPU regsiters.
Posted on 2002-12-12 20:34:34 by x86asm
I bet bitRAKE will come up with some MMX code which no one understands but its only half the number of your lines and 10 times faster :)
Posted on 2002-12-13 08:19:48 by bazik
I'd have to see more code, but first thought is to keep more data on the FPU stack.
fistp templsam

fistp temprsam ;Mix into output stream
mov eax,templsam
mov ebx,temprsam
add [edi],eax
add [edi+4],ebx
Can these be generated on the stack without the memory accesses until the data is stored? Seems like too many forward dependancies in the rest of the code. Is sammulfactor and the volume changing each sample?

Okay, I see now that your processing several samples into a single output - not just refactoring a single sample. Keeping everything in registers is not an option unless the number of samples is very small. :)
Posted on 2002-12-13 08:51:16 by bitRAKE
Its not being changed on every sample being passed through but, I want it to be able to be changed on the fly and the mixer will take into account, how would I rid of forward dependancies? I think I got another method of mixing using integers I want to try but even then it will be less accurate than the FPU one I have made.

But I think I can reduce the CPU usage, just for verification the instruction IMUL, the result is stored in EDX:EAX or just EAX?
Posted on 2002-12-13 16:00:34 by x86asm

But I think I can reduce the CPU usage, just for verification the instruction IMUL, the result is stored in EDX:EAX or just EAX?
Depends on the number of operands. Both. :)

Well, given the number of times this loop would execute each second - it would be okay for the mulitplier to be contant for the duration of the loop - the user will not be able to tell the difference.

Does this do up and down sampling? It is confusing to read the code with the constants - I'm not sure what is being done. I would use structures to outline the data.
Sample STRUCT

left dw ?
right dw ?
Sample ENDS

Channel STRUCT
samples dd ?
delta dd ? ; offset
; ...etc...
Channel ENDS
IMHO, this makes the code somewhat self documenting and changes are easier. I'm trying to think of what the ideal loop would be, but it is difficult when I don't know what is going on exactly. Even providing both methods FPU and MMX/Integer would be good, imho.
Posted on 2002-12-13 16:35:52 by bitRAKE

Depends on the number of operands. Both. :)

Well, given the number of times this loop would execute each second - it would be okay for the mulitplier to be contant for the duration of the loop - the user will not be able to tell the difference.

Does this do up and down sampling? It is confusing to read the code with the constants - I'm not sure what is being done. I would use structures to outline the data.
Sample STRUCT

left dw ?
right dw ?
Sample ENDS

Channel STRUCT
samples dd ?
delta dd ? ; offset
; ...etc...
Channel ENDS
IMHO, this makes the code somewhat self documenting and changes are easier. I'm trying to think of what the ideal loop would be, but it is difficult when I don't know what is going on exactly. Even providing both methods FPU and MMX/Integer would be good, imho.


I've made this do down sampling only, mixing at 44100Hz and will go to 22050 or 11025, I've simplified it, I've written different handlers for the module channels, they can do both upsampling and downsampling of arbitrary frequencies. So how about access structures? Would I do this?


mov eax,Channel[edi].lVol

Is that right? I do have a structure defined but I didnt know there was a way to access like you say.
Posted on 2002-12-13 17:55:47 by x86asm
mov eax, .Channel.lVol ; :)
This works good, or you could use ASSUME.
Posted on 2002-12-13 19:25:55 by bitRAKE
Will it generate the code I had before using the offsets? Thanks for your help BitRake I really appreciate it.
Posted on 2002-12-13 19:27:03 by x86asm