Can someone confirm this for me:
- when using the fpu instructions, i.e. 'fild', can you either load st(0) with either an immediate value, or from a register, like this:
- when popping values back of the fpu, can you pop them direct into a register like this:
My tests show that the above are not possible, but there is always the chance that i have missed a trick....
- when using the fpu instructions, i.e. 'fild', can you either load st(0) with either an immediate value, or from a register, like this:
[size=12]
fild 0FFh
fild esi
[/size]
- when popping values back of the fpu, can you pop them direct into a register like this:
[size=12]
fistp eax
[/size]
My tests show that the above are not possible, but there is always the chance that i have missed a trick....
You can't use immediate values with the coprocessor only operands in memory or it's own registers(st(*))
Since the coprocessor is logically separate from the main processor:confused::grin: it can't "see" the main processor's regisers.
Since the coprocessor is logically separate from the main processor:confused::grin: it can't "see" the main processor's regisers.
Yeah, i thought something like that was the case. I was just doing some quick integer math, and found it annoying that i had to declare a couple of temporary dword variables just to transfer values between the fpu and the normal registers.
Sluggy, this kind of code takes an integer in eax, does some fpu, and gets the result back in eax with relative ease:
push eax
fild dword ptr
; more fpu ops
fistp dword ptr
pop eax
push eax
fild dword ptr
; more fpu ops
fistp dword ptr
pop eax
you might be interested in my FPU tutorial, it sits at
antipasta.topcities.com/fputut.txt
Feedback appreciated
antipasta.topcities.com/fputut.txt
Feedback appreciated
There *is* one instruction which allows access to the CPU registers, FSTSW. You can write the status word to AX. It is intended to be used with SAHF IIRC so you can use the jxx series (which are normally for integer comparations) on float comparations.
you might be interested in my FPU tutorial, it sits at
antipasta.topcities.com/fputut.txt
Feedback appreciated
Good tut AntiPasta. I'll take a look at it myself. I'm no veteran for FPU coding but I can do it and this will certainly enhance my skills. Just one question, what do you mean by the FXCH (FPU exchange instruction) using ZERO clock cycles when paired correctly?
Thanks!
what do you mean by the FXCH (FPU exchange instruction) using ZERO clock cycles when paired correctly?
I'm not the author of the tutorial, but... :)
That means, fxch is implemented as register renaming at the lowest level. Yes, it takes decoding time, but you don't incur execution time. AFAIK, fxch on P5 is pairable with most of FPU instructions. On P6, it does not cost you other than the decoding time.
Be careful and don't abuse this feature. Decoding time may be longer than you might expect. From my experience, loading bunch of values and fxch to avoid latency can be slower than sequential processing. Another example of this is the drastic performance loss in gcc 3.x compiled C code compared to the code generated by previous versions of gcc.
Thanks starless.!! Really needed the help :D
So what instructions should I avoid pairing FXCH with? Should I avoid pairing it with instructions that sue both operands in the FXCH ins?
So what instructions should I avoid pairing FXCH with? Should I avoid pairing it with instructions that sue both operands in the FXCH ins?
So what instructions should I avoid pairing FXCH with?
I was wrong about saying 'most of FPU instructions'. Darn, my memory decay parameter is so large! :( Checking Agner's note gives me the following list of instructions pairable with fxch:
fld, fadd, fsub, fsubr, fmul, fdiv, fdivr, fchs, fabs, fcom, fucom and
associated 'pop stack' version of them.
Remember, this is for P5. P6 does not have the concept of 'pairability'. If your target is not P5, don't mind the pairability.
And, here is a fishing rod: Get Agner's optimization note. You will find yourself reading it over and over again soon. :)
x86asm, I think Starless means:
fld xxx
fld xxy
fld xxz
...
fxch ?
...
fxch ?
...
fxch ?
versesfld xxx
...
fld xxy
...
fld xxz
...
...
In the last paragraph he is saying the second method has less decode bandwidth, and hence better performance where decode bandwidth is the bottleneck.Exactly, bitRAKE. :)