Hi all!
I'm doing a fractal plotting program in Delphi that draws a mandelbrot set on the screen. I'm now trying to rewrite the most important parts to assembly and until now i'm doing just fine. I came to a part where I wanted to convert a screen coordinate to a complex coordiate. I used to do it using this piece of code (in pascal):
I transtlated it into:
This works, but it's much slower than Delphi's output!
How can I optimize this???
---EDIT-----
doh!
--------------
/Delight
I'm doing a fractal plotting program in Delphi that draws a mandelbrot set on the screen. I'm now trying to rewrite the most important parts to assembly and until now i'm doing just fine. I came to a part where I wanted to convert a screen coordinate to a complex coordiate. I used to do it using this piece of code (in pascal):
s:=x/(imagewidth/(abs(xmin)+abs(xmax)))+xmin;
I transtlated it into:
fld X // st=x
fld imagewidth // st=imagewidth,st(1)=x
fld Xmin // st=Xmin,st(1)=imagewidth,st(2)=x
fabs // st=abs(Xmin),st(1)=imagewidth,st(2)=x
fld Xmax // st=Xmax,st(1)=abs(Xmin),st(2)=imagewidth,st(3)=x
fabs // st=abs(Xmax),st(1)=abs(Xmin),st(2)=imagewidth,st(3)=x
fadd // st=abs(Xmax)+abs(Xmin),st(1)=imagewidth,st(2)=x
fdiv // st=imagewidth/(abs(Xmax)+abs(Xmin)),st(1)=x
fdiv // st=x/(imagewidth/(abs(Xmax)+abs(Xmin)))
fld Xmin // st=Xmin,st(1)=x/(imagewidth/(abs(Xmax)+abs(Xmin)))
fadd // st=x/(imagewidth/(abs(Xmax)+abs(Xmin)))+Xmin
fstp s // s:=x/(imagewidth/(abs(Xmin)+abs(Xmax)))+Xmin;
This works, but it's much slower than Delphi's output!
How can I optimize this???
---EDIT-----
doh!
--------------
/Delight
x / ( y / z) = (x * z) / y
Faster
Mirno
Faster
fld X
fmul imagewidth
fld Xmin
fabs
fld Xmax
fabs
fadd
fdiv
fadd Xmin
fstp s
Mirno
Thank you Mirno! One step closer to perfection...:grin:
/Delight
/Delight
You should also try to avoid loading a value twice from memory as you do here with Xmin. Instead load it at the start onto the stack then reuse it.
Also, if you don't want to preserve Xmax then you could get its absolute value by ANDing the sign bit with 0 in memory then simply adding it.
fld Xmin
fld X
fmul imageWidth
fld st(1) ; Xmin
fabs
fld Xmax
fabs
fadd
fdiv
fadd
fstp s
Also, if you don't want to preserve Xmax then you could get its absolute value by ANDing the sign bit with 0 in memory then simply adding it.
I think that the fastest way to clear a real8 number is to do:
var db ?
...
xor var, var
Marilyn
var db ?
...
xor var, var
Marilyn
You can't xor a memory variable with a memory variable.
A real8 is eight bytes - requires MMX/FPU to store eight bytes in one instruction, but MMX/FPU would require an instruction to load a zero to store. I think this would be fastest/shortest:
and DWORD PTR [var],0
and DWORD PTR [var + 4],0
Thanks, but shouldn't that 0 be -1 ???
/Delight
/Delight
Thanks, but shouldn't that 0 be -1 ???
/Delight
Z AND -1 = Z
Z AND 0 = 0 ; you wanted to clear it?
Z OR -1 = -1
Z OR 0 = Z
Z XOR -1 = NOT Z
Z XOR 0 = Z
Ok, now I get it. Thanks alot!
/Delight
:stupid:
/Delight
:stupid:
Wouldn't it be better to move zero to the memory location?
It would avoid a read-modify-write operation.
Mirno
It would avoid a read-modify-write operation.
Mirno
Wouldn't it be better to move zero to the memory location?
It would avoid a read-modify-write operation.
Mirno
otherwise it wouldn't matter on the Athlon. It also effects
the flags, so it might be better to just move the zero. :)
Hi !
One more FPU optimization is done by the following steps:
i) start your fpu-block at a boundary of 16 bytes, means at addresses like $42010, $64df0, ...
ii) if a fpu-instructions lead over a 16-byte-boundary (at example a 5 byte-operation starts at $4201e) insert nop-fillins (or integer-code which can be done simultanious) so that the fpu-instruction starts at the next 16-byte-boundary.
this helps because instructions are fetched by 16-byte-blocks ...
Greetings, Caleb
One more FPU optimization is done by the following steps:
i) start your fpu-block at a boundary of 16 bytes, means at addresses like $42010, $64df0, ...
ii) if a fpu-instructions lead over a 16-byte-boundary (at example a 5 byte-operation starts at $4201e) insert nop-fillins (or integer-code which can be done simultanious) so that the fpu-instruction starts at the next 16-byte-boundary.
this helps because instructions are fetched by 16-byte-blocks ...
Greetings, Caleb
NO, No, No
Remember the pentium chip has pairing:
mov ecx, 0
xor eax, eax
mov dword ptr , eax
mov ecx, 0
is faster than:
mov ecx, 0
mov dword ptr buffer, 0
mov ecx, 0
Just remember to alternate reisters.
And check these out:
mov eax, dword ptr buffer
push eax
call empty
pop eax
mov eax, 0
push eax
call empty
pop eax
Remember the pentium chip has pairing:
mov ecx, 0
xor eax, eax
mov dword ptr , eax
mov ecx, 0
is faster than:
mov ecx, 0
mov dword ptr buffer, 0
mov ecx, 0
Just remember to alternate reisters.
And check these out:
mov eax, dword ptr buffer
push eax
call empty
pop eax
mov eax, 0
push eax
call empty
pop eax