Hi all!

I'm doing a fractal plotting program in Delphi that draws a mandelbrot set on the screen. I'm now trying to rewrite the most important parts to assembly and until now i'm doing just fine. I came to a part where I wanted to convert a screen coordinate to a complex coordiate. I used to do it using this piece of code (in pascal):

I transtlated it into:

This works, but it's much slower than Delphi's output!

How can I optimize this???

---EDIT-----

doh!

--------------

/Delight

I'm doing a fractal plotting program in Delphi that draws a mandelbrot set on the screen. I'm now trying to rewrite the most important parts to assembly and until now i'm doing just fine. I came to a part where I wanted to convert a screen coordinate to a complex coordiate. I used to do it using this piece of code (in pascal):

```
```

s:=x/(imagewidth/(abs(xmin)+abs(xmax)))+xmin;

I transtlated it into:

```
```

fld X // st=x

fld imagewidth // st=imagewidth,st(1)=x

fld Xmin // st=Xmin,st(1)=imagewidth,st(2)=x

fabs // st=abs(Xmin),st(1)=imagewidth,st(2)=x

fld Xmax // st=Xmax,st(1)=abs(Xmin),st(2)=imagewidth,st(3)=x

fabs // st=abs(Xmax),st(1)=abs(Xmin),st(2)=imagewidth,st(3)=x

fadd // st=abs(Xmax)+abs(Xmin),st(1)=imagewidth,st(2)=x

fdiv // st=imagewidth/(abs(Xmax)+abs(Xmin)),st(1)=x

fdiv // st=x/(imagewidth/(abs(Xmax)+abs(Xmin)))

fld Xmin // st=Xmin,st(1)=x/(imagewidth/(abs(Xmax)+abs(Xmin)))

fadd // st=x/(imagewidth/(abs(Xmax)+abs(Xmin)))+Xmin

fstp s // s:=x/(imagewidth/(abs(Xmin)+abs(Xmax)))+Xmin;

This works, but it's much slower than Delphi's output!

How can I optimize this???

---EDIT-----

doh!

--------------

/Delight

x / ( y / z) = (x * z) / y

Faster

Mirno

Faster

```
```

fld X

fmul imagewidth

fld Xmin

fabs

fld Xmax

fabs

fadd

fdiv

fadd Xmin

fstp s

Mirno

Thank you Mirno! One step closer to perfection...:grin:

/Delight

/Delight

You should also try to avoid loading a value twice from memory as you do here with Xmin. Instead load it at the start onto the stack then reuse it.

Also, if you don't want to preserve Xmax then you could get its absolute value by ANDing the sign bit with 0 in memory then simply adding it.

```
fld Xmin
```

fld X

fmul imageWidth

fld st(1) ; Xmin

fabs

fld Xmax

fabs

fadd

fdiv

fadd

fstp s

Also, if you don't want to preserve Xmax then you could get its absolute value by ANDing the sign bit with 0 in memory then simply adding it.

I think that the fastest way to clear a real8 number is to do:

var db ?

...

xor var, var

Marilyn

var db ?

...

xor var, var

Marilyn

You can't xor a memory variable with a memory variable.

A real8 is eight bytes - requires MMX/FPU to store eight bytes in one instruction, but MMX/FPU would require an instruction to load a zero to store. I think this would be fastest/shortest:

```
and DWORD PTR [var],0
```

and DWORD PTR [var + 4],0

Thanks, but shouldn't that 0 be -1 ???

/Delight

/Delight

Thanks, but shouldn't that 0 be -1 ???

/Delight

Z AND -1 = Z

Z AND 0 = 0 ; you wanted to clear it?

Z OR -1 = -1

Z OR 0 = Z

Z XOR -1 = NOT Z

Z XOR 0 = Z

Ok, now I get it. Thanks alot!

/Delight

:stupid:

/Delight

:stupid:

Wouldn't it be better to move zero to the memory location?

It would avoid a read-modify-write operation.

Mirno

It would avoid a read-modify-write operation.

Mirno

Wouldn't it be better to move zero to the memory location?

It would avoid a read-modify-write operation.

Mirno

otherwise it wouldn't matter on the Athlon. It also effects

the flags, so it might be better to just move the zero. :)

Hi !

One more FPU optimization is done by the following steps:

i) start your fpu-block at a boundary of 16 bytes, means at addresses like $42010, $64df0, ...

ii) if a fpu-instructions lead over a 16-byte-boundary (at example a 5 byte-operation starts at $4201e) insert nop-fillins (or integer-code which can be done simultanious) so that the fpu-instruction starts at the next 16-byte-boundary.

this helps because instructions are fetched by 16-byte-blocks ...

Greetings, Caleb

One more FPU optimization is done by the following steps:

i) start your fpu-block at a boundary of 16 bytes, means at addresses like $42010, $64df0, ...

ii) if a fpu-instructions lead over a 16-byte-boundary (at example a 5 byte-operation starts at $4201e) insert nop-fillins (or integer-code which can be done simultanious) so that the fpu-instruction starts at the next 16-byte-boundary.

this helps because instructions are fetched by 16-byte-blocks ...

Greetings, Caleb

NO, No, No

Remember the pentium chip has pairing:

mov ecx, 0

xor eax, eax

mov dword ptr , eax

mov ecx, 0

is faster than:

mov ecx, 0

mov dword ptr buffer, 0

mov ecx, 0

Just remember to alternate reisters.

And check these out:

mov eax, dword ptr buffer

push eax

call empty

pop eax

mov eax, 0

push eax

call empty

pop eax

Remember the pentium chip has pairing:

mov ecx, 0

xor eax, eax

mov dword ptr , eax

mov ecx, 0

is faster than:

mov ecx, 0

mov dword ptr buffer, 0

mov ecx, 0

Just remember to alternate reisters.

And check these out:

mov eax, dword ptr buffer

push eax

call empty

pop eax

mov eax, 0

push eax

call empty

pop eax