Just by curiosity, I attempted recently to write a small program to generate the Mandelbrot fractal (using extended precision floating point math). For those not familiar with the subject, it consists in iterating the formula z=z*z+C until it exceeds a bailout value, z being a complex number. The number of iterations required is then converted to a color value for display. A maximum number of iterations is also specified and a fixed color value (usually black) is displayed if it is reached.

I then attempted to optimize the main computation loop. To my surprise, the main factor affecting speed was the memory location of the TBYTE variables used by the FPU. The time required for a given set of parameters was almost doubled when those variables were in consecutive memory locations!!!

Another surprise was that incrementing a global memory dword variable as a counter for the number of iterations was some 10-15% faster than using one of the CPU registers for that purpose (I tried EAX and EDX with no significant difference). And that instruction is a very small part of the loop! Furthermore, moving that counter instruction within the FPU instructions made a difference of up to 25% in the time required.

Following is the code of that main loop, MASM syntax (zr, zi, pixelX and pixelY being TBYTE global variables):

fld zr
fmul st,st
inc count ;optimized location
fld zi
fmul st,st
fld st
fadd st,st(2)
fld bailout
fcomip st,st(1)
fstp st
ja @F
jmp lastiter
fld pixelX

fld zr
fld zi
dec ecx ;also better location than before the jnz
fadd st,st
fld pixelY
fstp zi
fstp zr
jnz nextiter
jmp @F ;maximum iterations reached


Posted on 2003-12-29 15:13:37 by Raymond
AMD, right? :grin:
same at my K6-2, I think AMD had documented this. It's in fact not a slow-down, but a speed-up when you access memory +-8 bytes not consecutively :).
I use this speedup :alright:
Posted on 2003-12-29 16:47:23 by Ultrano
:) Here is a nice inner loop:

fld Real
fld Imaginary
fld st(1)
fld st(1)

;*** NOTE: This runs through atleast three times (who cares)
;ecx is max loop count, fp stack=(a,b) (r,i) (ie Z[1] is done)
brot2: ;Inner loop by Thomas Jentzsch '98 (What a guy!)
mov edx,0x40800000 ;IEEE float four compare
mov eax,edx ;set up for this pass through
mov ftmp,edx ;and next time through
fld ftmp ;R a b r i
.1 fstp ftmp ;a b r i
fld st(0) ;a a b r i
fxch st(2) ;b a a r i
fmul st(2),st(0) ;b a ab r i
cmp edx,eax ;test for exit condition
mov eax,ftmp ;get (aa+bb) from last loop
fmul st(0),st(0) ;bb a ab r i
fxch st(2) ;ab a bb r i
fadd st(0),st(0) ;2ab a bb r i
fxch st(1) ;a 2ab bb r i
fmul st(0),st(0) ;aa 2ab bb r i
fxch st(1) ;2ab aa bb r i
fld st(2) ;bb 2ab aa bb r i
fsub st(0),st(4) ;bb-r 2ab aa bb r i
fxch st(3) ;bb 2ab aa bb-r r i
fadd st(0),st(2) ;aa+bb 2ab aa bb-r r i
fxch st(1) ;2ab aa+bb aa bb-r r i
fadd st(0),st(5) ;2ab+i aa+bb aa bb-r r i
fxch st(3) ;bb-r aa+bb aa 2ab+i r i
fsubp st(2),st(0) ;aa+bb aa-bb+r 2ab+i r i
dec ecx ;one down, many to go...
ja .1 ;both exit conditions: 'cmp' above
jnc .2 ;this point just won't go away
add ecx,2 ;that 'cmp' is for two loops back!
.2 fstp st(0)
;ecx contains max loop count minus loops preformed on point
Posted on 2003-12-29 22:03:41 by bitRAKE
Thanks bitRake

Although I can't use the integer comparison of REAL4 numbers (I'm using extended precision math to get maximum zooming power up to about 10^16), I'll try to modify my code to avoid memory access in the main loop as in your suggested algo.

Posted on 2003-12-29 23:38:48 by Raymond

Although I can't use the integer comparison of REAL4 numbers (I'm using extended precision math to get maximum zooming power up to about 10^16).
It should not be a problem using only REAL4 precision for the exit value as we are only concerned if the value goes to infinity. I really enjoy playing with fractal generators. After reading Benoit B. Mandelbrot's Fractal Geometry of Nature, I wrote my first generator about 18 years ago. It is one of the first things I write in each new programming language learnt. It was the primary reason for learning assembler. :)
Posted on 2003-12-30 00:05:44 by bitRAKE
Thanks once more bitRake,

A few minutes after I posted that reply, I suddenly realized that there is no need for high precision regarding that comparison. Since I provide the option to modify the bailout value which I then keep as a REAL4 anyway, I only have to load EDX with that value instead of the suggested constant in your algo.

I'm working on the modification but I think I'll go and get a bit of sleep before continuing.

Posted on 2003-12-30 00:13:22 by Raymond
Just an update to previous posts.

The complete removal of memory accesses from the inner loop as suggested by bitRake reduced the time required to generate the fractal by more than 60%.

This optimization was possible because only 6 of the 8 FPU registers were required to hold intermediate results and perform the computations.

This may not be feasible in other types of applications. The specific location of memory variables used by the FPU may then have a significant effect on the overall performance and be worth investigating.

Posted on 2004-01-02 22:03:08 by Raymond
Raymond, did you also try doing the comparison with the FPU verses interger instructions?
Posted on 2004-01-03 18:32:45 by bitRAKE

No I have not tried that.

Although the comparison with the CPU can be performed in parallel with one of the FPU instructions, the aa+bb value has to be stored in memory. On the other hand, doing the comparison with the FPU involves loading the bailout value.

However, since I still have 2 unused registers, I will try to load that bailout value before the loop and see if using the FPU for the comparison has any benefit. Will advise.

(BTW, if you care to have a quick glance at the program still in its infancy, it's currently available on the other forum.)

Posted on 2004-01-03 22:30:52 by Raymond
Using the fcomip for a comparison with the FPU is slower than storing a REAL4 to memory. The overall speed of the program was about 10% slower.

Using the fcomp which requires storing the status word for eventual transfer to the flag register was even slower.

Alighment of the start of the loop on a page boundry seems to make a 4-8% difference.

Posted on 2004-01-03 23:40:31 by Raymond
Raymond, page as in 4k? Any difference if aligning to 32 bytes?
Posted on 2004-01-04 07:34:48 by f0dder
Sorry. It was very late last night and I should have written paragraph (16 bytes). My apology for the misinformation.

Posted on 2004-01-04 11:04:24 by Raymond
Raymond, the program worked well, but the centering became unaligned. I would have to click a couple of inches below where I wanted the center to be when zooming in -- where I clicked seemed like the lower edge of the screen on the zoomed image. It was very fast even at 5000 itterations and I enjoyed playing with it for about an hour, thank you. Posted on 2004-01-04 11:52:43 by bitRAKE
Raymond, got an executable available that I can play with? I enjoy fractals :) (spent quite some time playing with fractint and xaos :))
Posted on 2004-01-04 12:03:10 by f0dder
Here it is. As mentioned, it is still in its infancy.

I have also played a lot with fractint.

Posted on 2004-01-04 20:53:14 by Raymond
You should have a look at xaos - it's a realtime fractal zoomer, pretty neat.
Posted on 2004-01-05 09:28:12 by f0dder

Sorry for delay in answer. I wanted to have another look at my code processing the mouse input (WM_LBUTTONUP and WM_RBUTTONUP messages). I also wanted to hear from f0dder if he might have observed a similar problem.

I also tried to switch my video mode to the highest resolution shown by Windows for my system (1600x1200x32bits) to see how the program worked in that mode. However, as I learned today, my monitor may not have been designed to handle that mode and all I got was a totally black screen with no way of communicating with the computer. I though at first that my monitor had suddenly failed.

I finally got some info today on how to get back to normal (i.e. press F8 during Windows startup and choose "Safe Mode" to eventually reset the video mode to basics).

The only time which I have seen the program not centering properly was when the magnification was increased above approx. 10^16. Could you expand a bit more on when the problem was observed and what type of system you have.

Posted on 2004-01-05 15:28:16 by Raymond
Hm Raymond, windows ought to hide resolutions your monitor doesn't support :/ - really sucks having to boot to safe mode, but at least you got it working again - nice.

Humm, I do think the zooming seemed a bit weird, but it's probably because it zoomed more than I expected, and that I sometimes accidentally clicked twice because the update speed isn't *that* fast.

Still looks pretty nice though - and this is probably one of the fields where you can do a lot of work on both algorithm design and code optimization. Nice.
Posted on 2004-01-05 15:36:53 by f0dder

The program starts with the zooming factor set at 10. This can be changed from the dlg box opening with the TAB key (without any recomputing) or from the dlg box opening with the "Z" key (everything recomputed).

The update of the window starts immediately after a mouse click, but the "video memory" is not rezeroed; it's only written over with the new data. The screen is updated only line-by-line (updating each pixel as it is computed would definitely slow down the entire process without any benefit to the visual effect).

Have you had any experience similar to what was reported by bitRake?

Posted on 2004-01-05 15:57:51 by Raymond
That xaos is quite amazing. Hard to compete against that.

Posted on 2004-01-05 20:30:36 by Raymond