The fire-effect that I made over the weekend (mentioned it elsewhere already, http://scali.eu.org/~bohemiq/Fire5.rar), seems to have developed itself as a benchmark-tool (my simple proof-of-concept programs tend to do that anyway, probably because I generally have an FPS-counter in it).

Anyway, some interesting data:

XP1800+ with GF2GTS: ~50 fps
XP1800+ with R9700: ~50 fps

XP3000+: ~85 fps
P4 3.2 GHz, Win2k with HTT enabled: ~75 fps
P4 3.2 GHz, Win2k with HTT disabled: ~104 fps
P4 2.8 GHz, WinXP with HTT enabled: ~91 fps
P4 2.8 GHz, WinXP with HTT disabled: ~92 fps

So, some conclusions:
1) The routine is completely CPU-limited, the videocard doesn't make any difference at all (assuming it supports T&L).
2) The Athlon doesn't quite live up to its 3000+ name, even though the code does not contain any specific optimizations for P4 (no SSE/SSE2 used, only x87 code, only a single thread used, compiler set to optimize for PPro/PII/PIII).
3) On Win2k, HTT is not used properly at all, the OS considers it to be two separate CPUs (the CPU usage was 50% reportedly, indicating that it scheduled the single thread of the program entirely on one logical CPU, and trying to load-balance the other threads on the other CPU), and performance is considerably decreased when enabling HTT in this case.
4) On Windows XP, the OS apparently seems to realize that HTT is not the same as a dual-CPU system, and manages to get pretty much the full performance of the CPU, even if HTT is enabled, and only a single thread is used.

I thought these were quite interesting figures, so I thought I'd share them.
I can only wonder what would happen if I would utilize SSE/SSE2 and split the code up in multiple threads (in this case it would be reasonably easy to do, since all operations are independent, each thread could work on its own piece of the fire simultaneously.
I think the P4s would be able to beat the Athlons even more convincingly then (this is what I was talking about the other day, bitrake).
Posted on 2004-05-27 02:44:35 by Scali
I would be interested in seeing what you could do with SSE/2 and multiple threads. Oddly, I get the same 85fps on my XP2800/GF3, and that is with many other programs running concurently? IMVHO, this only proves compiler writers do not optimize for AMD cpu's.
Posted on 2004-05-27 09:12:48 by bitRAKE
Oddly, I get the same 85fps on my XP2800/GF3, and that is with many other programs running concurently?


Perhaps it is memory-limited on at least XP2800+ and above?
It could be that you both have 400 MHz DDR, which just cannot deliver more performance? It could also be that your chipset delivers better performance, or a combination of factors.
P4 obviously has more memory bandwidth, so that would explain why those still show improvements as clockspeeds go up.

IMVHO, this only proves compiler writers do not optimize for AMD cpu's.


Either that, or AMD doesn't do a good job of making x86 clones. One could say that it is AMD's responsibility to either deliver their own compiler, or to design the CPUs in such a way that they can run code at least as well as Intel CPUs (thinking about it, weren't you the one who said the Itanium was not a good CPU because it was not good at running existing x86 code? This seems to be the same situation).
Personally I think the compiler (VS.NET 2003 and Intel C/C++ compiler 7.1) does a good job on Athlon when generating PPro/PII/PIII code.

I don't particularly feel like coding a multithreaded version, since I do not have a dual-CPU or HTT system available. Perhaps someone else wants to try though.
I don't really feel like coding an SSE/SSE2 version, especially if it looks like it is memory-limited on fast CPUs anyway...
Posted on 2004-05-27 09:42:39 by Scali
The Itanium is not a good CPU for me at this time given the cost and the fact that x86 performance is much less than expected. Athlon/Opteron are good for me because of the relitively high x86 performance for low cost.

VS7 does usually do faily well on K7. So, you have probably created another memory tester. P4 certainly has bandwidth - if you want to run through a lot of data and not do much with it then the P4 is a real winner. If you reduce the data required and make the processing more complex then the K7 is a real winner. I do that later, so the K7 works for me.
Posted on 2004-05-27 10:29:33 by bitRAKE
Gee... 37fps on mine.. P4 1.2Ghz on XP... with a Radeon 9500
Don't have the guts to stick it in my P3 800 with the Rage 128 :D

Sure looks nice tho.
Posted on 2004-05-27 10:37:13 by JimmyClif
So, you have probably created another memory tester. P4 certainly has bandwidth - if you want to run through a lot of data and not do much with it then the P4 is a real winner. If you reduce the data required and make the processing more complex then the K7 is a real winner.


Apparently you have no clue of what is going on here, and how well it is optimized. Don't judge things you do not understand.
The actual dataset being operated on is merely 32k floats, so 128k.
The geometry being generated cannot be cut down obviously, since otherwise it won't be in a format that can be rendered.
Then there are tables that are required for the algorithm itself, and cannot be altered. They are all set up to be as compact as possible, and to manipulate in a linear fashion whenever possible.

The rest of the memory usage is mainly bitmaps to flag whether a certain point has already been processed or not, so that it can be reused, and the geometry can be more compact and rendered more efficiently.
Needless to say that bitmaps are the most compact form to store an array of boolean values.

So no, your assumptions are completely wrong, and I find it insulting that you would doubt my skills as a programmer (especially on matters that you seem to have no grasp of) before you would consider doubting your favourite processor and my (objective, benchmarked) judgement thereof.

FYI this routine was actually written on my Athlon, and any optimizations that I applied, were tested on this CPU.
Posted on 2004-05-27 12:59:24 by Scali
Doesnt run on my pc.
After I press ok on the 'audio and video selection' screen, a black window appears but instead of showing the 'fire', it just exits.

I got - P3-700 ,Tnt2 ,128mb ram
Posted on 2004-05-27 15:15:14 by clippy
Scali, when did I say anything about you?
Go take a pill or something.
Posted on 2004-05-27 15:24:16 by bitRAKE
If you reduce the data required and make the processing more complex then the K7 is a real winner. I do that later, so the K7 works for me.


I was doing 'the later' (sic), yet my code is faster on P4, so K7 doesn't work for you. You just assumed this was because I didn't optimize it properly, then tried to teach me how to optimize it. So you implied a lot about me, and how I was (deliberately or through lack of optimization?) making the P4 perform better than the Athlon.

What kind of reaction is that anyway? Why don't you say anything about how my code is indeed written the way you suggested, and admit that the Athlon is not a winner?
Posted on 2004-05-27 15:32:13 by Scali
Scali, cool effect. Did you post the code - sorry, I missed it.
Posted on 2004-05-27 15:40:37 by bitRAKE
Why would I want to post the code?
Posted on 2004-05-27 15:50:56 by Scali

Why don't you say anything about how my code is indeed written the way you suggested, and admit that the Athlon is not a winner?
I wouldn't comment on your code without seeing it. ;)
I have not commented on your code, or you personally.
Posted on 2004-05-27 15:56:02 by bitRAKE
No offense, but I doubt that you would understand anything about the code. You're not familiar with the algorithms used, let alone that you can decide whether the algorithms used are the most optimal ones, and whether the algos are implemented in an optimal way.
Posted on 2004-05-27 15:57:16 by Scali
Then don't expect or assume for me to comment on it - how silly.

Result is 80967345876

Aren't you impressed with my amazing discovery!?
Posted on 2004-05-27 15:59:52 by bitRAKE
You don't have to understand the code to see that it's faster on P4 than on Athlon, or to see that HTT is a real disadvantage on win2k in some cases, while XP handles it fine.

You just want to comment on things that you don't know a thing about, therefore you make wrong assumptions. And instead of admitting you're wrong, or even apologising, you just start talking nonsense. Pathetic.
Posted on 2004-05-27 16:17:24 by Scali
Well, it is true that I don't know much. I'm very sorry, if my comments interfere with you expressing yourself. How about putting a note at the bottom of your posts to indicate where you would like me not to reply and I'll post else where?
Posted on 2004-05-27 16:35:17 by bitRAKE
Interestingly on my P3 850 with Matrox G400 it only runs full screen, failing with D3D_ERR_NOTAVAILABLE when non-full screen.

It looks pretty neat, but the problem is that the console doesn't stick around afterwards so I only get the briefest of glimpses at it... I think it's something like 14 fps :)

I'd have thought a 128k data set wouldn't be too troublesome, it should easily fit in the L2 cache (even Celerons have 128k L2 don't they?)....

Mirno
Posted on 2004-05-27 16:41:28 by Mirno
Mirno: Is your desktop in 16-bit mode perhaps? I had that problem on a GF2Go. It only ran windowed if the desktop was 32-bit, else it would give the D3D_ERR_NOTAVAILABLE.
My GF2GTS works fine though with a 16-bit desktop, so it's not a bug in my code as far as I know, but a bug in the driver (the GF2Go driver was much older than mine), which does not support blitting the backbuffer to a frontbuffer of different format. That is rather silly, blitting 16<->32 bit is a reasonably standard operation, and is often done when rendering 32 bit textures on a 16-bit screen or vice versa, which afaik any card supports. Only when you do page flipping, the formats of back- and frontbuffers have to match.

Anyway, don't expect decent performance on a G400. It lacks hardware T&L support, which means that your CPU will have to perform T&L on about 50k triangles per frame (about 26k vertices). That's a considerable amount of extra work :)

As for the dataset, yes that's small, but I think the problem is partly in spitting out those vertices (and also about 150k indices per frame, so another 300k). One vertex is 32 bytes, so we have about 2 mb of data generated per frame. The data is written directly to videomemory on cards that support it. So you have to go through the AGP bus, and then the chipset comes into play aswell. Write combining/fastwrites and all that...
I guess the cache gets quite confused with it :)

So while the data itself may fit into cache, there's a lot of other stuff going on (also tables for the algorithm itself, as I mentioned before. They are not big, but if they're pushed out, you still get cachemisses). I suppose that the faster memory on P4s, combined with their rather unique cache setup (very small L1 cache, large but relatively fast L2 cache) is making the difference here.
Posted on 2004-05-27 17:04:21 by Scali
I wasn't expecting high performance :tongue:
This box is coming up to being 6 years old (well parts of it, it's still the same 440BX that was originally a P2 350).

You are right, upping the desktop to 32 bit made it work. Not a very helpful error message from DX though!

Even at 1X the AGP bus isn't going to be stretched by 2MB of data, but it will thrash the cache a fair bit though. It'd be interesting to see the performance on the P4 Expensive Edition, or one of the 940 pin Athlon64s (what with their dual channel, and 1MB cache, AND built in memory controller). One day an Athlon64 will be mine!

Mirno
Posted on 2004-05-27 17:17:16 by Mirno
You are right, upping the desktop to 32 bit made it work. Not a very helpful error message from DX though!


Yea... it just gets that code when it asks the driver to create the IDirect3DDevice9 object, and the driver fails. You never know why :P
You would have to verify that manually, with the features you tried to use, and the features that the card actually supports. Always a problem when it's someone else's hardware that's failing.

Even at 1X the AGP bus isn't going to be stretched by 2MB of data


That depends on how you send it. I try to send at least one vertex at a time. But since you never know where and when your next vertex is coming from, it's hard to send more than one vertex at a time, without using a temporary buffer, and copying in bursts from there (which I don't do, I'm not sure if it would be worth the trouble).
That's the thing with write combining. You want to transfer the maximum amount of data that you can do in one AGP transfer. If you would send 2 mb with 1 byte per transfer, you'd need a lot more transfers, with all the overhead of setting up the transfer and all that. Basically it's as expensive to send 1 byte, as it is to send a transfer with the maximum amount.

The same pretty much goes for the rendering process itself. Modern hardware has a lot of overhead when setting up a draw-operation (many more states than old hardware, and therefore they are often actually slower than old hardware when running old software). Some time ago, NVIDIA had a presentation about it, called batch-batch-batch or so :)

If I recall correctly, they showed that it was just as fast to render 100 triangles, as it was to render 1 triangle. You'd get the 99 other triangles 'for free', with the setup cost.
And as you sent more triangles per batch, the speed went up, up to 500 triangles per call, then you would reach the maximum trianglecount per second at last.
So it's not just about how much data you send, it's also about how you send it.

Oh, and why would you want to run the desktop in 16-bit anyway? :)
I had a G450 once (before GeForce/Radeon, I bought Matrox all the time), should be almost the same as your G400, and it was well capable of rendering a desktop in 32-bit at decent speed, at insane resolutions and refresh rates. Still excellent cards for 2d, those. The 2d acceleration is really good, and the output is very sharp and vibrant. I can't stand 16-bit. All that dithering, blocky gradients, ugly colours... bleh :P
Posted on 2004-05-27 17:33:23 by Scali