So, now it's up to you to prove that greedy out-of-order scheduling is optimal.


I leave that as an exercise to the reader :)
No, seriously... I'm not sure if there are some cases where it is indeed unoptimal... But I think we can both agree that in the general case, the ooo-scheduling works very well. It may spill the odd clockcycle every now and then, but I don't think that's something that should keep you awake at night ;)

Well, this is quite sad. I had hoped for a 15% performance increase or so 'for free' after I implemented the optimal algorithm. But now I realize I'm probably close to the optimum already. That leaves some mixed feelings though.


Well, that's the idea of ooo anyway... Since the industry has accepted that it's pretty much impossible for compilers to schedule the code properly for outdated architectures, they tried to redesign the CPUs to run 'bad' code efficiently :)
It's nice to know that they are doing a good job at it, don't you think? :)
Anyway, if you want to do software rendering, the trick is basically to cheat your arse off... Perhaps we should rather discuss software rendering and cheats and things than trying to get the 'slow' code faster? :)
Posted on 2003-12-12 07:36:10 by Bruce-li
Forgot to reply on the rest of your previous post first:
Originally posted by Bruce-li
Well, read about DirectX Next (see http://www.beyond3d.com)... Doesn't look like the CPU will ever take over graphics again :)
And the GPU is already very programmable... besides, I think you'll have trouble beating even the slowest of accelerators with a CPU :)
You'll either have to sacrifice quality (resolution, filtering, shading etc) or speed.
And I don't understand your "front row" statement.

Don't underestimate me, I knew of the DirectX Next before the name was official and I'll probably know the exact specifications before the biggest non-professional geek does. :cool:

Anyway, nothing sais DirectX is a hardware-only API. It's just an interface. Whether I choose DirectX or OpenGL or anything else as my future interface is just a 'detail' from a technological point of view, and popularity. So although I have very limited time besides my studies (mostly nighttime), I'd like to keep up with that technology. There's a lot of demand for a hardware-independent implementation that is faster than the reference rasterizer. Most notably game developers would like to get rid of some driver issues and have access to technology that's not even implemented in hardware yet (vs/ps 3.0). Or like I said before, what about all the laptop and the powerful office and scientifical systems without adequate graphics cards. And let's not forget applications that don't benefit from brute force but need intelligent visibility determination like CAD, which otherwise would require extremely expensive workstation cards. And I already pay my studies with it and my internship with a respectable graphics card manufacturer...

So, I hope this also made it clear what I meant with "being on the front row". I realize I'm not there yet, but I think I'm well on my way...
Oh I see... I thought you meant you emulated x87 with SSE.
Excuse me for saying this... but this doesn't seem like a very smart thing to do. SSE-code is written differently from x87 code... Just like hardware-accelerated 3d is different from software. There are different things that can be done quickly in both cases, so you should focus on the strong points of either method.
It would perhaps make more sense to build a virtual language aboce SSE/x87... Something where you indicate parallel operations, but don't tie it directly to specific instructions... And then you just use 'optimal' rules to generate either SSE or x87 code...

Well, it's absolutely no priority that it runs fast on a Pentium II. And I can assure you that it's still faster than the reference rasterizer and many software renderers that were written purely in C. So, I'm glad I took the 'easy road' by writing 6000 lines of replacements for each and every SSE instruction. :notsure:

I'm quite happy with it myself and now I don't have to spend a single minute any more figuring out how to make it work on Pentium II-like processors.
Or well, the way I write 3d stuff, I only use floats in the T&L pipe anyway, not in the trifiller... so virtually all float-related code is matrix/vector, and you can just create macros for those... at a high level.

Oh, then how do you do perspective correction, mipmapping or any shader stuff, and still have sufficient registers? There are a few cases where I chose to use SSE because of register limitations. Ok you're going bring up the stage splitting again... Well I have a better idea for reducing dependencies: drawing several pixels in parallel, or using hyper-threading. The first idea is for processors without hyper-threading, and would interleave two or more pixel pipelines so there's always some independent work to be done (quite like hyper-threading). Of course the real thing is even more efficient because there's no inner-loop setup overhead (i.e. splitting and syncronizing pixel pipelines is done at a higher level).

Anyway, I have -kindof- macros for everything as well. Plus, they take advantage of automatic register allocation that is hard to beat manually, and removes all redundant copying. I wrote two (new?) very efficient algorithms for that. One I've named linear-scan copy propagation and you can read about it at comp.compilers.
Oh, and something I just thought of... Something that may be overlooked often...
Sometimes it's as fast, or faster, to re-calc a value rather than to store it...
That is, if you need a+b multiple times, and you already have a in a register, you can just do add a, from memory. You have the value in 1 clk, and you didn't need to pin it down in a register, or save it to a local var, to be used again later. This is sometimes a great help in reducing dependencies, or keeping the register-count low.

That's already done by the automatic register allocation. When a value is available in a register, of course it uses the register. Else it determines if there are enough free registers and if the variable is used often enough to load it into a register. It can even overwrite/spill a register if it is detected that it is not or less used in the rest of the function. All can be done automatically (but with some manual control) and the code it produces looks very efficient in my eyes.

I wouldn't have started this thread about getting the last few percent of performance if I didn't already did the things any optimizing compiler would do. ;)
Well, no offence, but you are trying to tackle the very problem that the CPU manufacturers have been dealing with ever since the first compilers... Their solution was not to make better compilers, but to make better CPUs, which would allow better compilers.
The problem is, x86 predates this... And even the MMX/SSE/SSE2 instructionsets aren't very modern... For example, they still use a 2-operand model, while 3 operands would be many times more powerful.

I think that's all very relative. Who sais x86 doesn't use the 3-operand model? For example suppose add had three operands:

add eax, ebx, ecx

-or-

89 D8 01 C8

There's nothing that prevents an assembler from only presenting a 3-operand model that selects the most optimal instruction(s) automatically. And only a few geeks would know it's actually sometimes more than one instruction. And we're not even fully aware what happens at micro-instruction level!

And why does suddenly "instruction merging" sound through my head? I'd even dare to say that under certain circumstances this is more efficient than the 3-operand model because the code for adding with the destination operand being the same as one of the source operands is shorter. Not that code length is that critical nowadays, but I think one of the reasons why x86 still survives is because of it's unbelievable flexibility and extendability to adapt to more efficient implementations. If you look at RISC processors you see them being replaced every couple of years by totally new architectures. Think of x86 as some sort of very low-level abstraction layer that has passed the test of time...

Anyway, thanks for your many idea's and suggestions!
Posted on 2003-12-12 18:26:47 by C0D1F1ED
Originally posted by Bruce-li
I leave that as an exercise to the reader :)
No, seriously... I'm not sure if there are some cases where it is indeed unoptimal... But I think we can both agree that in the general case, the ooo-scheduling works very well. It may spill the odd clockcycle every now and then, but I don't think that's something that should keep you awake at night ;)

Just mailed one of my previous computer architecture professors to see if he knows the answer. :grin:
Well, that's the idea of ooo anyway... Since the industry has accepted that it's pretty much impossible for compilers to schedule the code properly for outdated architectures, they tried to redesign the CPUs to run 'bad' code efficiently :)
It's nice to know that they are doing a good job at it, don't you think? :)
Anyway, if you want to do software rendering, the trick is basically to cheat your arse off... Perhaps we should rather discuss software rendering and cheats and things than trying to get the 'slow' code faster? :)

Well, there's one specific thing you could help me with: bilinear filtering. :alright: It's my biggest bottleneck in the pixel pipeline. I have an implementation that uses five MMX multiplications and does mipmapping at the same time, which runs in 25 clock cycles. ;)

I'm afraid that 'cheating' has it's toll on quality though...
Posted on 2003-12-12 18:36:18 by C0D1F1ED
Anyway, nothing sais DirectX is a hardware-only API. It's just an interface.


An interface for hardware, yes. Let's face it, some things are just done differently in hardware than in software.
Although the difference is getting smaller as the programmable hardware becomes more flexible though...
But an example.. not too long ago, I coded a boxfilter on my gf2... It requires log(x)+log(y) passes for an x*y sized filter kernel.
This works fine in hardware, but doing it on the CPU like that (which you would, if you would build a software-renderer for the DirectX interface and then run a program such as mine) is just instant death.
Doing an SAT-based filter on the other hand, would be quite efficient on the CPU. But you cannot code that with DirectX... Or well, with ps2.0 and float textures, you actually can, I suppose, but still not in a very efficient way for the CPU.

Or like I said before, what about all the laptop and the powerful office and scientifical systems without adequate graphics cards.


Well, like I said, you won't be able to get very near hardware-performance unless you sacrifice quality and/or resolution... But that sorta defeats the point, I guess.

Oh, then how do you do perspective correction, mipmapping or any shader stuff, and still have sufficient registers?


I don't do perspective correction per pixel, that's just plain stupid :)
I use what I would like to call 'adaptive perspective correction', it estimates the error per poly, using linear texturemapping. Then it comes up with a value that is sure to look correct for the given poly, and does linear texturemapping of spans of that length, between 2 perspective correct points. Worst case is one div per 2 pixels. Best case is two divs per scanline.
I dropped mipmapping altogether, because I didn't have time for the idea that I wanted to implement, and per-poly mipmapping didn't give satisfactory results imho.
I handcode shaders for every occasion, so I know they're optimal, and not limited by some programming model. Copy-pasting stuff from other shaders saves work :)
And as I said, do NOT keep stuff in registers, use the L1 cache.

There's nothing that prevents an assembler from only presenting a 3-operand model that selects the most optimal instruction(s) automatically. And only a few geeks would know it's actually sometimes more than one instruction. And we're not even fully aware what happens at micro-instruction level!


There are certain operations (most notably with sub) that cannot be done in 1 clk with x86 at all.
Besides, you always use 2 micro-ops anyway, while you would only need one. So a more powerful instructionset would allow you to use less operations for the same thing, and hence make more efficient use of the execution resources. That was the point.
A nice example is perhaps the IDCT routine that Intel made to show off MMX... They did a (rather low-quality) IDCT in 330 clks if I'm not mistaken (and many people have since ripped it off, and build low-quality players with it, ugh).
Motorola created the same routine on their superior G4+AltiVec CPU... They managed it in only 109 clks. Because MMX is slow and AltiVec is fast? Not really. More because AltiVec allows you to write the actual code, without moving registers around and such.

And why does suddenly "instruction merging" sound through my head? I'd even dare to say that under certain circumstances this is more efficient than the 3-operand model because the code for adding with the destination operand being the same as one of the source operands is shorter.


That's not an issue, since modern CPUs use fixed instruction lengths anyway. It's much cheaper and efficient to decode fixed instruction length code.

If you look at RISC processors you see them being replaced every couple of years by totally new architectures. Think of x86 as some sort of very low-level abstraction layer that has passed the test of time...


Like graphics hardware for example, you mean? Let's face it, if you're not tied by a huge userbase and hardware-dependent software, you can introduce a newer, more efficient design much easier. x86 was just 'stuck' by the time RISC was becoming commonplace. If you read up on the history of Intel, you'll see that they actually produced a 32 bit RISC CPU in the late 80s or so... They probably wanted to abandon x86 already, but they never managed. So instead they went back to x86, since the RISC chip was not taking off... They're now trying again with Itanium, but it looks like AMD is going to spoil the party this time.
x86 is not good, we're just stuck with it, and making the best of it. That's not the same.

Well, there's one specific thing you could help me with: bilinear filtering. It's my biggest bottleneck in the pixel pipeline. I have an implementation that uses five MMX multiplications and does mipmapping at the same time, which runs in 25 clock cycles.


Well if you just want bilinear filter, you can try to implement a trick that I haven't been able to implement myself yet. Namely, the mipmap LOD function should be linear over the polygon... This means that the poly is basically divided by a few straight lines that separate different LODs. You could subdivide the poly at setup time so that each poly has exactly 1 LOD. Then you just feed each part to the rasterizer with no per-pixel mipmap calculations at all (note that each poly can share the rest of the gradients, and only the texture gradients need to be scaled, and each needs separate edges, assuming that you use axis-aligned gradients, which you should, anyway, since they're more stable and cheaper to do subpixel-correction with).
If you want trilinear filter (is that really necessary in software? Such a big performance hit), you could adapt this technique and use the 2 nearest LODs and blend them, I suppose.
You could also apply this for eg cubemapping and other per-pixel operations.
So anyway, that's sorta cheating... not doing per-pixel operations per-pixel :)
Actually, I believe that most hardware only does mipmapping per 2x2 block anyway.

I'm afraid that 'cheating' has it's toll on quality though...


That depends... The adaptive perspective correction for example can be tweaked so you cannot see the difference at all.
If you want to see it in action, check out: http://www.pouet.net/prod.php?which=10808
It's my Java engine that is optimized to shreds, cheating through its nose, but supporting nearly everything that a GeForce does, and the display quality is quite good I'd say, because of a stable rasterizer with bilinear filtered texturemapping.
It uses up to 3 bilerped textures at a time (embm), saturated shading, with specular per vertex, and ofcourse the adaptive perspective correction.
I like to call it JeForce :)
And this is just Java, doing it with native C++ would already make it quite a bit faster, and then we're not even talking about MMX/SSE yet.
The thing is that it's designed to be software, not to emulate hardware. If I were to actually emulate hardware, I don't think I'd have gotten very far in Java :)

PS: Perhaps I shouldn't be saying this before you get me an internship + pay for my uni aswell :)
Posted on 2003-12-12 19:34:54 by Bruce-li
Originally posted by Bruce-li
An interface for hardware, yes. Let's face it, some things are just done differently in hardware than in software.
Although the difference is getting smaller as the programmable hardware becomes more flexible though...
But an example.. not too long ago, I coded a boxfilter on my gf2... It requires log(x)+log(y) passes for an x*y sized filter kernel.
This works fine in hardware, but doing it on the CPU like that (which you would, if you would build a software-renderer for the DirectX interface and then run a program such as mine) is just instant death.
Doing an SAT-based filter on the other hand, would be quite efficient on the CPU. But you cannot code that with DirectX... Or well, with ps2.0 and float textures, you actually can, I suppose, but still not in a very efficient way for the CPU.

Sure, for some things you just need pure power. But let's not forget that the mainstream CPU now sold runs at 2.4 GHz and the mainstream GPU clocks at 250 MHz. Of course graphics cards have a lot of parallelism and are fully pipelined, but a lot has changed since we played Quake 1 on our Pentium 200 MHz!

Besides, I had a discussion with Michael Abrash a while ago about the Pixomatic software renderer being used for Unreal Tournament 2003 (which will also be standard in UT2k4). He was able to tell me that the changes to the (hardware targetted) 3D engine were very minimal, but Pixomatic uses no special overdraw reduction method like span-buffering!

My own Quake 3 renderer also doesn't use any software-specific optimizations: Real Virtuality.
Well, like I said, you won't be able to get very near hardware-performance unless you sacrifice quality and/or resolution... But that sorta defeats the point, I guess.

Yes you can. Especially in situations where brute force is of little use. I see more and more games which require at least a Geforce 4 not because of any advanced features but just for performance reasons because they only use very basic culling. In this case a software renderer could use things like a HOM or cache all rendering calls to perform deferred rendering. That way you can still get acceptable framerates at modest resolutions.
I don't do perspective correction per pixel, that's just plain stupid :)

No it isn't. With SSE, I can use the rcp instruction which executes in one clock cycle and gives me at least 12-bit precision. That is sufficient for all slightly further away polygons. For the near polygons I add one iteration of Newton's reciproke algorithm, to get 24-bit precision. So, for just a couple of clock cycles I get perfect perspective correction everywhere. Apart from linear interpolation (not doing perspective correction), I don't think there are many ways to beat that.
I use what I would like to call 'adaptive perspective correction', it estimates the error per poly, using linear texturemapping. Then it comes up with a value that is sure to look correct for the given poly, and does linear texturemapping of spans of that length, between 2 perspective correct points. Worst case is one div per 2 pixels. Best case is two divs per scanline.

I used a similar method for the Pentium II for a while. But once I decided to completely step to SSE, I found out the setup of this trick took longer than actual per-pixel correction. So -that's- pure software rendering power delivered by the CPU!
I dropped mipmapping altogether, because I didn't have time for the idea that I wanted to implement, and per-poly mipmapping didn't give satisfactory results imho.

Nowadays I use per-polygon mipmapping for distant polygons and polygons that are 'flat'. Again it makes almost no difference in performance if I use per-pixel mipmapping for the near polygons or not. This time it's not thanks to SSE but thanks to the bsr instruction which implements a log2 function.
I handcode shaders for every occasion, so I know they're optimal, and not limited by some programming model. Copy-pasting stuff from other shaders saves work :)
And as I said, do NOT keep stuff in registers, use the L1 cache.

I used to hand-code everything too. But it's way too limited and totally unmaintainable. At one point I had a few dozen shaders but only a handful actually worked. You might think that's due to bad programming practice, and you're probably partially right, but the big culprit was the method itself. You simply can't optimize tons of functions at the same time and test them after every step. If you change one simple thing, all the rest can break. A temporary solution was to use one file with a lot of preprocessor conditional directives but it became a huge file which was unmaintainable in itself. Then I started doing the same thing but now with inline functions and templates. This worked much better but the executable's size just kept growing exponentially when I added a single new option! Plus, I still had to instantiate all of these template function and write a huge 'switch' construct to select the appropriate function. And by now I was already stepping far away from the absolute hand-optimized code, because many basic operations had to read their input from memory and write it back to memory. You can say what you want, but this is two times too many mov instructions because in many cases everything can stay in registers. All in all, I can say this was a time when I thought a lot about giving up or just turning to hardware rendering where much of it is just controlled by single transistors.

But then I saw the light. First I wanted to glue together the binary code of the functions that would compose the pixel pipeline (Pixomati still uses this with several advancements, and calls it 'stitching'). It didn't work flawlessly (basically because I wanted too many optimizations at once), but it solved the problem of the huge executable. A caching system made sure I didn't have to do it a lot every frame. The next step was clearly to completely do the assembly compilation myself. A few months later I had a very well working prototype. It read an assembly file filled with conditional compilation directives much like I had done before. You can still see part of such a file on the main page of SoftWire. It was also the basis of the Real Virtuality demo, where the Microcode.dat file is an encrypted assembly file. But the problem of unoptimal register use was still there, and the file was too long even though I had support for include and macros but they were cumbersome to use because of the way I had to control the conditional compilation from C++. I even missed the syntax coloring I had in C++ files. :rolleyes:

So I took a radical new route, which I later called run-time intrinsics. They are C++ functions that have the same name as assembly instructions, and thanks to several C++ features it looks completely like assembly:


if(sampler[stage].mipmapFilter == Sampler::FILTER_POINT)
{
shufps(xmm0, xmm0, 0xFF);
mulss(xmm0, xmm6);
cvtss2si(ebp, xmm0);
bsr(ebp, ebp);
}

All these functions and their 5000 variants were automatically generated and hold the ID number to the instruction table (which in turn I had generated from the NASM manual). They don't execute the corresponding assembly instruction immediately, but store the code in a buffer. After all required run-time intrinsics are called, the buffer is used to compile and load the actual callable function. Now the conditional compilation is just a matter of C++ conditional statements and functions that neither have to be inlined or templetized. State management can be very elegant, as you can see above. Now only the problem of register use and basic optimizations remained.

But this is where run-time intrinsics become really powerful. Nothing prevents you from replacing the above registers with functions that return the register to be used. So I implemented linear-scan register allocation and the result looked like this:


if(sampler[stage].mipmapFilter == Sampler::FILTER_POINT)
{
shufps(r128(&UV), r128(&UV), 0xFF);
mulss(rSS(&UV), r128(&W));
cvtss2si(x32(&lod), r128(&UV));
bsr(x32(&lod), r32(&lod));
}

You can find all the code that uses this system in swShader. The syntax isn't the most elegant but in the code I'm working with at the moment there is no notion of registers any more. I also added linear-scan copy propagation, dropping non-modified registers and loop optimization. The clipper and rasterizer also benefits from it because it only executes exactly those instructions that are required for a specific vertex format. Just as an example how extreme you can go: I even made the pixel byte size (for stepping over the scanline) a soft-wired constant. So when you switch color depth it generates a new function. Without soft-wiring, you had to either read it from memory every time (wasting precious L1 cache space) or write several rasterizer functions just for this which is insane and unmaintainable.

So this is where things stand now. It's flexible, it's fast, it's doesn't break when you add or optimize a feature, it's all-round elegant to use. Not? :cool:
There are certain operations (most notably with sub) that cannot be done in 1 clk with x86 at all.
Besides, you always use 2 micro-ops anyway, while you would only need one. So a more powerful instructionset would allow you to use less operations for the same thing, and hence make more efficient use of the execution resources. That was the point.

There is totally no reason why 89 D8 01 C8 would be two or more micro-instructions. Depending on the CPU's implementation (instruction merging) it could be one, in which case it would be theoretically as efficient as a RISC wich uses the 3-operand model. Ok I know x86 does have it's flaws, but if you look at from an abstract point of view, with all its flexiblity, I think it's not so bad after all and will survive for at least another decade (even in emulated form).
A nice example is perhaps the IDCT routine that Intel made to show off MMX... They did a (rather low-quality) IDCT in 330 clks if I'm not mistaken (and many people have since ripped it off, and build low-quality players with it, ugh).
Motorola created the same routine on their superior G4+AltiVec CPU... They managed it in only 109 clks. Because MMX is slow and AltiVec is fast? Not really. More because AltiVec allows you to write the actual code, without moving registers around and such.

You still have to calculate the clock frequency in. If they added it to the x86 architecture, they had to lower the clock frequency. And not only the instructions that effectively use three operands would suffer from it, but also the ones that could execute in less time. That's the whole idea behind micro-instructions: splitting up operations until you have micro-instructions that take the same time to execute so you don't have any waiting time. It also comes at a price of course, but I think it's still proving itself. If there was any other desktop processor that performed twice as good as x86, I don't think this forum would even exist. :grin:
That's not an issue, since modern CPUs use fixed instruction lengths anyway. It's much cheaper and efficient to decode fixed instruction length code.

It is, but the decoder of a modern Pentium isn't really that horrendously complex. Especially if you put this in contrast with the enourmous flexibility you get, I think it's all worth it. With a fixed instruction length, you've quite limited in extendability. If x86 started off with a fixed instruction length, it would have been crushed by every other processor with modern instruction sets a long time ago. Otherwise we'd still only be using 16-bit. I'm convinced that a CPU without extendability can't stay competitive for very long.
Like graphics hardware for example, you mean? Let's face it, if you're not tied by a huge userbase and hardware-dependent software, you can introduce a newer, more efficient design much easier. x86 was just 'stuck' by the time RISC was becoming commonplace. If you read up on the history of Intel, you'll see that they actually produced a 32 bit RISC CPU in the late 80s or so... They probably wanted to abandon x86 already, but they never managed. So instead they went back to x86, since the RISC chip was not taking off... They're now trying again with Itanium, but it looks like AMD is going to spoil the party this time.
x86 is not good, we're just stuck with it, and making the best of it. That's not the same.

Nothing comes for free. Making the best of it is all we can do. But on the other hand, we -can- make the best of it. Many other architectures don't allow this. You can either look at all the things we don't get with x86, or you can look at all the things we do get and have already gotten.
Well if you just want bilinear filter, you can try to implement a trick that I haven't been able to implement myself yet. Namely, the mipmap LOD function should be linear over the polygon...

Wrong. It varies as 1/w?.
This means that the poly is basically divided by a few straight lines that separate different LODs.

True, but computing these lines is computationally expensive, even if mipmap LOD varied linearly. And you don't always end up with triangles again which complicates things even more.
You could subdivide the poly at setup time so that each poly has exactly 1 LOD.

Subdividing is extremely inefficient. Not only do you have to scan many more edges, you also get shorter scanlines which are less efficient with the cache and you have to do some setup multiple times.
Then you just feed each part to the rasterizer with no per-pixel mipmap calculations at all (note that each poly can share the rest of the gradients, and only the texture gradients need to be scaled, and each needs separate edges, assuming that you use axis-aligned gradients, which you should, anyway, since they're more stable and cheaper to do subpixel-correction with).

Sub-pixel correction is implicit if you use a DDA, and you always need axis-aligned gradients for that. So, you're right, there's no extra cost per triangle there, but it's still far from optimal to go subdivide polygons.

Besides, related to what I said before, it would only be the near polygons that would really get subdivided. And since it only takes a couple of clock cycles extra to do per-pixel mipmapping, it has the same effect with much less effort and possibly better performance. I like things to scale well, so often I prefer per-pixel methods if it's possible to make it efficient without becoming totally unreadable or unflexible. If I add anisotropic filtering tomorrow I don't want to have to rewite much of my rasterizer.

Anyway, this wasn't about fast bilinear filtering at all, was it? :wink:
If you want trilinear filter (is that really necessary in software? Such a big performance hit), you could adapt this technique and use the 2 nearest LODs and blend them, I suppose.
You could also apply this for eg cubemapping and other per-pixel operations.
So anyway, that's sorta cheating... not doing per-pixel operations per-pixel :)
Actually, I believe that most hardware only does mipmapping per 2x2 block anyway.

I never ask the question if something is really necessary. It's not I who decides how to use the renderer. If someone uses it as a reference rasterizer, for CAD or just wants one primitive to have trilinear filtering, I don't want any limitations. The whole strength of software rendering is that anything is possible!

By the way, I once did a test with 100 texture lookups (try that with hardware in one pass), and performance was still above 1 FPS! :grin: So I think trilinear filtering would scale very well too.
That depends... The adaptive perspective correction for example can be tweaked so you cannot see the difference at all.
If you want to see it in action, check out: http://www.pouet.net/prod.php?which=10808
It's my Java engine that is optimized to shreds, cheating through its nose, but supporting nearly everything that a GeForce does, and the display quality is quite good I'd say, because of a stable rasterizer with bilinear filtered texturemapping.
It uses up to 3 bilerped textures at a time (embm), saturated shading, with specular per vertex, and ofcourse the adaptive perspective correction.
I like to call it JeForce :)
And this is just Java, doing it with native C++ would already make it quite a bit faster, and then we're not even talking about MMX/SSE yet.
The thing is that it's designed to be software, not to emulate hardware. If I were to actually emulate hardware, I don't think I'd have gotten very far in Java :)

Very impressive for a Java demo! :alright: So did you use any special tricks for the bilinear filtering is is the classical formula? Ferformance looks very close to when I enable SSE emulation. How did you get Java to perform this good? Or ar the JIT-compilers nowadys really that efficient?
PS: Perhaps I shouldn't be saying this before you get me an internship + pay for my uni aswell :)

Well, if shader compilers are all you think about besides classes and your social life, they will come to you! :wink:
Posted on 2003-12-13 10:45:56 by C0D1F1ED
Sure, for some things you just need pure power. But let's not forget that the mainstream CPU now sold runs at 2.4 GHz and the mainstream GPU clocks at 250 MHz. Of course graphics cards have a lot of parallelism and are fully pipelined, but a lot has changed since we played Quake 1 on our Pentium 200 MHz!


That's not the point, really. The point is that with 3d hardware, texture filtering is free, perspective correction is free, and things like that...
So if you build filters that 'abuse' texture filtering and perspective correct triangles, it works fine. There is no better way... But on a CPU, you could use a better algo than a multipass one, and you don't need 2 perspective-correct textured triangles aligned in a special way to sample the pixels you want either. Then there's the problem of slow main memory... As an example... A gf4 runs the filter with a 16x16 kernel over a scene at about 200 fps... My gf2 runs it at 115 fps... My laptop, which uses an ATi IGP340M, runs 30 fps. Why? It doesn't have dedicated memory. It uses the main memory, which is 266 MHz DDR, I believe. CPUs would do even worse, since on a CPU, the triangles and filtering aren't free either. Let's say it would do 20 fps then, very positive estimate... Then you're still a factor 10 off the performance of a simple gf4 card.
And the real problem is this: a lot of software will be written to get ~60 fps from a gf4 card, and use a maximum of features. You may be lucky to get 6 fps, if you extrapolate performance from the filter example.

Besides, I had a discussion with Michael Abrash a while ago about the Pixomatic software renderer being used for Unreal Tournament 2003 (which will also be standard in UT2k4). He was able to tell me that the changes to the (hardware targetted) 3D engine were very minimal, but Pixomatic uses no special overdraw reduction method like span-buffering!


Yea, his bilinear filter is not working properly either, it didn't look filtered properly anyway... And speedwise I wasn't that impressed... It wasn't far from my Java engine, even with simple scenes.
Span-buffering is not good by the way. It doesn't work on modern scenes anymore, there are too many polys (my demo uses scenes of about 6000 polys each, relatively little overdraw, and already zbuffer was about as fast as zsort aswell btw...), and the complexity just goes through the roof.
You could try to tile your scene, and do a span-buffer per-tile, that's an idea I had, but I'm not sure if that is going to work too well either... Only one way to find out, I guess.

My own Quake 3 renderer also doesn't use any software-specific optimizations: Real Virtuality.


But Quake 3 is basically just a software-game... It's not really different from Quake one, except for some more highpoly scenes/characters perhaps. There are no shadows (good luck doing stencil shadows in software, you need insane fillrate) for example, and not much fancy stuff otherwise.

So, for just a couple of clock cycles I get perfect perspective correction everywhere. Apart from linear interpolation (not doing perspective correction), I don't think there are many ways to beat that.


Well you've seen my engine, I get perfect perspective correction everywhere, without per-pixel operations whatsoever.

I used a similar method for the Pentium II for a while. But once I decided to completely step to SSE, I found out the setup of this trick took longer than actual per-pixel correction. So -that's- pure software rendering power delivered by the CPU!


Maybe you did the wrong setup then... My setup for the adaptive perspective correction is very simple, it may only be more expensive on really small polys... maybe < 10 pixels or so.

So this is where things stand now. It's flexible, it's fast, it's doesn't break when you add or optimize a feature, it's all-round elegant to use. Not?


It's nice, but as I've been saying before, you should be optimizing at a higher level.
There's a difference between performing an operation as quickly as possible and performing the quickest operation possible.

You still have to calculate the clock frequency in. If they added it to the x86 architecture, they had to lower the clock frequency.


At the time, I believe PPC and x86 were still fairly matched in clockspeed. Besides, even today, the PPC is not 3x lower clockspeed than x86, so it will still win easily.

I'm convinced that a CPU without extendability can't stay competitive for very long.


That's not the point. x86 isn't all that competitive anyway. There have always been much more powerful CPUs. The problem is that the world is stuck to x86. If it were easier to cut oneself loose from a specific instructionset, it would be much easier to introduce newer and better CPUs. At least the GPU guys realized this from the start (except 3dfx, but obviously they died).

Wrong. It varies as 1/w?


Just get the point okay? The fact that you use 1/... already shows that you might be missing that point. 1/x can never be linear. x is linear... x = w? then? QED
Besides, there are multiple mipmap formulas. You could use the min/max of the u and v directions, or the average, or the standard deviation, whatever. I'd pick a linear one for software, obviously.

And you don't always end up with triangles again which complicates things even more.


As long as they're convex polys, everything is fine. And I don't think you can subdivide a triangle into anything non-convex with just the mipmapping.

Subdividing is extremely inefficient. Not only do you have to scan many more edges, you also get shorter scanlines which are less efficient with the cache and you have to do some setup multiple times.


As long as you save time on every pixel, it may well be worth it... Besides, miplevels don't change that often, so it's not like you add so many edges anyway. Ofcourse you can get bad cases, but I think the average case is much better, when implemented right (a lot of code can be dropped from the innerloop).

Sub-pixel correction is implicit if you use a DDA


How do you mean implicit? You need to align to hotspot, right?

and you always need axis-aligned gradients for that.


I've seen plenty of rasterizers use edge-aligned gradients though.

So, you're right, there's no extra cost per triangle there, but it's still far from optimal to go subdivide polygons.


Why? You just have to check the LOD gradient and see if it flips anywhere, if not, feed triangle as-is, else subdivide triangle at the flip-points.

So did you use any special tricks for the bilinear filtering is is the classical formula? Ferformance looks very close to when I enable SSE emulation. How did you get Java to perform this good? Or ar the JIT-compilers nowadys really that efficient?


The bilinear filter is 'correct', it does a weighted sum of the 4 nearest texels for every pixel. But the trick is that the filter is inlined with the rest of the shading calculations... So when I filter a texel, it is being lit at the same time (using the least-dependent method, not mathematically the fastest method!). Then I just have to saturate it after filtering, and it's done.
I've been trying to tell you how I got Java to perform this good, I render smarter, not harder :)
The JIT-compilers aren't too great actually... For example, the metaball-scene... I wrote the Java code based on my original D3D implementation... It is literally an order of magnitude slower... Where I'd get about 250 fps in C++ with D3D, I got about 25 fps in Java. That's probably a bad case for the JIT-compiler though, but I'd say that C++ or asm would easily outperform it in general (also note that it uses no special parallel instructions whatsoever, so it has to process the 24 bit pixels with regular ints in Java, and T&L with regular x87 float code, and special stuff because Java has to yield the exact same float results on every platform. And then there's extra overhead from garbage collection and such). I've been wanting to port it back to C++ and optimize some parts with asm to see just how fast you can get it on a modern PC, but haven't had time yet.

PS: I did make this Java engine, and also some D3D stuff, but nobody came to me yet, sadly... I was hoping that people would be interested in the Java engine, because it can run in any browser, on any OS, on any architecture, without installing extra plugins, and it should be able to compete nicely with things like Alambik, Flash, Shockwave etc.
And I can also port it to N-Gage and devices like that.
Posted on 2003-12-13 12:20:17 by Bruce-li
Originally posted by Bruce-li
And the real problem is this: a lot of software will be written to get ~60 fps from a gf4 card, and use a maximum of features. You may be lucky to get 6 fps, if you extrapolate performance from the filter example.

I never wanted to compete directly with a Geforce 4. I'm just trying to provide an alternative for the pathetic integrated graphics cards, plus offer support for things that are normally only available on high-end cards. For example that Geforce 4 doesn't support vs/ps 2.0, I do! And a 10x factor in performance is excellent in my opinion, and I can only hope this gap gets smaller as CPU parallelism increases (multi-core, hyper-threading with extra execution units). Last but not least there are always situations where software rendering can beat hardware rendering.
Yea, his bilinear filter is not working properly either, it didn't look filtered properly anyway... And speedwise I wasn't that impressed... It wasn't far from my Java engine, even with simple scenes.

Not far from your Java engine? Don't underestimate these guys please... they have to deal with far greater overdraw, transparancy, alpha testing, mipmapping and many different stage operations. Furthermore, I noticed some rasterization cracks in your demo. That's unacceptable in a professional product and you can be sure that it takes extra computations to get everything right. By the way, I hope you've experimented with the ini file to get the quality right? Standard settings are not that optimal.
But Quake 3 is basically just a software-game... It's not really different from Quake one, except for some more highpoly scenes/characters perhaps. There are no shadows (good luck doing stencil shadows in software, you need insane fillrate) for example, and not much fancy stuff otherwise.

What do you mean "just a software-game"? Just the fact that it is targetted only at hardware rendering makes me believe the big switch was made there. And every engine just looks like the previous but with some extras. I studied the Doom III alpha and there is no reason why it would be impossible to render it in software. And stencil shadows are no problem at all. First, it's only 8-bit per pixel, secondly, MMX offers insane parallelism on this level. So there's no doubt it can run at one or less than one clock cycle per pixel. For a 2.4 GHz CPU that's 2.4 Gpixel/s and only 2.4 GB/s bandwidth so if that isn't competitive with current graphic cards? It's exactly one of those situation where the CPU is more efficient than the GPU because it's not wasting so much time for a simple operation.
Well you've seen my engine, I get perfect perspective correction everywhere, without per-pixel operations whatsoever.

Never doubted that. I just wanted to react on your "per-pixel is insane" comment. It isn't insane if you use SSE, and it's even the most efficient and best scaling.
Maybe you did the wrong setup then... My setup for the adaptive perspective correction is very simple, it may only be more expensive on really small polys... maybe < 10 pixels or so.

And what do you think polgons are like with a model that has 2000 and is only 200 pixels tall? And the trend isn't going to stop. But I use linear texture mapping in such cases anyway. :grin:
It's nice, but as I've been saying before, you should be optimizing at a higher level.
There's a difference between performing an operation as quickly as possible and performing the quickest operation possible.

Run-time intrinsics give me all the abstraction I need. Plugging in a new algorithm, optimizing at a higher level, is a piece of cake. It's quite a lot harder to maintain your code if you're working with a dozen hand-optimized routines. Quite frankly I'd say there's not a single reason why to choose regular assembly above run-time intrinsics. I can just as easily optimize "performing the quickest operation possible" as much as "performing an operation as quickly as possible".
At the time, I believe PPC and x86 were still fairly matched in clockspeed. Besides, even today, the PPC is not 3x lower clockspeed than x86, so it will still win easily.

Then why does every game benchmark show that x86 still performs better? My guess is that the filter was just one specific example. I don't know much about the PPC's instruction set, but try to let it do a pavgb or pmaddwd. These instructions were added for specific algorithms that benefit a lot from them. I'm sure that once games run significantly faster on PPC, the x86 imperium would collapse. But until that day, I'm happy with ugly x86 because it does what I want it to do.
That's not the point. x86 isn't all that competitive anyway. There have always been much more powerful CPUs. The problem is that the world is stuck to x86. If it were easier to cut oneself loose from a specific instructionset, it would be much easier to introduce newer and better CPUs. At least the GPU guys realized this from the start (except 3dfx, but obviously they died).

If they really are that much more powerful, then why aren't there x86 emulators, JIT-compilers and whatnot all over the place? Then after a transition phase, PPC would just totally take over. Ok, there's a lot more to it than that, but imagine Intel produced a modern desktop RISC processor. If x86 is really that inefficient it should be easy for them to use the JIT-compiler method. It's happened before with the far inferiour 68k. I can tell you, if it effectively would be better, it would already have been done!
Just get the point okay? The fact that you use 1/... already shows that you might be missing that point. 1/x can never be linear. x is linear... x = w? then? QED
Besides, there are multiple mipmap formulas. You could use the min/max of the u and v directions, or the average, or the standard deviation, whatever. I'd pick a linear one for software, obviously.

Euhm, what are you trying to say? The formula is 1/w?, there ain't a way around that, and approximating it with 1/w (which you already have for perspective correction) or any linear approximation looks very bad.

I must admit I once tried something similar though. The idea was to calculate the mipmap level only every 16 pixels or so, keeping it constant in between. It worked, but resulted in zig-zag transition lines that were very clear to see. I reduced it to recalculating LOD every four pixels and it looked fine, but it wasn't faster. Sometimes you just can't approximate (any further) and expect it to look right and be faster.
As long as they're convex polys, everything is fine. And I don't think you can subdivide a triangle into anything non-convex with just the mipmapping.

Sure, you can triangulate those again. But that's yet another cost and added complexity.
As long as you save time on every pixel, it may well be worth it... Besides, miplevels don't change that often, so it's not like you add so many edges anyway. Ofcourse you can get bad cases, but I think the average case is much better, when implemented right (a lot of code can be dropped from the innerloop).

Which brings me to another point... the mipmap boundaries are not straight. They are kindof radial. You don't see this on small polygons, but a big floor looks wrong without it. Here's one of my first screenshots with mipmapping: q3dm1.jpg. You can clearly see the mipmap transitions by altering your screen's gamma. I wouldn't find it acceptable if it looked any different. And I know the floor isn't one polygon, but then I just take advantage of constant mipmapping when a triangle's mipmap LOD doesn't change. And for the nearby polygons, it is -only- three more instructions as you can see in my previous post to perform per-pixel mipmapping. So any optimization is at most going to save the equivalent of one instruction. But all the added complexity, and having to interpolate more edges and more scanline setup makes me doubt it's all worth it. Subdividing polygons really sounds to me like a premature optimization that might work against you over time.
How do you mean implicit? You need to align to hotspot, right?

Yes, but only once. After that you just keep stepping discretely from one pixel center to the next. I found out this is faster than edge-aligned stepping with prestepping every scanline.
The bilinear filter is 'correct', it does a weighted sum of the 4 nearest texels for every pixel. But the trick is that the filter is inlined with the rest of the shading calculations... So when I filter a texel, it is being lit at the same time. Then I just have to saturate it after filtering, and it's done.

How can you light it (I mean, multiplying by diffuse, adding specular) before you have your filtered texel?
PS: I did make this Java engine, and also some D3D stuff, but nobody came to me yet, sadly... I was hoping that people would be interested in the Java engine, because it can run in any browser, on any OS, on any architecture, without installing extra plugins, and it should be able to compete nicely with things like Alambik, Flash, Shockwave etc.
And I can also port it to N-Gage and devices like that.

Well, first of all you need an impressive product. And yours is impressive, but I hate to tell you that real-time Java rendering has been done many times before and you even quote some products yourself. I'm sure it proves your interest in 3D rendering and Java, and you might get a job in this branch, but after reading a book like LaMothe's newest, anyone could copy it.

Advanced pixel shaders in software, completely compiled at run-time and optimized on the fly, that's something I haven't found anywhere else. I don't want to give myself too much credit, but I do think there's some brand new idea's in it, and it performs at a level that was considered totally unreachable for software rendering. And if you compare it with the reference rasterizer, an implementation written by professionals that work on it all day, I believe this is a nice achievement for four months of work aside my studies.

Secondly, and possibly more importantly, you need publicity. The scene isn't really surrounded by professionals looking for new talent. Have you tried gamedev, flipCode, beyond3d? I have, and everywhere I got very positive reactions. And the things that happen on these forums get spread around among professionals and if you're lucky one will contact you and if you succeed at convincing him of your qualities then you've hit the jackpot. :grin:

So, I wish you all the luck if you really want to go for it and if you know you're up to it! :alright:
Posted on 2003-12-13 16:56:55 by C0D1F1ED
Last but not least there are always situations where software rendering can beat hardware rendering.


My point exactly, so sticking to a hw-API doesn't make sense.

Not far from your Java engine? Don't underestimate these guys please...


Why not? If the bilinear filter looks bad, and the performance of even an example with a single box is not very fast... I'm just not that impressed. And we all know that Michael Abrash is just a name, and Carmack was the real brains behind the ID stuff.

they have to deal with far greater overdraw, transparancy, alpha testing, mipmapping and many different stage operations.


How can you judge that, honestly?
Why do you try to defend these guys anyway? They're not holy, you know.

Furthermore, I noticed some rasterization cracks in your demo. That's unacceptable in a professional product and you can be sure that it takes extra computations to get everything right.


There are also rasterization cracks in your Quake engine btw. In my case it's mainly because I wanted the shortest route to setup. I have alternative setup code that is more stable. I have since optimized this alternative code, and increased the precision, and ditched the code used in the demo.
Don't be so quick to judge.

What do you mean "just a software-game"? Just the fact that it is targetted only at hardware rendering makes me believe the big switch was made there.


As I say, it doesn't do anything 'exciting'. It's just Quake I/II with improved models as far as I'm concerned.
I would call Doom3 a real hardware-game, or Half-Life 2. They make excessive use of shaders and multipass algorithms to generate effects never done before in software in a game. And these algorithms don't suit software either.

And stencil shadows are no problem at all. First, it's only 8-bit per pixel, secondly, MMX offers insane parallelism on this level. So there's no doubt it can run at one or less than one clock cycle per pixel. For a 2.4 GHz CPU that's 2.4 Gpixel/s and only 2.4 GB/s bandwidth so if that isn't competitive with current graphic cards? It's exactly one of those situation where the CPU is more efficient than the GPU because it's not wasting so much time for a simple operation.


This is rather naive I'd say. You need 2 passes per lightsource at least (using double-sided stencil), where one is full shading with textures and per-pixel diffuse/specular, and one extra pass for the ambient. So you need to render the scene 1+2*n times for n lightsources basically.
Even most hardware has trouble with that, currently, and hardware gives you virtually free T&L and more fillrate than the CPU, not to mention free rasterizing, and free z/stencil test per-pixel, and ofcourse faster per-pixel shading operations. You'll have a LOT of trouble running this algo on CPU. CPU would be much better off with shadowmapping I'd say.

Never doubted that. I just wanted to react on your "per-pixel is insane" comment. It isn't insane if you use SSE, and it's even the most efficient and best scaling.


Not doing something is always faster than doing something, so even for SSE it's faster to do scanline subdivision, this is trivial. Note that my code also automatically has shorter dependency chains because of it, and therefore schedules better.

Then why does every game benchmark show that x86 still performs better?


Perhaps because most games on Mac are actually ported x86 games, often even running on a DirectX wrapper over Mac's OpenGL? That's not really a fair comparison to begin with. Says nothing about the CPU at any rate.

My guess is that the filter was just one specific example. I don't know much about the PPC's instruction set, but try to let it do a pavgb or pmaddwd. These instructions were added for specific algorithms that benefit a lot from them.


Obviously you haven't looked into AltiVec... It's designed by Motorola, who is an expert on DSP, unlike Intel. And IDCT is not a filter, it's Inverse Discrete Cosine Transform, it's used to decode jpeg/mpeg-style lossy compression.

If they really are that much more powerful, then why aren't there x86 emulators, JIT-compilers and whatnot all over the place?


There's been an x86-JIT for Mac for years. It's called VirtualPC. It works reasonably well, but emulating foreign virtual memory is always troublesome in the worst case ofcourse.
Windows XP for Itanium is also getting an x86-JIT in the next servicepack... It currently uses the hardware emulation circuitry of the Itanium, but it's not very efficient, so the JIT was developed.

I can tell you, if it effectively would be better, it would already have been done!


Intel already did, in the late 80s... I believe it was the i860 CPU. It was a RISC-CPU anyway. Problem was that JIT-technology didn't exist yet, and they couldn't break away from the x86 legacy. They're going to try again now with Itanium2, but AMD64 is trying to spoil it this time.

Sure, you can triangulate those again. But that's yet another cost and added complexity.


Triangulating convex polys is as simple as a while-loop calling your rasterizer. Nothing I'd lose any sleep over.

Yes, but only once. After that you just keep stepping discretely from one pixel center to the next. I found out this is faster than edge-aligned stepping with prestepping every scanline


Yes, that's exactly what I said.

How can you light it (I mean, multiplying by diffuse, adding specular) before you have your filtered texel?


I didn't say 'before'. I leave the rest as an exercise to the (increasingly annoying) reader.

but after reading a book like LaMothe's newest, anyone could copy it.


It's not as simple as copying a book. Writing an efficient 3d engine also requires a lot of experience and knowledge of tricks-of-the-trade... And a certain talent. Not to mention knowledge of Java-specific stuff in this case.
If you do things by the book, you get stuff like what you wrote (no offense). It does everything by the book and it looks nice, but that quake level without any dynamic shading whatsoever ran at 18 fps on my XP1800+, that is not exactly impressive. There are many engines that would run circles around yours. You sound rather naive and arrogant right now. I thought I wanted to help you out, but not if it is like this.

Advanced pixel shaders in software, completely compiled at run-time and optimized on the fly, that's something I haven't found anywhere else.


I believe there is a Finnish company that does this for their mobile phone games, they helped to develop OpenGL ES.

and it performs at a level that was considered totally unreachable for software rendering.


Depends on how you look at it. It runs shaders pretty efficiently, but purely as a software engine it's not impressive at all, as I said above.

And if you compare it with the reference rasterizer, an implementation written by professionals that work on it all day, I believe this is a nice achievement for four months of work aside my studies.


Yes, written by your hero Michael Abrash. The idea is nice, but in practice it will mean that refrast might go from 0.001 fps to 0.01 fps when running code aimed at (shader) hardware. Which still makes the practical use 0.
My Java rasterizer can run circles around refrast aswell, who cares?
I think you shouldn't stick so close to the hardware-model if you want to make the most of software rendering, and your idea of runtime compilation.

Have you tried gamedev, flipCode, beyond3d?


I've posted some of my D3D stuff as an IOTD once, and also Croissant 9, but no reactions.
Posted on 2003-12-13 18:13:16 by Bruce-li
Originally posted by Bruce-li
My point exactly, so sticking to a hw-API doesn't make sense.

Ok you do have a point there. My renderer has it's own specific interface, but I still want to create a Direct3D wrapper. It might not be the most efficient choice, but it will make it possible to reach a much bigger audience. Game companies rarely want to spend any time adding software rendering support, let alone adapt to -my- interface.

Pixomatic uses it's own interface in Unreal Tournament 2003 as well, but the calls were nearly directly compatible to Direct3D so the programmers got it completely working in just a few days. Mostly it was just disabling the advanced features. :grin:
Why not? If the bilinear filter looks bad, and the performance of even an example with a single box is not very fast... I'm just not that impressed. And we all know that Michael Abrash is just a name, and Carmack was the real brains behind the ID stuff.

I don't have UT2k3 on this system right now, but if I recall correctly the standard filtering just used upsampled mipmaps to fake bilinear filtering. But in the ini file you're able to choose the filter quality and change it to real bilinear!

I'm sorry but it's really Abrash who wrote the assembly code for Quake. Carmack is the brain behind the gameplay, which I admit was genious at the time. It's Quake that got me fascinated with 3D rendering.
How can you judge that, honestly?
Why do you try to defend these guys anyway? They're not holy, you know.

It's not a matter of judging, it's a fact that UT2k3 has an overdraw of three on average, is filled with transparent objects and surfaces with alpha testing and has more than double the polygon count on a level than the average Quake 3 map. They used VTune all the way to get the absolute best performance. I certainly know they're not holy, but these are professionals who have done grapics programming all their life. So, again, don't underestimate these persons that you don't really know and which you're trying to measure up with.
There are also rasterization cracks in your Quake engine btw.

Oh... Could you please point it out for me because it's not supposed to happen. Are you sure it's not a crack in the map?
In my case it's mainly because I wanted the shortest route to setup. I have alternative setup code that is more stable. I have since optimized this alternative code, and increased the precision, and ditched the code used in the demo.
Don't be so quick to judge.

Ok, no problem. I can only tell what I see!
As I say, it doesn't do anything 'exciting'. It's just Quake I/II with improved models as far as I'm concerned.
I would call Doom3 a real hardware-game, or Half-Life 2. They make excessive use of shaders and multipass algorithms to generate effects never done before in software in a game. And these algorithms don't suit software either.

Well, sure, from the current point of view there's nothing exciting about Quake 3 any more. But at that time I very well recall that people were totally amazed by it. And I still see a lot of engines today are based on the Quake 3 engine or similar technology. And if anyone said a a few years ago that it's all possible in software they totally wouldn't believe him. So that's why I think it's just as much a 'hardware' game today as it was back then.

In a few years or so, I'm quite sure that Doom III will have nothing exciting any more and making it run in software will be looked upon as trivial.

You know, this has puzzeled me for a while now too: it seems like everything new doesn't look as impressive any more. It's as if yesterday I was totally hooked up on spherical harmonic lighting and today it seems like I've known it for years. I don't have a complete explanation for the feeling but I think it's because 3D rendering is so popular nowadays that everybody wants to do it and it's not special any more. I try to resist that feeling though, and just continue with what I find exciting at the moment...
This is rather naive I'd say. You need 2 passes per lightsource at least (using double-sided stencil), where one is full shading with textures and per-pixel diffuse/specular, and one extra pass for the ambient. So you need to render the scene 1+2*n times for n lightsources basically.

It's been a while since I looked at the algorithm, but I believe it's possible to do n lights in one pass if you have n stencil buffers. It's a nice example of a situation where using your own interface has great advantages. :o Wouldn't it be likely that next generation graphics cards and APIs support multiple stencil buffers as well? Clearly the current situation is sub-optimal for hardware as well.
Even most hardware has trouble with that, currently, and hardware gives you virtually free T&L and more fillrate than the CPU, not to mention free rasterizing, and free z/stencil test per-pixel, and ofcourse faster per-pixel shading operations. You'll have a LOT of trouble running this algo on CPU. CPU would be much better off with shadowmapping I'd say.

A lot of things are for free on hardware, but not everything, and I expect this to get worse in the future. On the other hand, this means you'll only be 'paying for what you get'. The reason for this would be that the increasing programmability demands to get rid of the totally pipelined structure (you can't afford a dozen dividers in one pipeline) and share the resources. Surely a lot of the sub-operations will remain fully pipelined, and texture samplers will always be able to deliver a texel per clock. The GPU will remain faster than the CPU simply because it's designed for dedicated graphics tasks, but they certainly will look a lot more alike in the future. And when that happens, people with software rendering expericience will have certain advantages.
Not doing something is always faster than doing something, so even for SSE it's faster to do scanline subdivision, this is trivial. Note that my code also automatically has shorter dependency chains because of it, and therefore schedules better.

"Not doing something" is very relative. There's still setup required and you're still linearly interpolating, and you're risking that it only works well under specific circumstances. The rcpss takes just a -single- clock cycle, which is used most of the time, and Newton's reciprocal algorithm takes just four extra. But ok, let's assume you can do it in three per pixel. What is that going to change about overall performance if the whole pipeline takes 100 clock cycles? I know I have to start small but in my opinion there are a lot more productive things to do than get this optimal.

By the way, there's another reason why I think this can be categorized as premature optimization. Do you know how the psx/psy shader instruction works? Ignore the rest of the paragraph if you do. It computes the gradient of -any- shader register by looking at it's value in the surrounding pixels. In hardware, this is easy, because most implementations have four (or two times four) pipelines that run in parallel. They all work on the same 2x2 block of pixels, so when they all execute the psx/psy instruction at the same time, they just have to look at each other's value of the register to compute the differential. One very important use of it is mipmapping arbitrarily deformed textures. For example a 'swirl' effect needs a different mipmap level in the center than on the edges. This mipmap level has to be computed per-pixel and a fast way to get the texture gradients is to use the psx/psy instructions.

Supporting these instructions is crucial to make many per-pixel effects look completely right (else you get aliasing or blurring effects). But now we're not working with scanlines any more, but with 2x2 pixel blocks. So all the time you've spend optimizing the scanline based mipmap level computations, and probably also the perspective correction, is lost once you step to this method. Of course you can still use the old method if no per-pixel texture effects are used, but I hope you understand why I prefer spending my time on new things once I get satisfactory performance while keeping full flexibility, then standing still getting a performance increase that would have been possible by overclocking. ;) Trust me, I've done my share of week-long optimization of one tiny piece of code. only the bilinear filtering code was really worth it...
Perhaps because most games on Mac are actually ported x86 games, often even running on a DirectX wrapper over Mac's OpenGL? That's not really a fair comparison to begin with. Says nothing about the CPU at any rate.

So there are no games that are properly optimized for x86 and PPC that can be used to convince people of the superiour RISC architecture? That's a real big shame. I mean, I'd be glad to admit its superiority if I see some hard numbers!

I did a bit of looking up and found out the PPC has out-of-order execution and several hundreds of intructions? Where's the line between CISC and RISC really? I have to admit the instructions look more coherent and general though...
Obviously you haven't looked into AltiVec... It's designed by Motorola, who is an expert on DSP, unlike Intel. And IDCT is not a filter, it's Inverse Discrete Cosine Transform, it's used to decode jpeg/mpeg-style lossy compression.

You're almost making me believe it's a bunch of chimps working at Intel. :notsure: I know IDCT isn't a filter in itself, but it's one way to go implement one. Signal processing never was my favorite course. :sweat:
There's been an x86-JIT for Mac for years. It's called VirtualPC. It works reasonably well, but emulating foreign virtual memory is always troublesome in the worst case ofcourse.
Windows XP for Itanium is also getting an x86-JIT in the next servicepack... It currently uses the hardware emulation circuitry of the Itanium, but it's not very efficient, so the JIT was developed.

Ah, yes, I overlooked the problem of having different virtual memory models. But still, if the PPC was so much more efficient than x86, wouldn't that become of lesser importance? I really do want to believe you that it's more efficient but, and I don't mean this in the negative way, there seem to be a lot of excuses why they can't be directly compared to each other. And I understand the historical advantages x86 has but I find it hard to imagine that a superiour technology can't win from it.
Intel already did, in the late 80s... I believe it was the i860 CPU. It was a RISC-CPU anyway. Problem was that JIT-technology didn't exist yet, and they couldn't break away from the x86 legacy. They're going to try again now with Itanium2, but AMD64 is trying to spoil it this time.

Are you sure they would want Itanium for the desktop market? It's a really expensive chip to produce, and it's EPIC architecture is more optimal for workstation applications I think. Of course that could all change... Anyway, I'm placing my bet on extended hyper-threading and extra execution units for the next five or ten years. :alright:
Triangulating convex polys is as simple as a while-loop calling your rasterizer. Nothing I'd lose any sleep over.

Loops can be quite inefficient when jump misprediction occurs, which will happen a lot in this case. Ok I mustn't exaggerate this but there are other reasons which I've mentioned before why I wouldn't want this extra loop.
I didn't say 'before'. I leave the rest as an exercise to the (increasingly annoying) reader.

Hey I'm not trying to to annoy you. I actually like these discussions and I'm still very grateful that you've opened my eyes about the out-of-order scheduling!

Now where were we... Is it that you sample the texel for the next pixel while performing the lighting for the current pixel? I always found that a really interesting idea, but I never tried it. My main reason has always been that it would require extra copy operations. But since you've repeated so many times that the L1 cache is nearly as fast as registers, I feel forced to try it out. It's really trivial now to change my pipeline to try it out as soon as possible and tell you the results tomorrow. Thanks for the idea!
It's not as simple as copying a book. Writing an efficient 3d engine also requires a lot of experience and knowledge of tricks-of-the-trade... And a certain talent. Not to mention knowledge of Java-specific stuff in this case.
If you do things by the book, you get stuff like what you wrote (no offense).

Ok that was probably a bit too rude of me. Of course it's not just copying from a book, it's applying all the theory and using it together with a lot of intelligent creativity to create new things. I've helped tons of newbies write their own software renderer but most of them just give up after a month because the math is too hard or because they can't find information about things they should actually be able to figure out by themselves. So yes, it requires a certain talent. And it's not like I think I'm the king of the world but I do think I have that talent and maybe more importantly, the perseverance. All my programming has always somehow been in relation to software rendering. When I developed SoftWire I wanted it to be an independent project but every feature I added was useful for software rendering. And I don't feel offended but I want you to know that LaMothe's latest book has learned me nothing new about 'Advanced 3D Graphics and Rasterization'. And I doubt if you find anything about run-time intrinsics in any book either.
It does everything by the book and it looks nice, but that quake level without any dynamic shading whatsoever ran at 18 fps on my XP1800+, that is not exactly impressive. There are many engines that would run circles around yours. You sound rather naive and arrogant right now. I thought I wanted to help you out, but not if it is like this.

Wow, wow, lean back! A discussion where people agree on everything doesn't make anyone learn anything, isn't it? So I'm sorry if you haven't learned anything from me yet but I think giving direct arguments (call it arrogance if you want) is an effective way to know where I'm right and where I'm wrong. And it's not because I don't agree on certain things that I don't respect you, because I certainly do. But I also like you to understand that I have been doing this stuff for nearly five years and some of the methods and priciples that have proven very useful to me are hard to let go off if you prove me they can be improved or are plain wrong. So please, give it a bit of time to let me experience things the hard way instead of just telling me it's this or that way and expecting me to believe you. I don't want to put you under any kind of pressure but some more examples and demos to show your theories work would give it a lot of extra force. Because, I myself have very often been wrong about something I've completely worked out in theory but which proved to have negative side-effects in practice. And I wouldn't be surprised if your mipmap sub-division theory has the same destiny...

About the 18 FPS, I'd like to stress that no optimizations have been made to reduce overdraw. It's a hardware-like brute-force rendering and Quake 3 was never designed for any software rendering. I'd really like to know what software engines would run circles around it while offering the same quality. Or you could teach me the hard way and write a faster one yourself after which you'd be my new god. ;)
I believe there is a Finnish company that does this for their mobile phone games, they helped to develop OpenGL ES.

Interesting. Could you help me find a link to their technology?
Depends on how you look at it. It runs shaders pretty efficiently, but purely as a software engine it's not impressive at all, as I said above.

I'm sorry to dissapoint you. But may I ask what would have impressed you? And I don't think it would be fair to demand things you haven't already seen being done better.
Yes, written by your hero Michael Abrash. The idea is nice, but in practice it will mean that refrast might go from 0.001 fps to 0.01 fps when running code aimed at (shader) hardware. Which still makes the practical use 0.
My Java rasterizer can run circles around refrast aswell, who cares?
I think you shouldn't stick so close to the hardware-model if you want to make the most of software rendering, and your idea of runtime compilation.

Ok, first of all, Abrash is not my hero or idol but I do have a great respect for him. Secondly, I happen to have reference rasterizer code that I'm sure you do not have. And it is so horribly slow because of the tons of control statements, because it loads/stored every operand from/to memory and because it has it's own classes for 12-bit, 24-bit and even 32-bit floating-point emulation. Comparing that to run-time generated code where all conditional statements have been eliminated, registers are used optimally, and MMX/SSE is used for native support of adequate vector types results in more than 10x performance increase! It's not like it's suddenly useful as a replacement for high-end hardware, but it makes the reference rasterizer useful for more than generating just one image or testing shaders in 160x120 resolution.
I've posted some of my D3D stuff as an IOTD once, and also Croissant 9, but no reactions.

Yeah people are hard to impress nowadays aren't they... :tongue:

Well, it's true, but seriously now, I don't know which IOTD it is so I'm not referring to it but you don't stand out in the crowd with things that have been done many times before. "I have a Direct3D demo" doesn't sound half as cool as "I have a new technology" even if that technology is just a trick in Direct3D. If you know you have something cool, wrap it up in a nice package. And don't be afraid to stand on the shoulders of others. Innovate, but don't reinvent the wheel unless it has nice chrome rims with per-pixel photon perturbation, euh, ... ;)
Posted on 2003-12-13 23:44:15 by C0D1F1ED
Pixomatic uses it's own interface in Unreal Tournament 2003 as well, but the calls were nearly directly compatible to Direct3D so the programmers got it completely working in just a few days. Mostly it was just disabling the advanced features.


Let's not discuss the 'marvels' of Unreal's 'flexible' engine, please.

I don't have UT2k3 on this system right now, but if I recall correctly the standard filtering just used upsampled mipmaps to fake bilinear filtering. But in the ini file you're able to choose the filter quality and change it to real bilinear!


I don't have UT2k3, and I couldn't get the Pixomatic thing running with the playable demo. I only ran the example programs, which included a crate with supposedly bilinear filter, but turning it on or off made very little difference. It didn't look like true bilinear filtering such as my Java engine does, or hardware.

I'm sorry but it's really Abrash who wrote the assembly code for Quake. Carmack is the brain behind the gameplay, which I admit was genious at the time.


It doesn't matter who wrote the asm, it matters who designed the engine's architecture. You can teach any monkey to write routines in asm. But getting the design right is what matters. And I'm quite sure that Carmack has done the most important work there, with his BSP trees and all.

It's Quake that got me fascinated with 3D rendering.


n00b :)
I've been doing graphics since the heyday of the Amiga 500... Well even before that, on my old C64, but that was really too simple to mention.
A lot of the stuff wasn't even invented by then, and/or not to be found in books. Funny enough I don't own any books on graphics at all.

I certainly know they're not holy, but these are professionals who have done grapics programming all their life. So, again, don't underestimate these persons that you don't really know and which you're trying to measure up with.


And what am I, chop liver?
Besides, if you can't measure up to anything or anyone, how can you ever compare?
I just said I wasn't impressed with a badly bilinear-filtered crate running at an unimpressive framerate. Don't get so worked up over the issue. Just because not everyone's name is Michael Abrash doesn't mean that they are necessarily less talented or experienced.

Oh... Could you please point it out for me because it's not supposed to happen. Are you sure it's not a crack in the map?


I cannot seem to find it right now, but I noticed a crack on the right side of a ... well wooden bar, over a portal or something...

But at that time I very well recall that people were totally amazed by it. And I still see a lot of engines today are based on the Quake 3 engine or similar technology.


But that's not the point. The point is that it just does some simple low-poly triangle rendering (my embryo-scene has about 6000 polys in it, nearly all on screen at the same time, that's more than the average Quake level I believe :)), with lightmaps as the most interesting 'feature'. This has been done in Quake 1 aswell, in software. The only difference is that it's not using software anymore. BSP trees are quite suboptimal for (T&L) hardware btw.

Wouldn't it be likely that next generation graphics cards and APIs support multiple stencil buffers as well? Clearly the current situation is sub-optimal for hardware as well.


I don't think so... Firstly, multiple stencilbuffers would have much less of an effect on hardware than they do on software (T&L is much cheaper, as is z/stencil testing (not a per-pixel operation, hierarchical z/stencil buffers!), so multipass rendering is much more similar to conditional rendering than in software), and they would only make the hardware itself many times more complicated. Also, you still need 1 pass for each lightsource to fill the proper stencilbuffer, because each lightsource has its own unique shadowvolume geometry (in theory you can render it in 1 pass, but that doesn't matter, it's still the same amount of geometry and pixels to be processed). Secondly, stencil shadows are not the be all, end all solution to shadows. They work nice on DX7-class hardware (yes Doom3 is already outdated), but I think eventually shadowmaps will take over, because they're more efficient to render, and they can be filtered for softer shadows.

And when that happens, people with software rendering expericience will have certain advantages.


I already considered software rendering experience an advantage for the first shader generation.

The rcpss takes just a -single- clock cycle, which is used most of the time, and Newton's reciprocal algorithm takes just four extra. But ok, let's assume you can do it in three per pixel. What is that going to change about overall performance if the whole pipeline takes 100 clock cycles?


You now leave out the muls and other stuff that you need aswell. And all that code is dependent aswell.
I need exactly 1 add per gradient per pixel, and each gradient is independent of every other, since the dependency on w is removed.
Besides, you seem to miss the point about innerloops... those 100 cycles are done per pixel.
If you have a setup that is 100 clks (not that it takes that long anyway), it is done only once. So at most you lose one pixel per polygon in terms of fillrate, if it doesn't help.
But if it does help, you may save per pixel, not per polygon, and since you have thousands of times more pixels than polygons, the gains can be huge. This is really standard optimization theory.

But now we're not working with scanlines any more, but with 2x2 pixel blocks.


And you couldn't think of a way to store the results from the previous scanline, or perhaps even render 2 scanlines in the same loop? Or did you try these things, and did it work out worse?

Where's the line between CISC and RISC really? I have to admit the instructions look more coherent and general though...


You pretty much answered your own question there. True RISC doesn't really exist anymore either, it has more or less inherited the CISC problems now. But the starting point is different, more RISC-like, so the problems are always less. But as you see, we're moving to VLIW/EPIC now.

You're almost making me believe it's a bunch of chimps working at Intel.


Compare AltiVec with MMX/SSE/SSE2, and draw your own conclusions.

And I understand the historical advantages x86 has but I find it hard to imagine that a superiour technology can't win from it.


If you ever had an Amiga, you'd be painfully aware of the fact that the best hardware doesn't always win. It's naive to think that technology is all that matters... There are many other things that matter, such as backward compatibility, price, marketshare, marketing by the company, etc.

Are you sure they would want Itanium for the desktop market? It's a really expensive chip to produce, and it's EPIC architecture is more optimal for workstation applications I think. Of course that could all change...


Itanium is expensive because of 2 reasons mainly:
1) It has a lot of cache
2) It is produced in relatively low volumes, and the demand for it is very low.

1) would be easily solved by removing some cache, which they already did, by the way. Besides, Intel has always marketed their latest CPUs as server/workstation CPUs first... I've seen them do it with 386, 486, Pentium... at least they have Xeon now, but still, eventually the technology becomes mainstream, I don't see why Itanium couldn't become mainstream aswell.
2) would be solved if the demand got higher. x86 is cheap mainly because it is produced in insane volumes, and Intel can invest almost endlessly in further development of the x86, both architectually, and in the manufacturing process. If they could focus on the Itanium instead, things would look a lot differently all of a sudden. All non-Intel competitors face the same problem, and simply have less resources to make their RISC/VLIW CPUs competitive.

Loops can be quite inefficient when jump misprediction occurs, which will happen a lot in this case. Ok I mustn't exaggerate this but there are other reasons which I've mentioned before why I wouldn't want this extra loop.


Again, this is not an innerloop, it's much less costly. Trivial optimization rules. I mean, if you spend 100 clks per pixel, and on average render what... 100 pixels (very low estimate for a quake level obviously), you spend 10.000 clks in each call from the outer loop... So who cares if the outerloop gets mispredicted once, it will add a handful of clks that you'd never even notice.

And it's not like I think I'm the king of the world but I do think I have that talent and maybe more importantly, the perseverance.


You also seem to think that others don't.

Interesting. Could you help me find a link to their technology?


I believe it was these guys: http://www.hybrid.fi/

I'd really like to know what software engines would run circles around it while offering the same quality.


Well, since nobody does software rendering anymore, this is an easy win.
But a friend of mine had a bsp viewer where he had added bilinear filter with MMX, back in the PII 233 days, and it could render a Quake 1 map pretty nicely in 640x480... It served more or less as the blueprint for the Java engine before my current one. I wouldn't be surprised if that thing could handle Q3 maps at a much higher framerate than yours, with more or less the same quality (it did per-poly mipmap though, I believe).
But if you don't mind, I don't feel like writing an entire software renderer just to prove a point which I am pretty sure of already. As you've seen yourself, my Java engine is not even all that slow compared to yours without SSE. Imagine that one ported to C++, and then applying some SSE/MMX (also note that my scenes generally are more highpoly than Quake scenes, I use no BSP or anything, only zbuffer, and I use dynamic lightsources, bumpmapping and such).

I'm sorry to dissapoint you. But may I ask what would have impressed you? And I don't think it would be fair to demand things you haven't already seen being done better.


I'll know it when I see it. (now isn't that a rotten answer? :))
It's just that in my frame of reference, I wasn't too impressed, because I've seen similar stuff on less hardware.

Secondly, I happen to have reference rasterizer code that I'm sure you do not have.


How can you be sure?
Besides, hacking at the design of the refrast is easy. It was probably written under time-pressure, because a reference was required for the developers of hardware and software. Speed was not an issue.
But as you see, the entire state-machine system for hardware doesn't suit software very well.
Posted on 2003-12-14 06:43:39 by Bruce-li
Originally posted by Bruce-li
Let's not discuss the 'marvels' of Unreal's 'flexible' engine, please.

I wan't planning to, but now that you mention it... just kidding. :grin: But why are you so opposed to Quake 3 and Abrash and now Unreal?
I don't have UT2k3, and I couldn't get the Pixomatic thing running with the playable demo. I only ran the example programs, which included a crate with supposedly bilinear filter, but turning it on or off made very little difference. It didn't look like true bilinear filtering such as my Java engine does, or hardware.

How can you judge Pixomatic in UT2k3 if you haven't even seen it? I think that's pretty unfair. And yes, their demo from RAD Game Tools isn't that impressive. Their -real- demo is the UT2k3 renderer.
It doesn't matter who wrote the asm, it matters who designed the engine's architecture. You can teach any monkey to write routines in asm. But getting the design right is what matters. And I'm quite sure that Carmack has done the most important work there, with his BSP trees and all.

Again, why are you so looking down upon Abrash? It is he who invented the texture coordinate increment trick with sbb that executes in just five clock cycles per pixel in the innermost loop. There have been several talented people working with 3D software rendering at that time but I don't think anybody had a faster method. It's only simple -after- it's been done. It's like my linear-scan copy propagation algorithm. I searched long to see if something similar didn't already exist, but everybody seems to be using coalescing and such in a graph. So I worked on it myself and after a week of experimenting with methods that did not work or did not perform optimally I came up with an elegant solution that has only a tiny compromise. It's almost trivial when I look at it again, but you can trivialize anybody's work without truely realizing how much work went into it. And I'm not saying you can't reproduce that, but I would be very grateful that you can stand on the shoulders of such people and save yourself a lot of time and do more productive things.
n00b :)
I've been doing graphics since the heyday of the Amiga 500... Well even before that, on my old C64, but that was really too simple to mention.
A lot of the stuff wasn't even invented by then, and/or not to be found in books. Funny enough I don't own any books on graphics at all.

Well at that time I was a newbie at graphics rendering, yes. But before that I've also done some work on a computer with a green screen I don't even remeber the name of, and on a Commodore 64 as well. I wrote a symbolical differentiator (to x) and drew the graphs which I could zoom in on interactively (adaptive refinement) until the point of numerical imprecision. :cool: None of it I found in books either, I had only just learned differentiation at school. And the first book I bough was Real-Time Rendering from Moller & Haines, after the Quake 3 demo when I had already started with swShader. It now servers as a little reference sometimes. Just two weeks ago I recieved LaMothe's book, and now it's only sitting there on my bookshelf just so I can say I have it. By looking at the table of content I knew it was going to be a waste of money before I bought it, so I'll just have to be happy with its pretty cover...
And what am I, chop liver?

I've never seen chop liver have such a furious discussion with me, so I guess not. Care to tell a bit more about yourself?
Besides, if you can't measure up to anything or anyone, how can you ever compare?

I didn't say you can't, I just said you shouldn't underestimate anyone you're trying to compete with.
I just said I wasn't impressed with a badly bilinear-filtered crate running at an unimpressive framerate. Don't get so worked up over the issue. Just because not everyone's name is Michael Abrash doesn't mean that they are necessarily less talented or experienced.

It's true. That demo doesn't show the full potential of Pixomatic. And I never said anyone's name has to be Abrash to be talented or experienced, I'm just saying he is is a great example of someone who is.
I cannot seem to find it right now, but I noticed a crack on the right side of a ... well wooden bar, over a portal or something...

Ah, yes, you could be referring to two things: That Quake 3 level has a lot of tiny ledges. From certain angles it can look like cracks but if you get closer and look it it from above or below you see it's really thin polygons. A second effect is that in some places I should have used texture clamping so avoid that seams appear where two polygons with different textures meet. The information is stored in the shader files but I never bothered implementing it. I'm quite confident there are no real cracks caused by rasterization because I did various tests and the math is is discrete and exact in theory.
But that's not the point. The point is that it just does some simple low-poly triangle rendering (my embryo-scene has about 6000 polys in it, nearly all on screen at the same time, that's more than the average Quake level I believe :)), with lightmaps as the most interesting 'feature'. This has been done in Quake 1 aswell, in software. The only difference is that it's not using software anymore. BSP trees are quite suboptimal for (T&L) hardware btw.

When you start the Quake 3 demo and turn around 180 degrees, there are 6000 polygons effectively being drawn, and there's 2.5x overdraw. So, no doubt your Java renderer is powerful, but who are you trying to fool? The lightmaps in Quake 1 are static, in the sense that they are premodulated with the texture and placed in a cache. Quake 3's lightmaps are dynamic, and you can't see it in my demo but I do sample and filter the lightmap at every pixel.
I don't think so... Firstly, multiple stencilbuffers would have much less of an effect on hardware than they do on software (T&L is much cheaper, as is z/stencil testing (not a per-pixel operation, hierarchical z/stencil buffers!), so multipass rendering is much more similar to conditional rendering than in software), and they would only make the hardware itself many times more complicated. Secondly, stencil shadows are not the be all, end all solution to shadows. They work nice on DX7-class hardware (yes Doom3 is already outdated), but I think eventually shadowmaps will take over, because they're more efficient to render, and they can be filtered for softer shadows.

I fully agree that shadowmaps are far superiour. I once had a discussion about it at Beyond3D and the main reason why stencil shadowing is so popular is because it works on old hardware without precision issues.
You now leave out the muls and other stuff that you need aswell. And all that code is dependent aswell.
I need exactly 1 add per gradient per pixel, and each gradient is independent of every other, since the dependency on w is removed.

Ok, one extra multiply per texture then. And eliminating these few extra instructions is going to make it super fast? And you still need the control for the sub-spans. Either you do it with an extra loop or you unroll it. The first solution requires extra inc/cmp/jmp instructions or so per pixel which makes the optimization rather pointless. The second solution also requires some setup but let's assume it's averaged out to less than a few clock cycles per pixel. Then you still have to deal with more jump mispredictions (a Pentium 4's 20-stage pipeline easily makes any optimization futile) and longer code. That's not really an option for my shader cache. So it's a balance of flexiblity and performance here as well. And like I said before I don't like spending my time implementing something that poses an architectural limitation that might later just become depecated. Per-pixel perspective correction works, let's move on...
Besides, you seem to miss the point about innerloops... those 100 cycles are done per pixel.
If you have a setup that is 100 clks (not that it takes that long anyway), it is done only once. So at most you lose one pixel per polygon in terms of fillrate, if it doesn't help.
But if it does help, you may save per pixel, not per polygon, and since you have thousands of times more pixels than polygons, the gains can be huge. This is really standard optimization theory.

Well if it's 100 clock cycles per polygon extra only then it would be fine. But there are things that have to be computed per scanline and new things to be done per pixel as well. You can't just move everything out of the inner loop. K.I.S.S. is also standard optimization theory.
And you couldn't think of a way to store the results from the previous scanline, or perhaps even render 2 scanlines in the same loop? Or did you try these things, and did it work out worse?

I once tried to move perspective correction (here we go again) into a second loop back when I worked on a Pentium II. The main advantage was that loop unrolling was no problem and no emms was needed in the actual pixel loop. Luckily it did pay off and simple scenes were faster but there was one thing I overlooked... The z-buffer prevents that -any- further calculations are being done per-pixel. But with this extra loop I had no other choice than to compute perspective correction for every pixel. So in practice in a scene with overdraw it wasn't faster. I tried to solve the problem but every extra measure made the gains from a separate loop undone or marginally small.

Anyway, I tried sampling one texel in advance and got 48 FPS instead of 47 for a skybox scene. Might have been the wind...
You pretty much answered your own question there. True RISC doesn't really exist anymore either, it has more or less inherited the CISC problems now. But the starting point is different, more RISC-like, so the problems are always less. But as you see, we're moving to VLIW/EPIC now.

I still think it's all very relative. A Pentium 4 core is already pretty much RISC-like, and the decoder is quite efficient to provide the required abstraction layer. Likewise, I don't think it's impossible that extended instruction merging could eventually become very much VLIW-like.
Compare AltiVec with MMX/SSE/SSE2, and draw your own conclusions.

It's more elegant and requires less instructions to perform certain tasks, yes, but concluding from this that it's inherently faster...
If you ever had an Amiga, you'd be painfully aware of the fact that the best hardware doesn't always win. It's naive to think that technology is all that matters... There are many other things that matter, such as backward compatibility, price, marketshare, marketing by the company, etc.

That's very correct. But if a technology can keep on top for several years, only then people will start to realize it's worth investing in it. And with 'on top' I mean it has to run x86 with an emulator and beat it in every area! Otherwise we're just talking about a marginal difference that can be compensated by other technology and it's best to stick with x86 in that case. It's all about economics after all.
Again, this is not an innerloop, it's much less costly. Trivial optimization rules. I mean, if you spend 100 clks per pixel, and on average render what... 100 pixels (very low estimate for a quake level obviously), you spend 10.000 clks in each call from the outer loop... So who cares if the outerloop gets mispredicted once, it will add a handful of clks that you'd never even notice.

Like I said before, it adds a lot more side-effects than just a jump misprediction. All these negative effect may very well add up to 100 clock cycles, which corresponds to the single clock cycle you were trying to save per pixel. And even if you do have a speedup, it would be one "you'd never even notice". So I'm not going to spend my time on something that makes my code a lot less elegant and might limit me in the future.
You also seem to think that others don't.

Many other people don't, some do.
I believe it was these guys: http://www.hybrid.fi/

Looks sweet. Couldn't find any reference to run-time compilation though.
Well, since nobody does software rendering anymore, this is an easy win.

Wait a second. First you say you know several engines that would run circles around my renderer, and now you say it's an easy win because nobody does software rendering anymore. Ok then I guess...
But a friend of mine had a bsp viewer where he had added bilinear filter with MMX, back in the PII 233 days, and it could render a Quake 1 map pretty nicely in 640x480...

As I said a few lines above, Quake 1 does only one texture. It also does mipmapping per polygon, has ten times less polygons in a map, uses a span buffer to reduce overdraw to zero and the BSP cells do not overlap and are completely convex. So, I'm quite sure it was pretty neat back then, but it's not like it's going to run circles around my renderer if it supported dynamic lightmaps, accurate mipmapping and high-poly BSPs with concave elements.

Recompiling that BSP so it has convex cells would make it have too many splits and a deep tree. Span-buffering isn't optimal for this situation you said yourself. And dynamic lightmapping and accurate mipmapping is not something you get for free either. The only thing that seems doable to me is create a map with portals like Unreal, to get rid of overdraw issues. But I have little interest in that because I love the stuff below the rendering API interface more. that's also why I won't make any engine-specific optimizations.
It served more or less as the blueprint for the Java engine before my current one. I wouldn't be surprised if that thing could handle Q3 maps at a much higher framerate than yours, with more or less the same quality (it did per-poly mipmap though, I believe).

I'd honestly be thrilled to see it. Good luck with the automatic portalizer!
But if you don't mind, I don't feel like writing an entire software renderer just to prove a point which I am pretty sure of already.

Yeah that's like your favorite football team saying "we could have won today be we didn't feel like it just to prove a point which we are sure of already". Seriously, any idea isn't worth a thing before it has been put into practice.

But I do understand you don't have the time or rather do something else. But then if I were you I wouldn't start claiming I can do this or that way better. I would just keep it for myself until I can prove it or make some suggestions that I am eager to learn about myself and be willing to admit if it was wrong.
As you've seen yourself, my Java engine is not even all that slow compared to yours without SSE. Imagine that one ported to C++, and then applying some SSE/MMX (also note that my scenes generally are more highpoly than Quake scenes, I use no BSP or anything, only zbuffer, and I use dynamic lightsources, bumpmapping and such).

Yes, for a Java engine it's pretty damn good and I don't think I've ever seen such performance before. :alright: But you have to stop claiming you could do better than me without showing anything. I'm very willing to believe you and admit I'm wrong but you won't get far with theory.
I'll know it when I see it. (now isn't that a rotten answer? :))
It's just that in my frame of reference, I wasn't too impressed, because I've seen similar stuff on less hardware.

Well then, allow me to give you a rotten response as well. Until I see a demo of yours that proves me wrong, -I- am the one with the fastest Quake 3 software renderer. And the only person who at the moment I do believe can beat me is Abrash because he's doing something similar. All the rest is just a lot of moving air.
How can you be sure?

Because you wouldn't have said those ridiculous things about only being 10x faster than the reference rasterizer if you knew how it works. And the stuff below is even more convincing:
Besides, hacking at the design of the refrast is easy. It was probably written under time-pressure, because a reference was required for the developers of hardware and software. Speed was not an issue.
But as you see, the entire state-machine system for hardware doesn't suit software very well.

If you had the code you would have known that they did include several optimizations that must have taken some time to implement. Either that, or you're just playing stupid. So instead of just bluffing, prove it or admit it.
Posted on 2003-12-14 18:20:09 by C0D1F1ED
But why are you so opposed to Quake 3 and Abrash and now Unreal?


Because Quake 1 was impressive, Quake 2 and 3 are just more of the same. Abrash, well, I thought it was a public secret that eg most of the stuff from his Black Book was actually taken from other people, and he was NOT the big man behind Quake etc. As for Unreal... You may not know this, being a n00b and all, who doesn't even know how to do stencil shadows... but designing a hardware-renderer for multiple APIs is VERY sucky... and the Unreal engine indeed is very sucky... They had one that used dynamic geometry for the entire scene, basically as a result of this 'flexible' design. I hope they have fixed it by now, else that is really sad.

How can you judge Pixomatic in UT2k3 if you haven't even seen it?


I never said I was judging it in UT2k3... I was just not impressed with the way it rendered a simple crate with broken bilinear filter at a lousy framerate.

It is he who invented the texture coordinate increment trick with sbb that executes in just five clock cycles per pixel in the innermost loop.


Are you sure about that? The Amiga-guys have been using the addx-trick to get 5-instruction innerloops for tmappers for a LOONG time. Are you sure it did not originate from there?

Care to tell a bit more about yourself?


As if you're interested, you seem too self-absorbed.

I didn't say you can't, I just said you shouldn't underestimate anyone you're trying to compete with.


I'm not trying to compete with Abrash, thank you very much.

I'm just saying he is is a great example of someone who is.


You need better heroes, boy.

So, no doubt your Java renderer is powerful, but who are you trying to fool?


The question is more: who are YOU trying to fool?

And eliminating these few extra instructions is going to make it super fast?


Eliminating instructions generally makes routines faster, yes. Adding them surely isn't the answer.

When you start the Quake 3 demo and turn around 180 degrees, there are 6000 polygons effectively being drawn, and there's 2.5x overdraw.


Ah yes, then it runs at a lovely 8 fps, gee, how great.

The first solution requires extra inc/cmp/jmp instructions or so per pixel which makes the optimization rather pointless.


Are you a n00b or something? sub/jnz is all you need, and thanks to branch-prediction, this is just a 1-clk test most of the time. And all the time you gain from the linear gradients gives you plenty of time to take that branch every now and then and update the gradients. In my Java-engine the performance is almost identical to linear texturemapping, while per-pixel perspective is much slower.

(a Pentium 4's 20-stage pipeline easily makes any optimization futile)


Are you a n00b or something? A P4 has trace-cache, and therefore its effective pipeline length in loops is far less than 20 stages. It also makes loop unrolling or loop aligning pretty much useless.

I still think it's all very relative. A Pentium 4 core is already pretty much RISC-like, and the decoder is quite efficient to provide the required abstraction layer.


You seem to sound like a n00b who's never used anything other than x86.
Just because x86 is a working solution doesn't mean it's a good solution.

Looks sweet. Couldn't find any reference to run-time compilation though.


Some of them hang in #coders on IRCnet sometimes, ask about it there.

Wait a second. First you say you know several engines that would run circles around my renderer


WOULD yes, if they would be adapted to render the same quality as yours, since you added that requirement.

So, I'm quite sure it was pretty neat back then, but it's not like it's going to run circles around my renderer if it supported dynamic lightmaps, accurate mipmapping and high-poly BSPs with concave elements.


You seem so confident of yourself. It'd almost be worth it to write a good engine and shove your arrogant face in it.

I'd honestly be thrilled to see it. Good luck with the automatic portalizer!


Grow up, I'm not going to screw around with age-old code just to please you. I'm busy with hardware-rendering now.

But then if I were you I wouldn't start claiming I can do this or that way better.


I never said that. I said it could be done better, not necessarily by me, and if you are honest, you would have to admit this aswell.

or make some suggestions that I am eager to learn about myself and be willing to admit if it was wrong.


I made some suggestions but you are so confident that your engine is already perfect (which it isn't), that you don't take any advice anyway. So why would I bother?

But you have to stop claiming you could do better than me without showing anything. I'm very willing to believe you and admit I'm wrong but you won't get far with theory.


I'm not claiming that I can do better than you. I'm just saying that my Java engine has a more efficient renderer than you do, because I do things differently than you do.

Until I see a demo of yours that proves me wrong, -I- am the one with the fastest Quake 3 software renderer. And the only person who at the moment I do believe can beat me is Abrash because he's doing something similar. All the rest is just a lot of moving air.


I can't decide if you are too arrogant, or just too naive.
I have decided though that it is useless to try and help you, because you think you know everything better anyway.
Posted on 2003-12-15 04:12:34 by Bruce-li
Humm, where do I get the pixomatic stuff? I do have UT2k3, and I'd like to see if the pixomatic plugin thingy is worth anything - the RAD GameTools certainly wasn't.

C0D1F1ED, your engine might be all fast and dandy etc, but the quake3 demo looks... bland and extremely washed-out. Try reducing the gamma? Would be a better show-off if it actually looked good ;)

And bruce-li, what about calming down a bit?
Posted on 2003-12-15 08:29:15 by f0dder
Pixomatic stuff: http://www.radgametools.com/pixomain.htm

And bruce-li, what about calming down a bit?


I'm annoyedly calm :)
Posted on 2003-12-15 08:58:59 by Bruce-li
Originally posted by Bruce-li
Because Quake 1 was impressive, Quake 2 and 3 are just more of the same.

Why would they choose to do something else. Don't change a winning team. And there's not an endless range of game genres to work on. Id clearly focusses on first-person shooters, let others work on what they do best.
Abrash, well, I thought it was a public secret that eg most of the stuff from his Black Book was actually taken from other people, and he was NOT the big man behind Quake etc.

Do you know one person who didn't learn most of his knowledge from other people? Or maybe you were born with the same knowledge you know now? Besides, like I said before, an idea isn't worth anything until it has been put into practice. The most successful people in the world are the ones who put other people's ideas into reality.
As for Unreal... You may not know this, being a n00b and all, who doesn't even know how to do stencil shadows... but designing a hardware-renderer for multiple APIs is VERY sucky... and the Unreal engine indeed is very sucky...

I am not a newbie. And I do know stencil shadows. And as far as I know the Unreal engine has always run perfectly on multiple APIs. Many times when there's a problem it's actually a driver bug. So what makes you say it sucks?
I never said I was judging it in UT2k3... I was just not impressed with the way it rendered a simple crate with broken bilinear filter at a lousy framerate.

Too bad but I was referring to UT2k3. Don't judge and underestimate things you don't really know.
Are you sure about that? The Amiga-guys have been using the addx-trick to get 5-instruction innerloops for tmappers for a LOONG time. Are you sure it did not originate from there?

Even if he wasn't the first one to use such instruction for texture mapping, he was still the first one to apply it in x86. Could you please show me a reference about the addx trick that dates LOONG time before '94? And he did a lot more work for Quake, like the artificial intelligence core. But I'm sure you find that trivial as well...
As if you're interested, you seem too self-absorbed.

Yes I'm very interested. What's your age, what do you study, what's your real name, where do you live? It's great you're so good at all this but if you sign it with just Bruce-li, well, you'll be just as anonymous as the brilliant Amiga programmer who used addx for texture mapping. Or maybe it might have just been Abrash after all...

And before you ask: I'm 21 years old, I'm in the third year of civil engineering in computer science at the university of Ghent, my real name is Nicolas Capens and I live in Sint-Niklaas, Belgium but in the week I stay in Ghent.
I'm not trying to compete with Abrash, thank you very much.

Well then I still don't fully understand why you attack his work so personally. It might be just a feeling, but it's like he's done you something wrong, or you're just yealous of him.
You need better heroes, boy.

He's not my hero. He's just one of the many persons I respect, something you don't seem to be able to do.
The question is more: who are YOU trying to fool?

Nobody. What you see is what you get. At least I do have a Quake 3 software rendering demo. So who were you trying to fool again?
Eliminating instructions generally makes routines faster, yes. Adding them surely isn't the answer.

Sometimes it is. A Newton-Rhapson division approximation takes extra instructions but it's certainly faster than a one division at full precision.

Anyway, what I was referring to is that one extra instruction isn't going to kill performance. And if it makes my code more elegant and general I'm glad to sacrifice a little performance if that would really be the case. It's one of the reasons that made "premature optimization is the root of all evil" a famous quote. And before you try to 'win' this discussion again with things like changing subject and returning my question, yes I do know that's not totally exact the original words.
Ah yes, then it runs at a lovely 8 fps, gee, how great.

Yes, great isn't it? With that amount of overdraw and other inefficiencies in the Quake 3 rendering system being due to really being targetted at hardware rendering, it ain't bad at all. With portal clipping and fine-tuning some other parts of the render pipeline it should be possible to reach a stable 20 FPS or so. So, if I were you I'd start writing such engine right away so you can finally put some proof on the table.
Are you a n00b or something? sub/jnz is all you need, and thanks to branch-prediction, this is just a 1-clk test most of the time.

I never said you need them all, did I? I even said "or so" to mean the most optimal combination for the situation.
And all the time you gain from the linear gradients gives you plenty of time to take that branch every now and then and update the gradients. In my Java-engine the performance is almost identical to linear texturemapping, while per-pixel perspective is much slower.

Of course it's much slower! It's not using any SSE instruction. You don't have to convice me that without SSE it's much more efficient to do perspective correction every 16 pixels or so. I used it several times myself when I was developing on a Pentium II. Don't compare apples and oranges. Without SSE and per-pixel perspective correction it might have been a lot harder to get the flexiblity and performance I have now with swShader.
Are you a n00b or something? A P4 has trace-cache, and therefore its effective pipeline length in loops is far less than 20 stages. It also makes loop unrolling or loop aligning pretty much useless.

No, clearly you are the newbie, because the Pentium 4 actually has 28 stages if you count them all in, of which 20 are behind the trace cache thus are critical to jump misprediction. I hope this is where you start to realize that other things you've said might just be plain wrong as well. It's not because you were right about the out-of-order execution that all I say makes no sense. I started this thread with the best intentions of learning something new, and I'm thankful you showed me my error, but stop calling me naive and you might learn a few things yourself.
You seem to sound like a n00b who's never used anything other than x86.
Just because x86 is a working solution doesn't mean it's a good solution.

That's correct I've only really worked with x86. But it's not always the most optimal solution that wins. And I do believe x86 is a "good solution" if you look at everything is has accomplished. I'd be glad to admit another architecture is superiour if it proves so on all areas.
WOULD yes, if they would be adapted to render the same quality as yours, since you added that requirement.

But "would" just ain't worth anything until it is proven.
You seem so confident of yourself. It'd almost be worth it to write a good engine and shove your arrogant face in it.

Who's being arrogant here? I -have- a product. And it might not be the absolute fastest that is theoretically possible, it's still one of the fastest and most flexible in practice. And that's all that matters. You can't sell a bear's skin before you've killed the beast!
Grow up, I'm not going to screw around with age-old code just to please you. I'm busy with hardware-rendering now.

Fine, then keep it that way and just admit there ain't going to be any proof that shows my renderer is slow like hell.

And do you really think the tricks you've used in your "age-old code" still work today?
I never said that. I said it could be done better, not necessarily by me, and if you are honest, you would have to admit this aswell.

I admit. There's still a lot of work to be done. But many of the tricks you're trying to shove down my throat either don't work as well in my situation as you think they do, or are very impractical and limit the renderer's flexibility and future extendability. But I would be very pleased to accept any suggestions from people who do work on advanced software renderers.

Seriously, what makes you think you're in the right position to say "it could be done better" and not even prove anything of what you say? Yes, it's true it could be done better. But you're not helping me by saying none of it is impressive and your Java demo (which I honestly do find impressive in it's context) which only supports a few render modes would perform much better if it used SSE.

I decompiled the jar file and there was just a handful of triangle functions in it. My renderer supports 9720 render states for the fixed-function shader and I haven't even exponentiated that by the 8 texture stages. The triangle rasterizer and clipper support nearly the whole FVF of DirectX whitout any redundant loop or arithmetic operation and the vertex pipeline is evolving in the same direction. You can see most of it for yourself in the open-source part of swShader. Proof. Practice. Power.
I made some suggestions but you are so confident that your engine is already perfect (which it isn't), that you don't take any advice anyway. So why would I bother?

I never said it's perfect. And I sincerely am a sensible person but if you could just show me the things you're saying have any truth you'd gain a lot of credibility. Else you're right, you shouldn't even bother because most probably it will just work less ideal than you think. What makes you have such an enourmous believe in your theories if you're not doing software rendering any more and I'm working on it every day? Why can't you just admit that you -could- be wrong because you can't even prove things for yourself?
I'm not claiming that I can do better than you. I'm just saying that my Java engine has a more efficient renderer than you do, because I do things differently than you do.

Well that's really great for you but I'm not working on a Java renderer with the same goals so some different rules apply. All I want to say is that, for the things I aim for, my renderer is pretty damn efficient and very little of what you've said will contribute to that.
I can't decide if you are too arrogant, or just too naive.
I have decided though that it is useless to try and help you, because you think you know everything better anyway.

I don't think I know everything better. Quit trying to think what I think and using it as an argument. Start looking a bit more at yourself to see who is naive and arrogant.

Enjoy finding an excuse for the 28 stage pipeline.
Posted on 2003-12-15 18:24:54 by C0D1F1ED
Originally posted by f0dder
Humm, where do I get the pixomatic stuff? I do have UT2k3, and I'd like to see if the pixomatic plugin thingy is worth anything - the RAD GameTools certainly wasn't.

The link Bruce-li posted is the demo you've already seen. You can find the UT2k3 plugin here: http://unreal.epicgames.com/News.htm, in the middle of the page.
C0D1F1ED, your engine might be all fast and dandy etc, but the quake3 demo looks... bland and extremely washed-out. Try reducing the gamma? Would be a better show-off if it actually looked good ;)

Oh, thanks for noticing. I actually use a higher gamma to see if the mipmap transitions are correct. You can find a version with gamma 1.0 here: Real Virtuality.exe. If that still doesn't look right you can add the gamma to the command line.
Posted on 2003-12-15 19:14:34 by C0D1F1ED
Why would they choose to do something else. Don't change a winning team.


For starters, because BSP trees haven't been exactly optimal since T&L hardware was introduced.
You don't seem to understand anything that I'm saying.

Do you know one person who didn't learn most of his knowledge from other people?


The point is passing other people's work off as your own.

I am not a newbie. And I do know stencil shadows.


Your 'optimizations' didn't exactly convince me, nor the fact that you think they would actually be feasible in software.

And as far as I know the Unreal engine has always run perfectly on multiple APIs. Many times when there's a problem it's actually a driver bug. So what makes you say it sucks?


You must have missed the part where I said it rendered everything as dynamic geometry.

Don't judge and underestimate things you don't really know.


I only judged what I knew. And I knew Pixomatic's example programs didn't impress me.

Or maybe it might have just been Abrash after all...


Or maybe I'm Abrash, you never know.

but it's like he's done you something wrong


Not me personally, but the way he tries to achieve fame over other people's backs and pretends he invented everything himself and he's so hot etc, doesn't exactly appeal to me.

something you don't seem to be able to do.


I respect many people, Abrash is just not one of them, and I have my reasons.

Sometimes it is. A Newton-Rhapson division approximation takes extra instructions but it's certainly faster than a one division at full precision.


Yea, but the alternative was not doing the division at all. Your friend Abrash uses this method aswell by the way, it's mentioned on the Pixomat page.

So who were you trying to fool again?


I'm not fooling anyone. You've seen my Java engine, that's the last software engine I made anyway, the most feature-filled one. I don't think I have anything to prove regarding software rendering, actually.

Yes, great isn't it? With that amount of overdraw and other inefficiencies in the Quake 3 rendering system being due to really being targetted at hardware rendering, it ain't bad at all. With portal clipping and fine-tuning some other parts of the render pipeline it should be possible to reach a stable 20 FPS or so. So, if I were you I'd start writing such engine right away so you can finally put some proof on the table.


Yea, and how much extra time do you think some ACTUAL shading would take (hint: per-pixel dot3, pow, normalmap)? How much overdraw do you think stencil shadows will generate (hint: backface culling does not work)? What do you think about scenes that don't have rooms with an average of 12 polys (hint: T&L is VERY slow on CPU, compared to modern GPUs)? How often do you think portal rendering will work in general, besides Quake levels (hint: outdoor scenes)?
You did say you want a GENERIC engine, as a replacement for hardware?

You don't have to convice me that without SSE it's much more efficient to do perspective correction every 16 pixels or so.


Who's talking about 16 pixels? I use adaptive scanline subdivision, remember? In the best case I have 1 div per 'width' pixels. And you're saying that SSE is not slower than that? And what about Abrash? Using SSE, but still subdividing his scanlines?

Fine, then keep it that way and just admit there ain't going to be any proof that shows my renderer is slow like hell.


Well if you want to continue dreaming, go ahead. I don't care. Just don't expect me to help you when you've realized that Doom3 doesn't work in your engine afterall, and you realize that you might need some tricks-of-the-trade that people such as myself could provide.

I decompiled the jar file and there was just a handful of triangle functions in it.


Erm, the demo only uses a handful of shading options, what's your point? Besides, I wonder if you even understand how my engine works in depth... Why don't you convince me by explaining it.
Why do you want to make a competition out of this anyway? We're not on the same level. You obviously don't know much about hardware rendering, and I am not really interested in software rendering anymore, because I've been there, done that, and went as far as I would like to go, in Java even.

And I sincerely am a sensible person but if you could just show me the things you're saying have any truth you'd gain a lot of credibility.


Well that's just the whole point, isn't it? I mean, even after you've seen my Java renderer, you still doubt that I have any knowledge whatsoever about software rendering, and just assume you know everything better than me, so you basically don't even listen to any advice that I'm trying to give.
You want to make a contest out of this all the time... "My engine is the best".
We could have shared ideas and worked on making the perfect renderer together, but instead you just seem to want to prove that you're better than everyone else, and end up not getting any help. Is that sensible?
Posted on 2003-12-15 19:25:36 by Bruce-li
Yeah, bruce pointed me to that page earlier today. Dunno what to think of it, really. With default settings, it was playable - some slowdown noticeable when there's a lot of bots on screen, but still playable. And somewhat ugly-looking. Turning off 2x pixelsize and setting texture to 32bit is still playable, and looks better - still not all that great, though. Setting terrain detail to non-low and things start looking okay, but the game is unplayable. P4 2.53ghz.

Abrash seems to be reasonable at optimizing :), so I guess it's the pixomatic architecture that's not all that great for good software rendering speed - or the way ut2k3 uses it. Fair enough, really, since the game was designed to be hw-only.

I'm not that much into 3D, but I guess you could get better speed by designing an engine for software from the start? Tradeoffs and everything. I like the idea that you're trying to make SoftWire+swShader very generic and extendable, but it still doesn't seem like a realistic option for running games - at least not on an engine that's designed for hardware rendering. I get ~19fps average on my P4, not exactly game rates, especially with the ocassional drop below :)

The new exe looks better - if perhaps a bit too dark. But certainly an improvement. There's still a few weird things though - like the lava being fullbright, and the "models" looking fullbright too (various skulls, the corpse in the "foggy cellar" (which also looks peculiar), the statues in the "bridge area". Normal corridors and such look sorta like hardware quality though.

But if you can beat refrast by a couple of magnitudes, then it does seem like more than a cute gimmick :)

Btw, what about the volume fog stuff that q3 has? And the skybox stuff?
Posted on 2003-12-15 19:45:05 by f0dder
I'm not that much into 3D, but I guess you could get better speed by designing an engine for software from the start? Tradeoffs and everything. I like the idea that you're trying to make SoftWire+swShader very generic and extendable, but it still doesn't seem like a realistic option for running games - at least not on an engine that's designed for hardware rendering.


Yea, that's what I've been saying, and I think I can say that I speak from experience.
There are some things that you just don't want to do in software, or at least, not in the way that hardware does it.

But if you can beat refrast by a couple of magnitudes, then it does seem like more than a cute gimmick


Well, as I say, if you go from 0.001 fps to 0.01 fps, what does it really matter? "I have a faster slideshow-generator than you do!"?
You'll never be able to render something like Doom3/HalfLife 2 in realtime with it.
You'd get a lot further if you'd write a realtime raytracer for it :)
Posted on 2003-12-15 20:00:56 by Bruce-li
Originally posted by Bruce-li
For starters, because BSP trees haven't been exactly optimal since T&L hardware was introduced.
You don't seem to understand anything that I'm saying.

BSP trees can be made arbitrarily complex. They already started with Quake 3 by getting rid of the convexity restriction and putting more polygons in every leaf. So that really is more targetted at hardware rendering and not software rendering, isn't it?
You must have missed the part where I said it rendered everything as dynamic geometry.

And how does that influence that it runs perfectly on multiple APIs? Besides, the Unreal engines always showed to be very fast to me. Is there any game you know that doesn't run smoothly?
I only judged what I knew. And I knew Pixomatic's example programs didn't impress me.

Then admit you underestimated it, or at least that it can have more potential than what you've seen in the small demos.
Not me personally, but the way he tries to achieve fame over other people's backs and pretends he invented everything himself and he's so hot etc, doesn't exactly appeal to me.

I respect many people, Abrash is just not one of them, and I have my reasons.

Do you know him personally then? You seem to know him a lot better so please tell me about it all.

And why avoid telling a bit more about yourself?
Yea, but the alternative was not doing the division at all. Your friend Abrash uses this method aswell by the way, it's mentioned on the Pixomat page.

Abrash is not my friend. I just had a discussion about him once about Pixomatic and software rendering in general in which he showed to be very polite and he respected my work so I respect him back. If you think that's wrong then I'd really like to know why.

And not doing the perspective division at all for far away polygons is obviously a good optimization. It works at a higher level so it doesn't put any restrictions on the rest of the rendering pipeline. I'm not saying any other optimization is inherently bad but I can't understand why you find this bad in particular.
I'm not fooling anyone. You've seen my Java engine, that's the last software engine I made anyway, the most feature-filled one. I don't think I have anything to prove regarding software rendering, actually.

It's very nice and all, but it's simply not capable of all the features my renderer has. Although certainly not impossible I think it's a little naive to think that you can make a renderer with thousands of more rendering options and keep the same performance. The complexity rises very rapidly so it's very important to K.I.S.S. or you end up with nothing that actually works. My technology works, in practice. And I won't deny that there are still many possibilities to make it faster but it wouldn't be wise to implement them now, and I'm sorry to say but the suggestions you made are one of the least interesting to add right now. The few clock cycles that I might win, or not, and the amount of work I would have to put into it just doesn't justify them. Therefore, if you count in all the goods and the bads I am stronly convinced that they would have a negative overall impact. And I already did lots of experimentation and made lots of premature optimization mistakes to back up the things that I'm saying.
Yea, and how much extra time do you think some ACTUAL shading would take (hint: per-pixel dot3, pow, normalmap)? How much overdraw do you think stencil shadows will generate (hint: backface culling does not work)? What do you think about scenes that don't have rooms with an average of 12 polys (hint: T&L is VERY slow on CPU, compared to modern GPUs)? How often do you think portal rendering will work in general, besides Quake levels (hint: outdoor scenes)?

The thing is, you don't really need bumpapping or stencil shadows to play cool games. Think of the popular Playstation I & II. And T&L really isn't that slow: my laptop has a 1.4 GHz Pentium M and an Intel Extreme Graphics chip, which has no hardware T&L. A bit to my own surprise, it has absolutely no problem with running modern games like Medal of Honor or any other application which requires some significant T&L.
You did say you want a GENERIC engine, as a replacement for hardware?

Not a replacement in the sense that you have to throw out your graphics card. But a replacement when there is little or no choice. There are various reasons why anyone would like to have a fallback when everything else fails. And I'm also trying to target a bit of the non-realtime or less interactive rendering market. There are many applications like CAD or commercial 'boutiques' that don't require framerates above your monitor's refresh but have to run no matter what the graphics card is like. And these applications could use all the performance they can get so the reference rasterizer is not an option. Last but not least there can be situations where all the brute force of hardware rendering is of little use.
Who's talking about 16 pixels? I use adaptive scanline subdivision, remember? In the best case I have 1 div per 'width' pixels.

I used to do it per 16 pixels for a while. That was because in the time to render these pixels I could let the division of the next 16 pixels run in parallel, nearly for free. So it was hardly better to do it per 32 pixels, and per 8 pixels was not required. So to save the extra setup of adaptive subdivision, I just fixed it and saved some more by unrolling to do 16 pixels at once. K.I.S.S.
And you're saying that SSE is not slower than that? And what about Abrash? Using SSE, but still subdividing his scanlines?

With an SSE division (approximation) there is little use any more to do the perspective correction in parallel to pixel rendering. And bilinear filtering also takes the majority of processing power now so I'm not going to put my effort in perspective correction, especially if it would imply a few restricition. I'm not saying per-pixel SSE is faster, it's just not worth the risk to do it differently while this works very satisfactory.

I don't know what Abrash really uses, but you could be right that it's close to adaptive subdivision. But he has a fair amount of restrictions on the rendering pipeline, and when not using bilinear filtering it could become worth it for just that one extra frame per second. That's not my priority right now so it seems unwise to implement it myself.
Well if you want to continue dreaming, go ahead. I don't care. Just don't expect me to help you when you've realized that Doom3 doesn't work in your engine afterall, and you realize that you might need some tricks-of-the-trade that people such as myself could provide.

I very much could use any help I can get. But you have to understand that, even though it bring an enourmous flexibility, my run-time assembler technology and my goals put some restrictions on the tricks that are actually useful for me right now. Your efforts are much appreciated but unfortunately you fail to see it's of little use to me and you seem to rather like to insult me than really understand the situation and help me in the direction -I- would like to go. And no, that doesn't mean I'm not listening to any ideas other than my own.
Erm, the demo only uses a handful of shading options, what's your point? Besides, I wonder if you even understand how my engine works in depth... Why don't you convince me by explaining it.

No, I don't understand what you're doing from that decompiled source. But I don't feel like I have to, or are there some other things you really find that important and that I should have in my renderer as well?
Why do you want to make a competition out of this anyway?

I'm not. I'm just trying to convince you that my work is worth something. It's not that I want you to love me and kneel down before me, but I do think I deserve a little more respect than what you have given me so far. I wouldn't just insult anyone for not knowing a few things, and try to understand the motivations behind some choices even though you would have done things differently. It's not all black/white.
We're not on the same level.

I would rather rephrase it as not being on the same wavelength. Our goals are different, but that doesn't mean one is inferiour to the other, let alone me or you should be competing about anything.
You obviously don't know much about hardware rendering, and I am not really interested in software rendering anymore, because I've been there, done that, and went as far as I would like to go, in Java even.

Again, that's really great for you, and I do am impressed by your results! But you have to stop thinking that the stuff I do, which you are not working with any more, is not as terribly bad as you'd like to believe. I really did a lot of profiling and experimentation to get a few things quite close to optimal, like the bilinear filtering. I don't care that much about all the inefficiency caused by the BSP renderer.
Well that's just the whole point, isn't it? I mean, even after you've seen my Java renderer, you still doubt that I have any knowledge whatsoever about software rendering, and just assume you know everything better than me, so you basically don't even listen to any advice that I'm trying to give.

I never said you don't have any knowledge about software rendering whatsoever, and I'm really trying hard to fit in some of your theory into my practice. But you do are assuming that all software rendering is the same so all your ideas work as well for me as they did for you and I should follow the direction you have in mind. I'm sorry, but it doesn't work that way: you can't convince anyone of something without proof. You can convince yourself, but then you might be fooling yourself.
You want to make a contest out of this all the time... "My engine is the best".

It's hard to measure anything here. I mean, if you compare my fixed-function pipeline to Pixomatic's, there's little chance that my renderer is the "best". But if you look at all the extra features it supports then it -might- be "best" at some other points. Why can't you just admit it has potential? There's no black/white here either!
We could have shared ideas and worked on making the perfect renderer together, but instead you just seem to want to prove that you're better than everyone else, and end up not getting any help. Is that sensible?

We'll I'd love to get personal but I'm afraid I'll first have to know you a bit better before I trust you. And I do believe we have shared idea's, you just don't seem to realize it's not only me who can learn something from it. And I never said I'm better than everyone else, nor do I feel this way, but at least I -have- proof of my ideas. Yeah there's sense behind that.

Now, could you please tell me a bit more about yourself? Oh, and I'd love to hear you admit a few things. Just a few, that show you are a sensible person as well. The 28-stage pipeline would be a good start. Or do you really feel ashamed that you don't know absolutely everything either? I mean, it can't be that bad to admit facts, is it? Or maybe show just a tiny little doubt that the idea's you presented here are really going to make my demo run twice as fast, because it doesn't really sound sensible that you know all the answers and I know nothing...
Posted on 2003-12-16 08:13:02 by C0D1F1ED